159 72 86MB
English Pages 929 [930] Year 2022
Lecture Notes in Networks and Systems 506
Kohei Arai Editor
Intelligent Computing Proceedings of the 2022 Computing Conference, Volume 1
Lecture Notes in Networks and Systems Volume 506
Series Editor Janusz Kacprzyk, Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland Advisory Editors Fernando Gomide, Department of Computer Engineering and Automation—DCA, School of Electrical and Computer Engineering—FEEC, University of Campinas— UNICAMP, São Paulo, Brazil Okyay Kaynak, Department of Electrical and Electronic Engineering, Bogazici University, Istanbul, Turkey Derong Liu, Department of Electrical and Computer Engineering, University of Illinois at Chicago, Chicago, USA Institute of Automation, Chinese Academy of Sciences, Beijing, China Witold Pedrycz, Department of Electrical and Computer Engineering, University of Alberta, Alberta, Canada Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland Marios M. Polycarpou, Department of Electrical and Computer Engineering, KIOS Research Center for Intelligent Systems and Networks, University of Cyprus, Nicosia, Cyprus Imre J. Rudas, Óbuda University, Budapest, Hungary Jun Wang, Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong
The series “Lecture Notes in Networks and Systems” publishes the latest developments in Networks and Systems—quickly, informally and with high quality. Original research reported in proceedings and post-proceedings represents the core of LNNS. Volumes published in LNNS embrace all aspects and subfields of, as well as new challenges in, Networks and Systems. The series contains proceedings and edited volumes in systems and networks, spanning the areas of Cyber-Physical Systems, Autonomous Systems, Sensor Networks, Control Systems, Energy Systems, Automotive Systems, Biological Systems, Vehicular Networking and Connected Vehicles, Aerospace Systems, Automation, Manufacturing, Smart Grids, Nonlinear Systems, Power Systems, Robotics, Social Systems, Economic Systems and other. Of particular value to both the contributors and the readership are the short publication timeframe and the world-wide distribution and exposure which enable both a wide and rapid dissemination of research output. The series covers the theory, applications, and perspectives on the state of the art and future developments relevant to systems and networks, decision making, control, complex processes and related areas, as embedded in the fields of interdisciplinary and applied sciences, engineering, computer science, physics, economics, social, and life sciences, as well as the paradigms and methodologies behind them. Indexed by SCOPUS, INSPEC, WTI Frankfurt eG, zbMATH, SCImago. All books published in the series are submitted for consideration in Web of Science. For proposals from Asia please contact Aninda Bose ([email protected]).
More information about this series at https://link.springer.com/bookseries/15179
Kohei Arai Editor
Intelligent Computing Proceedings of the 2022 Computing Conference, Volume 1
123
Editor Kohei Arai Saga University Saga, Japan
ISSN 2367-3370 ISSN 2367-3389 (electronic) Lecture Notes in Networks and Systems ISBN 978-3-031-10460-2 ISBN 978-3-031-10461-9 (eBook) https://doi.org/10.1007/978-3-031-10461-9 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Editor’s Preface
This edition of the proceedings series, “Intelligent Computing: Proceedings of the 2022 Computing Conference” contains papers presented at the Computing Conference 2022, held virtually on the 14th and 15th of July 2022. We are delighted to announce that the complete conference proceedings were successfully executed through the will and co-operation of all its organizers, hosts, participants and all other contributors. The conference is held every year since 2013, with an aim to provide an ideal platform for researchers to exchange ideas, discuss on research results and present practical and theoretical applications in areas, such as technology trends, computing, artificial intelligence, machine vision, security, communication, ambient intelligence and e-learning. The proceedings of 2022 conference has been divided into two volumes which cover a wide range of abovementioned conference topics. This year Computing Conference received a total of 498 papers from around the globe, out of which only 179 papers were selected to be published in the proceedings for this edition. All the published papers passed the double-blind review process by an international panel of at least three international expert referees, and the decisions were taken based on the research quality. We are very pleased to report that the quality of the submissions this year turned out to be very high. The conference brings a single-track sessions covering research papers, posters, videos followed with keynote talks by experts to stimulate significant contemplation and discussions. Moreover, all authors had very professionally presented their research papers which were viewed by a large international audience online. We are confident that all the participants and the interested readers benefit scientifically from this book and will have significant impact to the research community in the longer term. Acknowledgment goes to the keynote speakers for sharing their knowledge and expertise with us. A big thanks to the session chairs and the members of the technical program committee for their detailed and constructive comments which
v
vi
Editor’s Preface
were valuable for the authors to continue improving their papers. We are also indebted to the organizing committee for their invaluable assistance to ensure the conference comes out in such a great success. We expect that the Computing Conference 2023 will be as stimulating as this most recent one was. Kohei Arai
Contents
Estimation of Velocity Field in Narrow Open Channels by a Hybrid Metaheuristic ANFIS Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hossein Bonakdari, Hamed Azimi, Isa Ebtehaj, Bahram Gharabaghi, Ali Jamali, and Seyed Hamed Ashraf Talesh Development of a Language Extension for Configuration of Industrial Asset Capabilities in Self-organized Production Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Eric Brandt, Felix Brandt, and Dirk Reichelt Open-Source Mapping Method Applied to Thermal Imagery . . . . . . . . André Vong, João P. Matos-Carvalho, Dário Pedro, Slavisa Tomic, Marko Beko, Fábio Azevedo, Sérgio D. Correia, and André Mora Scalable Computing Through Reusability: Encapsulation, Specification, and Verification for a Navigable Tree Position . . . . . . . . . Nicodemus M. J. Mbwambo, Yu-Shan Sun, Joan Krone, and Murali Sitaraman Generalizing Univariate Predictive Mean Matching to Impute Multiple Variables Simultaneously . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mingyang Cai, Stef van Buuren, and Gerko Vink Timeline Branching Method for Social Systems Monitoring and Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Anton Ivaschenko, Evgeniya Dodonova, Irina Dubinina, Pavel Sitnikov, and Oleg Golovnin
1
25 43
58
75
92
Webometric Network Analysis of Cybersecurity Cooperation . . . . . . . . 103 Emmanouil Koulas, Syed Iftikhar Hussain Shah, and Vassilios Peristeras Safety Instrumented System Design Philosophy Paradigm Shift to Achieve Safe Operations of Interconnected Operating Sites . . . . . . . . . . 123 Soloman M. Almadi and Pedro Mujica
vii
viii
Contents
Bifurcation Revisited Towards Interdisciplinary Applicability . . . . . . . . 138 Bernhard Heiden, Bianca Tonino-Heiden, and Volodymyr Alieksieiev Curious Properties of Latency Distributions . . . . . . . . . . . . . . . . . . . . . 146 Michał J. Gajda Multicloud API Binding Generation from Documentation . . . . . . . . . . . 171 Gabriel Araujo, Vitor Vitali Barrozzi, and Michał J. Gajda Reducing Web-Latency in Cloud Operating Systems to Simplify User Transitions to the Cloud . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 Luke Gassmann and Abu Alam Crescoware: A Container-Based Gateway for HPC and AI Applications in the ENEAGRID Infrastructure . . . . . . . . . . . . . . . . . . . 196 Angelo Mariano, Giulio D’Amato, Giovanni Formisano, Guido Guarnieri, Giuseppe Santomauro, and Silvio Migliori Significance in Marlo Diagrams Versus Thoroughness of Venn Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 Marcos Bautista López Aznar, Guillermo Címbora Acosta, and Walter Federico Gadea A Hybrid Real-Time Scheduling Mechanism Based on Multiprocessor for Real-Time Tasks in Weakly Hard Specification . . . . . . . . . . . . . . . . 228 Habibah Ismail, Dayang N. A. Jawawi, and Ismail Ahmedy Simulating the Arnaoutova-Kleinman Model of Tubular Formation at Angiogenesis Events Through Classical Electrodynamics . . . . . . . . . . 248 Huber Nieto-Chaupis Virtual Critical Care Unit (VCCU): A Powerful Simulator for e-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 Frederic Banville, Andree-Anne Parent, Mylene Trepanier, and Daniel Milhomme Silence in Dialogue: A Proposal and Prototype for Psychotherapy . . . . . 266 Alfonso Garcés-Báez and Aurelio López-López A Standard Content for University Websites Using Heuristic Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278 Mohd. Hisyamuddin Jainari, Aslina Baharum, Farhana Diana Deris, Noorsidi Aizuddin Mat Noor, Rozita Ismail, and Nurul Hidayah Mat Zain Development of a Mobile Application to Provide Employment Information to Private-Sector Workers in Peru . . . . . . . . . . . . . . . . . . . 293 Paul Ccuno Carlos, Pabel Chura Chambi, José Lipa Ochoa, and José Sulla-Torres
Contents
ix
Analysis of Technical Factors in Interactive Media Arts, with a Focus on Prix Ars Electronica Award Winners . . . . . . . . . . . . . . . . . . . . . . . . 304 Yeeun Jo and Uran Oh Usability Evaluation of Mobile Application Software Mockups . . . . . . . 321 Fray L. Becerra-Suarez, Deysi Villanueva-Ruiz, Víctor A. Tuesta-Monteza, and Heber I. Mejia-Cabrera EasyChat: A Chat Application for Deaf/Dumb People to Communicate with the General Community . . . . . . . . . . . . . . . . . . . 332 W. W. G. P. A. Wijenayake, M. D. S. S. Gunathilake, P. M. Gurusinghe, W. A. H. K. Samararathne, and Disni Sriyaratna Influence of Augmented Reality on Purchase Intention . . . . . . . . . . . . . 345 Ana Zagorc and Andrija Bernik Encountering Pinchas Gutter in Virtual Reality and as a “Hologram”: Immersive Technologies and One Survivor’s Story of the Holocaust . . . 358 Cayo Gamber Strided DMA for Multidimensional Array Copy and Transpose . . . . . . 375 Mark Glines, Peter Pirgov, Lenore Mullin, and Rishi Khan The Machine Learning Principles Based at the Quantum Mechanics Postulates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394 Huber Nieto-Chaupis The Threat of Quantum Computing to SMEs . . . . . . . . . . . . . . . . . . . . 404 Paulina Schindler and Johannes Ruhland Quantum Computation by Means of Josephson Junctions Made of Coherent Domains of Liquid Water . . . . . . . . . . . . . . . . . . . . . . . . . . 414 Luigi Maxmilian Caligiuri Customer Response Modeling Using Ensemble of Balanced Classifiers: Significance of Web Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 433 Sunčica Rogić and Ljiljana Kašćelan What Augmentations are Sensitive to Hyper-Parameters and Why? . . . 449 Ch Muhammad Awais, Imad Eddine Ibrahim Bekkouch, and Adil Mehmood Khan Draw-n-Replace: A Novel Interaction Technique for Rapid HumanCorrection of AI Semantic Segmentation . . . . . . . . . . . . . . . . . . . . . . . . 469 Kevin Huang, Ting-Ju Chen, Shashank Shekhar, and Ji Eun Kim Working Towards an AI-Based Clustering of Airports, in the Effort of Improving Humanitarian Disaster Preparedness . . . . . . . . . . . . . . . . 483 Maria Browarska and Karla Saldaña Ochoa
x
Contents
Wind Turbine Surface Defect Detection Analysis from UAVs Using U-Net Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 499 Hassnaa Hasan Shaheed and Riya Aggarwal Anomaly Detection Using Deep Learning and Big Data Analytics for the Insider Threat Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512 Abu Alam and Harry Barron Near Infrared Spectra Data Analysis by Using Machine Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 532 Perry Xiao and Daqing Chen On Regret Bounds for Continual Single-Index Learning . . . . . . . . . . . . 545 T. Tien Mai Text to Image Synthesis Using Stacked Conditional Variational Autoencoders and Conditional Generative Adversarial Networks . . . . . 560 Haileleol Tibebu, Aadin Malik, and Varuna De Silva Analytical Decision-Making System Based on the Analysis of Air Pollution in the City of Nur-Sultan . . . . . . . . . . . . . . . . . . . . . . . . . . . . 581 Zhibek Sarsenova, Aldiyar Salkenov, Assel Smaiyl, and Mirolim Saidakhmatov A Local Geometry of Hyperedges in Hypergraphs, and Its Applications to Social Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 590 Dong Quan Ngoc Nguyen and Lin Xing Extraction of Consumer Emotion Using Diary Data on Purchasing Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 608 Yuzuki Kitajima, Shunta Nakao, Kohei Otake, and Takashi Namatame Significance in Machine Learning and Data Analytics Techniques on Oceanography Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 620 K. Krzak, O. Abuomar, and D. Fribance Applying Latent Dirichlet Allocation Technique to Classify Topics on Sustainability Using Arabic Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 630 Islam Al Qudah, Ibrahim Hashem, Abdelaziz Soufyane, Weisi Chen, and Tarek Merabtene Metrics for Software Process Quality Assessment in the Late Phases of SDLC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 639 Gcinizwe Dlamini, Shokhista Ergasheva, Zamira Kholmatova, Artem Kruglov, Andrey Sadovykh, Giancarlo Succi, Anton Timchenko, Xavier Vasquez, and Evgeny Zouev
Contents
xi
Online Quantitative Research Methodology: Reflections on Good Practices and Future Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 656 Pierpaolo Limone, Giusi Antonia Toto, Piergiorgio Guarini, and Marco di Furia Application of Machine Learning in Predicting the Impact of Air Pollution on Bacterial Flora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 670 Damjan Jovanovski, Elena Mitreska Jovanovska, Katja Popovska, and Andreja Naumoski Markov Chains for High Frequency Stock Trading Strategies . . . . . . . . 681 Cesar C. Almiñana Scalable Shapeoid Recognition on Multivariate Data Streams with Apache Beam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 695 Athanasios Tsitsipas, Georg Eisenhart, Daniel Seybold, and Stefan Wesner Detection of Credit Card Frauds with Machine Learning Solutions: An Experimental Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 715 Courage Mabani, Nikolaos Christou, and Sergey Katkov ALBU: An Approximate Loopy Belief Message Passing Algorithm for LDA for Small Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 723 Rebecca M. C. Taylor and Johan A. du Preez Retrospective Analysis of Global Carbon Dioxide Emissions and Energy Consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 747 Rajvir Thind and Lakshmi Babu Saheer Application of Weighted Co-expressive Analysis to Productivity and Coping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 762 Sipovskaya Yana Ivanovna An Improved Architecture of Group Method of Data Handling for Stability Evaluation of Cross-sectional Bank on Alluvial Threshold Channels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 769 Hossein Bonakdari, Azadeh Gholami, Isa Ebtehaj, and Bahram Gharebaghi Increasing Importance of Analog Data Processing . . . . . . . . . . . . . . . . . 797 Shuichi Fukuda New Trends in Big Data Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 808 Júlia Colleoni Couto, Juliana Damasio, Rafael Bordini, and Duncan Ruiz Finding Structurally Similar Objects Based on Data Sorting Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 826 Alexey Myachin
xii
Contents
Application of the Proposed Thresholding Method for Rice Paddy Field Detection with Radarsat-2 SAR Imagery Data . . . . . . . . . . . . . . . 836 Kohei Arai and Kenta Azuma Understanding COVID-19 Vaccine Reaction Through Comparative Analysis on Twitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 846 Yuesheng Luo and Mayank Kejriwal A Descriptive Literature Review and Classification of Business Intelligence and Big Data Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . 865 Ammar Rashid and Muhammad Mahboob Khurshid Data Mining Solutions for Fraud Detection in Credit Card Payments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 880 Awais Farooq and Stas Selitskiy Nonexistence of a Universal Algorithm for Traveling Salesman Problems in Constructive Mathematics . . . . . . . . . . . . . . . . . . . . . . . . . 889 Linglong Dai Addition-Based Algorithm to Overcome Cover Problem During Anonymization of Transactional Data . . . . . . . . . . . . . . . . . . . . . . . . . . 896 Apo Chimène Monsan, Joël Christian Adepo, Edié Camille N’zi, and Bi Tra Goore Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 915
Estimation of Velocity Field in Narrow Open Channels by a Hybrid Metaheuristic ANFIS Network Hossein Bonakdari1(B) , Hamed Azimi2 , Isa Ebtehaj3 , Bahram Gharabaghi4 , Ali Jamali5 , and Seyed Hamed Ashraf Talesh5 1 Department of Civil Engineering, University of Ottawa,
161 Louis Pasteur Drive, Ottawa K1N 6N5, Canada [email protected] 2 Environmental Research Center, Razi University, Kermanshah, Iran 3 Department of Soils and Agri-Food Engineering, Université Laval, Québec G1V06, Canada 4 School of Engineering, University of Guelph, Guelph, ON NIG 2W1, Canada 5 Department of Mechanical Engineering, Guilan University, Rasht, Iran
Abstract. Owing to the significance of velocity distribution, several laboratory and simulation studies were done on the velocity distribution in open channels. Laboratory and field studies show that the maximum value of velocity through narrow open flumes takes place beneath the water surface knowing as the velocity dip mechanism. This velocity dip is a highly complex feature of flow in narrow channels, and no expression can evaluate it exactly. This study introduces a hybrid Metaheuristic model to estimate the velocity distribution within the narrow open canals. To optimize the linear and nonlinear parameters of adaptive neurofuzzy inference systems (ANFIS) models, singular decomposition value (SVD) and genetic algorithm (GA) are employed. In order to increase the flexibility of the model for an optimal design, two different objective functions are used, and the superior optimal point using the Pareto curve is estimated. To evaluate the accuracy of the hybrid ANFIS-GA/SVD model, the velocity distribution in three hydraulic circumstances is compared to the measured values. ANFIS-GA/SVD predicts velocity distribution with reasonable accuracy and estimates the velocity dip value with high precision. The Root Mean Squared Error (RMSE) for depths, D = 0.65 m, D = 0.91 m, and D = 1.19 m is calculated 0.052, 0.044, and 0.053, respectively. According to the numerical model results, the ANFIS-GA/SVD simulated the velocity distribution in depth of D = 0.91 m more accurately than other ones. Also, almost 94%, 96% and 88% predicted velocities for D = 0.65 m, D = 0.91 m, and D = 1.19 m, respectively that are modeled by the ANFIS-GA/SVD algorithm show an error of less than 10% for all measured data in the entire cross-section. Keywords: ANFIS · Dip phenomenon · Genetic algorithm · Narrow compound channel · Singular decomposition value · Velocity distribution
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 506, pp. 1–24, 2022. https://doi.org/10.1007/978-3-031-10461-9_1
2
H. Bonakdari et al.
1 Introduction The velocity distribution as a key parameter for all open canal studies has been interested hydraulic engineers and scholars. For the velocity distribution, logarithmic-based laws are usually employed to show the recorded velocity datasets. In these laws, the velocity is normalized by the shear velocity, and the normalized velocity is demonstrated in a logarithmic distribution. Schlichting [1] determined two inner and outer areas in the initial open-flume current. The inner area is about the floor, with a thickness of almost 10 to 20% of the whole water depth [2]. Within the area, the vertical component of velocity is modeled by that law [3]. The law is considered as one of the most primary logarithmic approaches to show the normalized velocity profile, which was first proposed by Keulegan [4] and Nikuradse [5]. This log law deviates from laboratory measurements in the outer region (y/D > 0.2, where y is the distance from the floor and D is the water depth). Therefore, the approach is commonly considered insufficient to define the velocity profile in the outer area [6–8]. This deviation has been justified by increasing the wake function to this log law [9, 10]. The fully turbulent current in wide-open flumes with large aspect ratio, the channel width to flow depth ratio, the velocity distribution can be evaluated by the modified log-wake laws with a good prediction [11, 12]. However, in narrow open canals comprising an aspect ratio smaller than five, the highest value of velocity is located beneath the water surface produces the negative vertical gradient in the velocity distribution called velocity-dip-mechanism [7]. Regarding this mechanism, the velocity profile is complicated; as a result, the velocity distribution cannot be effectively represented by the log-law and parabola models [13– 15]. It is a serious problem for engineers to describe and offer explicit solutions to estimate the distribution of velocity within narrow open canals. Based on Navier-Stocke’s equations solution, numerous methods have been proposed for assessment of the velocity field within narrow channels (u), which can be categorized as a numerical solution [16–19] and simplified/averaged situations [15, 19, 20]. Nevertheless, these approaches generally involve too many parameters that require effort and time for engineers. Practical methods used to evaluate the velocity dip phenomenon comprise the log and modified log laws. As an example, Yang et al. [13] proposed a very functional equation named dip-modified log law that was able to evaluate the velocity dip mechanism and was suitable for the whole cross-section. Yang et al. [13] proposed the following equation for evaluation of velocity in narrow open channels: α y y 1 u + ln 1 − (1) = ln u∗ κ y0 κ h where u* = (gRS) 0.5, R is the hydraulic radius, g is the gravity acceleration, S is channel slope y0 is a hypothetical condition where the velocity equal to 0, κ is the Karman constant, and c is a constant coefficient.
Estimation of Velocity Field in Narrow Open Channels
3
Recently, machine learning (ML) technology has been widely applied to practical engineering problems. For instance, the adaptive neuro-fuzzy inference system (ANFIS), a robust approach, provides a technique for the fuzzy modeling procedure by learning information about data [21]. To train the ANFIS network and increase the accuracy of modeling, an evolutionary algorithm (Genetic Algorithm) is used [22, 23]. In addition, the Singular Value Decomposition (SVD) is applied to find the vector of linear parameters of subsequent parts using the ANFIS model [24–27]. Khoshbin et al. [28] optimized the ANFIS using the genetic algorithm/SVD technique to simulate the hydraulic characteristics of side weirs. Ebtehaj et al. [29] predicted the side weirs discharge capacity through gene expression programming (GEP) as well as an equation was proposed to compute the discharge capacity. Azimi et al. [30] estimated the hydraulic jump roller length on a rough bed by combining ANFIS and the firefly algorithm. Azimi et al. [31] optimized the ANFIS network using evolutionary Pareto for modeling scour around pile groups. They combined the ANFIS model with the SVD and Differential Evolution (DE) algorithm to design nonlinear and linear parameters. Bonakdari et al. [32] simulated the velocity distribution in narrow compound flumes through an extreme learning machine (ELM). The authors concluded that the ELM algorithm demonstrated a good efficiency compared to the entropy-based method. Additionally, Bonakdari et al. [33] applied the gene expression programming (GEP) model for the estimation of the velocity distribution within compound sewer canals. The investigation highlighted that the GEP model could well approximate the dip mechanism, which was a complex phenomenon in velocity distribution simulation. Bonakdari et al. [34] performed an uncertainty analysis for the velocity field in sewers flumes by the ELM model. The results of ELM model were compared with artificial neural networks and empirical models representing a better performance of the ELM algorithm. Bonakdari et al. [35] predicted the velocity pattern in a narrow open canal using self-adaptive extreme learning machines. The author showed that machine learning could model the negative gradient of velocity in the vicinity of a free surface. Regarding the literature, notwithstanding the significance of awareness about field velocity within narrow open flumes, a few studies on the usage of ML in the area were reported. In addition, the trade-off between erroneous training and simulation to optimize the ANFIS design using the Pareto curve in the velocity field in narrow open channels has not been reported either. Therefore, to fill this gap, a computer program was developed for the ANFIS network, which was the ANFIS network, combined with a SVD and GA in this study. On the other hand, the nonlinear antecedent and linear consequent parameters were optimized for the prediction of the velocity field in narrow open canals. The velocity distribution was modelled for three various depths using a hybrid multiobjective technique, ANFIS-GA/SVD, and the outcomes of the ANFIS-GA/SVD were compared with field measurements and Yang et al.’s [13] velocity distribution equation.
4
H. Bonakdari et al.
2 Data Collection In the current investigation, to validate the velocity field by the numerical model, field data measured by Larrarte [36] and Bonakdari [37] is used. Velocity distribution values on the Cordon Bleu site, situated in the main sewer system in Nantes, France, are measured. Channel includes a combined cross-section, elliptical and narrow (1.7 < y/z < 2.6), in which a sidewalk on the left bank of the stream is embedded. Vertical and horizontal axes are displayed as Y and Z, respectively. To approximate the velocity field of channel crosssections, Cerbere, a 2D device remote control, has been employed. Cerbere recorded the measurements for 10 various depths (0.65 < D (m) < 1.19) and 9 various discharges flow (0.54 < Q (m3 /s) < 1.65). The maximum and minimum velocity value was respectively measured 0.97 m/s and 0.49 m/s. In all measured values of velocity, the flow was turbulent and subcritical regimes. In other words, the largest Reynolds number equal to 105 and Froude number of 0.2 and 0.3 have been reported. In the current paper, to model the velocity field in narrow channels, a hybrid ANFIS-GA/SVD model is used to train the numerical model of 363 measured data at seven different depths. Subsequently, 166 velocities measured at depths D = 0.65, 0.91, 1.19 m to test the ANFIS-GA/SVD model are applied. After training and testing of ML, the results in the test mode are analyzed. According to the variation range of flow depth, depth of 0.65 and 1.19 m respectively is related to the minimum and maximum diameters and are not applied in ML training. Hence, ML training has been evaluated at different hydraulic conditions.
3 ANFIS-GA/SVD In the next steps, firstly, an outline of ANFIS is detailed. After that, the GA and SVD are applied to the optimum design of the ANFIS network in Sects. 3.1 and 3.2. 3.1 A Subsection Sample An ANFIS network consists of some TSK-type IF-THEN fuzzy regulations, which is associated with fuzzy logic, and a neural network is able to model complex nonlinear systems by mapping multi-inputs to output. The antecedent and conclusion, which are the two main parts ANFIS network, using TSK-type of fuzzy regulations in the form of a network, are connected with each other. In the problems of system identification, the primary purpose is to obtain an approximate function fˆ that has the ability to approximate the original function f as output parameter yˆ , for a set of input data X = (x1 , x2 , ..., xn ), an acceptable difference with the actual output value y can be seen. Generally, the ith output in some data pairs as multi-inputs-single-output, approximation function can be expressed as follows: yi
= f (xi1 , xi2 , ..., xin )
(i = 1, 2, ..., M )
(2)
here, M is the number of observation samples. A search table could be produced to teach a fuzzy model to estimate output parameters (ˆy) for any certain input vector (X = (x1 , x2 , ..., xn )): yi = fˆ (xi1 , xi2 , ..., xin )
(i = 1, 2, ..., M )
(3)
Estimation of Velocity Field in Narrow Open Channels
5
Following the simulation process, an ANFIS network, to minimize the differences between real observed and predicted values, following relationship is defined: E=
2 M f (xi1 , xi2 , ..., xin ) − yi → Min
(4)
i=1
The next step in modeling is a design of several linguistic fuzzy IF-THEN regulations to model fˆ by M samples of multi-inputs-single-output data pairs (Xi , yi ) (i = 1, 2, ..., M ). These fuzzy regulations of ANFIS models may be easily described as the following generic equation: (j )
(j )
(j )
Rulel : IF x1 is Al 1 AND x2 is Al 2 , ..., xN is Al n THEN y =
n
wil xi + w0l
(5)
i=1 j
here, ji ∈ {1, 2, .., r}, Al is the ith fuzzy membership function (MF) related to the antecedent part offor lth rule, r is fuzzy sets number related to each input in a used dataset and W l = w1l , w2l , ..., wnl , w0l is the variable sets related to the consequent part of each fuzzy regulation. The whole fuzzy set in xi space is defined by the equation below: A(i) = {A(1) , A(2) , ..., A(r) }
(6)
Due to the good Gaussian MF performance applied recently [28, 31], fuzzy sets used in modeling related to this study are considered Gaussian. The used range to define the function is considered as [−αi , +βi ] (i = 1, 2, ..., n). To complete all fuzzy sets, these ranges should be properly selected, especially for each xi ∈ [−αi , +βi ] that there is A(j) (relationship 6) as MF degree be non-zero μA(j) (xi ) = 0 . Each single fuzzy set A(j) (j = 1, 2, ..., r) using Gaussian MFs is determined as below:
xi − cj 2 (7) μA(j) (xi ) = Gaussian xi , σj , cj = exp − 2σj where σj and cj are two adjustable variances and centers (respectively) variables related to the antecedent section of fuzzy sets. The number of variables in the antecedent section of fuzzy sets related to ANFIS could be determined as NP = N × R, here, R and N are the number of fuzzy sets in each single antecedent section of fuzzy sets and dimensions of the input vector, respectively. The expressed fuzzy sets in Eq. (5) are a fuzzy relationship in domain R × R such that A(i) are fuzzy sets in Rule = Aj1 × Aj2 × ... × Ajn → y and Ui (U = U1 × U2 × ... × Un ). It is obvious that y ∈ R and X = (x1 , x2 , ..., xn )T ∈ U . Regarding the Mamdani algebraic product, the degree of TSK-type fuzzy regulation is computed in the following equation: μRulek = μU (x1 , x2 , ..., xn )
(8)
6
H. Bonakdari et al.
j1
here, U = Al
j2
× Al
n
(j ) × .... × Al n , μU (x1 , x2 , ..., xn ) = μ (ji ) (xi ) and μ (ji ) (xi ) A A i=1
l
l
(j )
is the MF degree of xi associated with the linguistic value of lth fuzzy rules Al i . Regarding a singleton fuzzifier in relation to an outcome inference engine, accumulation of single contribution of each TSK-type fuzzy rule consisting of N rules as Eq. (5), could be alternatively presented in a linear regression form as follows: f (X ) =
N
pl (X )yl + D
(9)
l=1
here, D is the discrepancy between the predicted f(X) and corresponding actual output (y). Moreover, p can be presented as below: N
n yl μ (x ) i=1 A(ji) i l l=1 f (X ) = (10) N
n i=1 μ (ji) (xi ) l=1
Al
For given M multi-inputs-single-output data pairs (Xi , yi ) (i = 1, 2, ..., m), the Eq. (10) could be rewrite in the following matrix form: Y = PW + D
(11)
T here, P = p1 , p2 , ...., ps ∈ RM ×S , S = N(n + 1) and W = [w1 , w2 , ...., ws ]T ∈ Rs . Each vector wi associated with the last section of a fuzzy regulation includes (n + 1) components. In this vector, the firing strength amount of the matrix p when the input space is divided into a finite quantity of fuzzy sets, estimated. The quantity of regulations is tiny enough (m ≥ S) as the quantity of existing training samples pairs are commonly more than the constants related to the last section of fuzzy regulations. Hence, relationship 12 as a least squares
approximation process from unknown aspect is transformed W = [w1 , w2 , ..., wS ]T since D is minimized. Hence, the major equation is written as below: W = (PT P)(−1) PT Y
(12)
Coefficient’s correction in the last section of TSK-type of fuzzy regulations results in the preferable estimation of the available data pair as a function of minimizing the amount of vector D. Direct solution of normal equations is sensitive to rounding error and, in addition, cause the singularity of the equation. Thus, in the current study, the SVD method is used as a practical and robust tool to optimize the linear constants of the last section of TSK-type fuzzy regulations of ANFIS. In addition, a combination of GA and SVD to optimize the ANFIS network is proposed and expressed in the following. 3.2 Optimization of ANFIS by GA If the number of fuzzy rules in the multi-inputs-single-output system considers as R, the actual values of the parameters of the Gaussian MF {c, σ} as a string of focused
Estimation of Velocity Field in Narrow Open Channels
7
substring of binary values in the “if” section of the fuzzy mechanism, which is generated randomly at first, is equal to R (n + 1). In fact, each substring shows the fuzzy rules considered in the previous section of the ANFIS network in a binary environment. The evolution procedure by mutation and crossover operators through producing a new binary string population of previous generations continues. These chromosomes are candidates for solving the fuzzy partitioning of the antecedent section of regulations with random production of premium population generated in each single generation. In the present study, the values of two different target values, called Training Error (TE) and Prediction Error (PE), regarding the Root Mean Square Error (RMSE) are calculated. So, the precision of populations produced in multi-objective Pareto design of GA simultaneously is compared to two conflicting objective functions. Similarly, the discrepancy between the real (y) and the calculated amount by the model proposed in this study yˆ gradually becomes minimal for each input variable. Also, simultaneously to optimal estimation of linear coefficients corresponding to the last part of TSK-type fuzzy regulations associated with each chromosome that shows the fuzzy partitioning of antecedent sections, SVD methods are used. 3.3 Application of SVD in Design if ANFIS SVD [38] is a numerical technique with great potential in solving linear least square problems. This method overcomes the singularity difficulties in the normal equations [31]. If get factor from matrix P ∈ RM ×S using the SVD method, three different matrices are produced. A column-orthogonal matrix; a column-orthogonal matrix and a diagonal matrix that has non-negative values (singular values) which is defined as follows: P = UQV T
(13)
Variable W in relationship 12 using the modified reverse of variable Q that considers the singularity value as zero or close to zero is selected optimally. Thus, the optimal value of this variable using the following equation is calculated [28]: 1 UTY (14) W = V diag qj
4 Results and Discussion In the current study, to assess the performance of the ML, several criteria including Root Mean Squared Error (RMSE), Mean Absolute Relative Error (MARE), correlation coefficient (R), BIAS, Scatter Index (SI), and ρ are applied as below:
2 1 n u(Predicted )i − u(Observed )i RMSE = (15) i=1 n n 1 u(Predicted )i − u(Observed )i MARE = (16) n u(Observed )i i=1
8
H. Bonakdari et al.
n i=1 u(Observed )i − u(Observed ) u(Predicted )i − u(Predicted ) R=
2 n
2 n i=1 u(Observed )i − u(Observed ) i=1 u(Predicted )i − u(Predicted )
(17)
n
1 u(Predicted )i − u(Observed )i BIAS = n
(18)
i=1
SI =
RMSE u(Observed )
(19)
SI 1+R
(20)
ρ=
This study used the two conflicting target parameters called Training Error (TE) and Prediction Error (PE). To select the optimal point for both objective functions that show a good performance, the Pareto curve, as Fig. 1, is used. In this figure, TE, PE, and Trd represent a point with minimum training error, the lowest prediction error, and the trade-off point between the two PE and TE points, respectively. Trd of PE and TE point by using nearest to ideal point (NIP) has been determined. In this method, the best value corresponding to each of the membership functions as an ideal point is selected and the distance of other points with this point, is calculated. Trd point as regards the minimum distance to the ideal point is determined. Figure 2 shows the membership functions values with Gaussian shape for input variables. The optimum amount of each MF function using GA in the trade-off point between the two objectives considered in this study (TE & PE), is determined. As shown, the x and y axes indicate the input function value and MF value. According to the figure, the number of considered MF for all entries equals 3. Optimal values for the coefficients Gaussian MF (c and σ) are presented in Table 1. Table 1. Gaussian MF parameters optimized at the trade-off point Input
Gaussian MF parameters σ
C
MF1
MF2
MF3
MF1
MF2
MF3
Input 1 (D)
0.43476
0.29752
0.29645
1.11813
0.92811
0.90543
Input 2 (y)
0.52723
0.43632
0.68051
0.90659
1.42469
0.33722
Input 3 (z)
1.06459
0.24599
0.60193
0.71725
0.46323
1.08919
Input 4 (Q)
0.69515
0.927139
0.05105
1.24991
1.64132
0.80916
Estimation of Velocity Field in Narrow Open Channels
9
Fig. 1. Pareto diagram of TE versus PE
Here u(Observed )i is experimental velocity, u(Predicted )i predicted velocity by the ML, u(Observed )i the averaged laboratory velocity and n number of experiments. The above criteria present the discrepancy between the laboratory and simulated values, and additional explanations regarding the distribution of the errors by numerical models is not presented. Hence, the below index is suggested to compute the error distribution by the models [39]. The TSx criterion approximate the MARE values for the ML: TSx =
Yx × 100 n
(21)
Above, Y x is MARE estimated values by the numerical models that the error value is lower than x%. In this study, the dimensionless values of the velocity (ui /umax ) have been used. Although, the velocity values measured at three depths, D = 0.65 m, D = 0.91 m, and D = 1.19 m with numerical values ANFIS, ANFIS-GA, ANFIS-GA/SVD, Bonakdari et al. [15], and Yang et al. [13] models were compared. The statistical indices are arranged in Table 2 for various models. Table 2 presents the statistical indices of numerical models ANFIS, ANFIS-GA, ANFIS-GA/SVD, Bonakdari et al. [15] and Yang et al. [13] for depths D = 0.65 m, D = 0.91 m and D = 1.19 m. According to Table 2, statistical indices for numerical models ANFIS-GA/SVD and Yang et al. [13] are predicted at a depth close to 0.65 m. For example, RMSE values for models ANFIS-GA/SVD, Bonakdari et al. [15] and Yang et al. [13], has been achieved 0.052, 0.036 and 0.138, respectively. The SI and ρ values for ANFIS-GA/SVD have been surmised 0.057 and 0.031. In contrast, the scatter index value for Bonakdari et al. [15] and Yang et al. [13] were respectively computed
10
H. Bonakdari et al.
Fig. 2. MF optimized regarding the trade-off design point
0.039 and 0.148. Figure 3, 4 and 5 show the scatter plots of test mode of models ANFIS, ANFIS-GA, ANFIS-GA/SVD, Bonakdari et al. [15] and Yang et al. [13], for D = 0.65 m, D = 0.91 m and D = 1.19 m. The correlation coefficient of models ANFIS-GA/SVD and Yang et al. [13] and Bonakdari et al. [15] to a depth of 0.65 m were calculated 0.783, 0.777 and 0.859, respectively. In depth 0.91 m for the ANFIS-GA/SVD model, RMSE, MARE, and BIAS values have been achieved 0.044, 0.036 and 0.007, respectively. This is despite the fact that for the Yang et al. [13] model, RMSE value equal to 0.078, MARE equal to 0.074, and BIAS parameter -0.064 is calculated. Therefore, the error values at depth 0.91 m to model ANFIS-GA/SVD is less than the Bonakdari et al. [15] and Yang et al. [13] models. The R statistical index value for ANFIS-GA/SVD, Bonakdari et al. [15] and Yang et al. [13] are calculated 0.788, 0.872, and 0.755, respectively. For models ANFIS-GA/SVD, Bonakdari et al. [15] and Yang et al. [13] at depth 1.19 m, correlation coefficient values were respectively obtained 0.761, 0.744, and 0.687. The RMSE and MARE values for the ANFIS-GA/SVD algorithm are computed 0.047 and 0.053. In contrast to Yang et al. [13] RMSE value equal to 0.108 and MARE index 0.100 has been calculated. The amount of ρ for models ANFIS-GA/SVD, Bonakdari et al. [15] and Yang et al. [13] has been predicted at 0.033, 0.069, and 0.159, respectively. As can be seen,
Estimation of Velocity Field in Narrow Open Channels
11
increasing flow depth increases the accuracy of the ANFIS-GA/SVD model compared to the Bonakdari et al. [15] and Yang et al. [13] models. In other words, the error value at depth of 1.19 m to ANFIS-GA/SVD has been calculated less than Bonakdari et al. [15] and Yang et al. [13] models. Table 2. Statistical indices values of RMSE, MARE, R, BIAS, SI, ρ test mode for models ANFIS, ANFIS-GA, ANFIS-GA/SVD, Bonakdari et al. [15] and Yang et al. [13] Depth
Models
RMSE
MARE
R
D = 0.65
ANFIS
0.048
0.043
0.736
D = 0.91
D = 1.19
BIAS 0.008
SI
ρ
0.052
0.030
ANFIS-GA
0.048
0.042
0.755
0.013
0.051
0.029
ANFIS-GA/SVD
0.052
0.047
0.783
−0.016
0.057
0.031
Yang et al. [13]
0.138
0.125
0.777
−0.112
0.148
0.083
Bonakdari et al. [15]
0.036
0.032
0.859
0.117
0.039
0.021
ANFIS
0.045
0.040
0.726
0.017
0.0487
0.0287
ANFIS-GA
0.042
0.0372
0.774
0.013
0.045
0.025
ANFIS-GA/SVD
0.044
0.036
0.788
0.007
0.047
0.026
Yang et al. [13]
0.078
0.074
0.872
−0.064
0.084
0.045
Bonakdari et al. [15]
0.080
0.074
0.755
−0.039
0.085
0.049
ANFIS
0.056
0.048
0.757
0.030
0.061
0.035
ANFIS-GA
0.067
0.060
0.490
0.022
0.072
0.048
ANFIS-GA/SVD
0.053
0.047
0.761
0.025
0.057
0.033
Yang et al. [13]
0.108
0.100
0.687
−0.123
0.117
0.069
Bonakdari et al. [15]
0.256
0.108
0.744
−0.053
0.277
0.159
In Fig. 6, the error distribution of models ANFIS, ANFIS-GA, ANFIS-GA/SVD, Bonakdari et al. [15] and Yang et al. [13] for depth D = 0.65 m, D = 0.91 m and D = 1.19 m has been drawn. As can be seen, almost 94% of predicted velocities at a depth of 0.65 m by the model ANFIS-GA/SVD show less than 10% error. Also, 100% of simulated velocity values by ANFIS-GA/SVD have an error of less than 20%. In contrast, Yang et al.‘s [13] model computed approximately 31% of the velocity values with a 6% error. Also, the model predicted 41% of velocities at depth 0.65 m with a margin error of less than 10%. Although for such a model, approximately 81% of simulated outcomes allocated error smaller than 20%. Additionally, for Bonakdari et al. [15], almost 44% of estimated velocities have an error of less than 12%. For model ANFIS-GA/SVD, approximately 96% of the predicted results at a depth of 0.91 m have a margin error of smaller than 10%. Also, roughly 5% of simulation outcomes by the model at a depth of 0.91 m have an error between 10 to 15 percent. While all the results simulated by the
12
H. Bonakdari et al.
Fig. 3. The comparison of scatter plots of models ANFIS, ANFIS-GA, ANFIS-GA/SVD, Bonakdari et al. [15] and Yang et al. [13] for Depth of D = 0.65 m
Estimation of Velocity Field in Narrow Open Channels
13
Fig. 4. The comparison of scatter plots of models ANFIS, ANFIS-GA, ANFIS-GA/SVD, Bonakdari et al. [15] and Yang et al. [13] for Depth of D = 0.91 m
model have an error of less than 20%. Model Yang et al. [13] approximately predicted 30% of results at a depth of 0.91 m with an error between 10 and 15 percent. Also, about 54% velocity simulated by the model Yang et al.‘s [13] at this depth is estimated with an error of less than 8%. Almost 88% of velocity simulated by the model ANFISGA/SVD at a depth of 1.19 m has an error of less than 10%. According to simulation results, the model calculated about 13% of results with error between 10 and 15 percent. While approximately 3% of the results estimated by model ANFIS-GA/SVD at this depth allocated error more than 15%. In contrast, Yang et al.‘s [13] model at a depth of 1.19 m predicted approximately 58% of velocity values with less than 10% error. Also, 87% of results simulated by the model have a margin error of 16 percent. Alternatively, almost 92% of values predicted by the model, Yang et al. [13] at this depth have an error of less than 20%. In addition, nearly 88% of simulated velocity by Bonakdari et al. [15] has an error of less than 14%.
14
H. Bonakdari et al.
Fig. 5. The comparison of scatter plots of models ANFIS, ANFIS-GA, ANFIS-GA/SVD, Bonakdari et al. [15] and Yang et al. [13] for Depth of D = 1.19 m
Estimation of Velocity Field in Narrow Open Channels
15
Fig. 6. The error distribution of models ANFIS, ANFIS-GA, ANFIS-GA/SVD, Bonakdari et al. [15] and Yang et al. [13] at Depths (a) D = 0.65 m (b) D = 0.91 m and (c) D = 1.19 m for Test Mode
To analyze the results simulated by the ANFIS, ANFIS-GA, ANFIS-GA/SVD, Bonakdari et al. [15], and Yang et al. [13] models, discrepancy ratio parameter velocities to experimental velocities is introduced
as the ratio of the modeled DR= u(Predicted ) /u(Observed ) . The approximation of DR value to 1 shows the proximity of simulated amounts to experimental measurements. In Fig. 7, 8 and 9, discrepancy ratio changes to models ANFIS, ANFIS-GA, ANFIS-GA/SVD, Bonakdari et al. [15], and Yang et al. [13] at depths of 0.65, 0.91, and 1.19 m is shown. The values DR(max), DR(min), and DR(ave) are maximum, minimum, and average discrepancy ratios, respectively. According to the obtained results, the DR(ave) value at a depth of
16
H. Bonakdari et al.
0.65m for models ANFIS-GA/SVD, Bonakdari et al. [15], and Yang et al. [13] has been calculated 0.984, 1.006, and 0.877, respectively. In contrast, at this depth, DR(max) and DR(min) values for ANFIS-GA/SVD are obtained 1.124 and 0.897, respectively. Alternatively, for the depth of 0.91 m, Yang et al.‘s [13] model calculated the discrepancy ratio values of maximum, minimum, and average 1.033, 0.846, and 0.930. The DR(ave) for the ANFIS-GA/SVD at a depth of 0.91 m is estimated as 1.007. In contrast, at a depth of 1.19 m, the discrepancy ratio average for the ANFIS-GA /SVD, Bonakdari et al. [15] and Yang et al. [13] model has been calculated 1.030, 1.018, and 0.925, respectively. As can be seen, the nearest value of discrepancy ratio average to 1 for the ANFIS-GA/SVD and at a depth of 0.91 m is obtained.
Fig. 7. The discrepancy ratio changes of models ANFIS, ANFIS-GA, ANFIS-GA/SVD, Bonakdari et al. [15] and Yang et al. [13] for Depth 0.65 m
Estimation of Velocity Field in Narrow Open Channels
17
Fig. 8. The discrepancy ratio changes of models ANFIS, ANFIS-GA, ANFIS-GA/SVD, Bonakdari et al. [15] and Yang et al. [13] for Depth 0.91 m
In Figs. 10, 11 and 12, the comparison between the profiles of velocity simulated by the ANFIS, ANFIS-GA, ANFIS-GA/SVD, Yang et al. [13], and Bonakdari et al. [15] models with the experimental values measured at y = 0.5, 0.7, 0.9, 1.1, 1.3, and 1.5 m in three depths comprising D = 0.65, D = 0.91, and D = 1.19 m was demonstrated. Furthermore, comparing the maximum u/umax parameter estimated by the ANFIS, ANFISGA, ANFIS-GA/SVD, Yang et al. [13], and Bonakdari et al. [15] models is illustrated in Fig. 13. Regarding the simulation results, the RMSE value for the ANFIS, ANFIS-GA, and ANFIS-GA/SVD models in D = 0.65 m and Z = 0.5 m is calculated as 0.045, 0.042, and 0.036. D = 0.65 m and Z = 0.5 m, for the Yang et al. [13] and Bonakdari et al. [15] models, the value of RMSE is surmised as 0.068 and 0.086, respectively. For the ANFISGA/SVD, ANFIS-GA, ANFIS, Yang et al. [13], and Bonakdari et al. [15] models, the MARE index in D = 0.65 m and Z = 1.3 m is equal to 0.029, 0.037, 0.032, 0.218, and 0.153, respectively. The value of ρ criterion for the ANFIS-GA/SVD, ANFIS-GA, and
18
H. Bonakdari et al.
Fig. 9. The discrepancy ratio changes of models ANFIS, ANFIS-GA, ANFIS-GA/SVD, Bonakdari et al. [15] and Yang et al. [13] for Depth 1.19 m
ANFIS algorithms in D = 0.91 m and Z = 0.5 m are approximated at 0.011, 0.019, 0.023, respectively. Moreover, the value of the R index for the ANFIS-GA/SVD model in D = 0.91 m and Z = 0.9 m equals 0.949, whereas this index for the ANFIS-GA and ANFIS models is respectively surmised to be 0.933 and 0.920. In D = 1.19 m and Z = 0.9 m, the Scatter Index value for the ANFIS-GA/SVD and ANFIS-GA, ANFIS, Yang et al. [13] Bonakdari et al. [15] models is estimated as 0.045, 060, 0.055, 0.087, and 0.301, respectively. As it can be seen, in comparison with other models, the ANFIS-GA/SVD model shows a better performance to simulate the maximum u/umax .
Estimation of Velocity Field in Narrow Open Channels
19
Fig. 10. Comparing of the profiles of velocity approximated by various models in D = 0.65 m
20
H. Bonakdari et al.
Fig. 11. Comparing of the profiles of velocity approximated by various models in D = 0.91 m
Estimation of Velocity Field in Narrow Open Channels
21
Fig. 12. Comparing of the profiles of velocity approximated by various models in D = 1.19 m
22
H. Bonakdari et al.
Fig. 13. Comparing the maximum u/umax simulated by ANFIS-GA/SVD, ANFIS-GA, ANFIS, Yang et al. [13], and Bonakdari et al. [15] models
5 Conclusions Adequate and appropriate knowledge of the velocity field within open canals is of great importance. In the current work, a new hybrid multi-objective technique (ANFISGA/SVD) has been developed to estimate velocity field within a narrow compound flume. The results of soft computing were compared with measured values, and Yang et al.’s [13] approach for various depths (D = 0.65, 0.91, 1.19m) proved its good performance in modeling the velocity field. According to the modeling results, at a depth of 0.91 m for the model ANFIS-GA/SVD, RMSE and BIAS values were obtained 0.044 and 0.007, respectively. Also, for models ANFIS-GA/SVD, Bonakdari et al. [15] and Yang et al. [13], at a depth of 1.19 m, the statistical index of correlation coefficient value was calculated 0.761, 0.744, and 0.687, respectively. While about 3% of the results were estimated by model ANFIS-GA/SVD at a depth of 1.19 m allocated error more than 15%. However, the model of Yang et al. [13] at a depth of 1.19 m approximately predicted 58% of velocity simulated with an inaccuracy of smaller than 10%. Regarding the performed analyses, by increasing flow depth, the accuracy of the model ANFIS-GA /SVD in comparison with other algorithms. For ANFIS-GA/SVD, the highest, lowest, and average DR at depths of 0.91 m were calculated 1.115, 0.907, and 1.007, respectively. Based on analysis of the discrepancy ratio index, the average DR to 1 was obtained for model ANFIS-GA/SVD at a depth of 0.91 m.
References 1. Schlichting, H.: Boundary Layer Theory, 7th edn. McGraw-Hill book Company (1979) 2. Cebeci, T.: Analysis of Turbulent Flows, 2nd edn. Elsevier Ltd., Oxford, U.K. (2004) 3. Kirkgoz, M.S.: Turbulent velocity profiles for smooth and rough open channel flow. J. Hydraul. Eng. 115(11), 1543–1561 (1989)
Estimation of Velocity Field in Narrow Open Channels
23
4. Keulegan, G.H.: Laws of turbulent flow in open channels. J. National Bureau Stand. 21(6), 707–741 (1938) 5. Nikuradse, J.: Laws of Flow in Rough Pipes, Tech. Memorandum 1292, National Advisory Committee for Aeronautics, Washington, DC. (1950) 6. Cardoso, A.H., Graf, W.H., Gust, G.: Uniform flow in a smooth open channel. J. Hydraul. Res. 27(5), 603–616 (1989) 7. Nezu, I., Nakagawa, H.: Turbulent Open-Channel Flows. CRC Press, Taylor and Francis Group, Balkema, Rotterdam, IAHR Monograph (1993) 8. Kirkgoz, M.S., Ardiçlioglu, M.: Velocity profiles of developing and developed open channel flow. J. Hydraul. Eng. 123(12), 1099–1105 (1997) 9. Coles, D.: The law of the wake in the turbulent boundary layer. J. Fluid Mech. 1(2), 191–226 (1956) 10. Nezu, I., Rodi, W.: Open-channel flow measurements with a Laser Doppler Anemometer. J. Hydraul. Eng. 112(5), 335–355 (1986) 11. Guo, J., Julian, P., Meroney, R.N.: Modified wall wake law for zero pressure gradient turbulent boundary layers. J. Hydraul. Res. 43(4), 421–430 (2005) 12. Castro-Orgaz, O.: Hydraulics of developing chute flow. J. Hydraul. Res. 47(2), 185–194 (2009) 13. Yang, S.Q., Tan, S.K., Lim, S.Y.: Velocity distribution and dip phenomenon in smooth uniform open channel flow. J. Hydraul. Eng. 130(12), 1179–1186 (2004) 14. Hu, Y.F., Wan, W.Y., Cai, F.K., Mao, G., Xie, F.: Velocity distribution in narrow and deep rectangular open channels. J. Zhejiang Univ. (Eng. Sci.) 42(1), 183–187 (2008) 15. Bonakdari, H., Larrarte, F., Lassabatere, L., Joannis, C.: Turbulent velocity profile in fullydeveloped open channel flows. Environ. Fluid Mech. 8(1), 1–17 (2008) 16. Absi, R.: An ordinary differential equation for velocity distribution and dip-phenomenon in open channel flows. J. Hydraul. Res. 49(1), 82–89 (2011) 17. Gac, J.M.: A large eddy based lattice-Boltzmann simulation of velocity distribution in an open channel flow with rigid and flexible vegetation. Acta Geophys. 62(1), 180–198 (2013). https://doi.org/10.2478/s11600-013-0178-1 18. Fullard, L.A., Wake, G.C.: An analytical series solution to the steady laminar flow of a Newtonian fluid in a partially filled pipe, including the velocity distribution and the dip phenomenon. IMA J. Appl. Math. 80(6), 1890–1901 (2015) 19. Yang, S.-Q.: Depth-averaged shear stress and velocity in open-channel flows. J. Hydraul. Eng. 136(11), 952–958 (2010) 20. Lassabatere, L., Pu, J.H., Bonakdari, H., Joannis, C., Larrarte, F.: Velocity distribution in open channel flows: analytical approach for the outer region. J. Hydraul. Eng. 139(1), 37–43 (2012) 21. Jang, J.S.: ANFIS: adaptive-network-based fuzzy inference system. IEEE Trans. Syst. Man Cybern. 23(3), 665–685 (1993) 22. Azimi, H., Shabanlou, S., Ebtehaj, I., Bonakdari, H., Kardar, S.: Combination of computational fluid dynamics, adaptive neuro-fuzzy inference system, and genetic algorithm for predicting discharge coefficient of rectangular side orifices. J. Irrig. Drain. Eng. 04017015 (2017) 23. Gholami, A., et al.: A methodological approach of predicting threshold channel bank profile by multi-objective evolutionary optimization of ANFIS. Eng. Geol. 239, 298–309 (2018) 24. Marzbanrad, J., Jamali, A.: Design of ANFIS networks using hybrid genetic and SVD methods for modeling and prediction of rubber engine mount stiffness. Int. J. Automot. Technol. 10(2), 167–174 (2009) 25. Azimi, H., Bonakdari, H., Ebtehaj, I., Shabanlou, S., Ashraf Talesh, S.H., Jamali, A.: A pareto design of evolutionary hybrid optimization of ANFIS model in prediction abutment scour depth. S¯adhan¯a 44(7), 1–14 (2019). https://doi.org/10.1007/s12046-019-1153-6
24
H. Bonakdari et al.
26. Bonakdari, H., et al.: Pareto design of multiobjective evolutionary neuro-fuzzy system for predicting scour depth around bridge piers. In: Water Engineering Modeling and Mathematic Tools, pp. 491–517. Elsevier (2021) 27. Ebtehaj, I., et al.: Pareto multiobjective bioinspired optimization of neuro-fuzzy technique for predicting sediment transport in sewer pipe. In: Soft Computing Techniques in Solid Waste and Wastewater Management, pp. 131–144. Elsevier (2021) 28. Khoshbin, F., Bonakdari, H., Ashraf Talesh, S.H., Ebtehaj, I., Zaji, A.H., Azimi, H.: Adaptive neuro-fuzzy inference system multi-objective optimization using the genetic algorithm/singular value decomposition method for modelling the discharge coefficient in rectangular sharp-crested side weirs. Eng. Optim. 48(6), 933–948 (2016) 29. Ebtehaj, I., Bonakdari, H., Zaji, A.H., Azimi, H., Khoshbin, F.: GMDH-type neural network approach for modeling the discharge coefficient of rectangular sharp-crested side weirs. Eng. Sci. Technol. Int. J. 18(4), 746–757 (2015) 30. Azimi, H., Bonakdari, H., Ebtehaj, I., Michelson, D.G.: A combined adaptive neuro-fuzzy inference system–firefly algorithm model for predicting the roller length of a hydraulic jump on a rough channel bed. Neural Comput. Appl. 29(6), 249–258 (2016). https://doi.org/10. 1007/s00521-016-2560-9 31. Azimi, H., Bonakdari, H., Ebtehaj, I., Talesh, S.H.A., Michelson, D.G., Jamali, A.: Evolutionary Pareto optimization of an ANFIS network for modeling scour at Pile groups in clear water condition. Fuzzy Sets Syst. 319, 50–69 (2017) 32. Bonakdari, H., Gharabaghi, B., Ebtehaj, I.: Extreme learning machines in predicting the velocity distribution in compound narrow channels. In: Arai, K., Kapoor, S., Bhatia, R. (eds.) SAI 2018. AISC, vol. 857, pp. 119–128. Springer, Cham (2019). https://doi.org/10.1007/9783-030-01177-2_9 33. Bonakdari, H., Gharabaghi, B., Ebtehaj, I.: A highly efficient gene expression programming for velocity distribution at compound sewer channel. In: The 38th IAHR World Congress from September 1st to 6th, Panama City, Panama, pp. 2019–0221. (2019) 34. Bonakdari, H., Zaji, A.H., Gharabaghi, B., Ebtehaj, I., Moazamnia, M.: More accurate prediction of the complex velocity field in sewers based on uncertainty analysis using extreme learning machine technique. ISH J. Hydraul. Eng. 26(4), 409–420 (2020) 35. Bonakdari, H., Qasem, S.N., Ebtehaj, I., Zaji, A.H., Gharabaghi, B., Moazamnia, M.: An expert system for predicting the velocity field in narrow open channel flows using self-adaptive extreme learning machines. Measurement 151, 107202 (2020) 36. Larrarte, F.: Velocity fields within sewers: an experimental study. Flow Meas. Instrum. 17(5), 282–290 (2006) 37. Bonakdari, H.: Modelisation des écoulements en conllecteur d’assainissement-application à la conception de points de mesures. Ph.D. Thesis, University of Caen, Caen, France (2006) 38. Golub, G.H., Reinsch, C.: Singular value decomposition and least squares solutions. In: Bauer, F.L. (eds.) Linear algebra, HDBKAUCO, vol. 2, pp. 134–151. Springer, Heidelberg (1971). https://doi.org/10.1007/978-3-662-39778-7_10 39. Ebtehaj, I., Bonakdari, H., Zaji, A.H., Azimi, H., Sharifi, A.: Gene expression programming to predict the discharge coefficient in rectangular side weirs. Appl. Soft Comput. 35, 618–628 (2015)
Development of a Language Extension for Configuration of Industrial Asset Capabilities in Self-organized Production Systems Eric Brandt(B) , Felix Brandt, and Dirk Reichelt HTW Dresden, Dresden, Germany {eric.brandt,felix.brandt,dirk.reichelt}@htw-dresden.de http://www.htw-dresden.de/
Abstract. The process of managing and configuring industrial assets has produced a number of solutions over the last few years. Although a number of process models and standards simplify the management of manufacturing equipment, there is still a lack of suitable free software applications and frameworks that incorporate new trends. The increasing need for self-organized production systems in adaptive manufacturing environments poses a particular challenge. This paper examines the current state of model-driven software development in self-organized adaptive production systems and proposes a tool that bridges the gap between asset management and configuration of a self-organized production system. The resulting system is developed and tested against a real self-organized production system, and the following research efforts are derived from the findings gained throughout the testing process. Keywords: Cyber pyhsical systems · Domain specific language · Industry 4.0 · Self-organized production · Modeling language · Tools
1
Introduction
The integration of plants still poses a challenge for many small and mediumsized companies. Although current research is already looking at more advanced issues of how plants can communicate intelligently with each other, a large proportion of companies say that integration and transformation tasks still represent a major part of the effort [1]. The future vision of small lot sizes being manufactured in self-organizing systems, can only be realized if integration efforts and modeling tools are improved. Therefore two major problems needs to be addressed. The first is the creation of suitable tooling in form of modeling software to model holistic industry 4.0 architectures with seemless integration into business processes as well as shop floor sensors. The second is to access most recent research topics like self-organized manufacturing systems and their boundaries to create a integration into the same tool [2]. RAMI 4.0 The Reference c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 506, pp. 25–42, 2022. https://doi.org/10.1007/978-3-031-10461-9_2
26
E. Brandt et al.
Model for Industry 4.0 (RAMI 4.0) is a widely cited model that proposes a common language to support the implementation of Industry 4.0 applications [3]. The model consists of three dimensions for the most common aspects in industry 4.0 scenarios. Because of that the model ensures the idea of a common language which could be used and shared by all participants (Fig. 1).
Fig. 1. RAMI 4.0 model [3]
The product aspect is dedicated to the individual phases of a product life cycle. From the design plan to commissioning and maintenance. Here, different actors are already considered, which bring further challenges from an integration point of view. For example, manufacturers and maintainers use different models to describe the equipment as an operator. RAMI tries to close these bridges by a standardized asset model. To get the business idea RAMI uses another dimension, the business layers. Each layer reflects a question people typically deal with during the creation of a business model for their industrial application. Typical questions are how the customer retrieves data, how should a product be integrated into the production process or what the product is supposed to achieve [3]. The third an last dimension defines a model of interconnected systems and products forming a network. Opposed to the scada model, RAMIs goal is to enable participants in that network to communicate among each other and adapt their structure and services given ever changing requirements and product changes [4]. This paper investigates the problems of management of manufacturing resources in the context of self-organized production systems. Current research already deals with both, software architectures that deal with the management of manufacturing resources and the need for a greater focus adaptive production resources [5] as well as different technologies to enable self-organization in industrial environments. Especially for medium-sized companies, there is the problem of introducing new technologies in a systematic manner [6,7]. Previous solutions focus on process models but ignore the intuitive usability and integration by the end user. Inspired by model-driven software development and low code platforms, we offer a possible solution through an extension which offers users several
Industry 4.0 Platform
27
tools to manage manufacturing resources and integrate them with self-organized systems without deep technical knowledge. In past publications we have already created a basis in the form of various software artifacts [8]. We described the basic benefits and showed this by means of use cases, but we were lacking testing the solution [9]. This paper builds upon the user tests feedback and introduces a way to model a language through xtext and microsofts language server protocol. To gain knowledge on the requirements for modeling industrial assets and their semantic information we conducted a systematic mapping study. The results from that study highlights open research topics and gaps we’d like to tackle with our proposal. Based on those findings we present the architecture of the developed software extension, its parts, limits and challenges throughout the development process and outline further research topics for future development. The paper is structured into a methodological part, which introduces our research questions, show how the mapping study was conducted and emphasises how the results so far has been tested. Subsequently, we present some areas of expertise that deal with the research questions. Based on this, we first present a formal model for the description of industrial capabilities and describe the integration of the resulting language into a software platform for asset management. At last we discuss open challenges and outline further research topics.
2
Method
To answer the problems described above, we conducted a systematic review and derived needs and research gaps from it. Four research questions, explained below, served as the basis for our research. 2.1
Research Questions
RQ 1 - What is the current status on capability modeling for manufacturing devices? We already conducted research on capability modelling for industrial assets in previous work but as the topic is constantly experiencing change, we need to update this question to make sure that the developed tool does reflect the needs of current research. ( Capabilitiy OR Capabilities OR Skill OR Ability ) AND ( Industry 4 . 0 OR Industry OR IoT OR Manufacturing ) AND ( Machine OR Asset OR Ressource ) Code Listing 1.1. search string RQ1
RQ 2 - How can we model and match capabilities of machinery against product requirements? This question investigates the problem of intelligent information integration. As a couple of standards exist on machine side, the main question is how to map
28
E. Brandt et al.
it against changing and often not standardized product requirements. With that Question we would like to get the current challenges in information integration and propose a way of dealing with it inside our tool. ( Capability Matching OR Semantic relationship OR Integration ) AND ( Industry 4 . 0 OR Industry OR IoT OR Manufacturing ) AND ( Machine OR Asset OR Ressource ) Code Listing 1.2. search string RQ2
RQ 3 - What are the peculiarities in self-organized production systems? Self-organized production systems are a problem that has been studied many times and over the years the procedures and algorithms have always changed. The constant change makes it difficult to create a standardized process model and therefore it is necessary to investigate what currently researched mechanisms are and what intersections the different systems offer. ( Tools OR Framework OR Software ) AND ( Self - organization OR Self - organized ) AND ( Industry 4 . 0 OR Industry OR IoT OR Manufacturing ) AND ( Machine OR Asset OR Ressource ) Code Listing 1.3. search string RQ3
RQ 4 - What are the software requirements for integration and configuration in self-organized production systems? The final research question should answer how a developed tool can communicate with self-organized production systems and deliver the capability of modelling asset capabilities as well as product requirements. ( Tools OR Framework OR Software ) AND ( Modeling OR Design OR Modelling ) AND ( Industry 4 . 0 OR Industry OR IoT OR Manufacturing ) AND ( Machine OR Asset OR Ressource ) Code Listing 1.4. search string RQ4
2.2
Conducting the Systematic Mapping Study
The systematic review was conducted according to [10] and was performed according to the following scheme (Fig. 2). Inclusion criteria have been established for all scientific databases to narrow the search to open access and current research topics. For the following three databases, the inclusion criteria are as follows: 1. open access 2. not older than 2017
Industry 4.0 Platform
29
Fig. 2. Systematic mapping study strategy
3. peer reviewed 4. title or abstract must match at least one keyword The Results showed different amounts of relevant papers through the first stage of conducting the research. For Scopus the initial search showed less paper than IEEE but did come with a great amount of relevant papers after screening the results (Table 1). Table 1. Scopus results SMS stage
Papers
Conducting search 2143 Screening
20
IEEE showed less results during the initial search but resulted in more relevant papers after the screening process (Table 2). Table 2. IEEE results SMS stage
Papers
Conducting search 1187 Screening
31
Although ScienceDirect resulted in the most paper during initial research, the amount of relevant papers after applying inclusion and exclusion criteria as well as screening process was less than the other two (Table 3).
30
E. Brandt et al. Table 3. ScienceDirect results SMS stage
Papers
Conducting search 5213 Screening
2.3
13
Proposal of the Extension Architecture
Given the Results of the SMS we will propose an architecture to close discovered research gaps and explain challenges during the development. Especially peculiarities while working with ecore models and xtext when developing language servers. The documentation on the lsp module for xtext is not very sophisticated and therefore we’d like to support the scientific community with our insights on working with the framework. For testing the implemented features we created a set of cases (Table 4). Table 4. ux test criteria Tests to be conducted
Expected value
The extension must be easy to install and run
Setup time
Errors during developing must be clear and transparent
Error Description
Does the modeled use case reflect the output in the system Boolean
2.4
Test Against SSOP Systems
To test the connection between modeling industrial assets and their capability and the actual use in self-organized production systems (SSOP) we integrate an existing system into the extension (Table 5). Table 5. Integration criteria
3
Tests to be conducted
Expected value
The Extension should be able to trigger the production process
Boolean
The Extension should be able to provide models for the SSOP System
Boolean
The Extension should be able to import capability models into the SSOP System
Boolean
Results and Background
This section is divided into two parts, which highlights findings for RQ1 and RQ2. The first part discovers modeling techniques and tools for capabilities in the manufacturing domain and discusses future directions and opportunities for the proposed software solution of this paper.
Industry 4.0 Platform
31
The second part looks into findings for self-organized production systems and their peculiarities (RQ3 and RQ4). It discusses solutions on how to close the gap between the first two research questions and looks into specific implementations and their ability for customization and extensibility. 3.1
Modeling Capabilities and Capability Matchings
The results of the first research question RQ1 revealed a heterogeneous picture of technologies and methods to describe capabilities of manufacturing assets. We formed three categories of implementation from all results. We received the most frequent hits for OPC UA, AutomationML, and semantic description of capabilities via web ontology language (OWL), where the application domain of capability description varied. The most common application domains were asset configuration and self-description. The simple description of the topology of an asset produced many results, but no recent publications. This suggests a shift, towards more expressive modeling methods (Fig. 4). The most relevant papers for RQ1 are listed in Table 6. These not only present concepts, but also describe concrete models and their use in application scenarios. Table 6. Relevant papers for RQ1 Domain of interest Paper Self-description
[1, 12–14]
Configuration
[11, 12, 15]
[12] concludes on a model based on a holistic approach that seeks to create a collaborative environment. Although still limited in terms of providing a tool to validate its findings, this publication provides a sufficient overview of modelling techniques and places them in a holistic view of the production context. As Fig. 3 shows, the most used tools for modeling skills or capabilities of manufacturing assets are uml or xml based tools or languages like AutomationML and OPC UA information models. For concrete capability modelling, the possibility of integrating product BOMs (bill of materials) and a standardised asset model is required. The authors of [12] and [13] sees future research efforts in the creation of a ubiquitous industrial language with associated software tools, as well as the seamless integration of the worker into the production process. 3.2
Self-organized Production Systems
Properties of self-organized systems are well researched for different domains. [21] and [18] took more generalized properties of self-organized systems and mapped it against manufacturing domain and reviewed competitors who deal with the development of a self-organized production system. The general properties are
32
E. Brandt et al.
Fig. 3. Modelling methods
simple system components, decentral control and locality, dynamic structuring, adaptiveness, flexibility, robustness, scalability. For the purpose of this paper we were particular interested in how the systems we found during research dealt with the design of simple system components as well as flexibility and dynamic structuring in the perspective of configuration and majurity of tooling. The literature we found could be divided into three categories: 1. Proposal 2. Framework 3. Platform Where proposals are paper which proposed concepts but untested models for managing and configuring manufacturing assets in self-organized production systems. The second category are paper which provides a testable framework for building and managing assets and the last category shows paper which provides a sophisticated platform with and end-to-end solution (Fig. 4). [16] proposed a software framework for modelling software agents with a strong focus on testing the expected behaviour of such systems. While they throughout explain and describe the concept and software components, we could not consider this system for our testing purpose, as the authors software is not publicly accessible. [17] is a paper we categorised as proposal as it provides information on how to architect an multi-agent manufacturing execution system but still lacks on reconstructing the results. Most publications we discovered are not providing a solution which can be tested except for [17]. The authors of [17] created a proof of concept to verify concepts in self-organised production environments. The systems source code is on github and has a permissive Apache
Industry 4.0 Platform
33
Fig. 4. Modelling methods
2.0 license1 . We analysed the project in respect to extensibility and configuration and testability. While the project provides a way to configure resource capabilities and product boms it does not come with a way of describing it in a declarative way. It does not offer configuration of assets other than adapting the source code. As this system is the most advanced in terms of reproducibility and testing concepts we’ll focus on extending it for our purpose in the following sections.
4
VS Code Architecture
In 2016, microsoft published the idea of a language server protocol (LSP) in the visual studio code development environment. The idea behind this was to separate the language-specific functionalities from the development environment. To further improve this approach, a protocol was introduced at the same time to ensure that multiple development environments can communicate with the language server [19]. Figure 5 shows the overall interaction between the language server and the production planning system. The Fig. 5 shows a basic scheme from the protocol specification of the LSP. As an example a communication flow is shown here. In this case the language 1
https://github.com/Krockema/MATE.
34
E. Brandt et al.
Fig. 5. Language server protocol
client reports to the server that a document has changed. Within the server appropriate actions are then executed. The implementation of the server is the responsibility of the creator of the language server. With the help of this logic, language-specific syntax checks, code analyses or resource-intensive calculations can be performed. If a software project should make it necessary to work on different programming languages, the possibility exists to communicate with several language servers. Using the abstraction level of the language server, we propose to develop configuration specifics and system dependent commands within the language server (Fig. 6).
Fig. 6. Language server in SSOP
4.1
Xtext LSP
Eclipse Xtext is a framework for the development of programming languages and in particular domain-specific languages [20]. The support of the Language Server Protocol is an integral part of the xtext-core2 . Xtext implements most of the features of the Language Server protocol and offers support for various development environments.3 . A summary of the features is listed in Table 7. Code Lens was used for providing standard syntax highlighting and can be implemented in a very straight forward way. As Xtext uses EBNF for formal description of the capability mechanism it was possible to generate basic 2 3
https://github.com/eclipse/xtext-core. https://github.com/eclipse/xtext-core/tree/master/org.eclipse.xtext.ide/src/org/ eclipse/xtext/ide/server.
Industry 4.0 Platform
35
Table 7. LSP features LSP feature
Description
Code lens
Request to the from client to server to show contextual information interspersed in the given source code document
Command
Used to interact with the editor. The client issues a command which gets eventually handled by the language server.
Content assist The server takes context information from the active client document and provides assistance as auto completion. Formatting
Implements features like custom syntax highlighting
Hover
The client provides extra information for errors or warnings while hovering over specific parts of the document
highlighting out of the box. We added extra information for asset inspection to navigate easy to the source of the requirements as well as the asset file which is associated with the capability matching description (Fig. 7).
Fig. 7. Code lense
Commands are essential for providing request-response like functionality with our external systems and are used to integrate the self-organized production systems. As the language server is aware of the whole project structure it can transform and delegate every asset description and capability description to any third party application. The server is only limited by the capabilities of the underlying host language, which is in this case java (Fig. 8).
Fig. 8. Commands
Content Assist is different from code lens as it should provide information for the next steps during development. If our users start to develop a capability matching it should provide sophisticated code completion for declared assets in the project structure as well as fitting product requirements (Fig. 9). The Hover functionality, as shown in Fig. 10, ensures that detailed information is shown for parts of the code where the user hovers over. This can show
36
E. Brandt et al.
Fig. 9. Content assist
fine grained explanations for debugging or semantic enrichment for complex asset description. In our proposal it is used to explore details about assets or products and link to fitting documents.
Fig. 10. Hover
Figure 11 demonstrates one quality of life feature of the language server protocol and the capabilities of VSCode are code actions. With code actions the developer of an extension can provide template like snippets for faster development. Especially users which are not familiar with the language can generate snippets of code to quick start a project.
Fig. 11. Code action
Implementing it in xtext is straight forward as xtext-core provides several service classes which can be extended. For the use of code actions we extended the IdeContentProposalProvider and added several snippets for the language (Listing 1.5).
Industry 4.0 Platform
37
class C o n t e n t A s s i s t P r o v i d e r extends I d e C o n t e n t P r o p o s a l P r o v i d e r { ... override protected _ c r e a t e P r o p o s a l s ( RuleCall ruleCall , C o n t e n t A s s i s t C o n t e x t context , I I d e C o n t e n t P r o p o s a l A c c e p t o r acceptor ) { if ( ruleCall . rule == endpointRule ) { acceptor . accept ( proposalCreator . createSnippet ( endpointSnippet , " endpoint snippet " , context ) ,0 ) } super . _ c r e a t e P r o p o s a l s ( ruleCall , context , acceptor ) } ...
Code Listing 1.5. Capability Description
4.2
Integration of the Extension
To describe and configure capabilities in self-organized production systems we created a language which serves as a contract between two parties (Listing 1.6). While the product typically comes with a bill of materials (bom), the ressources needs to offer certain production operations and skills. For this purpose we already created a formal model for asset descriptions in [8], which is able to generate OPC-UA information models for RAMI compliant asset administration shells [22] and extended it by a capability description and the current status can be tracked here4 . CapabilityMatching { parties [ machine , product ] product . Sewing requires machine . component1 product . Injection requires machine . component2 Constraint { machine . component2 . temperature [ 100 , 150 ] } } Code Listing 1.6. Capability Description
As the self-organized production system (SSOP) from [17] demands a configuration via code we approached that problem by extending the project with a rest endpoint to start and run the simulation with a formatted JSON description which was generated from the capability and asset description. This description 4
https://gitlab.com/smartproduction/caso/caso-vscode-extension.
38
E. Brandt et al.
is compliant with the needs of the SSOP System. For the description from Listing 1.6. For starting and running the SSOP system, as Fig. 12, we created a command for the extension and started it. The language server needs to submit a post request to the extended rest endpoint we created in the ssop system.
Fig. 12. Running the configured SSOP system through code extension
The use of LSP features within the development environment facilitated the description of the asset configuration for the SSOP. The code lens functionality expanded the textual description to include asset-specific information to help the creator determine which asset is being addressed and what features it has. A good way to start the external SSOP system was to use the commands. With the help of these, a ssop-start command was created and the corresponding logic was executed within the language server. To support the creation of the description, if the user have no knowledge about the structure of the assets, a code completion was implemented within the language server. It therefore reveals more detailed information of the inner structure of a BOM or a machine resource. The user does not need an understanding of the SSOP System itself and the extension handles the transformation and mapping part.
5
Conclusion and Future Work
This paper analysed the current state of modelling of industrial assets and their capability to fulfil certain product requirements. The results an publications of RQ1 and RQ2 showed that most modelling languages and research efforts are made by leveraging OPC UA, AutomationML and ontology languages. The main goal is still to find holistic models to configure and describe manufacturing
Industry 4.0 Platform
39
equipment. The main topics are self-description and the modelling of interoperability aspects in automating production processes. Although, there is no clear trend which favours a specific modelling language, it is clear that uml driven languages like AutomationML are not covered as much as OPC UA or ontology based languages. This might be due to the fact that uml based languages can not cover complex semantic description and verification. Answering RQ3 and RQ4 resulted in insights on the majurity level of self-organized systems and the current state to involve a way to model skills or capabilities in manufacturing. A major amount of publications were limited in a way that they do not come with a testable platform or are usable in different production scenarios. The solution we have chosen to test our proposal of an integrated tool to model assets and their semantics was the most mature in terms of accessibility and extensibility. The mapping of common technology showed that there is still a gap of frameworks and tools for configuration and more research needs to be done in this field. 5.1
Findings in Developing the VSCode Extension
The initial testing criterias (Table 8) showed insights in challenges for future research. Provisioning of the VSCode extension via Microsofts marketplace would be one way to distribute the software but it can also be deployed locally with having the extension in an external repository. The main goal of not having the end user deal with deployment and setting up complex systems to quickly manage and describe manufacturing assets could therefore be achieved and resolved one of the major issues we discovered in previous publications [8]. VSCode and Microsofts language server protocol already comes with integrated debugging solutions to fulfill the need to provide standardized feedback of different subsystems through a common interface. The debugging console always keeps the user up-to-date with the current state of the ssop system and its state without having to deal with the underlying system. At last one challenge we were dealing with was closing the gap of configuring the SSOP system without code changes or specific knowledge of the system itself. The extension and the modelling language we presented here and in previous publications [9] served that purpose. Table 8. Test criteria Conducted tests
Results
The extension must be easy to install and run
Depends on the deployment
Errors from the SSOP system during development must be clear and transparent
Usage of the debugging feature in vscode
Does the modelled use case reflect the output in the system
ssop systems needs further improvements
40
E. Brandt et al.
The second criterias (Table 9) for a running vscode extension were mostly technical and should reflect the use case to integrate self-organised production systems and modelling languages. While it was easy to communicate with third party applications through common interfaces like rest or rpc it is still a challenge to integrate different concepts of asset modelling and especially behaviour or capabilities in self-organised production scenarios. The extension can run the simulation of the chosen system and configure the underlying production process but it was necessary to fully understand the capability model of the SSOP system. Table 9. Integration criteria Conducted tests
5.2
Results
The Extension should be able to trigger the production process
Fulfilled
The Extension should be able to provide asset models for the SSOP System
Partly fulfilled
The Extension should be able to import capability models into the SSOP System
Partly fulfilled
Limits and Future Directions
Although, the results of this paper have shown that it is possible to integrate modelling languages into a tool to integrate self-organised production systems and abstract the complexity away from the end-user, it is still a challenge to find a common model for matching product requirements against available machine resources. Our solution provides a formal matching description which is based on RAMI4.0 suggestions but the variety of self-organized systems often use their own model to describe resource capabilities. Future research needs to address this by involving existing standards for asset representation. As for the tooling the declarative description of capabilities needs to be supported by more graphical solutions to reflect the current state of manufacturing and processes. Visual Studio Code offers webviews and therefore we need to find ways to express the given use cases not only by formal textual languages but graphical process and asset models. The adoption of the extension could be further increased by publishing the software together with visual studio as a whole. The recently published openvscode server5 . Although the paper proposes a solution to abstract the complexity of selforganized systems away from the end user it still lacks validation through user tests with a significant amount of users. Acknowledgment. This research has been partially funded by German Federal Ministry of Education and Research within the funding program Forschung an Hochschulen. Project number: 13FH133PX8 5
https://github.com/gitpod-io/openvscode-server.
Industry 4.0 Platform
41
References 1. Maddikunta, P.K.R., et al.: Industry 5.0: a survey on enabling technologies and potential applications. J. Ind. Inf. Integr. 26, 100257 (2021). https://doi.org/10. 1016/j.jii.2021.100257 2. Alc´ acer, V., Cruz-Machado, V.: Scanning the Industry 4.0: a literature review on technologies for manufacturing systems. Eng. Sci. Technol. Int. J. 22(3), 899–919 (2019). https://doi.org/10.1016/j.jestch.2019.01.006 3. Federal Ministry for Economic Affairs and Energy: Plattform Industrie 4.0 RAMI4.0 - a reference framework for digitalisation. Plattform Industrie 4.0 (2019) 4. Wright, P.K., Greenfeld, I.: Open architecture manufacturing. The impact of open-system computers on self-sustaining machinery and the machine tool industry, pp. 41–47 (1990). https://www.scopus.com/inward/record.uri?eid=2-s2.00025252006&partnerID=40&md5=db0e8cf053193548de25c09918669899 5. Vogel-Heuser, B., Seitz, M., Cruz Salazar, L.A., Gehlhoff, F., Dogan, A., Fay, A.: Multi-agent systems to enable Industry 4.0. At-Automatisierungstechnik 68(6), 445–458 (2020). https://doi.org/10.1515/auto-2020-0004 6. Gerrikagoitia, J.K., Unamuno, G., Urkia, E., Serna, A.: Digital manufacturing platforms in the Industry 4.0 from private and public perspectives. Appl. Sci. 9(14), 2934 (2019) 7. DKE Deutsche Kommission Elektrotechnik Elektronik Informationstechnik in DIN und VDE, “German Standardsation Roadmap: Industrie 4.0,” DIN e. V., p. 146 (2018) 8. Brandt, F., Brandt, E., Heik, D., Reichelt, D.: Domain specific assets: a software platform for use case driven human friendly factory interaction, pp. 1–12 9. Brandt, E., Brandt, F., Reichelt, D.: Managing manufacturing assets with endusers in mind. In: Arai, K. (ed.) FICC 2021. AISC, vol. 1364, pp. 968–986. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-73103-8 70 10. Petersen, K., Feldt, R., Mujtaba, S., Mattsson, M.: Systematic mapping studies in software engineering. In: 12th International Conference on Evaluation and Assessment in Software Engineering, EASE 2008, June 2008. https://doi.org/10.14236/ ewic/ease2008.8 11. Hauck, M., Machhamer, R., Czenkusch, L., Gollmer, K.-U., Dartmann, G.: Node and block-based development tools for distributed systems with AI applications. IEEE Access 7, 143109–143119 (2019). https://doi.org/10.1109/ACCESS.2019. 2940113 12. Qin, Z., Lu, Y.: Self-organizing manufacturing network: a paradigm towards smart manufacturing in mass personalization. J. Manuf. Syst. 60, 35–47 (2021). https:// doi.org/10.1016/j.jmsy.2021.04.016 13. Morgan, J., Halton, M., Qiao, Y., Breslin, J.G.: Industry 4.0 smart reconfigurable manufacturing machines. J. Manuf. Syst. 59, 481–506 (2021). https://doi.org/10. 1016/j.jmsy.2021.03.001 14. Silva, R.G., Reis, R.: Adaptive self-organizing map applied to lathe tool condition monitoring. In: 2017 22nd IEEE International Conference on Emerging Technologies and Factory Automation (ETFA), pp. 1–6 (2017). https://doi.org/10.1109/ ETFA.2017.8247641 15. Pang, J., Huang, Y., Xie, Z., Han, Q., Cai, Z.: Realizing the heterogeneity: a selforganized federated learning framework for IoT. IEEE Internet Things J. 8(5), 3088–3098 (2021). https://doi.org/10.1109/JIOT.2020.3007662
42
E. Brandt et al.
16. Nascimento, N., Alencar, P., Lucena, C., Cowan, D.: A metadata-driven approach for testing self-organizing multiagent systems. IEEE Access 8, 204256–204267 (2020). https://doi.org/10.1109/ACCESS.2020.3036668 17. Mantravadi, S., Li, C., Møller, C.: Multi-agent manufacturing execution system (MES): concept, architecture & ML algorithm for a smart factory case. In: ICEIS 2019 - Proceedings of the 21st International Conference on Enterprise Information Systems, vol. 1, pp. 465–470 (2019). https://doi.org/10.5220/0007768904770482 18. Ma, K.: Prove of Concept for a Next Gerneration ERP System for an Industrial 4.0 Environment. Github Repository (2021). https://github.com/Krockema/MATE 19. https://microsoft.github.io//language-server-protocol/overviews/lsp/overview/ . Accessed 27 Sept 2021 20. Tutorials and Documentation for Xtext 2.0 (2011). http://www.eclipse.org/Xtext/ documentation/ 21. Krockert, M.: ERP 4.0: symbiosis of self-organization (2020) 22. Plattform Industrie 4.0; ZVEI, “Details of the asset administration shell part 1 the exchange of information between partners in the value chain of Industrie 4.0,” p. 473 (2020)
Open-Source Mapping Method Applied to Thermal Imagery Andr´e Vong1 , Jo˜ ao P. Matos-Carvalho2(B) , D´ ario Pedro3 , Slavisa Tomic2 , 2,4 5 abio Azevedo , S´ergio D. Correia2,6 , and Andr´e Mora1,7 Marko Beko , F´ 1
4
NOVA School of Science and Technology, 2829-516 Caparica, Portugal [email protected] 2 Cognitive and People-Centric Computing Labs (COPELABS), Universidade Lus´ ofona de Humanidades e Tecnologias, Campo Grande 376, 1749-024 Lisbon, Portugal {joao.matos.carvalho,slavisa.tomic}@ulusofona.pt 3 PDMFC - Projecto Desenvolvimento Manuten¸ca ˜o Forma¸ca ˜o e Consultadoria, 1300-609 Lisbon, Portugal [email protected] Instituto de Telecomunica¸co ˜es, Instituto Superior T´ecnico, Universidade de Lisboa, 1049-001 Lisbon, Portugal [email protected] 5 BEV - Beyond Vision, 2610-161 ´Ilhavo, Portugal [email protected] 6 VALORIZA, Polytechnic Institute of Portalegre, 7300-555 Portalegre, Portugal [email protected] 7 Centre of Technology and Systems, UNINOVA, 2829-516 Caparica, Portugal [email protected]
Abstract. The increase of the world population has had an impact in the agricultural field. As a consequence, this implies an increase in food production. To address this demand, farmers had to boost crop yields and land sizes. The latter one led to ineffectiveness of traditional methods for crop monitoring. For this reason, farmers began to adopt and adapt technological breakthroughs into agriculture by applying the concept of remote sensing. Remote sensing aims to collect information at a distance through the use of cameras and, in some cases, aerial platforms. Furthermore, multispectral cameras allowed farmers to better understand the management of crops and crop’s health. Also, through study of literature, it was found that thermal imaging could be an important tool to measure the crop’s condition. However, due to complexity of thermal images, an open-source tool that integrated this functionality was not found. Therefore, this paper proposes an open-source method that addresses the complexities of thermal images and is able to produce maps by exploiting them. Keywords: Remote sensing · Thermal images Structure from Motion (SfM) · Orthomap
· Photogrammetry ·
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 506, pp. 43–57, 2022. https://doi.org/10.1007/978-3-031-10461-9_3
44
1
A. Vong et al.
Introduction
The increase of the world population has had an impact in the agricultural field. The rise of population means that farmers have to increase crop yields and use more farmland to keep up with this expansion. Likewise, concern over crop monitoring has seen an increase. This way, farmers started to use technological advances in their favor. Most notably, the application of unmanned aerial vehicles (UAV). These provide accessible operational costs with a low complexity, high mobility, and resolution. In combination with photogrammetry techniques, such as Structure from Motion (SfM), they allow a high-resolution mapping of cropland [1]. In the same way, thermal remote sensing has seen an increased interest in the agricultural field to monitor temperatures in soil and crop surfaces, as well as potential uses in the detection of diseases and in planning of irrigation and harvesting [2]. One form of monitoring is through the use of thermal imagery. However, this type of images provide a set of challenges regarding its processing. The main challenge in the image processing of thermal datasets is the lack of feature points which leads to a lower resolution model. A solution to this is to increase the image’s footprint in hopes of capturing more feature points. Still, this solution leads to the same low resolution model due to the height increase at the moment of capture. This paper comprised of the present introduction followed by a summary of image processing works in the agricultural field in Sect. 2. Section 3 provides a detailed explanation of the proposed methodology, while Sects. 4 and 5 present the experimental results and its validation, respectively. Finally, a discussion of the obtained results is performed in Sect. 6 with conclusions and further remarks in the final section. This paper aims to describe a developed open-source method to perform mapping using thermal imagery that is expected to further improve crop monitoring.
2
Related Work
Remote sensing consists in the acquisition of information regarding a study area from a distance. This process often uses complementary machinery, like aircraft or satellites. In agriculture, remote sensing has been applied to crop and soil monitoring using cameras. The images of these cameras can then be applied in photogrammetry techniques, such as Structure from Motion (SfM) [3], and a point cloud of the captured area can be reconstructed digitally. Furthermore, some cameras are able to capture images with frequencies outside of the visible spectrum. With these, different point clouds can be constructed based on the combination of different frequencies. For instance, the MicaSense RedEdge-M camera [4] can capture images in five different bands, 3 in the visible spectrum (red (R), green (G), and blue (B)), and 2 in the invisible spectrum, near-infrared (NIR), and rededge (RE). Using a combination of the available bands, an estimation
Open-Source Thermal Mapping
45
on vegetation, water and chlorophyll indexes can be made. Equation 1 refers to the expressions to calculate each index, where Normalized Difference Vegetation Index (NDVI) estimates the vegetation area, Normalized Difference Water Index (NDWI) measures the water content, and the chlorophyll content is assessed by the Normalized Difference Red Edge Index (NDRE) [5]. N IR − Red N IR + Red N IR − SW IR (1) N DW I = N IR + SW IR N IR − RE N DRE = N IR + RE By using these indexes in agriculture, the crop’s conditions could be monitored. In [6], Hatfield et al. applied these indexes to monitor different crop characteristics in corn, soybean, wheat, and canola over eight years under different soils. There, the authors noted that changes within a crop follow a trend with differences between different type of crops. Emphasizing also that during the crop’s growth, similar reactions in the R, G, B frequencies were acknowledged during the crop’s growth but other crops would emit different infrared patterns over the years. As an example, the infrared signal emitted by crops of corn showed a significant increase over its growth compared to indexes of the visible spectrum that developed more slowly. This was partly connected to the reflectance values of the soil converging to the values of the visible indexes rather than the infrared’s. While [6] focused on the variations of multiple indexes, Gitelson et al. in [7] looked to correlate the chlorophyll content with reflectance. Furthermore, the light absorbed by a leaf can be depicted as a function of the amount of chlorophyll content. This way, the potential of the photosynthesis output process can be directly estimated [8,9]. Studies also indicate that plant stress also affects the content of chlorophyll in leaves [10–13]. Using UAV photogrammetry, Hobart et al. in [14] mapped an apple tree orchard to assess their height and position. To validate the results obtained, they paired their results against acquired by using LiDAR, as displayed in Fig. 1. The comparison concluded that photogrammetry estimated lower plant heights in relation to LiDAR. However, this can be due to smaller branches being displaced by wind, leading to a more complex feature matching. This point could be rectified by a slower flight speed which consequently increases image overlap. Arriola-Valverde et al. in [15] built a temporal graph that accessed the health of crops. The crop health was based on the plant’s height and radius over its development period. Using height assessment, through Digital Elevation Models (DEM), the evolution of each plant could be monitored while the difference in DEM of each plant was used to determine its radius. The temporal graph is depicted in Fig. 2. As expected, the growth of each plant increases over time, until a decline was detected shortly after. This decline was associated with the presence of parasites that were confirmed by a visual inspection of the crop. N DV I =
46
A. Vong et al.
Fig. 1. UAV (orange point cloud) and LiDAR (blue point cloud) overlapped. Orange line correspond to the estimation of tree height using UAV data. In the same way, LiDAR height estimation is represented with the blue line.
Fig. 2. Top figure shows the orthomosaic and delimitation of each plant. The graph represents each plant’s growth.
These works illustrate successful implementations of photogrammetry in agriculture to monitor the development of crops’ health and plague detection. Although these examples illustrate good potential of implementing remote sensing for crop monitoring, the usefulness of the vegetation index can be limited as conclusions can only be reached when significant damage is inflicted onto the
Open-Source Thermal Mapping
47
crops. Besides, temperature variations has been used as an important parameter to monitor growth. As the thermal condition can be traced as the rate of perspiration which correlates in a similar way to atmospheric conditions, the temperature can be used as a measuring potential tool to map any stress (i.e., water and/or nutrient deficiency, etc.) that crops might be experiencing.
3
Proposed Method
The proposed method is based on the open-source algorithm developed in [16]. This algorithm implements the necessary functions to produce point clouds using aerial imagery [17]. The introduced images corresponding to the surveyed area are presented to the algorithm. Here, extraction of features is performed using SIFT algorithm [18]. This algorithm allows the extraction of scale-invariant features used to build a descriptor that contains the position and color of the feature, as well as its neighboring points. This information is essential so that every descriptor is distinct. The descriptors are used in the next step, feature matching, where the descriptors of different images are compared. When descriptors are considered to be similar enough, it is said that a match is found. These feature matches tell the program what images the feature is detected and the transformations that are suffered throughout the images. With this information, a sparse point cloud can be assembled after removing image’s distortion. A point cloud patch-based densification method developed by [19] is applied to the sparse point cloud. Each image that integrates with the reconstruction is paired by evaluating the camera’s viewpoint. Depthmaps of image pairings are computed and used to create a dense point cloud. This process is done by adding a point to a plane considering the intrinsic parameters, rotation matrix, and camera’s coordinates. The same point represented in neighboring cameras is also added. An evaluation of the depth value of each point is performed to ensure that no point repeats itself in the point cloud. This process repeats until all the points are added. The densification is then performed by propagating neighboring pixel panels if their cost is similar of the current pixel plane. This is due to the similarities of planes in the same neighborhood. A method developed by [20] was used to create a mesh of a dense point cloud by applying an implicit function combined with an octree. The point cloud points are encapsulated in an octree leaf and a base function that best represents the surface is estimated. A surface can be extracted by calculating the Poisson equation on the base function and averaging the result values. Finally, a path-based method of mesh texturing from [21] is applied. Here, the label system assigns the optimal visual image to each texture patch. To this end, a Markov Random Field (MRF) energy function is used to determine the optimal image, and the smoothness of how it integrates with the neighboring patches.
48
A. Vong et al.
From the study of the workflow presented above, it was observed that implementation of the thermal imaging process was lacking. Furthermore, literature research showed that no open-source mapping tool for thermal images was available. Considering the characteristics of the thermal images, an assumption can be made. Thermal images often present low resolution, which is a result of the image capturing method during the survey and the survey area. In surveys performed in outdoor environments, a certain texture’s homogeneity is present. This fact, coupled with the lack of features in thermal images, led to a struggle in the feature extraction step. A standard method to correct this issue is by increasing the altitude from which the image is captured, expanding the camera’s footprint. However, by increasing the altitude of image capture, the resolution of images is reduced, which will influence the resolution of the point cloud. Therefore, an adaptation was performed on the developed workflow that integrates the ability to process thermal images without sacrificing the resolution. This adaptation consists of using the unaltered infrared images captured from the camera. These images contain the thermal information in their metadata. As the workflow already supports multispectral images, it can perform feature extraction and matching on infrared images. With the features matched, the program now knows which images express what features and can create feature trackers that contain the evolution of the feature over the images. As the thermal images carry the same characteristics as the infrared images, such as image size, a feature descriptor detected in one image is located in the same position in its corresponding thermal image. This way, the operations that would be performed on the infrared images are now carried out on the thermal images. The proposed method is illustrated in Fig. 3.
4
Experimental Results
In this section, the obtained results using the implemented workflow are displayed. The workflow was tested on an [email protected] GHz processor computer with 64 GB of RAM running a 64-bit Ubuntu 18.04 LTS distribution. The datasets were captured using a Flir Vue Pro R camera that allows accurate and calibrated images to be captured alongside radiometric data from the aerial platforms that can be used to assist in precision agriculture. The dataset tested consisted of 1821 images at the height of approximately 285 m. Given the displacement speed of the UAV and altitude, an overlapping of approximately 90% was achieved. The results can be seen in Fig. 4. The first image illustrates the 3D model generated from the thermal images. Here, it can be noted that the model’s edges are represented with a particular elevation, contrarily to what can be observed in reality. This can be attributed to the lack of constraints near the edges. The second figure represents the 2.5D model. This type of model consists of projecting the point cloud onto a plane and applying the parallax effects into
Open-Source Thermal Mapping
49
Fig. 3. Step-by-step representation of the proposed methodology.
the vertical objects [22]. As no vertical objects are present in this model, the models consist only of the horizontal plane. Moreover, it can also be stated that the artifacts near the edges present in the 3D model have been removed. The final image depicts the orthomap generated. An orthomap is composed of geometrically corrected images (orthophotos) so that the scale is uniform. It transmits that images integrated into the orthomap were taken parallel to the ground. This is the predominant type of model used when measurements of the true distances are required [23].
5
Validation
Due to the pandemic situation, the models could not be validated. As the model’s accuracy requires measurements of the survey area to be collected and authorization is needed, but unfortunately not available. With this in mind, the platform designed in [24,25] was used to carry out the visual assessment of the models. The analysis of the models generated from the implemented workflow yielded satisfactory results with no visible artifacts, besides the ones presented in the 3D model. In addition, due to geographic coordinates being stored in the image’s metadata, the model’s position in the Earth’s surface could be evaluated using mapping platforms like Google maps or interactive maps such as Leaflet. This evaluation is depicted in Fig. 5a. Visually assessing the overlap, it can be declared that a good overlap of the model is present despite some deviations, mainly the road on the bottom right leading right.
50
A. Vong et al.
(a) 3D model.
(b) 2.5D model.
(c) Thermal orthomap.
Fig. 4. Models generated from thermal images
Also, by only using thermal images, the result is a model, with some gaps appearing on the orthophoto, more specifically at the center, left and rightmost sides of the model. Furthermore, in comparison to the previous result (Fig. 5a), the model presents itself with lower resolution and a smoother surface, confirming the challenges faced when processing the thermal datasets. A comparison of the models can be seen in Fig. 5. In the same way, to validate the developed work, a comparison of results was observed against a commercial program, Pix4D [26]. The trial version of the commercial program Pix4D was used. This applied certain restrictions, such as a limitation of 1000 processed images, and only being able to process a maximum of 3 datasets. The obtained results can be seen in Fig. 6. Analyzing the figure of the obtained georeferenced orthomap, the unfilled gaps standout. As mentioned before, a limitation of 1000 images restricted the number of images that can be made available to the program. In addition, the deviations that are present from the result of our work do not seem to manifest in the orthomap of Pix4D. A justification for this can be due to a lower number of images used. On the other hand, another distinct difference between our work and Pix4D, is the fact that the latter did not use thermal data of the images. Instead, the initial infrared images were used to perform the mapping, even though Pix4D software acknowledged that the images being introduced to the program contained, in fact, thermal information.
Open-Source Thermal Mapping
51
(a) Model Result from Original Thermal Images.
(b) Model Result from Processed Thermal Images.
Fig. 5. Georeferenced models overlapped on Google maps interactive map.
Thus, a preprocessing of the infrared images was performed to validate further the thermal model obtained using the developed system. To this end, the thermal information from the infrared images was extracted from its metadata and stored in a new image file. The dataset containing the thermal data was then imported into the Pix4D program. The results can be seen in Fig. 7b and 8.
52
A. Vong et al.
Fig. 6. Georeferenced model obtained from Pix4D.
Again, due to the maximum restriction of images, only 1000 thermal images could be imported to Pix4D. Although an effort to fill the gaps in the mesh model was partially successful, the smallest gap in Fig. 6 was filled with some success in Fig. 7b, the lack of images in the larger gap meant that the gap was too large to be filled. Furthermore, Fig. 7 allows the comparison between both meshes created using our proposed method and Pix4D, respectively. Disregarding the artifacts near the edges of the mesh from the proposed method and the gap in the Pix4D’s mesh due to lack of images, a satisfactory result is achieved where both models successfully recreate the layout of the orchard as well as the roads surrounding it. Nonetheless, the thermal orthophoto produced by Pix4D correctly filled the gaps in data. In Fig. 8 the location of the models overlapped in both roadmap and satellite can be seen. Due to the models generated containing some degree of transparency, attributes of the map can be seen through it, such as the roads in Fig. 8a. Most notably the orchards can be seen through the thermal orthomap in Fig. 8b. This way, a clear overlap of the results over the maps cannot be declared.
6
Discussion
In this paper, a 3D thermal model creation system is proposed using aerial images collected using UAVs. In this system, infrared images captured during the survey, containing the thermal data in its metadata, are imported to the program. As an algorithm to process thermal images had not been integrated into the system developed by [16], this program was used as the base system, and adaptations were made after a careful study of how the system works and how the digital models are reconstructed. The comparison between the systems is highlighted in Table 1.
Open-Source Thermal Mapping
53
(a) Mesh Model Obtained from Proposed Method.
(b) Mesh Model Obtained from Pix4D.
Fig. 7. Mesh models obtained using proposed method compared with Pix4D. Table 1. Comparison between developed systems. Platform
RGB (Y/N) Multi Spectral (Y/N) Thermal (Y/N)
OpenDroneMap Y Ours Platform Y
Y Y
N Y
Additionally, the generated thermal models were compared with models created using commercial software. In the case of this paper, the commercial software Pix4D was used. It should be underlined that a trial version of Pix4D was used, which implied certain restrictions. Most notably, the limitations are that only 1000 images or three data sets could be processed.
54
A. Vong et al.
(a) Georeferenced Thermal Model Overlapped over Roadmap Obtained from Pix4D.
(b) Georeferenced Thermal Model Overlapped over Satellite Obtained from Pix4D.
Fig. 8. Pix4D results of the processed thermal images.
In regards to the processing of thermal data in Pix4D, the same dataset was imported on both open-source and commercial programs. Pix4D acknowledged the thermal information present in the metadata of the images. However, the final reconstruction was performed using the initial infrared images. Besides the visible artifacts present in the 3D model, satisfactory results were obtained from both 2.5D and 3D models, as well as the orthomap, compared to Pix4D. Furthermore, the visual assessment of the georeferencing models in the Google maps yielded a satisfactory overlap with key points such as roads. It can be said that our georeferencing results are on par with the same process done by Pix4D. To further validate our thermal model against Pix4D’s, a preprocessing of infrared images was performed to extract the thermal information into a new image file before being imported into Pix4D. Here, transparent orthomaps were obtained, which did not allowed to validate its thermal information as well as if the georeferencing was correctly done. All in all, the results obtained using our platform managed similar results compared with Pix4D.
Open-Source Thermal Mapping
7
55
Conclusion and Future Work
Given the increased demand for available food, crop producers have embraced the opportunity to increase their knowledge in crop management. Through remote sensing and as a consequence of technological advances, new tools were made available to farmers. This paper aims to provide an open-source method capable of working with thermal images. Due to their characteristics, thermal images are often afflicted by low resolution, due to an increase in altitude at the time of capture to increase the camera’s footprint and feature points that otherwise the feature detection and matching step would struggle with. One of the benefits of the proposed method is that resolution is not affected because a higher camera’s footprint is not required. Similar results were achieved from the obtained products with the trial version of the commercial program Pix4D. Furthermore, even though the obtained products achieved satisfactory results, future work would be focused on improving the edges of the 3D model so artifacts can be corrected. Also, the background not being removed at the end of the orthomap process might focus on our future work. Likewise, a joint effort is to be made with the developers of [16] to integrate the aforementioned thermal method within their workflow. Acknowledgment. This project was partially funded by AI4RealAg project. The goal of the AI4RealAg project is to develop an intelligent knowledge extraction system, based in Artificial Intelligence and Data Science, to increase sustainable agricultural production. The project is financed by Portugal 2020, under the Lisbon’s Regional Operational Programme and the Competitiveness and Internationalization Operational Programme, worth 1.573.672.61 euros, from the European Regional Development Fund. This work was also partially funded by the Portuguese Funda¸ca ˜o para a Ciˆencia e a Tecnologia (FCT) under Project UIDB/04111/2020, Project foRESTER PCIF/SSI/0102/2017, Project IF/00325/2015, Project UIDB/00066/2020, and also Instituto Lus´ ofono de Investiga¸ca ˜o e Desenvolvimento (ILIND) under Project COFAC/ILIND/COPELABS/1/2020.
References 1. Berni, J., Jose, A., Su´ arez, L., Fereres, E.: Remote sensing of vegetation from UAV platforms using lightweight multispectral and thermal imaging sensors 38 (2008) 2. Khanal, S., Fulton, J., Shearer, S.: An overview of current and potential applications of thermal remote sensing in precision agriculture. Comput. Electron. Agric. 139, 22–32 (2017) 3. Schonberger, J.L., Frahm, J.-M.: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4104–4113 (2016) 4. MicaSense RedEdge-M. https://micasense.com/rededge-mx/ 5. Vegetation indices and their interpretation. https://www.auravant.com/en/blog/ precision-agriculture/vegetation-indices-and-their-interpretation-ndvi-gndvimsavi2-ndre-and-ndwi/ 6. Hatfield, J.L., Prueger, J.H.: Value of using different vegetative indices to quantify agricultural crop characteristics at different growth stages under varying management practices. Remote Sens. 2(2), 562–578 (2010). https://doi.org/10.3390/ rs2020562
56
A. Vong et al.
7. Gitelson, A.A., Gritz, Y., Merzlyak, M.N.: Relationships between leaf chlorophyll content and spectral reflectance and algorithms for non-destructive chlorophyll assessment in higher plant leaves. J. Plant Physiol. 160(3), 271–282 (2003). https://doi.org/10.1078/0176-1617-00887 8. Curran, P.J., Windham, W.R., Gholz, H.L.: Exploring the relationship between reflectance red edge and chlorophyll concentration in slash pine leaves. Tree Physiol. 15(3), 203–206 (1995). https://doi.org/10.1093/treephys/15.3.203 9. Filella, I., Serrano, L., Serra, J., Pe˜ nuelas, J.: Evaluating wheat nitrogen status with canopy reflectance indices and discriminant analysis. Crop Sci. 35, 1400–1405 (1995). https://doi.org/10.2135/cropsci1995.0011183X003500050023x 10. Merzlyak, M.N., Gitelson, A.A., Chivkunova, O.B., Rakitin, V.Y.: Non-destructive optical detection of pigment changes during leaf senescence and fruit ripening. Physiol. Plant. 106, 135–141 (1999). https://doi.org/10.1034/j.1399-3054.1999. 106119.x 11. Pe˜ nuelas, J., Filella, I.: Visible and near-infrared reflectance techniques for diagnosing plant physiological status. Trends Plant Sci. 3, 151–156 (1998). https://doi. org/10.1016/S1360-1385(98)01213-8 12. Merzlyak, M.N., Gitelson, A.A.: Why and what for the leaves are yellow in autumn? On the interpretation of optical spectra of senescing leaves (Acerplatanoides L.). J. Plant Physiol. 145(3), 315–320 (1995). https://doi.org/10.1016/ S0176-1617(11)81896-1 13. Hendry, G.A.F., Houghton, J.D., Brown, S.B.: The degradation of chlorophyll a biological enigma. New Phytol. 107, 255–302 (1987). https://doi.org/10.1111/j. 1469-8137.1987.tb00181.x 14. Hobart, M., Pflanz, M., Weltzien, C., Schirrmann, M.: Growth height determination of tree walls for precise monitoring in apple fruit production using UAV photogrammetry. Remote Sens. 12(10), 1656 (2020) 15. Arriola-Valverde, S., Villagra-Mendoza, K., M´endez-Morales, M., Sol´ orzanoQuintana, M., G´ omez-Calder´ on, N., Rimolo-Donadio, R.: Analysis of crop dynamics through close-range UAS photogrammetry. In: IEEE International Symposium on Circuits and Systems (ISCAS), pp. 1–5 (2020). https://doi.org/10.1109/ ISCAS45731.2020.9181285 16. OpenDroneMap Authors ODM - A command line toolkit to generate maps, point clouds, 3D models and DEMs from drone, balloon or kite images. OpenDroneMap/ODM GitHub (2020). https://github.com/OpenDroneMap/ODM. Accessed 15 Sept 2021 17. Vong, A., et al.: How to build a 2D and 3D aerial multispectral map?-All steps deeply explained. Remote Sens. 13 (2021). https://doi.org/10.3390/rs13163227 18. Lowe, D.G.: Object recognition from local scale-invariant features. In: Proceedings of the Seventh IEEE International Conference on Computer Vision, vol. 2, pp. 1150–1157 (1999). https://doi.org/10.1109/ICCV.1999.790410 19. Cernea, D.: OpenMVS: multi-view stereo reconstruction library (2020). https:// cdcseacave.github.io/openMVS. Accessed 15 Sept 2021 20. Kazhdan, M., Maloney, A.: PoissonRecon. https://github.com/mkazhdan/ PoissonRecon. Accessed 15 Sept 2021 21. Waechter, M., Moehrle, N., Goesele, M.: Let there be color! Large-scale texturing of 3D reconstructions. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 836–850. Springer, Cham (2014). https://doi. org/10.1007/978-3-319-10602-1 54 22. Elizabeth: Full 3D vs 2.5D Processing. https://support.dronesmadeeasy.com/hc/ en-us/articles/207855366-Full-3D-vs-2-5D-Processing. Accessed 15 Sept 2021
Open-Source Thermal Mapping
57
23. ArcGIS. https://pro.arcgis.com/en/pro-app/latest/help/data/imagery/introducti on-to-ortho-mapping.htm. Accessed 15 Sept 2021 24. Pino, M., Matos-Carvalho, J.P., Pedro, D., Campos, L.M., Costa Seco, J.: UAV cloud platform for precision farming. In: 12th International Symposium on Communication Systems. Networks and Digital Signal Processing (CSNDSP), pp. 1–6 (2020). https://doi.org/10.1109/CSNDSP49049.2020.9249551 25. Pedro, D., et al.: FFAU-framework for fully autonomous UAVs. Remote Sens. 12(21), 3533 (2020) 26. Pix4D. https://www.pix4d.com/. Accessed 15 Sept 2021
Scalable Computing Through Reusability: Encapsulation, Specification, and Verification for a Navigable Tree Position Nicodemus M. J. Mbwambo1(B) , Yu-Shan Sun1 , Joan Krone2 , and Murali Sitaraman1 1 Clemson University, Clemson, SC 29634, USA {nmbwamb,yushans,msitara}@clemson.edu 2 Denison University, Granville, OH 43023, USA [email protected]
Abstract. Design, development, and reuse of generic data abstractions are at the core of scalable computing. This paper presents a novel data abstraction that captures a navigable tree position. The mathematical modeling of the abstraction encapsulates the current tree position, which can be used to navigate and modify the tree. The encapsulation of the tree position in the data abstraction specification avoids explicit references and aliasing, thereby simplifying verification of (imperative) client code that uses the data abstraction. The generic data abstraction is reusable, and its design makes verification scalable. A general tree theory, rich with mathematical notations and results, has been developed to ease the specification and verification tasks. The paper contains an example to illustrate automated verification ramifications and issues in scalability. With sufficient tree theory development, automated proving seems plausible even in the absence of a special-purpose tree solver. Keywords: Automation · Data abstraction Verification · Scalability · Reusability
1
· Specification · Tree ·
Introduction
Most modern programming languages support construction and use of reusable software components as a step toward scalable computing. To be reused with ease and scale up, components must be suitably designed and specified, and ultimately verified. This paper illustrates the ideas with a novel, reusable data abstraction that encapsulates a navigable tree position. Specification and verification of tree data structure have received considerable attention in the computing literature. As early as 1975, Ralph London used the Floyd-Naur inductive assertions method [1,2] to illustrate an iterative tree traversal verification. In his paper [3], London uses a program that traverses a (finite) binary tree to prove c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 506, pp. 58–74, 2022. https://doi.org/10.1007/978-3-031-10461-9_4
Scalable Computing Through Reusability
59
Fig. 1. A general overview of artifacts related to this work
the loop assertions used in the program. Since then, more studies on the specification and verification of tree structures have been done with a special focus on search trees. This paper presents a formal specification of a generic data abstraction for manipulating a tree position. The abstraction includes the type of information the tree contains and the number of children for a parent node as parameters to emphasize its generic nature. The generic data abstraction presented contains operations to create, navigate, and update the tree, yet it does not require explicit use and reasoning about references. Its design is intended to promote ease of reuse and ease of reasoning, and hence scalablity. Figure 1 shows a UML diagram that provides a broader overview and the focus of this work. The artifacts highlighted in gray are the focus of this paper and discussed in the subsequent sections. Additionally, Fig. 1 illustrates where work corresponding to prior efforts (e.g., red-black trees in Pe˜ na’s work and B+ tree work in Ernst et al. discussed in Sect. 2) fits within the context of this work. This work uses RESOLVE, an integrated specification and programming language with goals of supporting software engineers to (1) formally specify generic reusable software, (2) write efficient implementations, and (3) have the advantage
60
N. M. J. Mbwambo et al.
of automated verification [11]. In RESOLVE, Concept is a term used for a formally specified software component and analogous to a Java interfaces, except that in Java, interface specifications are written in English or Java. RESOLVE concepts are written using mathematical notation, which is part of the RESOLVE syntax. Realization is an implementation of a Concept. RESOLVE implementations are analogous to classes in Java but written in the RESOLVE code, whose syntax is similar to that of Ada [10]. A grammar for the RESOLVE language is available in [9]. In addition to concepts and realizations, there are mathematical theory units, which contain theorems and proofs to support automated verification. Some theory components are built-in, such as mathematical Booleans and Integers, and it is possible to add new theories as needed. When a theory is added to the mathematical library, an automated proof checker checks the proofs based on built-in mathematical logic. RESOLVE has value semantics, supporting programmers to avoid the difficulty of trying to reason at a pointer level, and carries out parameter passing via swapping, a technique that allows only one access to a given variable at any time, thereby avoiding aliasing. This paper is organized as follows: In Sect. 2 prior efforts are discussed. Section 3 motivates the design of a tree data abstraction presented in this paper. Section 4 provides a two-part formal description of the abstraction. Section 5 discusses enhancements and the results of an experiment to verify the presented enhancement. Finally, in Sect. 6, a summary and discussion on ongoing research directions are provided.
2
Related Work
Component-based software engineering is widely recognized as necessary for scaling up software development. Yet the benefits are realizable only if the components are designed with reusability as a central objective. Avoiding references in component specifications simplifies the verification process by averting aliasing, which complicates reasoning. Hiding references in specifications is a crucial similarity and motivation shared by Pe˜ na’s work [4] and the work presented in this paper. Pe˜ na used functional programming language Dafny [6] to specify and provide a proof sketch for insert and delete functions without explicit use of references. The specifications provided by Pe˜ na defines the behaviors of a set of operations for red-black trees by writing preconditions and postconditions for each operation. Unlike Pe˜ na’s work on red-black trees, no tree structure is built-in to the underlying language used in our work. We provide a tree data abstraction that is generic, equipped with primary operations, and can be extended with secondary operations to satisfy different applications of trees. Because the abstraction does not rely on a built-in language structure, verification of any code using it requires no special machinery. While tree implementation is not the focus of Pe˜ na work, in [13], Ernst et al. have specified and implemented B+ trees. Unlike the work by Ernst et al., the tree
Scalable Computing Through Reusability
61
abstraction in this paper is generic, and it encapsulates the tree position without explicit references. However, the implementation will need to address verification in the presence of explicit references. References are used to implement the tree structure (e.g., see the work in [15,21] for verification of linked structures), and its verification complexity will be similar in spirit to what is presented in [13]. This design simplifies reasoning for any code written using the abstraction. As in [4], Dross et al. have specified and verified red-black trees using SPARK [5]. To avoid aliasing, the specification of trees do not involve any pointers. The implementation is done at three different levels, and for each level, a type invariant is specified and used to enforce defined tree property. In the implementation, arrays are used as the underlying memory with references and addresses. Trees are bounded by the size of this array, which is shared between disjoint trees represented by a type Forest. The authors have verified their implementation using SPARK (in an auto-active fashion) and analyzed it via GNATprove, a formal verification tool for Ada. Unlike the work we present in this paper, in SPARK, trees are built-in. The notion of a tree position as a data structure was initially introduced in functional programming by Huet [7] and used in [8]. Huet presented a data structure (named the zipper) with a path structure and a tree position (location) as a memory-efficient solution to changeable data structures in functional programming. In Huet’s design, path structure and tree position are used to navigate and modify the zipper data structure. Any changes made to the structure are through the current location, which carries all information necessary to facilitate navigation and mutation. The path structure and location in the zipper data structure have similarities to the tree abstraction presented here. However, the similarities end there. Whereas Huet has focused his work on implementing the data structure, our work concerns a mathematical formalization of navigable position suitable for specification and verification of a generic tree data abstraction. We present a general tree theory and its use to encapsulate the tree position in a navigable tree data abstraction-without relying on built-in tree structures and without introducing explicit references-in contrast to prior work. In addition to the contributions of this research, an important ramification is that no special machinery is required to verify and simplify any imperative client code that uses this tree abstraction, as shown in this paper. The tree data abstraction in this paper does not contain a search operation, because searching can be layered as a secondary operation reusing previously verified tree primary operations. Secondary operations can be used to realize various search trees, and to illustrate how this is done, the authors have implemented a map data abstraction in a thesis [16]. Though searching itself is not a topic of discussion in this paper, the efforts to verify search trees discussed in this section are useful to motivate our work.
62
3
N. M. J. Mbwambo et al.
Encapsulating a Navigable Tree Position in a Data Abstraction
In imperative programming practice and literature, trees are realized via pointers-structural pointers between nodes and external pointers to nodes as tree positions. However, explicit manipulation of pointers is inherently complex and makes it difficult to scale up reasoning [14,15,21]. To locally encapsulate this complexity, the use of reusable and comprehensible data abstractions is essential. The navigable tree data abstraction proposed here is surprisingly simple and presents a labeled tree and a single position within that tree where manipulations can be performed. Access to other parts of the tree is achieved using navigational operations, while modifying operations support changing the tree’s shape and labeling. The verification of these navigational and modifyig operations does indeed involve pointers, but once verified, the abstraction’s reusability means that the upfront cost can be amortized over many deployments. In such deployments, programmer’s code is relieved from reasoning about explicit references, as would be necessary if traditional tree realizations were being manipulated.
Fig. 2. (a) Informal Presentation of a Tree Position (b) Formalized Presentation of a Tree Position Showing a Path with One Site and Remainder Tree
Scalable Computing Through Reusability
63
Figure 2 shows an example tree position for an instance of a generic tree abstraction we propose. In this example, the tree abstraction has been instantiated with integers as node labels and 3 as the maximum number of children for each node. The fat arrow in Fig. 2(a) is used to indicate the tree’s current position and an empty tree is denoted by Ω. By the nature of trees, any particular position factors a tree into two disjoint parts-the remainder tree below the position indicator and a path from the root out to the position indicator. If the path is viewed as including all the side branches to its left and right, then the underlying tree can be recovered, and further, movements between nearby positions can be readily specified. The path consists of a string of sites, each of which records the label for that path position, the string of subtrees that lie to the left of the path (Left Tree String, LTS), and the string of subtrees that lie to the right of the path (Right Tree String, RTS). Figure 2(b) formalizes the informal presentation of a tree position shown in Fig. 2(a). In the diagram, the remainder tree is shown in blue and the path in green. Currently, the path has two sites. A formal discussion on a tree position is provided in Sect. 4. A navigable tree data abstraction uses operations Advance and Retreat to navigate the exploration tree. These operations can move the current tree position to or from any one of the k immediate subtrees of the remainder tree. Operation Advance moves the current tree position to the next with an effect of adding one site to the path. Operation Retreat has an opposite effect to Advance. It reconstructs a new remainder tree using the last added site in the path and the current remainder tree, which is similar to navigating back to the previous tree position. For example, from a tree position in Fig. 3(a), advancing to the next tree position in direction 2 will land us to a tree position in Fig. 3(b). We can retreat from tree position in Fig. 3(b) to Fig. 3(a).
Fig. 3. (a) A Current Tree Position with Position Indicator at a Node Labeled 20. (b) Updated Tree Position after a Call to Advance Operation.
The other operations included in the tree abstraction are Add_Leaf and Remove_Leaf. They create or modify trees by adding or removing a node from a tree. A new node is added to a tree using the operation Add_Leaf. For this operation to work, the tree position should be at the end, where the remainder
64
N. M. J. Mbwambo et al.
tree is an empty tree (Ω) (see Fig. 4(a)). Adding a new node, say with a label 16 at this position, results in a tree in Fig. 4(b). Only leaves are removable from a tree, and operation Remove_Leaf is used. While Add_Leaf and Remove_Leaf may seem restrictive, another operation, Swap_Rem_Trees (Swap Remainder Trees) allows the remainder tree of two different tree positions to be swapped in and out, essentially making it possible to add new nodes in the middle. For efficient data movement, swapping is used wherever possible rather than copying references or values [17]. Swapping is chosen over copying as it does not introduce aliasing and does not compromise abstract reasoning. Swapping is also efficient for data movement as tree content in the generic abstraction we present may be of arbitrary types (not just Integers). Another efficient operation is Swap_Label, specified to update the node’s content at the current position regardless of type.
Fig. 4. (a) An Example Tree Position at an end. (b) Updated Tree Position with an Added Leaf.
Boolean operations At_an_End and At_a_Leaf are useful in checking to see if the current tree position is at the end of the tree and we cannot advance any further or at a leaf node having empty tree children, respectively. For example, a tree position in Fig. 4(a) is at an end, and it is at a leaf in Fig. 4(b). The specification for an exploration tree containing operations discussed above is provided in Listing 1. At this point, a skeleton version of the interface is used to explain key parts involved. A detailed interface is provided in Listing 3. The interface describes the behavior of all primary operations necessary to make it useful. The parameters of each operation have a type and are preceded by specification modes, i.e., updates, evaluates and restores. For the operations in Listing 1, only two parameter types are shown for brevity. The parameter dir (direction) of type Integer is used in the Advance operation, and P of type Tree_Posn (tree position) is used in both Advance and At_an_End operation. The operation Advance updates the tree position along the specified direction, while At_an_End does not change the tree position but is useful to check if the position can be advanced.
Scalable Computing Through Reusability
65
Concept Exploration_Tree_Template ( . . . ) ; .. . Operation Advance ( evaluates dir : Integer ; updates P : Tree_Posn ) ; Operation At_an_End ( restores P : Tree_Posn ) : Boolean ; Operation Retreat ( . . . ) Operation Add_Leaf ( . . . ) Operation Remove_Leaf ( . . . ) Operation At_a_Leaf ( . . . ) Operation Swap_Label ( . . . ) Operation Swap_Rem_Trees ( . . . ) .. . end Exploration_Tree_Template ; Listing 1. A skeleton version of a formal interface specification of exploration trees
4
A Reusable Exploration Tree Data Abstraction
For software engineers to author succinct specifications that are amenable to automated verification, mathematical theorems with definitions, notations, and predicates are defined and used in writing the specification at the right level of abstraction. Exploration_Tree_Template (the exploration tree data abstraction we present) has been developed using General Tree Theory (General Tree Theory) for similar reasons. In the next section, we provide an overview of the General Tree Theory. 4.1
A General Tree Theory
A mathematical tree theory is needed to describe different tree properties formally. A major challenge of this task is the wide range of applications trees have in computing. It is going to be costly to develop and verify theories for every tree application. A practical solution is to develop a general theory to all applications. Generic theories are complex, and their development is not a trivial task. Nevertheless, once developed and verified, their continuous reuse amortizes the up-front development and verification effort. General Tree Theory has been presented in detail elsewhere [16]. However, a few definitions, predicates, and notations found in the tree theory and used in later sections are presented next as needed for this paper’s interest. The first definition we present is a root label function Rt_Lab(Tr). This function provides a root node label from a non-empty tree. To illustrate how it works, suppose we are given the remainder tree (colored blue) in Fig. 3(a). Calling Rt_Lab(Tr) function gives an integer label 20 as a root node label. A second function is named Rt_Brhs(Tr) for root branches. Rt_Brhs(Tr) extracts all branches for a root node and returns them as a string of trees. For
66
N. M. J. Mbwambo et al.
example, for the remainder tree (colored blue) in Fig. 3(a), the root branches returned by this function will include T3, a tree with root node 18, and T4. A third function is named U_Tr_Pos(k, Node_Label) for Uniform Tree Positions and returns a collection of all tree positions in a specified tree with k maximum number of children in a node and Node_Label as node type. A fourth function height (ht(Tr)) is inductively defined to return an integer value representing the tree’s maximum depth. For example, the height of an empty tree is 0. Finally, to show what a definition looks like in the theory, we present Listing 2, which shows the definition of a site and a tree position. For each definition, a suitable mathematical model is chosen to describe the related properties. A site is mathematically modeled as a Cartesian Product (Cart_Prod) of a tree node label (Lab), Left Tree String (LTS) and Right Tree String (RTS). Similarly, the tree position is modeled as a Cartesian Product of a path (a string of sites) and the remainder tree (Rem_Tr). Def . Site = Cart_Prod Lab : El ; LTS : Str ( Tr ) ; RTS : Str ( Tr ) ; end ; Def . Tree_Posn = Cart_Prod Path : Str ( Site ) ; Rem_Tr : Tr ; end ; Listing 2. The mathematical definition of a site and a tree position
4.2
A Formal Specification of an Exploration Tree Data Abstraction
An exploration tree data abstraction named Exploration_Tree_Template is a generic abstraction parameterized by a generic node type Node_Label, an integer value k, and an initial capacity. Generally, the value k sets the number of children per node in an instantiated tree. The initial capacity places a shared upper bound on the number of nodes for all trees created within a single instantiation of the template. The proposed tree template is presented in Listing 3. For clarity, operations are omitted and will be discussed separately. Concept Exploration_Tree_Template ( type Node_Label ; evaluates k , Initial_Capacity : Integer ) ; uses General_Tree_Theory . . . ; requires 1 ≤ k and 0 < Initial_Capacity . . . ; Var Remaining_Cap : N ; initialization ensures Remaining_Cap = Initial_Capacity ;
Scalable Computing Through Reusability
67
Type Family Tree_Posn ⊆ U_Tr_Pos ( k , Node_Label ) ; exemplar P ; initialization ensures P . Path = Λ and P . Rem_Tr = Ω ; finalization ensures Remaining_Cap = #Remaining_Cap + N_C ( P . Path Ψ P . Rem_Tr ) ; .. . Listing 3. A part of a formal specification of an exploration tree
After the concept’s name and parameters are defined, the uses statement provides the specification with access to General_Tree_Theory and any other theories needed for the concept’s specification. Next, a requires clause is specified to provide restrictions on the values to be supplied as parameters during an instantiation of the concept. The first condition is for the value k, which has to be greater than 0 (no tree can be created with 0 as a maximum number of children per node), the second condition is for initial capacity, which also has to be greater than 0. Following the requires clause is a declaration of a conceptual variable Remaining_Cap (remaining capacity) shared across all instantiated trees, and not each tree to be constrained individually. This design allows for safe memory sharing [21]. The remaining capacity is specified as a natural number and initialized to an initial capacity in the ensures clause of the initialization. The Type Family clause specifies a type Tree_Posn exported by the template as a subset of all Uniform Tree Positions (U_Tr_Pos) defined by k and Node_Label. The Type Family clause emphasizes the generic nature of this abstraction. Thus, not only one type is exported but a whole family of types, each with different contents. The exemplar clause after Type Family introduces a name P as an example tree position used in specifying initialization and finalization ensures clauses. Initially, a created tree position P will have its path (P.Path) equals to an empty string (Λ) and the remainder tree (P.Rem_Tr) to equal to an empty tree (Ω). The finalization clause guarantees that the number of tree nodes that belonged to the tree object (described by a node count (N C) of a tree formed by zipping (Ψ ) P.Path to P.Rem Tr) is added back to the current remaining capacity. To differentiate initial or incoming remaining capacity to outgoing or final remaining capacity, a # symbol is used as a prefix for the former. Our complete abstraction has all primary operations specified. For space consideration, we discuss only some of the operations in this paper. Some of these operations were mentioned in Listing 1, and here we provide a detailed mathematical description for the operation Advance. The complete specification is available in [16]. To advance from one tree position to another, advancing direction dir and a current tree position P are supplied as parameters to Advance operation. The two parameters are preceded by parameter modes evaluates and updates to state their behavior before and after the operation is called. Advance operation is specified in Listing 4.
68
N. M. J. Mbwambo et al.
The requires clause uses two conditions to specify the operation’s behavior before it is called by the user. The first condition restricts the operation to only tree positions with nonempty remainder trees because advancing pass an empty tree is not possible. The second precondition requires the supplied integer value dir be within a valid bound (i.e., 1 ≤ dir ≤ k) because only advancing from a node to one of its branches is possible. updates P : Tree_Posn ) ; requires P . Rem_Tr = Ω and 1 ≤ dir ≤ k . . . ensures P . Rem_Tr = ̸ ( Prt_Btwn ( dir − 1 , dir , Rt_Brhs(#P . Rem_Tr ) ) ) and P . Path = #P . Path o ( Rt_Lab(#P . Rem_Tr ) , Prt_Btwn ( 0 , dir − 1 , Rt_Brhs(#P . Rem_Tr ) ) , Prt_Btwn ( dir , k , Rt_Brhs(#P . Rem_Tr ) ) ) ; Listing 4. Formal specification of Advance operation
The ensures clause in the Advance operation specifies the behavior of the operation after the call is completed. The two specified postconditions guarantee that the operation moves the current tree position to the next as determined by the direction (dir) provided. This operation updates both parts of the tree position. The first postcondition updates the remainder tree, and the second postcondition updates the path. To succinctly state this behavior, mathematical definitions, notations, and predicates from the theories are used to specify the ensures clause. Because the conditions are complicated, we analyze each part separately in the following explanations. The first part of the ensures clause uses a function Rt_Brhs (Root Branches) to return a string containing all branches of a root node for a given tree. A Part Between operator (Prt_Btwn) will return a substring between two specified intervals of a given string. In the first part, the Prt_Btwn operator is used to extract the needed tree branch between two directions, dir − 1 and dir. The tree branch extracted is in a string. A destring (̸1 ) operator is used to produce the entry of type Tr in a singleton string. The second part of the ensures clause updates the tree position by adding a new site. The new site is a result of advancing into the remainder tree leaving behind a root node label specified by Rt_Lab(#P.Rem_Tr) and the string of branches on the left and right of the advancing direction. The left branches are specified by Prt_Btwn(0, dir − 1, Rt_Brhs(#P.Rem_Tr)) and the right branches are specified by Prt_Btwn(dir, k, Rt_Brhs(#P.Rem_Tr)).
5
Modular Verification as a Step Toward Scalability
To make the specification and realization of data abstraction reasonable, only orthogonal and efficiently implementable primary operations are specified in the 1
Current RESOLVE compiler parses only ASCII characters, so all mathematical characters used are converted to ASCII equivalents.
Scalable Computing Through Reusability
69
core concept. Any other useful operation, such as the one for searching, that can be implemented using primary operations is specified as a secondary operation. In the RESOLVE language, a specification inheritance mechanism permits a straightforward extension of the primary operations by writing enhancements that add functionality to the core concept. By building and verifying enhancements in a modular fashion, verification can be done for one component at a time, thus making it possible to scale up verification. In this section, we present Position_Depth_Capability specified in Listing 5 as one of the enhancements that extend the application of our tree data abstraction. Position_Depth finds the longest path (from the root node to one of the empty trees) in the remainder tree. Enhancement Position_Depth_Capability for Exploration_Tree_Template ; Operation Position_Depth ( restores P : Tree_Posn ) : Integer ; ensures Position_Depth = ( ht(#P . Rem_Tr ) ) ; end Position_Depth_Capability ; Listing 5. A formal specification of Position_Depth_Capability enhancement
Only a postcondition needs to be specified for this operation as it can be called on any tree. Therefore, given a tree position P, the ensures clause is specified to return an integer value representing the height of P’s remainder tree (#P.Rem_Tr). The specification uses a function ht (for height) defined in the General Tree Theory, and explained in Sect. 4.1. Listing 6 presents an implementation of Position_Depth_Capability enhancement. Recursion is used in this implementation to traverse the tree topto-bottom while iterating on each branch to get the maximum height. To establish the correctness of this implementation, the code is annotated with a termination (decreasing) clause and loop invariant specified in the maintaining clause. Realization Obv_Rcsv_Realiz for Position_Depth_Capability of Exploration_Tree_Template ; Recursive Procedure Position_Depth ( restores P : Tree_Posn ) : Integer ; decreasing ht ( P . Rem_Tr ) ; Var PrevDir , NextHeight , MaxBrHeight : Integer ; If ( At_an_End ( P ) ) then Position_Depth := 0 ; else PrevDir := 0 ; NextHeight := 0 ; MaxBrHeight := 0 ; While ( PrevDir < k ) maintaining P . Path = #P . Path and P . Rem_Tr = #P . Rem_Tr and MaxBrHeight = Ag ( max , 0 , ht [ [ Prt_Btwn ( 0 , PrevDir , Rt_Brhs ( P . Rem_Tr ) ) ] ] ) and PrevDir 0) Listing 8. A VC for establishing loop termination
5.2
Discussion
Out of 40 VCs required to establish the example code’s correctness, 85% are straightforward, and their correctness can be established in a few steps by the RESOLVE current minimalist prover. Despite how complex the code is, these results show how a well-engineered specification contributes to code verification by ensuring as many VCs are obvious for verification. Nevertheless, some VCs may still end up being non-trivial for the verification system. In our experimentation, 15% of the generated VCs were non-trivial, and their verification challenges many fully automated provers discussed in the literature. Proving them can involve assertions in a theory for which no solvers are readily available, demonstrating the importance of having an extendable mathematical library.
6
Conclusion and Future Work
Generic data abstraction plays a central role in component-based software engineering and scalable computing. This paper has presented a novel formal specification of reusable data abstraction for a navigable tree with a position—the encapsulation of explicit references with a position that simplifies reasoning. A tree theory has been developed to formalize the specification provided in this paper. The abstraction presented only contains primary operations but can be extended with secondary operations. A secondary operation Position Depth is provided as an example, and its implementation is used in the experimentation. The results of our experimentation show that 85% of the VCs are obvious for verification. These results are promising. However, a more sophisticated prover is required to verify the 15% of the VCs classed as non-trivial automatically. Ongoing verification research in the group concerns the design and development of a general-purpose automated prover that is scalable enough to prove
Scalable Computing Through Reusability
73
more VCs, including non-trivial VCs resulting from complex components such as the one presented in this work. The goal is to allow developers to extend the mathematical library of the verification system with suitable mathematical units to specify and verify complex software components. Ultimately, verifying the implementation of a generic data abstraction such as the one presented in this paper will be possible. Acknowledgment. The authors would like to acknowledge the unique contribution of Dr. William F. Ogden to this work. More appreciation to other research group members at Clemson and Ohio State universities. This research is funded in part by grants from the U. S. National Science Foundation.
References 1. Floyd, R.: Assigning meaning to programs. In: Schwartz, J.T. (eds.) Proceedings of a Symposium in Applied Mathematics, vol. 19, pp. 19–32. American Mathematical Society (1967) 2. Naur, P.: Proof of algorithms by general snapshots. BIT 6, 310–316 (1966) 3. London, R.: A view of program verification. In: ACM Proceedings of the International Conference on Reliable Software, pp. 534–545. ACM Digital Library (1975) 4. Pe˜ na, R.: An assertional proof of red–black trees using Dafny. J. Autom. Reason. 64(4), 767–791 (2019). https://doi.org/10.1007/s10817-019-09534-y 5. Dross, C., Moy, Y.: Auto-active proof of red-black trees in SPARK. In: Barrett, C., Davies, M., Kahsai, T. (eds.) NFM 2017. LNCS, vol. 10227, pp. 68–83. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-57288-8 5 6. Leno, K.R.M.: Dafny: an automatic program verifier for functional correctness. In: Clarke, E.M., Voronkov, A. (eds.) LPAR-16th International Conference, LPAR-16, pp. 348–370, Senegal (2010) 7. Huet, G.: The zipper. J. Funct. Program. 5, 549–554 (1997) 8. Darragh, P., Adam D.M.: Parsing with zippers (functional pearl). In: Proceedings of the ACM on Programming Languages, vol. 4, p. 28, No. ICFP. ACM, August 2020 9. RSRG: Research Grammar. https://www.cs.clemson.edu/resolve/research/ grammar/grammar.html. Accessed 13 Oct 2020 10. Barnes, J.G.P.: An overview of Ada. Softw. Pract. Experience 10(11), 851–887 (1980) 11. Sitaraman, M., et al.: Building a push-botton RESOLVE verifier: progress and challenges. Formal Aspects Comput. 23(5), 607–626 (2011) 12. Cook, C., Harton, H., Smith, H., Sitaraman, M.: Specification engineering and modular verification using a web-integrated verifying compiler. In: Glinz, M., Murphy, G.C., Pezz˜ e, M. (eds.) ICSE 2012, pp. 1379–1382. IEEE Computer Society (2012) 13. Ernst, G., Schellhorn, G., Reif, W.: Verification of B+ trees: an experiment combining shape analysis and interactive theorem proving. In: Barthe, G., Pardo, A., Schneider, G. (eds.) SEFM 2011. LNCS, vol. 7041, pp. 188–203. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-24690-6 14 14. Weide, B., Heym, W.: Specification and verification with references. In: Proceedings OOPSLA Workshop on SAVCBS, October 2001
74
N. M. J. Mbwambo et al.
15. Kulczycki, G., Smith, H., Harton, H., Sitaraman, M., Ogden, W.F., Hollingsworth, J.E.: The location linking concept: a basis for verification of code using pointers. In: Joshi, R., M¨ uller, P., Podelski, A. (eds.) VSTTE 2012. LNCS, vol. 7152, pp. 34–49. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-27705-4 4 16. Mbwambo, N.: A Well-Designed, Tree-Based, Generic Map Component to Challenge the Progress towards Automated Verification. MS Thesis, Clemson University (2017) 17. Harms, D., Weide, B.: Copying and swapping on the design of reusable software components. IEEE Trans. Softw. Eng. 17(5), 424–435 (1991) 18. Kirschenbaum, J., et al.: Verifying component-based software: deep mathematics or simple bookkeeping? In: Edwards, S.H., Kulczycki, G. (eds.) ICSR 2009. LNCS, vol. 5791, pp. 31–40. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3642-04211-9 4 19. Harton, H.: Mechanical and Modular Verification Condition Generation for ObjectBased Software. PhD Thesis, Clemson University (2011) 20. Smith, H.: Engineering Specifications and Mathematics for Verified Software. PhD Thesis, Clemson University (2013) 21. Sun, Y.: Towards Automated Verification of Object-Based Software with Reference Behavior. Ph.D. Thesis, Clemson University (2018)
Generalizing Univariate Predictive Mean Matching to Impute Multiple Variables Simultaneously Mingyang Cai(B) , Stef van Buuren, and Gerko Vink Utrecht University, Padualaan 14, 3584 CH Utrecht, Netherlands [email protected]
Abstract. Predictive mean matching (PMM) is an easy-to-use and versatile univariate imputation approach. It is robust against transformations of the incomplete variable and violation of the normal model. However, univariate imputation methods cannot directly preserve multivariate relations in the imputed data. We wish to extend PMM to a multivariate method to produce imputations that are consistent with the knowledge of derived data (e.g., data transformations, interactions, sum restrictions, range restrictions, and polynomials). This paper proposes multivariate predictive mean matching (MPMM), which can impute incomplete variables simultaneously. Instead of the normal linear model, we apply canonical regression analysis to calculate the predicted value used for donor selection. To evaluate the performance of MPMM, we compared it with other imputation approaches under four scenarios: 1) multivariate normal distributed data, 2) linear regression with quadratic terms; 3) linear regression with interaction terms; 4) incomplete data with inequality restrictions. The simulation study shows that with moderate missingness patterns, MPMM provides plausible imputations at the univariate level and preserves relations in the data. Keywords: Missing data Predictive mean matching regression analysis
1
· Multiple imputation · Block imputation · · Multivariate analysis · Canonical
Introduction
Multiple imputation (MI) is a popular statistical method for the analysis of missing data problems. To provide valid inferences from the incomplete data, the analysis procedure of MI consists of three steps. First, in the imputation step, missing values are drawn from a plausible distribution (e.g., posterior distributions for Bayesian model-based approaches and a cluster of candidate donors for non-parametric approaches) to generate several (m) complete datasets. The value of m commonly varies between 3 to 10. Second, in the analysis step, complete data analysis are used to estimate the quantity of scientific interest for each imputed data set. This step yields m separate analyses because imputed c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 506, pp. 75–91, 2022. https://doi.org/10.1007/978-3-031-10461-9_5
76
M. Cai et al.
datasets are different. Finally, in the pooling step, m results are aggregated into a single result by Rubin’s rules, accounting for the uncertainty of estimates due to the missing data [1]. Two widely used strategies for imputing multivariate missing data are joint modeling (JM) and fully conditional specification (FCS). Joint modeling was proposed by Rubin [1] and especially developed by Shafer [2]. Given that the data is assumed to follow a multivariate distribution, all incomplete variables are generally imputed by drawing from the joint posterior predictive distribution conditional on other variables. Fully conditional specification, which was developed by Van Buuren [3], follows an iterative scheme that imputes each incomplete variable based on a conditionally specified model [3]. Fully conditional specification allows for tremendous flexibility in multivariate model design and flexibility in imputing non-normal variables, especially discrete variables [4]. However, FCS may suffer from incompatibility problems, and computational shortcuts like the sweep operator cannot be applied to facilitate computation [5]. On the other hand, joint modeling possesses more solid theoretical guarantees. With increasing incomplete variables, JM may lead to unrealistically large models and a lack of flexibility, which will not occur under FCS. In practice, there are often extra structures in the missing data which are not modelled properly. Suppose there are two jointly missing variables X1 and X2 . There may be restrictions on the sum of X1 and X1 (e.g., X1 + X1 = C, where C is a fixed value) and the rank of X1 and X2 (e.g., X1 > X2 ), data transformations (e.g., X2 = log(X1 ), X2 = X1 2 ) or interaction terms included in the data (X1 , X2 , X3 are jointly missing, where X3 = X1 ∗ X2 ). In this paper, we would focus on the setting of structures between two jointly missing variables, which is a simple scenario to illustrate. The two popular approaches of MI mentioned before may not be appropriate for modeling the relations among multiple variables in the missing data. Joint modeling may lack the flexibility of modeling the relations explicitly, and FCS imputes each missing variable separately, which may not ensure that the imputation remains consistent with the observed relations among multiple variables. Van Buuren [5] suggested block imputation, which combines the strong points of joint modeling and fully conditional specification. The general idea is to place incomplete variables into blocks and apply multivariate imputation methods to the block. Joint modeling can be viewed as a “single block” imputation method. In contrast, FCS is strictly a multiple blocks imputation method, where the number of blocks equals the number of incomplete columns in the data. It is feasible to consider the relations among a set of missing variables if we specify them as a single block and perform the MI iteratively over the blocks. Based on the rationale of block imputation, we extend univariate predictive mean matching to the multivariate case to allow for the joint imputation of blocks of variables. The general idea is to match the incomplete case to one of the complete cases by applying canonical regression analysis and imputing the variables in a block entirely from the matched case [6]. We shall refer to the multivariate extension of PMM as multivariate predictive mean matching (MPMM).
Multivariate Predictive Mean Matching
77
Predictive mean matching (PMM) is a user-friendly and versatile nonparametric imputation method. Multiple imputation by chained equation (MICE), which is a popular software package in R for imputing incomplete multivariate data by Fully Conditional Specification (FCS), sets the PMM as the default imputation approach [7]. We tailor PMM to the block imputation framework, which will widen its application. More computational details and properties of PMM would be addressed in Sect. 2. For a comprehensive overview of missing data analysis, we refer to Little and Rubin [8] for a comparison of approaches to missing data other than multiple imputation (e.g., ad-hoc methods, maximum likelihood estimation and weighting methods). Schafer [9], Sinharay et al. [10] and Allison [11] introduced basic concepts and general methods of MI. Schafer and Graham [12] discussed practical issues of application of MI. Various sophisticated missing data analysis were developed on the fields of multilevel model [13], structural equation modeling [14,15], longitudinal data analysis [16,17] and meta-analysis [18]. Schafer [19] compared Bayesian MI methods with maximum likelihood estimation. Seaman and White [20] gave an overview of the use of inverse probability weighting in missing data problems. Ibrahim et al. [21] provided a review of various advanced missing data methods. Because an increasing number of missing data methodologies emerged, MI as well as other approaches were applied in many fields (e.g., epidemiology, psychology and sociology) and implemented in many statistical software packages (e.g., mice and mi in R, IVEWARE in SAS, ice in STATA and module MVA in SPSS) [7]. The following section will outline canonical regression analysis, introduce predictive mean matching (PMM), and connect the techniques to propose multivariate predictive mean matching (MPMM). Section 3 provides a simple comparison between PMM and MPMM. Section 4 is a simulation study investigating whether MPMM yields valid estimates and preserves functional relations between imputed values. The discussion closes the paper.
2 2.1
Multivariate Predictive Mean Matching Canonical Regression Analysis (CRA)
Canonical regression analysis is a derivation and an asymmetric version of canonical correlation analysis (CCA). It aims to look for a linear combination of covariates that predicts a linear combination of outcomes optimally in a least-squares sense [22]. The basic idea of canonical regression analysis is quite old and has been discussed under different names, such as Rank-reduced regression [23] and partial least squares [24]. Let us consider the equation α Y = βX + .
(1)
We aim to minimize the variance with respect to α and β under some restrictions. CRA can be implemented by maximizing the squared multiple correlation coefficient for the regression of α Y on X, which can be written as
78
M. Cai et al. 2 Rα y.x =
−1 Σxy α α Σy x Σxx , α Σy y α
(2)
2 where Rα y.x is the ratio of the amount of variance of α Y accounted for by the covariates X to the total variance. According to McDonald [25], maximization of the above equation leads to eigenvalue decomposition. The solution is that α −1 is the right-hand eigenvector of Σy−1 y Σy x Σxx Σxy corresponding to its greatest eigenvalue. After reducing the rank of α Y to 1, we could estimate β by multivariate regression analysis.
2.2
Predictive Mean Matching (PMM)
PMM was first proposed by Rubin [26]and formalized by Little [6]. It can be viewed as an extension of the k nearest neighbor method. PMM calculates the estimated value of the missing variable through a specified imputation model (e.g., linear imputation model). The method selects a set of candidate donors (typically, the number of candidate donors is 5) from all complete cases whose estimated values are closest to the estimated value of the missing unit. The unobserved value is imputed by randomly drawing one of the observed values of the candidate donors [5]. Computational Details. We elaborate the algorithm of predictive mean matching for the clear illustration of its merger with canonical regression analysis [27]. Xobs , a Nobs × j matrix, denotes the observed part of predictors and Xm is , a Nmis × j matrix, denotes the missing part of predictors. ˆ and ˆ through ordinary 1. Use linear regression of Yobs given Xobs to estimate β least squares 2. Draw σ 2∗ = ˆT ˆ/A, where A is a χ2 variate with Nobs − j degrees of freedom ˆ and 3. Draw β ∗ from a multivariate normal distribution with mean vector β 2∗ T −1 covariance matrix σ (Xobs Xobs ) ˆ and Vˆm is = Xm is β ∗ 4. Calculate Vˆobs = Xobs β 5. For each missing cell ymis,n , where n = 1, · · · , Nmis (a) Find Δ = |ˆ vmis,n − vˆobs,k | for all k = 1, · · · , Nobs (b) Pick several observed entries yobs , 5 as default in mice.impute.pmm, with the smallest distance defined in step 5(a) (c) Randomly draw one of the yobs which are picked in the previous step to impute ymis,n 6. Repeat steps 1–5 m times and save m completed datasets. Predictive mean matching has been proven to perform well in a wide range of simulation studies and is an attractive way to impute missing data [7,27–30]. More precisely, PMM has the appealing features that the imputed values 1) follow the potential distributions of the data and 2) are always within the range of observed data because imputed values are replaced by real observed values [5]. For the same reason, PMM yields acceptable imputations even when normality
Multivariate Predictive Mean Matching
79
assumptions are violated [30]. In cases where the observed values follow a skewed distribution, the imputations will also be skewed. If observations are strictly positive, so will the imputations from PMM be. Furthermore, since PMM does not rely on model assumptions, it alleviates the adverse impact when the imputation model is misspecified [31]. Although PMM was developed for situations with only a single incomplete variable, it is easy to implement it under a fully conditionally specification framework for imputing multivariate missing data. However, the application of PMM under FCS framework is only limited to univariate imputation. Therefore, it may distort the multivariate relations in the imputations and narrow the application of the method to more complex data structures. For example, Seaman et al. [32] concluded that a univariate implementation of predictive mean matching is not advised to produce plausible estimates when the analysis model contains non-linear terms. As a multivariate extension to PMM, we expect that MPMM could yield plausible and consistent imputations when missing covariates include polynomial or interaction terms. 2.3
Multivariate Predictive Mean Matching (MPMM)
For illustration, we present the algorithm with one missing data pattern. The appendix discusses the extension to cases with multiple missing patterns. Let Y = (Y1 , · · · , YI ) and X = (X1 , · · · , XJ ) be two sets of I jointly incomplete variables and J complete quantitative variables, respectively. Let V = αY denotes the linear combination of multiple response variables and X denotes predictors with j dimensions. 1. Use the observed data to estimate the (I + J) × (I + J) covariance matrix Σy o b s y o b s Σy o b s x o b s Σx o b s y o b s Σx o b s x o b s Σy o b s x o b s Σx−1 Σx o b s y o b s and 2. Find the largest eigenvalue λ2 of Σy−1 o b s yo b s o b s xo b s its corresponding right-hand eigenvector α 3. Calculate the linear combination α Y for all completely observed individuals in the sample: Vobs = α Yobs ˆ and ˆ through ordinary 4. Use linear regression of Vobs given Xobs to estimate β least squares 5. Draw σ 2∗ = ˆT ˆ/A, where A is a χ2 variate with Nobs − j degrees of freedom ˆ 6. Draw β ∗ from a multivariate normal distribution with mean vector βand T Xobs )−1 covariance matrix σ 2∗ (Xobs ˆ and Vˆm is = Xm is β ∗ 7. Calculate Vˆobs = Xobs β 8. For each missing vector ym is,n , where n = 1, · · · , Nmis (a) Find Δ = |ˆ vmis,n − vˆobs,k | for all k = 1, · · · , Nobs (b) Pick several observed components yobs = {y1,obs , · · · , yI,obs }, 5 as default, with the smallest distance defined in step 8(a) (c) Randomly draw one of the yobs which are picked in the previous step to impute ymis,n 9. Repeat steps 5–8 m times and save m completed datasets.
80
M. Cai et al.
We also tried other methods of multivariate analysis, such as multivariate regression analysis (MRA) [33] and redundancy analysis (RA) [34]. However, imputation models specified by MRA or RA are not appropriate because of the assumed independence between missing variables. The violation of this assumption leads to less sensible imputations when there are extra relations among missing covariates.
3
Comparison Between PMM and MPMM
We shall illustrate that although MPMM is a multivariate imputation method, where the whole missing component is assigned entirely from the matching donor, the derived imputed datasets are also plausible at the univariate level. 3.1
Simulation Conditions
The predictors were generated by a multivariate distribution ⎡⎛ ⎞ ⎛ ⎞⎤ ⎛ ⎞ 2 12 0 0 X1 ⎝X2 ⎠ ∼ N ⎣⎝ 2 ⎠ , ⎝ 0 12 0 ⎠⎦ . X3 2 0 0 12 The responses were generated based on the multivariate linear model ⎡⎛ ⎞ ⎛ ⎛ ⎞ ⎞⎤ 3X1 + X2 + 2X3 4 4ρ 4ρ Y1 ⎝Y2 ⎠ ∼ N ⎣⎝ X1 + 5X2 + 2X3 ⎠ , ⎝ 4ρ 4 4ρ ⎠⎦ , Y3 5X1 + 3X2 + X3 4ρ 4ρ 4 where ρ denotes the correlation between the predictors X. Let R be the vector of observation indicators whose values are zero if the corresponding variable is missing and one if observed. We simulated missingness such that rows in the set (Y1 , Y2 , Y3 ) were always either observed or completely missing. This joint missingness was either completely at random (MCAR) with P (R = 0|X, Y) = ea 0.4 or right-tailed missing at random (MARright) with P (R = 0|X, Y) = 1+e a, where a = α0 +X1 /SD(X1 ) and α0 was chosen to make the probability of jointly missing Y equal to 0.4. Missing values were induced with the ampute function [35] from the package MICE [7] in R [36]. The correlation ρ was simulated from 0.2, 0.5 or 0.8 corresponding to a weak, moderate and strong dependence between predictors. The sample size was 2000, and 1000 simulations were repeated for different setups. For reasons of brevity, we focused our evaluation on the expectation of Y1 and the correlation between Y1 and Y2 . We studied the average bias over 1000 simulations with respect to the designed population value and the coverage rate of nominal 95% confidence interval. Within each simulation, we generated five imputed datasets and combined the statistics into a single inference by using Rubin’s combination rules [1].
Multivariate Predictive Mean Matching
3.2
81
Results
Table 1. Simulation results for evaluating whether MPMM provide valid imputations at the univariate level. ρ
0
scenario E(Y1 ) PMM bias cov
ρ(Y1 , Y2 ) PMM-CRA PMM bias cov bias cov
PMM-CRA bias cov
MCAR MAR
0 0
0.94 0 0.93 0
0.95 0.94
0 0
0.95 0 0.96 0
0.94 0.94
0.5 MCAR MAR
0 0
0.95 0 0.94 0
0.93 0.94
0 0
0.95 0 0.94 0
0.95 0.94
0.8 MCAR MAR
0 0
0.93 0 0.93 0
0.94 0.93
0.01 0.91 0 0.01 0.93 0
0.95 0.94
Table 1 shows the simulation results. In general, MPMM yielded no discernible difference with PMM when focusing on the correlation coefficient ρ(Y1 , Y2 ). Under the MCAR missingness mechanism, both methods yielded unbiased estimates and displayed coverage rates close to the nominal 95%, and even there was 40% missingness in the joint set (Y1 , Y2 , Y3 ). It is notable to see that with MARright and high correlation between Y1 and Y2 , PMM had a somewhat reduced coverage rate, which suggests that MPMM yielded more robust results against various correlation coefficients. For estimation of the mean value E(Y1 ), MPMM performed similarly to PMM. Both methods yielded plausible imputations with various missingness scenarios and different pre-assumed correlation coefficients. These initial results suggested that multivariate predictive mean matching could be an alternative to predictive mean matching. If PMM yields sensible imputations, so will PMM-CRA.
4
Evaluation
To investigate the performance of MPMM when there are relations in the incomplete data, we performed the following simulation studies carried out in R 4.0.5 [36]. 4.1
Linear Regression with Squared Term
We first simulated from a linear regression substantive model with a squared term.
82
M. Cai et al.
Simulation Conditions. The dependent variable Y was generated according to the analysis model (3) Y = α + β1 X + β2 X 2 + where α = 0, β1 = 1, β2 = 1, both predictor X and error term were assumed as standard normal distributions. These coefficients lead to a strong quadratic association between Y and X. A large sample size (n = 5000) was created. Simulations were repeated 1000 times so that we could achieve more robust and stable analyses. Forty percent of X and X 2 were designed to be jointly missing under five various missingness mechanisms: MCAR, MARleft, MARmid, MARtail, and MARright1 , which means no cases with missing values on either X or X 2 for each mechanism. Missing values were again created with the ampute function from the package MICE in R. Estimation Methods. We compared the performance of MPMM to four other approaches: ‘transform, then impute’ (TTI), ‘impute, then transform’ (ITT), polynomial combination method (PC) and substantive model compatible FCS (SMC-FCS). ‘Impute, then transform’, also named as passive imputation, excludes X 2 during imputation and appends it with the square of X afterwards. ‘Transform, then impute’, also known as just another variable (JAV), treats the squared term as another variable to be imputed. Both aforementioned methods are proposed by von Hippel [37]. We also apply polynomial combination proposed by Vink and Van Buuren [38]. PC imputes the combination of X and X 2 by predicted mean matching and then decomposes it by solving a quadratic equation for X. The polynomial combination method is implemented by mice.impute.quadratic function in the R MICE package. Finally, SMC-FCS is proposed by Barlett et al. [39]. In general, it imputes the missing variable based on the formula: f (X ,X
,Y )
f (Xi |X−i , Y ) = f (Yi ,X−i−i ) ∝ f (Y |Xi , X−i )f (Xi |X−i ).
(4)
Provided the scientific model is known and the imputation model is specified precisely (i.e., f (Y |Xi fits the substantive model), SMC-FCS derives imputations that are compatible with the substantive models. SMC-FCS is implemented by smcfcs function in the R smcfcs package and a range of common models (e.g., linear regression, logistic regression, poisson regression, Weibull regression and Cox regression) are available.
1
With left-tailed (MARleft), centered (MARmid), both tailed (MARtail) or righttailed (MARright) missingness mechanism, a higher probability of X being missing are assigned to the units with low, centered, extreme and high values of Y respectively.
Multivariate Predictive Mean Matching
83
Results. Table 2 displays the results of the simulation, including estimates of α, β1 , β2 , σ , R2 and the coverage of nominal 95% confidence intervals of β1 and β2 . In general, MPMM performed similarly to the polynomial combination method. There were no discernible biases for both approaches with five types of missingness mechanisms (MCAR, MARleft, MARmid, MARtail, and MARright). The coverage of the CIs for β1 and β2 from MPMM and PC was close to 95% with MCAR, MARleft, and MARmid. However, MPMM and PC had low CI coverage with MARtail and MARright. The undercoverage issue is due to the data-driven nature of predictive mean matching. PMM might result in implausible imputations when sub-regions of the sample space are sparsely observed or even truncated, possibly because of the extreme missing data mechanism and the small sample size. In such a case, two possible results may occur. First, the same donors are repeatedly selected for the missing unit in the sparsely populated sample space, which may lead to an underestimation of the variance of the considered statistic [40]. Second, more severely, the selected donors are far away from the missing unit in the sparsely populated sample space, which may lead to a biased estimate of the considered statistic. Although ‘impute, then transform’ method preserved the squared relationship, it resulted in severely biased estimates, even with MCAR. The CI coverage of β2 was considerably poor, with all cases of missingness mechanisms. With MCAR, ‘transform, then impute’ method yielded unbiased regression estimates and correct CI coverage for β1 and β2 . However, TTI distorted the quadratic relation between X and X 2 . It also gave severely biased results, and the CIs for β1 and β2 had 0% coverage with MARleft, MARtail, and MARright. Since we knew the scientific model in the simulation study and specified a correct imputation model, SMC-FCS provided unbiased estimates and closed to 95% CI coverage with all five missingness mechanisms. Furthermore, It was noteworthy that with MARtail and MARright, MPMM and PC yielded relatively accurate estimations for σ and R2 compared with the model-based imputation method. Overall, the multivariate predictive mean matching yielded unbiased estimates of regression parameters and preserved the quadratic structure between X and X 2 . Figure 1 shows an example of the observed data and imputed data relationships between X and X 2 , generated by the multivariate predictive mean matching method. 4.2
Linear Regression with Interaction Term
This section considers a linear regression substantive model, which includes two predictors and their interaction effect.
84
M. Cai et al.
Table 2. Average parameter estimates for different imputation methods under five different missingness mechanisms over 1000 imputed datasets (n = 5000) with 40% missing data. The designed model is Y = α + β1 X + β2 X 2 + , where α = 0, β1 = 1, β2 = 1 and ∼ N (0, 1). The population coefficient of determination R2 = .75. Missingness mechanism MCAR MARleft MARmid MARtail
MARright
Transform, then impute −0.04
0
0.15
Slope of X (β1 )
1(0.93)
0.93(0.02) 0.97(0.68) 1.13(0)
1.27(0)
Slope of X 2 (β2 ) 1(0.92)
0.93(0)
0.96(0.13) 1.13(0)
1.27(0)
Residual SD (σ ) 1
0.96
1
1.06
1.13
0.77
0.75
0.72
0.68
0.22
0.2
0.45
0.49
R
2
0.75
0
−0.11
Intercept (α)
Impute, then transform Intercept (α)
0.32
Slope of X (β1 )
0.94(0.62) 0.97(0.91) 0.89(0.08) 1(0.99)
1.04(0.92)
Slope of X 2 (β2 ) 0.68(0)
0.68(0)
0.74(0)
0.62(0)
0.7(0)
Residual SD (σ ) 1.41
1.36
1.35
1.52
1.57
R2
0.5
0.54
0.55
0.42
0.38
Intercept (α)
0
0
0
−0.05
−0.06
Slope of X (β1 )
1(0.93)
1(0.82)
PC 1(0.93)
1(0.93)
1(0.85)
Slope of X 2 (β2 ) 1.01(0.9)
1(0.94)
1(0.93)
1.07(0.12) 1.09(0.09)
Residual SD (σ ) 1
1
1
1.05
1.07
0.75
0.75
0.75
0.72
0.71
Intercept (α)
0
0
0
−0.03
−0.03
Slope of X (β1 )
1(0.93)
1(0.93)
1(0.91)
1.04(0.47) 1.06(0.4)
Slope of X 2 (β2 ) 1(0.91)
1(0.95)
1(0.93)
1.05(0.25) 1.07(0.23)
Residual SD (σ ) 1
1
1
1.05
1.07
0.75
0.75
0.75
0.72
0.71
Intercept (α)
0.01
0
0
0.03
0.05
Slope of X (β1 )
R
2
PMM-CRA
R
2
SMC-FCS 1(0.96)
1(0.95)
1(0.95)
1(0.97)
1.01(0.97)
Slope of X 2 (β2 ) 1(0.95)
1(0.96)
1(0.94)
1(0.96)
1.01(0.93)
Residual SD (σ ) 1.04
1
1
1.11
1.12
R2
0.75
0.75
0.69
0.68
0.73
Multivariate Predictive Mean Matching
85
Fig. 1. Predictive mean matching based on canonical regression analysis. Observed (blue) and imputed values (red) for X and X 2 .
Simulation Conditions. The dependent variable Y was generated according to the analysis model Y = α + β1 X1 + β2 X2 + β3 X1 X2 +
(5)
where α = 0, β1 = 1, β2 = 1, β3 = 1, two predictors X1 , X2 and error term were assumed as standard normal distributions. Under five types of missingness mechanisms: MCAR, MARleft, MARmid, MARtail, and MARright, the
86
M. Cai et al.
probability of jointly missing X1 and X2 was set to 0.4. There were no units with missing values on either X1 or X2 . Missing values were amputed with the ampute function from the package MICE in R. For each simulation scenario, n = 5000 units were generated and 1000 simulations were repeated. Estimation Methods. We evaluated and compared the same methods as under Sect. 4.1, except the polynomial combination method. The model-based imputation method ensures a compatible imputation model by accommodating the designed model Results. Table 3 shows the estimates of α, β1 , β2 , β3 , σ , R2 and the coverage of the 95% confidence intervals for β1 , β2 and β3 . With MCAR, MARleft, and MARmid, MPMM was unbiased, and the CI coverage for regression weights was at the nominal level. While similar to the linear regression with quadratic term situation, with MARtail and MARright, MPMM yielded unbiased estimates but had relatively reduced confidence interval coverage. The reason is explained in Sect. 4.1.2. ‘Transform, then impute’ method did not preserve the relations even though it resulted in plausible inferences in cases of MCAR and MARmid. The imputations were not plausible. Moreover, with MARleft, MARtial, and MARright, ‘transform, then impute’ method gave severely biased estimates and extremely poor CI coverage. ‘Impute, then transform’ method generally yielded biased estimates, and the CI for coefficients β1 , β2 and β3 had lower than nominal coverage with all five types of missingness. SMC-FCS yielded unbiased estimates of regression weights and had correct CI coverage in all simulation scenarios. The only potential shortcoming of the model-based imputation method was that the estimates of σ and R2 showed slight deviations from true values with MARtail and MARright. 4.3
Incomplete Dataset with Inequality Restriction X 1 + X 2 C
Multiple predictive mean matching is flexible to model relations among missing variables other than linear regression with polynomial terms or interaction terms. Lastly, we would evaluate the inequality restriction X1 + X2 C, which is relatively difficult for the model-based imputation approach to specify. One application of such inequality restriction would be the analysis of the academic performance of qualified students. For example, the sum score of mid-term and final exams should exceed a fixed value. Simulation Conditions. The data was generated from: 0 4 3.2 X1 ∼N , , X3 1 3.2 4 X2 = 3 − X1 + , where followed a standard uniform distribution. The sum of X1 + X2 3 was the restriction in the generated data. We simulated
Multivariate Predictive Mean Matching
87
Table 3. Average parameter estimates for different imputation methods under five different missingness mechanisms over 1000 imputed datasets (n = 5000) with 40% missing data. The designed model is Y = α + β1 X1 + β2 X2 + β3 X1 X2 + , where α = 0, β1 = 1, β2 = 1, β3 = 1 and ∼ N (0, 1). The population coefficient of determination R2 = .75. Missingness mechanism MCAR MARleft MARmid MARtail
MARright
Transform, then impute Intercept (α)
0
0.05
−0.05
0.06
Slope of X1 (β1 )
1(0.93)
0.96(0.4)
1(0.94)
1.05(0.42) 1.08(0.05)
Slope of X2 (β2 )
1(0.94)
0.96(0.4)
1(0.96)
1.05(0.38) 1.09(0.02)
Slope of X1 X2 (β3 ) 1(0.94)
0.05
0.96(0.53) 0.95(0.25) 1.06(0.31) 1.09(0.02)
Residual SD (σ )
1
0.97
1
1.02
1.04
R2
0.75
0.76
0.75
0.74
0.73
−0.04
−0.01
0.01
0.11
Impute, then transform Intercept (α)
0
Slope of X1 (β1 )
0.98(0.88) 1.05(0.51) 0.96(0.71) 0.98(0.9)
Slope of X2 (β2 )
0.98(0.88) 1.05(0.48) 0.96(0.73) 0.98(0.92) 0.95(0.69)
0.95(0.69)
Slope of X1 X2 (β3 ) 0.64(0)
0.64(0)
0.7(0)
0.54(0)
0.61(0)
Residual SD (σ )
1.25
1.18
1.22
1.28
1.37
R2
0.61
0.65
0.63
0.59
0.53
Intercept (α)
0
0
0
0.01
0.02
Slope of X1 (β1 )
1(0.93)
1(0.86)
1(0.92)
1.02(0.8)
1.02(0.73)
Slope of X2 (β2 )
1(0.93)
1(0.84)
1(0.93)
1.02(0.8)
1.02(0.77)
PMM-CRA
Slope of X1 X2 (β3 ) 1(0.94)
1.01(0.86) 1(0.93)
Residual SD (σ )
1
1.01
1
1.03
1.03
R2
0.75
0.74
0.75
0.74
0.74
Intercept (α)
0
−0.01
0
0.01
0.03
Slope of X1 (β1 )
1(0.95)
1.01(0.95) 1(0.95)
1(0.96)
0.99(0.95)
Slope of X2 (β2 )
0.99(0.94) 0.99(0.93) 1(0.97)
1(0.96)
0.99(0.96)
1.02(0.71) 1.03(0.68)
SMC-FCS
Slope of X1 X2 (β3 ) 1(0.95)
1(0.96)
1(0.95)
1(0.97)
1.01(0.93)
Residual SD (σ )
1.02
1.02
1
1.07
1.06
0.74
0.74
0.75
0.71
0.72
R
2
missingness such that rows in the block (X1 , X2 ,) were always either observed or completely missing. We considered 30% joint missingness of X1 and X2 . 2000 subjects were generated and 1000 simulations were performed for two missingness
88
M. Cai et al.
mechanisms : MCAR and MARright. We evaluated the mean of X1 and X2 and the coverage of nominal 95% CIs. Estimation Methods. We compared MPMM with PMM to illustrate the limited performance of univariate imputation approaches when there are relations connected to multiple missing variables. We did not apply joint modeling and univariate model-based imputation methods because it is hard to specify the designed inequality restriction. Table 4. Average parameter estimates for MPMM and PMM under MCAR and MARright over 1000 imputed datasets (n = 2000) with 30% missing data. The designed model is introduced in Sect. 4.3.1. The true values of E(X1 ) and E(X2 ) are 0 and 3.5. MPMM
PMM
MCAR
MARright
MCAR
MARright
MEAN COVERAGE MEAN COVERAGE MEAN COVERAGE MEAN COVERAGE E(X1 ) 0
0.95
0.01
0.94
0
0.92
−0.3
0
E(X2 ) 3.5
0.95
3.51
0.95
3.5
0.92
3.8
0
Results. Table 4 shows the mean estimates of X1 and X2 and coverage of the corresponding 95% CIs. The true values for E(X1 ) and E(X2 ) are 0 and 3.5. MPMM yielded unbiased estimates with MCAR and MARright and had the correct CI coverage. However, PMM was unbiased with close to 95% when the missingness mechanism is MCAR. It had considerable bias and extremely poor coverage with MARright. The reason is that the relations between X1 and X2 are not modeled [29].
5
Conclusion
Predictive mean matching is an attractive method for missing data imputation. However, because of its univariate nature, PMM may not keep relations between variables with missing units. Our proposed modification of predictive mean matching, MPMM, is a multivariate extension of PMM that imputes a block of variables. We combine canonical regression analysis with predictive mean matching so that the models for donors selection are appropriate when there are restrictions involving more than one variable. MPMM could be valuable because it inherits the advantages of predictive mean matching and preserves relations between partially observed variables. Moreover, since predictive mean matching performs well in a wide range of simulation studies, so can the multivariate predictive mean matching. We assess the performance of the multivariate predictive mean matching under three different substantive models with restrictions. In the first two simulation studies, MPMM provides unbiased estimates where the scientific model
Multivariate Predictive Mean Matching
89
includes square terms and interaction terms under both MCAR and MAR missingness mechanisms. However, with MARtail and MARright, MPMM suffers the undercoverage issue because the density of the response indicator is heavy-tailed with our simulation setup. It makes units with large Y almost unobserved and more missing than observed data in the tail region. The missingness mechanism is commonly moderate in practice, unlike MARtial and MARright in simulation studies. Overall, when no sub-regions of the sample space are sparsely observed, the multiple predictive mean matching analysis will provide unbiased estimates and correct CI coverage. SMC-FCS yields better estimates and CI coverage of regression weights, but MPMM provides relatively accurate σ and R2 . The comparison is not entirely fair because SMC-FCS, as used here, requires the correct substantive model for the data. In practice, we often do not know the model, and MPMM becomes attractive. MPMM is an easy-to-use method when increasing variables in the datasets or only the estimates are of interest. The third simulation shows the appealing properties of MPMM. When relations of missing variables are challenging to model, MPMM becomes the most effective approach to imputation. We expect that MPMM could be applied to other relations not yet discussed in Sect. 4. We limited our calculations and analyses to normal distributed X. However, since Vink [30] concluded that PMM yields plausible imputations with nonnormal distributed predictors, we argue that distributions of predictors will not significantly impact the imputations. We focus on the simple case with one missing data pattern. One possible way to generalize MPMM to more complicated missing data patterns is proposed in the appendix. The general idea is to partition the cases into groups of identical missing data patterns in the block imputed with MPMM. We then perform the imputation in ascending order of the fraction of missing information, i.e., we first impute cases with relatively small missing data problems. Considering to impute partially observed covariates for linear regression with a quadratic term Y = X + X 2 , we first impute cases with only missing value in X 2 by square the observed X. Then cases with only missing value in X are imputed with one square root of Y = X + X 2 . However, the selection of roots should be modeled with logistic regression. Finally, we impute cases with jointly missing X and X 2 with MPMM. The comprehensive understanding of MPMM with multiple missing data patterns is an area for further research.
Appendix The MPMM algorithm with multiple missing patterns: 1. 2. 3. 4.
Sort the rows of Y into S missing data patterns Y[s] , s = 1, · · · , S. Initialize Ym is by a reasonable starting value. Repeat for T = 1, · · · , t. Repeat for S = 1, · · · , s.
90
M. Cai et al.
5. Impute missing values by steps 1–8 of PMM-CRA algorithm proposed in Sect. 2.3. 6. Repeat steps 1–5 m times and save m completed datasets.
References 1. Rubin, D.B.: Multiple Imputation for Nonresponse in Surveys. Wiley, New York (2004) 2. Schafer, J.L.: Analysis of Incomplete Multivariate Data. CRC Press, Boca Raton (1997) 3. Van Buuren, S.: Multiple imputation of discrete and continuous data by fully conditional specification. Stat. Methods Med Res. 16(3), 219–242 (2007) 4. Goldstein, H., Carpenter, J.R., Browne, W.J.: Fitting multilevel multivariate models with missing data in responses and covariates that may include interactions and non-linear terms. J. Roy. Stat. Soc. Ser. A. 177(2), 553–564 (2014) 5. van Buuren, S.: Flexible Imputation of Missing Data, 2nd edn. Chapman and Hall/CRC (2018). https://doi.org/10.1201/9780429492259 6. Little, R.J.A.: Missing-data adjustments in large surveys. J Bus. Econ. Stat. 6(3), 287–296 (1988). https://doi.org/10.1080/07350015.1988.10509663 7. van Buuren, S., Groothuis-Oudshoorn, K.: MICE: multivariate imputation by chained equations in R. J. Stat. Softw. 45(3) (2011). https://doi.org/10.18637/ jss.v045.i03 8. Little, R.J., Rubin, D.B.: Statistical Analysis with Missing Data, vol. 793. Wiley, New York (2019) 9. Schafer, J.L.: Multiple imputation: a primer. Stat. Methods Med. Res. 8(1), 3–15 (1999) 10. Sinharay, S., Stern, H.S., Russell, D.: The use of multiple imputation for the analysis of missing data. Psychol. Methods 6(4), 317 (2001) 11. Allison, P.D.: Missing Data. Sage Publications, Thousand Oaks (2001) 12. Schafer, J.L., Graham, J.W.: Missing data: our view of the state of the art. Psychol. Methods 7(2), 147 (2002) 13. Longford, N.: Multilevel analysis with messy data. Stat. Methods Med. Res. 10(6), 429–444 (2001) 14. Olinsky, A., Chen, S., Harlow, L.: The comparative efficacy of imputation methods for missing data in structural equation modeling. Eur. J. Oper. Res. 151(1), 53–79 (2003) 15. Allison, P.D.: Missing data techniques for structural equation modeling. J. Abnorm. Psychol. 112(4), 545 (2003) 16. Twisk, J., de Vente, W.: Attrition in longitudinal studies: how to deal with missing data. J. Clin. Epidemiol. 55(4), 329–337 (2002) 17. Demirtas, H.: Modeling incomplete longitudinal data. J. Mod. Appl. Stat. Methods 3(2), 5 (2004) 18. Pigott, T.D.: Missing predictors in models of effect size. Eval. Health Prof. 24(3), 277–307 (2001) 19. Schafer, J.L.: Multiple imputation in multivariate problems when the imputation and analysis models differ. Stat. Neerl. 57(1), 19–35 (2003) 20. Seaman, S.R., White, I.R.: Review of inverse probability weighting for dealing with missing data. Stat. Methods Med. Res. 22(3), 278–295 (2013)
Multivariate Predictive Mean Matching
91
21. Ibrahim, J.G., Chen, M.-H., Lipsitz, S.R., Herring, A.H.: Missing-data methods for generalized linear models: a comparative review. J. Am. Stat. Assoc. 100(469), 332–346 (2005) 22. Israels, A.Z.: Eigenvalue Techniques for Qualitative Data (m&t series). DSWO Press, Leiden (1987) 23. Izenman, A.J.: Reduced-rank regression for the multivariate linear model. J. Multivar. Anal. 5(2), 248–264 (1975) 24. Sun, L., Ji, S., Yu, S., Ye, J.: On the equivalence between canonical correlation analysis and orthonormalized partial least squares. In: Twenty-First International Joint Conference on Artificial Intelligence (2009) 25. McDonald, R.P.: A unified treatment of the weighting problem. Psychometrika. 33(3), 351–381 (1968). https://doi.org/10.1007/bf02289330 26. Rubin, D.B.: Statistical matching using file concatenation with adjusted weights and multiple imputations. J. Bus. Econ. Stat. 4(1), 87 (1986). https://doi.org/10. 2307/1391390 27. Vink, G., Lazendic, G., van Buuren, S.: Partitioned predictive mean matching as a large data multilevel imputation technique. Psychol. Test Assess. Model. 57(4), 577–594 (2015) 28. Heitjan, D.F., Little, R.J.A.: Multiple imputation for the fatal accident reporting system. Appl. Stat. 40(1), 13 (1991). https://doi.org/10.2307/2347902 29. Morris, T. P., White, I. R., Royston, P.: Tuning multiple imputation by predictive mean matching and local residual draws. BMC Med. Res. Methodol. 14(1) (2014). https://doi.org/10.1186/1471-2288-14-75 30. Vink, G., Frank, L. E., Pannekoek, J., van Buuren, S.: Predictive mean matching imputation of semicontinuous variables. Stat. Neerl. 68(1), 61–90 (2014). https:// doi.org/10.1111/stan.12023 31. Carpenter, J., Kenward, M.: Multiple Imputation and Its Application. Wiley, New York (2012) 32. Seaman, S.R., Bartlett, J.W., White, I.R.: Multiple imputation of missing covariates with non-linear effects and interactions: an evaluation of statistical methods. BMC Med. Res. Methodol. 12(1) (2012). https://doi.org/10.1186/1471-2288-1246 33. Rencher, A.C.: Methods of Multivariate Analysis, vol. 492. Wiley, New York (2003) 34. Van Den Wollenberg, A.L.: Redundancy analysis an alternative for canonical correlation analysis. Psychometrika 42(2), 207–219 (1977) 35. Schouten, R.M., Lugtig, P., Vink, G.: Generating missing values for simulation purposes: a multivariate amputation procedure. J. Stat. Comput. Simul. 88(15), 2909–2930 (2018). https://doi.org/10.1080/00949655.2018.1491577 36. R Core Team: R: A language and environment for statistical computing [Computer software manual]. Vienna, Austria (2021). https://www.R-project.org/ 37. Von Hippel, P.: How to impute interactions, squares, and other transformed variables. Sociol. Methodol. 39(1), 265–291 (2009) 38. Vink, G., van Buuren, S.: Multiple imputation of squared terms. Sociol. Methods Res. 42(4), 598–607 (2013). https://doi.org/10.1177/0049124113502943 39. Bartlett, J. W., Seaman, S. R., White, I. R., Carpenter, J. R., Initiative*, A.D.N.: Multiple imputation of covariates by fully conditional specification: Accommodating the substantive model. Stat. Methods Med. Res. 24(4), 462–487 (2015) 40. de Jong, R., van Buuren, S., Spiess, M.: Multiple imputation of predictor variables using generalized additive models. Commun. Stat. Simul. Comput. 45(3), 968–985 (2014). https://doi.org/10.1080/03610918.2014.911894
Timeline Branching Method for Social Systems Monitoring and Simulation Anton Ivaschenko1(B) , Evgeniya Dodonova2 , Irina Dubinina1 , Pavel Sitnikov2,3 , and Oleg Golovnin4 1 Samara State Technical University, Molodogvardeyskaya 244, Samara, Russia
[email protected]
2 ITMO University, Kronverksky Pr. 49A, St. Petersburg, Russia 3 SEC “Open code”, Yarmarochnaya 55, Samara, Russia 4 Samara National Research University, Moskovskoye shosse 34, Samara, Russia
Abstract. The paper presents a new method of timeline balancing for monitoring and simulation of the possible scenarios of a social system development based on cross-correlation analysis of non-even time series. The proposed approach allows using the simulation results to improve the adequacy and accuracy of the forecast and determine the necessary and sufficient number of considered situations. The situational model is introduced to solve the problem of describing several generated options of the pace of developments in a complex social system. Possible generated options include positive, negative and indifferent scenarios as well as those considered as a possible feedback to specific incoming events. The main idea is to split the current situation to several options and generate separate scenarios for its development in time. The method allows evaluating the closeness of neutral expected scenario to its positive and negative options, which leads to automatically generated recommendations of collecting additional input data or making efforts to mitigate the risks or avoid negative effect. The proposed method implemented by the monitoring and simulation software platform was probated and tested to analyze the dynamics of the accumulated statistics on the key parameter of the growth of morbidity. Initial data includes the incidence statistics of Samara region for 18 months starting from March 2020, taken from the open sources online. Timeline balancing method is recommended as a component for data management and visualization in the analytical systems and specialized software for situational management and decision-making support. Keywords: Digital transformation · Social system · Monitoring · Risk management · Time series · Decision-making support
1 Introduction Modern trends of digital transformation require automated collecting and processing of multiple parameters describing the social systems actual state and development. Considering the volume and variety of the input information processed by this processes it can be characterized as the Big Data. Being processed by the algorithms of semantic © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 506, pp. 92–102, 2022. https://doi.org/10.1007/978-3-031-10461-9_6
Timeline Branching Method for Social Systems
93
and statistical analysis it is reorganized and reduced to several key indicators necessary and sufficient for automated decision-making support. In practical applications this activity is usually covered by the processes of monitoring and simulation. Monitoring provides a number of significant and relevant indicators in real time, while simulation develops a forecast of their dynamics considering possible positive and negative impacts. One of the challenging problems of these processes support by an automated business intelligence system is a necessity to deal with unpredictable event chains, which require the algorithms to be adaptive to the changing pace of developments. To solve this problem in this paper there is proposed a new abstract model and technology for monitoring and simulation of the possible scenarios of a complex system development based on cross-correlation analysis of non-even time series. The proposed approach allows using the simulation results to improve the adequacy and accuracy of the forecast and determine the necessary and sufficient number of considered situations.
2 State of the Art Nowadays almost all the processes of social and economic systems management are subjected to the digital transformation [1, 2]. This trend also addresses public service delivery processes modernization, which implementation in state and regional management requires “digital thinking”. The main approaches to the study of socio-cultural factors in the economy, as well as key modern trends in empirical research (including in connection with the identification of cause-and-effect relationships) are presented in [3]. The use of systems for monitoring and analyzing the situation allows you to effectively collect data on the current socio-economic situation, evaluate them, identify the main trends and patterns of development, find problem areas and correct them as soon as possible, as well as predict their occurrence in a timely manner. An important component of monitoring and simulation is formed by statistical methods and techniques [4–6] that most fully and accurately reflect the specifics of the development over time of indicators that characterize the situation under study. Statistical methods are based on the analysis of Big Data [7, 8], with which the category of statistical regularity is closely related. It helps to identify a natural relationship between the preceding and subsequent states of the system, which is under the influence of external, constantly changing conditions. This research is based on the analysis of non-even time series [9] and continues studying the statistical and semantic trends of the development of digital social and economy systems [10–12]. The information obtained during monitoring can be processed through comparative analysis, which includes comparing the results obtained with the established goals and objectives, analyzing the dynamics of the situation, and conducting benchmarking. A situational analysis can also be applied, with the help of which the strengths and weaknesses of the situation under consideration, as well as its threats and opportunities, are assessed. Continuity is becoming an important feature, which makes it possible to determine development adjustments taking into account the rapidly changing factors of the external and internal environment.
94
A. Ivaschenko et al.
One of the main existing data analysis tools is Business Intelligence (BI) [13, 14]. The process of data processing in such systems consists of the stages of extracting initial data from various sources, transforming them and loading them into the system for subsequent analysis. In recent years, the development of proprietary BI platforms based on the Data Discovery technology has been developed, which makes it possible to discover useful information in the general data set. At the same time, great attention is paid to its most visual presentation through information panels (dashboards) that support personalization depending on the user’s needs. At the moment, there are many BI platforms and data visualization tools that allow you to effectively work with data, thereby helping in making management decisions [15]. Among the existing BI platforms, we can highlight QlikView, Tableau, Klipfolio, Power BI. A brief comparison of platforms is presented in. Their advantages include interactive visualization, the ability to simultaneously use different types of data, geospatial analytics, combining data from various databases and other sources, real-time monitoring of continuous data streams, and convenient user collaboration. Taking into account the latest trends in the development of the state in the field of digitalization, it is necessary to highlight the so-called strategy of a Smart Society [16–18]. It is a socio-economic and cultural strategy for the development of society, based on the use of digital technologies in all spheres of life to improve people’s lives. One of its elements is the presence of situational centers, which analyze the current situation using various monitoring tools: monitoring systems for transport, production lines, construction objects, aerospace monitoring, media monitoring, accounting tools, visualization, and response. Thus, modern situational centers are formed as a complex of information and hardware and software tools intended for the work of managers and expert groups in order to quickly assess the problem situation, promptly construct and simulate various scenarios for the development of the situation based on special methods of processing large amounts of information.
3 Situational Model Modern social system monitoring and control is based on an aggregation and analysis of a number of parameters that describe the current situation and its changes in time. For example, the city is described by population, economy, levels of education, medical care, tourist attraction, etc. The situation describes a complex a relationship that is the result of the interaction of various members. It includes subjects, relations between them and their states, formed as a result of various actions of subjects. Situation parameters are a set of indicators that numerically characterize the current situation, as well as trends in its change over time. Events are the marks in space-time that characterize significant changes in objects and subjects of the situation, as well as factors affecting them.
Timeline Branching Method for Social Systems
95
Scene is a representation of the current or predicted situation, implying the distribution of objects and subjects on the map using a geographic information platform. Scenario is a sequence of scenes containing occurring or predicted events. Using these basic definitions there was developed an abstract model, which formalizes the situations of a complex social system. Let’s designate the structural elements of the situation in the form x i , where i = 1..N x is a unique index of the structural element in the network model. For the variant of the situation development described by the scenario option V k we will introduce the designation of the situation S j,k , for which the belonging of the structural elements x i will be denoted using the logical relation: si,j,k = si,j,k xi , Sj,k = {0, 1} (1) The function si,j,k takes the value “1” if the structural element x i is included in the current description of the stop, and the value “0” if not. As an additional element of the model, it is proposed to introduce a geo information layer L x,j,k on the digital map, containing an exhaustive description of a certain aspect of the situation in the framework of the considered situation S j,k . The distribution of structural elements by layers is carried out with the need by setting the relations: Ln,m,i,j,k = gn,m,i,j,k xi , dm , Ln,j,k = {0, 1}. (2) For the structural elements of the situation within the framework of an individual situation, logical connections are set in the form of binary relations of the type Ap : Ap,j,k (xi1 , xi2 ) = a p,j,k (xi1 , xi2 , Ap , S j,k) = {0,1}.
(3)
To describe each situation, the values of the parameters of the situation K q are set: kq,j,k = kq,j,k Kq , Sj,k ∈ (4) It is proposed to describe important factors that determine the change in the situation and the transition between situations in the form of events E r changes in structural elements: er,i,j,k = er,i,j,k Er , xi , Dr,i,j,k , tr,i,j,k = {0, 1}, (5) where t r,i,j,k is the time of the event occurrence; H r,i,j,k - semantic descriptor characterizing the event as a set of keywords (tags) with weights: Hr,i,j,k = {(τr,i,j,k,y , wr,i,j,k,y )}, where τr,i,j,k,y is a tag (keyword), and wr,i,j,k,y weight of the tag in the cloud.
(6)
96
A. Ivaschenko et al.
Therefore, the situation is represented as follows: = si,j,k , {ap,j,k }, oq,j,k , er,i,j,k .
(7)
Monitoring and simulation of the situation is carried out by sequentially expanding the graph (7) by calculating additional parameters {k q,j,k }, identifying new relations {ap,j,k } due to influencing factors and building new options for the development of the situation {si,j,k } as a result of the analysis of data on the events {er,i,j,k }.
4 A Timeline Branching Method The proposed situational model was used to solve the problem of describing several generated options of the pace of developments in a complex social system. This method is intended for visualization of the predicted scenarios in the automated systems for decision-making support. Possible generated options include positive, negative and indifferent scenarios as well as those considered as a possible feedback to specific incoming events. The main idea is to split the current situation to several options and generate a separate scenario for its development in time. The method allows evaluating the closeness of neutral expected scenario to its positive and negative options, which leads to automatically generated recommendations of collecting additional input data or making additional efforts to mitigate the risks or avoid negative effect. The solution is based on the theory of cross-correlation analysis of non-even time series, which is capable of processing the event chains to determine linear and possible causal relationships. Timeline branching method contains two steps: 1. Generating the branches; 2. Discretization of timelines. According to the introduced model, each scenario option V k is characterized by a number of parameters k q,j,k and contains the sequence of scenes si,j,k caused by the event queue er,i,j,k . The introduced timeline branching method contains definition of the moment of time t 0 , when the basic situation is split to several ongoing scenarios describing the alternative development options. Let us designate them as V k (t 0 )+ for a positive scenario, V k (t 0 )− for a negative scenario, and V k (t 0 ). for a neutral scenario, which basically corresponds to an expected development of realistic events. In order to provide decision-making support there should be determined a horizon of existence t : t : t = t − t 0 → max, Vk (t )+ = null; Vk (t )− = null; Vk (t )−. = null. t0]
(8)
This means that the specified options V k are limited by the logical time interval [t , and contain the corresponding scenes si,j,k that describe the main situation changes.
Timeline Branching Method for Social Systems
97
Let us introduce a generalized scoring function that describes the current scene and produces a few key indicators describing the situation in total: f (Vk (t)) = Fk ({si,j,k }).
(9)
Using this function there can be determined the closeness of scenarios. To provide an effective decision-making support there should be generated the sequence of scenes si,j,k , which provide maximum deviation of positive and negative scenarios: |f (Vk (t )+ ) − f (Vk (t )− )| → max.
(10)
Subsequent analysis of the closeness degree of positive and negative scenarios to a neutral one is helpful for understanding the sufficiency of data and efforts available at the point in time t 0 : f + = |(Vk (t )+ ) − f (Vk (t ))|; f − = |(Vk (t )− ) − f (Vk (t ))|.
(11)
In case f + < f − there is a lack of data and additional information should be collected to reduce the excessive optimism. Otherwise additional activity should be scheduled to reduce the risks. In such a way the number of possible situation development scenarios form an oriented graph, which contains the scenes si,j,k as vertexes linked by scenarios. Generating of branches can be performed reconstruction of this graph by one of the two ways: 1. Taking V k (t 0 ) as a basis, then generating V k (t ), developing V k (t )+ and V k (t ) and finally restoring the corresponding V k (t 0 )+ and V k (t 0 )− . 2. Taking V k (t 0 ) as a basis, then generating V k (t 0 )+ and V k (t 0 )− , and developing the corresponding V k (t ), V k (t )+ and V k (t ). The next step of a timeline branching method is adaptive discretization. To improve the reliability and adequacy of the developed estimations it is proposed to introduce the intermediate states of each timeline branch in the form of the scenes si,j,k . General approach is based on the uniform sampling, when each branch is split by constant time intervals. This approach is easy implemented, but requires generating multiple scenes, which is time consuming and computationally complex. Under the frames of this method it is proposed to implement an adaptive discretization based on approximation of the functions f (V k (t)), f (V k (t)+ ) and f (V k (t)− ). The history of the situation development is taken as the basis for approximation. Then statistically confirmed patterns of dependences of key indicators of the situation on time are used as approximating expressions. The standard deviation is used to estimate the approximation error. The most optimistic and pessimistic patterns are used to develop the corresponding branches considering the final states in t . Finally adaptive discretization of each branch allows generating the necessary and sufficient number of scenes required to describe the dynamical changes. This method allows determining the moments of time of expected changes, when there is a lack of information to describe the situation in full. To solve this problem there should be scheduled an additional process of data collecting and processing to cover the gaps.
98
A. Ivaschenko et al.
The proposed method is recommended as a component for data management and visualization in the analytical systems and specialized software for situational management and decision-making support.
5 Implementation The proposed method was implemented in a situational center based on a geographic information platform, including the information widgets that describe the dynamics of the current indicators of the development of the socio-economic situation. The goal of the system is to present the entire scope of key indicators of the regional development to decision-makers in a concise and convenient form (see Fig. 1).
Fig. 1. Monitoring and Simulation Software Platform.
The map in the implemented system is organized in the form of a multilayer model, which makes it possible to superimpose several layers containing different objects on top of each other. Thus, it becomes possible to combine and display more information required by the user, which greatly simplifies the spatial analysis of objects. In order to carry out a more accurate analysis, it is necessary to take into account the time factor. With its help, it will be more convenient to single out the cause-and-effect relationship in order to understand how the socio-economic situation in the region has changed, which will make it possible to assess the risk of its deterioration. In this regard, the timeline branching is used to compare and simulate the development of the situation considering several options. Also, the time analysis of the events displayed on the map and the values of socio-economic indicators allows you to calculate indirect indicators, such as the tension of the situation, fluctuations in the information background, etc.
Timeline Branching Method for Social Systems
99
6 Results The proposed method implemented by the monitoring and simulation software platform was probated and tested to analyze the dynamics of the accumulated statistics on the key parameter of the growth of morbidity, which is critical nowadays. Initial data includes the incidence statistics of Samara region for 18 months starting from March 2020, taken from the open sources online (see Fig. 2). This indicator corresponds to the parameter of “The number of new infections”. The Ministry of Health constantly monitors this indicator in the areas accessible to it. In this case, the emergence of a focus of morbidity is possible, which can lead to a sharp increase in the values of the indicator under consideration. The Ministry of Health must promptly respond to this leap in order to stabilize the situation.
Fig. 2. Accumulated Statistics on the Key Parameter of the Growth of Morbidity.
The functional fragments, which describe the basic patterns of critical dependencies, were approximated by linear splines as presented in Fig. 3. The resulting use of the approximation functions allowed developing the forecasts for positive, negative and expected timeline branches (see Fig. 4). As it can be noticed from the resulting graph, pessimistic estimation requires an early collection and evaluation of additional information. Realistic scenario turns out to be closer to the optimistic one, which inspires reassurance, but identifies the risk of the lack of initial data. This result is implemented as a widget in a monitoring system for decision-making support and is recommended to the regional administration.
100
A. Ivaschenko et al.
Fig. 3. Examples of the Approximated Fragments.
Fig. 4. Generated Timeline Branches with Adaptive Discretization.
Timeline Branching Method for Social Systems
101
7 Conclusion Taking into account the multi-criteria description of the assessment of the socioeconomic situations, the proposed method of timeline branching makes it possible to investigate emerging problems most conveniently for the perception of users. In turn, the use of a situational model in conjunction with statistical analysis of non-even time series helps in predicting situations. Adaptive approximation allows identifying the critical scenes that become important elements in planning, which can then be used to improve the process of decision-making support. The next steps are planned to improve the algorithms of timeline branches automated generation based on prediction using the artificial neural networks. The domains of the proposed timeline branching method possible application include the various areas of government and corporate governance considering the modern concept of smart society digital transformation.
References 1. Digital Russia: New Reality. Digital McKinsey, 133 p (2017). https://www.mckinsey.com/ru/ our-work/mckinsey-digital 2. Patel, K., McCarthy, M.P.: Digital Transformation: The Essentials of E-Business leadership,134 p. KPMG/McGraw-Hill (2000) 3. Auzan, A.A., et al.: Sociocultural factors in economics: milestones and perspectives. Vopr. Ekon. 7, 75–91 (2020) 4. Filz, M.A., Herrmann, C., Thiede, S.: Simulation-based data analysis to support the planning of flexible manufacturing. In: 18 ASIM Fachtagung Simulation in Produktion und Logistik Conference, pp. 413–422 (2019) 5. Grami, A.: Analysis and Processing of Random Processes (2019). https://doi.org/10.1002/ 9781119300847.ch12 6. Kobayashi, H., Mark, B., Turin, W.: Probability, Random Processes and Statistical Analysis, 812 p. Cambridge University Press (2012) 7. Bessis, N., Dobre, C.: Big Data and Internet of Things: a roadmap for smart environments, Studies in Computational Intelligence, 450 p. (2014) 8. Ma, S., Huai, J.: Approximate computation for big data analytics. ACM SIGWEB Newsletter, pp. 1–8 (2021) 9. Prokhorov, S.: Applied analysis of random processes, Samara scientific center of RAS, 582 p. (2007) 10. Ivaschenko, A., Lednev, A., Diyazitdinova, A., Sitnikov, P.: Agent-based outsourcing solution for agency service management. In: Bi, Y., Kapoor, S., Bhatia, R. (eds.) IntelliSys. LNNS, vol. 16, pp. 204–215. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-56991-8_16 11. Ivaschenko, A., Stolbova, A., Golovnin, O.: Data market implementation to match retail customer buying versus social media activity. In: Arai, K., Kapoor, S., Bhatia, R. (eds.) SAI 2020. AISC, vol. 1228, pp. 363–372. Springer, Cham (2020). https://doi.org/10.1007/978-3030-52249-0_26 12. Sitnikov, P., Dodonova, E., Dokov, E., Ivaschenko, A., Efanov, I.: Digital transformation of public service delivery processes in a smart city. Lect. Notes Networks Syst. 296, 332–343 (2021) 13. Trujillo, J., Maté, A.: Business intelligence 2.0: a general overview. In: Aufaure, M.-A., Zimányi, E. (eds.) eBISS. LNBIP, vol. 96, pp. 98–116. Springer, Heidelberg (2012). https:// doi.org/10.1007/978-3-642-27358-2_5
102
A. Ivaschenko et al.
14. Rouhani, S., Asgari, S., Mirhosseini, V.: Review study: business intelligence concepts and approaches. Am. J. Sci. Res. 50, 62–75 (2012) 15. Srivastava, G., Muneeswari, S., Venkataraman, R., Kavitha, V., Parthiban, N.: A review of the state of the art in business intelligence software. Enterprise Information Systems, pp. 1–28 (2021) 16. Cifaldi, G., Serban, I.: Between a smart city and smart society. Intelligent Human Systems Integration, pp. 714–719 (2018) 17. Iannone, R.: Smart society. Smart Society, pp.1–12 (2019) 18. Arias-Oliva, M., Borondo, J., Murata, K., Lara, A.: Societal Challenges in the Smart Society, 635 p. Universidad de La Rioja (2020)
Webometric Network Analysis of Cybersecurity Cooperation Emmanouil Koulas1,2(B)
, Syed Iftikhar Hussain Shah2 , and Vassilios Peristeras2,3
1 University College London, London, UK {emmanouil.koulas,emmanouil.koulas.20}@ucl.ac.uk 2 International Hellenic University, Thessaloniki, Greece {i.shah,v.peristeras}@ihu.edu.gr 3 Council of the European Union, Brussels, Belgium
Abstract. Cyberspace is slowly but surely evolving into the battlefield of the future. The rise of computational power allows not only for state-sponsored hacking groups, but for lone hackers or organizations to wreak havoc on infrastructure, either for financial or political gain. Cooperation between states, on multiple levels, is required to defend their cyber sovereignty effectively. In this paper, we will identify Cyber Emergency Response Team (CERT) and Cyber Security Incident Response Team (CSIRT) which are under national or governmental supervision, and we will draw comparisons between the European Union and NATO. Furthermore, we will examine whether the cooperation between European Union Member States on the aforementioned technical level is reflected on the World Wide Web. Inter-linkage and co-mention analyses have been conducted to map the organizations’ footprint on Cyberspace, using Webometric Analyst. Keywords: Cybersecurity · CSIRT · CERT · Webometrics · Network analysis · European Union · NATO
1 Introduction The 21st century is characterized by the spread of internet and an overall change in the processes and functions of all institutions that occur due to the increase of the adoption of the Internet of Things (IoT) and the digitization of data and services [1]. IoT is characterized by Brass and Sowell [2] as a “disruptive technology”, in the sense that it challenges the existing systems. At the same time, the use of digital technologies can often be risky as the institutions that digitize their archives and databases can become more vulnerable and need to be more adaptable, flexible and innovative. The existing regulations of the European Union (EU) cover a wide range of topics, activities and potential threats. For instance, the main areas that are covered in the legislation regarding the cyber-security are [3] the economic sector and economic challenges (i.e. the digital economy, the digital institutions), data transfer and data security, social issues (i.e. the gender gap, digital literacy, etc.), capacity building and knowledge sharing, and nuclear and military security. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 506, pp. 103–122, 2022. https://doi.org/10.1007/978-3-031-10461-9_7
104
E. Koulas et al.
This paper approaches issues related to digital cooperation in the European Union and NATO. In detail, the aim of the paper is to do a WNA in order to assess the networking patterns among the Cyber Security Incident Response Teams (CSIRT)/Cyber Emergency Response Team (CERT) and to what degree the cooperation between the partners is reflected on those patterns at an EU level. It is important to note that CSIRTs and CERTs are inherently addressing the same threats, so the names are used interchangeably. Webometrics is relatively new field that has emerged from the areas of information technologies and telecommunications using mathematical models, as well as qualitative research methods. In detail, Webometrics is based on the idea of using the “Web” to collect data and information and to discover “patterns” as well as to study either the content of the websites available on the internet, the most common users’ preferences and/or the link-structures and associations [4]. In order to discover the networking patterns among the MS of the EU and NATO, their partners and their Institutions different software can be used. In this project it is argued that the use of Webometric Analyst can reveal interesting patterns using this specific tool. We opted for the EU and NATO, because they constitute large political and military organizations respectively, thus we anticipate that the results they yield might apply to other organizations of similar aims. The research questions are, then, formed as such: RQ1. What are the key cybersecurity actors including State CSIRTs, and their trends to safeguard the community by investigating through their official websites? RQ2. What are main cybersecurity actors’ networking patterns, and the extents to which they are exposed at national, regional, and international level in Cyberspace? RQ3. What are the analytical techniques and concepts of the technological tool and how such tools are utilized to measure the main cybersecurity actors’ intercommunication strength and their web presence?
2 Cyberspace Cooperation and Partnership Since its founding, the European Union has been an organization that places particular emphasis on international cooperation, security and stability. Also, the EU actions in the field of cybersecurity and security in general have been mostly reactive. This means that, in the 1990s and early 2000s, the EU has mostly focused on responding to the needs of its member states and their citizens and not at creating resilient institutions that are independent and flexible [5]. In recent years we have witnessed EU effort to address modern challenges, like misinformation and cybersecurity [6]. Markopoulou et al. [7] discuss the contemporary European policy on tackling cyber – attacks and securing the EU cyberspace. In detail, the paper starts with an assessment of the impact of the Directive 2016/1148 (NIS Directive) that is regarded as the first systematic attempt of the Union to secure its network systems and to provide with a horizontal legislation to approach those challenges. According to Markopoulou et al. [7] the NIS Directive addresses the issues related to cyber-terrorism and cyber-crime by strengthening the information systems across the EU and identifying the main changes that must be implemented in order to improve the performance of the service providers at an EU level. Also, along with ENISA, its purpose is to assist the MS in their efforts
Webometric Network Analysis of Cybersecurity Cooperation
105
to minimize cyber-threats and, also, offer the necessary tools for the EU to be able to cooperate with multi-national parties and third countries as well as other organizations. Moreover, as Koulas [8] notes, the EU - NATO cooperation is mostly fruitful as the majority of the EU MS are also part of the alliance. The common efforts of the two organizations since 1999 have led to a series of common actions that are not limited to training but are of systematic nature. To Koulas [8], the EU – NATO common efforts to reduce vulnerability aim at a) integrating their common efforts to raise awareness amongst the authorities as well as the public, b) provide training to officials and members of the armed forces to become more equipped at securing the European databases and communications systems, c) the two organizations also plan common exercises and activities to forward their mutual interests. The most effective tools used are the Memorandum of Understanding between NATO and the EU on Cyber Defense, the NATO Industry Cyber Partnership and the Computer Emergency Response Teams that are multinational. Those actions also impact the private sector as they are aimed at securing the cyberspace in general and not solely at providing guidance to the authorities. According to Renard [9], the EU and the United States of America (USA) aim at reducing external threats, particularly from Asia (China, Iran, North Korea and Russia). Hence, the USA need to cooperate with the European Union in order for them to be capable of monitoring, locating and prosecuting attackers and to ally with the official institutions, particularly in China and Russia to tackle cybercrime. Renard [9] attributes the emphasis placed by the EU institutions on cyber-defense at two factors: a) the EU’s main ambition is to become a global power, b) as the international presence of the EU increases, so do the risks faced by the Union. Also, due to the fact that the EU is a global diplomatic power, this cooperation can enable the singing of bilateral and multi-lateral agreements with smaller states that do not have a clear national cyber-strategy. As a result, apart from the fact that the partners can strengthen their capacity and capabilities, the goal of ensuring stability in the international system becomes attainable. Additionally it is important to highlight that, the EU has created the CSIRTs that are technical groups of experts that according to Tanczer et al. [10] are equipped to maintain the integrity of the technical infrastructure of the internet and, play a vital role in ensuring the success of the diplomatic approaches and missions of the European Union and its partners. Indeed, the development of international as well as regional networks, the enhancement of the capabilities of the response groups as well as the Task Force CSIRT and the coordination of the European networks can lead to a) the reduction of bureaucratic obstacles, b) reducing transnational threats, c) the improvement of diplomatic missions of the EU [10]. The role of ENISA appears to be of outmost importance in those efforts along with the European Cybersecurity Research and Competence Centre [11]. However, although international law has been adapted to address cybersecurity issues, the new digital era also come hand – in – hand with security challenges that cannot always be covered by the legal provisions included in international treaties and/or conventions [12, 13]. According to Chainoglou [14], the main risk factor that can be noted is that, due to the fact that the cyber space is limitless and can be accessed remotely, it is not rare for an aggressor to “attack” it without them being traceable. Therefore, the offenders/attackers can be traced outside the territory of the state where a specific organization and/or institution is situated, making it even more challenging for the authorities to either
106
E. Koulas et al.
trace and prosecute them or even to punish them. Also, in many cases, a cyber-attack can be complex, meaning that it is not always possible for the authorities to press charges against them [8]. Furthermore, one could argue that in cybersecurity, like in terrorism and in nuclear or biological warfare, states need to adapt their perspective of sovereignty, in order to effectively address the modern threats [15].
3 Methods 3.1 Data Collection The first step for our study was to create the seed site list. In order to do that we used the European Union’s Agency for Cybersecurity (ENISA) interactive map for identifying CSIRTs by country. We only chose Government and National CSIRTs by Member States, thus ending up with a list of 38 different CSIRTs and their websites. The selection criteria for the CSIRTs that would constitute our seed sites where: • Be a MS’ CSIRT, • Have an Accredited Trusted Introducer status, and • Have a National or Governmental constituency. In addition to the MS’ CSIRTs we added ENISAs website, as well as the EU CERT’s website to our seed site list. We proceeded with these additions because we estimated that due to their standing as European Institutions, they would play a central role in the network, and thus we wanted to compare the networks with and without the European Institutions. Finally, we utilized ENISA’s interactive map in order to complete a big portion of the NATO Member States’ CSIRTs. The rest of them were found through Google Search, and we ended up with 42 different CSIRTs and their websites. Finally, we opted for including the NATO Communications and Information Agency (NCIA) in our list. 3.2 The Process The above websites are analyzed using Webometric Analyst 2.0 (http://lexiurl.wlv.ac.uk) and the Bing Application Program Interface, which is capable of carrying out advanced Boolean searches and tracking external websites linked to URL citations under study. Thus, the lists of external sites corresponding to the base query, i.e., the websites mentioned above, were obtained. The methods of analysis are summarized in the following chapter. 3.3 Techniques and Concepts In order to assess our data, we have formulated research question three. The necessary techniques and concepts that we need in order to draw conclusions are interlinking, co-mentions, and the degree and betweenness centralities. Interlinking is the directional, direct connection of two pages by a link. The page that has the link to other page has an outlink, while the page that is the recipient of the link
Webometric Network Analysis of Cybersecurity Cooperation
107
has an inlink [16]. For the purposes of the study we have created a directional diagram that shows the patterns of connectivity between the seed sites [6, 17]. Co-mention is when a third website links two websites at the same time, with the additional limitation of having the links to both websites, on the same page [16]. For the purpose of the study we have created a co-mention diagram. It is important to note that the co-mention diagram is not directional [6, 17]. Furthermore, in both network diagrams, the thickness of the lines connecting the nodes in the network is proportional to the number of links. Degree centrality, whether indegree (number of inbound links) or outdegree (number of outbound links), is a node importance score which is proportional to the total number of links that the node has. This shows how many connections any particular node has to rest of the network and shows us the most popular nodes of the network [18–20]. Betweenness centrality shows the sum of times any particular node is found on the shortest path between different nodes of the network. Nodes with high betweenness centrality act as connectors in the network, thus having the capability to influence the flow of information through the network [18, 19]. This can be achieved through a variety of actions like enhancing, blocking, or altering the communication between the nodes [19, 21, 22].
4 Results Our results are divided into two sections. The first one includes the results of the EU Member States CSIRTs networking patterns, both including and excluding the EU Institutions. The second sections include the results of the NATO Member States CSIRTs. 4.1 EU Networking Pattern Results Table 1 answers our first research question, identifying the key national cybersecurity actors in the EU. Figure 1 and 2, as well as, Tables 2 and 3 answer our second research question, regarding the network patterns of these actors. Table 1. European Union seed sites No.
Country
Team name
Constituency
Website
1
Austria
AEC
CIIP
energy-cert.at
2
Austria
CERT.at
National
cert.at
3
Austria
GovCERT Austria
Government
govcert.gv.at
4
Belgium
CERT.be
CIIP, Government, National
cert.at
5
Bulgaria
CERT Bulgaria
Government
govcert.bg
6
Croatia
CERT ZSIS
Government
zsis.hr (continued)
108
E. Koulas et al. Table 1. (continued)
No.
Country
Team name
Constituency
Website
7
Croatia
CERT.hr
NREN, National
cert.hr
8
Cyprus
CSIRT-CY
CIIP, Government, National
csirt.cy
9
Czech Republic
CSIRT.CZ
Government, National, Private and Public Sectors
csirt.cz
10
Czech Republic
GovCERT.CZ
Government, Private and Public Sectors
govcert.cz
11
Denmark
CFCS
Government, National
govcert.dk
12
Estonia
CERT-EE
CIIP, Financial, Government, National
cert.ee
13
European Union
CERT-EU
EU Institutions
cert.europa.eu
14
European Union
ENISA
EU Institutions
enisa.europa.eu
15
Finland
NCSC-FI
CIIP, Government, National
kyberturvallisuuskesk us.fi
16
France
CERT-FR
CIIP, Government
cert.ssi.gouv.fr
17
Germany
CERT-Bund
CIIP, Government, National
bsi.bund.de/DE/ Home/home_node. html
18
Greece
GR-CSIRT
Government
csirt.cd.mil.gr
19
Hungary
GovCERT-Hungary
Government
cert-hungary.hu
20
Ireland
CSIRT-IE
Government, National
ncsc.gov.ie/CSIRT
21
Italy
IT-CERT
Government, National
certnazionale.it
22
Latvia
CERT.LV
Government, National
cert.lv
23
Lithuania
CERT-LT
National
nksc.lt
24
Luxembourg
CIRCL
National
circl.lu
25
Luxembourg
GOVCERT.LU
CIIP, Government, Law Enforcement, Military, National
govcert.lu
26
Luxembourg
NCERT.LU
National
govcert.lu/en/ncert. html (continued)
Webometric Network Analysis of Cybersecurity Cooperation
109
Table 1. (continued) No.
Country
Team name
Constituency
Website
27
Malta
CSIRTMalta
National
matacip.gov.mt/en/ CIP_Structre/Pages/ CSIRTMalta.aspx
28
Netherlands
CSIRT-DSP
National
csirtdsp.nl
29
Netherlands
NCSC-NL
National
ncsc.nl
30
Poland
CERT POLSKA
NREN, National
cert.pl
31
Poland
CSIRT-GOV
Government, National
cert.gov.pl
32
Poland
CSIRT-MON
Military
csirt-mon.wp.mil.pl/ pl/index.html
33
Portugal
CERT.PT
Government, National
cncs.gov.pt
34
Romania
CERT-RO
National
cert-ro.eu
35
Slovakia
CSIRT.SK
Government
csirt.gov.sk
36
Slovakia
SK-CERT
CIIP, National
sk-cert.sk
37
Slovenia
SI-CERT
NREN, National
cert.si
38
Spain
CCN-CERT
Government
ccn-cert.cni.es
39
Spain
INCIBE-CERT
CIIP, National
incibe-cert.es
40
Sweden
CERT-SE
Government, National
cert.se
From Table 1 we can observe that Austria, Luxembourg and Poland, each have 3 CSIRTs that fulfil our selection criteria, Croatia, the Czech Republic, the European Union, the Netherlands, Slovakia and Spain, each have 2 CSIRTs, while all the other members of the European Union have 1. Table 2 shows the CSIRTs that have the biggest Indegree and Outdegree Centralties. As it is expected, a European Institution, ENISA, has the highest centrality in both categories. Except ENISA, only CERT-SE and CERT.at find themselves on the top of both centralities. An important observation is that CERT Bulgaria, CERT-Bund, GR-CSIRT, CSIRTIE, IT-CERT, NCERT.LU, CSIRTMalta, CSIRT-DSP, and CSIRT-MON have no inlinks and outlinks. Table 3 shows the most influential nodes of the network. Those are the websites that can control the flow of information within the network. As it is expected, a European Institution, ENISA, is by far the most influential node of our network. The Austrian CERT.at is the most influential site belonging to a member state.
110
E. Koulas et al. Table 2. European Union seed sites with the highest indegree and outdegree centralities Team name
Indegree Team name centrality
Outdegree centrality
ENISA
18
ENISA
26
CERT-SE
16
CERT.at
16
CERT.at
14
INCIBE-CERT 12
AEC
10
CCN-CERT
11
NCSC-NL
8
GovCERT.CZ
6
CERT-EU
6
CERT-SE
5
CIRCL
5
CERT.LV
5
CERT POLSKA
5
CERT.hr
5
CSIRT.CZ
5
NCSC-FI
5
CERT-EE
5
SK-CERT
5
Table 3. European union seed sites with the highest betweenness centrality (>10) No.
Team name
Betweenness centrality
1
ENISA
521.2
2
CERT.at
135.283
3
CERT-SE
33.783
4
CERT POLSKA
29.25
5
NCSC-FI
29.167
6
NCSC-NL
28.833
7
CERT.hr
28.5
8
GovCERT.CZ
27
9
CERT-FR
18.333
On the other side of the CSIRTs shown in Table 3, there are 21 CSIRTs with 0 betweenness centrality, namely, AEC, CERT-EE, GovCERT Austria, CERT ZSIS, GOVCERT.LU, CSIRT-GOV, SK-CERT, CFCS, GovCERT-Hungary, CERT-RO, INCIBECERT, CERT-LT, CERT Bulgaria, CERT-Bund, GR-CSIRT, CSIRT-IE, IT-CERT, GOVCERT.LU, CSIRTMalta, CSIRT-DSP and CSIRT-MON. This means that those nodes excerpt no control over the flow of information within the network.
Webometric Network Analysis of Cybersecurity Cooperation
111
Fig. 1. EU CSIRT network interlinks.
Figure 1 is the network diagram that depicts the interlinking of the nodes. As it is mentioned before, the lines connecting the nodes are directional, hence the arrows, accounting for inlinks and outlinks. Red nodes are the ones that have at least one connection, wheather inbound our outbound, whereas green nodes indicate that there is no connection with other nodes in the network. Figure 1 shows graphically, what we have mentioned earlier, that CERT Bulgaria, CERT-Bund, GR-CSIRT, CSIRT-IE, IT-CERT, NCERT.LU, CSIRTMalta, CSIRT-DSP, and CSIRT-MON have no inlinks and outlinks. There are 168 distinct pairs. The top 10, based on the number of different URL’s linking are: CERT.at links AEC 274 times; CCN-CERT links CERT-SE 196 times; INCIBE-CERT links CERT-SE 167 times; GovCERT.CZ links CSIRT.CZ 68 times; CCN-CERT links ENISA 54 times; ENISA links NCSC-NL 39 times; ENISA links CERT.LV 36 times; CERT POLSKA links CSIRT-GOV 34 times; ENISA links CERT.at 24 times; and ENISA links CERT.be 22 times. 30 out of the 168 pairs have 10 or more different URLs linking them, whereas 56 pairs have 1 URL and 30 pairs have 2 URLs, while 51 pairs have between 3 and 9 URLs connecting them. ENISA is the most common site in the pairs, by appearing 53 times in the list.
112
E. Koulas et al.
Fig. 2. EU CSIRT network co-mentions.
Figure 2 is the network diagram that depicts the co-mentions. This data comes from search engines [16] and shows how many pages link two of the seed sites at the same time. This lines that connect the nodes are non-directional, since both websites in any giver pair are mentioned by a third page. The thickness of the line is proportional to the number of the co-mentions any given pair has. As it shown, every node at the network is connected, which means that any given seed site is co-mentioned at least one time, with at least one other seed site, by a third page. This is due to the very central role that ENISA’s website plays in the network, which is especially highlighted by the fact that the vast majority of the pairs with a single co-mention are mentioned by ENISA, which host the CSIRT inventory. There is a total of 657 co-mentioned pairs. 269 pairs have less than 10 co-mentions and 222 pairs have between 10 and 99 mentions, and 165 pairs have over 100 comentions. The top 10, based on the number of different URL’s mentioning them are: CSIRT-GOV and CERT-SE are co-mentioned 970 times; CSIRT-GOV and SI-CERT are co-mentioned 967 times; CERT.hr and CSIRT-GOV are co-mentioned 966 times; CERT.PT and SK-CERT are co-mentioned 964 times; CERT.LV and CSIRT-GOV are co-mentioned 962 times; CERT.PT and SI-CERT are co-mentioned 962 times; CSIRTDSP and CSIRT-GOV are co-mentioned 961 times; CERT.at and CSIRT-GOV are comentioned 960 times; NCSC-NL and CSIRT-GOV are co-mentioned 959 times; and CERT.hr and CERT.PT are co-mentioned 957 times. From the list above we observe that the Polish CSIRT-GOV is in 7 out of the 10 most co-mentioned pairs. ENISA’s highest co-mention pair is with CERT-Bund, with the pair being co-mentioned 728 times. Proceeding with removing ENISA and EU CERT from the seed site list the yielded results change significantly. For this purpose our new seed site is identical to Table 1, with the exception of lines 13 and 14, which are to be completely removed.
Webometric Network Analysis of Cybersecurity Cooperation
113
Fig. 3. EU CSIRT network interlinks without the European institutions.
Figure 3 is the network diagram that depicts the interlinking of the nodes. Figure 3, compared to Fig. 1 shows a very weakly interconnected network. Only 4 nodes have inlinks, and only 5 have outlinks. This means that 34 out of the 38 seed sites have no indegree centrality, and 33 out of 38 have no outdegree centrality. The betweenness centrality is 0 for the entire seed site list. Out of the 124 pairs, 110 of them have 0 interlinks. Table 4 shows the most influential nodes of the network. It is clearly depicted that the networks with the absence of the European Institutions is very weak. All the nodes of the network have zero betweenness centrality. Table 4. European Union seed sites with the highest indegree and outdegree centralities without the European institutions Team name
Indegree Team name centrality
Outdegree centrality
CERT.at
10
10
SI-CERT
2
GovCERT Austria
1
CSIRT-GOV
1
GOVCERT.LU
1
NCERT.LU
1
CSIRT.SK
1
GovCERT.CZ
1
SK-CERT
1
GovCERT.CZ
114
E. Koulas et al.
As it shown in Fig. 4, every node at the network is connected, which means that any given seed site is co-mentioned at least one time, with at least one other seed site, by a third page. There is a total of 703 co-mentioned pairs. 120 pairs have 0 co-mentions, 246 pairs have less than 10 co-mentions and 180 pairs have between 10 and 99 mentions, and 157 pairs have over 100 co-mentions. The top 10, based on the number of different URL’s mentioning them are: CERT.hr and CSIRT-GOV are co-mentioned 973 times; CSIRT-GOV and SI-CERT are co-mentioned 973 times; CSIRT-GOV and CERT-SE are co-mentioned 973 times; CERT.hr and CERT.PT are co-mentioned 969 times; CERT.at and CSIRT-GOV are co-mentioned 968 times; CIRCL and CERT.PT are co-mentioned 965 times; CSIRT-DSP and CSIRT-GOV are co-mentioned 963 times; CERT.LV and CSIRT-GOV are co-mentioned 962 times; CERT-LT and CSIRT-GOV are co-mentioned 960 times; and CERT.PT and SI-CERT are co-mentioned 960 times.
Fig. 4. EU CSIRT network co-mentions without the European institutions.
4.2 NATO Table 5 answers to our first research question, identifying the key national cybersecurity actors in NATO.
Webometric Network Analysis of Cybersecurity Cooperation
115
Table 5. NATO seed sites No.
Country
Team name
Constituency
Website
1
Albania
AKCESK/NAECCS
Government, Law Enforcement, National, Non-Commercial Organisation
http://cesk.gov
2
Belgium
CERT.be
CIIP, Government, National
cert.be
3
Bulgaria
CERT Bulgaria
Government
govcert.bg
4
Canada
CCCS
Government, National
cyber.gc.ca
5
Croatia
CERT ZSIS
Government
zsis.hr
6
Croatia
CERT.hr
NREN, National
cert.hr
7
Czech Republic
CSIRT.CZ
Government, National, Private and Public Sectors
csirt.cz
8
Czech Republic
GovCERT.CZ
Government, Private and Public Sectors
govcert.cz
9
Denmark
CFCS
Government, National
govcert.dk
10
Estonia
CERT-EE
CIIP, Financial, Government, National
cert.ee
11
France
CERT-FR
CIIP, Government
cert.ssi.gouv.fr
12
Germany
CERT-Bund
CIIP, Government, National
bsi.bund.de/DE/Home/ home_node.html
13
Greece
GR-CSIRT
Government
csirt.cd.mil.gr
14
Hungary
GovCERT-Hungary
Government
cert-hungary.hu
15
Ireland
CSIRT-IE
Government, National
ncsc.gov.ie/CSIRT
16
Italy
IT-CERT
Government, National
certnazionale.it
17
Latvia
CERT.LV
Government, National
cert.lv
18
Lithuania
CERT-LT
National
nksc.lt
19
Luxembourg
CIRCL
National
circl.lu
(continued)
116
E. Koulas et al. Table 5. (continued)
No.
Country
Team name
Constituency
Website
20
Luxembourg
GOVCERT.LU
CIIP, Government, Law Enforcement, Military, National
govcert.lu
21
Luxembourg
NCERT.LU
National
govcert.lu/en/ncert.html
22
Montenegro
CIRT.ME
National
cirt.me
23
NATO
NCIA
NATO Body
ncia.nato.int
24
Netherlands
CSIRT-DSP
National
csirtdsp.nl
25
Netherlands
NCSC-NL
National
ncsc.nl
26
North Macedonia
MKD-CIRT
National
mkd-cirt.mk
27
Norway
EkomCERT
Government
eng.nkom.no
28
Norway
HelseCERT
Government
nhn.no/helsecert
29
Norway
NorCERT
Government
cert.no
30
Poland
CERT POLSKA
NREN, National
cert.pl
31
Poland
CSIRT-GOV
Government, National
cert.gov.pl
32
Poland
CSIRT-MON
Military
csirt-mon.wp.mil.pl/pl/ index.html
33
Portugal
CERT.PT
Government, National
cncs.gov.pt
34
Romania
CERT-RO
National
cert-ro.eu
35
Slovakia
CSIRT.SK
Government
csirt.gov.sk
36
Slovakia
SK-CERT
CIIP, National
sk-cert.sk
37
Slovenia
SI-CERT
NREN, National
cert.si
38
Spain
CCN-CERT
Government
ccn-cert.cni.es
39
Spain
INCIBE-CERT
CIIP, National
incibe-cert.es
40
Turkey
TR-CERT
Government
usom.gov.tr
41
United Kingdom
MODCERT
Government
mod.uk/cert
42
United Kingdom
NCSC (UK)
Government, National
cert.gov.uk
43
United States
US-CERT
Government, National
us-cert.gov
Webometric Network Analysis of Cybersecurity Cooperation
117
Table 6 shows the most influential nodes of the network. It is clearly depicted that the network of NATO member states’ CSIRTs is very weak. Even NATO’s institution doesn’t play an important part in the flow of information, a conclusion we can derive from Fig. 5, as well. The single node having a betweenness centrality metric is US-CERT (centrality = 4) which means that even though this node is the most influential of the network, its influence is very weak. Only 3 nodes have inlinks, and only 8 have outlinks. This means that 40 out of the 43 seed sites have no indegree centrality, and 35 out of 43 have no outdegree centrality. The betweenness centrality is 0 for the entire seed site list.
Fig. 5. NATO CSIRT network interlinks.
Table 6. NATO seed sites with the highest indegree and outdegree centralities Team name
Indegree centrality
Team name
Outdegree centrality
CERT-IS
16
US-CERT
10
US-CERT
7
CERT.hr
5
CSIRT.CZ
1
CIRCL
3
CCN-CERT
5
GovCERT.CZ
1
CERT-FR
1
SK-CERT
1
INCIBE-CERT
1
118
E. Koulas et al.
There are 113 distinct pairs. The top 10, based on the number of different URL’s linking are: US-CERT links SK-CERT 50 times; CERT-IS links US-CERT 50 times; USCERT links circl.lu 48 times; CSIRT.CZ links GovCERT.CZ 47 times; US-CERT links CERT-FR 47 times; CERT-IS links CERT.hr 39 times; US-CERT links INCIBE-CERT 39 times; CSIRT-GOV links CERT POLSKA 33 times; CERT-IS links CCN-CERT 30 times; and US-CERT links CCN-CERT 28 times. 16 out of the 113 pairs, have 10 or ore different URLs linking them, whereas 45 pairs have 1 URL and 19 pairs have 2 URLs, while 33 pairs have between 3 and 9 URLs connecting them.
Fig. 6. NATO CSIRT network co-mentions.
There is a total of 765 co-mentioned pairs, as shown in Fig. 6. 333 pairs have less than 10 co-mentions and 173 pairs have between 10 and 99 co-mentions, 257 pairs have over 100 co-mentions, and 1 pair has over 1000 co-mentions. The top 10, based on the number of different URL’s mentioning them are: SI-CERT and NCSC (UK) are co-mentioned 1003 times; CERT-EE and US-CERT are co-mentioned 999 times; CERT.LV and NCSC (UK) are co-mentioned 989 times; CIRT.ME and NCSC (UK) are co-mentioned 987 times; NorCERT and CSIRT-GOV are co-mentioned 986 times; CERT-EE and NCSC (UK) are co-mentioned 984 times; MODCERT and NCSC (UK) are co-mentioned 982 times; CERT.be and NCSC (UK) are co-mentioned 980 times; NorCERT and NCSC (UK) are co-mentioned 980 times and NorCERT and US-CERT are co-mentioned 980 times.
Webometric Network Analysis of Cybersecurity Cooperation
119
5 Discussion There are two types of observations we can make from this study so far. One the one hand, we reach some conclusions based on the conducted literature review, and on the other had we can draw conclusions from our webometrics analysis. From the relevant literature, it becomes apparent that the EU places great emphasis on strengthening its networks and protecting cyberspace [1, 5, 9], ENISA plays a key role in ensuring the stability and reliability of the European cyberspace [11], the aim of the EU institutions is not only to protect the Union’s networks but also to make the European Union a global power [5, 23], networking is the ideal solution for the EU, however, cooperation can be challenging for all partners [2, 10, 11, 13], the more coordinated the network and the more secure it becomes and, the lower the risk of serious threats facing the EU in its cyberspace [5, 13, 14]. Having analyzed the results of the research as well as the literature review, at this point a comparison can be made between the findings of the experiment and the available academic literature. Having considered the above, the need to reduce the vulnerability of the public and private institutions becomes a priority for most of the states across the world. Indeed, the European Union and its member states (MS) are constantly investing at creating systems, technical guidelines and tools in order to enhance cooperation between the MS and third countries as well as the North Atlantic Council (NATO) and other organizations at a regional and global level [11]. In addition, studying the case of China and Asia in general, Li et al. [24] explain how the use of webometrics can allow authorities to pursue a more effective policy regarding cybersecurity and information technology. Specifically, in their article, they examine 1,931 distinct cases of e-policies in China and conclude that China invests mainly in e-business and less in the protection of the personal data even though the internet is controlled by government agencies. This trend is also reflected in Chinese jurisprudence in relation to cybersecurity, which focuses on industry and commerce. This is not the case in the EU. On the other hand, Pupillo [11], supports that the European Union is doing considerable efforts to improve its information infrastructure. In detail, the author mentions that the EU is currently taking part at a global effort to govern the cyberspace and to become more efficient by leading and not simply taking part in the cybersecurity initiatives and programs. On the other hand, Pupillo [11] considers the EU efforts to be rather fragmented that the voluntary nature of cooperation among the MS is not sufficient for the EU to be able to secure its borders as well as its networks. Respectively, Giudice et al. [25] study the example of the North African states to examine whether the policies of these states are deemed appropriate and commensurate with the available technological possibilities and the existing academic and scientific knowledge. This article concludes that a Social Network Analysis (SNA) can ultimately highlight specific patterns and trends in the cases in which it is utilized, as well as the presence of “disconnected components”. The tools and the platforms that are created, as well as the software and the hardware used for security purposes, must be dependable. Also, the majority of the technological tools available enable protection by implementing multi-level controls. For instance, the
120
E. Koulas et al.
vast majority of cyber-security software has strict standards and focus on identity and encryption management. On the other hand, as the technical tools are used in public institutions and the governmental authorities, the systems and programs must also be comprehensive, and, at the same time compatible with conventional systems typically used in public servers [23]. In the webometric analysis, we have validated our assumption that European Institutions would play a major role in the networking patters of the CSIRTs. ENISA has topped every evaluation metric of this study. In addition to this, if we factor degree centralities and betweenness centrality, we observe that CERT Bulgaria, CERT-Bund, GR-CSIRT, CSIRT-IE, IT-CERT, NCERT.LU, CSIRTMalta, CSIRT-DSP, and CSIRT-MON are the weakest nodes of our network ranking zero in all metrics. This is clearly depicted by Fig. 1. This is more important for CERT Bulgaria, CERT-Bund, GR-CSIRT, CSIRT-IE, IT-CERT and CSIRTMalta, because those are the single CSIRTs from their respective countries, whereas, IT-CERT, NCERT.LU, CSIRT-DSP and CSIRT-MON, are one of 2 or 3 different CSIRTs of their respective countries. However, when we removed the European Institutions from the equations, the network shows minimal level of interconnectivity. This face highlights ENISA’s role in the public dissemination of information. Furthermore, we opted to run our experiment with the NATO member states’ CSIRTs. This was decided in order to showcase how different kind of organizations compares. On the one hand an economic and political organization like the EU, and on the other hand a military alliance. NATO’s network yielded slightly better results than the EU network without the European Institutions. The US-CERT is the most influential factor in that network. This was expected, as within a NATO framework, the US plays a leading role. There are also certain limitations to this study. First of all, we evaluate the cooperation of the CSIRTs by assessing their online presence; however, due to the peculiar nature of cybersecurity, we expect that there are cooperation and information sharing processes and protocols, not available to the public. Furthermore, the seed site analysis was conducted by Webometric Analyst, a third-party tool, and with the use of the Search Engine Bing, which limits us in regard to the results we can get.
6 Conclusions Cybersecurity is one of the greatest challenges that countries will need to address effectively in the coming years. Our study has identified all the key national cybersecurity actors in the EU and NATO. Through the analytical framework we utilized it is shown that there is a room for improvement on the public side of CSIRT cooperation. With regards to the EU, as we have expected, ENISA is the most important actor in CSIRT cooperation networking pattern, whereas over 50% of the CSIRTs lack the ability to influence the network. The hypothesis that ENISA is the most influential node in the network was validated by the results we got when we run the same experiment excluding ENISA. NATO’s metrics, in our experiment were slightly better than the EU’s without ENISA. The network was a little more interconnected, however it lacked severely when compared to the EU network with ENISA.
Webometric Network Analysis of Cybersecurity Cooperation
121
Our experiment highlights the necessity for stronger interstate relations with regards to cybersecurity co-operation, not only in the operational level, but in the public interconnectivity layer as well. It has also validated the fact that, within EU and NATO, ENISA is the most powerful actor with regards to cybersecurity information dissemination. 6.1 Future Work Further study should focus on addressing the limitations of this study and the incorporation of new technologies for the data processing. Furthermore, interesting conclusions might be drawn by comparing EU and Association of Southeast Asian Nations (ASEAN), and NATO and Shanghai Cooperation Organization (SCO) CSIRT network patterns. The first is a comparison between political and economic organizations, while the second is a comparison between military alliances. Lastly, the ever-increasing role of non-state actors in cybersecurity cooperation and governance calls for the research of networking patterns between organizations is similar to ICANN, IETF, and IEEE. Acknowledgments. EK is funded by the EPSRC grant EP/S022503/1, which supports the Centre for Doctoral Training in Cybersecurity delivered by UCL’s Departments of Computer Science, Security and Crime Science, and Science, Technology, Engineering and Public Policy.
References 1. Deshpande, A., Pitale, P., Sanap, S.: Industrial automation using Internet of Things (IOT). Int. J. Adv. Res. Comput. Eng. Technol. 5, 266–269 (2016) 2. Brass, I., Sowell, J.H.: Adaptive governance for the Internet of Things: coping with emerging security risks. Regul. Gov. 11–20 (2020). https://doi.org/10.1111/rego.12343 3. Napetvaridze, V., Chochia, A.: Cybersecurity in the making – policy and law: a case study of Georgia. Int. Comp. Law Rev. 19, 155–180 (2019). https://doi.org/10.2478/iclr-2019-0019. Summary 4. Holmberg, K.: Webometric Network Analysis: Mapping Cooperation and Geopolitical Connections Between Local Government Administration on the Web. ABO Akademi University Press (2009) 5. Carrapico, H., Barrinha, A.: The EU as a coherent (Cyber)security actor? J. Common. Mark Stud. 55, 1254–1272 (2017). https://doi.org/10.1111/jcms.12575 6. Koulas, E., et al.: Misinformation and its stakeholders in europe: a web-based analysis. In: Arai, K. (ed.) Intelligent Computing. LNNS, vol. 285, pp. 575–594. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-80129-8_41 7. Markopoulou, D., Papakonstantinou, V., de Hert, P.: The new EU cybersecurity framework: the NIS Directive, ENISA’s role and the General Data Protection Regulation. Comput. Law Secur. Rev. 35, 105336 (2019). https://doi.org/10.1016/j.clsr.2019.06.007 8. Koulas, E.: Defining Sovereignty and National Interest on Cyberspace: National and Supranational Paradigms. Univeristy of Macedonia (2019) 9. Renard, T.: EU cyber partnerships: assessing the EU strategic partnerships with third countries in the cyber domain. Eur. Polit. Soc. 19, 321–337 (2018). https://doi.org/10.1080/23745118. 2018.1430720 10. Tanczer, L.M., Brass, I., Carr, M.: CSIRTs and global cybersecurity: how technical experts support science diplomacy. Glob. Policy 9, 60–66 (2018). https://doi.org/10.1111/1758-5899. 12625
122
E. Koulas et al.
11. Pupillo, L.: EU Cybersecurity and the Paradox of Progress. CEPS Policy Insights (2018) 12. Chaisse, J., Bauer, C.: Cybersecurity and the protection of digital assets: assessing the role of international investment law and arbitration. Vanderbilt. J. Entertain. Technol. Law 21 (2019) 13. Finnemore, M., Hollis, D.B.: Constructing norms for global cybersecurity. Am. J. Int. Law 110, 425–479 (2016). https://doi.org/10.1017/s0002930000016894 14. Chainoglou, K.: Attribution policy in cyberwar. In: Kulesza, J., Balleste, R. (eds.) Cybersecurity and Human Rights in the Age of Cyberveillance. Rowman & Littlefield, London (2016) 15. Chainoglou, K.: Reconceptualising self-defence in international law. King’s Law J. 18, 61–94 (2007). https://doi.org/10.1080/09615768.2007.11427664 16. Thelwall, M.: Introduction to Webometrics: Quantitative Web Research for the Social Sciences (2009) 17. Acharya, S., Park, H.W.: Open data in Nepal: a webometric network analysis. Qual. Quant. 51(3), 1027–1043 (2016). https://doi.org/10.1007/s11135-016-0379-1 18. Disney, A.: Social network analysis 101: centrality measures explained. In: Cambridge Intelligence (2019). https://cambridge-intelligence.com/keylines-faqs-social-network-analysis/. Accessed 12 Oct 2020 19. Valente, T.W., Coronges, K., Lakon, C., Costenbader, E.: How correlated are network centrality measures? Connect (Tor) 28, 16–26 (2008) 20. Friedkin, N.: Theoretical foundations for centrality measures. Am J Sociol 96, 1478–1504 (1991) 21. Freeman, L.C.: Centrality in social networks conceptual clarification. Soc. Networks 1, 215– 239 (1978). https://doi.org/10.1016/0378-8733(78)90021-7 22. Newman, M.E.J.: Ego-centered networks and the ripple effect. Soc. Networks 25, 83–95 (2003). https://doi.org/10.1016/S0378-8733(02)00039-4 23. Rajamäki, J.: Cyber security, trust-building, and trust-management: as tools for multi-agency cooperation within the functions vital to society. In: Clark, R.M., Hakim, S. (eds.) CyberPhysical Security. PCI, vol. 3, pp. 233–249. Springer, Cham (2017). https://doi.org/10.1007/ 978-3-319-32824-9_12 24. Li, J., Xu, W.W., Wang, F., et al.: Examining China’s internet policies through a bibliometric approach. J. Contemp. East Asia 17, 237–253 (2018). https://doi.org/10.17477/jcea.2018.17. 2.237 25. Lo, G.P., Russo, P., Ursino, D.: A new Social Network Analysis-based approach to extracting knowledge patterns about research activities and hubs in a set of countries. Int. J. Bus. Innov. Res. 17, 1–36 (2018)
Safety Instrumented System Design Philosophy Paradigm Shift to Achieve Safe Operations of Interconnected Operating Sites Soloman M. Almadi(B) and Pedro Mujica SPE, ISA, Saudi Aramco, Dhahran, Saudi Arabia [email protected]
Abstract. Petrochemical industry and other process manufacturing facilities require major infrastructure investment with dangerous operations that are susceptible to great risks such fire, explosion, and or un-orchestrated process upsets. The Safety Instrumented System (SIS) ensures process operation is managed by a Basic Process Control System (BPCS). SIS is deployed in the local process automation zone of the processing facility without interconnection to a remote monitoring and operations facility. This resembles one of the major key challenge and limitations of SIS systems. There are historical major incidents in the Oil and Gas industry that could have been avoided if the SIS system performance is proactively known and acted autonomously upon abnormal conditions. This paper examines a set of major Petrochemical industry process related incidents with primary focus on identifying sensor network and system weaknesses. The detailed analysis of earlier incidents revealed the need for SIS design enhancements in the networking, system architecture, data flow interworking model, and unexistence of centralized data processing with execute and autonomous decisionmaking rights. The use of different communication mediums fiber, wireless, and VSAT introduces new capabilities that can be utilized to achieve required data delivery for process safety related actions. Moreover, the emerging Industrial Internet of Things (IIoT) Technologies introduce a new automation layer that increase proactive decision making. This paper introduces a new concept in interlinking process operations that have multiple, distributed, and remote operational zones. The paper concluded with best practices that enhance the current design model, efficiency and operational reliability. The intent is to bridge an existing gap that well identified and prevent escalation of hazardous events with reduced time response (miliseconds to seconds) as compared with current design philosophy that relies in non-autonomous decision making processes (human intervention) that take longer times (minutes to hours) to detect and react after the fact, lagging indication and lacking the proactive approach. Keywords: Safety instrumented system · SIS system · Basic process control system · BPCS · Industrial Internet of Things technologies · IIoT
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 506, pp. 123–137, 2022. https://doi.org/10.1007/978-3-031-10461-9_8
124
S. M. Almadi and P. Mujica
1 Introduction Petrochemical, chemical, mining, gas compression, and other process manufacturing facilities are dangerous and susceptible to great risks such chemical exposure, fire, explosion, tank overflow, gas release. Safety Instrumented System (SIS) plays a key role in preventing physical harm to personnel, and/or damage to the process infrastructure and environment [1]. SIS is considered the last line of defense for process operation that is typically managed by a Basic Process Control System (BPCS). While SIS has proven its capabilities and maturity, the key challenge is its local operational performance is not remotely known. SIS performance availability is confirmed by direct interaction due to its lack of inter-working capability outside the operating zone. Major investigation of incidents in the Oil and Gas industry have indicated that a proactive approach with leading indicators would be highly beneficial in the prevention of catastrophic events that claim the loss of human lives, property, reputation damage, and environmental impacts. Moreover, the ability of proactively conducting a holistic risk impact will provide enhancement to the SIS design and ngineering [2]. Such proactiveness is now possible due to the advanced development of communication echnologies in process automation applications [3]. The use of different communication mediums fiber, wireless, and VSAT introduces new capabilities that can be utilized to achieve required data delivery. This paper introduces a new concept in inter-linking process operation that has multiple operational zones. The paper concluded with best practices that enhance the current operational model and efficiency and reliability. This enables reliable transport of critical process safety data from an operating site (downstream) into a global server that is then able to order commands into another (or multiple) upstream operating sites SIS to make the necessary process adjustment in due time. This approach will demand skilled Safety Instrumented System (SIS) designers and Process Safety practitioner to analyze the process parameters and understand the cause and effect of events, gather information and make predictions in a leading and proactive manner that will trigger an autonomous action, that once confirmed in a matter of seconds or milliseconds (computing power in realtime) will trigger an action to activate a Safety Instrumented Function (SIF) loop located in a SIS physically located miles and miles away from the actual site, where the potential unsafe and hazardous situation is being developed. This breaks the silo design concept and yields a global & barrier-less concept allowing reading and commanding critical process safety design inputs and outputs to act upon a confirmed process safety threat. Under the global design philosophy approach explained in this article, the designers will gather the information from the process safety critical elements relevant to the process, to name a few, high level and pressure transmitters, limit switches status, process temperature readings, vibration sensors, among other parameters and come up with a model with the purpose of predicting and preventing process safety events that involves immediate action from the upstream systems (operating sites) shipping hazardous materials to a distribution network of receiving clients downstream. Hence, with the utilization of predictive leading indicators combined with the latest communication technologies, it is intended to explain how by applying a new design philosophy with most advanced artificial intelligence, Boolean logic, reliable and redundant communication methods, the occurrence of major catastrophic events such as Piper
Safety Instrumented System Design Philosophy Paradigm Shift
125
Alpha [3] or Buncefield [4] can be prevented, since the hazardous event would be identified and act autonomously upon confirmation of process conditions to reach the safe state with a global, all encompassing design concept that break the barriers of physical distance limitations. 1.1 SIS Overview Safety Instrumented System (SIS) could be defined as an autonomous system designed to monitor, identify abnormal or unsafe conditions, and take necessary actions to prevent, stop, or mitigate escalation of undesired consequences of a given process [5]. The SIS is independent from the Basic Process Control System (BPCS), which is designed with control purposes for a process equipment to operate within certain operating windows [6]. The distinction is worth noting since both systems are conformed with similar components, which in its most simplistic way can be described as having: sensing elements, logic solver, and final elements. However, the design intent of the SIS is dedicated to automatically take a given process to the safe state when certain conditions controlled by the BPCS are abnormally out of control. The SIS are often designed to watch that certain conditions and process parameters are healthy to allow, in a restrictive permissive way, the process to start, and move in a safe manner. Also, SIS can be designed to be a mitigation element when things have entered a hazardous zone beyond control, a typical example would be when the mechanical integrity of a piping system is lost and the hazardous material is out of the pipe and in the open air, then the SIS could become a mitigation element preventing a large escape of the hazardous material and minimizing the potential damage to people, the enterprise reputation, financial losses and the impact to the environment. The SIS is not meant to be or be part of the BPCS, it is an independent autonomous system designed to prevent, or mitigate undesired situations. SISs are conformed by Safety Instrumented Functions (SIFs) that are the designed with a specific purpose to prevent, stop, or mitigate specific hazardous situations [7]. Therefore, the SISs often have within its design architecture multiple SIFs, with a variety of functionalities with the intent of preventing undesired, hazardous, and harmful consequences. The SISs design comprise robust designed architecture, that are defined in principle by the type of hazardous events intended to prevent, the higher the potential consequences the higher the expected robustness of the SIS design, with redundancy required to assure that the necessary actions are taken when called upon demand. Within the SISs there can be multiple SIFs, each designed to monitor and prevent specific hazardous outcomes keeping the process safe. Current Challenges and Problem Statement SIS is typically designed for a specific operating site with the primary purpose of protecting undesired harmful outcomes within the boundaries of the local operating facility. The system is based on an isolated autonomous system that is locally dedicated for that specific process zone. However, many industrial processes consist of producing, shipping, and or receiving hazardous materials that travel long distances. An example is upstream source, Oil or Gas Plant facilities, where the material is extracted or produced,
126
S. M. Almadi and P. Mujica
and shipped to a downstream operating facility for further refining, upgrading, or storage. While it is feasible to establish interconnected industrial Automation and Control System over a wide operating geographical area for BPCS [8], it is unthought-of and inconceivable under the current design philosophy to have a SIS in similar manners. Extending SIS design or configuration to issue process safety commands to a separate and independent SIS entity that is geographically separated has an overall impact of the SIL ratings and performance. The complexity increases when different owners or operating entities yet interconnected by complex piping systems share assets. However, interconnected SIS design concept brings with its great overall enhancement and may even lead to catastrophic incidents avoidance. Piper Alpha, a well-known historical catastrophic event, has changed the way the industry view process safety as it has highlighted the limitations of the SIS protections among interconnected operating facilities. The incident affirmed how the undesired events escalated due to the lack of interconnected and autonomous SISs among the multiple platforms that flowed into Piper Alpha [9]. The Piper Alpha post incident analysis has outlined shutting the interplatform oil lines would probably have made a difference. Exported oil flowed out of the ruptured oil line into Piper Alpha facility which has resulted in flooding the floor and overflowed to the floor beneath. This resulted in large pool fire which led to the inevitable escalation of catastrophic event on Piper Alpha [3]. It is also understood from the investigation that the human factor played a role in delaying the decision to halt production from the other platforms; this in part was since the people in charge were in different of making such decision due the major production expected impacts [9]. However, such an overlook resulted in a much bigger event leading to the entire platform getting lost and more importantly multiple lives were losts. The Piper Alpha incident is just an example of how the lack of interconnectivity of SISs among operating sites could have benefitted from a proactive shutdown of the product being shipped into a failed Piper Alpha platform experiencing a major event, perhaps cutting the feed at a much faster reaction time would have prevented the disastrous outcome. The questions [3] “How is my facility is connected to other facilities? What could go wrong at the interface? What are the opportunities of introducing another computer layer that oversee the overall SIS process? What is the impact of interconnected system on the SIL rating? These different questions are left unanswered to instigate further thinking. This has stimulated a design approach intended to provide a solution that will address the process safety point of view for multiple sites. Moreover, design attributes and variables, through use of Data-Analytics, can be cross examined to prevent escalation of undesired events. The new approach: applying a global SIS concept, that interconnects multiple facilities via safety communication links to reliably initiate the necessary actions in a timely manner aiming to mitigate catastrophic outcomes, therefore reducing the potential loss or damage. As stated, opportunities and proactive actions could have reduced the ultimate outcome of the Piper Alpha event, which is considered to this date one of the worst industrial accidents that challenged the status-quo regarding process safety in the Oil and Gas industry. Yet, many years after this event occurred silo mentality is predominant in the design of SIS among operating facilities, relying on heavily in human factors to take prompt corrective actions to prevent escalation of process safety events. Although, fiber optics
Safety Instrumented System Design Philosophy Paradigm Shift
127
and Supervisory Control and Data Acquisition (SCADA), Remote Terminal Unit (RTU) are robust and reliable, are not intended to take the necessary autonomous and necessary actions to take the process to a safe state. Such, systems are limited in their applications to control, communication of information that is brought into the control room for an Operator, the human factor, to identify, analyze and take the necessary corrective actions [10]. To overcome the challenge, the question of what about removing the “silo mentality” from the design of SISs of operating sites? Is it time for us to consider a new concept that looks from an elevated perspective to identify how the operating sites are related? One must answer how certain events will impact the entire network of producing systems upstream and downstream, understanding the leading indicators and design parameters that point out in a proactive manner to the prevention of the potential development of a catastrophic event. What does it take to have the corrective action communicated and invoked timely? Before exploring the solution for these questions, an example to illustrate perhaps the most basic application of this concept is the catastrophic event that occurred at Buncefield [4], UK. The Buncefield Oil Storage Depot was comprehensively damaged by a series of explosions and fires occurred back in December 2005. It was considered the largest event of its type in Europe since the 2nd World War. The primary cause of the incident was found to be the overfilling of a tank containing unleaded petrol, resulting in the spell of 250,000 L of fuel which encountered a source of ignition [4]. With the current computing power and predictive modelling that can be generated utilizing artificial intelligence, Boolean logic, among others, it is an opportunity to move in the direction of leading indicators, which in the cited Buncefield case would have used all the process parameters readily available to determine that something was not going right and stop the shipment of product from the upstream source. An independent post incident analysis [1] concludes the need to move in the direction of prediction of events using and applying leading and trailing indicators. The SIS design shall factor relationships between these indicators to improve safety performance. Although it is recognized the need to identify and track leading indicators [1], which are passive without any autonomous or immediate response triggered or handled by the SIS, the current status-quo requires a profound paradigm shift to change the mindset in process safety design. This should encompass identifying leading indicators to proactively and autonomously apply solutions that use leading factors as input to be processed by the computing power. This should lead and command corrective actions that are executed within reasonable time and more importantly actioned within the boundaries of the operating site and beyond the fences. Developing insight on upstream or downstream sources for that matter enables necessary corrective actions that results in autonomous smart local and global SIS innovative concept. Receiving input from an operating site, analyzing the process safety parameters, and actuating final elements under a global perspective. This results in control strategies to actuate final elements that may (or may not) be physically in proximity within the operating boundaries of the site but reaching as far as necessary to take the whole system to a safe state. This transforms the current design philosophy of SISs to trespass actions beyond the boundaries of the particular operating site where the SIS physically resides, under the
128
S. M. Almadi and P. Mujica
approach claimed in this paper, it is imperative to extend those boundaries and reach out via reliable and redundant available innovative mechanisms, technologies and platforms such as cloud computing, Industrial Internet of Things IoT, Edge Computing, etc. These technologies enable to write commands in SISs physically located afar from the site where the unsafe condition is developing. Utilizing the Buncefield example, enough process safety data was produced from the field that if processed with a predictive model capitalizing on artificial intelligence algorithms designed with process safety in mind, would have autonomously taken the necessary actions to stop the incident. Industry Emerging Solution The reliability of SIS can be scaled up or down according to the particular risk of a given process. This is customarily measured by applying Safety Integrity Level (SIL) techniques which define the level of robustness required by the SIS to bridge identified residual risks [11]. Hence, the reliability of the SIS can be very robust by design at local operating zone level. This included having enough redundancy and hardware failure tolerance, that makes the SIS exceptionally reliable from the safety availability viewpoint, as well as, with high operating uptime with minimal nuisance trips of the operating process. Hence, the industry is benefiting from a well-established and standardized approach to reliable protection offered by SIS. The SISs are designed with well-established communication protocols to safely and reliable bring the required signals and process parameters from the field, into a processing element (logic solver) that takes autonomous actions upon confirmation of an undesired and hazardous process safety event being initiated, all of that within the boundaries of the operating facilities [12]. The latest trends in digital transformation which addresses connectivity, computing decision making power, cybersecurity protection among other capabilities lend itself to interconnect multiple and geographically spread systems [13]. Moreover, measured level of reliability and high-level integrity of the data and action commands from malicious intruders, are evolving into tools that can be used to enhance and change the silo legacy mentality approach. All are in the favor of having true leading indicators and that when fed to emergency and powerful analytics with artificial intelligence capabilities can autonomously take the necessary actions to prevent major catastrophic events.
2 Proposed Design Model The proposed design philosophy is conformed to traditional SIS components for the sensing and final elements, which will continue to function alike current conventional applications to-date. The new model approach is in introducing centralized logic solver which is no longer functioning in silos but rather a network of interconnected logic solvers. The system capabilities which include highspeed communication, network protocols, and computing power position the readiness for this approach. As a result, sending and the acceptance of external commands generated from a logic solver, not physically linked via hardwired protocols, is feasible. Hence, Cloud based computing and or centralized server architecture can be a migration path for the conventional localized and hardwired signals.
Safety Instrumented System Design Philosophy Paradigm Shift
129
The new approach involves interaction of multiple inputs received from scattered logic solvers that are constantly feeding real time information to a centralized computing system; global logic solver for process safety purposes. The system designed with processing algorithms that provide predictive models to generate decision and workflows that avert undesired process safety scenarios. The ability to issue commands that actuate final elements belonging to a SIS located miles away, physically connected with the operating equipment to stop an unsafe situation. The model approach brings forward a paradigm shift in SIS design departing from the legacy hardwired and silo system. 2.1 SIS System Changes Traditional SIS is described in a simplified manner as follows: • Act on a specific site within an operating site realm, referred to as silo concept in this article. • Hardwired connections between sensing elements, logic solver, and final elements. – Present technology allows wireless applications between sub-systems within the same SIS, however are bound to distance limitations. For example, wireless sensors feed input to a logic solver located in the same geographical area where the operation is occurring. • Do not allow input signals with executive power and commands from separate and independent SIS via known communication protocols (e.g. Ethernet), often allowing only reading rights from external interfaces (e.g. Distributed Control Systems). • Inability to communicate signals to execute actions outside its own design boundaries. • Not capable to utilize emerging communication protocols such as Cloud technology, to actuate SISs that geographically spread. The changes that challenge the status-quo for SIS design: • No longer a silo design philosophy: – SIS located at operating site A can read multiple sensing signals from a local process – Orchestrate actions for local actuation and send output commands via cloud (or any other communication protocol) to a separate SIS located a site B to perform the required actuation of final elements. • Process safety information captured locally can be shipped for processing in a global logic solver that may execute further commands in SISs located in other operating sites, as required by the severity and needs of the system to mitigate the predicted unsafe situation. • SISs located in different operating sites (e.g., B, C…Z or other different than site A) can receive control action commands from a global logic solver, or directly receive signals that are outputs from the SIS located on site A to stop an unsafe condition. • The logic solvers can process and actuate systems locally, as well as exporting process safety information to other interconnected (via cloud or any other interconnecting communication media) logic solvers for further processing and actuation of final elements as required.
130
S. M. Almadi and P. Mujica
2.2 Operational Model The current industry practice widely utilizes Safety Instrumented Functions (SIFs) to prevent materialization of Process Safety Events. This is achieved by having sensing elements, safety logic solvers, and final elements solely dedicated to a specific set of process parameters, which are constantly monitoring the process conditions. Logic Solver serves and is considered the brain that makes decisions to trigger a safety loop when a certain dangerous process condition is met. It consists of pre-defined logic that contains process setting limits (trip set points) which triggers an action on the final elements (Valves, pumps contactors, etc.). The outcome of a process trip is an actuation of the final elements belonging to the stated SIF (from now on cited as SIF 1), which is designed to achieve the safe state of the process preventing the hazardous event. However, when the final elements actuate as a result of the SIF 1 actuation, quite commonly creates a cascade hazardous effect that propagates to the upstream side of the process. Typically, this is neither directly monitored by the stated SIF 1 nor by the same process controller. The common practice is that the cascade effect is dealt with by a different and independent SIF (from now on cited as SIF 2), which belongs to a control and safety system physically separated by long distances from the SIF 1. The novel concept makes an intrinsic correlation via Cloud and/or Edge computing concepts between SIF 1 and SIF 2 yielding a tremendous opportunity to prevent escalation of the hazardous scenarios that arise as a consequence of SIF 1 activation. A simple schematic is shown in Fig. 1 below:
SIF 2 feeds from global server with advance processing of informaon communicated via Cloud from SIF 1 to take leading and prevenng acons
Parameters conforming SIF 1 are imported into the cloud environment (global server for process safety)
Fig. 1. Conceptual model approach
An illustration for the newly introduced approach is depicted in Fig. 2. In this scenario, a tank that receives gasoline is being shipped by a pump from a tank farm. A summary of the events in this typical example can be described as follows: • A gasoline shipping from Tank #2 to a distance Tank farm. • The transfer Pump #2 is emptying Tank #2 and shipping the product via long distance piping (e.g. 50 km away) to the receiving Tank #1. • The receiving Tank #1 is protected against a hazardous scenario with a SIF 1 designed to prevent overfilling of the tank, which upon a high-high level detection closes valve #1, at the inlet of Tank #1, preventing the overfilling of the tank.
Safety Instrumented System Design Philosophy Paradigm Shift
131
• Valve #1 is physically located in proximity to the safety control system allowing hardwiring the signals from the Tank #1 level transmitter to the Tank #1 safety logic solver, and from the logic solver to the valve #1 (final element). – However, SIF 1 does not have any connection whatsoever with Pump #2 that remains pumping from afar (e.g. 50 km away) even after the high liquid level is detected and valve #1 closed by the actuation of SIF 1. • The sudden closure of valve #1 as result of a real SIF 1 demand preventing the overfilling of Tank #1 may generate for example a cascade effect commonly known as mechanical surge in this example, which if not diagnosed and designed correctly, has the potential for a catastrophic failure of the pipeline system due to the pressure peaks (mechanical surge) experienced in the system. Pressure waves may generate as a result of blocked outlet that does not allow the fluid to enter the Tank #1. As briefly described, SIF 1 offers a reliable protection against the hazardous scenario of overfilling of Tank 1; however it initiates a cascade effect that creates a potential hazardous event to the upstream pipeline system due to the mechanical surge phenomenon. The novel concept described allows the use of Cloud technology and/or edge computing to prevent or mitigate the collateral (or cascade) effect hazardous scenario such as the one described above. Following the same example, the process information such as high-high level in Tank #1 demanding SIF 1 and any other information (e.g. process alarms) will now be recorded in the Cloud environment (global server for process safety). This enable the transport of real time information to actuate a SIF 2 remotely located at a distance where the pump station is (e.g. Pump #2). Taking the necessary and immediate actions to prevent the cascaded effect, which is done by having connectivity via Cloud computing taking the necessary actions, in this case stopping Pump #2 to prevent or mitigate the mechanical surge effect as consequence of SIF 1 demand. Therefore, the concept of elimination of cascade effect has a wide application since it is not limited neither by geographical locations nor hardwire constrains since the emerging Cloud technology and edge computing in the remote area avails the sustained application of process safety concepts that were limited or not possible in the past.
3 Implementation Requirements and Best Practices The implementation requirements are not different from the current requirements of typical SISs in terms of hardware and software. However, the communication of process safety data in and out of operational sites is key elements for this approach to work optimally. Communication mediums, standard protocols and global server for process safety are the key distinction as compared to current design of Safety Instrumented Systems. The Safety Instrumented Systems will continue to be confirmed by Sensing Elements, Logic Solvers, and Final elements with the newly added capability of having a Global Server. The Global Server will be dedicated to handle process safety applications, receiving input and sending control command to other geographically distributed logic solver.
132
S. M. Almadi and P. Mujica
Fig. 2. Process SIF line diagram
3.1 Computing The new technologies of edge computing and the concept of global server are expected to be part of system design, which will allow collecting, and processing information and sending commands. The robustness of the computing systems will be determined by the level of risk dictated by the risk assessment perform for each project and operating site, typically dictated by the results of the Safety Integrity Level (SIL) Assessment that will indicate the need to design to meet SIL 1, 2, or 3 requirements. These requirements are clearly defined in the international practice as per standard requirements in IEC 61511: Functional safety – Safety instrumented systems for the process industry sector – Part 1: Framework, definitions, system, hardware and application programming requirements. The computing platform should have built-in redudnancys, processing power, and storage capacity. The use of advance data anlaytiscs that produce vial data is required. Computing power failaure shall not impact the local system performance at the local process facility. 3.2 Network and Communication System Changes Network is no longer limited to hardwiring of signals being produced at a given location. In order to achieve the inter-working between SIF 1 and SIF2 or SIFn (where n: 1, 2, 3…), a highly reliable communication infrastructure most be part of the SIF solution in the design, typically captured in the HAZOP study. The design concept entails redundant communication infrastructure but most importantly redundant communication medium that interconnected the different SIF facility along the process. In order to do this, this approach introduces the trio communication medium design model to ensure reliability and availability at all time as shown in Fig. 3. This includes the introduction of Transmission self-healing network in the form of SDH (Synchronous Digital Hierarchy) or OTN (Optical Transport Network) parallel to
Safety Instrumented System Design Philosophy Paradigm Shift
133
Fig. 3. System design network robustness
dedicate fiber optic cable strands that is part of a different routing, and complemented with a long-haul Wireless solution; Very Small Aperture Terminal (VSAT) or Point to Point Broadband Wireless. The design model is based on each infrastructure medium is established where if one medium becomes unavailable, the second and or the 3rd still in operation. The design model communication nodes will have the capability to sense communication medium outage and switch over to the alternate. Moreover, the design model has the capability to send duplicated data via each communication route and then at the receive end the received data is analyzed and the most correct data value is selected. The data selection is based on voting process of 2 of 3 closets to the correct expected value. 3.3 Cybersecurity The SIF communication and system infrastructure layer interworking, Fig. 3 (above) is interfaced with outside the local process operation to enable advance data analytics, data presentations, and mobile end user access. To support this ability, the design model introduces secure communication link that uses multilayers of cybersecurity protection (such as DMZ) leading to a connection to the Cloud IoT and or standard data server hosting system. Figure 4 (below) depicts the interface between the SIF layers to the external central computing platform. End user desktop and or mobile Devices (Tablet, Smart Phone, etc.) may have access to the data based on preset templates and data access authorization levels.
Fig. 4. Embeded network and system cyber protection
134
S. M. Almadi and P. Mujica
4 Process Safety Time The typical SIS application requires rapid processing and prompt actuation of final elements to prevent a hazardous event. The data deliver time varies from milliseconds to hours of process safety before the dangerous scenario is developed and evolved to be a threat to the operating system. In principle the current process safety time considerations for process safety application will not vary in this application that is due to fact that the SISs will continue to have the typical design functionalities to control the hazardous scenarios of their own operating site location. The new approach considers contemplation of the time that will take for the hazardous event to develop into a dangerous situation. This is where both (or multiple) SIS are required to act in prevention of escalation of a hazardous event or the prevention of a collateral effect. This will typically have advantage of sufficient time for the SIS. SIS in site A to communicate to SIS in site B, SIS C…SIS n, will be bounded by the speed of signal and data communication between the SIS generating the input, communication delay, and Cloud processing. The time for the communication to be established is rather fast and the processing time by all SISs users of the data is in the orders of milliseconds or seconds at the most. However, it is estimated to be much faster and have sufficient time for the prevention or mitigation of the hazardous event. Every application will have to analyze the time frames for the hazardous scenarios to develop reaction time and operational boundaries. The authors estimate the process safety time will not be a limitation for the application of this concept of interconnected SISs, inclusive in those cases where the global server for process safety application may be a technically sound candidate to house artificial intelligence and machine learning capabilities. This resulting in autonomous and command decisions for local SISs to actuate final elements geographically spread at different operating sites.
5 Human Factors This effort is not intended to elaborate much on the human factor aspects, since human factor is a complex science on its own rights. However, it is a fact that having autonomous computing power designed and capable to evaluate complex process safety will alleviate the burden on operations’ personnel. Difficult decisions in moments of distress are now facilitated by this new approach. With available machine learning capabilities, the logic solvers at each site and in the global server concept can have the necessary training to identify and react to different scenarios. This opportunity is unique, with specialized process safety personnel, the logic solvers computing power that can be pre-programmed to identify, learn to forecast potential hazardous scenarios and take the necessary actions. Off course, this is not intended to replace or diminish the importance of the human operators, but it will be definitely a good insurance when all elements in the current operating model have failed.
Safety Instrumented System Design Philosophy Paradigm Shift
135
6 Compartive Analysis The comparative analysis between the current SIS design philosophy and the new proposed model, Table 1, show an enhancement in all categories. The enhancements are qualitatively rated due to SIS system complexity and simulation tool challenges. However, a conceptual model has shown an enhancement in response time, autonomy, and response time. This is by focusing on one variable: remote SIS response time from minutes to subseconds. Additional simulations for the remaining variables are required as discussed in Future Work. Table 1. Comparative analysis between the current SIS design philosophy and the new proposed model Category Current SIS design model
Proposed SIS design model
Time impact
Action impact
Process Time Action Process safety delay impact safety advantage impact advantage
Sensor Layer
Real time
Localized
No Change
Real time
Multi Site
Local SIS
Real time
Localized
No Change
Real time
Multi Site
Final Real Elements time
Localized
No Change
Real time
Multi Site
Inter-SIS Does not exist
Non-Existing –
Real time
Multi Site
Real time
Multi Site
Remote SIS
Delayed Human Impared Inetervention
Variance
Prevention Inroduced Soft of large Sensors escale events Data Processing Beyond Local Site Boundaries Introduced Interworking Between Remote SISs Autonomous Reduced Time Response from Minutes/Hours Autonomous to Sub-seconds
7 Future Work There is a great opportunity to explore the new approach utilizing simulation and or the use of empirical data. However, it is recommended to conduct simulation for both network and SIS prior to moving into an action field pilot. Field pilots are crucial to collect empirical data and compare to the simulated scenarios.
136
S. M. Almadi and P. Mujica
8 Conclusion Digital communication and centralized computing platforms can play a major transformation in the current approach of SIS systems. This effort introduced a new approach that will enable interconnection of different SIS systems scattered across different geographically sites. This introduced a new solution that will address past major industrial incidents that have resulted in catastrophic impacts due to lack of SISs interconnectivity. The solution capitalized on standard communication medium, network protocols, Cloud computing and Internet of Things to provide a wholistic monitoring and automated command and control SIS systems. The new approach requires simulation and pilot validations. The new model is expected to improve process safety by orders of magnitudes achieving the required safety integrity level required for a given application while reducting human factors to bear minimum which addresses the time-delay in response of a remote SIS. This effort defines the fundamental requirements and best practices to ensure a successful and sustainable performance for the new approach. Additional simulation and empirical formula are required to quantitative demonstrate the enhancements. Acknowledgments. The authors express their appreciation to Saudi Aramco management for their permission to publish this paper.
References 1. Electric/Electronic/Programmable Electronic Safety Related Systems, Parts 1–7, document IEC61508, International Electrotechnical Commission, Geneva, Switzerland (2010) 2. Anderson, W.E.: Risk analysis methodology applied to industrial machine development. IEEE Trans. Ind. Appl. 41(1), 180–187 (2005) 3. Macleod, F., CEng FIChemE, Richardson, S.: Piper Alpha: The Disaster In Detail. https:// www.thechemicalengineer.com (2018) 4. Howard, C.: The Bundefield Incident – 7 Years on: A Review. Buncefield Oil Storage Deposit. Measurement and Control, vol. 46 no. 3 Health & Safety Laboratory, istech Consulting Ltd, Middlesbrough, UK (2013) 5. Mannan, M.S.: A Technical Analysis of the Buncefield Explosion and Fire, Symposium Series No. 155, Mary Kay O’Connor Process Safety Center, Texas A&M University System, College Station, Texas, USA (2019) 6. Scharpf, E., Thomas, H.W., Stauffer, T.R.: Practical Sil Target Selection, Risk Analysis per the IEC 61511 Safety Lifecycle. 2nd Edition (2022) 7. Generowicz, M.: Functional safety: The next edition of IEC 61511, I&E Syst. Pty Ltd., WA, Australia, Technical Report (2015) 8. Industrial Communications Networks—Network and System Security—Part 2–1: Establishing An Industrial Automation and Control System Security Program, Edition 1.0, document IEC62443-2-1, International Electrotechnical Commission, Geneva, Switzerland (2011) 9. Nasa Safety Center System Failure Case Study: The Case for Safety The North Sea Piper Alpha Disaster. National Aeronautics and Space Administration, vol. 7, issue 4 (2013) 10. Alade, A.A., Ajayi, O.B., Okolie, S.O., Alao, D.O.: Overview of the supervisory control and data acquisition (SCADA) system. Int. J. Sci. Eng. Res. 8(10) (2017)
Safety Instrumented System Design Philosophy Paradigm Shift
137
11. Catelani, M., Ciani, L., Luongo, V.: ‘A simplified procedure for the analysis of Safety Instrumented Systems in the process industry application.’ Microelectron. Rel. 51(9–11), 1503–1507 (2011) 12. Functional Safety-Safety Instrumented Systems for the Process Industry Sector, Parts 1– 3, document IEC61511, International Electrotechnical Commission, Geneva, Switzerland (2003) 13. Nadkarni, S., Prügl, R.: Digital transformation: a review, synthesis and opportunities for future research. Manage. Rev. Quar. 71(2), 233–341 (2020). https://doi.org/10.1007/s11301020-00185-7
Bifurcation Revisited Towards Interdisciplinary Applicability Bernhard Heiden1,2(B) , Bianca Tonino-Heiden2 , and Volodymyr Alieksieiev3 1
Carinthia University of Applied Sciences, 9524 Villach, Austria [email protected] 2 University of Graz, 8010 Graz, Austria 3 Leibniz University Hannover, 30823 Garbsen, Germany http://www.cuas.at
Abstract. Bifurcation analysis is a very well established tool in chaos theory and non-linear dynamics. This paper revisits the practical application methods in two computer algebra tools, Matlab and Mathcad. After introducing into the topic and giving applications to those computer algebra tools, these implementations are investigated regarding their possible application above the core fields of chaos theory applications and the possibility to bridge hard and soft sciences with these computational mathematical tools, for striving complex dynamics across disciplinary borders as well as a transplanted or transferred tool use in soft science disciplines. Keywords: Computing · Soft science Transdisciplinary research
1
· Bifurcation analysis ·
Introduction
Bifurcation analysis is well known in chaos theory and can be seen as a tool to analyse non-linear dynamics [3,12]. The main problem in soft sciences and interdisciplinary applications is that there are no universal and generally applicable tools to tackle those fields mathematically or thoroughly quantitatively. The usual method of linearising and using linearised models is dominating a vast field of science and has been making great advances in the field of linear theory. Even now, in university education, this approach is very widespread established and only very slowly new approaches of non-linearity science is becoming perceived. The reasons are manifold, so it becomes clear that there are ever new classifications with new characteristics, as the combinatorial possibilities increase vastly when investigating non-linear regions. On the other hand, very simple models can unfold chaotic behaviour easily. Such models can be self-referential functions or Ordinary Differential Equations (ODEs) as important function classes that can be analysed by bifurcation analysis with many related algorithms and methods. Although the method is very promising, only some other hard-science c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 506, pp. 138–145, 2022. https://doi.org/10.1007/978-3-031-10461-9_9
Bifurcation Revisited
139
applications have emerged and no soft-science applications so far. So it is promising to initiate a bridging between these two fields of science and to investigate possibilities for their further realisation. An interesting event happened today. A glass that I was moving in one hand and a bottle in the other were slightly touching each other. At that moment, an explosion of the glass took place, and thousands of diamond-like splitters were falling down to the ground, most of equal size. It was not only shocking, but it was also amazing what this rare resonance phenomenon had produced: Thousands of parts with a slight mechanical impulse. So this could be a future possibility, to save energy by switching large systems by small impulses with ease and in a predictive manner. Content of the Work. In this work, we first explore in Sect. 2 what is bifurcation analysis, what are possible applications and how are typical implementations of it in Matlab and Mathcad as examples for computational algebra simulation applications. Then, in Sect. 3 we give an idea of how the bridging of hard and soft science can be done using bifurcation analysis as a general computational tool for non-linear systems. Finally, in Sect. 4 we give a conclusion and outlook. Goals and Limitations of the Work. This work aims to investigate the possibility of applications of bifurcations analysis with an emphasis on a system theoretical projection to future interdisciplinary system applications of soft and hard sciences. The limitation of this work is that there can be given only a glimpse of the direction, as bifurcation analysis from chaos theory is (1) extraordinarily theoretical and restricted to a small set of unique problems on the one side, and (2) the soft-sciences are extremely qualitatively driven. So the limitation is that the proof of such systems can only be done by testing the given theses in future practical simulation applications, with the suggested or functionally similar computing environments.
2
Bifurcation Analysis
The basic idea of the paper is how bifurcation analysis can be used for quantifying dynamics that occur in soft-sciences or interdisciplinary applications that link disciplines that are wide apart with one unifying method. 2.1
Applications of Bifurcation Analysis in Logistics and Other Research Fields
There are several industrial and logistical applications of bifurcation theory described in this section. Sharma [18] applied bifurcation analysis in behaviour characterisation, determination of sudden changes and conceptual design support of aircraft. Sharma [18] also mentioned that bifurcation analysis could be applied to analyse the “highly non-linear behaviour” of, e.g. uncrewed air vehicles, whose complex dynamics is nowadays just poorly explored. In [22] the dynamics of milling operations was investigated using bifurcation analysis. According to this
140
B. Heiden et al.
analysis, the dynamics of two degrees of freedom milling system differs depending on cutting speed and depth. In [17] the bifurcation analysis is applied in logistics to analyse the stability of supply chain networks by means of several elasticity parameters obtained from characteristics of production and inventory rates as well as supply network topology. There are also many applications of bifurcation analysis in macroeconomics. For instance, [1] investigated the cyclicity presence in the Marshallian Macroeconomic Model (MMM) and a Hopf bifurcation existence applied to the entry and exit equation parameter. Results show that the cyclical behaviour can appear with special combinations of own price, cross-price and income elasticities in a special case of two-sector MMM. Here the presence of Hopf bifurcation was confirmed in the “theoretically feasible parameter space” [1]. In [13] the model of logistic population growth with delayed production, which is due to the installation of new capital, was shown. Using the bifurcation analysis, the oscillating behaviour of dynamics due to the “time-to-build production” was detected [13]. 2.2
Bifurcation Analysis with Matlab
In Fig. 1 we see a Matlab implementation of a bifurcation diagram. Here we see that the function is defined self-referentially by f (r) = r · cos(π/2 − x · π). The function Y ue bif ur can be found at [23]. The nonlinearity of the function shows in the state space, and the non-connectivity or the discontinuity in the bifurcation diagram. A tool for the general use of bifurcation diagrams has been developed by [3] as a Matlab source code, supported by additional C-code implementation for the sake of computing velocity. In this library, many quite new analysis methods are available, like, e.g. Poincar´e maps, which are the results of the intersection of to be investigated planes and the state space of the dynamics of the regarded model. 2.3
Bifurcation Analysis with Mathcad
We see an example of a bifurcation diagram in Fig. 2. The example is from Andreas Thiemer [19], and the respective programming is found in Fig. 2, which was provided kindly by Andreas Thiemer. Here the quantity of a duopoly is in the y-axis and the control parameter on the x-axis, leading to a bifurcation diagram.
3
Suggested Bridging of Soft and Hard Sciences with Bifurcation Analysis
It seems that bifurcation analysis is a perfect computational tool to bridge very different fields of applications with a mathematical approach and with the particular relation to dynamics in the general sense, which means the time dependence of movement in room-time. Now, dynamics is typically purely time-dependent. In Fig. 3 we propose a cycle of how bifurcation analysis can be used systematically to produce models that can be further developed. This would then be an
Bifurcation Revisited
141
Fig. 1. Bifurcation diagram.
epistemological evolutionary cycle that can enhance human imagination based on hard computational facts, underpinning qualitative features with model calculations. So this would be an epistemological near general system design tool. Especially this would enhance the emerging increase in hard and soft sciences interaction (cf. e.g. [6]). In [3] the approach is similar and is given there the possibility of two cases: (1) The design case, and (2) the case of back-following the trace of the causing parameter, which the authors call identification problem which is difficult because of the ambiguities. Concerning a future research program, the crucial step is (I) to translate linguistic and other soft-science terminology to the mathematical models. (II) It will be important to allow for more or osmotic mathematical approaches (see e.g. [11,20,21]) for implementing non-linear models from that can be calculated bifurcations diagrams and made related analyses.
142
B. Heiden et al.
Fig. 2. Feigenbaum diagram in Mathcad 15 according to [16, 19].
Dynamic Model MODEL DATA Bifurcation Analysis
Qualitative Explanation and Comparison REAL WORLD DATA
Fig. 3. Bifurcation analysis as quantitative-qualitative model mediator, by means of computation.
One approach to do this bridging function is to apply (a) logical structure implementations directly to a mathematical formulation, e.g. by the lambda “computatrix” calculus (see e.g. [10]). Another (b) could be to link words as functional attractors, directly to quantification measures, as can be related to the communication process, and as Luhmann concluded, is the base for society [14,15]. Language could be defined as a bifurcated space of words, which are then representations of real objects as their correspondences. A third approach (c) might be that specific words that are closely connected to mathematical notations are directly translated into a mathematical language that can be spoken out. When Luhmann, e.g. talks about the difference of the differences, this could be regarded as the differential operator of second order, or the second derivative, by this relating to a back-coupling process, which can be expressed as a graph or as differential equation directly from the word. In any way, the possibility of speaking about mathematics is analogous to speaking out written
Bifurcation Revisited
143
language, or writing down spoken language, generating new bifurcations of the media language for the informational or communication process.
4
Conclusion and Outlook
In this paper, we have screened several actual applications for using bifurcation analysis. Only in the last decades bifurcation analysis has become more widely spread from its physics origins into the direction of interdisciplinary fields. Even now, there are only sparse and hard-science driven approaches to the application of this mathematics, although it is evident that there exist many more interdisciplinary connections that could be not only useful for soft-sciences but can be described by these approaches as well. Future applications will be for sure in all industries and societal applications, like social science, law, communication technologies and even philosophy. It is indeed a system theory, described with different kinds of symbolisations found throughout all sciences, and hence natively interdisciplinary as an approach. As an outlook, we have to bridge from the side of mathematical physics and the side of societal and soft sciences into the direction of uniting interdisciplinary models of prediction and explanation. In the course of these approaches, the qualitative is coming nearer to the quantitative, and hence “machining” becomes more feasible. The machining may be physical applications as well as computing applications or both. These are creating emergence by intermediating in the process (cf. Fig. 3 and [9]). Computers are, in this respect, the new backbone of our creative forces, as they be explored and induced let us do hard sciences and interpret it qualitatively. By this, there may be well a change in technical and physical phenomena. While today we try to avoid complicated structures and daily changes at all costs, this may be in the future a virtue that is for sure worthwhile. The reason is quite amazing: Those non-linear regions of complexity hide laws that we would use eagerly if we were able to know it. So we would like to have faster communication, e.g., instantaneous information transmission, without time, according to the Pauli principle [2,4], or production processes without using excess energy. This, for example, is only possible in machines and utilisations that design resonance phenomena, as our brain is doing so to result in our phenomenological behaviour as a species by stabilising increasing higher levels of instability dynamically (cf. e.g. [5,7,8]). So our future work will focus on applying bifurcation analysis to the societal field, i.e., applying it to logical applications and societal interaction processes and their possible computational implementation. As a hard sciences approach, the bifurcation analysis is a reliable base to make qualitative analysis that can be compared to macro phenomena occurring in society. For this, we plan to implement mathematical modelling of societal problems, using bifurcation analysis as a bridging tool.
144
B. Heiden et al.
References 1. Banerjee, S., Barnett, W.A., Duzhak, E.A., Gopalan, R.: Bifurcation analysis of zellner’s marshallain macroeconomic model (2011) 2. Bell, J.S.: On the Einstein podolsky rosen paradox. Physics 1(3), 195–200 (1964) 3. Dhooge, A., Govaerts, W., Kuznetsov, Yu.A., Meijer, H.G.E., Sautois, B.: New features of the software MatCont for bifurcation analysis of dynamical systems. Math. Comput. Modell. Dyn. Syst. 14(2), 147–175 (2008). https://doi.org/10. 1080/13873950701742754 4. Einstein, A., Podolsky, B., Rosen, N.: Can quantum-mechanical description of reality be considered complete? Phys. Rev. 47, 777–779 (1935) 5. Grossberg, S.: Conscious Mind, Resonant Brain: How Each Brain Makes a Mind. Oxford University Press, Oxford (2021) 6. G¨ otschl, J.: Self-organization: New Foundations Towards A General Theory of Reality. Series A: Philosophy and Methodology of the Social Sciences, pp. 109– 128. Kluwer Academic Publishers, Dordrecht. Theory and decision library edition (1995) 7. Heiden, B., Alieksieiev, V., Tonino-Heiden, B.: Selforganisational high efficient stable chaos patterns. In: Proceedings of the 6th International Conference on Internet of Things, Big Data and Security - volume 1: IoTBDS, pp. 245–252. INSTICC, SciTePress (2021). https://doi.org/10.5220/0010465502450252 8. Heiden, B., Tonino-Heiden, B.: Emergence and solidification-fluidisation. In: Arai, K. (ed.) IntelliSys 2021. LNNS, vol. 296, pp. 845–855. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-82199-9 57 9. Heiden, B., Tonino-Heiden, B., Alieksieiev, V., Hartlieb, E.: Digitisation model innovation system. In: O’Dell, M. (ed.) 2021 10th International Conference on Industrial Technology and Management (ICITM), pp. 128–133. IEEE (2021). https://doi.org/10.1109/ICITM52822.2021.00030 10. Heiden, B., Tonino-Heiden, B., Alieksieiev, V., Hartlieb, E., Foro-Szasz, D.: Lambda computatrix (LC)—towards a computational enhanced understanding of production and management. In: Yang, X.-S., Sherratt, S., Dey, N., Joshi, A. (eds.) Proceedings of Sixth International Congress on Information and Communication Technology. LNNS, vol. 236, pp. 37–46. Springer, Singapore (2022). https://doi. org/10.1007/978-981-16-2380-6 4 11. Heiden, B., Volk, M., Alieksieiev, V., Tonino-Heiden, B.: Framing artificial intelligence (AI) additive manufacturing (AM). Procedia Comput. Sci. 186, 387–394 (2021). https://doi.org/10.1016/j.procs.2021.04.161 12. Hilborn, R.C.: Chaos and Nonlinear Dynamics - An Introduction for Scientists and Engineers. Oxford University Press, New York (1994) 13. Juratoni, A., Bundau, O.: HOPF bifurcation analysis of the economical growth model with logistic population growth and delay. In: Annals of DAAAM for 2010 & Proceedings of the 21st International DAAAM Symposium, vol. 21, No. 1 (2010). ISSN: 1726-9679 14. Luhmann, N.: Die Gesellschaft der Gesellschaft, 10th edn. Suhrkamp Verlag, Frankfurt/Main (2018) 15. Luhmann, N.: Soziale Systeme. 17th edn. Suhrkamp Verlag AG (2018) 16. Puu, T.: Attractors, Bifurcations, and Chaos. Nonlinear Phenomena in Economics (2000). https://doi.org/10.1007/978-3-662-04094-2 17. Ritterskamp, D., Demirel, G., MacCarthy, B.L., Rudolf, L., Champneys, A.R., Gross, T.: Revealing instabilities in a generalized triadic supply network: a bifurcation analysis. 28(7), 073103 (2018). https://doi.org/10.1063/1.5026746
Bifurcation Revisited
145
18. Sharma, S., Coetzee, E.B., Lowenberg, M.H., Neild, S.A., Krauskopf, B.: Numerical continuation and bifurcation analysis in aircraft design: an industrial perspective. Philos. Trans. R. Soc. A Math. Phys. Eng. Sci. 373(2051), 20140406 (2015). https://doi.org/10.1098/rsta.2014.0406 19. Thiemer, A.: Duopoly - Part II: lost in fractals (2001). https://www.fh-kiel.de/ fileadmin/data/wirtschaft/dozenten/thiemer andreas/mcd/duopol2.pdf. Accessed 23 June 2022 20. Tonino-Heiden, B., Heiden, B., Alieksieiev, V.: Artificial life: investigations about a universal osmotic paradigm (UOP). In: Arai, K. (ed.) Intelligent Computing. LNNS, vol. 285, pp. 595–605. Springer, Cham (2021). https://doi.org/10.1007/ 978-3-030-80129-8 42 21. Villari, M., Fazio, M., Dustdar, S., Rana, O., Ranjan, R.: Osmotic computing: a new paradigm for edge/cloud integration. IEEE Cloud Comput. 3, 76–83, 073103 (2016) 22. Weremczuk, A., Rusinek, R., Warminski, J.: Bifurcation and stability analysis of a nonlinear milling process. AIP Conf. Proc. 1922(1), 100008 (2018). https://doi. org/10.1063/1.5019093 23. Wu, Y.: Function yue bifur: plots 1d bifurcation figure (2010). https://www. mathworks.com/matlabcentral/mlc-downloads/downloads/submissions/26839/ versions/1/previews/yue bifur.m/index.html. Accessed 23 June 2022
Curious Properties of Latency Distributions Michal J. Gajda(B) Migamake Pte Ltd., Singapore, Singapore [email protected]
Abstract. We set on modeling performance of distributed systems analytically. We show that for reasonable assumptions, we can generalize probability calculus to network latency distributions. These distributions may model both network connectivity, and computational performance of the system. We describe their algebra, show how to implement computations on these distributions, and get exact results in most cases. We apply this methodology to model performance of AWS DynamoDB analytically, as well as gossip algorithm. Keywords: Latency · Bandwidth-Latency product · Latency distribution · Performance estimation · Network performance
1
Introduction
Here we explain the network simulation context which gave context to this paper and give key references. Capacity Insensitive Networking: Most of the network connections on the internet are called mice [1] for a good reason: they have bandwidth × latency product of less than 12k, and thus are largely insensitive to the network’s capacity. The deciding factor in performance of these flows is thus latency, hence we ignore capacity limitation for the remainder of this paper. Packet Loss Modelling using Improper CDFs: In order to accurately simulate capacity-insensitive network miniprotocols we formally define network latency distribution as improper CDF (cumulative distribution function) of arrived messages over time. We call it improper CDF, because it does not end at 100%: some messages can be lost. Time-Limited Model: For practical purposes we ignore answers that are delivered after certain deadline: that is network connection timeout. Starting with description of its apparent properties, we identify their mathematical definitions, and ultimately arrive at algebra of ΔQ [2] with basic operations that correspond to abstract interpretations of network miniprotocols [3]. This allows us to use objects from single algebraic body to describe behaviour of entire protocols as improper CDFs. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 506, pp. 146–170, 2022. https://doi.org/10.1007/978-3-031-10461-9_10
Curious Properties of Latency Distributions
2
147
Related Work
Then we discuss expansion of the concept to get most sensitive metrics of protocol and network robustness [4]. However instead of heuristic measure like effective graph resistance [5], we use logically justified measure derived from the actual behaviour of the network. This is similar to network calculus [6] but uses simpler methods and uses more logical description with improper latency distribution functions, instead of max-plus and min-plus algebras.1 Basic operations ∧ and ∨ are similar to last-to-finish and first-to-finish synchronizations [2]. This approach allows us to use abstract interpretation [3] of computer program to get its latency distribution, or a single execution to approximate latency distribution assuming the same loss profile of packets.
3
Preliminaries
Nulls and Units of Multiplication. We will be interested in null and unit of a single multiplication for each modulus we will consider: class Unit α where unitE :: α class Null α where nullE :: α instance Unit Int where unitE = 1 instance Null Int where nullE = 0 3.1
Discrete Delay
Discrete delays are defined as: newtype Delay = Delay {unDelay :: Int} For ease of implementation, we express each function as a series of values for discrete delays. First value is for no delay. We define start ∈ T as smallest Delay (no delay). start :: Delay start = Delay 0 3.2
Power Series Representing Distributions
We follow [7] exposition of power series, but use finite series and shortcut evaluation: 1
We describe how it generalizes these max-plus and min-plus algebras later.
148
M. J. Gajda
newtype Series α = Series { unSeries :: [α] } deriving ( Eq , Show, Read, Functor, Foldable, Applicative, Semigroup) To have a precise discrete approximation of distributions, we encode them as a series. Finite series can be represented by a generating function Ft : F (t) = f0 ∗ t0 + f1 ∗ t1 + f2 ∗ t2 + ... + fn ∗ tn Which is represented by the Haskell data structure: ft = Series [a0, a1, .., an] Differential Encoding and Cumulative Sum. For any probability distribution, we need a notion of integration, that coverts probability distribution function (PDF) into cumulative distribution function (CDF). Cumulative Sums computes sums of 1..n-th term of the series: cumsum :: Num α ⇒ Series α → Series α cumsum = Series . tail . scanl (+) 0 . unSeries After defining our integration operator, it is not time for its inverse. Differential encoding is lossless correspondent of discrete differences2 , but with first coefficient being copied. (Just like there was a zero before each series, so that we always preserve information about the first term.) This is backward antidifference 3 as defined by [8]. That makes it an inverse of cumsum. It is backward finite difference operator, as defined by [9]. diffEnc :: Num α ⇒ Series α → Series α diffEnc ( Series [ ]) = Series [ ] diffEnc ( Series σ ) = Series $ head σ : zipWith (-) (tail σ) σ So that diffEnc of CDF will get PDF, and cumsum of PDF will get CDF. Since we are only interested in answers delivered before certain deadline, we will sometimes cut series at a given index: cut :: Delay → Series α → Series α cut (Delay t) ( Series σ) = Series ( take t σ) instance IsList (Series α) where type Item (Series α) = α fromList = Series toList ( Series σ) = σ Series enjoy few different multiplication operators. Simplest is scalar multiplication: 2 3
Usually called finite difference operators. Antidifference is an inverse of finite difference operator. Backwards difference subtracts immediate predecessor from a successor term in the series.
Curious Properties of Latency Distributions
149
infixl 7 .* – same precedence as * (.*) :: Num α ⇒ α → Series α → Series α γ .* ( Series ( f : fs)) = Series ( γ * f : unSeries ( γ.* Series fs)) .* (Series [])= Series [] F (t) = f0 ∗ t0 + f1 ∗ t1 + f2 ∗ t2 + ... + fn ∗ tn Second multiplication is convolution, which most commonly used on distributions: F (t) ⊗ G(t) = Στt =0 xt ∗ f (τ ) ∗ g(t − τ ) Wikipedia’s definition:
(f ⊗ g)(t)
∞
−∞
f (τ )g(t − τ ) dτ.
Distribution is from 0 to +∞: Wikipedia’s definition: 1. First we fix the boundaries of integration: ∞ f (τ )g(t − τ ) dτ. (f ⊗ g)(t) 0
(Assuming f (t) = g(t) = 0 when t < 0.) 2. Now we change to discrete form: (f ⊗ g)(t) Σ0∞ f (τ )g(t − τ ) 3. Now we notice that we approximate functions up to term n: (f ⊗ g)(t) Σ0n fτ gt−τ . Resulting in convolution: F (t) ⊗ G(t) = Στt =0 xt ∗ f (τ ) ∗ g(t − τ ) infixl 7 ‘convolve‘ – like multiplication convolve :: Num α ⇒ Series α → Series α → Series α Series ( f :fs) ‘convolve‘ gg@ (Series (g:gs)) = Series (f *g : unSeries ( f .* Series gs + ( Series fs ‘convolve‘ gg))) = Series [ ] Series [] ‘convolve‘ ‘convolve‘ Series [] = Series [ ]
150
M. J. Gajda
Elementwise multiplication, assuming missing terms are zero. (.*.) :: Num α ⇒ Series α → Series α → Series α Series α .*. Series β = Series (zipWith (*) α β) Since we use finite series, we need to extend their length when operation is done on series of different length. Note that for emphasis, we also allow convolution with arbitrary addition and multiplication: convolve :: ( α → α → α) → ( α → α → α) → Series α → Series α → Series α convolve (+) (*) ( Series (f:fs) ) gg @ (Series (g:gs)) = Series (f * g : zipWithExpanding (+) ( f .* gs) ( unSeries (convolve (+) (*) ( Series fs) gg))) where bs α .* bs = ( α*) ♦$ ( Series [ ]) = Series [ ] convolve ( Series []) = Series [ ] convolve 3.3
Expanding Two Series to the Same Length
We need a variant of zipWith that assumes that shorter list is expanded with unit of the operation given as argument: zipWithExpanding :: ( α → α → α) → [ α] → [α] → [α] zipWithExpanding f = go where go [] ys = ys – unit ‘f‘ y == y go xs [] = xs – x ‘f‘ unit == x go (x: xs) (y: ys) = (x ‘f‘ y):go xs ys Here we use extension by a given element e, which is 0 for normal series, or 1 for complement series.
Curious Properties of Latency Distributions
151
Extend both series to the same length with placeholder zeros. Needed for safe use of complement-based operations. extendToSameLength e (Series α, Series β) = (Series resultA , Series resultB) where (resultA, resultB) = go α β go [ ] [] = ( [ ] , [] ) go ( β : bs) (γ: cs) = ( β: bs’ , γ: cs’) where ˜ ( bs’ , cs’) = go bs cs go ( β : bs) [ ] = ( β : bs , e :cs’) where ˜( bs’, cs’) = go bs [] go [ ] ( γ:cs) = ( e : bs’, γ:cs ) where ˜(bs’, ) = go [] cs In a rare case (CDFs) we might also prolong by the length of the last entry: We will sometimes want to extend both series to the same length with placeholder of last element. extendToSameLength’ ( Series α, Series β) = (Series resultA , Series resultB) where (resultA, resultB) = go α β go [ ] [ ] = ( [ ] , [ ]) go [ β ] [γ] = ( [ β] , [γ] ) go ( β : bs) [ γ ] = ( β : bs , γ : cs’) where ˜ ( bs’, cs’) = go bs [γ] go [ β ] ( γ:cs) = ( β : bs’ , γ:cs ) where ˜ ( bs’ , ) = go [ β ] cs go ( β : bs) ( γ :cs) = ( β : bs’ , γ:cs’) where ˜(bs’, cs’) = go bs cs
152
3.4
M. J. Gajda
Series Modulus
We can present an instance of number class for Series: instance Num α ⇒ Num ( Series α) where Series α + Series β = Series ( zipWithExpanding (+) α β) (*) = convolve abs = fmap abs signum = fmap signum fromInteger = ⊥ negate = fmap negate Given a unit and null elements, we can give unit and null element of a Series.4 instance Unit α ⇒ Unit (Series α) where unitE = [unitE ] instance Null α ⇒ Null (Series α) where nullE = [nullE ] We may be using Series of floating point values that are inherently approximate. In this case, we should not ever use equality, but rather a similarity metric that we can generate from the similar metric on values: instance Real α ⇒ Metric ( Series α) where α ‘distance‘ β = sqrt $ realToFrac $ sum $ fmap (λx → x*x) $ α- β similarityThreshold = 0.001 Note that generous similarity threshold of 0.001 is due to limited number of simulations we can quickly run when checking distributions in the unit tests (10k by default). For a Series of objects having complement, there is also special definition: instance Complement α ⇒ Complement (Series α) where complement = fmap complement
4
Note that we in this context we are mainly interested in null and unit of multiplication.
Curious Properties of Latency Distributions
153
Square Matrices of Declared Size This is a simple description of square matrices with fixed size5 . First we need natural indices that are no larger than n: newtype UpTo ( ν ::Nat) = UpTo { unU pT o :: N } deriving ( Eq , Ord, Num , Typeable) The only purpose of safe indices is to enumerate them: allUpTo :: KnownNat ν ⇒ [UpTo ν] Armed with safe indices, we can define square matrices: newtype SMatrix (ν::Nat) α = SMatrix { unSMatrix :: DM.Matrix α } deriving ( Show, Eq , Functor, Applicative , Foldable , Traversable, Typeable, Generic) size :: KnownNat ν ⇒ SMatrix ν α → Int size (σ :: SMatrix ν α)= intVal ( Proxy :: Proxy ν) We also need to identity (unitE) and null matrices with respect to multiplication (nullE)). Definition of parametrized matrix multiplication (sMatMult) is standard, so we can test it over other objects with defined multiplication and addition-like operators. sMatMult :: KnownNat ν ⇒ (α → α → α) – ˆ addition → (α → α → α) – ˆ multiplication → SMatrix ν α → SMatrix ν α → SMatrix ν α One might also want to iterate over rows or columns in the matrix: rows, columns :: KnownNat ν ⇒ SMatrix ν α → [[α]] rows sm = [ [ sm ! ( i , j) — j ← allUpTo ] — i ← allUpTo ] columns sm = [ [ sm ! ( i , j) — i ← allUpTo ] — j ← allUpTo ]
4
Latency Distributions
4.1
Introducing Improper CDF
To define our key object, lets imagine a single network connection. In this document, we ignore capacity-related issues. So ΔQ(t) is probability distribution function of event arriving at some point of time including infinity which stands for the even never arriving, as shown in Fig. 1. We will often describe it as completion rate up to a deadline, which is improper cumulative distribution function (iCDF, since it may not reach the value of 1.0), for events arriving up until some point in time as shown on Fig. 2. 5
Note that we considered using matrix-static, but it does not have typesafe indexing.
154
M. J. Gajda
Fig. 1. Latency probability distribution function
Fig. 2. Completion rate against deadline as improper cumulative distribution function
For the sake of practicality, we also mark a deadline as the last possible moment when we still care about messages. (Afterwards, we drop them, just like TCP timeout works.) For example, when only 0.99 of messages arrive at all within desired time t, and we silently drop those that arrive later. For each distribution, we will define deadline formally as d(t) = maxargt (ΔQ(t)) or such t for which our improper CDF reaches maximum value. We also define ultimate arrival probability formally as au (ΔQ) = max(ΔQ). Our improper CDFs are assumed to be always defined within finite subrange of delays, starting at 0. Ultimate arrival probability allows us to compare attenuation between two links. In the following we define domain of arrival probabilities as A ∈ [0, 1.0], which is probability. We also define domain of time or delays as T . We also call a domain of ΔQ functions as Q = (T → A). Below is Haskell specification of this datatype: newtype LatencyDistribution α = LatencyDistribution { pdf :: Series α }
Curious Properties of Latency Distributions
155
The representation above holds PDF (probability density function). Its cumulative distribution function can be trivially computed with running sum: cdf :: Num α ⇒ LatencyDistribution α → Series α cdf = cumsum . pdf Since it is common use complement of CDF, we can have accessor for this one too: complementCDF :: Probability α ⇒ LatencyDistribution α → Series α complementCDF = complement . cumsum . pdf 4.2
Canonical Form
Sometimes we need to convert possibly improper LatencyDistribution into its canonical representation. Here we define canonical LatencyDistribution when (i) it is a valid improper probability distribution so sum does not go over 1.0; (ii) it does not contain trailing zeros after the first element (which are redundant). This definition assumes we have a finite series, and assures that any distribution has a unique representation. canonicalizeLD :: Probability α ⇒ LatencyDistribution α → LatencyDistribution α canonicalizeLD = LatencyDistribution . Series . assureAtLeastOneElement . dropTrailingZeros . cutWhenSumOverOne 0.0 . unSeries . pdf where cutWhenSumOverOne aSum [ ] = [] cutWhenSumOverOne aSum ( x :xs) — aSum +x¿1.0 = [1.0-aSum] = cutWhenSumOverOne aSum ( x :xs) x:cutWhenSumOverOne ( aSum +x) xs assureAtLeastOneElement [ ] = [0.0] assureAtLeastOneElement other = other dropTrailingZeros = reverse . dropWhile (==0.0) . reverse
156
4.3
M. J. Gajda
Construction from PDF and CDF
We use canonicalizeLD to make sure that every distribution is kept in canonical form (explained below), we might also want to make constructors that create LatencyDistribution from a series that represents PDF or CDF: fromPDF :: Probability α ⇒ Series α → LatencyDistribution α fromPDF = canonicalizeLD . LatencyDistribution To create LatencyDistribution from CDF we need diffEnc (differential encoding or backward finite difference operator from Series module): fromCDF :: Probability α ⇒ Series α → LatencyDistribution α fromCDF = fromPDF . diffEnc Similar we can create LatencyDistribution from complement of CDF: fromComplementOfCDF :: Probability α ⇒ Series α → LatencyDistribution α fromComplementOfCDF = fromCDF . complement 4.4
Intuitive Properties of Latency Distributions
1. We can define few linear operators on ΔQ (for exact definition, see next section): A. Stretching in time – ignored in here. B. Delaying by t – composition with wait: 0 for t < td wait(t) = f (t) = 1.0 for t = td C. Scaling arrival probability – in other words, attenuation. 2. We distinguish special distribution that represents null delay or neutral element of sequential composition ( or afterLD), where we pass every message with no delay: preserved(1) = wait(0) = 1Q 3. We can say that one ΔQ no worse than the other, when it is improper CDF values never less than the other after making it fit a common domain: ΔQ1 ≥Q ΔQ2 ≡ ∀t.X(ΔQ1 )(t) ≥ X(ΔQ2 )(t) Here assuming X(ΔQ) defined: ΔQ(t) X(ΔQ)(t) ≡ ΔQ(d(ΔQ))
for t ≤ d(ΔQ) otherwise
Curious Properties of Latency Distributions
4.5
157
Basic Operations on ΔQ
To model connections of multiple nodes, and network protocols we need two basic operations: sequential and parallel composition. Interestingly both of these operations form an associative-commutative monoids (with unit that changes nothing) with null element (zero that nullifies the other), however their null and neutral elements swap places. 1. Sequential composition or afterLD6 : given ΔQ1 (t) and ΔQ2 (t) of two different connections, we should be able to compute the latency function for routing the message through pair of these connections in sequence: ΔQ1 (t) ΔQ2 (t). – associative: ΔQ1 (t) [ΔQ2 (t) ΔQ3 (t)] = [ΔQ1 (t) ΔQ2 (t)] ΔQ3 (t) – commutative ΔQ1 (t) ΔQ2 (t) = ΔQ2 (t) ΔQ1 (t) – neutral element is 1Q or noDelay, so: ΔQ(t) 1Q = 1Q ΔQ(t) = ΔQ(t) – null element is 0Q or allLost, so: ΔQ(t) 0Q = 0Q ΔQ(t) = 0Q afterLD :: Probability α ⇒ LatencyDistribution α → LatencyDistribution α → LatencyDistribution α rd1 ‘afterLD‘ rd2 = fromPDF $ pdf rd1 ‘convolve‘ pdf rd2 2. Alternative selection ∨: given ΔQ1 (t) and ΔQ2 (t) of two different connections, we should be able to compute the latency function for routing the message through pair of these connections in parallel: ΔQ1 (t)∨ΔQ2 (t) * associative: ΔQ1 (t) ∨ [ΔQ2 (t) ∨ ΔQ3 (t)] = [ΔQ1 (t) ∨ ΔQ2 (t)] ∨ ΔQ3 (t) 6
Sometimes named.
158
M. J. Gajda
* commutative ΔQ1 (t) ∨ ΔQ2 (t) = ΔQ2 (t) ∨ ΔQ1 (t) * neutral element is 0Q or allLost, so: ΔQ(t) ∨ 0Q = 0Q ∨ ΔQ(t) = ΔQ(t) * null element of firstToFinish is 1Q : ΔQ(t) ∨ 1Q = 1Q ∨ ΔQ(t) = 1Q * monotonically increasing: ΔQ1 (t) ∨ ΔQ2 (t) ≥ ΔQ1 (t) Here is the Haskell code for naive definition of these two operations: We can also introduce alternative of two completion distributions. It corresponds to a an event that is earliest possible conclusion of one of two alternative events. That can be easily expressed with improper cumulative distribution functions: Pmin(a,b) (x ≤ t) = 1 − (1 − Pa (x ≤ t)) ∗ (1 − Pb (x ≤ t)) That is, event min(a, b) occured when t < a or t < b, when: – it did not occur (top complement: 1 − ...), that: • either a did not occur 1 − Pa (x ≤ t), • and b did not occur 1 − Pb (x ≤ t): firstToFinishLD ⇒ → →
:: Probability α LatencyDistribution α LatencyDistribution α LatencyDistribution α
rd1 ‘firstToFinishLD‘ rd2 = fromComplementOfCDF $ complementCDF rd1’ .*. complementCDF rd2’ where ( rd1’ , rd2’) = extendToSameLengthLD ( rd1 , rd2) Notes: 1. Since we model this with finite discrete series, we first need to extend them to the same length.
Curious Properties of Latency Distributions
159
2. Using the fact that cumsum is discrete correspondent of integration, and diffEnc is its direct inverse (backward finite difference), we can try to differentiate this to get PDF directly: t = Σx=0 Pa (x) Pa (x ≤ t) t Pa (t) ∇ (Σ0 Pa (x)dx) =
In continuous domain, one can also differentiate both sides of the equation above to check if we can save computations by computing PDF directly. Unfortunately, that means that instead of 2× cumulative sum operations, 1× elementwise multiplication, and 1× differential encoding operation, we still need to perform the same 2× cumulative sums, and 3× pointwise additions and 3× pointwise multiplications, and two complements. So we get an equation that is less obviously correct, and more computationally expensive. Code would look like: rd1 ‘firstToFinishLD‘ rd2 = canonicalizeLD $ LatencyDistribution { pdf = rd1’ .*. complementCDF rd2’ + rd2’ .*. complementCDF rd1’ + rd1’ .*. rd2’ } where (rd1’, rd2’) = extendToSameLengthLD (rd1, rd2) complement :: Series Probability → Series Probability complement = fmap (1.0-) Note that complement above will be correct only if both lists are of the same length. In order to use this approach in here, we need to prove that cumsum and diffEnc correspond to integration, and differentiation operators for discrete time domain. Now let’s define neutral elements of both operations above: preserved :: Probability α ⇒ α → LatencyDistribution α preserved α = LatencyDistribution { pdf = Series [α] } allLostLD, noDelayLD :: Probability α ⇒ LatencyDistribution α allLostLD = preserved 0.0 noDelayLD = preserved 1.0 Here:
160
M. J. Gajda
– allLost indicates that no message arrives ever through this connection – noDelay indicates that the all messages always arrive without delay 3. Conjunction of two different actions simultaneously completed in parallel, and waits until they both are: Pmax(a,b) (x ≤ t) = Pa (x ≤ t) ∗ Pb (x ≤ t) lastToFinishLD :: Probability α ⇒ LatencyDistribution α → LatencyDistribution α → LatencyDistribution α rd1 ‘lastToFinishLD‘ rd2 = fromCDF $ cdf rd1’ .*. cdf rd2’ where (rd1’, rd2’) = extendToSameLengthLD (rd1, rd2) (Attempt to differentiate these by parts also leads to more complex equation: rd1 .*. cumsum rd2 + rd2 .*. cumsum rd1.) Now we can make an abstract interpretation of protocol code to derive corresponding improper CDF of message arrival. It is also: – commutative – associative – with neutral element of noDelay 4. Failover A < t > B when action is attempted for a fixed period of time t, and if it does not complete in this time, the other action is attempted: failover deadline rdTry rdCatch = fromPDF $ initial fmap (remainder*) ( pdf rdCatch) where initial = cut deadline $ pdf rdTry remainder = 1 - sum initial Algebraic properties of this operator are clear: – Certain failure converts deadline into delay: fail < t > A = waitt ;A – Failover to certain failure only cuts the latency curve: A < t > fail = cutt A – Certain success ignores deadline: 1Q < t > A = 1Q when t > 0
Curious Properties of Latency Distributions
161
– Failover with no time left means discarding initial operation: AB=B – When deadline is not shorter than maximum latency of left argument, it is similar to alternative, with extra delay before second argument: A < t > B = A ∨ wait(t)B when t > d(A) 5. Retransmission without rejection of previous messages: A < t >> B, when we have a little different algebraic properties with uncut left argument: A < t > B = A ∨ waitt ;B AB =A∨B A < t > fail = A fail < t > A = A 6. Some approaches [2] propose using operator A p B for probabilistic choice between scenarios A with probability p, and B with probability 1 − p. In this work we assume the only way to get non-determinism is due to latency, for example if protocols use unique agency property like those used in Cardano network layer [10,11]. infixr 7 ‘after‘ infixr 5 ‘firstToFinish‘ infixr 5 ‘lastToFinish‘ instance Probability α ⇒ TimeToCompletion (LatencyDistribution α) where firstToFinish = firstToFinishLD lastToFinish = lastToFinishLD after = afterLD delay = delayLD allLost = allLostLD noDelay = noDelayLD
General Treatment of Completion Distribution over Time. Whether might aim for minimum delay distribution of message over a given connection ΔQ(t), minimum time of propagation of the message over entire network (ΔR(t), reachability), we still have a distribution of completion distribution over time with standard operations.
162
M. J. Gajda
We will need a standard library for treating these to speed up our computations. We can also define a mathematical ring of (probability, delay) pairs. Note that LatencyDistribution is a modulus over ring R with after as multiplication, and whicheverIsFaster as addition. Then noDelay is neutral element of multiplication (unit or one), and allLost is neutral element of addition.7 Note that both of these binary operators give also rise to two almost-scalar multiplication operators: scaleProbability :: Probability α ⇒ α → LatencyDistribution α → LatencyDistribution α scaleProbability α = after $ preserved α scaleDelay :: ⇒ → → scaleDelay t =
Probability α Delay LatencyDistribution α LatencyDistribution α after $ delayLD t
delayLD :: Probability α ⇒ Delay → LatencyDistribution α delayLD ν = LatencyDistribution $ Series $ [0.0 — ← [(0::Delay)..ν − 1]] [1.0] To compare distributions represented by series of approximate values we need approximate equality: instance ( Metric α , Num α , Real α) ⇒ Metric (LatencyDistribution α) where LatencyDistribution l ‘distance‘ LatencyDistribution μ = realToFrac $ sum $ fmap (ˆ2) $ l-μ similarityThreshold = 1 / 1000
7
This field definition will be used for multiplication of connection matrices.
Curious Properties of Latency Distributions
163
Choosing 0.001 as similarity threshold (should depend on number of samples) instance Unit α ⇒ Unit ( LatencyDistribution α) where unitE = LatencyDistribution (Series [unitE]) instance Null α ⇒ Null ( LatencyDistribution α) where nullE = LatencyDistribution (Series [nullE]) 4.6
Bounds on Distributions
Note that we can define bounds on LatencyDistribution that behave like functors over basic operations from TimeToCompletion class. – Upper bound on distribution is the Latest possible time8 : newtype Latest = Latest { unLatest :: SometimeOrNever } deriving (Eq , Ord, Show) newtype SometimeOrNever = SometimeOrNever { unSometimeOrNever :: Maybe Delay } deriving (Eq) These estimates have the property that we can easily compute the same operations on estimates, without really computing the full LatencyDistribution.
5
Representing Networks
Adjacency matrix is classic representation of network graph, where i-th row corresponds to outbound edges of node i, and j-th column corresponds to inbound edges of node j. So Ai,j is 1 when edge is connected, and 0 if edge is not connected. It is common to store only upper triangular part of the matrix, since: – it is symmetric for undirected graphs – it should have 1 on the diagonal, since every node is connected to itself. We use this trick to avoid double counting routes with different directions. So network connectivity matrix is: – having units on the diagonal – having connectivity information between nodes i, and j, for j > i in element ai,j . 8
Here liftBinOp is for lifting an operator to a newtype.
164
M. J. Gajda
Generalizing this to ΔQ-matrices we might be interesting in: – whether An correctly mimicks shortest path between nodes (Earliest) – whether An correctly keeps paths shorter than n – for a strongly connected graph there should exist n ≤ dim(A), such that An is having non-null elements on an upper triangular section. More rigorous formulation is: R0 (A) = 1 Rn (A) = Rn−1 (A) ∗ A where: – 1 or Id denotes a unit adjacency matrix, that is matrix where every node is connected with itself but none else. And these connections have no delay at all. – A is connection matrix as defined above, and distribution for a transmission from a single packet from i-th to j-th node. For pre-established TCP this matrix should be symmetric. Our key metric would be diffusion or reachability time of the network ΔR(t), which is conditioned by quality of connection curves ΔQ(t) and the structure network graph. Later in this section, we discuss on how ΔR(t) encompasses other plausible performance conditions on the network. This gives us interesting example of using matrix method over a modulus, and not a group. Reachability of Network Broadcast or ΔR(t). Reachability curve ΔR(t) or diffusion time is plotted as distribution of event where all nodes are reached by broadcast from committee node, against time. We want to sum the curve for all possible core nodes, by picking a random core node. Area under the curve would tell us the overall quality of the network. When curve reaches 100% rate, then we have a strongly connected network, which should be eventually always the case after necessary reconfigurations. Note that when running experiments on multiple networks, we will need to indicate when we show average statistics for multiple networks, and when we show a statistic for a single network. Description of Network Connectivity Graph in Terms of ΔQ. Traditional way of describing graphs is by adjacency matrix, where 0 means there is no edge, and 1 means that there is active edge. We may generalize it to unreliable network connections described above, by using ΔQ instead of binary. So for each value diagonal, the network connection matrix A will be noDelay, and Aij will represent the connection quality for messages sent from node i to node j. That allows us to generalize typical graph algorithms executed to algorithms executed on network matrices:
Curious Properties of Latency Distributions
165
1. If series Rn (A) converges to matrix of non-zero (non-allLost) values in all cells in a finite number of steps, we consider graph to be strongly connected [12]. Matrix multiplication follows uses (;, ∨) instead of (∗, +). So sequential composition afterLD in place of multiplication, and alternative selection firstToFinish in place of addition. When it exists, limit of the series Rn (A) = An is called A∗ . In case of non-zero delays, outside diagonal, we may also consider convergence for delays up to t. NOTE: We need to add estimate of convergence for cutoff time t and number of iterations n, provided that least delay is in some relation to n∗t. Also note that this series converges to ΔQ on a single shortest path between each two nodes. That means that we may call this matrix Rmin (t), or optimal diffusion matrix. optimalConnections :: ( Probability α , KnownNat ν , Real α , Metric α) ⇒ SMatrix ν ( LatencyDistribution α) → SMatrix ν ( LatencyDistribution α) optimalConnections α = converges ( fromIntegral $ size α) (—*—α) α Of course, this requires a reasonable metric and definition of convergence. We will use this to define path with shortest ΔQ. It corresponds to the situation where all nodes broadcast value from any starting point i for the duration of n retransmissions.9 2. Considering two nodes we may consider delay introduced by retransmissions in naive miniprotocol: – we have two nodes sender and receiver – sender sends message once per period equal maximum network latency deadline – the message is resent if receiver fails to send back confirmation of receipt ... Assuming latency of the connection l, and timeout t > d(l), we get simple solution: μX.l;l;X 3. We need to consider further examples of how our metrics react to issues detected by typical graph algorithms. We define a matrix multiplication that uses firstToFinish in place of addition and after in place of multiplication. (—*—) :: ( Probability α , KnownNat ν ) ⇒ SMatrix ν ( LatencyDistribution α) → SMatrix ν ( LatencyDistribution α) → SMatrix ν ( LatencyDistribution α) (—*—) = sMatMult firstToFinish after 9
That we do not reduce loss over remainder yet?
166
M. J. Gajda
Note that to measure convergence of the process, we need a notion of distance between two matrices. Here, we use Frobenius distance between matrices, parametrized by the notion of distance between any two matrix elements. instance ( Metric α , KnownNat ν) ⇒ Metric ( SMatrix ν α) where α ‘distance‘ β = sqrt $ sum [square ( ( α !(i,κ) ) ‘distance‘ (β ! (i,κ))) — i ← allUpTo , κ ← allUpTo ] where square x = x*x similarityThreshold = similarityThreshold @α
6
Histogramming
To provide histograms of average number of nodes reached by the broadcast, we need to define additional operations: (a) sum of mutually exclusive events, (b) K-out-of-N synchronization which generalizes K-out-of-N combinatoric symbol to a series of inequal probability distributions. 6.1
Sum of Mutually Exclusive Events
First we need to define a precise sum of two events that are mutually exclusive ⊕. That is different from firstToFinish which assumes that they are mutually independent. infixl 5 ‘exSum‘ – like addition class ExclusiveSum α where exAdd :: α → α → α Given a definition of exclusive sum for single events, and existence neutral element of addition, we can easily expand the definition to the latency distributions: 6.2
K-out-of-N Synchronization of Series of Events ak
For histogramming a fraction of events that have been delivered within given time, we use generalization of recursive formula nk . F(an ) (t) = an ∧ F(an−1 ) (t) ⊕ (an ∧ F(an−1 ) (t)) k
k−1
k
Here an−1 is a finite series without its last term (ending at index n − 1).
Curious Properties of Latency Distributions
167
It is more convenient to treat F(an ) (t) as a Series with indices ranging over k k: F(an ) (t). Then we see the following equation: ...
F(an ) = [an , an ] ⊗ F(an−1 ) ...
...
where: – ⊗ is convolution – [an , an ] is two element series having complement of an as a first term, and an as a second term. We implement it as a series kof n with parameter given as series an , and indices ranging over k: kOutOfN :: ( TimeToCompletion α , ExclusiveSum α , Complement α ) ⇒ Series α → Series α kOutOfN (Series [] ) = error ”kOutOfN of empty series” kOutOfN ( Series [ x ] ) = [ complement x, x ] kOutOfN ( Series ( x : xs) ) = [ x] ‘convolution‘ Series xs where convolution = convolve exAdd lastToFinish ‘on‘ kOutOfN
Fraction of Reached Nodes. Now, for a connection matrix A, each row corresponds to a vector of latency distribution for individual nodes. Naturally source node is indicated as a unit on a diagonal. Now we can use kOutOfN to transform the series corresponding to a single row vector into a distribution of latencies for reaching k-out-of-n nodes. Note that this new series will have indices corresponding to number of nodes reached instead of node indices: nodesReached :: ( Probability α , ExclusiveSum α) ⇒ Series ( LatencyDistribution α) → Series ( LatencyDistribution α) nodesReached = kOutOfN For this we need to define complement for LatencyDistribution: instance Complement α ⇒ Complement (LatencyDistribution α) where complement (LatencyDistribution σ) = LatencyDistribution $ σ) ( complement ♦
168
6.3
M. J. Gajda
Averaging Broadcast from Different Nodes
Given that we have a connection matrix A of broadcast iterated n times, we might want histogram of distribution of a fraction of nodes reached for a random selection of source node. We can perform this averaging with exclusive sum operator, pointwise division of elements by the number of distributions summed: averageKOutOfN :: ( KnownNat ν , ExclusiveSum α , Probability α , Show α) ⇒ SMatrix ν ( LatencyDistribution α) → Series ( LatencyDistribution α) averageKOutOfN μ = average $ rows μ) (nodesReached . Series ♦ average ::
( ExclusiveSum , Probability , Show ⇒ [ Series ( LatencyDistribution → Series ( LatencyDistribution average aList = scaleLD $ ♦ exSum aList where scaleLD :: Probability α ⇒ LatencyDistribution α → LatencyDistribution α scaleLD = scaleProbability ( 1/fromIntegral ( length aList))
7
α α α) α) ] α)
Summary
We have shown how few lines of Haskell code can be used to accurately model network latency and get n-hop approximations of packet propagation. It turns out they can also be used to model task completion distributions for a all-around estimation of software completion time. 7.1
What for Capacity-Limited Networks?
It turns out that most of the real network traffic is latency limited and focused on mice connections: that is connections that never have bandwidth-latency product that would be greater than 12 kB*s. That means that our approximation is actually useful for most of the flows in real networks, even though the real connections have limited capacity!
Curious Properties of Latency Distributions
7.2
169
Relation of Latency to Other Plausible Metrics of Network Performance
One can imagine other key properties that network must satisfy: – That absent permanent failures, network will reach full connectivity. That corresponds to the situation where given ΔQ(t) iCDF ultimately reaches 100% delivery probability for some delay, ΔR(t) will also always reach 100%. Moreover ΔR(t) metric allows to put deadline for reaching full connectivity in a convenient way. – That given a fixed limit on rate of nodes joining and leaving the network, we also will have deadline on when ΔR(t) reaches a fixed delivery rate RSLA within deadline tSLA . – That given conditions on minimum average quality of connections ΔQ(t), and fixed rate of adversary nodes radv we can still guarantee networks reaches reachability RSLA . – That there are conditions for which ΔR(t) always reaches almost optimal reachability defined by given ratio o ∈ (0.9, 1.0), such that ΔR(o ∗ t) ≥ max(ΔRoptimal (t)/o). In other words: there is a deadline o-times longer than time to reach optimal reachability in an optimal network, we reach connectivity of no less than o-times connectivity of the optimal network. 7.3
Interesting Properties
We note that moduli representing latency distributions have properties that allow for efficient estimation by bounds that conform to the same laws. Square matrices of these distributions or their estimations can be used to estimate network propagation and reachability properties. That makes for an interesting class of algebras that can be used as a demonstration of moduli to undergraduates, and also allows to introduce them to latency-limited performance which is characteristic to most of modern internet working. 7.4
Future Work
We would like to apply these methods of latency estimation to modelling a most adverse scenarios: when hostile adversary aims to issue denial-of-service attack by delaying network packets [13].
Glossary – t ∈ T - time since sending the message, or initiating a process – ΔQ(t) - response rate of a single connection after time t (chance that message was received until time t) – ΔR(t) - completion rate of broadcast to entire network (rate of nodes expected to have seen the result until time t) – - rate of packets that are either dropped or arrive after latest reasonable deadline we chose
170
M. J. Gajda
References 1. Azzana, Y., Chabchoub, Y., Fricker, C., Guillemin, F., Robert, P.: Adaptive algorithms for identifying large flows in IP traffic (2009). https://arxiv.org/abs/0901. 4846 2. Bradley, J.T.: Towards reliable modelling with stochastic process algebras (1999) 3. Nielson, F., Riis Nielson, H., Hankin, C.: Principles of Program Analysis. Springer Berlin (1999). https://doi.org/10.1007/978-3-662-03811-6 4. Ellens, W., Kooij, R.E.: Graph measures and network robustness. CoRR. abs/1311.5064 (2013) 5. Ellens, W., Spieksma, F.M., Mieghem, P.V., Jamakovic, A., Kooij, R.E.: Effective graph resistance. Linear Algebra Appl. 435, 2491–2506 (2011). https://doi.org/10. 1016/j.laa.2011.02.024 6. Le Boudec, J,-Y., Thiran, P.: Network Calculus: A Theory of Deterministic Queuing Systems for the Internet. Springer, Berlin (2001). https://doi.org/10.1007/3540-45318-0 7. McIlroy, M.D.: Power series, power serious. J. Funct. Program. 9, 323–335 (1999) 8. Wikipedia contributors: Indefinite sum — Wikipedia, the free encyclopedia (2019) 9. Wikipedia contributors: Finite difference — Wikipedia, the free encyclopedia (2019) 10. Haeri, S.H., Thompson, P., Van Roy, P., Hammond, K.: Mind your outcomes: quality-centric systems development. https://www.preprints.org/manuscript/ 202112.0132/v3. https://doi.org/10.20944/preprints202112.0132.v3 11. Brunetta, C., Larangeira, M., Liang, B., Mitrokotsa, A., Tanaka, K.: Turn-based communication channels. In: Huang, Q., Yu, Y. (eds.) Provable and practical security, pp. 376–392. Springer, Cham (2021) 12. O’Connor, R.: A very general method of computing shortest paths (2011). http:// r6.ca/blog/20110808T035622Z.html 13. Anderson, C., Giannini, P., Drossopoulou, S.: Towards type inference for javascript. In: Black, A.P. (ed.) ECOOP 2005. LNCS, vol. 3586, pp. 428–452. Springer, Heidelberg (2005). https://doi.org/10.1007/11531142 19
Multicloud API Binding Generation from Documentation Gabriel Araujo, Vitor Vitali Barrozzi, and Michal J. Gajda(B) Migamake Pte Ltd., Singapore, Singapore [email protected]
Abstract. We present industry experience from implementing retargetable cloud API binding generator. The analysis is implemented in Haskell, using type classes, types a la carte, and code generation monad. It also targets Haskell, and allows us to bind cloud APIs on short notice, and unprecedented scale. Keywords: Haskell · REST Web API · Code generation
1
· Language bindings · Cloud computing ·
Introduction
We propose a solution for low-level integration cloud service integration among different providers without additional support from the cloud API vendor. Multicloud service orchestration allows leveraging the full power of specialized, scalable best-of-a-kind services in the cloud. Its advent coincides with the so-called no-code and low-code solutions that enable rapid prototyping of cloud applications by small teams of startup companies. While excellent in theory, but in practice SDKs usually address only a few programming languages, descriptions are untyped and incomplete, API description languages like Swagger/OpenAPI [1] are rarely used. That means that initial implementation is often an incoherent mix of incompatible scripts, deployment is only partly automated, and the source code uses a variety of platforms with varying level of automation. An attempt to integrate these stumbles upon the barrier of cloud APIs documented in many different ways, usually ad-hoc and without particular rigour. The API interface bindings often contain implicit assumptions, untyped JSON or text-based bindings. The additional efforts to support cloud API bindings by code generation are limited to single API and single language [2–4], or require a significant effort of handwriting Swagger/OpenAPI declarations for entire API [5], if the vendor did not generate them. We solve this significant and essential interoperability problem by automatically parsing API documentation, and then generating code for API bindings in any chosen programming language1 . This allows us to scale the effort to 1
We currently generate Haskell [6] code, but developing generation to any other typed language is offered as a paid service upon request.
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 506, pp. 171–177, 2022. https://doi.org/10.1007/978-3-031-10461-9_11
172
G. Araujo et al.
Data gathering
Parsing
Code generation
URLs
JSON parser
JSON Autotype type inference
Crawlers
Parameter descriptions
Parameter types
Extractors
HTTP request parser
CSV
Use example parser
Argument and result type declarations
Target library package
HTTP call function
Typed source module
HTTP request type
Fig. 1. Data flow within the API binding code generation pipeline.
multitude of cloud APIs available [2–5,7], as well as multitude of useful programming languages.
2
Implementation
Our solution (HurlAPI ), is a data pipeline, divided into three different stages: (I) data gathering from the web pages (II) parsing and analysis (III) code generation. (The pipeline is shown on Fig. 1.) Data Gathering. At the initial stage of data gathering, we gain a complete description of each cloud API call with Scrapy library [8] using Python. It allows us to download HTML pages with Chrome, then we examine page structure with XPath [9], CSS [10], or JQ [11] selectors to extract data in a systematic manner. All this information is written into tabular CSV files [12], with some columns containing JSON objects, and carefully validated. Data Analysis. As part of a data analysis stage implemented in Haskell [6], we parse many possible formats: (a) HTTP request path description with variables (b) cURL command options (3) extracts from tables of parameter descriptions.
Multicloud Binding Generation
173
The parameter descriptions are tagged with the parameter passing convention 2 : (I) as part of an HTTP request path (II) URL-encoded query parameter (3) part of a request body as JSON or plain text [13] (4) HTTP header or a cookie. The content of parsed entries is carefully validated and cross-checked for possible inconsistencies. Every entry has a separate list of errors, that are reported perrecord. While we only allow 100% correct records to be used for code generation, the failed records are reported in detail. Summary statistics of erroneous records reports on a validation dashboard that indicates percentage correctness of current data and allows us to assess the overall health of data pipeline [14]. Agile Data Pipeline Principles. Data analytics pipeline goes beyond previously described best practices in agile data science [15] and also draws from BCBS 239 best practices for risk management in the financial industry [16]. The principles of our data pipeline development process are (1) judge by a final impact – prioritize development of an entire pipeline to judge issues by their impact on a final product using data processing dashboard; it allows us to focus our efforts on few issues that have a significant impact on final data3 (2) record never disappears – trace flow of records over the entire data pipeline with unique record identifiers4 ; when filtering out records, put them to alternate output, so you can examine impact of each filter (3) error is a tag or an alternate output – assign error as one of many tags of the record, and then filter by sorting error records to an alternative output that requires similar examination as final product; multiple errors and warnings are attached to each record that elucidate co-occurrence of data quality and handling issues (4) late filtering – delay filtering when you have multiple data quality criteria that can be run in parallel on a single record; this makes it easy to examine issue co-occurrence that is common for faulty data (5) universal data formats – data at any stage of the pipeline is available for examination as CSV files (7) gradual record enrichment with additional information, so we can examine all data related to the record in a single row; preserve of existing information, so inputs and outputs can be quickly examined at any stage of processing (8) an iteration throughput is considered as important as an iteration speed, since number of successfully processed records increases a number of issues discovered during each iteration, and we try to make the processing of different records and categories as independently as possible (8) tagging potential gaps with errors or warnings, instead of assuming total correctness of the input data and the processing pipeline; this facilitates databased assessments of completeness of the analysis (9) use excerpts from real data
2
3 4
Cloud API call parameter passing conventions differ from binary function call conventions in this respect, and many different argument passing convention may be assigned for different parameters of the same call. Only in the absence of more precise goals, the primary measure of impact is a frequency of an issue. In case records are merged, we also merge the identifiers.
174
G. Araujo et al.
as unit tests whenever possible to avoid testing for issues rarely or never occur in practice.5 The principles (1), (7–9) are all guided by Zeno’s principle [17] of extensive data processing, where sorting data quality and processing issues by the final impact on the final product will show that most of the issues occur in a relatively small number of records. Fixing the first issues gives big improvements, but getting to 100% accuracy needs much more work. We can easily observe that moving from 80% to 90% of correctly processed records takes about the same time as moving from 98% to 99%. Still, the gain in the former case is more substantial6 . The principles (3–5) aim to increase iteration throughput in terms of simultaneously detected data quality and processing issues. Code Generation. As the final stage, we generate code in a typed programming language. We start with a reference data structure that lists necessary functions and type declarations. This part is language-independent except for function generating the language-dependent declaration identifiers themselves. The binding generation proceeds with templates that use these identifiers to generate full code modules, and then entire API binding package along with its metadata; we use techniques described in [19–22]. Following the best current practice, we also attach links to the original documentation website, which allows user to cross-reference the information with the original API documentation.
3
Conclusion
3.1
Limitations
When proposing a code generation solution instead of handwritten code, it is important to consider limitations compared to manual processes. For a few APIs, we need to implement specialized components like AWS S3 chunked signatures [23], custom authentication rolled out for TransferWise API [24].7 There are also few (less than 2% in entire MS Graph API [25]) of API calls that use custom argument passing. For example, MS Graph uses custom DSL for filtering by customizable extended properties [26]. Another example is non-standard retry behavior of Backblaze API [27], that requires to replace the access token upon receiving 503 response (service unavailable). There are also sometimes bugs in the documentation, which will cause the generation of incorrect code. Some companies provide only language-specific SDKs [28], instead of publicly documenting their REST interfaces. Luckily the situation improves with bigger 5
6 7
It also allows for principled treatment of the potential problems for which we do not have any practical test examples. We firmly anchor our analysis in naturally occurring data. Which is kind of Pareto [18] principle in data analytics. It is instance of Zeno’s paradox [17], since we move in smaller steps the closer we are to the complete correctness. We agree with TransferWise [24], that this is more secure than using plain secret like an access token, but each new authentication method needs special support code.
Multicloud Binding Generation
175
companies even providing live debugging or live sandbox functionality for the REST interface [29–31]. Generally, API documentation offer response examples instead of the more informative response schemas. Some transformation steps are required to create comprehensive response decoders from response examples, and the resulting decoders may require extensive testing to ensure coverage for all response cases. 3.2
Related Work
OpenAPI spec is the standard used for encoding REST API specifications in a standard way [1,5]. This allows writing of standard code generators for multiple languages. Unfortunately these code generators are often incomplete, and evolution of OpenAPI spec makes it harder to conform to all uses of this format occurring in the wild. This approach requires tremendous manual effort, and ignores the problem of legacy specifications and those specifications that contain non-standard elements. There are approaches to generate OpenAPI specifications automatically from type declarations [32,33]. This promising approach allows for synchronized evolution of interface and implementation, by making the OpenAPI files easy to update as a part of continuous delivery process. However, not all languages provide type declarations sufficient for exhaustive description of the OpenAPI interface. 3.3
Evaluation
We realized automatic code generation from HTML documentation at the truly massive scale without need of manual writing of OpenAPI/Swagger files. This way of code generation for large APIs is clearly the future. While we hope that publishing of standardized and complete OpenAPI files would be universal, we conceded that our code generation project is a more practical way of publishing large amount REST API client bindings (sizes and coverages shown in Table 1). We recommend an automatic analysis of web pages and automatic generation as state-of-art approaches when input is relatively structured, and output is easily verified (like code in a programming language). We believe realizing this project manually would be impossible with the given resources. 3.4
Summary
We implemented the retargetable code generator for cloud API bindings that presents the following benefits: (1) provide a binding for thousands of API calls within months; (2) language retargeting with little effort; (3) the systematic approach allows easy scaling to a number APIs; (4) removes a dependency on the cloud API provider support; (5) it significantly reduces maintained code base as compared with handwritten cloud API bindings. We offer to generate cloud API bindings for other programming languages and other cloud APIs as a paid service.
176
G. Araujo et al.
Table 1. Some of the Analyzed APIs. First Percentage Indicates Coverage of Analysis, Second Coverage of Endpoint URL Parsing, Third that of Parameter Types, cURL Indicates Correctness of shell Command Parsing for Cross-Checking Information, and Last Number Indicates Number of Distinct Calls within API. API Backblaze
ALL
URL Parameters cURL Calls
55% 100% 100%
85%
29
Cloudflare
95% 100% 100%
97%
488
Transferwise
10%
98%
93%
81
Gitlab
42%
99% 100%
Linode
13% 100%
96% 13%
99%
871
97%
196
Azure
64%
99% 100%
100% 5402
Office365
56%
99% 100%
100% 1588
Site24×7
99%
99% 100%
96%
CoinMarketCap 100% 100% 100%
100%
39
Digitalocean
100%
195
100% 100% 100%
419
Dropbox
58% 100% 100%
81%
233
Mailgun
24% 100% 100%
100%
133
Vultr
45% 100% 100%
100%
157
References 1. SmartBear: OpenAPI 3.0.2 Specification. https://swagger.io/docs/specification/ about/ 2. B-Project Contributors: Boto 3 - The AWS SDK for Python, release 1.12.9. https://github.com/boto/boto3 3. Hay, B., et. al.: Amazonka - a comprehensive Amazon Web Services SDK for Haskell. https://github.com/brendanhay/amazonka 4. Hay, B., et. al.: Gogol - a comprehensive Google Services SDK for Haskell. https:// github.com/brendanhay/gogol 5. SmartBear: SwaggerHub. https://swagger.io/tools/swaggerhub/ 6. Marlow, S. (ed.): Haskell 2010 Language Report. https://www.haskell.org/ definition/haskell2010.pdf 7. Amazon Web Services: Guides and API References. https://docs.aws.amazon. com/#user guides 8. Evans, S.: Scrapy. https://scrapy.org/ 9. Clark, J., DeRose, S. (eds.): XML Path language (XPath) version 1.0. W3C (1999) 10. CSS Working Group: Selectors Level 3. https://www.w3.org/TR/2011/REC-css3selectors-20110929/ 11. Dolan, S.: Jq is a lightweight and flexible command-line JSON processor. https:// stedolan.github.io/jq/manual/ 12. Shafranovich, Y.: Common format and MIME type for Comma-Separated Values (CSV) files (2005). https://rfc-editor.org/rfc/rfc4180.txt, https://doi.org/10. 17487/RFC418010.17487/RFC4180 13. Gajda, M.J.: Towards a more perfect union type
Multicloud Binding Generation
177
14. Gajda, M.J.: Guidelines for agile data pipelines 15. Jurney, R.: Agile Data Science 2.0: Building Full-stack Data Analytics Applications with Spark. O’Reilly Media Inc., Sebastopol (2017) 16. Basel Committee on Banking Supervision: BCBS 239: Principles for effective risk aggregation and risk reporting (2013). https://www.bis.org/publ/bcbs239.htm 17. Huggett, N.: Zeno’s paradoxes. In: Zalta, E.N., (ed.) The Stanford Encyclopedia of Philosophy. Metaphysics Research Lab, Stanford University (2019). https://plato. stanford.edu/archives/win2019/entries/paradox-zeno/ 18. Pareto, V.: Cours d’Economie Politique. Droz, Gen´eve (1896) 19. Gajda, M.J.: JSON autotype: Presentation for Haskell. SG. https://engineers.sg/ video/json-autotype-1-0-haskell-sg-429 20. Gajda, M.J.: Art of industrial code generation. https://www.migamake.com/presi/ art-of-industrial-code-generation-mar-6-2019-uog-singapore.pdf 21. Gajda, M.J., Krylov, D.: Fast XML/HTML tools for Haskell: XML TypeLift and improved Xeno (2020). https://doi.org/10.5281/zenodo.3929548 22. Gajda, M.J.: Do not give us a bad name 23. Amazon Web Services: Signature calculations for the authorization header: transferring payload in multiple chunks (chunked upload) (AWS signature version 4). https://docs.aws.amazon.com/AmazonS3/latest/API/sigv4-streaming.html# sigv4-chunked-body-definition 24. Transferwise: Strong customer authentication. https://api-docs.transferwise. com/#payouts-guide-api-access 25. Microsoft: Microsoft graph REST API v1.0 reference. https://docs.microsoft.com/ en-us/graph/api/resources/domain?view=graph-rest-1.0 26. Microsoft Corporation: Outlook extended properties overview. https:// docs.microsoft.com/en-us/graph/api/resources/extended-properties-overview? view=graph-rest-1.0 27. Backblaze: B2 Integration checklist, uploading files. https://www.backblaze.com/ b2/docs/integration checklist.html 28. gotoAndPlay(): SmartFoxServer 2X documentation central. http://docs2x. smartfoxserver.com/ 29. Alibaba Cloud: OpenAPI Explorer. https://api.alibabacloud.com/ 30. TransferWise: Full API reference, simulation. https://api-docs.transferwise.com/# simulation 31. Stas Shymansky: Mailgun sandbox domain explained. https://blog.mailtrap.io/ mailgun-sandbox-tutorial/ 32. Johnson, D.: Servant-swagger. https://hackage.haskell.org/package/servantswagger 33. Johnson, D., Kudasov, N., Koltsov, M.: Servant-openapi3. https://hackage.haskell. org/package/servant-openapi3
Reducing Web-Latency in Cloud Operating Systems to Simplify User Transitions to the Cloud Luke Gassmann(B) and Abu Alam University of Gloucestershire, Cheltenham GL50 4AZ, UK [email protected]
Abstract. Data storage and algorithms are continuing to move toward cloud-based storage. This provides users with faster processing speeds and online access. The reduction in latency between client and server now makes it possible for servers to handle intensive algorithms and data, rather than running and storing information on client machines. Negative impacts like poor memory, storage and CPU speeds, can effect the desktop’s architectural performance, in addition to having a local storage dependency, desktops do not have the multi-platform and worldwide access capabilities of cloud based platforms. The proposed solution for this is the current/seen transition from local hardware to cloud platforms. However, this transition involves difficulties relating to network latency, storage retrieval and access-speeds. This study proposes a methodology to evaluate the validity of reducing web-latency in cloud platforms, so that they may simulate a local operating system (OS) environment. This research will compare web-architectures in categories that could positively impact network latency like prefetching, clustering and deep learning prediction models, as well as Markov algorithms and layered architectures, which will be reviewed to provide a development foundation. Surrounding studies on web-latency, storage access and architectural retrieval times will be reviewed to assess applicable technologies and real-world obstacles involved in reducing web-latency in cloud operating systems. Keywords: Online latency · Cloud computing systems · Prefetching · Online architecture
1
· Cloud operating
Introduction
Building upon research into cloud based operating systems and reducing weblatency, the platform Prometheus-OS was developed to study the web-latency effects of storage, architecture, caching techniques and prediction models. This paper will cover the project’s proposal, research into surrounding fields and a methodology to test the developed system’s latency. The process of designing the platform’s high level architecture will cover client and server side communication, caching and storage interactions, as well as the lower level implementation c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 506, pp. 178–195, 2022. https://doi.org/10.1007/978-3-031-10461-9_12
Latency in Cloud Operating Systems
179
of these areas and the platform’s iterations in design. This paper will also explore how the platform testing was implemented, with a detailed analysis of corresponding platform areas. Considering this, the future plans and limitations of this research will be discussed, reviewing what improvements to the project can be made to further research in improving web-latency reduction in cloud based operating systems. The paper will have the following format, a literature review containing the surrounding and related work, a methodology on the design and implementation of the platform, an evaluation of the results when comparing the platform’s performance, the limitations of the platform, and a summary of the paper and future works.
2 2.1
Related Work Premise
With industry technologies moving to the cloud there are issues surrounding the increasing volume of data and user control [11]. When confronted with response time, network load and synchronous architecture [18,22], the issues faced in network latency are comparable to cloud platform environments. Alongside the risks, there are benefits in moving OS environments to the cloud, like the improved performance and integration of application, OS and hardware [10]. This paper researches the reduction in web-latency within Cloud OS (COS) to improve user transitions from local to cloud environments. It will review industry methods in improving data acquisition, data-storage and prediction models related to cloud architectures, to reduce the impact of web-latency [6,17]. 2.2
XML Operating Systems and Streaming Applications
Introduction. Although limited research is available on cloud platforms that replicate local OS environments, the strategies used to reduce web-latency and improve performance are variant. The following research will show two techniques in applying COS, focusing on implementation and how both compare when reducing web-latency. Approaches and Techniques. There are many approaches to applying cloud architecture designs in COS. One example is an Extensible Markup Language (XML) OS as examined by Garg et al. The platform simulates a computer environment providing limited functionality [10]. However, similar to AJAX based web-applications, the platform accesses data using synchronous database requests. Although XML based OS were denoted as being fast due to cached code [10], work by Chen et al. improves upon this by using wrappers to stream application data (Fig. 1). Chen et al. researches COS using Broadway GTK+ library (BGL) backend. The platform uses a Hypertext Markup Language (HTML) canvas element to stream data from a virtual machine. This then establishes a communication channel between browser and desktop with better efficiency than VNC software, with additional application support [5] (Fig. 1).
180
L. Gassmann and A. Alam
Fig. 1. Comparison in theoretical COS designs by [5, 10]
Comparisons. Unlike XML OS’s reduced features and reliance upon weblanguages, Chen et al.’s BGL platform shows fewer network latency issues [10]. Using BGL, only 1 megabytes per second (MBPS) of bandwidth was needed to transfer data between the virtual machine and the browser [5]. The platform does, however, have virtual machine hardware dependencies, in addition to transferring application computation from the server to clients in the future. Although the BGL platform offers OS application compatibility, XML OS has dynamic web-application benefits like styling and global updates to maintain cloud flexibility with local OS proficiency [10,16]. These two areas of research provide different approaches to COS. However, both rely upon synchronous communication with online servers to offload complex computational and storage needs to reduce processing time. Garg et al.’s XML based OS focuses on basic services using cache storage to reduce weblatency, whilst Chen et al.’s BGL platform, instead streams applications from a virtual machine with more effective bandwidth utilization. Prometheus-OS is an extension of these ideas, allowing the platform to function by streaming application data from a server (similar to Chen et al.) whilst hosting and running the visual and interactive components client side in an attempt to predict and pre-fetch user interactions. 2.3
Performance
Reasoning. Cloud applications facilitate global platforms. However, web-traffic has put new demands on system performance. Considering the successive datatransfers necessary, it is important that POS’s communication is efficient to reduce latency issues.
Latency in Cloud Operating Systems
181
Fig. 2. In-memory caching as described by [18]
Caching Definition and Benefits Kumar and Chen et al. explain caching benefits as improving data acquisition speed by retrieving past data from high-access memory [6,13]. As the client and server is a factor for high performance, Prakash and Kovoor suggest that in-memory caching can be used for retrieving data and reducing redundant database calls. The proposed method stores data in cache memory with an ‘entity key’. If the memory being requested is already loaded, it is fetched and provided to the client [18] (Fig. 2). Constraints and Solutions [15] explain that in-memory caching may cause unpredictable latency based on its cache miss ratio. Gracia proposes that prefetching models can compensate for cache misses, as prefetching manages virtual memory by anticipating cache misses and retrieving data beforehand [11]. Prefetching Definition and Benefits. Prefetching is used within web-interfaces to reduce slow memory retrieval by anticipating cache misses. This is obtained by combining clustering, association rules and Markov models [6,11]. Yogish defines multiple types of prefetching methods. This includes Proxy and Semantic Prefetching which reduces retrieval latency by predicting requests based on past semantic keys and web-pages [23]. Venketesh et al. explains that the core components behind semantic prefetching are the tokenizer, prediction unit, prefetching unit and prefetching cache. Tokens are created based on uniform resource locator (URL) patterns and key words. The prediction unit then computes probability values based on tokens and pages that have not been accessed by the user; once calculated these prediction values are passed to the prefetching unit to retrieve predicted pages for cache storage [22] or conventional storage log [1] (Fig. 3). Alternative Methods. Kumar proposes a model that reviews semantic data-packets rather than the url tokens defined by Venketesh et al. These tokens contain details like the user’s platform host, username, date: time, request, status code etc. This would allow a platform to set the variables that predict and identify behaviour [13]. Some suggested improvements by Zou include filtering techniques based on storage size and latency to priorities data with higher latency risks [25].
182
L. Gassmann and A. Alam
Fig. 3. Sequence diagram of semantic prefetching as described by [22]
Within Prometheus-OS, caching plays an important role in reducing response time by offering data at higher levels of memory. In-memory caching reduces redundant calls to the database, however, due to cache misses, prefetching becomes necessary to compensate for incorrect memory management by anticipating requests. The semantic prefetching design outlined by Venketesh et al. develops this feature to incorporate application data by using tokens/datapackets to predict requests and utilize retrieval time. 2.4
Architecture and Scalability
Reasoning. Architectural reliability is paramount for third party software, future updates and data storage, by applying software service models [8,24]. This section will discuss the effect of database structures on processing stability, retrieval time and web-latency factors. NoSQL and Relational Databases Outline. Factors surrounding cloud data storage is vital for improving platform performance and latency [7]. NoSQL Databases are a paradigm substitute for relational database’s poor scaling/flexibility [20] making it suitable for large/dynamic data [12]. Relational Databases, however, commonly adopt SQL as a querying language [9] and although ensure data/transaction integrity and consistency [20], its poor flexibility and response time makes it a poor performance solution for large applications [19]. Performance. Research by Mahmood et al. shows that relational databases and NoSQL databases are equally capable of retrieving records using rangebased hashing [14]. When comparing, Jose and Abraham showed that NoSQL performed quicker and flexibly, supporting external storage engines and programming languages [12]. As examined by Fatima et al. and Rautmare et al., MongoDB outperforms relational databases in write-intensive operations, using schema-less and flexible data modeling [7,9,19]. However, it should be noted
Latency in Cloud Operating Systems
183
that this research also showed a poorer standard deviation making response time unpredictable. In summary, database retrieval and management requires negotiating flexibility, reliability and access-speeds when considering performance/scale. For example, NoSQL offers horizontal storage expansion whilst offering less reliable access times [19].
3 3.1
Design and Implementation Project Resources and Development
Overview. Developing upon the XML OS and BGL platform [5,10], surrounding cloud architecture research will be integrated to reduce web-latency when transferring from local to COS environments. Based on XML OS research, POS will be deployed on a Linux server implementing an n-tier architecture to divide business logic. This tiered architecture will contain a database tier consisting of relational and non-relational database, a business tier for application logic, and a presentation tier for displaying data. The platform will also implement Laravel to improve object oriented behaviour and Framework7 for interface design. N-Tier Architecture Structure. Based on object oriented behavioural research, the implemented PHP Laravel framework will improve object oriented and n-tier framework development [2]. The platform’s server architecture will be layered, using page controllers, terminal and database communication objects, and graphical frameworks. The page controllers generate page data, terminal objects communicate between client and server objects and anticipate requests, database communication objects retrieve and send data from/to the relational and/or non-relational database, and the graphical frameworks generate the dynamic and hierarchical interfaces. Terminal Design. The cloud platform will involve communication between the client and server, thus effecting the latency and consistency of data-retrieval. The proposed architecture will adopt a four layered approach to improve data integrity when requesting application scripts (Fig. 4). This not only provides a modular approach but means that any application or graphical user interface (GUI) will involve user related triggers that can be formatted and processed by the client and server terminals making debugging and standardisation simpler. Once a request is sent, the client’s local JavaScript terminal will process the information and send the formatted data to the Server Terminal via a synchronous AJAX request. The Server Terminal, then retrieves, prepares and returns application information to the Local Terminal, which then generates a dedicated encapsulated object to maintain a server communication link for updates and sending prediction token data. Figure 5, displays folder GUI logic using the four layered terminal design. The folder class reduces latency by maintaining communication by sending terminal
184
L. Gassmann and A. Alam
Fig. 4. Four layered approach for generating an application
requests using the simplified format: ‘gui id request type request information’. Once retrieved, the application’s callback function executes the returned data. By using object oriented GUIs, the folder object can inherit design and business logic, reducing redundant code and data-retrieval. Data Storage. To improve data-retrieval whilst maintaining the integrity, horizontal scaling and adaptability of stored data, the platform will use nonrelational and relational databases. The relational database will consist of platform account information and basic directory/file storage, for improved retrieval consistency and transaction integrity [20]. Whereas, the non-relational database will store application information within a unique collection, due to its flexibility, schema-less design and faster writing speeds. Requests writing to the user’s collection will adopt application account keys to authorise access. Prediction Modeling. To anticipate user’s data collection patterns, user account and interaction information will be processed within a prediction model to retrieve application data prior to terminal requests. Figure 6 proposes a prediction model’s communication design. Once data is predicted and retrieved using event data, the Terminal Gateway will store the object in memory for a period of time. If the data is requested, the terminal can return the cached data to the client. The result will improve web-latency by reducing client data-retrieval time. 3.2
Latency Measurements
Data Collection. Analysis of the platform’s web-latency and architecture will cover three non-functional tests surrounding the third research objective.
Latency in Cloud Operating Systems
185
Fig. 5. Sequence diagram in generating and updating a folder GUI
Fig. 6. Prometheus-OS prediction model trigger sequence diagram
These three test areas (Table 1) will measure generating text documents and requesting data. These metrics will also be used in unit testing the platform’s prefetching and retrieval algorithms.
186
L. Gassmann and A. Alam Table 1. Tasks associated with the third research objective Test categories Description Load testing
Measuring system response time to work load ratio
Stress testing
Measuring high work load response time stability
Spike testing
Measuring sudden increases in work load to response time
The data will then be analysed based on software components such as client and server response times, terminal speed and prefetching accuracy. Throughout each test, a control variable will measure the servers ping distance/time to assess web-latency (Fig. 7).
Fig. 7. Diagram showing non-functional performance testing procedure
Methods of Analysis. Non-functional platform and unit testing will be analysed by assessing the standard deviation of the platform’s latency in each testing metric. Determining the mean and correlations will assess contributing unit test factors with the platform’s reduction in web-latency. Load and Spike testing will also measure unit test/algorithm efficiency based on traffic. To self-evaluate the project, an expert review will be conducted to analyse the research methods and to recommend changes by providing a judgment on research/project practice. Problems. Researchers will measure the platform’s reduction in web latency by assessing/comparing the factors of each unit test and its measure of impact on the overall platform. Contribution. This research will show how the platform’s performance under different non-functional tests will affect web-latency. The data obtained under different workloads should show the platform’s efficiency in read-write operations and the contributing factors in latency reduction.
Latency in Cloud Operating Systems
4
187
Evaluation
4.1
Cloud Operating System Comparison
In direct comparison to this platform, two other studies by [5] and [10] have greatly influenced the approach and development of this platform. These studies both took very different approaches to developing an online operating system. Garg et al. developed an XML based dynamic online operating system, and expressed the advantages in portability and its lack of dependence on local operating system hardware [10]. They qualitatively regarded the platform as extremely fast, as data would be stored in the browser’s cache rather than being retrieved from a server. This is not unlike Prometheus-OS, however, this platform retrieves and caches data both dynamically and through prediction/prefetching algorithms to reduce the amount of unnecessary data stored in cache. Prometheus-OS was also influenced by Garg et al.’s study to remain portable and dynamic, by reducing external dependencies and offline criteria. The Prometheus-OS platform also followed Garg et al.’s view on system updates being capable of occurring only once on the server rather than on each client machine. This was done via the client server terminal communication link, as the bulk of data retrieval and GUI design logic is applied server side, to further reduce dependency on client machines. Table 2. Table showing the amount of data transferred with different requests. Request type
Data sent (Bytes) Data received (Bytes)
Text File Initiation
116
Text File HTML Retrieval
747
138
38683
File Option Menu HTML Retrieval 152
67100
File Option Menu List Contents
19992
142
Folder Initiation
113
726
Folder HTML Retrieval
132
66064
Folder List Contents
122
19932
Upper Boundary
152
67100
Chen et al., however, uses a BGL back-end to stream application data from a virtual machine to a HTML5 canvas which was capable of transferring data with 1MBPS bandwidth. This meant that the platform was capable of displaying and interacting with offline operating system applications faster than VNC software [5]. Prometheus-OS was developed to mirror the reduced data transfer size and bandwidth of Chen et al.’s platform, but in a web-application form by using JavaScript and HTML rather than a streamed image. As shown in
188
L. Gassmann and A. Alam
Table 2, the maximum sent packet for applications reaches 0.152 kb/0.000152 MB, and the maximum retrieved from the server reaches 0.0671 MB. Thus, the Prometheus-OS platform requires one tenth the bandwidth required by Chen et al.’s model. Figure 8 shows that the‘average HTML data and directory content’s request retrieves 0.035 MB, so although Chen et al.’s platform currently has more functionality, the needed bandwidth to operate the Prometheus-OS platform is less.
Sending Command (0.13) Object Initiation Code (0.73) Object HTML Code
57.29
List Directory Contents
39.92 0
10
20
30
40
50
60
Fig. 8. Bar chart showing the average kilobytes for sending and receiving server data.
4.2
Initial Setup
The platform’s initial loading speed is vital as a lot of data has to be retrieved from the relational and file database. The administrator console was used to display processes and times as the command was executed for each database five times. The results showed a large difference in total duration time, with the relational SQL database averaging at 30 ms whilst the file database was 72% faster at only 8.6 ms (Fig. 9). This was due to a decrease in formatting time, as the database had to join several tables. Although the file database was faster overall, it was slower in retrieving the application data, this may be due to the large amount of redundant data that can be avoided in a normalised SQL database. 4.3
Writing Data and Storage
The file writing test measured the duration for the platform to write new file data to the server. The test measured the server’s total and database duration for writing a new 1 MB file of random alpha-numeric text. Two tests were carried out to assess each databases ability, a Load Test and a Spike Test.
30
30
Relational Database 28
File Database
0.6 1.2 0
0.8 0
189
20 8.6
ns
ge
tR
G et
ec
ge
tS
ec
tio
sF ul nd
en tF rie
ge t
0.8 0
0
l
s
1.6
ow R
tio ec
ge
tS
co rI
tU
n
ns
0.6 0
G
et
A
pp
lic
Ta sk
at
B
io
ar
ns
To t
im To ta lT
To ta l
al
0
2.6
se
1
ge
10
e
Command Duration (ms)
Latency in Cloud Operating Systems
Fig. 9. Bar chart showing the average time for data retrieval during platform initiation.
Load Testing File Saving Procedure Total Time File Database Relational Database
900 800
800
700 600 500 400 300
700 600 500 400 300
200
200
100
100
0
0
20
40
File Database Relational Database
900
Duration (ms)
Duration (ms)
Load Testing File Saving Procedure Writing Time
60
Synchronous Requests
(a) Total Time
80
100
0
0
20
40
60
80
100
Synchronous Requests
(b) Writing Time
Fig. 10. Scatter plots showing the increase in duration during load testing on save file procedures for both the total and writing time.
As shown in Fig. 10, the tests showed significant differences in write times. The initial total time to write data with just 10 requests in the relational database was 45.5 ms, this is 61% longer than the file database which wrote the data in 17.5 ms. However, this increased exponentially for the relational database, and with 90 requests the duration increased to 969.34 ms. This is directly opposite to the file database which took 18.4 ms. The exponential increase in duration makes the relational database a poorer choice for platform file storage as latency is increased based on the number of users writing to it.
190
L. Gassmann and A. Alam
The Spike Test showed an increase in duration for both the file and relational database. The file database showed an increase of 91% in total duration time and a 100% increase in writing time. The relational database also saw an increase in total and written latency during the request spike at 500%. Not only is this higher but the system’s ability to recover after the spike saw a further increase in latency, in comparison to the file database which saw a decrease in both total and written time which mirrored the initial low. 4.4
Reading Data and Storage
To measure the reading time for both the file and relational database the same procedure was carried out. However, this time the command would be for retrieving a 1 MB platform file from the two databases. Because of the relational database’s low redundancy, it would need to retrieve data from two tables, whilst the file database would open a single formatted file. As shown in Fig. 11, Load testing saw both databases latency drop as the synchronous requests increased with the file database remaining on average 2 ms slower than the relational database. This was the case in both total and database retrieval latency tests where the relational database performed faster than the file database. Load Testing File Reading Procedure Retrieval Time
Load Testing File Reading Procedure Total Time
8
12 File Database Relational Database
File Database Relational Database
7 6
10
Duration (ms)
Duration (ms)
11
9 8 7
5 4 3 2
6
1 0
20
40
60
80
100
0
20
40
60
80
Synchronous Requests
Synchronous Requests
(a) Total Time
(b) Retrieval Time
100
Fig. 11. Scatter plots showing the increase in duration during load testing on file retrieval procedures for both the total and reading time.
When reviewing the Spike Test data, both databases see a rise in latency during a request spike. During the request spike, the file database’s total latency increases by 44%, whilst the relational database increases by 54%. Although the relational database sees a larger spike in latency, it remains 19% lower than the file relational database duration. During recovery, however, both databases see a drop in latency with both returning to their initial lows.
Latency in Cloud Operating Systems
4.5
191
Caching System
Two tests were carried out to measure the cache mechanisms effectiveness and latency. The cache mechanism was tested to assess the latency of data retrieval when using the prediction model. The test assessed the prediction model’s process, retrieval and HTML caching for initial and subsidiary caching requests. The results showed that the highest cause of latency in this model is the prediction method itself, whilst other interactions with the database like retrieving HTML data were 80% faster. Formatting and storing the cached code remained the shortest cause of latency remaining at 1 ms across all tests. The caching mechanism was also tested against a server retrieval test. Based on the five tests carried out for retrieving from storage and cache, the caching mechanism was 52–67% faster than retrieving data from online storage, even when accounting for the latency in sending data to the server (averaging at 59 ms), the caching mechanism still retrieved data faster, however, with only a 1.8–19% improvement in latency/duration. This demonstrates that the caching mechanism reduces latency most effectively when the ping duration is high.
5 5.1
Limitations Caching Prediction Accuracy
One limitation with this study was the inability to test the accuracy of the caching prediction mechanism and the Prometheus Mouse algorithm. This was not achieved due to social-economic constraints, and although this was not a primary focus of the study, further research would provide data on how accurate these methods are using participants and how much latency can be reduced. 5.2
Machine Learning Models
Other limitations include the inability to test more than one machine learning model and with a larger dataset. By introducing additional models, the platform could be tested to measure how effective each model was outside of training. However, this was not achievable due to time constraints and real world data. 5.3
Server Resources
Constraints in server resources meant that the platform’s Python Smart Caching system was tested on a local server; however, with additional time and resources this could be rectified. Furthermore, additional tests should be done to compare the platform’s latency and performance on multiple devices, a key part of online operating systems is its cross-platform capability, however, this was not researched or tested and as further development is necessary.
192
5.4
L. Gassmann and A. Alam
Study Comparison
The comparisons made with past research showed promising improvements in bandwidth and latency, however, these platform comparisons are not normalised as each platform’s architecture provides a variety of different features with a different purpose in mind. This was further hindered by the lack of directly comparable research in this area, although surrounding research greatly improved the platform’s design, there were only two studies available that were closely comparable.
6 6.1
Conclusion Overview
In summary, this study has shown the process of designing, developing and latency testing an online operating system. Objective 1 was achieved with an in-depth review into surrounding research and critically reviewed the topics of caching, prefetching, database architecture and cloud based operating systems, and was influenced by works like [5,10]. Fulfilling objective 2, the PrometheusOS platform was developed using the Laravel and Framework7 frameworks to improve the presentation and object oriented nature of the platform. The platform was designed to mirror a modern day operating system layout including GUI applications and graphical interactions. The platform’s communication between client and server used a pair of switching stations and terminals to execute and route commands effectively, whilst data storage was divided to test the latency of relational and file based databases. Both databases had very different structures to achieve different goals. The relational database was designed to have less redundancy, whilst the file database could offer formatted JSON data. Two prediction mechanisms were also developed for the platform’s caching method. The first used a Naive Bayes network to predict upcoming user commands with past user data, whilst the Prometheus Mouse class provided predictions based on mouse movement and direction. For objective 3, the platform’s systems were critically tested and showed that the highest cause of latency was through HTML data retrieval, although the retrieval latency could be reduced when using the relational database. When writing to the database during both Load and Spike Tests, the file database showed the best performance and recovery. The caching mechanism also showed an overall further reduction in latency in comparison to server storage retrieval. This was also true when accounting for server client ping duration. Overall this study has provided evidence that online operating systems have the potential to reduce latency and retrieval times by identifying contributing factors. 6.2
Future Plans
Considering the results and scope of this project, the platform’s online operating system architecture not only has the potential to provide online/worldwide access
Latency in Cloud Operating Systems
193
to user desktops and cross-device compatibility, but the dynamic nature of hybrid mobile applications (HMA) allow for greater viewing experiences. Using HMA, the concept of adapting augmented reality (AR) would allow a platform to host multiple monitors [3]. By applying the principles of QR codes, better stability could be provided to the AR system [4,21]. This concept would, however, need more work on web-latency reduction if the platform was to retrieve large amounts of dynamic application data from the server. Further potential work includes a study into reviewing user perceptions of online operating systems. This could involve overseeing user response time when using Prometheus Mouse and additional work on reviewing the real world implications of platform latency. These real world implications include research into the necessary security architecture of designing an online operating system, as well as further developing the system’s smart caching method using different machine learning models and additional user factors like mouse position, recent events and device. Additional server testing should measure the platform’s latency and performance using additional server hardware to measure what is necessary on a per-user basis. Client architecture could also explore how the platform could handle drops in connection by applying a short term offline mode or a synchronising algorithm to ensure that both server and client are updated. As mentioned, the platform should research its performance using mobile hardware, as this research was primarily done with desktops and one of the core-benefits of a web-application is its cross-device capabilities. This research could cover the mobiles hardware, connection speeds and data usage. Acknowledgment. I would like to thank my close family for supporting me on this journey. To my wife, you have been a solid cornerstone to rely on, your sense of team work and genuine love has offered me only open arms after long days. Even during times of stress and anxiety, you are always there for me with a smile and a joke. For whatever the day throws at us, your kindness and hugs are always something I can rely on.
References 1. Ahmad, N., Malik, O., Ul Hassan, M., Qureshi, M.S., Munir, A.: Reducing user latency in web prefetching using integrated techniques. In:Proceedings - International Conference on Computer Networks and Information Technology, pp. 175–178 (2011) 2. Alfat, L., Triwiyatno, A., Rizal Isnanto, R.: Sentinel web: implementation of Laravel framework in web based temperature and humidity monitoring system. In: ICITACEE 2015 - 2nd International Conference on Information Technology, Computer, and Electrical Engineering: Green Technology Strengthening in Information Technology, Electrical and Computer Engineering Implementation, Proceedings, pp. 46–51 (2016) 3. Alulema, D., Simbana, B., Vega, C., Morocho, D., Ibarra, A., Alulema, V.: Design of an augmented reality-based application for Quito’s historic center. In: 2018 IEEE Biennial Congress of Argentina, ARGENCON 2018 (2019)
194
L. Gassmann and A. Alam
4. Bunma, D., Vongpradhip, S.: Using augment reality to increase capacity in QR code. In: 2014 4th International Conference on Digital Information and Communication Technology and Its Applications, DICTAP 2014, pp. 440–443 (2014) 5. Chen, B., Hsu, H.P., Huang, Y.L.: Bringing desktop applications to the web. IT Prof. 18(1), 34–40 (2016) 6. Chen, Z., Liu, X., Xing, Y., Hu, M., Ju, X.: Markov encrypted data prefetching model based on attribute classification. In: 2020 5th International Conference on Computer and Communication Systems, ICCCS 2020, pp. 54–59 (2020) 7. Chickerur, S., Goudar, A., Kinnerkar, A.: Comparison of relational database with document-oriented database (MongoDB) for big data applications. In: Proceedings - 8th International Conference on Advanced Software Engineering and Its Applications, ASEA 2015, pp. 41–47 (2016) 8. Di Lucca, G.A., Fasolino, A.R.: Testing web-based applications: the state of the art and future trends. Inf. Softw. Technol. 48(12), 1172–1186 (2006) 9. Fatima, H., Wasnik, K.: Comparison of SQL, NoSQL and NewSQL databases for internet of things. In: IEEE Bombay Section Symposium 2016: Frontiers of Technology: Fuelling Prosperity of Planet and People, IBSS 2016 (2016) 10. Garg, K., Agarwal, A., Gaikwad, M., Inamdar, V., Rajpurohit, A.: Xml based lucid web operating system. In: AICERA 2012 - Annual International Conference on Emerging Research Areas: Innovative Practices and Future Trends (2012) 11. Gracia, C.D.: A Case Study on Memory Efficient Prediction Models for Web Prefetching (2016) 12. Jose, B., Abraham, S.: Exploring the merits of NoSQL: a study based on MongoDB. In: 2017 International Conference on Networks and Advances in Computational Technologies, NetACT 2017, 266–271, July 2017 13. Kumar, P.: Pre fetching Web Pages for Improving user Access Latency Using Integrated Web Usage Mining, pp. 5–9 (2015) 14. Mahmood, K., Orsborn, K., Risch, T.: Comparison of NoSQL datastores for large scale data stream log analytics. In: Proceedings - 2019 IEEE International Conference on Smart Computing, SMARTCOMP 2019, pp. 478–480 (2019) 15. Mocniak, A.L., Chung, S.M.: Dynamic cache memory locking by utilizing multiple miss tables. In: Proceedings of 2016 International Conference on Data and Software Engineering, ICoDSE 2016 (2017) 16. Nunkesser, R.: Purchase Beyond web native hybrid a new taxonomy for mobile app development. In: 2018 IEEE/ACM 5th International Conference on Mobile Software Engineering and Systems (MOBILESoft), pp. 214–218 (2018) 17. Pereira, A., Silva, L., Meira, W.: Evaluating the impact of reactivity on the performance of web applications. In: 2006 Conference Proceedings of the IEEE International Performance, Computing, and Communications Conference, pp. 425–432 (2006) 18. Prakash, S.S., Kovoor, B.C.: Performance optimisation of web applications using. In: International Conference on Inventive Computation Technologies (ICICT), pp. 3–7 (2016) 19. Rautmare, S., Bhalerao, D.M.: MySQL and NoSQL database comparison for IoT application. In: 2016 IEEE International Conference on Advances in Computer Applications, ICACA 2016, pp. 235–238 (2017) 20. Sahatqija, K., Ajdari, J., Zenuni, X., Raufi, B., Ismaili, F.: Comparison between relational and NOSQL databases. In: 2018 41st International Convention on Information and Communication Technology, Electronics and Microelectronics, MIPRO 2018 – Proceedings, pp. 216–221 (2018)
Latency in Cloud Operating Systems
195
21. Teng, C.H., Wu, B.S.: Developing QR code based augmented reality using SIFT features. In: Proceedings - IEEE 9th International Conference on Ubiquitous Intelligence and Computing and IEEE 9th International Conference on Autonomic and Trusted Computing, UIC-ATC 2012, pp. 985–990 (2012) 22. Venketesh, P., Venkatesan, R., Arunprakash, L.: Semantic web prefetching scheme using na¨ıve Bayes classifier. Int. J. Comput. Sci. Appl. 7(1), 66–78 (2010) 23. Yogish, H.K., Raju, G.T.: A novel ART1NN clustering and pre-fetching technique for reducing web latency. In: Proceedings - 5th International Conference on Computational Intelligence and Communication Networks, CICN 2013, pp. 327–330 (2013) 24. Zissis, D., Lekkas, D.: Addressing cloud computing security issues. Future Gener. Comput. Syst. 28(3), 583–592 (2012) 25. Zou, W., Won, J., Ahn, J., Kang, K.: Intentionality-related deep learning method in web prefetching. In: Proceedings - International Conference on Network Protocols, ICNP, October 2019
Crescoware: A Container-Based Gateway for HPC and AI Applications in the ENEAGRID Infrastructure Angelo Mariano1(B) , Giulio D’Amato2 , Giovanni Formisano3 , Guido Guarnieri3 , Giuseppe Santomauro3 , and Silvio Migliori4 1 ENEA, TERIN-ICT, via Giulio Petroni 15/F, 70124 Bari, Italy
[email protected]
2 Sysman Progetti & Servizi S.r.l., Via G. Lorenzoni 19, 00143 Rome, Italy 3 ENEA, TERIN-ICT-HPC, Portici Research Centre, 80055 Portici (NA), Italy 4 ENEA, TERIN-ICT, ENEA Headquarters, 00100 Rome, Italy
Abstract. The purpose of this work is to introduce Crescoware, a container-based gateway for HPC/AI applications in the ENEAGRID infrastructure. Crescoware is a web application created to facilitate the migration of CRESCO users’ workloads to Singularity containers, bringing all the necessary actions to work with containers within a visual environment capable of abstracting many complex command-line operations, typically executed over multiple terminal sessions. Among the features offered by Crescoware there are: a collaborative catalog of Singularity containers; a workspace for the creation of new containers, based on centralized build servers; and a web-based SSH client to test these containers. After reviewing the advantages offered by containers if compared to a traditional bare-metal approach, this work also focuses on how the problem of enabling users to create and launch Singularity containers has been dealt with, having in mind the goal of enabling a frictionless experience on ENEAGRID/CRESCO systems. Keywords: High performance computing · Cloud computing · Containers · Science gateway
1 Introduction The convergence of artificial intelligence, high performance computing and data analytics is being driven by a proliferation of advanced computing workflows that combine different techniques to solve complex problems [1–3]. Traditional HPC workloads can be augmented with AI techniques to speed up scientific discovery and innovation. At the same time, data scientists and researchers are developing new processes for solving problems at massive scale that require HPC systems and oftentimes AI-driven applications like machine learning and deep learning. These applications need to benefit from virtualization and cloud architectures that allow to get customized libraries and tools. HPC and cloud technologies need to converge in order to accelerate discovery and innovation, while also putting pressure on IT teams to support increasingly complex environments. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 506, pp. 196–206, 2022. https://doi.org/10.1007/978-3-031-10461-9_13
Crescoware: A Container-Based Gateway for HPC and AI Applications
197
Container-based virtualization technologies abstract away the manual steps that can slow down provisioning and may lead to configuration errors, and automate the deployment of customized libraries, frameworks, operators, services, platforms and applications. ENEA, the Italian National Agency for New Technologies, Energy and Sustainable Economic Development, operates in a range of R&D sectors, including energy technologies, new materials, life and environmental sciences and in support of its institutional research activities, the ENEA IT division provides computing and storage resources organized in the ENEAGRID distributed infrastructure based on the HPC CRESCO clusters [4] whose last cluster, CRESCO6, ranked 420th in the TOP500 list of Nov. 2018. These HPC hardware resources need to be made available to a different set of environments and libraries that researchers can put together in order to solve their needs. For advanced computing applications such as simulation, high-throughput computing, machine learning, deep learning and data analytics, container-based virtualization can give the flexibility to run these workloads in different environments, with a single interface for containers provisioning and deployment, and all with easy-to-use point-and-click templates. Containers allow multiple applications to run in isolation on a single host and they encapsulate an application together with the runtime environment it needs, allowing to almost freely move between different hosts without the need of special reconfigurations. Containers greatly lighten the work of system administrators, and they are objects suitable to be shared and treated as if they were black boxes. Not all containerization and orchestration environments meet the needs of the HPC world: compared to the enterprise case, the aim is not to keep services alive but to execute jobs. In this context, surpassing the popular Docker platform [5], Singularity [6] constitutes the main solution created to be interoperable with the tools and workflows already common in the HPC world. However, moving to Singularity and abandoning the current “bare metal” approach requires users to become familiar with a new way of thinking and with two key actions: create new containers and then launch them on target hardware. Crescoware is a Softwareas-a-Service (SaaS) platform built to ease users into this paradigm shift. The idea is to bring all the actions necessary to work with containers within a visual environment capable of abstracting all the complexities and promoting the advantages offered by containerization, with the intent to create a frictionless experience on the ENEAGRID HPC resources together with the main science gateway interface FARO 2.0 [7]. The paper is organized as follows. Section 2 is focused on Singularity for containerized HPC/AI applications. Section 3 presents the Crescoware interface, while Sect. 4 presents its design and software architecture. Finally, Sect. 5 discusses experimental results and future enhancement planned.
2 Containerized HPC/AI Applications Within just the past few years, the use of containers has revolutionized the way in which industries and enterprises have developed and deployed computational software and distributed systems. The containerization model is gaining traction because it provides improved reliability, reproducibility, and levels customization that have never been possible before. This paradigm shift made inroads in the high-performance computing community via new tools such as Singularity, allowing users to securely run containers in distributed HPC centers. From the onset of containerization in high performance
198
A. Mariano et al.
computing, the Singularity platform has been a leader in providing container services, ranging from small clusters to massive supercomputers, and container computing has revolutionized the way research groups are developing, sharing, and running software. Researchers interested in AI, deep learning, compute driven analytics, and IoT are increasingly demanding HPC-like tools and resources. Singularity has many features that make it a good container solution for this new type of R&D workloads. Instead of a layered filesystem, a Singularity container is stored in a single file that encapsulates applications, data, scripts and supporting libraries. This simplifies the container management lifecycle and facilitates features such as image signing and encryption to produce a trusted container environment. The Singularity container system started as an open-source project in 2015 and it was created as a result of scientists wanting a new method of packaging analytics applications for mobility and repeatability. By combining the success in HPC environments with the rapid expansion of artificial intelligence, deep learning, and machine learning in R&D domain, Singularity is uniquely qualified to address the needs of a container-based HPC environment. At runtime, Singularity blurs the lines between the container and the host system allowing users to read and write persistent data to different distributed and parallel filesystems in ENEAGRID like OpenAFS [8], GPFS [9], Ceph [10], and leverage hardware like GPUs and InfiniBand [11] with ease. The Singularity security model is also unique among container solutions. Users can build containers on resources they control, and they can move their containers to a production environment where the Linux kernel enforces privileges as it does with any other application. These features make Singularity a simple, secure container solution perfect for HPC and AI workloads. Moreover, Singularity blocks privilege escalation within the container so if a user wants to be root inside the container, they must be root outside the container. This usage paradigm mitigates many of the security concerns that exists with containers on multitenant shared resources. Programs inside containers can be called from outside it fully incorporating pipes, standard I/O, file system access, graphic displays, and MPI. By choosing to implement Singularity in ENEAGRID computing facilities, researchers can execute containers as if they were native programs or scripts on a host computer. All standard input, output, error, pipes, and other communication pathways used by locally running programs are synchronized with the applications running locally within the container. From a theoretical point of view, containerization by means of this architecture is the easiest and secure solution to extend ENEA CRESCO HPC cluster features outside their classical use. Custom requests sent by researchers to set specialized environments for specific simulations are directly managed by themselves avoiding modifying, as administrator, configurations or general software adoptions that could conflict with integration of a compute node into the ENEAGRID infrastructure.
3 Crescoware Interface Basically, Crescoware is a web application bringing all the necessary actions to work with containers in a visual environment capable of abstracting many complex command-line
Crescoware: A Container-Based Gateway for HPC and AI Applications
199
operations, typically executed over multiple terminal sessions. Around this main feature, a collaborative catalog of Singularity containers has been developed in such a way that users, typically researchers, could maintain (or simply take advantage of) a common set of tools. In the following the most important features that Crescoware offers to its users are illustrated (Fig. 1).
Catalog
Manage Containers
Manage Versions Login
Workspace Manage Builds
Help Playground
Fig. 1. Overview diagram of crescoware’s web interface structure
After the login process, Crescoware prompts the collaborative catalog of Singularity containers (Fig. 2) which are divided into three categories: private, shared and public.
Fig. 2. The collaborative catalog of singularity containers
By clicking on any container in the list, a detailed view is presented (Fig. 3), showing a text description of its features, its inputs, its outputs and the position of the image on the file system. The detailed view is connected to a MediaWiki instance [12], in such a way that any other relevant information about the project (such as a user manual)
200
A. Mariano et al.
can be recorded. Users are enabled to give “likes” to containers: this ability is a great opportunity to engage users and to persuade them to share more custom containers. In the detailed view, the user can browse previous versions of a given container, together with any related “changelogs”.
Fig. 3. View of a singularity container from the catalogue
The workspace entry highlights the graphical tools provided in order to build and run containers (Fig. 4). Currently, the workspace has four functions: “Manage containers”; “Manage versions”; “Manage builds”; and “Playground”. The features called “Manage containers” and “Manage versions” allow users to create, read, update or delete containers and their versions info. In this design, containers are released in many versions. Container classes store general purpose information such as: name; target architecture; sharing preference; catalog wallpaper; description; overview; inputs; outputs; featured YouTube video; and tags. Version classes store more specific information such as: container; version number; changelog; and building recipe. As soon as a version is assigned to a container, a build process is started. Under the hood, the build scheduler will perform any relevant action to return the user the log deriving from its building recipe, namely the “stdout” and “stderr” of the build process and (if present) the Singularity Image Format (SIF) file. The feature called “Manage builds” (Fig. 5) presents information such as: container; version number; build status; last update; and the log.
Crescoware: A Container-Based Gateway for HPC and AI Applications
201
Fig. 4. The workspace of Crescoware
Fig. 5. The “Manage Builds” feature of the workspace
Finally, the workspace is completed by the “Playground” (Fig. 6), a web-based SSH client. The first outcome of this feature is that users can run their own containers without the need to configure an SSH client on their computer. Moreover, in the “Playground” the SSH shell is complemented by a widget named the “Cheatsheet”, a set of configurable commands that can be copied to the shell and then executed. This avoids users to remember long and complex commands, replaced by just a few clicks. For example, a user can: load the Singularity module on the node; import (namely download) a Singularity container into the current working directory with the option to specify the version; and run (or submit to an HPC queue) the Singularity container with bindings to distributed filesystems.
202
A. Mariano et al.
Fig. 6. The playground feature of the workspace
This feature is not meant to compete against full-featured stand-alone SSH clients but it constitutes a test-bench for new builds as well and it can ease users’ migration towards a container-based approach for their HPC and/or AI workloads. Considering the structure of the workspace, one can say that it provides an endto-end solution towards the containerization of a service. Containers can be defined, versioned, built, and ultimately run inside the workspace. However, evaluating how containers and version classes were modeled, Crescoware is ready to be enhanced soon to support processing pipelines based upon Singularity containers in the catalog. In fact, considering that containers can be thought as black-boxes operating on a certain set of inputs to return a certain set of outputs (all information that can be stored in the system), entire HPC/AI workflows could be designed visually around them. Of course, this enhancement will be effective whenever a semantic meaning on input and output fields will be enforced in order to make them machine-readable. It seems important to emphasize the uniqueness of the proposed solution. In fact, to our knowledge, web applications that delivers a similar functionality and a similar degree of integration with HPC resources do not exist. Of course, another important aspect is the look-and-feel of the user interface which is uncommon if compared to other tools for HPC: the frictionless experience that Crescoware provides to end users is key factor for a broad adoption of the solution.
Crescoware: A Container-Based Gateway for HPC and AI Applications
203
4 System and Software Architecture The current system architecture (Fig. 7) of Crescoware consists of: a machine with the role of web application server (LAMP stack), WebSocket server and build scheduler; two machines with the role of build servers (one for x86 and one for ppc64le architecture); an SSH terminal server. These machines are distributed onto the ENEAGRID infrastructure as virtual machines running specific roles. The system is partitioned in such a way that a decoupled implementation of each service could be realized. The machines are running a Linux distribution on a virtual infrastructure and from a networking point of view firewall rules have been configured to allow communications only on specific ports with specific hosts.
Fig. 7. Diagram of the current system architecture from an IT perspective
The first machine serves the web application front-end and back-end services through the standard HTTPS port (443). The front-end has been developed in HTML5, CSS3 and JavaScript following the Single Page Application (SPA) paradigm (Fig. 8). The back-end is written in PHP following the REST API paradigm. A MariaDB instance is also involved to allow the persistence of application data. The machine also serves a WebSocket server through a custom TCP port (3000). This component is written in Node.js and is used to implement the web-based SSH client of the “Playground”. WebSockets allow to maintain an asynchronous communication between the web browser and the server, a tool that in this case has been used to deliver users a full-featured interactive shell towards an SSH terminal server (Fig. 9). Moreover, the machine runs the build scheduler (Fig. 10), which is an agent developed in Node.js that connects to a build server as root to start the build process of any new container version’s recipe stored in the system (and retrieve the resulting logs and SIF file). These build servers are machines that have been configured with Ubuntu 20.04.3 LTS and Singularity. The building is allowed on two different machines in order to compile on both x86 and ppc64le architectures. In fact, for the purpose of compiling for a specific platform architecture, root access to a machine with that architecture is normally required. Otherwise, the generation of containers for target architectures other than the one on which the build process is performed is extremely complex, requiring the use of emulation tools that typically show very limited performance and difficult to implement for newbies. In case of exotic platforms (such as ppc64le) a shared build
204
A. Mariano et al.
Fig. 8. Sequence diagram of the SPA for persisted information retrival
Fig. 9. Sequence diagram of the web-based SSH client
server allows to make resource allocation and system administration more efficient in the datacenter, as well as viable for a broader group of users to build. The SSH terminal server is the CRESCO node where the web-based SSH client of the “Playground” actually connects. In fact, the above WebSocket server acts as a message relay (also performing a protocol conversion) between the SPA and the CRESCO node. Any communication between the SPA and the WebSocket server is protected with TLS. Users log into the SPA with their credentials thanks to the integration of the back-end services with the ENEAGRID Kerberos-based federated identity management. However, users are required to prompt their credentials again when connecting to the SSH terminal server when entering the “Playground”, since it is implemented as a native (not mediated) SSH connection.
Crescoware: A Container-Based Gateway for HPC and AI Applications
205
Fig. 10. Sequence diagram of the build process
From a security standpoint, the current implementation only partially takes into account the fact that the build process runs on build servers as root. For example, commands in the “%setup” section of a recipe are executed on the build server as root, and this is denied by banishing this section from recipes. However, a future development is planned to implement a better approach to this problem, which is the usage of “disposable” build servers, namely virtual machines that are provisioned for just one build process through the integration with virtualization platforms already available in the ENEA IT Division.
5 Conclusions and Future Developments This work introduces Crescoware, a container-based gateway for HPC/AI applications in the ENEAGRID infrastructure. Crescoware is a web application built to ease CRESCO users into migrating their workloads to Singularity containers, bringing all the necessary actions to work with containers within a visual environment that abstracts many complex command-line operations, typically executed over multiple terminal sessions. Among the features offered by Crescoware there are: a collaborative catalog of Singularity containers; a workspace to create new containers, based on centralized build servers; a web-based SSH client to test container execution. After a short review of the advantages offered by containers if compared to a traditional bare-metal approach, this work focuses on how the problem of enabling users to create and launch Singularity containers has been dealt with, having in mind the goal of avoiding the perception of an increased complexity of their workflow. Currently, the system is opened to a restricted set of high-skilled researchers (about 20 final users). At the moment, Crescoware hosts about 10 different public containers ranging from deep learning libraries, like PyTorch and TensorFlow, to computational science, like AiiDA, from code quality automated tests to specialized residual attention networks. Another recent application involves an advanced molecule editor and visualizer designed for cross-platform use in computational chemistry, molecular modeling, bioinformatics, materials science, and related areas. Usually, even application in the graphic design sector can benefit from the Crescoware container-based approach.
206
A. Mariano et al.
Future works will focus on the implementation of graphical tools to design processing pipelines based upon Singularity containers in the catalog and “disposable” virtualized build servers to enhance security and isolation of the build process. The Crescoware ecosystem will be soon opened to all the ENEAGRID researchers; then, broader feedback will be collected and presented.
References 1. Huerta, E.A., et al.: Convergence of artificial intelligence and high performance computing on NSF-supported cyberinfrastructure. J. Big Data 7(1), 1–12 (2020). https://doi.org/10.1186/ s40537-020-00361-2 2. Brayford, D., Vallecorsa, S., Atanasov, A., Baruffa, F., Riviera, W.: Deploying AI frameworks on secure HPC systems with containers. In: 2019 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–6 (2019) 3. Colonnelli, I., Cantalupo, B., Spampinato, C., Pennisi, M., Aldinucci, M.: Bringing AI pipelines onto cloud-HPC: setting a baseline for accuracy of COVID-19 diagnosis. ENEA CRESCO in the Fight Against COVID-19, pp. 66–73 (2021) 4. Iannone, F., et al.: CRESCO ENEA HPC clusters: a working example of a multifabric GPFS Spectrum Scale layout. In: 2019 International Conference on High Performance Computing & Simulation (HPCS), pp. 1051–1052. Dublin, Ireland (2019) 5. Bashari Rad, B., Bhatti, H., Ahmadi, M.: An introduction to Docker and analysis of its performance. Int. J. Comput. Sci. Network Secur. 173, 8 (2017) 6. Kurtzer, G.M., Sochat, V., Bauer, M.W.: Singularity: Scientific containers for mobility of compute. PLoS ONE 12(5), e0177459 (2017) 7. Mariano, A., et al.: Fast Access to Remote Objects 2.0 A renewed gateway to ENEAGRID distributed computing resources. Future Gener. Comput. Syst. 94(2019), 920–928 (2017) 8. OpenAFS homepage. http://www.openafs.org/. Accessed 25 Nov 2021 9. GPFS homepage. https://www.ibm.com/docs/en/spectrum-scale/4.2.0?topic=scale-ove rview-gpfs. Accessed 25 Nov 2021 10. Ceph homepage. https://ceph.com/en/. Accessed 25 Nov 2021 11. InfiniBand homepage. https://www.nvidia.com/en-us/networking/products/infiniband/. Accessed 25 Nov 2021 12. MediaWiki homepage. https://www.mediawiki.org/wiki/MediaWiki. Accessed 25 Nov 2021
Significance in Marlo Diagrams Versus Thoroughness of Venn Diagrams Marcos Bautista López Aznar1(B) , Guillermo Címbora Acosta2 , and Walter Federico Gadea1 1 University of Huelva, Huelva, Spain
[email protected] 2 University of Seville, Seville, Spain
Abstract. Logical diagrams allow expressing intuitively the exact analogy that exists between the relations of concepts and those of spatial figures. Undoubtedly, among the logical diagrams, those created by Venn stand out. These diagrams form a complete logical system with a well-defined syntax, equivalent to the monadic logic of the first order. However, Venn diagrams have the disadvantage that they need to explicitly and comprehensively show all the combinations that make up the universe of discourse, including empty classes. This requirement considerably reduces clarity and its usefulness as the number of letters increases. By contrast, Marlo diagram, quantifying the predicate and without giving up the functional dichotomy “subject-predicate”, overcomes this difficulty and can communicate the same information with rigor and precision, but more economically, handling only relevant information. In this way, it is possible to maintain a greater correspondence between the formal notation, the linguistic processes that lead us from the premises to the conclusion, and the graphic representation of each of the reasoning steps that underlie the inference. After many years of efforts to make our diagrams simple and intuitive tools for the didactics of logic, we present here a more detailed analysis of some of its operations that could be useful to investigate the constitutive processes of reasoning involved in First-order logic. Keywords: Logic diagrams · Diagrammatic reasoning · Visual reasoning
1 Elementary Propositions in Marlo and Venn Diagrams Logic diagrams can represent propositions and inferences in an insightful way [see. e.g., 2, 3, 10, 12, 16, 17, 22, 23, 26, 29, 30, 32, 33, 35]. The diagrammatic representation of logic that Venn proposed [37] is widely spread, from the field of didactics to even the context of bioinformatics [14]. Starting from Boolean symbolic logic [11], Venn proposed an alternative system to Aristotelian logic, which he accompanied by a diagramming system in perfect harmony with his principles. Venn diagrams form a complete logical system with well-defined primitive notions of syntax [2, 27, 33] so they are an example of heterogeneous reasoning equivalent to first-order monadic logic [19] and that can represent syllogistic in a perspicuous way [13]. He improved the design of © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 506, pp. 207–227, 2022. https://doi.org/10.1007/978-3-031-10461-9_14
208
M. B. L. Aznar et al.
Leibniz-Euler’s diagrams and contributed to a great extent to the abandonment of the Aristotelian approach to reasoning by modern logicians [1, 2, 10]. In Venn’s logic the predicate of a proposition no longer corresponds to an attribute of an object, nor are there subjects or predicates. For him, the fundamental concept is that of class, which allows us to speak of mathematical equality as identity. However, from the perspective of the Theory of the Quantification of the Predicate that was rejected by Venn due to the limitations of Hamilton’s proposal [15, 18, 38], Marlo’s diagram allows to maintain the distinction between subjects and predicates and to perform logical calculations with mathematical precision. In the first row of Fig. 1, we see how Venn diagrams represent logical propositions by eliminating different combinations of the universe of discourse. In the case of the proposition “A ∧ B”, we have represented the existence with an “X” in the conjunction of classes A and B. Under each of the propositions expressed in Venn diagrams, we can observe four equivalent ways in which they can be represented in the Marlo diagram [4–6, 9, 24].
Fig. 1. Elementary propositions in Marlo and Venn diagrams, with and without distinction between subject and predicate.
Firstly, let us look at the biconditional. Venn represents it explicitly by eliminating the regions of A¬B and B¬A. The region of class AB is explicitly allowed, while the ¬A¬B class is implicitly affirmed by not being deleted. The rest of the logical connectives are represented in the same way, eliminating prohibited combinations. The elimination of the region ¬A¬B becomes explicit in the disjunctions. Venn diagrams have the disadvantage
Significance in Marlo Diagrams Versus Thoroughness of Venn Diagrams
209
of showing explicitly and completely all the combinations that make up the universe of discourse, including empty classes [14, 16, 19, 31]. To understand the meaning of our diagrams, it may be useful to attend to the formal notation that appears at the bottom of Fig. 1 and 2. When one of the variables of the proposition is considered universally, that variable is accompanied by a subscript x, but not when it is particular (to avoid terminological disputes, variables are referred to hereinafter as characters or letters). Each proposition can be converted and transformed without changing its logical meaning. To do the conversion, just change the order of the characters. When we transform, we exchange the subscript x and modify the quality of both characters. We can see these processes in detail in Fig. 2, in which we obtain the four formal language expressions of the conditional statement “A → B”. Once we understand the meaning of our formal notation, it is easy to understand how diagrams are constructed. The character that appears first in the formal notation is taken as the subject of the proposition and is placed in the center of the diagram. The character that appears in second place is taken as the predicate and is placed within the diagram of the subject, but not in the center. We call diagrams propositional models [9]. If the character taken as subject is universal, so that it appears with the subscript x in the formal notation, then its model is not divided [6]. An example can be found in Fig. 2, particularly in the representation of propositions ax b and ¬bx ¬a. But if the character is taken as particular, then its model is divided, so that the association with the character that acts as a predicate in the proposition only takes place in a part of the set and not in the whole set. That is the case in bax and ¬a¬bx . In the divided models we observe a part that appears blank, in which sometimes a question mark is placed to indicate that it remains undetermined.
Fig. 2. Conversion and transformation of a conditional proposition “A → B”.
For example, in the model of bax , the upper part indicates that, based only on the proposition, and forgetting our previous knowledge, this part of b could be associated with a or ¬a. For the same reason, the top of the upper part of the model of ¬a¬bx could be filled with “b?” to indicate that ¬ab is possible and that ¬a¬b is also possible. Let us focus now on the predicate. When it is taken universally, it only appears once, within the subject model, but when it is particular, it also appears outside the subject’s model. The predicate is universal in bax because we cannot find a not associated with b, that is, in the region of ¬b. Similarly, in ¬a¬bx , we cannot place any part of ¬b outside the model of ¬a. And now observe Fig. 3.
210
M. B. L. Aznar et al.
Fig. 3. Conditional and disjunction propositions on Marlo diagrams, tree diagrams, and truth tables. Conventional formal notation, and Marlo notation by quantifying the predicate [9].
If we represent the conditional proposition in a tree diagram (Fig. 3), we can understand why we consider that a and ¬b are taken universally in it, while ¬a and b are taken in a particular way. The disjunction tree diagram (lower part of Fig. 3) also allows us to understand that, behind the grammatical relations between subjects and predicates of the propositions, there are implicit universal and particular quantitative relations [11, 18, 20]. To understand the tree diagrams in Fig. 3, we must know that any underlined letter represents a criterion that allows us to distinguish between stimuli that possess or do not possess a certain quality. For example, the criterion of beauty “b” in a dichotomous system allows people to be labeled as beautiful “b” or not beautiful “¬b”. If we also use the intelligence criterion “i” dichotomously, then we have four types of people: ab, a¬b, ¬ab, ¬a¬b. Of course, between not meeting a criterion at all and meeting this criterion perfectly, we can establish an unlimited number of divisions in theory. But such divisions should not exceed the power to discriminate one stimulus from another. The more divisions we make, the more ability to perceive nuances, but also more cognitive costs that are not always profitable. In any case, one of the greatest advantages of our proposal is that the same inference principles are used even if we are working with a dichotomous, trichotomous, or any other type of system. If we think carefully about this issue, we will come to understand that the degrees of the fulfillment of a criterion, for
Significance in Marlo Diagrams Versus Thoroughness of Venn Diagrams
211
example, being very beautiful “0.8b” or being few beautiful “0.2” are different from the values of the true tables. For example, if we affirm that “If you are absolutely intelligent (1.i), then you are very beautiful (0.8b)”, and it is true that you are “1i”, then it is absolutely true (1 in true tables) that you are “0.8b”.
2 Main Elements of Marlo Diagram
Fig. 4. Some elements and logical symbols in Marlo diagram (1)
212
M. B. L. Aznar et al.
Fig. 5. Some elements and logical symbols in Marlo diagram (2)
We are aware that prior knowledge in logic interferes with learning our notation. That is why we have included Fig. 4 and 5 to make this article easier to understand. We are confident that our proposal has some advantages that we can only try to point out in this document. Returning to the quantification of the predicate present in George Boole [11] allows us to address the uncertainty and probable conclusions. We do not need to distinguish between propositional logic and predicate logic. We assert only one reasoning system in which modal logic is naturally included. The separation of mathematical logic from Aristotelian logic allowed the elimination of useless metaphysical questions, but it also caused the loss of important reflections on how material truth is communicated. We explain all inferences as syllogisms, without the need to postulate different logical connectives. We believe that this simplifies logical calculus and returns a central role to common sense in reasoning tasks. Our notation is a renewed version of the Quantification of the Predicate followed by authors like Stanhope and Jevons, who were the first to build logic machines as early as the 19th century. Similar to Solomon Maimon (1753–1800), we add the subscript x when the variable is taken universally. If it is taken in a particular way, we do not add any subscript x. For example, when we said, If you are a mammal then you are a vertebrate, we are talking about all mammals, but only a part of vertebrates. So, we write “mx v”. We can also operate with subscripts that mean most, half, few (See Fig. 4.3) [8]. Any formula can be converted and transformed. When we convert, we only change the
Significance in Marlo Diagrams Versus Thoroughness of Venn Diagrams
213
place of the variables. When we transform, we change the quality of both variables and permute the subscript x (see Fig. 5.4). From the perspective of the Quantification of the Predicate, we can represent any logical connective. You must bear in mind that we understand a proposition as the assertion of the association of the whole or one part of a variable with the whole or the part of another. We do not care if the qualities of the variables are positive “a” or negative “¬a”: propositions are always affirmative. For example, No prime number is even = The whole of prime numbers is equivalent to part of non-even numbers. Table 1. How to work logical connectives in Marlo diagram Logical connective
Propositional Marlo Action rules calculus notation notation
Biconditional ↔
ax bx
1. Remove all combinations containing a¬b and b¬a from the network. 2. Add a to all formal language combinations that contain b. 3. Add b to all combinations containing a
¬ax ¬bx 4. Add ¬a to all formal language combinations that contain ¬b. 5. Add ¬b to all combinations containing ¬a Exclusive Or
Conditional
Inclusive Or
→
∨
ax ¬bx
1. Remove all combinations containing ab and ¬a¬b from the network. 2. Add ¬b to all formal language combinations that contain a. 3. Add a to all combinations containing ¬b
¬ax bx
4. Add b to all formal language combinations that contain ¬a. 5. Add ¬a to all formal language combinations that contain b
ax b
1. Remove all combinations containing a¬b from the network. 2. Add b to all formal language combinations that contain a
¬bx ¬a
3. Add ¬a to all combinations containing ¬b
¬bx a
1. Remove all combinations containing ¬a¬b from the network. 2. Add a to all formal language combinations that contain ¬b
¬ax b
3. Add b to all combinations containing ¬a 1. Remove all combinations containing ab from the network. 2. Add ¬b to all formal language combinations that contain a
Nand
ax ¬b
bx ¬a
3. Add ¬a to all combinations containing b
Empty set
∅
a
1. Remove all combinations containing a from the network
214
M. B. L. Aznar et al.
3 Existential Import of Propositions in Marlo Diagram In addition to formal relations, some propositions also assert existential content. We are aware that in the conventions of mathematical logic universal propositions are considered hypothetically [10]. However, giving existential content to the terms allows us to see how material truth is communicated in inference, which we consider important when communicating with natural language. In everyday life, when I say that All my sisters are beautiful, the listener understands that I actually have sisters. In the Marlo diagram we express that there is something defined by a character (i.e., that it exists) and that it is present using the capital letter. Venn claimed that it was best to eliminate the existence of universal proposition terms such as All men are mortal to avoid an existential commitment with the term non-mortal after converting the initial proposition to No nonmortal is human [28, 36, 38]. However, we have solved this problem by distinguishing between conversion and transformation of propositions [5, 9]. We can see in Fig. 6 why when converting a proposition the existential content is preserved, but not when it is transformed. When we convert All A is non-B into Part of non-B is A, the existential charge remains, but not when transforming this proposition into If B, then ¬A. That is, when converting (ax ¬b = ¬bax ) we are on the same path of the tree diagram, so, if A is true in the first proposition, it is true in the second. But when transforming (ax ¬b = bx ¬a) we change the path. In this way, we avoid problems associated with the principle of contraposition like those discussed by John Buridan [28].
Fig. 6. “(A → ¬B) ∧ A” in Marlo diagram [9].
Like the tradition of Aristotelian logic, we also believe that natural language communicates the certainty associated with the existence of the terms that operate as subjects and predicates of sentences. It is easy for human cognitive systems to see the difference between conjectures, theories, and ongoing facts. We distinguish in the notation the following existential modes: 1. Hypothetical or theoretical 1.1 Possible as conjecture or guess: Any variable or combination of variables without contradiction is possible a priori. It is impossible “a¬a?”. If it is possible
Significance in Marlo Diagrams Versus Thoroughness of Venn Diagrams
215
A, then it is possible ¬A, as well as it is possible a. a? = ¬a? 1.2 Theoretical generalization supported by experience or social agreement: For instance, All men are mortal. The presence or absence of the terms is not considered as really true here and now. These theoretical objects are not considered to make decisions during ongoing situations. We do not run because all lions eat people unless there is a lion here and now. The presence or absence of the terms is not contained in the theoretical generalizations. The theoretical affirmation or denial of combinations prevails over conjecture (see Fig. 7.D). ax b → (a¬b) 2. Factual: its presence or absence is actually activated by an external or internal stimulus: there can be physical, mental, or social facts such as roles, laws, values, etc. An activated node can modify the values of other linked nodes. 2.1 Present here and now: There is sufficient reason to affirm the presence of something with the quality a: A 2.1.1 There is sufficient reason to affirm the presence of everything that possesses the quality a: [A] 2.1.2 It is impossible A¬A in the same individual at the same time, but it is possible A.¬ A as the presence of two different objects. 2.2 Absent here and now. Only things that exist can be absent here and now. For example: There is sufficient reason to affirm the absence of something with the quality ¬a: ¬A 2.2.1 It is impossible A¬A, but it is possible A.¬ A as absent of two different objects. 2.2.2 It is impossible AA, but it is possible A.A as an aggregation of two different objects. 2.2.3 There is sufficient reason to affirm the absence of everything with the quality a: A 2.2.4 It is impossible [A]. A as well as [A].A. That is, it is impossible at the same time that all my students are here and now and one of them is absent. 2.2.5 If everything is absent, at least one object is absence: A → A 2.2.6 If all A is B, and all B is absent, at least one A is absent. ax b. B → A
4 Elementary Notation and Building Rules in Marlo Diagram [9] Below are some elementary construction rules for Marlo diagrams, as well as the meaning of their basic notation. We recommend looking at Fig. 7 to understand the explanations.
216
M. B. L. Aznar et al.
• a = Criterion that allows a cognitive system to establish differences between stimuli. In an Alfa dichotomous system, a allows to distinguish between two characters: a and ¬a. Each of the unique and distinct sets of characters that can be formed is called an object. In an Alfa dichotomous system, the number of objects = 2n (n = number of criteria). An object can be removed from when it is not reasonable to the system believe in its existence at any time or place a ∨ ¬a ∨ ∅]. Sometimes objects just have the state of conjecture “a?”. Sometimes a cognitive system only deals with objects in a theoretical way “a”. When an object must be taken into account during an ongoing situation, that object will be present or absent [(¬ab + stimulus) → ¬AB ∨ ¬AB]. Sometimes all objects that share a character or set of characters are present [¬AB], and sometimes all are absent ¬AB . The meaning of A is “There is something in the current situation that could be A or ¬A (see Fig. 6). • [ax bx = bx ax = ¬bx ¬ax = ¬ax ¬bx ] = Any combination of the “a character” contains b and any combination of b contains a. It is impossible a¬b and ¬ab. Rules of action: You must add b to any character sequence containing a. You must add a to any character sequence containing b. You must delete any character sequence containing a¬b or ¬ab. You can convert the equation by swapping the order of the characters (see Fig. 7D). • ax b: Any combination of a contains b. At least part of the combinations of b contain a. It is impossible a¬b. The Rule of action is Synthesis or Generalization: You must add b to any character sequence containing a. You must delete any character sequence containing a¬b (see Fig. 7.E.9). You can convert the equation by swapping the order of the characters. You can transform equations by exchanging the subscript X and changing the qualities of both characters. If it is a fact that all A is associated with B, we can theoretically say that if we have A, then we will have B. [A]B → ax b ∧ AB. • A propositional model graphically expresses the relations between the subject “S” and the possible predicates “P” of a proposition. The subject and the predicate of the propositions with which we work are characters. • The set of possible combinations of a character or conjunction of characters is represented in a geometric figure such as a circle, a triangle, or a square. • We place the character that acts as the subject in the center of the figure and each of its possible predicates on the sides. Each side of the figure is a region of the propositional model (Remember what we said in Fig. 5.5). • Following the Theory of the Quantification of the Predicate, when we draw propositional models, we ask ourselves about the quantity of the subject and the predicate. If the subject is particular, then we divide the model, but not if it is universal. If the predicate is particular, we place the predicate inside and outside the model of the subject, but if the predicate is universal, we only place the predicate inside the model of the subject. • If we attribute to S a predicate P that can also be associated with ¬S, then we must place P inside and outside the “S model”, but with a question mark “P?” outside the “S model”. • If we attribute to S a predicate P that cannot be associated with ¬S, then we must place P only inside the “S model”.
Significance in Marlo Diagrams Versus Thoroughness of Venn Diagrams
Fig. 7. Exercise solved in Marlo diagram to show some basic rules of inference.
217
218
M. B. L. Aznar et al.
• All sides of the figure are potentially combinable with each other when this does not lead to a contradiction. Any combination not explicitly prohibited is allowed. • Each side of a propositional model can be indeterminate or determinate. Once it is necessarily determined by a predicate P, it is not possible to place ¬P on this side. P and ¬P can share the same model, but they must be located in separate regions. • A blank region within a model means the same as a region with a question mark. The meaning of a question mark in a region depends on the context. That is, if P determines a region, the question mark in any other region should be interpreted as “¬P?”. • Together, the inside and outside of a model represent the Alpha system. That is, if the conjunction AB is taken as the subject of a model, the exterior of the model implicitly contains any combination other than AB, that is, A¬B, ¬AB, ¬A¬B. If a character with a question mark “a?” is placed within a region of a propositional model, then it is equally possible to place ¬a? in the same region, but not if there is a, or A, or A. • Following the precautionary principle, each particular predicate of S involves drawing a new side in its propositional model. In this way we avoid associating two characters that may not be associated (Remember what we said in Fig. 5.5). • If two models share the subject, both models can be synthesized into one. We must take into account the precautionary principle and recall collecting the characters located outside the models. It is possible to use the convention defined in Table 1 to express all allowed combinations outside of a model. And it is also possible to collect all the characters forbidden by the premises to limit the number of combinations allowed outside the model.
5 Comprehensive Coding of Possibilities Figure 8 shows the conclusion of a syllogism that states that All A is ¬C, in the Euler, Venn, and Marlo diagrams. In the three diagrams, each of the closed areas represents an object class although the outside of the enclosed areas can also represent object classes. If we agree that we have started with three dichotomous criteria, then we have eight possible combinations: abc, ab¬c, a¬bc, a¬b¬c, ¬abc, ¬ab¬c, ¬a¬bc, ¬a¬b¬c. Four of them are eliminated by the premises. We have indicated in each of the regions of the diagrams of Fig. 8 the combinations that are confirmed and eliminated. The number of regions required to solve this syllogism is eight in Venn, four in Euler and three in Marlo. What can we observe in Euler diagram? First, abc (1), a¬bc (3) and a¬b¬c (4) are forbidden within the circle of A: a¬bc and a¬b¬c because A is completely enclosed in B, so it is impossible a¬b. And abc is impossible because the circle of A and the circle of C are separated. The class ¬abc is false because the circle of B and the circle of C are separated. Second, the class ab¬c (2) is true because A is enclosed in B and separated from C. Third, ¬ab¬c (6) is true inside the circle of B, but in the part of B that is outside the circle of A. Fourth, ¬a¬bc (7) is true because it is the only possible class within the circle of C, which is absolutely separated from A and B. Finally, ¬a¬b¬c is true outside the three circles that represent A, B and C.
Significance in Marlo Diagrams Versus Thoroughness of Venn Diagrams
219
Fig. 8. Comparison of the diagrams of Euler, Venn and Marlo [9].
What can we observe in Venn diagram? We must realize that the intersections of the circles of A, B and C can represent all classes except ¬a¬b¬c, which is represented in the region outside the three intersecting circles. In Venn, classes (1), (3), (4), and (5) are false because the intersections of the circles representing them have been removed, while all other classes are true because they have not been removed. In both Euler and Venn, we could indicate that ab¬c is not an empty class by adding an “X” in the region of the diagram that represents it. What can we observe in the Marlo diagram? In the BA¬C region of the Marlo diagram it is stated that AB¬C (2) is true, and four incompatible combinations with the information contained in that region are eliminated. First, abc (1) is false because all A, the only one that appears in the model, is associated with ¬C. Second, the combinations a¬bc (3) and a¬b¬c (4) are incompatible with the fact that the only A that is drawn is within the model B. Third, ¬abc (5) is incompatible with the fact that all regions of B are already determined by ¬C. AB¬C is capitalized to indicate that it is not an empty class. The combination b¬c¬a (6) is possible in the upper part of the B model, but not necessary (it may be the case that all B is A). The letter “¬c?”, outside the model of B, can only be ¬b¬c. And, since all of A is already inside B, the combination ¬b¬c must necessarily be completed with ¬b¬c¬a (8) outside of B. The only thing left for us to consider is ¬a¬bc. This combination is not prohibited by premises. It is obvious that it cannot occur within B. But outside of this model, nothing prevents conjecturing ¬a¬bc, because nothing is explicitly said about ¬a or c within the model that prevents us from doing so. In Fig. 9, we compare several Venn and Marlo diagrams with the same information about prohibited combinations. The process to decode the information compressed in the Marlo diagram is the same as we have followed previously. Remember that characters outside a model are potentially combinable and that all combinations that are not incompatible with the information already explicitly stated in the model are allowed. Of course, we cannot add characters that are not justified by the premises.
220
M. B. L. Aznar et al.
Fig. 9. Different propositions with three criteria in Venn and Marlo [9].
For example, in case 1, ¬z outside the model of x can only be ¬z¬x¬y: ¬x because it is outside x and ¬y because y is already fully included in x. In case 2, zy¬x, z¬y¬x, ¬zy¬x, and ¬z¬y¬x are implicitly accepted outside the model of x, and any combination other than x¬z¬y is explicitly prohibited within this model. In case 3, we see that one part of x is y and another part is z. These parts are not exclusive, but potentially combinable, as occurs in inclusive disjunctions. Then we can say that x can be xyz, xy¬z, x¬yz, but it cannot be x¬y¬z. In case 4, apart from the x model, it is only possible to assume ¬x¬y¬z, which is a possibility that is also implicitly admitted in the Venn diagram. In case 5, all x and all y are exhausted within z in the form of an inclusive disjunction, which makes it impossible to assume that ¬Z exists. In case 6, the complexity of relationships forces us to expand the notation if we want to be rigorous. The following table shows us how to interpret the notation of case 6 of Fig. 9. We can see there is a sign of multiplication “*” which means to combine. The underlined letters represent the affirmation and denial of a character. That is, in case 6, z means z and ¬z, y means y and ¬y, etc. Then z *y compresses the combinations zy, z¬y, ¬zy, ¬z¬y. In case 6, that expression is followed by an interrogation because the combinations compressed are uncertain. However, regardless of x, all combinations containing ¬y¬z must be subtracted “-[¬y¬z]”. Table 2 explains how to decompress the combinatorial combinations of two criteria.
Significance in Marlo Diagrams Versus Thoroughness of Venn Diagrams
221
Table 2. How to decompress the notation [9]. a*b
ab?, a¬b?, ¬ab?, ¬a¬b?
a*b
ab?, ¬ab?
a*b
ab?, a¬b?
a*b
ab?
In the diagrams of Allan Marquand [25], John Venn [37, 38], and Lewis Carroll [12], the requirement to represent the entire universe of discourse to interpret the conclusions considerably reduces their clarity and usefulness as the number of characters increases. And trying to express all possible combinations in the universe of discourse can become a tedious task. In addition, it is unnecessary when the objective is to reach a certain significant conclusion. In any case, the readers can evaluate for themselves the advantages and disadvantages of each system of diagrammatic representation when we solve more complex problems. In Fig. 10, we can see how to solve an AE-type syllogism, which belongs to the socalled first Aristotelian figure. The diagram of the first premise tells us that the totality of X is associated with Y, but it does not say that “all Y is associated with X”. That is why we must add “y” outside of x. We can do this in two ways, by modifying the amount of implicit information on the margin of the model of premise 1. First: if we put “y?”, then we explicitly state that it is still possible that is associated with ¬x. At the same time, we implicitly indicate with “y?” that none of the combinations of ¬y have been eliminated by the premises. Nothing is explicitly declared or denied in the diagram about ¬y, and although it is clear that ¬y can no longer be affirmed within the “X model”, nothing prevents this character from being associated with ¬x. Second: if we put y (letter underlined), we explicitly indicate that all combinations of y and ¬y are still possible in the region of ¬x. The premise two diagrams have been constructed in the same way as the premise number one diagram. Diagram number three is a simple conversion of the first premise, and the diagram in step four is a synthesis of diagrams two and three. Diagram five, which is the conclusion, focuses on the relationships between X and ¬Z which are already contained in step four. Although here we exhaustively point out all the possible combinations outside the X model of step five, we do not work this way with our students. In the classroom, the only character outside the X model would be “¬z?” in step five. And while it is true that we lose information proceeding in this way, it is no less true that there is also loss of information when we formulate the same conclusion using natural language stating that No X is Z. And there is also loss of information if we conclude in formal language that X → ¬Z. So, to reach relevant conclusions we do not need to be exhaustive in computing the empty classes that are allowed or prohibited by the premises.
222
M. B. L. Aznar et al.
Fig. 10. Syllogism solved in Marlo and Venn diagrams [9].
In Fig. 11, we solve the following problem taken from Venn [25]: -1. All x is either both y and z, or not y. -2 All xy that is z is also w. -3 No wx is yz. The first premise does not eliminate any combination of ¬x, and therefore all possible combinations of y and z outside the x model are preserved. The lower part of the x model is already fully determined as xyz. However, the only thing we know about the top of x model that it necessarily contains ¬y. And this ¬y can be ¬yz or ¬y¬z because we have z outside the x model, which means two things. First, that z is not necessarily included in its entirety in the region defined as zyz. Second, that we can speculate ¬z out of x. Now, everything that can be conjectured outside can be conjectured inside if we do not incur in contradictions. We need to realize that the meaning of the characters depends on the context, and although sometimes it becomes more complicated to interpret the regions of the models, in return, we get clarity when expressing the relevant relations. We have granted existence to x in the first premise to show how this existence is transmitted throughout the inference when we have sufficient reasons to do so. The subject of the second premise model is composed of the characters x, y, z. When we operate with a compound subject, we must take into account two things. First, the region outside this model represents all combinations other than the one that acts as a subject within the model. Second, that none of the characters that are part of a compound in a proposition can be considered taken in its entirety. For example, the premise two states that when x is yz, x is w, but says nothing about x when it is not y or it is not z. And the same applies to the other characters. Outside the xyz model, the expression ¬(xyz) compresses the combinations: ¬xyz, x¬yz, xy¬z, ¬x¬yz, ¬xy¬z, x¬y¬z, ¬x¬y¬z. It should not be difficult for the reader to decompress all the combinations summarized in the expression ¬(xyz)*w: w¬xyz, wx¬yz, wxy¬z, w¬x¬yz, w¬xy¬z, wx¬y¬z, w¬x¬y¬z, ¬w¬xyz, ¬wx¬yz, ¬wxy¬z, ¬w¬x¬yz, ¬w¬xy¬z, ¬wx¬y¬z, ¬w¬x¬y¬z.
Significance in Marlo Diagrams Versus Thoroughness of Venn Diagrams
223
Fig. 11. Problem, with four characters, solved in Venn and Marlo. Based on Venn [9].
In the third premise we use the expression ¬(yz) to summarize the combinations ¬yz, y¬z, ¬y¬z. Nor should it be difficult for the reader to unzip all the combinations summarized in the expression ¬(xyz)*¬(yz). Diagram number four generalizes the information about xyz contained in premise two within the model of premise number one. Diagram number five generalizes the information about wx contained in premise three within model number four. Finally, in step number six, we remove the bottom of the X model from step five, because it contains a contradiction. In this way, X, whose existence was asserted in premise one, is determined solely and totally by ¬Y, which must necessarily exist as well. Remember that all the characters in an object must exist in the same way (ongoing fact, conjecture, or theory). Letters within the same region of a model must be considered part of the same object.
6 Correspondence of Marlo Diagram with the Truth Tables Marlo diagram allows us to work with the conventions of the propositional calculus. To do this, we only must establish that A and ¬A cannot be true at the same time. Then, we can ask ourselves if that model collects the same information as the truth tables. In Fig. 12, we can observe the synthesis of two propositional models that share the middle term “a” (a → b) ∧ (a → c). We have indicated in the Figure the relations between the two regions of the model obtained as a conclusion (a → b ∧ c) with the truth tables. We must realize that any combination other than abc is impossible within “a model”.
224
M. B. L. Aznar et al.
Fig. 12. (a → b ∧ c) in Marlo diagram and truth tables
Figure 13 shows us the correspondence of the model of (¬d ∧ ¬c) → (a ∧ ¬b) with the truth tables. In that figure, we can see in which regions of the diagram the possibilities admitted in the truth table are represented. We have listed the regions of the diagram from one to four. Region number one corresponds to the core of the model. The diagram regions number two, three, and four develop the different combinations contained in the formula “¬(¬d¬c) *a, b”. Remember that everything outside the “¬d¬c model” must be different from “¬d¬c”. We have decompressed “¬(¬d¬c)” into ¬dc, d¬c and cd. Of course, these possibilities can be combined with a, ¬a, b and ¬b, because nothing in the premises prohibits assuming them outside of “¬d¬c”.
Fig. 13. (¬d¬c → a¬b) in Marlo diagram and truth tables [9].
There is no place within “¬d¬c model” for the events of truth tables number 4 “¬d¬cab”, 12 “¬d¬c¬ab” and 16 “¬d¬c¬a¬b”. These three combinations are incompatible with ¬d¬ca¬b and they cannot be conceived apart from “¬d¬c model”.
Significance in Marlo Diagrams Versus Thoroughness of Venn Diagrams
225
7 Conclusions Venn diagrams with three letters are simple, clear, and precise tools, extremely useful in logic proofs. They have the merit of having highlighted the importance of considering the class elimination processes that are at the base of all inferences. Venn isolated the more primitive formal aspects of reasoning, but to do so he had to ignore more complex aspects such as those underlying the categorical syllogism. He showed that the exact combinatorics of the letters that exclusively and exhaustively form the classes is an a priori process without which no form of reasoning would be possible. However, his diagrams also promoted the disconnection of natural and symbolic language, leaving aside the importance of mechanisms that allow the communication of information implicitly. This omission contributes to its learning requiring a shorter learning period than Marlo diagrams, but it is also the reason why Venn diagrams are so limited as the number of variables increases. Venn could not adequately assess the potential of the doctrine of the Quantification of the Predicate [10, 20] and he did not realize all that we can learn from the processes by which communication systems linked to articulate language have naturally evolved [38]. Our strategy attempts to reconnect natural language and logic by prioritizing significance over completeness. Our diagrams, which are still in the early stages of development, require a greater investment in learning, but, in return, they allow us to develop cognitive shortcuts to reach conclusions without having to be exhaustive in constructing, deleting, and reviewing all the regions of the diagrams. We do not work with a fixed image in which the classes are compartmentalized, but with a dynamic and flexible representation of the inference. In this way, and within the paradigm of heterogeneous reasoning [3, 34], we have managed to increase the synchrony between the inferences of natural, formal, and diagrammatic language in the classroom [4, 7]. That is, a greater parallelism is established between the graphic and formal demonstration steps, facilitating, at the same time, the interpretation of each step in natural language. An experienced student uses the graphical representation to check the validity of his formal inferences and, at the same time, relies on these formal inferences to construct his diagrams. Thus, the teacher, in addition to the conclusion of the exercise, will be able to visualize the logical operations that the student has carried out to achieve it [9]. Marlo diagrams allow plausible conclusions to be drawn, distinguishing between the possible and the probable [5, 6]. In this way, they integrate uncertainty into the reasoning processes in a more decisive and clear way than Leibniz, Lambert, Venn, or Lewis Carroll [10, 12]. In addition, the Marlo diagrams can represent and solve inferences using propositions with intermediate quantifiers [8], something Venn considered excessively complex and useless [38]. For all these reasons, we believe that the Marlo diagrams can be a complementary tool to Venn diagrams in teaching logic. However, we are convinced that our representations of inference, such as Peirce diagrams [21], may be more useful than Venn diagrams in attempting to reveal the constitutive processes of reasoning [9].
References 1. Abeles, F.F.: Lewis Carroll’s visual logic. Hist. Philos. Log. 28(1), 1–17 (2007). https://doi. org/10.1080/01445340600704481
226
M. B. L. Aznar et al.
2. Alemany, F.S.: Hacia la lógica plástica: emergencia de la lógica del razonamiento visual. Contextos 16(31–32), 281–296 (1998) 3. Allwein, G., Barwise, J.: Logical Reasoning with Diagrams. Oxford University Press, New York (1996) 4. Aznar, M.B.L.: Adiós a bArbArA y Venn. Lógica de predicados en el diagrama. Paideia Revista de Filosofía y didáctica filosófica 35(102), 35–52 (2015) 5. Aznar, M.B.L.: Visual reasoning in the Marlo diagram. In: Sato, Y., Shams, Z. (eds.) SetVR@ Diagrams, vol. 2116. Ceurws.org (2018). http://ceur-ws.org/Vol-2116 6. Aznar, M.B.L.: Visual reasoning in the Marlo diagram (2018). [video file]. https://www.you tube.com/watch?v=ivEZ4Pfr6tQ&t=7s 7. Aznar, M.B.L.: The Marlo diagram in the classroom. In: Pietarinen, A.-V., Chapman, P., Bosveld-de Smet, L., Giardino, V., Corter, J., Linker, S. (eds.) Diagrams 2020. LNCS (LNAI), vol. 12169, pp. 490–493. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-542498_41 8. Aznar, M.B.L.: Syllogisms with intermediate quantifiers solved in Marlo logic diagrams. In: Pietarinen, A.-V., Chapman, P., Bosveld-de Smet, L., Giardino, V., Corter, J., Linker, S. (eds.) Diagrams 2020. LNCS (LNAI), vol. 12169, pp. 473–476. Springer, Cham (2020). https://doi. org/10.1007/978-3-030-54249-8_37 9. Aznar, M.B.L: Diagramas lógicos de Marlo para el razonamiento visual y heterogéneo: válidos en lógica matemática y aristotélica. Unpublished doctoral dissertation, University of Huelva, Spain (2020) 10. Boche´nski, I.M.: A History of Formal Logic. University of Notre Dame Press, Notre Dame (1961) 11. Boole, G.: An Investigation of the Laws of Thought, on Which are Founded the Mathematical Theories of Logic and Probabilities. Dover, New York (1854) 12. Carroll, L.: The Game of Logic. Macmillan, London (1886) 13. Castro-Manzano, J.: Remarks on the idea of non-monotonic (diagrammatic) inference. Open Insight 8(14), 243–263 (2017) 14. Chen, H., Boutros, P.C.: Venn diagram: a package for the generation of highly customizable Venn and Euler diagrams in R. BMC Bioinform. 12(1), 35 (2011). https://doi.org/10.1186/ 1471-2105-12-35 15. Címbora Acosta, G.: Lógica en el Diagrama de Marlo. Tratando de hacer evidente la certeza. In: Campillo, A., Manzanero, D. (eds.) Actas II Congreso Internacional de la Red española de Filosofía, pp. 41–55. REF, Madrid (2017). http://redfilosofia.es/congreso/wp-content/upl oads/sites/4/2017/07/7.5.pdf 16. Gardner, M.: Logic Machines and Diagrams. McGraw-Hill, New York/Toronto/London (1958) 17. Giardino, V.: Diagrammatic reasoning in mathematics. In: Magnani, L., Bertolotti, T. (eds.) Springer Handbook of Model-Based Science. SH, pp. 499–522. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-30526-4_22 18. Hamilton, W.: Lectures on Metaphysics and Logic, vol. IV. William Blackwood and Sons, Edinburgh/London (1860) 19. Hammer, E., Shin, S.: Euler’s visual logic. Hist. Philos. Log. 19(1), 1–29 (1998). https://doi. org/10.1080/01445349808837293 20. Jevons, W.: Pure Logic or the Logic of Quality. Stanford, London (1864) 21. Johnson-Laird, P.: Peirce, logic diagrams, and the elementary operations of reasoning. Think. Reason. 8(1), 69–95 (2002). https://doi.org/10.1080/13546780143000099 22. Legris, J.: Visualizar y manipular: sobre el razonamiento diagramático y la naturaleza de la deducción. In: Lassalle Cassanave, A., Thomas Sautter, F. (eds.) Visualização nas Ciências Formais, pp. 89–103. College Publications, London (2012)
Significance in Marlo Diagrams Versus Thoroughness of Venn Diagrams
227
23. Macbeth, D.: Realizing Reason: A Narrative of Truth and Knowing. Oxford University Press, Oxford (2014). https://doi.org/10.1093/acprof:oso/9780198704751.001.0001 24. Marlo diagram. http://www.diagramademarlo.com. Accessed 4 Apr 2021 25. Marquand, A.: Logical diagrams for n terms. Phil. Mag. 12, 266–270 (1881). https://doi.org/ 10.1080/14786448108627104 26. Moktefi, A., Shin, S.-J.: A history of logic diagrams. In: Gabbay, D.M., Pelletier, F.J., Woods, J. (eds.) Handbook of the History of Logic, vol. 11: ‘Logic: A History of its Central Concepts, pp. 611–682. North-Holland (2012). https://doi.org/10.1016/B978-0-444-52937-4.50011-3 27. Pagnan, R.J.: A diagrammatic calculus of syllogisms. J. Log. Lang. Inform. 21(3), 347–364 (2012). https://doi.org/10.1007/s10849-011-9156-7 28. Parsons, T.: The traditional square of opposition. In: Zalta, E.N. (ed.) The Stanford Encyclopedia of Philosophy, Summer Edition. Stanford University (2017). https://plato.stanford.edu/ archives/sum2017/entries/square/ 29. Peirce, C.S.: Prolegomena to an apology for pragmaticism. Monist 16(4), 492–546 (1906). https://doi.org/10.5840/monist190616436 30. Peirce, C.S., Hartshorne, C., Weiss, P.: Collected Papers of Charles Sanders Peirce. Harvard University Press, Cambridge (1933) 31. Riche, N.H., Dwyer, T.: Untangling euler diagrams. IEEE Trans. Vis. Comput. Graph. 16(6), 1090–1099 (2010). https://doi.org/10.1109/TVCG.2010.210 32. Roberts, D.: The existential graphs of Charles S. Peirce.: Mouton, The Hague (1973) 33. Shin, S.-J.: Logical Status of Diagrams. Cambridge University Press, Cambridge (1995) 34. Shin, S.-J.: Heterogeneous reasoning and its logic. Bull. Symb. Log. 10(1), 86 (2004). https:// doi.org/10.2178/bsl/1080330275 35. Shin, S.-J., Lemon, O., Mumma, J.: Diagrams. In: Zalta, E.N. (ed.) The Stanford Encyclopedia of Philosophy, Summer Edition. Stanford University (2018). https://plato.stanford.edu/arc hives/sum2018/entries/diagrams/ 36. Strawson, P.F.: Introduction to Logical Theory. Methuen, London (1952) 37. Venn, J.: On the diagrammatic and mechanical representation of propositions and reasonings. Lond. Edinb. Dublin Philos. Mag. J. Sci. 10(59), 1–18 (1880). https://doi.org/10.1080/147 86448008626877 38. Venn, J.: Symbolic Logic. MacMillan, London (1881). https://doi.org/10.1037/14127-000
A Hybrid Real-Time Scheduling Mechanism Based on Multiprocessor for Real-Time Tasks in Weakly Hard Specification Habibah Ismail1(B) , Dayang N. A. Jawawi2 , and Ismail Ahmedy3 1 Centre of Foundation Studies, Universiti Teknologi MARA, Dengkil, Selangor, Malaysia
[email protected]
2 School of Computing, Faculty of Engineering, Universiti Teknologi Malaysia, Skudai, Johor,
Malaysia 3 Faculty of Computer Science and Information Technology, Universiti Malaya, Kuala Lumpur,
Malaysia
Abstract. In real-time systems, there are three categories which are based on the “seriousness” of missing a deadline, which are hard, soft, and weakly hard. Realtime scheduling algorithms proposed for use can guarantee a bounded allowance of deadline misses in a predictable way dedicated to weakly hard real-time tasks. A number of studies from previous research on multiprocessors in scheduling algorithms for weakly hard tasks in real-time systems used non-optimal heuristics, wherein these cannot guarantee that an allocation of all tasks can be feasibly scheduled. Moreover, the use of a hierarchical scheduling algorithm under the PFair algorithm may cause high scheduling overhead due to frequent preemptions and migrations. This research is done to address the problem of optimization in partitioned scheduling and task migration in global scheduling, that causes scheduling overheads. Therefore, to achieve these objectives, this study proposes a hybrid scheduling mechanism that uses the partitioning and global approaches, which are R-BOUND-MP-NFRNS and RM-US (m/3m-2) with the multiprocessor response time test. Based on the simulation results, when comparing the hybridized scheduling approach and R-BOUND-MP-NFRNS, it is seen that the deadline satisfaction ratio improves by 2.5%. In case of the proposed approach versus multiprocessor response time, the deadline satisfaction ratio has seen an improvement of 5%. The overhead ratio for the proposed hybrid approach versus R-BOUND-MP-NFRNS has reduced by 5%, and in case of the proposed hybrid approach versus multiprocessor response time, it reduces by 7%. According to the results, it can be seen that the proposed hybrid approach achieved a higher percentage in the ratio of deadline satisfaction, and minimized its overhead percentage when compared to the other approaches. Keywords: Real-time systems · Hybrid multiprocessor scheduling · Weakly hard real-time tasks
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 506, pp. 228–247, 2022. https://doi.org/10.1007/978-3-031-10461-9_15
A Hybrid Real-Time Scheduling Mechanism Based on Multiprocessor
229
1 Introduction In real-time systems, each task that has been assigned in the system needs to be completed in a certain period of time. To ensure that the timing constraint is always met, the realtime system must be able to predict the task. This specification in the real-time system is defined by timing requirements or constraints for the deadline activities. The “realtime tasks” depend on the computations in the real-time system based on the timing requirements. There are two categories for the real-time system classifications that need to be considered based on the seriousness of missing deadlines, which are, hard realtime and soft real-time, for each task in the scheduler [1]. In the event of a missed deadline, a serious consequence could arise in a hard real-time system. The missed deadline event would not be accepted since the failure of the task executed could harm the system performance. This is different than soft real-time systems, where missing a certain deadline is still acceptable in some applications. In weakly hard real-time systems such as multimedia data types, the delay in task execution can be tolerated. The missed deadline can be acceptable as long as the missed deadline specification has been defined. Hence, the weakly hard approach is more robust as compared to the hard or soft approach. In multiprocessors, for real-time scheduling systems, there are two categories: partitioning and global scheduling. The partitioning approach has acceptable overheads, but optimality cannot be guaranteed. Meanwhile, this guarantee can be provided by the global scheduling approach; however, it has considerable overheads [5]. Multiprocessor overhead in a scheduling system is an important factor to be considered when completing real-time tasks by their deadlines. Hence, the real-time scheduling algorithm using multiprocessors in weakly hard real-time systems is proposed. In preemptive real-time scheduling, the multiprocessor preempts a task by another task, but the original task still needs to continue its implementation later, and as such, this is then resumed via another processor. In order to perform preemptive multiprocessor scheduling, two different approaches, partitioned and global scheduling, can be used. Fixed-Priority Scheduling in multiprocessors has been proposed by [9] to overcome high-performance and low-power requirements. Using the partitioning approach, a set of tasks is divided into separated tasks, where each task can only be processed by a single processor. Based on this scenario, every single processor has its own capability for task queuing, and a task belonging to one processor is not permitted to migrate to another processor. In the global approach, in case of global task queuing where a current task has been placed, it can be transferred to another processor. This process is called preemption, where after this process is conducted, the task execution can be resumed by any other processor. In this study, the partitioned approach was deployed, in which the set of tasks needed were linked to all the processors using a given algorithm. Following this, the global scheduling approach was used to schedule the remaining tasks.
230
H. Ismail et al.
2 Real-Time Paradigms and Design Issues In real-time systems, design issues for scheduling techniques, including the timing requirements/constraints, are the principal concerns. Timing constraints are an important specification for the system to satisfy. In cases where these constraints are not satisfied as per the requirement specifications, it could lead to a system failure. In real-time scheduling systems, a set of tasks is a number of tasks in priority order, that need to be met and executed to achieve and meet the satisfactory conditions or requirements. The execution of tasks is based on three important components, which are, period of time, the deadline, and the execution time. These components must be considered when designing a scheduling algorithm. For deadlines, it can either be a hard, soft, or weakly hard approach [6]. Furthermore, the tasks in real-time systems scheduling are characterized as periodic, aperiodic, or sporadic tasks, and the scheduling method could be preemptive or non-preemptive. In preemptive scheduling, a low priority task is allowed to be preempted by a higher critical task due to its assurance of meeting the deadline in the task scheduler. Schedulability analysis has been done by [8] to analyze the degradable performance on global multiprocessor scheduling. The author in [10] has addressed the problem in schedulability analysis of the deadline on tasks scheduled in multiprocessors. A new schedulability analysis approach has been suggested by [11] by applying the constraint specification for the deadline using a fixed priority. The use of multiprocessors in scheduling techniques has shown that the real-time systems can perform in an optimum state since a multiprocessor can run and execute scheduled multi tasks in less time. There are two types of real-time system scheduling mechanisms, which are off-line and on-line scheduling. In off-line scheduling, the characteristics and requirements of the application are needed to be known, and they can change frequently. Additionally, system performance using off-line scheduling is guaranteed based on a predictable environment for the scheduler. As for on-line scheduling, the characteristic requirements do not need to be known, and the scheduler is more scalable to the environmental changes. According to [6], energy is the main issue in weakly-hard real-time scheduling methods running on multiprocessors. In [12], the nonpreemptive real-time tasks are focused upon, by taking into account the feasibility and energy perspective. In real-time systems characterization, the tasks in a scheduler need to be described and categorized. Partitioning and global strategies are two approaches in real-time systems that use multiprocessor systems. These approaches allocate and schedule the realtime tasks that are put into the scheduler. Figure 1 illustrates the paradigm of real-time scheduling as discussed above.
A Hybrid Real-Time Scheduling Mechanism Based on Multiprocessor
Fig. 1. Real-time scheduling paradigms
231
232
H. Ismail et al.
3 Proposed Hybrid Scheduling Approach The conceptual component of this approach is shown in Fig. 2. The selected case studies, namely, Videophone application, Inertial Navigation System (INS), and Autonomous Mobile Robot (AMR), were originally derived from [2, 3] and [4], respectively. These studies will be used to produce the sets of tasks which will be characterized as an input, specifically for three existing multiprocessor real-time scheduling algorithms, RBOUND-MP-NFRNS and RM-US (m/3m − 2) with multiprocessor response time tests. The three algorithms use the specific techniques as discussed, which are hyperperiod analysis, weakly hard temporal constraints, and μ-pattern. These three techniques will be applied to the algorithms to ensure that the render process of the multiprocessor is resilient for the scheduling algorithm, so that it involves less overheads and can be optimal at the same time. To ensure that the task priority constant value is stagnant, the fixed priority algorithm needs to be implemented in this work. In conventional partitioned scheduling, this algorithm results in some unused processor space after being partitioned.
Fig. 2. Conceptual component of the proposed approach.
A Hybrid Real-Time Scheduling Mechanism Based on Multiprocessor
233
The proposed approach can use this space for migrating tasks and thus improve system utilization. This is made possible as the hybrid approach extends the scheduling partition to allow the task to be feasible to the designated processor for migration purposes. This combination contributes an advantage by having the measurement of the deadline violation using the weakly hard specification, which includes constraints and μ-pattern. By implementing this concept, the total number of missed deadlines in a worst case scenario can be guaranteed by response time tests, and the allocation of tasks to a given processor can be identified by computing the R-BOUND-MP-NFRNS; hence, it can guarantee the satisfaction of delayed tasks for all of the deadline values. The hybridized scheduling approach flowchart for this work is illustrated in Fig. 3. Additionally, Fig. 4 shows the pseudocode of the proposed hybrid approach.
Fig. 3. Hybrid scheduling approach flowchart
234
H. Ismail et al.
The pseudocode below can be categorized into three phases. In the first phase, the partitioned approach is used to allocate each task to an appropriate processor. The Rate Monotonic Algorithm (RMA) is specifically used as the scheduling algorithm in the real-time system that runs on each processor. In the second phase, the response time is calculated on each task scheduled, by checking the response time in each specific task for deadline verification in terms of the equality of the deadline task. If the verification does not meet the condition, the missing of the deadline will be verified by the constraint and μ-pattern of the weakly hard specification. In the third phase, for the remaining tasks arriving in the system scheduler, the migration process for the tasks in each processor is executed using RMUS (Rate Monotonic with Utilization Separation), which is also called the utilization capacity method.
Fig. 4. Pseudocode of the proposed hybrid approach
A Hybrid Real-Time Scheduling Mechanism Based on Multiprocessor
235
Figure 5 illustrates the pseudocode of weakly hard. As mentioned in the previous phase, response time holds responsibility of checking tasks which miss their deadlines. In line 1, each task is selected, and it has been analyzed in the next 14 lines to find the best position of the task. The variable resultFailedTasks mentioned in line 3 records the count of tasks that have failed to meet the deadline. In line 5, if the value of ResponseTime variable is more than the given deadline of the task, it means that the task has failed to be scheduled. In the next line, the author creates the DeadlineMeet variable to indicate the number of deadlines met and missed for each task. In line 9, the count of the selected task is calculated, along with the number of deadlines met and missed, and this is stored in a variable called weaklyHardConstraint. In line 12, a loop is created to verify the location of the tasks. The author adds an index to locate each task; in line 13, the index of the task is stored in a variable called actualTaskLocation.
Fig. 5. Pseudocode for the weakly hard phase.
4 Models of Hybrid Multiprocessor Real-Time Scheduling Approach Real-time system scheduling is one of the techniques that aims to provide an optimal schedule for a specific task in any system domain. The partitioning multiprocessor implementation in real-real time scheduling can be considered as a tolerable approach, due to the simplification of the implementation in the scheduler system. This is because the task assignment for partitioned approach is static, in addition to the tasks migration disability. Moreover, it needs to share the processors’ capacities in order to execute a task. Furthermore, the partitioning approach makes the problem of the real-time scheduling in multiprocessor simpler, in relation to a single processor scenario. This includes, for example, using RMA or Earliest Deadline First (EDF), two optimal uniprocessor real-time scheduling algorithms for every unit of the processor task waiting queues. However, scheduling for task migration using the timing approach is expected. Nevertheless, several preconditions need to considered in order to achieve the migration.
236
H. Ismail et al.
Additionally, its considerable overhead becomes one of the most important deficits in the global multiprocessor real-time scheduling approach [7]. In order to satisfy systems requirements, two existing approaches were combined, and this proposed combination is called a hybrid multiprocessor real-time scheduling approach. It purposes to employ the good elements of the two existing multiprocessor real-time scheduling approaches, partitioning and global. As discussed before, the hybridized multiprocessor real-time system scheduling method involves combining the advantages of the two recognized approaches (partitioning and global). The aim of developing a hybridized real-time system scheduling method with multiprocessors is to achieve the optimal scheduling results by implementing the task assignments with optimal conditions in a number of tasks assigned in the scheduler, for low overhead outcome produced by the processor. This will benefit the partitioning process in the task scheduler, especially in the specific processor being used. The proposed approach uses partitioning as the base of the combination, while at the same time, to improve the performance of the system, uses global capabilities to complete the execution. Accordingly, the partitioning approach is the basis of the hybrid multiprocessor scheduling approach, which has minimum overheads. The global approach is adopted due to its ability to be near optimal through migration of tasks, and its capabilities to improve system utilization in order to make use of the empty capacities of the processors for execution of tasks. The partitioning approach for the scheduling, which is R-BOUND-MP-NFRNS and RM-US, is being used in the multiprocessors to generate the response time. The outcome will be calculated to produce a predictable output, which is the number of missed deadlines, by implement the μ-pattern.
5 An Experimental Result on Case Studies In designing real-time systems, one of the significant considerations for activities of scheduling is selecting the appropriate approaches. In this approach, the allocation of the tasks will use the specific algorithm in the partitioning approach, which is R-BOUNDMP-NFRNS. This algorithm is implemented in each processor (Multiprocessor) along with the RM-US (m/3m − 2) algorithm that acts as a real-time scheduling algorithm. The process is calculated in the multiprocessor for the task given, to generate the task deadline. 5.1 Videophone Application In this case study, the parameters from the Videophone application have been used as a set of tasks which act as inputs in the executed experiment. Based on Table 1, the VSELP speech encoding has become the task with the highest priority as the process task priority is decreasing. In this outcome, it shows that the four tasks that have four parameters are required to be scheduled by using two processors along the R-BOUND-MP-NFRNS algorithm implementation.
A Hybrid Real-Time Scheduling Mechanism Based on Multiprocessor
237
Table 1. Task set in multiprocessor for task parameters Task
Ti
Ci
Ui
Pi
VSELP speech encoding (T1 )
40
20
0.5
1
VSELP speech decoding (T2 )
40
10
0.3
1
MPEG-4 video encoding (T3 )
66
40
0.6
2
MPEG-4 video decoding (T4 )
66
30
0.5
2
The result generated in Table 1 has been extracted into an illustration, as shown in Fig. 6. In this chart, the partitioning of the task from every processor is drawn using a timing diagram that shows the performance based on the scheduling algorithm implemented.
Fig. 6. Timing diagram for videophone application case study using R-BOUND-MP-NFRNS.
In Fig. 6, it can be seen that the scheduling of Task 4 is unsuccessful for the assigned processor. To overcome this scenario, the task must be allocated to a processor which is not in use. By implementing this approach, the scheduling algorithm has achieved the optimal time in reducing the deadline miss. Furthermore, the response time is calculated, showing the worst-case scenario based on the testing condition. The worst-case scenarios in the response times of the scheduling algorithm are depicted in Table 2. The result shows that the Task 4 response time is greater than its designated deadline. Although the deadline is not met based on the criteria, the weakly
238
H. Ismail et al.
hard implementation has made the unscheduled task to be scheduled in the specific acceptable deadline. Table 2. Sets of videophone task application. Task
Ti
Ci
hi
ai
VSELP speech encoding (T1 )
40
20
VSELP speech decoding (T2 )
40
10
40
1
MPEG-4 video encoding (T3 )
66
40
1320
20
MPEG-4 video decoding (T4 )
66
30
1320
20
40
1
Based on Table 2, it can be seen that Task 4 could miss its deadline in the worstcase scenario. The total of the deadline misses in Task 4 is specified using an analysis method called hyperperiod. Figure 7 shows the invocation of Task 4 response times in the worst-case scenario using the hyperperiod analysis method with level 4 priority.
Fig. 7. Invocations in the hyperperiod analysis response time for task 4
In Table 3, the missed deadline, along with the met deadline for Task 4 is shown, with the invocation of 1to 90. Based on the result, it shows that Task 4, using a weakly 2 hard specification with , satisfies the constraint. 5 For the multiprocessor experiment, the unassigned task is migrated from processor2 to processor1 using 4 units of the task number. The generated result is shown in Fig. 8
A Hybrid Real-Time Scheduling Mechanism Based on Multiprocessor
239
Table 3. Task 4 distribution in µ-pattern Task 4 distribution Invocations µ-pattern 1–30:
01010 11010 11101 11110 01111 10011
31–60:
11111 11111 10011 11001 11101 11101
61–90:
01111 10010 11011 11111 01111 11100
using a timing diagram, where it shows that the hybridized scheduling algorithm has executed 4 units of Tasks.
Fig. 8. Timing diagram for videophone application case study using the proposed hybrid scheduling approach.
5.2 Inertial Navigation System (INS) A selected case study (namely INS) is considered for the second case study in this work. Table 4 shows the Attitude Updater task holding higher priority as the task priority decreases on each processor. In this experiment, six tasks along with six parameters have been considered in the task scheduler. These tasks are then scheduled by adapting four processors equipped with the R-BOUND-MP-NFRNS algorithm. Figure 9 demonstrates a timing diagram which shows the Navigation Sender task (Task 4) with four processors. Based on the result, it can be seen that 5 units are needed
240
H. Ismail et al. Table 4. Case study of inertial navigation system (INS).
Task
Period (T i )
Ci
Pi
Attitude updater (T1 )
Ri
10
5
1
Velocity updater (T2 )
15
10
1
3
30
2
Attitude sender (T3 )
20
15
2
12
60
3
Navigation sender (T4 )
50
40
2
28
300
6
Status display (T5 )
50
25
3
50
300
6
Position updater (T6 )
60
30
4
75
300
5
1
hi 10
ai 1
for processor2 to preempt and migrate the tasks set to be scheduled, to processor3. Furthermore, it shows that the response time is being invoked α4 = 60 times based on the hyperperiod, with a priority of level 4, which is shown in Fig. 10.
Fig. 9. INS task scheduling using proposed hybrid scheduling algorithm
AlthoughTask 4 misses the deadline, the task response time is satisfied using the 1 constraint of for weakly hard specification. The distribution of the missed deadlines 5 for Task 4, along with the met deadlines, is shown in Table 5. 5.3 Autonomous Mobile Robot (AMR) A selected case study (namely AMR) was used as the third case study for this work. As can be seen in Table 6, the priority of the task MobileRobot is high, as compared to the decreasing priorities of other tasks in the other processors. The seven tasks along with
A Hybrid Real-Time Scheduling Mechanism Based on Multiprocessor
241
Fig. 10. Navigation sender task response time
Table 5. The navigation sender task distribution Task navigation sender Invocations
µ-pattern
1–20:
00111 10101 00001 11001
21–40:
11101 11011 01000 10111
41–60:
01111 10111 11011 11100
six parameters are scheduled using six processors equipped with the R-BOUND-MPNFRNS scheduling algorithm. Table 6. AMR task data set Task
Period (T i )
Ci
Ri
Pi
hi
ai
MobileRobot
50
10
1
1
50
1
motorctrl_left
50
20
11
2
50
1
motorctrl_right
50
20
42
3
50
1
Subsumption
50
33
51
3
50
1
Avoid
90
17
52
4
450
5
Cruise
90
10
31
5
450
5
manrobotintf
90
16
12
6
8550
95
242
H. Ismail et al.
Figure 11 demonstrated a timing diagram which showed the task Navigation Sender (Task 4) with six processors. Based on the results, a three unit is needed for processor3 to preempt and migrate the tasks set to be scheduled, to processor4. Furthermore, it shows that the response time is being invoked α4 = 52 times based on the hyperperiod, with a priority of level 4, as shown in Fig. 12. The distribution of the deadlines in terms of missed and met for the task Subsumption, with an invocation of 1 to 52, is shown in Table 7.
Fig. 11. AMR data set scheduler using the proposed hybrid scheduling algorithm
Fig. 12. AMR subsumption response times
A Hybrid Real-Time Scheduling Mechanism Based on Multiprocessor
243
Table 7. Task subsumption distribution Task subsumption Invocations
µ-pattern
1–52
0011 1101 1111 1011 0011 1101 0011 1110 1101 0010 1100 0110 0110
6 Performance Evaluation In real-time systems, designing the scheduling algorithm is one of the important components to ensure high performance, especially in meeting the task deadlines. Based on the studied scenarios, the INS case study has been selected as the benchmark for the performance evaluation. In this case study, the system has been modelled using the weakly hard approach, with specifications in both functionalities, which are the hard function and the soft function. Figure 13 demonstrates a timing diagram which shows that the Position Updater, which is under Task 6, requires two unit times to preempt and migrate processor1 to processor2, to ensure that the task is scheduled.
Fig. 13. INS case study timing diagram for 2 processors
Figure 14 shows the response time for the 6 tasks using hyperperiod analysis at level 6 priority. The tasks have been executed at a time interval of α6 = 75 times. Although missed deadlines have occurred in this experiment, Task 6 is considered as satisfied 2 based on the weakly hard specification constraint of . The distribution of deadlines 5
244
H. Ismail et al.
(met and missed) for the task Position Updater is shown in Table 8, with the invocation of 1 to 75.
Fig. 14. Task position updater response time
Table 8. Task position updater distribution. Task position updater Invocations
µ-pattern
1–25:
11111 11110 10111 11111 11100
26–50:
11111 11110 10010 00110 11001
51–75:
10001 10101 10111 11111 11110
7 Performance Comparison In multiprocessor approaches for scheduling algorithms, the parameter in the scheduled tasks is considered as an independent input. A task period range of 10 to 100 has been generated within the time limit for maintaining the set task hyperperiod. The period of tasks 2 to 5 is generated in multiple random from the first task. These tasks have selected the execution time from a range of 1 to 50 randomly, as the execution time is less than the period time. The multiprocessor approaches are evaluated based on the varying number sets of tasks for 2 to 10, their success ratio based on the weakly hard
A Hybrid Real-Time Scheduling Mechanism Based on Multiprocessor
245
2 specification constraint deadlines. In this simulation, 10 tasks have been set as 3 the maximum tasks value, and the real-time system does not exceed this maximum task value. The system utilization from 0 to 2 with an increment of 0.05 was varied, as shown below.
Fig. 15. (a) System utilization (b) Number of tasks
Figure 15(a) shows the deadline satisfaction ratio where the processor of m = 2. Based on the illustrated chart, the proposed hybrid scheduling approach produces a high QoS outcome, as compared to the existing approaches. For deadline meet ratio, it has
246
H. Ismail et al.
been observed that it is higher in GMRTS-MK as compared to MPDBPs and EMRTSMK. This is an increase of almost 6% for the proposed hybrid approach. This outcome is caused by mandatory task portions which were executed in R-BOUND-MP-NFRNS with RM-US (m/3m − 2). Thus, in the proposed hybrid scheduling approach, when using weakly hard model and multiprocessor response time test, the totals of the deadline misses have been reduced in each task. Calculating the worst-case response-time of the tasks scheduled has resulted in this outcome. The four scheduling approaches in weakly hard real-time have been simulated and from the outcome, the success ratio has been generated, plotted in a graph with the values of deadline satisfaction ratio against the number of tasks. Figure 15(b) shows the deadline satisfaction ratio values with varying workloads (2 to 10 tasks in steps of 10), with average values at 10%, 20%, 30%, 40%, and 50% processor utilization. From this, it can be inferred that, out of four approaches, while the number of tasks is increasing, the hybridized scheduling approach has achieved a lesser miss ratio compared to others. This is a real-time system’s most important parameter.
8 Conclusion To conclude, the result obtained was that the value of Videophone application deadlines was higher than INS based on the invocations of 1 to 75. The experiment results have also shown that the satisfaction ratio deadline in the Videophone application is less than INS, with a difference of 7%. Thus, the success ratio rate of INS is higher than that of the Videophone application. The soft task of the Videophone application is the reason behind the outcome in terms of deadline misses toleration, as compared to INS. Both 2 case studies have been set with the value of constraint to satisfy the weakly hard 5 specification. Based on the analysis of the results, it shows that R-BOUND-MP-NFRNS, along with its usage of the multiprocessor, can predict and bind the task scheduled using the weakly hard specification constraint and μ-pattern, even though the R-BOUND-MPNFRNS task set is unsuccessfully partitioned. As observed, the migration of the balance unassigned tasks was executed by using the RM-US (m/3m − 2) on different processors, for scheduling the system task. Based on the schedulability analysis, the result has shown that the deadlines were met for the scheduled tasks. The proposed scheduling approach has shown the ability to guarantee the met deadline, while keeping the timing constraint satisfied. Both are measured using the multiprocessor response time test and weakly hard models, wherein these techniques had guaranteed the missed deadline number in the worst case and the system. Thus, violation of a few timing constraints is acceptable for a task in a weakly hard real-time system. Moreover, in our studies, it produced an optimal schedule with a reduced number of overheads. Real-time scheduling with hybrid approach using multiprocessors is considered optimal with the partitioning approach, as compared to the global approach. This is due to a toleration on the overhead of the fundamental system. The proposed scheduling approach is based on the tasks through the partitioning approach, in which the tasks are assigned to
A Hybrid Real-Time Scheduling Mechanism Based on Multiprocessor
247
a single processor before execution; during execution, the remaining tasks are scheduled by migrating the tasks among processors by employing the global approach. As such, the hybrid scheduling approach has proven to be a solution for distributing resources in an efficient manner, with a high guarantee of tasks scheduled at the same time. Acknowledgment. The authors would like to thanks Universiti Malaya for their financial support under Ministry of Higher Education Malaysia Fundamental Research Grant Scheme (FRGS) FP055-2019A, Universiti Teknologi MARA (UiTM) and Software Engineering Research Group (SERG) of Universiti Teknologi Malaysia (UTM).
References 1. Shin, K.G., Ramanathan, P.: Real-time computing: a new discipline of computer science and engineering. Proc. IEEE 82(1), 6–24 (1994) 2. Shin, D., Kim, J., Lee, S.: Intra-task voltage scheduling for low-energy hard real applications. IEEE Des. Test Comput. 18(2), 20–30 (2001) 3. Borger, M.W.: VAXELN experimentation: programming a real-time periodic task dispatcher using VAXELN Ada 1.1. Technical report, CMU/SEI-87-TR-032 ESD-TR-87-195, November 1987 4. Jawawi, D.N.A., Deris, S., Mamat, R.: Enhancement of PECOS embedded real-time component model for autonomous mobile robot application. In: IEEE International Conference on Computer Systems and Applications, pp. 882–889 (2006) 5. Carpenter, J., Funk, S., Holman, P., Srinivasan, A., Anderson, J., Baruah, S.: A categorization of real-time multiprocessor scheduling problems and algorithms. In: Handbook on Scheduling Algorithms, Methods and Models, pp. 30.1–30.19. Chapman Hall. CRC (2004) 6. Kong, Y., Cho, H.: Energy-constrained scheduling for weakly-hard real-time tasks on multiprocessors. In: Park, J.J., Chao, H.-C., Obaidat, M.S., Kim, J. (eds.) Computer Science and Convergence, pp. 335–347. Springer, Dordrecht (2012). https://doi.org/10.1007/978-94-0072792-2_32 7. Pathan, R.M., Jonsson, J.: A new fixed-priority assignment algorithm for global multiprocessor scheduling. Technical report, Department of Computer Science and Engineering, Chalmers University of Technology (2012) 8. Back, H., Chwa, H.S., Shin, I.: Schedulability analysis and priority assignment for global joblevel fixed-priority multiprocessor scheduling. In: 2012 IEEE 18th Real Time and Embedded Technology and Applications Symposium, pp. 297–306, April 2012. ISSN: 1545-3421 9. Guan, N., Stigge, M., Yi, W., Yu, G.: Parametric utilization bounds for fixed-priority multiprocessor scheduling. In: 2012 IEEE 26th International Parallel and Distributed Processing Symposium, pp. 261–272, May 2012. ISSN: 1530-2075 10. Sun, Y., Lipari, G., Guan, N., Yi, W.: Improving the response time analysis of global fixedpriority multiprocessor scheduling. In: 2014 IEEE 20th International Conference on Embedded and Real-Time Computing Systems and Applications, pp. 1–9, August 2014. ISSN: 2325-1271 11. Sun, Y., Natale, M.D.: Weakly hard schedulability analysis for fixed priority scheduling of periodic real-time tasks. ACM Trans. Embed. Comput. Syst. 16, 1–19 (2017) 12. Mayank, I., Mondal, A.: An integer linear programming framework for energy optimization of non-preemptive real-time tasks on multiprocessors. J. Low Power Electron. 15(2), 162–167 (2019)
Simulating the Arnaoutova-Kleinman Model of Tubular Formation at Angiogenesis Events Through Classical Electrodynamics Huber Nieto-Chaupis(B) Universidad Aut´ onoma del Per´ u, Panamericana Sur Km 16.3 Villa el Salvador, Lima, Peru [email protected], [email protected]
Abstract. The Arnaoutova-Kleinman model is simulated in an entire scenario of Classical Electrodynamics. For this end the 4-steps are considered: (i) The migration of endothelial cells, (ii) the random attachment among them, (iii) the apparition of bFGF proteins generating electrical and lines fields, and (iv) the tubule formation from these proteins. Simulations have shown that tubule formation as the one of the first phases of Angiogenesis would require large values of electric field and a fast adhesion of bFGF proteins to produce stable lines of electric field. On the other hand, tubular formation can also be stopped through external fields that might cancel the adhesion of proteins. Therefore prospective nano devices would play a relevant role to avoid first phases of tumor formation.
Keywords: Classical electrodynamics
1
· Angiogenesis · Cancer cell
Introduction
A key process that gives origin to the tumor formation is known as angiogenesis [1] in the which there is an aggregation of cells that in some phase of its evolution initializes the creation of new vessels from the pre-existent veins through welldefined steps [2]. From the view of modern Internet technologies, the anticipation and subsequent tackling of this anomalies might be done with new branches of nanotechnology such as Nanomedicine that might to face current open problems belonging to Oncology [3]. Under this philosophy nano engineering has proposed well-sustained methodologies as TDD whose expectation is far from the known values of cost-benefits ratio [4,5]. This paper has focused its attention entirely on the protocol proposed by Arnaoutova and Kleinman to called along this document as the A-K model that in an explicit manner it is identified inhibitors or stimulators of angiogenesis [6,7]. The A-K model can be summarized in 4 independent events such as: c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 506, pp. 248–254, 2022. https://doi.org/10.1007/978-3-031-10461-9_16
Simulating the Arnaoutova-Kleinman Model
– – – –
249
Random apparition of endothelial cells, Random attachment, Apparition of bFGF proteins, Tubule formation-
Under this model it is noteworthy the role of protein (bFGF) [8,9] as crucial ingredient for the formation of tubules. Thus it is demanded for a close simulation of these events, to count with a main equation that governs the cells interaction among them as well as their diffusivity. Some attempts can be done through diffusion and Fokker-Planck equation for example. As shell be seen along this paper, the usage of classical electrodynamic turns out to be relevant to understand the protocol A-K. Based at the ionic components the electric force given by the Coulomb force: ρ1 (r)dV1 ρ2 (r)dV2 1 , (1) F= 4π0 |r2 (t) − r1 (t)|2 that was written above in its standard definition with the charges written as the volumetric integrations of the charge densities [10].
Fig. 1. The Anaoutova-Kleinman Model: While Emerge a Random Number of Endothelia Cells without any Order (Top-Left Panel), a Possibility to Attach each One of these Cells is Expected because the Ionic Component of Cells as Claimed by Liu and Collaborators at [11]. Thus Coulomb Forces can Exert Electrical Attraction among them (Top-Right Panel). A Possible External Source that Accomplishes Subsequent Events is the Presence of bFGF Proteins that would Cause Electrical Interaction because their Surface-Exposed Charges [12] Might be the Source of Electric Field Lines (Down-Left Panel). Because of this the Formation of Tubules Might be Done Efficiently (Down-Right Panel).
250
2
H. Nieto-Chaupis
Physics of Arnaoutova-Kleinman Model
As it is seen at Fig. 1 the A-K model consists in up to four phases. The top-left and top-right panels are purely biological processes whereas the down-left and down-right panels are electrical processes involving ions. From the A-K protocol one can write down a well-defined sequence based entirely at Classical Electrodynamics whose main mechanism is assumed to be cause by electrical interactions. At this manner it is strongly assumed that each step has an argument based at physics as the basis of protocol. 2.1
Presence of Endothelial Cells
Cells can be randomly deployed and expelled from others as cause of their electric charge. Thus these charges can be expressed as: Q = ρdV = ρ(r, t)dV, (2) with ρ can satisfy the Fokker-Plack equation or simply diffusion equation. It is important to remark that electrical currents can emerge as reported at [13]. Thus with the combination of diffusion equation one gets below: dρ(r, t) dQ = ρdV = dV = D∇2 ρ(r, t)dV. (3) I= dt dt 2.2
Random Attachment
This phase is going to be assumed as all those events that maximize the electrical attraction of opposite sign ions. In this manner this forces can be written as: F=
M,N m,n
2.3
1 Qm Qn . 4π0 |Rm,n (t)|2
(4)
Apparition of bFGF Proteins
The electrically charged proteins bFGF can appear at this phase as a source of electric fields originating field lines. The corresponding electric field can be written below as: ρPRO 1 (5) dV 2 , E= 4π0 R (t) that under a crude approximation this can be rewritten as: E=
1 ρPRO . V 4π0 R2 (t)
(6)
Simulating the Arnaoutova-Kleinman Model
2.4
251
Formation of Tubules
In this phase it is mandatory to derive a closed-form that gives the length of tubule as function of electrical properties of both cells ans bFGF proteins. Under the assumption that tubules are perfect cylindrical volumes the one gets: dV V = 2. (7) dL = L = πr2 πr From Eq. 6 and putting it into Eq. 7 one arrives to an expression for the length of tubule as function of electric field generated by the bFGF proteins: L=
as:
1 4π0 E(t)R2 (t) . πr2 ρPRO
(8)
In other hand Eq. 8 the electric field from the bFGF proteins can be written q(t) 4π0 r2 (t) so that one arrives to: L=
1 4π0 R2 (t) q(t) q(t)R2 (t) = , πr2 ρPRO 4π0 r2 (t) ρPRO r2 (t)
(9)
with q(t) the total effective charge of bFGF proteins that can also contribute to the formation of tubules and it is assumed to be free of screening from other electrical compounds. With the definition R2 ⇒ r2 (t)-R2 (t) the one gets: L=
q(t)[r2 (t) − R2 (t)] , ρPRO r2 (t)
2 2 q(t) q(t) q(t) q(t) L[q(t)] = 1− ≈ , Exp − 2 ρPRO r2 (t) ρPRO r (t)
(10)
(11)
by the which the tubule length acquires a well-defined Gaussian form. This in a first instance provides an interesting window to test phenomenological approaches that would give a direct bridge between the theoretical formulation and experimental tests done at the past. 2.5
Current Associated to the Tubule Formation
Equation 11 would constitute a purely theoretical expression for the tubule formation of sprouting events in a process of angiogenesis. An possible event is the apparition of currents as done below: 2
q(t) d q(t) dL[q(t)] = Exp − 2 , (12) dt dt ρPRO r (t)
252
H. Nieto-Chaupis
the usage in a straightforward manner of chain’s rule at the left side of Eq. 12 gives: 2
q(t) d q(t) dL[q(t)] dq(t) = , (13) Exp − 2 dq(t) dt dt ρPRO r (t) by the which enables to write the tubule length as function of created electric current: 2
q(t) dL[q(t)] 1 d q(t) = , (14) Exp − 2 dq(t) i(t) dt ρPRO r (t) so that it is easy to recognize the possible electrical functions of the ingredients of Eq. 14 from the well-known Ohm’s law V = IR and the definition of electrical capacitance Q = CV at the following manner: 1 1 = , Q(t) i(t)RC
(15)
with Q(t) charge per unit of length. Thus the pseudo “constant of time” can be written as: 2
q(t) d q(t) 1 = Exp − 2 , (16) RC dt ρPRO r (t) so that one can wonder if actually electricity principle are behind of network of sprouting prior to the consolidation of tumor. Although this idea emerges as a logic response to the usage of classical electrodynamics, this subsequent debate is beyond the scope of this paper.
3
Simulations
In Fig. 2 (Up) the simulation of normalized tubule length is displayed as function of time (expressed in arbitrary units). For this the following equation has been employed: 2 Sin(qt) L = 2.2 × Sin(qt)Exp − , (17) Cos(qt) with q an integer number running from 1 to 3. The charge of proteins was opted to be sinusoid at time as consequence of an expected constant attraction and repulsion with noise or charged compounds around the cancer cells. This integer number is also associated to the frequency of electrical action among the ions of bFGF proteins and cancer cells. The magenta color denotes the biggest frequency while blue and green denotes 0.5 and 0.3 units of frequency. Thus the linear model is shown to be rather oscillatory. While frequency gets smaller the apparition of peaks is noted.
Simulating the Arnaoutova-Kleinman Model
253
In Fig. 2 (Down) a simulation with the approximation dictated by the equation: 2 Sin(qt2 ) 2 , (18) L = 2.2 × Sin(qt )Exp − Cos(qt) is displayed. Here one can see the role of an electrical charge that is nonlinear seen at the square of time. This reveals a kind of preference at time to form the tubule as seen at t = 0.9 when a peak emerges (color green). The non linearity can be found at cases where the cancer cells are totally under oscillation by the presence of charges either same cells or different proteins with an electric charge around them.
Fig. 2. Simulation of Eq. 16 through the usage of Package [14], for the Linear (Up) and Nonlinear (Down) Cases. The Character Oscillatory of Tubule Length is Due to the Assumption that the bFGF Proteins are in Constant Attraction and Repulsion with Electrically Charged Compounds Prior to the Sprouting Formation. (Color figure online)
4
Conclusion
In this paper, the Arnaoutova-Kleinman protocol has been modeled through classical electrodynamics. The main focus has been the derivation of the tubule length that is mandatory to the sprouting formation as an important phase at the processes of angiogenesis. The simulations have shown that the tubule length might be oscillating as consequence of electrical interactions with charged compounds. In addition, the electrodynamics approach has served to relate the Ohm’s law as well as capacitance with the time change of tubule. In this form it is plausible to anticipate that angiogenesis might be dictated by electrical principles more than biological facts. Of course, the results of this paper require a direct comparison with experimental test to validate the theory proposed at this study. The electrical properties if any at the Arnaoutova-Kleinman protocol might be of great interest in prospective technologies such as Internet of Bionano Things [15] that is rather related to drug delivery techniques seen as a firm technology to tackle down cancer diseases at their first phases.
254
H. Nieto-Chaupis
References 1. Piasentin, N., Milotti, E., Chignola, R.: The control of acidity in tumor cells: a biophysical model. Sci. Rep. 10, 13613 (2020) 2. Carmeliet, P.: Mechanisms of angiogenesis and arteriogenesis. Nat. Med. 6, 389– 395 (2000) 3. Nieto-Chaupis, H.: Electrodynamics-based nanosensor to identify and detain angiogenesis in the very beginning of cancer. In: IEEE CHILEAN Conference on Electrical, Electronics Engineering, Information and Communication Technologies (CHILECON) (2019) 4. Cho, D.-I.D., Yoo, H.J.: Microfabrication methods for biodegradable polymeric carriers for drug delivery system applications: a review. J. Microelectromech. Syst. 24(1), 10–18 (2015) 5. Nieto-Chaupis, H.: Probabilistic theory of efficient internalization of nanoparticles at targeted drug delivery strategies. In: 2020 IEEE 20th International Conference on Bioinformatics and Bioengineering (BIBE) (2020) 6. Arnaoutova, I., Kleinman, H.K.: In vitro angiogenesis: endothelial cell tube formation on gelled basement membrane extract. Nat. Protoc. 5(4), 628–635 (2010) 7. Arnaoutova, I., George, J., Kleinman, H.K., Benton, G.: The endothelial cell tube formation assay on basement membrane turns 20: state of the science and the art. Angiogenesis 12(3), 267–274 (2009). https://doi.org/10.1007/s10456-009-9146-4 8. Kim, H.S.: Assignment1 of the human basic fibroblast growth factor gene FGF2 to chromosome 4 band q26 by radiation hybrid mapping. Cytogenet. Cell Genet. 83(1–2), 73 (1998) 9. Burgess, W.H., Maciag, T.: The heparin-binding (fibroblast) growth factor family of proteins. Ann. Rev. Biochem. 58, 575–606 (1989). ibid Liang, H., Peng, B., Dong, C., et al. Cationic nanoparticle as an inhibitor of cell-free DNA-induced inflammation. Nat Commun 9, 4291 (2018) 10. Jackson, J.D.: Classical Electrodynamics, 2nd edn. Wiley, New York (1975) 11. Liu, B., Poolman, B., Boersma, A.J.: Ionic strength sensing in living cells. ACS Chem. Biol. 12, 2510–2514 (2017) 12. Gitlin, I., Carbeck, J.D., Whitesides, G.M.: Why are proteins charged? Networks of charge charge interactions in proteins measured by charge ladders and capillary electrophoresis. Angew. Chem. Int. Ed. 45, 3022–3060 (2006) 13. Nieto-Chaupis, H.: Nano currents and the beginning of renal damage: a theoretical model. In: 2020 IEEE 33rd International Symposium on Computer-Based Medical Systems (CBMS) 14. Wolfram Mathematica. www.wolfram.com 15. Akyildiz, I.F., Pierobon, M., Balasubramaniam, S., Koucheryavy, Y.: The internet of Bio-Nano things. In: IEEE Commun. Mag. 53(3) (2015). ibid Chun Tung Chou, Extended master equation models for molecular communication networks. In: IEEE Transactions on NanoBioscience, Volume: 12, Issue: 2, 2013
Virtual Critical Care Unit (VCCU): A Powerful Simulator for e-Learning Frederic Banville(B) , Andree-Anne Parent, Mylene Trepanier, and Daniel Milhomme Université du Québec à Rimouski, 300 Allee des Ursulines, Rimouski, QC G5L 3A1, Canada [email protected]
Abstract. Overview. Learning clinical skills in nursing is challenging for students and teachers, especially because several security and safety issues could affect the patient in real-live situations. Between learning a theoretical content and practice in the real-life, it appears very important to have concrete meaning of practice and make errors without consequences for others. Hight-fidelity simulation could be useful, but it’s expensive for faculties. However, the benefits of simulation are well documented, particularly about decision making and critical thinking (Cant and Cooper 2010; Harder 2010; Rhodes and Curran 2005), managing patient instability (Cooper et al. 2012), developing leadership and teamwork and improving performance and technical skills. In literature, Virtual reality (VR) could be an acceptable safe alternative and at a relatively low-cost. This technology is now known to be useful for the development of clinical skills such as assessment, intervention, or rehabilitation. Through its ecological approach (their ability to reproduce real life with a good similarity), it constitutes an excellent educational strategy by reducing artificial conditions found in conventional laboratories. Aim: In this paper, we describe an experimentation of the VCCU functionality according to four variables: sense of presence, cybersickness, usability and user experience (UX). Method: In a multiple case-study design, we assessed four participants to explore the feasibility in education for this type of learning. In a 30 min session, participants were immersing in a virtual environment representing an intensive care unit (ICU). They had to practice clinical surveillance process and made adequate nursing interventions. Results: After the experimentation, we observed that the sense of presence was high and cybersickness very low. However, the mental load associated to the task AND the Human-Machin Interface was very high. A qualitative analysis shown that the usability and UX had an impact directly on the learning processes. Discussion: Finally, we will discuss the importance to consider the impact of interfaces on the cognitive functions that impact directly learning capacities according to the perspective of Cognitive Load Theory. Keywords: Virtual reality · e-Learning · Critical care · Nursing · User experience
1 Introduction Clinical surveillance is essential to ensure patient safety in nursing. It’s one of the main activities reserved in the field of practice of nurses in the Province of Québec, Canada © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 506, pp. 255–265, 2022. https://doi.org/10.1007/978-3-031-10461-9_17
256
F. Banville et al.
[1]. As part of their training, various educational activities are made available to nursing students to promote the development of their skills in connection with the clinical surveillance process in different contexts. For example, during their study curricula, students participate in laboratories, simulations, and internships to mobilize their knowledge in practice and facilitate their integration. In literature, the benefits of simulation are well documented, particularly about decision making and critical thinking [2–4], managing patient instability [5], developing leadership and teamwork [6] and improving performance and technical skills [7]. These concepts have obvious links with those that constitute the foundations of the clinical surveillance process, namely data collection, analysis and interpretation, problem detection, decision-making and working in synergy [8, 9]. High-fidelity simulations, a healthcare education approach involving the use of expensive and very sophisticated manikins, have been valid teaching methods for years which can have significant advantages over so-called “traditional” teaching methods [2], especially for teaching interventions. However, with the arrival of new technologies, we considered using a new approach, namely immersive virtual reality, to immerse students in a critical care environment before their arrival in the internship and optimize their performance both in their actions and in their decision-making process during the deployment of clinical surveillance. Virtual reality (VR) simulates a digitalized environment from a computer. This technology is proving to be an option that is increasingly recognized and used for the development of clinical skills such as assessment, intervention, or rehabilitation. Through its ecological approach (their ability to reproduce real life with very great similarity), it constitutes an excellent educational strategy. Like other disciplines in the health sciences, nursing has followed suit over the past decade in the use of virtual reality, immersive and non-immersive, for nurse education [10–15]. In this mind, we have developed a Virtual Critical Care Unit (VCCU). This care unit allows in particular: a) to raise awareness among students before their internship in intensive care unit; b) to train healthcare professionals to make the right decisions at the right time, in a completely safe environment for patients; c) to standardize students’ exposure to different scenarios to standardize student exit profiles; d) to study the clinical surveillance process unfold by nurses in the context of critical care. The aim of this paper is 1) to describe the development of the VCCU and 2) to show the first results about the user experience (UX) of nurses immersed in this unit when they unfold the clinical surveillance process in critical care context.
2 Method 2.1 Design and Development The VCCU has been in development since 2018 from the Unity 3D interactive 3D content creation and management platform. Inspired by a real intensive care unit, the VCCU is made up of 14 rooms equipped with monitoring systems. In addition to these rooms, a nurses’ station, offices, pharmacy, clean and dirty rooms, and teaching room were also modeled. Most computing resources were modeled and animated by a 3D artist using 3DS Max software. The VCCU is a completely immersive work environment in which
Virtual Critical Care Unit (VCCU)
257
the user wears a Head-Mounted Display (Oculus Rift S) with a 360-degree experience that is close to that of the real world. The user can move freely around the environment and interact with a multitude of objects from the controllers provided with Oculus. User movement is by translation or by teleportation, which gives the user some flexibility to choose the movement system that suits them. Interactions are made possible using buttons on the joysticks and menus appear near objects to suggest possible actions. During interactions, some menus open to leave room for interpretation by the user. The different moving methods (teleportation or translation) in the VCCU as well as the interactions were chosen by the researchers and the development team to minimize the impact of the technology on the mental load or cybersickness. All environment menus are similar in terms of formatting and colors. It should be noted that all objects in the virtual environment represent objects and instruments frequently encountered in real critical care units in Province of Quebec. As for the virtual agent (named patient), he corresponds to a 43-year-old Caucasian man hospitalized for chest pain of undetermined etiology. At the time of the study, this character was partially animated. 2.2 Integration of Scenarios The scenario implemented in the VCCU was created by a professor-researcher (DM) in the field of critical care. Educational objectives have been clearly targeted to determine the essential learning to acquire during the deployment of the clinical surveillance process in the context of critical care. The scenario was based on the study of Milhomme [8, 9] and was validated by three critical care experts, namely an expert professor in the field of pathophysiology, a clinical advisor in the field of cardiac surgery as well as a clinical nurse with more than 20 years of experience in the field of medical and surgical intensive care. The scenario, lasting up to 20 min, takes place mainly in the patient’s room as well as at the nurses’ station. During the scenario, users must unfold clinical surveillance process and manage the risk of complications, while being exposed to various stimuli (alarms, ringing phones, etc.). They must interact with the patient and the objects in their environment, within a given time. All the clocks in the environment were functional. The user was therefore able to know the time he must carry out the various actions he wants or those requested of him by the scenario. This scenario was built by considering four main concepts in the conception: the game design, the presence and immersion, the game flow, and the playability. Game Design. The VCCU use virtual reality to simulates authentic situations for learning purpose. Previous authors explored player experience and game design theories to engage students for learning in serious games [16]. Even if the VCCU is mostly a simulator, game design models can improve user motivation and engagement to offer a better learning experience. For this reason, the development took into consideration the following principles to optimize user experience and engagement. Previous authors try to define the player or gaming experience [17, 18] using game design elements and models: the presence and immersion, the game flow, and the playability. When defined, these core elements can be related to key concepts in teaching and learning such as relevance, confidence, attention, engagement, challenges, skills, control, and feedbacks [19–22].
258
F. Banville et al.
Based on the different player experience models [16, 23–25], the VCCU development used various strategies to apply these three concepts. Game Flow. The Game Flow was introduced in 2005 [24] to explain the enjoyment in a game based on the Flow concept [24]. The authors proposed eight Game Flow elements: concentration, challenge, skills, control, clear goals, feedback, immersion, and social interaction. Each part has sub-criteria were described to help to develop the game [24]. More specifically, concentration refers to an ability to concentrate on the task that needs to be accomplished. This concentration can be related to the learner that become cognitively engage by completing complex tasks over a time; a concept also crucial in a teaching and learning framework [16]. The Challenge referred to the sufficiently challenging and coherent with the player skill level. A sense of progression and challenging activities concepts are also used in teaching and learning to keep the learner engagement’s [26]. The player skills were identified as the game needs to be developed to learn and master skills. This element is close to the challenge aspect, where the progression needs to promote a progressive learning curve [26]. The control element referred to the player perception of control over their actions in the game. This control can be related to the competence perception to a certain extent, where the feedback from the actions is coherent and authentic. In the VCCU development, these elements were taken into consideration when the scenario was built. For example, different difficulty levels were plan (with and without hint and scenario difficulties); the scenario level and the possibility to have help or hint from a non-player character can offer a progression for the learner and a higher challenge to match the user’s skill level. Furthermore, clear goals and authentic feedback created by experts in nursing help the learner understand the consequences of their decisions and actions in real situations; increasing the immersion sensation, the feedback level, and promoting a high concentration to succeed. Finally, social interaction will be integrating into a future version where the multiplayer mode will be available; for the moment, the social interactions are done with non-player characters. Playability. Playability can be described as the gameplay quality in terms of storyline, rules, mechanics, goals, and design [25]. These variables contribute to improve the game flow. The control’s responsiveness pace, usability, the interaction’s intensity, the personalisation/customisation, and realism are criteria for the playability. The usability and interactions were an important challenge in VCCU’s development. Most users will have little experience with virtual reality systems meaning that human-computer interactions need to be simple and intuitive. However, scenarios (storyline) needed a high level of interactions with the environment due to the complex task. For this reason, controls design required attention to reducing the system demands interfering with the professional competencies learning curve. During the development, different game menus and controls were explored. At this moment, the pointing control in menus and the standard controls for movements were chosen. However, the possibility of including voice recognition in French and hand movements is discussed for a future improvement to simplify and increase realism to achieve the tasks in the VCCU. Presence and Immersion. Immersion is defined as a “deep but effortless involvement, reduced concern for self and sense of time” [24], where the player has a total immersion that alters the awareness of everyday life. For other authors [23], immersion is related to
Virtual Critical Care Unit (VCCU)
259
accurate sensory stimuli provided by the virtual system, while the presence was described as the “sense of being there”. Although the immersion definition may slightly diverge, the immersion seems to transport the user in an absorbing virtual environment, forgetting the real-world stimuli. During the development of the VCCU, the high graphic quality, the environment authenticity, and the sound used were designed to increase the sensation of presence and immersion in the virtual environment. These components are often linked with the game flow described in the next paragraph to enhance this sensation. 2.3 Experiment To study the user experience in the VCCU, we have chosen a multiple case study design. This exploratory method serves as a precedent for studies that can be reproduced and generalized, involving a larger sample [27] such as experimental designs. It is often recommended either for description, explanation, prediction, or the search for process control. However, this approach remains often criticized on the scientific level, as for the external validity and the generalization of the results. Therefore, it is important to ensure that pre-established measures are defined to obey rigorous scientific standards equivalent to those of quantitative methods. Participants. Participants were recruited from an intensive medical and surgical care unit in the Province of Québec. A total of four nurses participated in the experimentation. The mean of experience was three years as a critical care nurse; participants were all women, aged in mean 39 years old (36–43 years old). No participant had experimented video games, virtual reality and they estimated they ease with computer to 72% (60%–80%). The data collection took place from November 2019 to March 2020. After reading and accepting the information and consent form, the participants read the patient’s file (i.e., Virtual Agent) and received a verbal report by a third party who represented a fictive nurse colleague. After that, the participant received verbal information regarding the use of virtual reality equipment (e.g., human-computer interface and interaction with virtual environment). Finally, the participant was prompted to put on the virtual reality headset. From then on, the scenario began by placing her in immersion at the nurses’ station, very close to the patient’s file. The goal of participants was to ensure clinical surveillance for the patient who had thoracic issues. The nursing actions were to collect and analyze data and made clinical decisions concerning the health of the patient. Post-immersive Questionnaires. All the questionnaires were valid in French and were administered after immersion except for the sociodemographic questionnaire that aim to collect personal characteristics of the sample and was administered before immersion. User’s experience questionnaire [23] was administrated to have opinion of participants concerning the game design of VCCU. The questionnaire includes 75 Likert scale questions, four questions involving two choices and three open-ended questions at the end for a deeper understanding of the user’s experience. Simulator Sickness Questionnaire (SSQ) [28], was targeted to observe cybersickness during or after the task’s realization in the VE. This Questionnaire includes 16 items answered on a Likert scale (0–3). iPresence questionnaire (iPQ) [29] was used to describe a which point the participant feel
260
F. Banville et al.
present in VE during the assessment. Its 14 questions instruct about the sense of presence, the realism of environment, the spatial presence, and the involvement of the participant in the environment. To assess the mental workload or cognitive workload, that could reduce performance by affecting attention processes and concentration, caused by both Human-Computer interface AND the nursing task, we administered the National Aeronautics and Space Administration Task Load Index (NASA-TLX) [30]. The participant must answer to six statements concerning the task in term of mental, physical temporal exigencies of the task/environmental demands, effort, performance, and frustration by using a visual analog scale. The AttrakDiff questionnaire, was used to assessing user experience (UX); the questionnaire was used to helps us to understand how users rate the usability and design of the interactive VCCU. It consists of 28 items to be scored on a visual analog scale based on a semantic differential. Three scales emerge to this questionnaire: pragmatic qualities, hedonic qualities of the environment based on simulation or identity. The descriptive statistical analysis was conducted using SPSS 27 (IBM, USA) and the subjective comments from the User’s experience questionnaire were analyzed individually.
3 Results Concerning the Game Flow, two participants (1 and 2) mentioned to feel neutral face to the virtual environment; they had some difficulties to feel absorbed by the game or task. They judged that Human-Computer Interface was difficult to learn and manipulate. They said also that the objectives of the clinical tasks were difficult to reach. One participant did not have a felling to be enveloped or engaged into the virtual environment. More specifically, Participant 2 had a low score for all the aspects of the game. However, the participant explained it by the difficulties with the learning curve using the controllers. Participant 1 had a moderate score for the following components: flow, competence, emotion, and usability. Participant 1 explained it also by the hard time at the beginning with the controls. The participant 3 and 4 had a generally good experience in the different game design’s components. All participants said, in general, have had a nice experience with virtual reality. The most significant qualitative comments, link to usability, were the following: • “I was too preoccupated by the joystick and by virtual objects to found in the virtual environment; that limit to me in the interaction with the virtual patient. I think we need to practice or to train before the realisation of the task to be more comfortable with the interface” (P2). • “To know all the possibilities about the interaction into the virtual environment before performing the task” (P1). • “I was lacking experience in the use of virtual reality; that impacted directly my performance on the clinical tasks”. The sense of presence was, even if the usability was relatively low, high in general (mean: 3.75/4); participants had the feeling “to be there” in the virtual environment. More specifically, the spatial presence (mean: 3.5/4) was higher than the perception of realism (2.72/4) or the sense to be involved (mean: 2.06/4) by the virtual environment.
Virtual Critical Care Unit (VCCU)
261
Concerning the workload, overall, the task seems to be moderately charging (mean: 50%). Specifically, we can observe that the scale “Effort” (How hard did you have to work to accomplish your level of performance?) was the most affected by the task (mean: 80%) followed by “temporal pressure” (How hurried or rushed was the pace of the task?) (Mean: 60%) and “Mental demand” (How mentally demanding was the task?) (Mean: 50%). “Frustration” (How insecure, discouraged, irritated, stressed, and annoyed were you?) could be considered, to a lesser degree (Mean: 48.75%) as important in the understanding of the impact of the task AND the Human-Computer Interaction on the feeling of participants. To finish, “physical demand” (How physically demanding was the task?) (Mean: 15%) was not a factor to consider in the understanding of how the virtual setting impact clinical and eventually, learning processes. Despite this, the participant seems to be relatively pleased concerning its performance to the whole task (Mean: 58%). User experience (UX) was interesting to analyse. First, participants said the VCCU was very attractive globally (Mean: 2.89; range −3/+3). Pragmatic quality was low (Mean: 0,75; range −3/+3). Hedonic quality – simulation was better (2,82 range −3/+3) than hedonic quality – identity (1,46; range −3/+3). Cybersickness, when assessed after immersion was very low in general and for oculomotor and nauseas scale.
4 Discussion This paper aims was to present, firstly, the development of a potential powerful elearning tool in nursing using virtual reality. Secondly, to present results concerning a pre-experimentation with the VCCU to establish a proof of concept that allow us to continue the development and implementation of this simulator as a training place to help competencies development in the field of nursing’s critical care. For this, we ask to four experimented nurses to realise intervention with a virtual patient hospitalized to an intensive care unit. Data collected helped us understand the participant’s perception of the game flow, how was the workload during the immersion in the VE and how was the senses of presence and if the User experience were or not agreeable. The following paragraph discusses the obtained results. Game Flow. Following the results of the preliminary tests, participant 1 and 2 seems to have a poor experience with the sense of competence, the flow, the usability, and the emotion probably due to the frustrations and difficulties with the controls. These results corroborate previous papers explaining the relation between usability and the game flow [23, 24, 31]. Notice that no tutorial was developed for these tests. Consequently, a tutorial is our next addition to the VCCU development appear to be an essential to prepare participant to be trained in virtual reality. Our goal is to incorporate the tutorial into an authentic simulation. Using more straight forward controls like hand movement recognition and vocal commands will probably improve and simplify the future version’s controls. To sum up, it appears that the usability could have an impact on the feeling of competence with the use of Human-Computer interface to perform on the tasks. Problem of usability could cause discomfort and impact negatively the concentration on the task
262
F. Banville et al.
because participants focus on the aspects of the environment that not work or are not realistic. Sense of Presence. In psychology, it had been proved that sense of presence is very important to allow to the user to be really engaged in an intervention [32]. This conclusion is less true on a neuropsychologic perspective; Indeed, it was observed that participants who realized a cognitive task into a virtual environment seem less present to the environment [33]. The fact is that the executive components of attention is limited in term of resources. Therefore, if the participant is concentrated on an aspect of the task OR if he/she ought to learn how to manipulate the Human-Computer interfaces, this is normal that he/she feels less attentive to the spatial or visual aspects of the virtual environment. These observations could explain our results concerning the sense of presence were participants have the general feeling “to be there” in the virtual environment (general factor of presence + special presence) but did not feel involved in the environment; this one appear having characteristics that interfere with the real representation of the environment. We can think also that the usability difficulties describing above influence negatively the feeling that the virtual environment or the tasks were realistic. Workload. The results observed give us a good idea concerning how the tasks AND human-computer interface influence the subjective perception of an elevation of the workload. In fact, even if participants were relatively pleased to their performance in the virtual environment, they expressed that the task AND the use of human-computer interface rise the workload by the necessity to work harder to reach the same level of performance such as experiencing in real life. More specifically, we observe the same pattern here that our precedent research [33]. More specifically, the mental and temporal demand of the tasks are always important when the participant is asked to perform a cognitive task with an explicit limit of time to perform. That impact necessarily on the frustration to feel limited in the performance because of the technology (interfaces). UX. The participants said that the VCCU is a Virtual Environment with several quality and a high validity concerning the numeric design and imagery (global attractivity scale). However, the participants shared the difficulties experienced to reach the goal expected; that reduce considerably the usability of the VCCU (Pragmatic quality scale). On the other hand, participants appreciated the novelty, the interactions style and they said that the VCCU was interesting promising (Hedonic quality – stimulation scale). To finish, participant said that VCCU did not bear socialisation and communication and did not feel like having its proper identity (Hedonic quality – identity scale). This last observation was normal because no interaction – except for the patient – are really implemented so far in the VCCU. These results are less consistent with those obtained in another study [34]. These authors related that HMD is better than a flatscreen to improve UX and usability when participant can walk into a virtual environment [35]. In contrast, we believe that the complexity of manipulation of the virtual objects (human-computer interaction) in the VCCU can, itself, explained poor outcome obtained for UX variable in our study. Despite the difficulties expressed to manipulate and to direct themselves in the virtual environment, participants did not experiment cybersickness; this is a variable of less to consider for the optimisation of the VCCU in the future.
Virtual Critical Care Unit (VCCU)
263
To conclude, the academic contribution of this paper was to show the potential of virtual reality simulators like the VCCU as a tool to simulate a real-world work situation for nursing as training purposes. Moreover, the results shown the importance to consider human-computer interactions as variable which can interfere negatively with learning. This suggests that it is important to train the participant to use correctly the immersive material before to be exposed to the learning situation; indeed, to develop ease with technological aspects could be a requirement to use effectively the learning simulator. Furthermore, it can be used for e-learning, with possibly low risks of cybersickness. However, the VCCU needs to improve usability, influencing the sense of presence and game flow. Our team now works concerning the results obtained here to enhance usability and user experience by reducing the complexity of the human-computer interaction; for example, a) using only the triggers to interact within the environment (fewer buttons to memorise); b) simplify menus; c) using teleportation to improve efficacy in the game, etc. In parallel, our team work now to develop multiplayers situations accessible remotely using Internet. Implemented a multiplayers situation will give possibility to practice interprofessional collaboration in a nursing situation. That will allow guidance or supervision by the teacher and eventually give the opportunity to practice from home with another students. In addition, a tutorial is under development using a realistic nursing situation that allows the player to learn the different interactions before to be in immersion in the virtual environment. In this way, we will focus on the learning process and the development of professional expertise in nursing. To finish, we develop currently a general Virtual Care Unit to learn competencies in several field of nursing such as in the field of obstetric, interprofessional collaboration and oncology.
References 1. OIIQ: Portrait national de l’effectif infirmier 2019–2020 (2020). https://www.oiiq.org/por trait-national-de-l-effectif-infirmier-2019-2020 2. Cant, R.P., Cooper, S.J.: Simulation-based learning in nurse education: systematic review. J. Adv. Nurs. 66(1), 3–15 (2010) 3. Harder, B.N.: Use of simulation in teaching and learning in health sciences: a systematic review. J. Nurs. Educ. 49(1), 23–28 (2010) 4. Rhodes, M.L., Curran, C.: Use of the human patient simulator to teach clinical judgment skills in a baccalaureate nursing program. CIN Comput. Inform. Nurs. 23(5), 256–262 (2005) 5. Cooper, S., et al.: Managing patient deterioration: a protocol for enhancing undergraduate nursing students’ competence through web-based simulation and feedback techniques. BMC Nurs. 11(1), 18 (2012) 6. Endacott, R., et al.: Leadership and teamwork in medical emergencies: performance of nursing students and registered nurses in simulated patient scenarios. J. Clin. Nurs. 24(1–2), 90–100 (2014) 7. Alinier, G., Hunt, B., Gordon, R., Harwood, C.: Effectiveness of intermediate-fidelity simulation training technology in undergraduate nursing education. J. Adv. Nurs. 54(3), 359–369 (2006) 8. Milhomme, D.: Processus de surveillance clinique par des infirmières expertes en contexte de soins critiques: une explication théorique. Thèse de doctorat pour l’obtention du grade de Philosophiae doctor, Ph.D., Université Laval, Québec (2016)
264
F. Banville et al.
9. Milhomme, D., Gagnon, J., Lechasseur, K.: The clinical surveillance process as carried out by expert nurses in a critical care context: a theoretical explanation. Intensive Crit. Care Nurs. 44, 24–30 (2018) 10. Bracq, M.-S., et al.: Learning procedural skills with a virtual reality simulator: an acceptability study. Nurse Educ. Today 79, 153–160 (2019) 11. Butt, A.L., Kardong-Edgren, S., Ellertson, A.: Using game-based virtual reality with haptics for skill acquisition. Clin. Simul. Nurs. 16, 25–32 (2018) 12. Chitongo, S., Suthers, F.: Use of technology in simulation training in midwifery. Br. J. Midwifery 27(2), 85–89 (2019) 13. dit Gautier, E.J., Bot-Robin, V., Libessart, A., Doucède, G., Cosson, M., Rubod, C.: Design of a serious game for handling obstetrical emergencies. JMIR Serious Games 4(2), e21 (2016) 14. Verkuyl, M., Romaniuk, D., Atack, L., Mastrilli, P.: Virtual gaming simulation for nursing education: an experiment. Clin. Simul. Nurs. 13(5), 238–244 (2017) 15. Verkuyl, M., et al.: Virtual gaming simulation: exploring self, virtual and in-person debriefing. Clin. Simul. Nurs. 20, 7–14 (2018) 16. Dickey, M.D.: Engaging by design: how engagement strategies in popular computer and video games can inform instructional design. Educ. Technol. Res. Dev. 53(2), 67–83 (2005). https:// doi.org/10.1007/BF02504866 17. Wiemeyer, J., Nacke, L., Moser, C., ‘Floyd’ Mueller, F.: Player experience. In: Dörner, R., Göbel, S., Effelsberg, W., Wiemeyer, J. (eds.) Serious Games, pp. 243–271. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-40612-1_9 18. Calvillo-Gámez, E.H., Cairns, P., Cox, A.L.: Assessing the core elements of the gaming experience. In: Bernhaupt, R. (ed.) Game User Experience Evaluation. HCIS, pp. 37–62. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-15985-0_3 19. Fredricks, J.A., Blumenfeld, P.C., Paris, A.H.: School Engagement: Potential of the concept, state of the evidence. Rev. Educ. Res. 74(1), 59–109 (2004) 20. Coates, H.: A model of online and general campus-based student engagement. Assess. Eval. High. Educ. 32(2), 121–141 (2007) 21. Bryson, C., Hand, L.: The role of engagement in inspiring teaching and learning. Innov. Educ. Teach. Int. 44(4), 349–362 (2007) 22. Srikanthan, G., Dalrymple, J.F.: A conceptual overview of a holistic model for quality in higher education. Int. J. Educ. Manag. 21(3), 173–193 (2007) 23. Tcha-Tokey, K., Christmann, O., Loup-Escande, E., Loup, G., Richir, S.: Towards a model of user experience in immersive virtual environments. In: Advances in Human-Computer Interaction (2018) 24. Sweetser, P., Wyeth, P.: GameFlow: a model for evaluating player enjoyment in games. Comput. Entertainment (CIE) 3(3), 3 (2005) 25. Sánchez, J.L.G., Vela, F.L.G., Simarro, F.M., Padilla-Zea, N.: Playability: analysing user experience in video games. Behav. Inf. Technol. 31(10), 1033–1054 (2012) 26. Jackson, S.A., Marsh, H.: Development and validation of a scale to measure optimal experience: The Flow State Scale. J. Sport Exerc. Psychol. 18(1), 17–35 (1996) 27. Claxton, G.: Expanding young people’s capacity to learn. Br. J. Educ. Stud. 55(2), 1–20 (2007) 28. Alexandre, M.: La rigueur scientifique du dispositive méthodologique d’une étude de cas multiple. Rech. Qualitatives 32(1), 26–56 (2013) 29. Kennedy, R.S., Lane, N.E., Berbaum, K.S., Lilienthal, M.G.: Simulator sickness questionnaire: an enhanced method for quantifying simulator sickness. Int. J. Aviat. Psychol. 3, 203–220 (1993) 30. Schubert, T., Friedman, F., Regenbrecht, H.: Decomposing the sense of presence: factor analytic insights. In: Extended Abstract to the 2nd International Workshop on Presence, pp. 1–5 (1999)
Virtual Critical Care Unit (VCCU)
265
31. Hart, S.G., Staveland, L.E.: Development of NASA-TLX (task load index): results of empirical and theoretical research. Adv. Psychol. 52, 139–183 (1988) 32. Federoff, M.: Heuristics and usability guidelines for the creation and evaluation of fun in video game. Thesis, Indiana University (2002) 33. Bouchard, S., St-Jacques, J., Robillard, G., Renaud, P.: Anxiety increases the feeling of presence in virtual reality. Presence Teleop. Virt. 17(4), 376–391 (2008) 34. Banville, F., Nolin, P., Rosinvil, T., Verhulst, E., Allain, P.: Assessment and rehabilitation after traumatic brain injury using virtual reality: a systematic review and discussion concerning human-computer interactions. In: Rizzo, A., Bouchard, S. (eds) Virtual Reality for Psychological and Neurocognitive Interventions, pp. 327–360. Springer, New York (2019). https:// doi.org/10.1007/978-1-4939-9482-3_15 35. Rhiu, I., Min Kim, Y., Kim, W., Hwan Yun, M.: The evaluation of user experience of a human walking and a driving simulation in the virtual reality. Int. J. Ind. Ergon. 79, 1–12 (2020)
Silence in Dialogue: A Proposal and Prototype for Psychotherapy Alfonso Garc´es-B´aez(B) and Aurelio L´ opez-L´opez Computational Sciences Department, Instituto Nacional de Astrof´ısica, ´ Optica y Electr´ onica, Sta. Ma. Tonantzintla, Puebla, Mexico {agarces,allopez}@inaoep.mx
Abstract. Current lockdown caused by pandemic worldwide has forced people to interact to a great deal online and through social media. In this situation, people are prone to depression. New approaches to tackle this problem are required. The occurrence of intentional silence in a dialogue is common and its study in the framework of the Cognitive Behaviour Therapy (CBT) can be of great help, on one hand to have a more complete communication between the Therapist and the Patient, and on the other to identify mood depressants that put at risk the health and physical integrity of the patients. The study of silence can shed light to the therapist to carry out preventive actions for the benefit of the patient. In this work, within the framework of the CBT, we propose the study of silence in dialogue or an interactive system to identify risk situations of depression. A system prototype based on Beck inventory is presented and preliminary results are showed and discussed. Keywords: Intentional silence · Cognitive Behaviour Therapy Dialogue · Depression · Beck inventory
1
·
Introduction
When a dialogue is started, a turn-based role is established for the speech exchange, according to Sacks et al. [12]. Turn-taking is used for the ordering of moves in games, for allocating political office, for regulating traffic at intersections, for serving customers at business establishments, and when talking in interviews, meetings, debates, ceremonies, or conversations. These last being members of the set of activities which we shall refer to as speech exchange systems. There are six types of dialogues, namely, [11]: Information-Seeking Dialogues are those where one participant seeks the answer to some question(s) from another participant, who is believed by the first to know the answer(s). In Inquiry Dialogues, the participants collaborate to answer some question or questions whose answers are not known to anyone of the participants. Persuasion Dialogues involve one participant seeking to persuade another to accept a proposition he or she does not currently endorse. In Negotiation Dialogues, the participants bargain over the c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 506, pp. 266–277, 2022. https://doi.org/10.1007/978-3-031-10461-9_18
Silence in Dialogue: A Proposal and Prototype for Psychotherapy
267
division of some scarce resource. Participants of Deliberation Dialogues collaborate to decide what action or course of action should be adopted in some situation. In Eristic Dialogues, participants quarrel verbally as a substitute for physical fighting, aiming to vent perceived grievances. In dialogues with a therapeutic purpose, silence can be grouped in productive, neutral, and obstructive. Here, there are seven types of pauses whose characteristics are summarized in Fig. 1.
Fig. 1. Pausing experiences [10].
The communication literature emphasizes positive aspects of silence, viewing it as a critical component of social interaction [9]. For example, Scott [14] described silence and speech as two dialectical components of effective communication. Without both silence and voice, effective communication is impossible because no one would be listening. Silence is implicit and relative to speech acts. According to Searle [15], we have three parts of a speech act: 1) Locutionary act-Communicative act; 2) Illocutionary act-Speakers intention; 3) Perlocutionary act-The effect that speech act has on the context participants world. Spenader in [16] holds that sentences types have conventional relationships to certain types of speech acts: 1) Declarative (assertions): The class finishes at 6 p. m. 2) Interrogative (questions): Does this class finish at 6 p. m.? 3) Imperative (orders): Stop teaching inmediately! 4) Optative (wishes): I wish this class would be over! In this paper, we propose the study of silence in dialogue or an interactive system to identify risk situations of depression. A system prototype based on Beck inventory, called Psychotherapeutic Virtual Couch (PVC), is developed and presented, reaching some preliminary results.
268
A. Garc´es-B´ aez and A. L´ opez-L´ opez
This paper is organized as follows: Sect. 2 shows some preliminares viewpoints related to implicature and silence. Section 3 includes the PVC prototype. Section 4 includes a discussion of the application. We conclude in Sect. 5, discussing in addition work in progress.
2
Preliminares
Grice [8] introduces the important concept of conversational implicature through the cooperative principle and the maxims. For example, in the dialogue: A: I am out of petrol. B: There is a garage around the corner. B conversationally implicates: the garage is open and has petrol to sell. Benotti [2] explored the use that conversational implicatures have in an interactive process, as part of the communication, and proposed Frolog software as a dialogue game for context maintenance. She also mentioned the importance of the concept of tacit or implicit act. In Stanford Encyclopedia of Philosophy [18], we found the following formal definition of implicature: A representative formulation goes as follows, with S the speaker and H the hearer (also referred as addressee). Definition 1 [18]. Implicature is formally defined as S conversationally implicates p iff S implicates p when: 1. S is presumed to be observing the Cooperative Principle (cooperative presumption); 2. The supposition that S believes p is required to make S s utterance consistent with the Cooperative Principle (determinacy); and 3. S believes (or knows), and expects H to believe that S believes, that H is able to determine that (ii) is true (mutual knowledge). Omission in implicatures has been studied before, in linguistic interactions. Eco [4] refers to the silence occurring in dialogues as a voluntary or involuntary action that can be interpreted in the same way, intentional or unintentional. According to Swanson [17], in some contexts, not saying p generates a conversational implicature: that the speaker did not have sufficient reason, all things considered, to say p. Considering that S is the speaker, H the listener and l is a locution in the sequence L of length n, we formally define the omissive implicature or intentional silence as; Definition 2. Omissive implicature: In the context of linguistic exchange L = (l1 , ..., li−1 , li , li+1 , li+2 , ..., ln ) between S and H, there is an omissive implicature of H sii:
Silence in Dialogue: A Proposal and Prototype for Psychotherapy
269
1. S conversationally implies q in li−1 : S q. i.e. Says(S, q) 2. H conversationally implies that p was not said in li : H ξ. i.e. not Says(H, p) 3. S conversationally implies r in li+1 : S r. i.e. Says(S, r) 4. H conversationally implies t in li+2 : H t. i.e. Says(H, t) Where: – The conversational implicature is according to the formal definition of the Stanford Encyclopedia of Philosophy [18]. – Says(A, p) states that agent A says or holds p. – In the locution li : H ξ, H produces an omissive implicature (silence) not saying p. – The term p, according to the Cooperative Principle, must be in some way related to q produced by S. – Between li−1 and li+1 there is a significant time lapse1 . – In (3) and (4), the communication in the conversation is recovered. This definition expresses that is possible to draw linguistic inferences from omission without definitively interrupting communication and this can be useful in dialogic interactions in specific contexts. Silence has been previously studied in testimonies [6], where three interpretation of silence were semantically formulated and explored in a case study, employing answer-set programming. Omissions that generate inferences during everyday dialogues are common, however, there are situations where such omissions are even more significant. Next, one of these examples with some key signs or transcription keys, is discussed. In the dialogue below, the use of no comment by suspects in police interviews [13], the officer (OF) goes further in attempting to obtain some response from the suspect (SUS), moving the suspect from resistance to participation, making it clear that the suspect is aware of the potential repercussions of remaining silent. During the dialogue some transcription keys are used, additional keys can be used in other dialogues, but those in the example are detailed at the end. Interview 1. 2. 3. 4. 5.
OF: Anything else? OF2: (hhhh) the inevitable question but where did you get the stuff from? SUS: No comment (5.9) OF: Okay so an overall round up
Transcription Keys 1. (number) pauses of over a number of seconds. 2. (h) exhalation with the number of ‘h’s indicating length of breath. 1
Time lapse of at least 7 s.
270
A. Garc´es-B´ aez and A. L´ opez-L´ opez
There is a representative sample in [5] of this kind of dialogues with expressions (i.e., no comment) that generate conversational implicature. Another quite significant exercise is found in a generic example of the CBT corpus [3], where T is the therapist and C is the client, through the following extract of interaction: Extract 1 1. 2. 3. 4. 5. 6. 7. 8.
T: So do you think your mood‘s stabilised, do you think now? (1.5) C: Ye::a? (1.0) T: Yea? C: Just er: more positive thinking really. T: The more positive thinking.
In Extract 1, after an enquiry by the therapist, the client confirms that her mood has stabilised. However, a potential misalignment between what is being claimed by the client and what is actually conveyed is evident. Even though her reply is ostensibly affirmative, the 1.5-s pause on line 3 (combined in this case with the elongated delivery and upward intonation of her ‘ye::a?’) conspires to convey that this may not really be the case. The impression given is that, while it cannot be assumed that what she says is untrue, she projects that she may well have some reservations or issues. The therapist appears to pick up on this because there is a similarly extended pause before her next turn (line 5), and she also then uses a questioning (i.e. rising) intonation to prompt the client for more detail (‘yea?’ - line 6) [3]. In the case of the communication schema where voluntary and involuntary silence occur, as in the cases of the police interviews and the generic examples of the CBT, the time units are considered to indicate the silence of any of the interlocutors. There are three kinds of acoustic silence in conversation: pauses, gaps, and lapses. Pauses refer to silences within turns, gaps refer to short silences between turns and lapses are longer or extended silences between turns [3], however in human-computer communication, the study of silence must consider some aspects formalized by Goren and Moses [7]. For example, in a reliable system in which M messages from j to i are guaranteed to be delivered within Δ time units, if j sends no M messages to i at time t, then i is able to detect this fact at time t + Δ. This, in turn, can be used to pass information from j to i. We can assert that there is an intentional silence. Definition 3. There is an intentional silence from j agent to i agent iff M = ∅ at time t + Δ in a reliable system. In a setting where j can fail, however, it is possible for i not to receive j s message because j failed in some way, i.e. here there is an unintentional silence.
Silence in Dialogue: A Proposal and Prototype for Psychotherapy
271
Definition 4. There is an unintentional silence from j agent to i agent iff i does not receive j s message because j failed in some way at time t + Δ. In both definitions M, t and Δ are as stated by Goren and Moses [7]. There are studies related to the occurrence of silence in the communicative process, but its application in the field of psychology through interviews with patients is particularly important. A study of silence with psychological anger can be found in [3] which holds that silences and pauses in ordinary interaction have received a lot of attention. Likewise, they argued that Cognitive Behaviour Therapy (CBT) exists as an umbrella term for a range of interventions based on modifying unhelpful thoughts and behaviours, the principles of which are underpinned by psychological models of human emotion and behaviour. Originally developed to treat depression, CBT is a psycho-social approach that focuses on the development of personal coping strategies with the aim of changing unhelpful patterns of behaviour.
3
Silence Management in Psychotherapeutic Dialogue
The design of experiments through interviews for the study of omission in response to some questions can yield interesting information, in our case it was not possible to do interviews due to the global health emergency imposed by the Covid-19 pandemic. Introducing the proof of concept behind the Psychotherapeutic Virtual Couch (PVC) oriented to depression, the prototype is intended to illustrate some advantages of attending the occurrence of intentional omission of answer or silence. For instance, it can set on alarms in some cases that may require specialized help. This silence is considered of interactional obstructive type, and summarized in Fig. 1 [10]. This system can serve to do empirical studies of the occurrence of silence with two objectives, the first of them is to discover depressive mood states that put people’s health at risk, the second objective is to keep a record relating to the context, the answers that are obtained from the contextualized dialogues and the action undertaken corresponding to the silence. The treatment of silence can be done where a dialogue is established, although this is particularly important where a question is asked that requires an answer that does not come out. Figure 2 illustrates the steps necessary to detect silence in a given sequence of dialogue processing. Once silence is detected, this can be handle properly for its treatment in a silence management module. In the management of silence (see Fig. 3), the patient is asked: Are you OK? to identify the answer that best describes her silence with the possibility that her response is silence again. The patient’s behavior may be studied with the help of a specialist. At the end of the module, a Silence Knowledge Base (SKB) is maintained, linking the context or scenario, the answer, and the corresponding decision-making action.
272
A. Garc´es-B´ aez and A. L´ opez-L´ opez
Fig. 2. Silence detection
To illustrate the interpretation of silence in the framework of dialogue with psychotherapeutic use, we developed the PVC system that reports to the patient their level of depression based on a 21-question quiz, a questionnaire known as the Beck inventory [1]. The 21 groups of numerical responses with levels from 0 (null) to 4 (maximum) correspond to the following states [1]: G1-Sadness, G2-Pessimism, G3-Failure, G4-Loss Pleasure, G5-Feelings of Guilt, G6-Feelings of Punishment, G7-Disagreement with oneself, G8-Selfcriticism, G9-Suicidal thoughts or wishes, G10-Crying, G11-Agitation, G12-Loss of interest, G13-Indecision, G14-Devaluation, G15-Loss of Energy, G16-Changes in Sleep Habits, G17-Irritability, G18-Changes in Appetite, G19-Difficulty Concentrating, G20-Tiredness or Fatigue, G21-Loss of Interest in Sex.
Silence in Dialogue: A Proposal and Prototype for Psychotherapy
273
Fig. 3. Silence management
The pseudocode of PVC system is as follows: Patient welcome Parameter initialization Random choice of one of the 21 questions: If the answer is empty within 7 sec or more: Do the silence management (yes, no, Alert: ’I prefer not to speak’) Create evidence of the occurrence of silence Create evidence of interaction Writing results Thank the patient In Fig. 4, the dialogue between the PVC system is shown where one can notice the absence of explicit responses that, according to the context, are interpreted in various ways, as can be observed in the log records depicted in Fig. 5. In the SilenceData file (Fig. 5), the first silence corresponds to question number 9 ‘Suicidal wishes’ generated randomly, the double silence turns on the alarm registering the message ‘I prefer not to speak’.
274
A. Garc´es-B´ aez and A. L´ opez-L´ opez
Fig. 4. PVC prototype.
Fig. 5. Dialogical interaction of PVC.
The second silence corresponds to question number 21 ‘Loss of interest in sex’ where after the question: ‘Are you okay? y/n’ ENTER is given without response within 7 s and the value ‘no’ is assigned, which represents another type of alarm that may be the trigger for a new dialogue module. The next two records correspond to questions number 6 and number 15 which have explicit answers to the question: ‘Are you okay? y/n’, where the answer ‘no’ should also be considered an alarm related to question 15 ‘Loss of energy’. The last record of silence refers to question number 5 ‘Feelings of guilt’ which has as an assigned answer ‘yes’ to the question: ‘Are you okay? y/n’.
Silence in Dialogue: A Proposal and Prototype for Psychotherapy
4
275
Discussion
Based on previous work, we are proposing the inclusion of the consideration of silence in dialogues, and its management, to discover information that could be behind people’s omission in the context of a dialogic interaction. It is important to clarify that the difference between the responses (values) that can be adopted by default (‘y’ or ‘n’ in our case) and those that are inferred by silence are a function of time as part of a context, a silence corresponds to 7 or more seconds of no response. Therefore, in the system prototype, the inferences ‘yes’ and ‘no’ based on historical data are consequences of a silence (we randomly generate the polarity for illustrative purposes) and the inference ‘I prefer not to speak’ is a consequence of two silences. The application that we developed so far is monolithic and is necessary to analyze the use of time of the answers to questions formulated through distributed systems based on the definition of reliable system.
5
Conclusion and Future Work
There are contexts where the occurrence of silence within dialogues can lead us to an important source of information that can help to make decisions, for instance the application of preventive measures to avoid risks to people’s lives and health. Intentional silence in dialogues is quite common and can be adequately treated for decision-making. Some extensions of this work are the following: 1. Collaborate with a psychotherapist to develop a timely help tool in interviews with patients who resort to silence. 2. An application for people to measure their level of depression. This can be available at any time. 3. Statistics of the occurrence of silence in dialogues and its consequences, once the app is ready and in use by people. Acknowledgments. The first author thanks the support provided by Consejo Nacional de Ciencia y Tecnolog´ıa. The second author was partially supported by SNI, M´exico.
Annexure A: System Prototype Code # System Prototype Beck inventory, Silence manager import time, random t_stat = [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0] t_beck = [’Sadness’,’Pessimism’,’Failure’,’Loss of pleasure’, ’Feelings of guilt’,’Feelings of punishment’,’Disagreement with oneself’, ’Self-criticism’,’Suicidal wishes’,’Crying’,’Agitation’,’Loss of interest’, ’Indecision’,’Devaluation’,’Loss of energy’,’Change in sleeping habits’, ’Irritability’,’Changes in appetite’,’Difficulty concentrating’, ’Tiredness or fatigue’,’Loss of interest in sex’] t_level= [4,4,4,4,4,4,4,4,4,4,3,4,4,4,4,4,4,4,4,4,4] t_user = [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
276
A. Garc´es-B´ aez and A. L´ opez-L´ opez
def silence_m(n): opt = [’yes’,’no’] t0 = time.perf_counter() a = input(n+’ Are you OK? y/n: ’) t1 = time.perf_counter() Dt = t1-t0 if (a == ’’) and not(Dt < 7.0): a = ’ I prefer not to speak’ if a == ’’: a = random.choice(opt) return a def dlevel(name,t_stat,t_user): for j in range(21): i = random.randint(1,21) while t_stat[i] != 0: i = random.randint(1,21) t_stat[i] = 1 t0 = time.perf_counter() answer = input(t_beck[i]+’ (of 0 to ’+str(t_level[i])+’):’) t1 = time.perf_counter() Dt = t1-t0 if (answer == ’’) and not(Dt < 7.0): sm = silence_m(name) s_data.write(name+’ ’+str(i)+’ ’+sm+’ ’) answer = t_level[i] if answer == ’’: answer = 0 answer = int(answer) t_user[i] = answer s_history.write(name+str(t_user)) return sum(t_user) def eval(l): print(’’) print(’Depression Level Score:’) if l 18.2 µg/m3 and > 33 µg/m3 , it decreases significantly to 40.41 CFU/L, if OutPM2.5+10 is around the average measured > 18.2 µg/m3 and almost decreases by 2–3 times at OutPM2.5+10 > 9.82 µg/m3 . The winter period is accompanied by at least 7 times higher air pollution in the outside air or “outdoor” air, which directly affects the increase of the degree of microbiological contamination of the indoor “indoor” air, by the same at least 7 (24.75 CFU/L - 142.8 CFU / L) times between the period of lowest summer OutPM2.5+10 (9.82 µg/m3 ) air pollution and the period of highest winter air pollution OutPM2.5+10 (136 µg/m3 ).
676
D. Jovanovski et al.
Fig. 3. Laboratory measurements of pollution with PM 2.5 (LabPM2.5 ) and microbiological contamination and relationship with seasona variation.
According to the next model (see Fig. 3), which shows that air pollution of air with LabPM2.5 concentration in the winter months is at least 2.5 times higher > 46 µg/m3 compared to air pollution in the summer months > 18 µg/m3 . Winter measurements of LabPM2.5 in the laboratory are on average around the legally allowed standard, i.e. 46 µg/m3 . The microbiological pollution of 85.8 CFU/L does not follow the trend of their increase, contrary to the values of PM2.5 which is above the standard but are between 46 µg/m3 and 96 µg/m3 when the microbiological pollution is significantly higher, i.e. 137.5 CFU/L.
Fig. 4. Laboratory Measurements of Pollution with PM 10 (LabPM10 ) and Microbiological Contamination and Relationship with Seasonal Variation.
In winter months where LabPM2.5 particles were twice lower than the permissible standard (21µg/m3 ) the microbiological contamination does not decrease analogously but is around the values (115.08 CFU/L. And 124.3 CFU/L) which are also obtained
Application of Machine Learning in Predicting
677
for LabPM2.5 values between > 46 µg/m3 and > 96 µg/m3 . The summer results of LabPM2.5 with values > 10 µg/m3 to > 18 µg/m3 are sent with approximately 2 times lower values 72.3 CFU/L, compared to their numbers at > 21 µg/m3 or 96 µg/m3 which were measured in some winter months. The last model depicts the air pollution with LabPM10 concentration (see Fig. 4). The model describes the winter months LabPM10 pollution is at least 3 times higher > 66 µg/m3 compared to air pollution in the summer months > 22 µg/m3 . The microbiological contamination of 84.2 CFU/L, which satisfies hygienic correctness of air, can be found only in summer season, while the winter values most of them are above 100 CFU/L and some of them belongs in the upper limit of hygienic correctness of air. In Table 2, the evaluation results of the model performance are presented. Two metrics are considered, the Root Mean Square Error (RMSE) and Pearson Correlation Coefficient (CC) for model evaluation. According to the results from Table 2, the model with highest descriptive and predictive accuracy is the LabPM10 model, while the others model obtained relatively same results. Table 2. Model performance evaluation using RMSE and pearson correlation coefficient metrics for both train/test procedures. Lowest RMSE (highest performance) are bolded, while highest CC values are underline for both train/test procedures. (RF - Random Forest Algorithm, Bag – Bagging Algorithm) Models
LabPM2.5+10
OutPM2.5+10
LabPM2.5
LabPM10
PCT_Train*
28.29
28.99
28.08
26.18
PCT_Test*
28.82
29.76
29.65
26.86
RF_Train*
24.15
14.03
22.55
19.46
RF_Test*
31.33
33.23
30.21
30.47
Bag_Train*
28.14
28.62
28.40
23.91
Bag_Test*
28.57
29.16
29.05
27.76
PCT_Train**
0.80
0.79
0.79
0.83
PCT_Test**
0.79
0.77
0.77
0.82
RF_Train**
0.86
0.96
0.87
0.91
RF_Test**
0.75
0.73
0.77
0.77
Bag_Train**
0.80
0.79
0.79
0.86
Bag_Test**
0.79
0.79
0.78
0.81
* RMSE, ** CC
If we analyze the results from different algorithms experimental evaluation presented in Table 2, and more important the model interpretability, we can note that PCT algorithm, not only produce interpretable models, presented in this section, but also achieved highest and most stable (not overfitted) models compare to the Random Forest algorithm. We know this because the difference between the test and train performance is the smallest between the PCT and Bagging models, but not for the Random Forest models.
678
D. Jovanovski et al.
Despite the algorithmic presentation of microbiological contamination of air, there are some biological indicators that can be detected only quantitatively, and they are presented in Table 3. Table 3 shows qualitatively isolated microorganisms that are internationally accepted as indicators that can be associated with the type of hygienic air defect and other inanimate environment. Table 3. Microorganisms that are used as Indicators for certain type of contamination of the living environment in a period of 1 Year. Month
Bacterial flora
December 2019
Indicative microorganisms
January February March April May
Bacillus subtilis
June July
Bacillus subtilis
August
Streptococcus fecalis
September October November
mold
December ** Indicators of microorganisms
Bacillus subtilis - an indicator of increased presence of the right Fecal streptococcus - an indicator of fecal contamination Mold - an indicator of increased viscosity in the space.
From the models we can note that there is a seasonal variation in total air pollution in relation to legal standards and air pollution in the winter months. Also, it is higher than the legally allowed standard in both outdoor and indoor air. In the summer months, the air pollution values are close to the standard allowed values in both outdoor and indoor air. What is constant in these models is that the winter period for the outside air have several times higher air pollution than the summer months, which directly proportionally affects the increase of the level of microbiological contamination.
4 Conclusion In this paper we applied a machine learning algorithm, particularly decision trees on data collected for air quality and microbiological properties of the given sample. The
Application of Machine Learning in Predicting
679
data contained measurements of both PM2.5 and PM10 particles taken inside the laboratory as well outside from the laboratory. The bacteriological data was a collection of aero-microbiological samples in two microbiological laboratories through impactor type sampler. Using the decision tree machine learning methodology (LabPM2.5+10 , LabPM10 , LabPM2.5 ), we obtained three models from laboratory measurements and one model represented with the outside measurements (OutPM2.5+10 ). The results from all the models clearly indicated that the winter season has the greatest influence on the bacterial growth, compared with the summer measured data. In the summer months there is no visible difference between the total air pollution outside the indoor air, but in the summer month’s microorganisms are more often present indicators of the presence of dust and fecal contamination. Air pollution with PM particles proportionally affects the microbiological contamination of indoor air. The reduction of air pollution is proportionally followed by a reduction of microbiological air contamination in both seasons and in both measured air samples. There is no visible association of microbiological contamination with the origin of increased air pollution, i.e. outside/indoor air. Based on these results we will focus our future work on application of other machine learning models, including models for hierarchical clattering or multi-target classification models.
References 1. Air quality deteriorating in many of the world’s cities. World Health Organisation (WHO). https://www.who.int/mediacentre/news/releases/2014/air-quality/en/. Accessed 23 Oct 2021 2. Chan, C.K., Yao, X.: Air pollution in mega cities in China. Atmos. Environ. 42, 1–42 (2008) 3. Popovska, K.: Intrahospitalni infekcii; muktifaktorski pristap vo prevencija na nozokomijalnite infekcii. Dejo International (2014) 4. General Information about Staphylococcus aureus. https://www.cdc.gov/hai/organisms/staph. html. Accessed 12 Nov 2021 5. Gandolfi, I., Bertolini, V., Ambrosini, R., Bestetti, G., Franzetti, A.: Unravelling the bacterial diversity in the atmosphere. Appl. Microbiol. Biotechnol. 97(11), 4727–4736 (2013) 6. US EPA: Stay of Federal Water Quality Criteria for Metals, Water Quality Standards, Establishment of Numeric Criteria for Priority Toxic Pollutants, States’ Compliance Revision of Metals Criteria, Final Rules. Federal Register vol. 60, Issue 86, pp. 22228–22237 (1995) 7. Smets, W., Moretti, S., Lebeer, S.: Study of airborne bacteria and their relation to air pollutants. WIT Trans. Ecol. Environ. 191, 1449–1458 (2014) 8. World Health Organization: Air quality guidelines for particulate matter, ozone, nitrogen dioxide and sulfur dioxide: global update. World Health Organization (WHO) Geneva Switzerland (2006) 9. Air quality criteria for particulate matter. Research Triangle Park, NC. US Environmental Protection Agency, EPA/600/P-99/002aD (2004) 10. Presidency of Meteorology and Environment. The Royal Kingdom of Saudi Arabia (2013). http://www.pme.gov.sa/en/en_airpollution.asp. Accessed 12 2013 11. Özden, Ö., Dö˘gero˘glu, T., Kara, S.: Assessment of ambient air quality in Eski¸sehir. Turkey. Environ. Int. 34(5), 678–687 (2008) 12. Hameed, A.A., Khoder, M.I., Ibrahim, Y.H., Saeed, Y., Osman, M.E., Ghanem, S.: Study on some factors affecting survivability of airborne fungi. Sci. Total Environ. 414, 696–700 (2012)
680
D. Jovanovski et al.
13. Naumoski, A., Mirceva, G., Mitreski, K.: A novel fuzzy based approach for inducing diatom habitat models and discovering diatom indicating properties. Eco. Inform. 7(1), 62–70 (2012) 14. Naumoski, A.: Multi-target modelling of diatoms diversity indices in Lake Prespa. Appl. Ecol. Environ. Res. 10(4), 521–529 (2012) 15. Naumoski, A.: Learning models of abiotic influence on the biodiversity indices in Lake Prespa. In: Proceedings of the 12th International Conference for Informatics and Information Technologies (CIIT 2015), pp. 257–261 (2015) 16. Naumoski, A., Mirceva, G., Mitreski, K.: Clustering tree algorithm for biodiversity modelling od diatoms. In: Dimitrova V., Dimitrovski I. (eds.) ICT Innovations 2020. Big Data Processing and Mining. ICT Innovations 2020. ICT Innovations 2020, ISSN 1857-7288, pp. 37–48 (2020) 17. Blockeel, H., Struyf, J.: Efficient algorithms for decision tree cross-validation. J. Mach. Learn. Res. 3, 621–650 (2002) 18. Kocev, D., Naumoski, A., Mitreski, K., Krsti´c, S., Džeroski, S.: Learning habitat models for the diatom community in Lake Prespa. Ecol. Model. 221(2), 330–337 (2010) 19. Naumoski, A., Mirceva, G., Mitreski, K.: Evaluation of diatoms biodiversity models by applying different discretization on the class attribute. In: 2020 43rd International Convention on Information, Communication and Electronic Technology (MIPRO), pp. 1114–1119 (2020) 20. Stoimenovski, B., Mitreski, K., Naumoski, A., Davcev, D.: Analyzing the level of high tropospheric ozone during the summer in 2013. In: Skopje, R. Macedonia, V. (eds.) International Conference “Ecology of Urban Areas 2016”, p. 21 (2016) 21. Breiman, L.: Random Forests. Mach. Learn. 45(1), 5–32 (2001) 22. Breiman, L.: Bagging predictors. Mach. Learn. 24(2), 123–140 (1996) 23. Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005). Ver 3.9 24. Garofalakis, M., Hyun, D., Rastogi, R., Shim, K.: Building decision trees with constraints. Data Min. Knowl. Disc. 7(2), 187–214 (2003)
Markov Chains for High Frequency Stock Trading Strategies Cesar C. Almiñana(B) Universidade Presbiteriana Mackenzie, São Paulo, Brazil [email protected]
Abstract. The price prediction problem for any given stock has been an object of study, deepening and evolution within the past few decades, to achieve the basic goal of positive financial realizations with the smaller risk possible. A known risk minimization strategy is a high frequency trading regime, taking benefit from minor price variations, and achieving small profits multiple times a day. By the hypothesis and the interpretation that prices variations behavior as a Random Walk, it is possible to characterize a Markov (stochastic) process with states and transition probabilities known, allowing to describe price variations through time. Therefore, this study aims to estimate the financial outcome using Markov chains for automated decision making in a high frequency stock trading strategy, and then comparing the computed results with stock’s valuation (or devaluation) within the same analyzed period. Keywords: Stocks · Markov chains · High frequency trading
1 Introduction Predicting prices or values of financial products with a high precision level – an index, a stock, an exchange, or a fund – have been an object of study, deepening and evolution for the past few decades, mainly with the basic objective of monetary profits, as riskless as possible. Technical (or graphical) analysts, for instance, are devoted to creating behavioral projections for stocks, based upon indicators, observed performance, demand and supply parameters, quotations projections and transacted volumes, looking for proper way to gather future performance estimates. Parallelly, innumerable authors devoted to studying Machine Learning (ML) and regressions techniques and algorithms, mainly related to Artificial Neural Networks (ANN), to accurately predict future values that a given asset may assume. Some approaches try to realize short term exact value predictions [1–4]; some tries to aggregate complexity, using ANN methods [5–9]; others use Deep Artificial Neural Networks (Deep Learning, DL) methods, still trying to predict future values with highest precision level, model adaptability, and refining [10–12]. In other words, accomplishing value prediction with relevant precision level can be translated to execute buying and selling operations at the correct times, allowing maximizing profits, reduce risk levels and optimize investment strategies. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 506, pp. 681–694, 2022. https://doi.org/10.1007/978-3-031-10461-9_47
682
C. C. Almiñana
However, an operating system can be only considered fully automatized when two subsystems are combined and work together: the prediction system and the trading system [13]. Paradoxically, their objectives differ: while one manages to minimize prediction errors, the other maximize profitability. But, since operation rules are normally not incorporated withing the learning and predicting process, many real- world variables end up not being considered, such as liquidity, latency, and transaction costs. As a counterpoint to the prediction problem, this study handles with the profitable trading challenge under the latter perspective. Also, more active trading methods (high frequency trading, or HFT) benefit from minor price variations, making multiple returns a day. Theoretically, this king of strategy allows risk reduction, since it profit collection are made due to a high volume of operations, focusing on trading, not the prediction system. Then, this work’s general goal is to develop, implement, test, and evaluate an automated stock trading system, that eliminates the need of exact price prediction at a given instant, considering only market tendencies. Also, the algorithm must be capable of instant and automated decision making in real time, allowing, and supporting HFT strategy. This paper is organized as follows. Section 2 describes all theoretical references that sustain this study, such as the Brazilian stock exchange specificities, and a statistical review regarding random variables, random walks, and Markov chains. Section 3 presents the proposed methodology and related works. Section 4 details the observed experimental results, and finally, Sect. 5 concludes this paper with discussions, conclusions, and future works.
2 Literature Review 2.1 From Random Walks to Markov Chains Randomness and uncertainty, which always walk side by side, virtually exists in all life aspect [14]. A random variable is an attributed value to an experiment. It could be the result in a game of chance, the voltage of a random power source or any other numerical greatness of interest when performing an experiment [15]. A random process is an extension of the random variable concept, but now computed as a time function. That means that for any chosen time instant observation, the random process will result a random variable [14]. The notion of time aggregated to random variables will define the characteristics of a stochastic process. RWs are just a particular case that describes an independent and random variable sequence with known values and regularly distanced from each other [15]. Since stock price changes are caused by successive random “impulses” that vary in time, so it can be considered a RW. A Markovian process represents the simplest generalization of an independent process that, at any time instant, its result depends only on the immediately previous result, and none other. In a Markovian process, the past doesn’t have any interference in the future if present is specified [15]. Every Markovian process can be described as a chain (Markov chains), with welldefined and finite (or, in some cases, infinite) number of transition states. A transition
Markov Chains for High Frequency Stock Trading Strategies
683
state describes how a particular system is at a time instant. Therefore, such as according to Eq. 1, the probability of this process to move on to next state depends only on the current state, and none of past states. P x(tn ) = xn |x(tn−1 ), . . . , x(tn ) = P x(tn ) = xn |x(tn−1 ) (1) For n ≥ m ≥ 0, the sequence tm+1 → tm → · · · → tn represents system’s evolution. Then, pi (m) = P[xm = si ] represents the probability that, at time instant t = tm the system will assume state si and,pij (m, n) = P[xn = sj |xm = si ], will be the system’s probability that, at instant t = tn , the system will assume statesj , considering that it was described by the state si at instant tm . By the state transitioning idea, it’s possible to describe pij (m, n) probability as a probability matrix, such as Eq. 2. The P(m, n) matrix has lines and columns made by positive numbers only, since it’s dealing with probabilities. The sum of each line will always be equal to 1. ⎞ ⎛ p11 · · · p1j ⎟ ⎜ (2) P(m, n) = ⎝ ... . . . ... ⎠ pi1 · · · pij
It is also possible to represent a Markov chain as a graph where each knot represents a state, and each edge represents the transition between states (knots). The weight attributed to each edge represents the transition probability between each state pair (see Fig. 1).
Fig. 1. Markov Chain Representation, with States, and Transitions Pairs and Probabilities
2.2 Decision Process More important than knowing next event’s probability – and, by this paper means the next stock price value – it’s still necessary to make decisions to achieve the basic and main objective of financial realizations.
684
C. C. Almiñana
A Markov decision process (MDP) can be considered as a stochastic extension of finite automata or even a Markovian process with actions (and rewards). By applying the action a at any given state s, the system will transit to a new estate S , according to the probability distribution of all possible transitions. A reward function indicates the reward for being in a state or executing an action from a state. It’s worth mentioning that rewards are not only positive, but also negative, interpreted as a penalization. An MDP can be represented as an extension of Markov chains graphical representation, with actions illustrated by intermediary knots that, when taken, will allow state transitions (see Fig. 2).
Fig. 2. MDP representation, with states, actions, transitions pairs and probabilities, and rewards
Some works have used Deep Learning based models and implemented trading agents in a reinforcement learning framework, for Chinese and South Korean stock markets [16, 17]. They differ from each other on the quantity of traded stocks – in simulations and other tests – and used historical data. The use of different pairs of currencies at the Forex (Foreign Exchange trading) market, using 15-min detailed data from 2012 and 2017, has proven to be a valid and applicable approach [16]. Other models differ by using cryptocurrencies and data from a shorter past period to develop decision making layer [18]. Others approached this issue by using a model ensemble to study and simulate the American stock market between 2009 and 2020, with observed accumulated profits relevantly bigger when comparing to other similar studies [19]. All works have used an offline learning and decision-making process, allied to the use of different data timeframes and granularities, different variables for model training (e.g., open, close, max, min prices), and different trading frequencies (e.g., hourly, daily, and other timeframes). This paper is limited to the mentioned Markov chain/MDP concepts and does not use any ML algorithm or model.
Markov Chains for High Frequency Stock Trading Strategies
685
2.3 Stock Trading The best price to the buyer is the lower value offered by a seller. In more technical terms: the buyer places the best (lower) buying order for any given stock, known as bid and the seller places the best (higher) selling order for that stock, known as ask [20]. The difference between these two values is known as spread and is the implicit cost of being immediately capable to execute an order at the market [21]. The Brazilian stock market has three important trading aspects to be considered by any automated trading strategy: (i) the minimum monetary fraction is R$0,01 (one cent of Real), (ii) the minimum trading volume (batch) is 100 stock units, and such as in other markets, (iii) open and close values are updated every minute, and current value (for bids and asks) varies and are updated in real time. There are several buying and selling order types. By the means of this paper, only 2 will be considered: market orders and limited orders. The first happens when there’s not limit to the negotiation price and the deal occurs by the best one at closing moment. The latter is attached to a negotiation price limit, most of the time with one minimum selling price and other maximum buying price. It is worth mentioning that once all negotiations are processed by an electronic trading interface. There are also two kinds of positioning: buying and then selling – or “long” position, which tends to be the most intuitive order when talking about buying and selling stocks – or, selling and then rebuying – or “short” position, which consists of selling (or lending) and then repurchasing (or returning the loan). In the “long” position, or the purchase and subsequent sale operation, the negotiation is easily described: a first order (limited or market orders) is sent to purchase a certain stock at a certain value, respecting the minimum transacting volume; then a target price is established – either with the goal of making a profit or to limit a maximum loss value; then, a second opposite order is sent, with the objective of closing the opened position, taking the profit (or loss). In other words, this buy/sell operation is composed of 2 orders: a first buy order and then a second sell order. An example of that would be buying 100 stocks of ABCD3 for R$10.00 each, so the total invested amount is R$1,000.00. When selling this same stock quantity by R$11.00 each, the profit-taking action results on R$1.00 per stock, or R$100.00 total. On the other hand, it is possible to have a “short” position by selling and then repurchasing the same stock. Practically, this kind of position consists in borrowing assets and then selling them. Later, the investor needs to purchase the same number of stocks to return to the lender. In the meantime, if the stock price has decreased, a profit can be taken. An example of that would be selling (after lending) 100 stocks of ABCD3 for R$10.00 each, so the total invested amount is R$1,000.00. When repurchasing this same stock quantity by R$9.00 each, the profit-taking action results on R$1.00 per stock, or R$100.00 total. For every executed order, it’s also necessary to consider all due taxes and other transaction fees. According to the Brazilian laws, an investor might have to pay an income tax when profiting by trading stocks, considering the position period. An income of 20% of all summed profits that must be declared monthly by the investor, plus a 1% fee directly charged when a transaction is completed. At the end
686
C. C. Almiñana
of the month, all losses can be applied as a discount [22]. Also, other market (known as B3, in Brazil – the equivalent of Nasdaq or other stock markets) taxes needs to be considered: a 0,005% negotiation tax, and a 0,018% liquidation tax applied to the total traded amount – only applied to day trades [23]. In Brazil, all transactions are directly related to a broker that also may apply its own taxes and fees for each trade and full custody. For the strategy present by this paper, these fees have been assumed as zero since considered brokers have a zero-fee trading policy [24]. Every time the market operates contrarily to the prevailing tendency, and then returns to it, a new cycle ends and a new one begins. This pattern, that constantly occurs in every financial market, may be called a swing. The so-called swing trades take benefit of cyclical movements within an existent tendance [25]. In addition to these cycles, a more active operating method, called scalping, take benefit of minor movements or micro price variations, taking multiple profits a day [26]. Minor price variations might be described as consequence of successive random “impulses” that happen through time. So, the concept of HFT is then introduced as an approximation of the scalping strategy [27, 28]. Under a statistical perspective, stock price variations (or, impulses) may be characterized as random walks (RW) [20]. That is, a stochastic process that describes an independent and random variable sequence with known values and regularly apart from each other. A Markov (or Markovian) process may represent the simplest generalization of an independent process, that is, a process that at any given point in time, depends only on the immediately previous value, and none other. Some works have been capable of showing both Indian and South African markets adherence to the RW behavior hypothesis [29, 30].
3 Methodology Based on and inspired by technical simplicity, such as [31, 32], this paper dismisses the need of exact predictions. The Markovian processes approach can describe price changes of any traded stock, and then a logical operation layer, converts state transition probabilities into a decision-making system. Practically, it translates into how probable a random variable, positive or negative, may occur, knowing the current state that a system is. For that, historical data is needed to properly describe stock price changes using a Markov chain. Stock prices are normalized by, instead of absolute instant prices, using percentual changes from one minute to the other. A transition probability matrix can then be calculated and, for any given state, sum the probability that a stock has, at any given moment, to increase, decrease or maintain the current price. So, at the beginning of every working (trading) minute, the system will compute these probabilities and take the decision of buying (if the sum indicates an increasing tendance), selling (if the sum indicates a decreasing tendance) or doing nothing (holding current position). The one-minute interval was chosen to increase the amount of open and closed trades, decreasing risks, and making relevant profit by taking successive small gains (HFT strategy). A swing strategy would involve much more operating period, not to mention that relevant gains would require relevant investments.
Markov Chains for High Frequency Stock Trading Strategies
687
All operating system logics and routines were developed using Python programming language, under the 3.8 version. It not only allowed to properly model this problem and to connect to the Metatrader 5 trading system (an online environment), allowing to observe and work with real time data of any stock, but also to retrieve past historical data. Both data sources were used: the first for trading simulation, the latter for modeling, Markov chain creation and transition probabilities calculation. This process will be described in detail in the following Sect. 4. All tests were performed in a Virtual Machine (VM) with the Windows Server OS, hosted at the South American region of Google Cloud Platform (GCP) – physically located at São Paulo, Brazil. This VM has for 4 processing cores and 6GB memory. These choices were made once (i) the Metatrader 5 can only be executed under a Windows OS (and not using any ARM architecture) and (ii) to improve latency reduction for sending and receiving new information.
4 Decision-Making Strategy and Development It is possible to develop, implement, test, and evaluate an automated stock trading system that does not require a specific value prediction for any given stock at any instant in time. This system will only consider the market tendence. It’ll also be capable of taking automated, instant, and real-time decisions, supporting the HFT strategy. Metatrader 5 platform allows to use programming languages (in this case, Python) to automatic trading (algotrading), also allowing to extract historical data for opening, closing, maximum and minimum prices for any stock or index, considering the minimum timeframe of 1 min per data point. With this data available is possible to define the states and the transition probabilities of a Markov chain (process). This paper uses the stock CCRO3 as an initial study object, and the following Table 1 shows a data sample, which comprehends opening (“open”), closing (“close”), maximum (“high”) and minimum (“low”) prices for each timestamp. With the data extraction was possible to validate the trading strategy basic premise that consists of taking small profits in a short period of time, multiple times a day. For that, two basic statements need validation: (i) Prices need to vary within a minute, so the absolute difference between high and low prices is different than R$0.00 and (ii) prices also need to vary between one minute and the other. In other words, (i) will guarantee that profits are taken within a minute, and (ii) guarantees that a new order can be placed, and then (i) again. An exploratory analysis has shown that the absolute difference between high and low prices frequency distribution. Approximately 90% of all differences are comprehended between R$0.00 and ± R$0.07, which means there’s potential and theoretical trading opportunities that may result in R$0.01 profit per traded stock, every minute (see Fig. 3). With the same data extraction was possible to perform a shift transformation, allowing to relate data with itself (see Fig. 4). That means that the opening price at time ti+1 is equivalent to closing price at time ti . As a counterpoint to other studies, this paper uses minute-opening price instead of minute-closing price, once decision making happens at the beginning of each new minute. Also, the differences between last-minute opening price at day di and first-minute opening price at day di+1 were not considered.
688
C. C. Almiñana Table 1. Historical data extract sample for CCRO3.
Timestamp
Open
High
Low
Close
1592509260
R$14.51
R$14.51
R$14.48
R$14.49
1592509320
R$14.50
R$14.54
R$14.49
R$14.50
1592509380
R$14.50
R$14.54
R$14.50
R$14.53
1592509440
R$14.52
R$14.52
R$14.51
R$14.51
1592509500
R$14.51
R$14.52
R$14.49
R$14.51
1592509560
R$14.52
R$14.54
R$14.50
R$14.52
1592509620
R$14.52
R$14.56
R$14.52
R$14.54
1592509680
R$14.54
R$14.56
R$14.54
R$14.55
1592509740
R$14.55
R$14.56
R$14.51
R$14.51
1592509260
R$14.51
R$14.51
R$14.48
R$14.49
Fig. 3. Absolute Difference between High and Low Prices Frequency Distribution.
This operation allowed to observe the absolute difference between opening prices frequency distribution. Approximately 95% of all differences are placed between R$0.00 and ± R$0.03 (see Fig. 5). The nominal variation (in monetary unity) didn’t allow to describe with proper detail system’s state changes (i.e., the same R$0.01 variation could be a result from differences between R$10.00 and R$10.01, and R$100.00 and R$100.01). Then, the Markov process states weren’t defined by absolute prices or absolute differences, but from percentual differences. By shifting data with itself was possible to calculate the percentual variation (delta) from one minute to the other (see Table 2). Parallelly, the delta calculation resulted in more than 4.100 unique values and working with such a high number of different values (translated to states), is not only unpractical, but have proved to be less relevant, once nominally represent the same variations (i.e., −0.068918% and -0.068870% variations represent the same R$0.01 difference for close price ranges – R$14.51–14.52 and R$14.54 and R$14.52, respectively). So, instead
Markov Chains for High Frequency Stock Trading Strategies
Fig. 4. Shift operation example.
Fig. 5. Absolute difference between opening prices frequency distribution.
Table 2. Shifted data creating price ranges and delta calculation Start
End
Initial Price
Final Price
Delta (%)
1592509260
1592509320
R$14.51
R$14.50
−0.068918%
1592509320
1592509380
R$14.50
R$14.50
0%
1592509380
1592509440
R$14.50
R$14.52
0.137931%
1592509440
1592509500
R$14.52
R$14.51
−0.068870%
1592509500
1592509560
R$14.51
R$14.52
0.068918%
1592509560
1592509620
R$14.52
R$14.52
0%
1592509620
1592509680
R$14.52
R$14.54
0.137741%
1592509680
1592509740
R$14.54
R$14.55
0.0687758%
689
690
C. C. Almiñana
of percentual variations, a percentual delta range (e.g., [−0,019%, 0,118% [, [0,709%, 0,808% [) not only reduced states quantity, but also allowed to eliminate unreachable states. A 15-state transition state Markov chain was created using this simplification process. The states could be described by its inferior and superior delta variation limits, as follows on Table 3. Table 3. Transition state ranges. Timestamp
Low
Close
1
−0.671%
−0.572%
2
−0.572%
−0.474%
3
−0.474%
−0.375%
4
−0.375%
−0.277%
5
−0.277%
−0.178%
6
−0.178%
−0.079%
7
−0.079%
0.019%
8
0.019%
0.118%
9
0.118%
0.216%
10
0.216%
0.315%
11
0.315%
0.413%
12
0.413%
0.512%
13
0.512%
0.610%
14
0.610%
0.709%
15
0.709%
0.808%
After state definition and respective state-pair transition probabilities, was important to aggregate decision making strategy that will guide stock trading by working as an MDP. Generically, at each initial state ei would be possible to take an action ai, considering the set of possible actions A = {buy, sell, hold} that, posteriorly, would allow to transit to the final state si+1 (see Fig. 6). The system will buy when the sum of all transition probabilities that indicate a price-increase transition (σ + ) is higher than the sum of all transitions that indicate a price-decrease or a price-stability transition. The system will sell when the sum of all transition probabilities that indicate a price-decrease transition (σ − ) is higher than the sum of all transitions that indicate a price-increase transition or a price-stability. Finally, the system will hold then the higher transition probability (σ 0 ) indicates a transition to the current state – which is the same as staying at the same one. So, the system will place a (i) buy, sell, or hold any given stock at the beginning of every minute, if the market is open, (ii) aiming for the highest profit, a multiple of the minimum R$0.01 monetary fraction per traded stock and according to the 100 stocks
Markov Chains for High Frequency Stock Trading Strategies
691
Fig. 6. MDP Representation Example.
minimum volume – so, always when this gain is achieved, the operation will take the profit and then closed; (iii) for loss limitation, a maximum loss value is considered, so positions will be compulsorily closed if loss limit has been reached. This value can be established at the beginning of every trading routine. The simulations described in this paper have used R$3.00 since it represents 95% of all price variations within a minute. The first trading minute works only as a starting point to establish an initial stock price value. At the second trading minute, the price is updated to the last opening value. The percentual delta can be calculated by using previously stored two values, and then the current chain’s state can be identified. The values of σ + , σ − and σ 0 are now calculated. At this moment, a decision-making point defines what action could be taken considering a threshold. If σ + ≥ threshold, the system will perform a buying order, if σ − ≥ threshold, a selling order, or hold otherwise. The threshold value starts at 0.33 (or 33%) – intuitively, the lowest probability winning value – and then adjusted if necessary. By the end of the current trading minute, opened positions will be closed if take profit (or stop loss) orders were not automatically executed. Then, at the beginning of the following minute this same loop will start over. A flux can illustrate the decision-making algorithm based on the MDP structure and its trading rules (see Fig. 7).
Fig. 7. Decision-making and trading system loop.
692
C. C. Almiñana
5 Results, Discussion and Further Development This algorithm was tested in an online environment, which is fulfilled with real time information and tends to operate exactly as a “real-life” trading situation. Then, a 12day test period was performed, between 12–29-2020 and 01-15-2021, not only to test trading strategies, but also to calibrate threshold value such as previously described. The total gross relative profit observed was 61.8%, resulting 5.2% in average (or, in absolute terms, it can be converted to R$68.00 of daily profit), when comparing to the initial traded amount of approximately R$1,300,00. These results are comparable to those described by [18, 19, 27]. They observed, approximately, up to 74%, 98% and 70% when trading cryptocurrencies, American stocks, and ForEx respectively, with the difference that the present paper does not use any reinforcement learning strategy as these other works did. The CCRO3 stock has lost 4.9% of its value within the same trading period (buy and hold strategy). It was also possible to observe 75% of average accuracy, where 63% represents a win rate (correct decision-making) and 12% represents a neutral rate (zero profit, zero loss). Unfortunately, when considering trading taxes and other fees – approximately 0,02% of total traded amount – and the 20% governmental tax (applied on day trade profits), the results are not positive at all. These taxes results in R$1,970.00, which is more than the sum of all profits. So, the total net relative profit is −151%. Since this system operated according to a HFT strategy, it performed 332 trading orders per day (in average), or 0.8 orders per minute (from 10 A.M to 5 P.M). Table 4 describes each trading day (excluding taxes). Undoubtedly, it was possible to use both MDP and HFT concepts to develop an automated stock trading system, mainly focused on the decision-making process and not an exact value prediction. Financial results were positive while compared to the benchmark (−4.9% by buying and holding, versus ~60% by automated trading). Taxes and operational fees could be significantly reduced by increasing win rate and trading cheaper stocks (since approximately 0.02% of the total transacted volume are later charged as an exchange fee). Further developments should address an augmented testing scope by longer trading periods in a real-life online trading environment. Spread value (bid and ask difference) was not considered so far, which means that profit levels could be seriously decreased. Finally, there’s a clear opportunity to evolve this paper by adding a learning layer, taking benefit of reinforcement learning strategies and ML models.
Markov Chains for High Frequency Stock Trading Strategies
693
Table 4. Detailed Trading Performance. Date
Absolute profit
12-29-2020
R$49.00
12-30-2020 01-04-2021
Relative profit
Win rate
Neutral rate
Total accuracy
3.6%
56%
15%
71%
R$82.00
6.0%
56%
18%
74%
R$59.00
4.4%
61%
12%
73%
01-05-2021
R$130.00
10.0%
67%
12%
79%
01-06-2021
R$84.00
01-07-2021
-R$12.00
01-08-2021
R$65.00
01-11-2021
R$33.00
2.5%
65%
10%
75%
01-12-2021
-R$7.00
−0.5%
50%
16%
66%
01-13-2021
R$106.00
8.1%
66%
12%
78%
01-14-2021
R$148.00
10.9%
74%
5%
79%
01-15-2021
R$80.00
6.1%
64%
11%
75%
6.5%
66%
10%
76%
−0.8%
51%
14%
69%
5.0%
85%
2%
87%
References 1. Henrique, B.M., Sobreiro, V.A., Kimura, H.: Stock price prediction using support vector regression on daily and up to the minute prices. J. Finan. Data Sci. 4(3), 183–201 (2018) 2. Almasarweh, M., Wadi, S.: Arima model in predicting banking stock market data. Mod. Appl. Sci. 12(11) (2018) 3. Chou, J., Nguyen, T.: Forward forecast of stock price using sliding-window metaheuristicoptimized Machine Learning regression. IEEE Trans. Indust. Inf. 14(17) (2018) 4. Sadorsky, P.: A Random Forests approach to predicting clean energy stock prices. Journal of risk and financial management (2021) 5. Ballini, R., Luna, I., Lima, L., Silveira. R.: A comparative analysis of neurofuzzy, ANN and ARIMA models for Brazilian stock index forecasting. In: 16th International Conference on Computing in Economics and Finance – CEF (2010) 6. Parmar, I., et al.: Stock market prediction using machine learning. In: First International Conference on Secure Cyber Computing and Communication (ICSCCC) (2018) 7. Pandey, V.S., Bajpai, A.: Predictive efficiency of ARIMA and ANN Models: a case analysis of nifty fifty in Indian stock market. Int. J. Appl. Eng. Res. 14(2) (2019) 8. Yadav, A., Jha, C.K., Sharan, A.: Optimizing LSTM for time series prediction in Indian stock market. Procedia Comput. Sci. 167, 2091–2100 (2020) 9. Vijh, M., Chandola, D., Tikkiwal, V.A., Kumar, A.: Stock closing price prediction using machine learning techniques. Procedia Comput. Sci. 167, 599–606 (2020) 10. Lachiheb, O., Gouider, M.S.: A hierarchical Deep neural network design for stock returns prediction. Procedia Comput. Sci. 126, 264–272 (2018) 11. Nikou, M., Mansourfar, G., Bagherzadeh, J.: Stock price prediction using Deep learning algorithms and its comparison with machine learning algorithms (2019) 12. Jiang, W.: Applications of deep learning in stock market prediction: recent progress (2020) 13. Conegundes, L., Pereira, A.C.M.: Beating the Stock Market with a Deep Reinforcement Learning Day Trading System. Belo Horizonte, MG, Brazil (2020)
694
C. C. Almiñana
14. Grami, A.: Probability, Random Variables, Statistics and Random Processes: Fundamentals & Applications, 1st edn. Wiley (2020) 15. Papoulis, A., Pillai, S.U.: Probability, Random Variables, and Stochastic Processes. 4th edn. E. McGraw-Hill (2002) 16. Wu, J., Wang, C., Xiong, L., Sun, H.: Quantitative trading on stock market based on deep reinforcement learning. In: International Joint Conference on Neural Networks (IJCNN) (2019) 17. Shin, H., Ra, I., Choi, Y.: A Deep multimodal reinforcement learning system combined with CNN and LSTM for Stock Trading (2019) 18. Sattarov, O., et al.: Recommending cryptocurrency trading points with deep reinforcement learning approach. Appl. Sci. 10, 1506 (2020) 19. Yang, H., Liu, X., Zhong, S., Walid, A.: Deep Reinforcement Learning for automated stock trading: an ensemble strategy (2020) 20. Vishwanath, S.R., Krishnamurti, C.: Investment Management. Springer, A Modern Guide to Security Analysis and Stock Selection (2009) 21. Rampell, A., Kupor, S.: Breaking down the payment for order flow debate. A16Z (2021). https://a16z.com/2021/02/17/payment-for-order-flow/. Accessed 13 June 2021 22. Corral, H.: Day trade no imposto de renda: como declarar? ExpertXP (2021). https://conteu dos.xpi.com.br/aprenda-a-investir/relatorios/day-trade-no-imposto-de-renda/. Accessed 15 June 2021 23. B3. Ações. http://www.b3.com.br/pt_br/produtos-e-servicos/negociacao/renda-variavel/ acoes.htm. 24. Rico. Custos. https://www.rico.com.vc/custos. Accessed 15 June 2021 25. Crane, J.: Advanced Swing Trading: Strategies to Predict, Identify, and Trade Future Markets Swings. Wiley (2003) 26. Abell, H.: Digital Day Trading: Moving from One Winning Stock Position to the Next. Dearborn (1999) 27. Rundo, F.: Deep LSTM with reinforcement learning layer for financial trend prediction in FX high frequency trading systems. Appl. Sci. 9 (2019) 28. Briola, A., Turiel, J., Marcaccioli, R., Aste, T.: Deep Reinforcement Learning for active high frequency trading. Fev. (2021) 29. Chitenderu, T.T., Maredza, A., Sibanda, K.: The random walk theory and stock prices: evidence from Johannesburg stock exchange. Int. Bus. Econ. Res. J. 13(6), 1241–1250 (2014) 30. Vigg, S., Arora, D.: A study of randomness in the national stock exchange. Prestige Int. J. Manage. IT-Sanchayan 7, 112–125 (2018) 31. Huang, J.C., et al.: Applying a Markov chain for the stock pricing of a novel forecasting model. Commun. Statist. Theory Meth. 46, 9, 4388–4402 (2016) 32. Cruz, R.R., Gonzalez, A.D.D.: Investment portfolio trading based on Markov chain and fuzzy logic. In: 2018 IEEE Latin American Conference on Computational Intelligence (LA-CCI), pp. 1–6 (2018). https://doi.org/10.1109/LA-CCI.2018.8625246
Scalable Shapeoid Recognition on Multivariate Data Streams with Apache Beam Athanasios Tsitsipas(B) , Georg Eisenhart, Daniel Seybold, and Stefan Wesner Institute of Information Resource Management, Ulm University, Ulm, Germany {athanasios.tsitsipas,georg.eisenhart,daniel.seybold, stefan.wesner}@uni-ulm.de
Abstract. Time series representation and discretisation methods are susceptible to scaling over massive data streams. A recent approach for transferring time series data to the realm of symbols under primitives, named shapeoids has emerged in the area of data mining and pattern recognition. A shapeoid will characterise a subset of the time series curve in words from its morphology. Data processing frameworks are typical examples for running operations on top of fast unbounded data, with innate traits to enable other methods which are restricted to bounded data. Apache Beam is emerging with a unified programming model for streaming applications able to uniquely translate and run on multiple execution engines, saving development time to focus on other design decisions. We develop an application on Apache Beam which transfers the concept of shapeoids to a scenario in large-scale network flow monitoring infrastructure and evaluate it over two stream computing engines.
Keywords: Pattern recognition streams · Apache beam
1
· Shapeoids · Scalability · Data
Introduction
A large scale Cyber-Physical System (CPS) can produce tremendous amounts of data in a fast and multivariate manner. The data contain elements from different sources mirroring the multiple ramifications of causes. They enable many opportunities and challenges for scalable techniques to gain valuable insights. Various data mining techniques arise from our natural reification incentive to envisage their data shapes [19]. A human perceives the notion of shape, even from high-dimensional data, and almost instantly may observe similarities with known patterns in various time scales. For example, financial analysts inspect stock prices in patterns, usually described in natural language [10], to understand and predict the market [22]. The process of identifying shapes, or patterns, in line charts is a vital part of data exploration. Usually, domain experts perform data exploration tasks via pattern mining in sequential offline data to acquire c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 506, pp. 695–714, 2022. https://doi.org/10.1007/978-3-031-10461-9_48
696
A. Tsitsipas et al.
new insights [41] or detect anomalies and motifs interactively [39]. However, although both of the previous methods contain various patterns as their focused artefacts, leading to more human-related representations (at least visually), they are not scalable to streaming data for real-world applications. A recent time series primitive, named shapeoids [45], which is based on the time series representation of Symbolic Aggregate Approximation (SAX), describes a set of shapes that are primarily data-driven, interpretable to the naked eye while being expressed in natural language (cf. Sect. 2.2). Extracting the shapeoids from a time series curve enable methods that are not directly compatible with real-valued data [31], such as in suffix trees or many algorithms from the domain of natural language processing (NLP). In our previous work [43] we developed Scotty, a framework to extract shapeoids from time series curves. Nevertheless, Scotty is susceptible to two factors: (i) it handles only finite time series, and (ii) it is unsuitable for multivariate data. Scotty has the prospect to spawn over vast amounts of data [13,43]; however, due to its limitations, it cannot handle streaming data similarly with the mentioned approaches for data exploration. The requirements already discussed are innate in current big data platforms (e.g., Google Dataflow [23], Apache Flink [4]), making it simpler to design and build distributed processing pipelines. The emerging Apache Beam project recently attempted to create a standard interface above the mentioned platforms and many more [30]. We utilize the Beam programming model, which sufficiently abstracts and implements the features Scotty lacks to become dynamic and adaptable to continuous multivariate data. Our contribution is the design of a processing pipeline based on Apache Beam with a windowing mechanism to split longer time series data into shorter chunks that Scotty can handle and produce good results. In addition, we foresee a function that can group multivariate data by their type and let Scotty operate on each of them as univariate instances. Finally, we evaluate the performance and scalability of our approach with a real streaming scenario in a network flow monitoring infrastructure [36] deployed over BelW¨ u, an extensive German research network [12]. The scenario application is also based on Apache Beam. The experimental evaluation follows a streaming run over a clustered Apache Flink deployment, a standalone Apache Flink deployment, and a standalone Apache Samza deployment on a private Openstack-based cloud. We discuss the results in terms of the system metrics, the throughput and the memory management considerations. When a data mining technique benefits from the characteristics innate in data processing frameworks, we showcase the importance of selecting the appropriate one, as it can affect the underlying technology that is being executed upon real-time streaming data. The rest of the paper is organised as follows. Section 2 contains background information on Apache Beam, a brief description of Scotty and a short related work on methods for streaming time series for data exploration. Next, Sect. 3 elaborates upon the implementation details in retrofitting Scotty in an Apache Beam pipeline. Section 4 presents the evaluation results and discuss them as valuable lessons. Finally, in Sect. 5, we conclude the work.
Scalable Shapeoid Recognition
2 2.1
697
Background and Related Work Apache Beam
Apache Beam is a unified model for batch and stream processing, which attracted considerable interest from the industry and academia. An interesting perspective of Apache Beam is that it considers batch processing as a particular case of stream processing (via a window type). Applications use its model to build a so-called pipeline that consists of PCollections and PTransforms. A PCollection can be either a finite dataset or an infinite data stream. A PTransform takes one or a more significant number of PCollections as an input and applies user-defined functions, producing new PCollections. The Beam SDK offers a rich palette of PTransforms, including the core functions (i.e., ParDo and GroupByKey) from the concept, which bases primarily on the Dataflow model [2]. Each PTransform constitutes a different processing paradigm: (i) the ParDo transform, used for parallel processing, maps a user-defined function (DoFn) over an input PCollection, (ii) the GroupByKey, processes collections of key/value pairs, gathering all of the values associated with each unique key, (iii) the CoGroupByKey operates a relational join between two or more key/value PCollections having the same key, (iv) the Combine transform combines either whole PCollections or values for each key in a PCollection of key/value pairs using a user-defined function with the combining logic (e.g., Sum all the values for a given key), and finally (v) the Flatten and Partition transforms, merges or splits, respectively PCollections. Additionally, the Beam model provides read and write transforms to read data from an external source, such as a file, a database or a messaging queue and respectively write the output of a pipeline as the final result. Moreover, the concept of windowing and triggering has a central role in the model. The windowing mechanism is subject to predetermined boundaries that remain unchangeable after creation and evolve to support dynamic cases as a series of different window types (e.g., sliding, fixed, session). Depending on its event timestamp, an element from a PCollection may reside in one or multiple windows. However, the time an event happens and gets processed may encounter a lag. The model tracks this via a progressing watermark, which is the system’s notion of the expectation that all the data arrived for a given window. If the watermark passes the window’s boundaries, any element that comes with a timestamp in the passed window is late data. The Beam model allows for retrieving window contexts and can accommodate late arrivals and update the outputs accordingly. Moreover, it supports stateful processing and timer callbacks for processing-defined saved states in the pipeline. To summarise, one can execute any application using the Beam model on various big data processing platforms. The runners include Google Dataflow, Apache Flink, Apache Spark [8], Apache Samza [7], Apache Nemo [6], Hazelcast Jet [24], Twister2 [47] and local standalone executions (i.e., Direct Runner). Apache Beam eases the development of data pipelines by removing the burden
698
A. Tsitsipas et al.
of selecting the underline execution engine beforehand, leading to a so-called “vendor lock-in”. 2.2
Recap S COTTY
Scotty is a structure-based representation framework that extracts humanrelatable representations (i.e., in natural language). It is suitable for dealing with high dimensional data due to its dependence on SAX’s time series symbolic representation method. SAX inspired many studies [14,26,28,34] in the data mining field. It is a transformation process converting a real-valued time series to a sequence of symbols with a predetermined length l and an alphabet size A. In general, the technique contains four steps: (a) the time series is z-normalized (i.e., a mean of zero and a standard deviation of one), (b) the normalised time series is divided to l different equal-sized partitions while calculating for each the mean value to represent the Piecewise Aggregate Approximation (PAA) [27] of the time series, (c) according to a statistical lookup table, each element in the resulting vector (from step (b)) is substituted by a symbol from the table which is dependant on the selected alphabet size A. The alphabet size splits a canonical normalised curve to equiprobable areas, and each partition is associated with a symbol, resulting in a SAX word. Figure 1 illustrates an example of the SAX process.
Fig. 1. An example of SAX process using as parameters the raw time series length L = 120, l = 20, A = 5. Using the PAA representation of the time series (turquoise line), the resulting word is eeeeeeddddbaaaaaabbb. Adapted from [45]
Further, Scotty directly operates on the SAX word for extracting shapeoids, our human-relatable structure-based primitives. A shapeoid is eventually a datadriven pattern, expressed in natural language and extracted from the discretised form of a given time series. It contains the lexical representation of the discretised subsequence and the interval of its occurrence. In the following, we provide an overview of the available shapeoids: (i) An ANGLE is a gradual and continuous
Scalable Shapeoid Recognition
699
increase or decrease, (ii) A HOP describes a distinctive phase shift leading to an overall visible change, (iii) The HORN is a temporary effect which fades quickly, as the end of the pattern returns almost to the initial point, and (iv) The FLAT is an almost sturdy line with minor variations in the curve. In addition, the inclusion of the trend information (upward and downward ) provide an additional level of interpretation for the actual direction of the shapeoid. The trend value does not apply for the FLAT shapeoid for apparent reasons. The shapeoid transformation follows a string-based algorithm. There is no connection with any similarity measure, calculating distances (e.g., Euclidean distance) across the time series, and therefore it is not threshold-dependant. During the SAX process described above, Scotty attaches indexes for every point in the PAA representation. The SAX word is traversed with a sliding window of three and an offset of one. It always starts from the beginning of the SAX word. The subsequence in the window represents a shapeoid accompanied with its duration interval. In addition, Scotty contains the original event timestamps from the raw data to reflect the associated event times of the actual data. Therefore, it is appropriate to deal with real-world problems rather than keeping the result in discrete time points. For more information about the theoretical aspects of their extraction, we point the reader to [45], where an author of the current paper provides evidence via a proof in group theory, that the shapeoid extraction (SE) process is always valid without feeding all the possible time series in the universe, but validated with the descriptive definition of a mathematical group. In addition, more details about the Scotty framework and its efficiency and accuracy are available in [43]. The remaining issue is the inability of Scotty, in its current form, to scale over multivariate time series in data streams. Next, we revisit some related work in the context of data mining and data exploration areas, which assist the point of the paper. 2.3
Related Work
One can find the related work in terms of time series representation and discretisation methods on describing a time series in our previous paper [43] around the framework of Scotty. Methods in time series data mining [20] are used primarily to evaluate short or large datasets offline, for various downstream tasks, using shape-based similarity measures [17] (e.g., Euclidean distance and Dynamic Time Warping), which fail on long or noisy data. In addition, structurebased approaches employ abstracted representations (e.g., shapelets [49], Bagof-Patterns (BOP) [32]) to extract a model or higher-level feature vectors [38]. Most commonly, structure-based models follow the same combination of techniques, i.e., a sliding window over a symbolic representation. The main incentive of numerous works is to identify interesting patterns for discovering motifs and discords by capturing the frequency of appearance in the subsequences from the extracted SAX word of a time series. In addition, tools for interactive time series pattern mining have been developed [37,39]. They offer informative visualisations via data clustering or various methods for the visual motif or discord discovery. The limitation of these methods is that they are closed systems
700
A. Tsitsipas et al.
operating on offline univariate data with a predetermined number of datasets. Therefore, they are not generalised to real multivariate data, which should follow a streaming scheme. In this work, we transcend the technique of SE, based on a structure-based method, over real-time data able to scale on modern data processing frameworks. Moreover, closer to the incentives of our work, concerning the extraction of patterns in natural language, are the visual querying systems. Such systems [35,40,41] assist the user is searching for the shape they opt out, by taking as input an actual sketch of that shape using a drawing canvas. Most of these tools employ similarity measures using shape-based distances, as we also mentioned at the beginning of the section. In contrast, we use a string-based method for the SE, which does not use any fixed threshold. Some tools [11] let the users apply location-based thresholds on the time series’ x and y ranges. However, again the offline nature of the data hinders them from an uptake to streaming scenarios, which is a significant limitation. In addition to that, they do not also foresee multivariate data on their data exploration tasks. We elevate data mining and data exploration perspectives to become stream-based and handle volatile sources using modern data processing frameworks. Our approach offers the incentives of visual querying systems; however, extracting human-relatable representation relies on sensitivity analysis tasks to find the proper parameters for a given window granularity the use case requires. As such, multiple windowing sizes should be accommodated in a given solution.
3
Beaming S COTTY
This Section elaborates on retrofitting Scotty in big data processing platforms to deal with its limitations. An implementation upon the Beam programming model offers a plethora of options in terms of programming languages for development but also in the final execution environment hosting the data pipeline. Below, we divide the transcendence of Scotty in streaming scenarios into four critical aspects of the process. Each aspect includes the implementation details for “beaming” Scotty to unbounded data. 3.1
Pipeline Input/Output
The Beam programming model foresees a source and a sink transforms. For both of them, many options are available in its Application Programming Interface (API), with Input/Output (I/O) connectors ranging from file-based and file system interfaces to unbounded messaging systems and several databases. In contrast to traditional data mining techniques, we are not restricted to a single I/O option to interact with data. Therefore, the availability of many I/O options constitutes Scotty a malleable approach to extract valuable shapeoids from different data forms. In addition, a Scotty application developed via Apache Beam may also be integrated with other processing pipelines sequentially, due to the enabled I/O structure, as we illustrate in Fig. 2; i.e., dynamically outputting
Scalable Shapeoid Recognition
701
the resulting shapeoids to desirable sinks for extraneous computations (e.g., logic programming, suffix tree, NLP tasks). As such, both I/O operations handle batch and streaming options for given use cases, in which another critical aspect arises; that of time.
Fig. 2. An example is composing I/O beam pipelines with pipeline A to be beam Scotty application, and pipeline B receives the resulting shapeoids.
3.2
Data Time Management
For both bounded and unbounded data, the time domain [9] is vital. In the aspiring Dataflow model whitepaper [2], Akidau et al. similarly elevate time into two domains: event time and processing time. The event time is the time value attribute (i.e., timestamp) of an event1 . For example, a sensor reading or a database state change, are associated with a clock reading in the system they were created or observed [33]. The event time cannot essentially change. However, the processing time in the Dataflow model, and therefore in Apache Beam, is the timestamp at which the occurred event is observed from a system during the data processing task, which similarly records the system’s clock. We assume no global synchronisation of clocks in a distributed setting. During the data processing, factors such as latency, scheduling, actual computation and pipeline serialisation, when becoming an actuality, produce a lag between the two domains. In a perfect world, this lag would always be zero, processing the event data as they happen, without delay. Apache Beam tackles this out-of-order processing, using a concept like the MillWheel’s watermark [1], a way for a system to have notions of completeness with a lower bound on event times heuristically calculated; hence tackling the time asymmetry. Therefore, transferring the above concepts to the context of Scotty is of great importance, especially on handling late data and out-of-order processing, where Scotty would output the incorrect shapeoid using the resulting SAX word from its internal computation (cf. Fig. 3). Data will arrive later in the processing time and tamper with the overall result. Scotty’s process will run on online streaming data and offline ones—the distinction between the two is based on the kind of processing we would like to offer. The input data on the offline streaming case are still bounded but simulated in an unbounded source (e.g., Apache Kafka [5], RabbitMQ [48]). The timestamps are dependent on the 1
An event is anything that happens or is contemplated as happening [33].
702
A. Tsitsipas et al.
unbounded source, with a configuration on how the timestamp is extracted from the raw data stream. As such, the watermark will begin on the timestamp of the first offline streaming event and subsume correctness for the given input data. If this configuration does not occur, then the watermark will start as soon as the application starts; the watermark on the offline streaming case would fail to include the events with previous timestamps already early for inclusion in the resulting PCollection. Our application selects the KafkaIO source and provides a custom timestamp policy to determine the timestamp field from the event (e.g., Kafka Record).
Fig. 3. The time skewness between processing and event time could tamper with the resulting computation of Scotty. The resulting SAX word, using an alphabet with four letters and seven PAA partitions, may begin with a FLAT (ccc) shapeoid instead of a HORN (cdc).
3.3
Windowing
A window is a common mechanism in data processing systems [21], allowing a bounded view in the data without accounting for the existing one in a source. Especially in a streaming setting, it is a by-default selection mechanism for processing the stream over its bounded subsets (i.e., windows). Typical computations over this fixed view of data are forms of aggregation [42], outer joins [51], and various time-bounded operations, to name a few. We focus on the latter aspect of computation over windowed data streams for our purposes. The sliding, fixed, and session windows are the main windows addressed by many systems dealing with unbounded data. A sliding window has a fixed length l and an offset lo , with which the system initiates a new one at time l − lo (e.g., hourly windows starting every thirty minutes). The consecutive sliding windows overlap when lo < l. The fixed windows (often called tumbling windows) are a special case of sliding windows when l = l0 , making them aligned and equally distributed across time. Finally, the session windows enfold elements, with the last element having a gap duration with another following element. It usually models activity over time by calculating these unaligned sessions [2]. As a default, in Apache Beam, data in a PCollection are assigned into a single global window,
Scalable Shapeoid Recognition
703
unless otherwise indicated. An interesting aspect in processing windowed data with Apache Beam is the ability to trigger, during processing time, the timebased operation with elements that have arrived so far for early results. For more information about the triggering with window contents, we point the readers to the original dataflow paper [2]. The windowing mechanism offers the bounded flavour that Scotty envisages for its ability to perform over real-time data streams. It is a deterministic enactment of its computation over the contents of the defined window. Furthermore, Scotty does not have to cater for vital components and challenges in stream processing, such as punctuations [46], watermarks [1], out-of-order processing [29] and heartbeats [25]. We provide evidence in our previous work [43], that Scotty is suitable for big data processing handling efficiently more than a million data points in a bounded setting in under six seconds. Therefore, a setting over the Beam programming model assists in developing applications using Scotty to extract the data-driven shapeoids over the coarse of continuous-time where real-time analysis is required. Figure 4 illustrates how Scotty operates over the different types of bounded windows. The process of shapeoid recognition is always valid [45], although the user has to perform a detailed sensitivity analysis over Scotty’s input parameters (i.e., the PAA step and the size of the alphabet). These parameters are undoubtedly related to the selected window size for the given application, as the window granularity will reveal different types of shapeoids. High-dimensionality is always inherent in the data of real-time scenarios; the input parameters can control the smoothness of the time series curve in the given window. Choosing to observe a data stream for a short duration (e.g., a minute) could include many short shapeoids in it. Observing the same data stream in a larger window (e.g., an hour, or a day) may look different, offering more rounded shapeoids which may be more meaningful to detect.
Fig. 4. The SE upon common window types. Each “triad” from the SAX word will always form a valid shapeoid [45].
704
3.4
A. Tsitsipas et al.
Keyed Value Processing
The data flowing in a data processing system could have any format from raw numerical values to complex data objects (e.g., JavaScript Object Notation (JSON), Data Transfer Object (DTO)). Apache Beam usually treats the data as key, value pairs, even though some transforms do not require a key on their operation; e.g., the ParDo. On the one hand, the parallel process ParDo operates element-wise on every input, and therefore the data keep flowing although passing from this transform. On the other hand, a grouping operation in Apache Beam (e.g., GroupByKey) would collect all the data for a given key before forwarding them down the stream for further reduction. In the case of unbounded data, the windowing mechanism (cf. Sect. 3.3) is a solution to the issue of the non-terminating outbound, as it would never terminate in a global window, for example. The Beam GroupByKey is similar to the Partitioning process in the MapReduce [16] model during the shuffle phase and the grouping operation in an SQL expression. As we mentioned in Sect. 3.3, applying a time-bounded operation on the windowed data, even from a key-grouping process, is vital to transfer Scotty to big data scenarios. Exceedingly important is also to move it to big multivariate data with an application of grouping mechanisms. The Scotty’s design assumed that underlying measured data carry information we need to know. Especially in cases where a phenomenon occurs and creates a so-called dimensional footprint [44] and instruments measuring or recording it, they provide different views of the same phenomenon, which coincide with the definition of multivariate data [18]. Going beyond a statistical correlation concept, we interpret deterministic relations between the multivariate data. Therefore, applying Scotty as a time-based operation, over the windowed data is only possible using a grouping operation to partition the data to singular homogeneous down streams of data, based on a characteristic key (e.g., type, location, feature). As illustrated in Fig. 5, the incoming data are grouped by key, relating a fan-out mechanism, creating relevant partitions while splitting the data to subsequences ready for a shapeoid transformation. Consequently, the grouping operation, innate in modern data processing programming models, scale the capabilities of Scotty for big data application use cases, as we will showcase in our experimental evaluation (cf. Sect. 4). 3.5
The B EAM -S COTTY Pipeline
Scotty contains dimensionality reduction steps for abstracting a numerical time series to the symbolic domain. Therefore, Scotty builds on the computational and storage reduction aspects of transferring raw data to a form valuable for one that seeks the interpretability of a time series in human-relatable constructs. There is a series of steps one has to take to develop her application using Scotty over extensive streaming data. As we elaborated in the sections above, Apache Beam and most data processing engines support the concepts of
Scalable Shapeoid Recognition
705
Fig. 5. The key-grouping operator over windowed data has a final reduction phase per key to the SE.
I/O, windowing and grouping. The following steps allow one to build an application in a data processing pipeline from any domain, ideally containing numerical data. Source: A suitable method to acquire any source of data with abundant options from Apache Beam. Time Domain Selection: As Apache Beam unifies batch and stream processing, the developer should select to compute the results in event time and therefore use a timestamp field from the message payload or the database/file column. Data Alignment: The data may be aligned, with a transformation to an arithmetic form if needed, and pivot them in key-value pair structures. The key must be unique and possibly informative (e.g., metadata from the data origin, unique identifier (less preferable)). Bounded Windows: The selection of the window type relates to the examined application scenario. For example, the sensor data with a particular sampling rate should use fixed or sampling windows, but one may use session windows to detect the vacancy of an Uber rider. This step can occur more than once down in the pipeline (e.g., use a different window to aggregate the data and another window to bound the aggregations). GroupByKey: This step is the last chance to apply a transformation to an arithmetic form, either via a supported aggregation function per key (e.g., counting the elements in a window) or a user-defined function. This logical step applies more than once until the definition, resulting in a vector of values for a key. Shapeoid Recognition: This step of the pipeline is a straightforward deterministic execution of the Scotty library for recognising one or more shapeoids in each vector per key. Scotty will output the shapeoid type and its duration in the timeline, with start and ending real-valued timestamps. Sink: The final step has a design choice, for either storing the found shapeoids for further offline analysis or choosing a sink to output the data continuously for another data pipeline to consume. To summarise, with the above key design steps, one may bring the power of Scotty to any domain requiring a fast and human-relatable description of
706
A. Tsitsipas et al.
their multivariate data. In the next section, we evaluate the performance of an application developed in a Beam-Scotty pipeline over two big data processing engines. At the same time, we write the application once and execute it on multiple different engines.
4
Experimental Evaluation
We evaluate the performance and scalability of Scotty operating on top of two big data processing frameworks for a network monitoring scenario. Other works to compare various stream computing systems have to design an identical application using each of the respective programming APIs [50]. Therefore, the evaluation results may not be credible enough as the implementation intricacies usually appertain the developer’s acquaintance with the target system. Apache Beam removes this concern because the same code can run on different execution engines. Therefore, the performance results are entirely associated with the selected Beam Runner and the underlying engine. In this Section, we compare Apache Flink and Apache Samza. We select their runners based on their capabilities [3] to support our computation’s required aspects (cf. Sect. 3.5). The availability of runners, such as Google Dataflow, submit the Beam Pipelines to the Google Cloud Platform, which for privacy concerns we emit from the study. 4.1
Setup
The Section describes the fundamentals of hardware and software configurations for the evaluation and the data our application uses in a scenario for network monitoring. Hardware Configuration. We deploy our Flink and Samza installations in a private Openstack-based cloud at the Ulm University under the series OpenStack version “Victoria”. We provide three types of installations for evaluation purposes; a four-node clustered version for Flink, a standalone Flink installation and a standalone Samza installation. A clustered flavour of Samza was also foreseen initially. However, due to deployment shortcomings and unknown interrelations with the software versions, we could not spawn a job successfully over the YARN cluster of Samza. The two standalone installations are spawned in individual virtual machines (VMs) with two virtual CPU cores, six gigabytes (GB) of memory and twenty GB of SSD storage (tagged as mb1.medium.ssd). The Flink cluster utilises four mb1.medium.ssd instances as well. It is worth mentioning that we create equal execution environments for both clustered and standalone versions of the respective installations2 . 2
The standalone version should be equal to one worker VM in the clustered version to observe also the speedup.
Scalable Shapeoid Recognition
707
Software Configuration. Both Flink and Samza are deployed on Java 1.8, as well as the Scotty library and the Beam-Scotty applications target the same Java version. We use Flink version 1.11 and configure each worker, in the clustered setup, with the JVM heap size to use two GB of memory, and the Flink managed memory numbers at almost two GB of reserved memory allocation. The standalone Flink installation has the same memory allocation as the worker VMs in the clustered version. Moreover, the standalone Samza installation uses Kafka 2.3.1, Zookeeper 3.5.5 and Samza 1.6.0. Samza allocates the available CPU cores and four GB of memory from the VM. For the experiments, we allow one job (i.e., application run) to utilise all the available CPUs as tasks in the given VM. Workload Description. BelW¨ u monitors about thirty border interfaces on nine routers which are exporting NetFlow v9 [15] flow records with a 1:32 sampling rate to collect flow data leaving or entering the BelW¨ u network. It leads to twenty thousand flows per second on average, with peaks up to one hundred and sixty thousand flows per second. The flows are collected with Cloudflare’s goflow as a NetFlow collector to re-encode them with protobuf and send the data to a five node Apache Kafka cluster. For other consumption, the monitoring platforms enriches the flows with additional metadata; e.g., SNMP data from the interface that the flow originates, protocol names, or geolocation data. Currently, their usage serves DDoS mitigation and detection of high-volume networks [36]. Meanwhile, the usage has been extended with multiple use cases like a post-mortem analysis, peering analysis, ACL oversight, and ad-hoc tooling. There is a Grafana dashboard to provide a visual overview of the traffic composition with different views like the total monitored bandwidth for incoming and outgoing traffic, amount of traffic by IP-Version, used protocols, targeted applications and the design based on geolocation information. Additional views give insight into the data rates by routers, peering interfaces and remote networks. Scenario. In our scenario, we performed online experiments using the BelW¨ u flows, filtering them for two provided services in the Ulm University; a Moodle3 e-learning platform, which serves as the primary endpoint for organising the courses across the whole university every semester, and an Opencast4 video service, where recordings from individual classes are available for the students. The workload contains four different variables of the data groupings, per Fig. 5—the outgoing and incoming NetFlows from and towards both the described services. Every variable contains counter data points over five seconds each. This process allows us to translate the massive load in the flow monitoring collector to a numerical form for our purposes. The intention for using the data from these two services as our variables originate from a diversity of causes that primarily created them. Utilising the calculated shapeoids from the time series data, one may infer such use cases in a subsequent reasoning computation (e.g., logicbased [44]). 3 4
https://moodle.com/. https://opencast.org/.
708
4.2
A. Tsitsipas et al.
B EAM -S COTTY for NetFlow Monitoring
The individual design guidelines from Sect. 3.5 drive the development for a NetFlow monitoring scenario for further use in different use cases, as we indicate in the workload description above. The application code is publicly available5 for another reference. The application written in the unified Beam API allows the bootstrapping of different execution engines. The Flink and Samza runners interchangeably complete the artefacts for each execution run in the experiments. The protobuf FlowMessage objects arrive in the Source endpoint of the application, using a Kafka consumer, to receive all the available messages in a topic. Each application artefact for the experiments uses a different group to not tamper with the pool of incoming flows in the Kafka infrastructure. The next step reads element-wise each FlowMessage in a ParDo transform, unmarshalling each object to read individual fields from the NetFlow v9 flow record format and following create key-value pairs using as a key the IP of the monitored service concatenated with the flow direction integer (i.e., zero for ingress and one for egress). Further down in the pipeline, we count the number of flow records in a five-second window and create a two-minute bag state feeding the incoming records before releasing the bag of records every five minutes. Therefore, we may end up with a maximum of sixty elements per computation. At last, Scotty runs on top of this vector of elements for each key (i.e., IP address - Flow Direction), with a predefined set of parameters6 for Scotty’s runtime. At the end of the pipeline, we write the resulting shapeoids in an InfluxDB database for visual purposes (e.g., using a Grafana dashboard). There are also side-outputs of the counters to the same database. 4.3
Analysis
The selection of a big data scenario as the NetFlow monitoring assists the incentive of elevating a data mining technique with a data exploration perspective to become stream-based and handle volatile sources using modern data processing frameworks. The multivariate data in the scenario contain the incoming and outgoing traffic from two monitored services in Ulm University (cf. Sect. 4.1). Using the data and the extracted structural features, one can induce or abduce information; e.g., it may be a severe use case, such as a DDoS attack, or an equitably informative insight, such as the semester has started, or the working sessions of students during the day. All these intentions occur in different observing window lengths, and the time management and the windowing techniques assist the realisation of the application. Additionally, the grouping operations enable the SE for each data class (i.e., the incoming/outgoing network traffic counters). All the above constitute the vital characteristics of a data processing pipeline in Apache Beam and significant data processing engines. However, as we will also see from the results, different execution engines in their two deployment strategies (i.e., 5 6
https://github.com/thantsi/beam-scotty-for-netflows. A sensitivity analysis may be beneficial in this step.
Scalable Shapeoid Recognition
709
standalone or clustered-based) can handle the data volume differently (e.g., the network traffic counters). Therefore, when a data mining technique leverages the advantages of big data processing frameworks (cf. Sect. 3), it contains a significant burden in selecting the appropriate execution engine. An issue that the Beam programming model facilitates successfully. 4.4
Results
We begin with the runtime characteristics of the three installations. The application (cf. Sect. 4.2) initiates simultaneously to all the endpoints, and the operation starts instantly, as the Kafka consumer delivers the input data to the pipeline. It is evident also from an initial spike in the CPU load of the worker nodes in every installation of around thirty per cent. In Fig. 6, we illustrate the number of records throughout the experiment. The NetFlow records were increasing as the investigation was in a working day for the universities, and higher traffic was expected. There is this initial burst of incoming data, and subsequently, both engines react to the load and stabilise themselves quite quickly. In Table 1, we summarise the system metrics of the worker VMs throughout a three hour run time. The Flink installation provides metrics also from within the JVM and therefore is more detailed. We collected the Samza values using the Ganglia monitoring system because the run time and performance metrics were unavailable due to an internal bug in Samza when used with Apache Beam. It was mentioned in the developer’s mailing list, and we did not foresee an alternative. Thus, the rest of the evaluation only uses Flink-related performance metrics.
Fig. 6. The number of incoming NetFlow records in the application during the experiment.
Nonetheless, to accommodate this shortcoming, we capture a custom metric in the Beam-Scotty application; the number of calculated counters over for the three-hour time range. As we already described in the scenario section (cf.
710
A. Tsitsipas et al.
Table 1. The average system metrics in the worker VMs of Flink and Samza. With the clustered version to have a different average over the three workers. System metric
Flink
Avg. JVM CPU load (%) 15.2
Flink clustered Samza 8.7
Avg. JVM memory (Mb) 697 + 266 649 + 342
14.7 955
Table 2. An incremental index value for every counter data point (over a five-second window) from the NetFlow records. Flink Flink clustered Samza 3350
2005 + 1318
3766
Sect. 4.1), we sample the counters every five seconds. Additionally, there may be no data samples for the filtered IPs in this window, and the application will not calculate any counter. As observed in Table 2, Samza found slightly more data points, which can become crucial during the SE process down in the pipeline. Overall, the numbers are almost identical, but it is a positive aspect favouring Samza. Both of the engines use RocksDB7 , an in-memory key-value store for fast storage. Therefore, the evaluation is ultimately based on the internal implementations of the respective framework. Every data processing framework has its peculiarities. Therefore, developing the Scotty application on Apache Beam has immediate benefits on the hassle of selecting one, especially at the beginning of the development cycle. A Beam-Scotty application is stateless when it operates directly on numerical data, however as per our example in the paper, the Beam-Scotty application for the NetFlow monitoring contains one more windowing step, and it may require different levels of observation (e.g., hourly, daily, weekly) for extracting the shapeoids. Therefore, it uses the state management of data processing and flushes the state bag on the expiration of the set timer (five minutes currently). The memory buffer gets flushed at the end of every SE and hence not tampering with the overall run time.
5
Conclusion
The paper designed and evaluated a data processing pipeline based on the Scotty framework. The transcendence of Scotty to a streaming setting and the development with a unified programming API for data processing, enables it to handle multivariate data streams; a characteristic of real-world applications. We provide evidence via developing a processing pipeline in a NetFlow monitoring platform, extracting shapeoids for characterising the data on their morphology for two extensive learning services at Ulm University. The shapeoids could 7
http://rocksdb.org/.
Scalable Shapeoid Recognition
711
serve use cases where a constantly increasing angle or a reoccurring spike in the data are valuable sources of information and require further characterisation; by selecting values for the parameters that produce the best results for the task at hand. Nevertheless, the development with the Beam programming model alleviates the burden of selecting the best execution environment during the initial stages of bootstrapping and focus on the sensitivity analysis Scotty requires to give the best results. We consider a set of windowing strategies for bounding the underlying data for Scotty to operate. One may use any type of window to bound the data, as long as it is under the underlying data source. For example, using a session window would lead to the suppression of the FLAT shapeoid, at least for longer inactivity gaps and has to be handled appropriately in the application logic, or be known for the user down the line who uses the shapeoids; e.g., no data should not mean an operational malfunction. Nevertheless, in the paper, we provide evidence that the application development with Scotty on big data processing systems is possible, with a one-time encoding of the application and the variety of execution engines to run it. As future work, we plan to evaluate the resulting shapeoids for use cases in anomaly detection on the NetFlow monitoring platform. In the paper, we omit this task, as our previous work showed the efficiency and accuracy of Scotty on another dataset. Acknowledgment. The research leading to these results has received partial funding from Germany’s Federal Ministry of Education and Research (BMBF) under HorME (01IS18072) and the federal state of Baden-W¨ urttemberg (Germany), under the Project bwNet2020+.
References 1. Akidau, T., et al.: MillWheel: fault-tolerant stream processing at internet scale. Proc. VLDB Endow. 6(11), 1033–1044 (2013) 2. Akidau, T., et al.: The dataflow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. Proc. VLDB Endow. 8, 1792–1803 (2015) 3. Apache: Apache Beam Capability Matrix. https://beam.apache.org/documenta tion/runners/capability-matrix/. Accessed Oct 2021 4. Apache: Apache Flink - Stateful Computations over Data Streams. https://flink. apache.org/. Accessed Oct 2021 5. Apache: Apache Kafka. https://kafka.apache.org/. Accessed Oct 2021 6. Apache: Apache Nemo. https://nemo.apache.org/. Accessed Oct 2021 7. Apache: Apache Samza. http://samza.apache.org/. Accessed Oct 2021 8. Apache: Apache Spark - Unified Analytics Engine for Big Data. https://spark. apache.org/. Accessed Oct 2021 9. Botan, I., Derakhshan, R., Dindar, N., Haas, L., Miller, R.J., Tatbul, N.: SECRET: a model for analysis of the execution semantics of stream processing systems. Proc. VLDB Endow. 3(1–2), 232–243 (2010) 10. Bulkowski, T.N.: Encyclopedia of Chart Patterns. Wiley, Hoboken (2021) 11. Buono, P., Aris, A., Plaisant, C., Khella, A., Shneiderman, B.: Interactive pattern search in time series. In: Visualization and Data Analysis 2005, vol. 5669, pp. 175–186. International Society for Optics and Photonics (2005)
712
A. Tsitsipas et al.
12. bwNet Consortium: bwNet100G+ - Research and innovative services for a flexible 100G-network in Baden-Wuerttemberg. https://bwnet100g.de/. Accessed Oct 2021 13. Camerra, A., Palpanas, T., Shieh, J., Keogh, E.: iSAX 2.0: indexing and mining one billion time series. In: 2010 IEEE International Conference on Data Mining, pp. 58–67. IEEE (2010) 14. Castro, N., Azevedo, P.: Multiresolution motif discovery in time series. In: Proceedings of the 2010 SIAM International Conference on Data Mining, pp. 665–676. SIAM (2010) 15. Claise, B.: Cisco systems netflow services export version 9. RFC 3954, pp. 1–33 (2004) 16. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008) 17. Ding, H., Trajcevski, G., Scheuermann, P., Wang, X., Keogh, E.: Querying and mining of time series data: experimental comparison of representations and distance measures. Proc. VLDB Endow. 1(2), 1542–1552 (2008) 18. Esbensen, K.H., Guyot, D., Westad, F., Houmoller, L.P.: Multivariate Data Analysis: In Practice: An Introduction to Multivariate Data Analysis and Experimental Design (2002) 19. Esling, P., Agon, C.: Time-series data mining. ACM Comput. Surv. (CSUR) 45(1), 1–34 (2012) 20. Tak-chung, F.: A review on time series data mining. Eng. Appl. Artif. Intell. 24(1), 164–181 (2011) 21. Gama, J., Rodrigues, P.P.: Data stream processing. In: Gama, J., Gaber, M.M. (eds.) Learning from Data Streams, pp. 25–39. Springer, Heidelberg (2007). https://doi.org/10.1007/3-540-73679-4 3 22. Ge, X.: Pattern matching in financial time series data. Final project report for ICS 278 (1998) 23. Google: Google Dataflow. https://cloud.google.com/dataflow/. Accessed Oct 2021 24. Hazelcast: Hazelcast Jet - The ultra-fast stream and batch processing framework. https://hazelcast.com/products/stream-processing/. Accessed Oct 2021 25. Johnson, T., Muthukrishnan, S., Shkapenyuk, V., Spatscheck, O.: A heartbeat mechanism and its application in gigascope. In: Proceedings of the 31st International Conference on Very Large Data Bases, pp. 1079–1088 (2005) 26. Kasetty, S., Stafford, C., Walker, G.P., Wang, X., Keogh, E.: Real-time classification of streaming sensor data. In: 2008 20th IEEE International Conference on Tools with Artificial Intelligence, vol. 1, pp. 149–156. IEEE (2008) 27. Keogh, E., Chakrabarti, K., Pazzani, M., Mehrotra, S.: Dimensionality reduction for fast similarity search in large time series databases. Knowl. Inf. Syst. 3(3), 263–286 (2001). https://doi.org/10.1007/PL00011669 28. Le Nguyen, T., Gsponer, S., Ilie, I., O’Reilly, M., Ifrim, G.: Interpretable time series classification using linear models and multi-resolution multi-domain symbolic representations. Data Min. Knowl. Discov. 33(4), 1183–1222 (2019). https://doi. org/10.1007/s10618-019-00633-3 29. Li, J., Tufte, K., Shkapenyuk, V., Papadimos, V., Johnson, T., Maier, D.: Out-oforder processing: a new architecture for high-performance stream systems. Proc. VLDB Endow. 1(1), 274–288 (2008) 30. Li, S., Gerver, P., MacMillan, J., Debrunner, D., Marshall, W., Kun-Lung, W.: Challenges and experiences in building an efficient apache beam runner for IBM streams. Proc. VLDB Endow. 11(12), 1742–1754 (2018)
Scalable Shapeoid Recognition
713
31. Lin, J., Keogh, E., Wei, L., Lonardi, S.: Experiencing SAX: a novel symbolic representation of time series. Data Min. Knowl. Discov. 15(2), 107–144 (2007). https:// doi.org/10.1007/s10618-007-0064-z 32. Lin, J., Khade, R., Li, Y.: Rotation-invariant similarity in time series using bagof-patterns representation. J. Intell. Inf. Syst. 39(2), 287–315 (2012). https://doi. org/10.1007/s10844-012-0196-5 33. Luckham, D., Schulte, W.R.: Glossary of terminology: the event processing technical society: (EPTS) glossary of terms-version 2.0. In: Event Processing for Business, pp. 237–258. Wiley, Hoboken (2012) 34. Miller, C., Nagy, Z., Schlueter, A.: Automated daily pattern filtering of measured building performance data. Autom. Constr. 49, 1–17 (2015) 35. Mohebbi, M., Vanderkam, D., Kodysh, J., Schonberger, R., Choi, H., Kumar, S.: Google correlate whitepaper. Technical report, Google (2011) 36. N¨ agele, D., Hauser, C.B., Bradatsch, L., Wesner, S.: bwNetFlow: a customizable multi-tenant flow processing platform for transit providers. In: 2019 IEEE/ACM Innovating the Network for Data-Intensive Science (INDIS), pp. 9–16. IEEE (2019) 37. Ruta, N., Sawada, N., McKeough, K., Behrisch, M., Beyer, J.: SAX navigator: time series exploration through hierarchical clustering. In: 2019 IEEE Visualization Conference (VIS), pp. 236–240. IEEE (2019) 38. Sch¨ afer, P.: The BOSS is concerned with time series classification in the presence of noise. Data Min. Knowl. Discov. 29(6), 1505–1530 (2015). https://doi.org/10. 1007/s10618-014-0377-7 39. Senin, P., et al.: GrammarViz 3.0: interactive discovery of variable-length time series patterns. ACM Trans. Knowl. Discov. Data (TKDD) 12(1), 1–28 (2018) 40. Siddiqui, T., Kim, A., Lee, J., Karahalios, K., Parameswaran, A.: Effortless data exploration with zenvisage: an expressive and interactive visual analytics system. arXiv preprint arXiv:1604.03583 (2016) 41. Siddiqui, T., Luh, P., Wang, Z., Karahalios, K., Parameswaran, A.: ShapeSearch: a flexible and efficient system for shape-based exploration of trendlines. In: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pp. 51–65 (2020) 42. Traub, J., et al.: Efficient window aggregation with general stream slicing. In: EDBT, pp. 97–108 (2019) 43. Tsitsipas, A., Schiessle, P., Schubert, L.: Scotty: fast a priori structure-based extraction from time series. In: 2021 IEEE International Conference on Big Data (IEEE Big Data 2021). IEEE Computer Society (2021) 44. Tsitsipas, A., Schubert, L.: Modelling and reasoning for indirect sensing over discrete-time via Markov logic networks. In: Cassens, J., Wegener, R., KofodPetersen, A. (eds.) Proceedings of the Twelfth International Workshop Modelling and Reasoning in Context (MRC 2021), vol. 2995, pp. 9–18. CEUR-WS.org (2021) 45. Tsitsipas, A., Schubert, L.: On group theory and interpretable time series primitives. In: Li, B., et al. (eds.) ADMA 2022. LNCS (LNAI), vol. 13088, pp. 263–275. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-95408-6 20 46. Tucker, P.A., Maier, D., Sheard, T., Fegaras, L.: Exploiting punctuation semantics in continuous data streams. IEEE Trans. Knowl. Data Eng. 15(3), 555–568 (2003) 47. Twister2 - High performance Data Analytics. Indiana University. https://twister2. org/. Accessed Oct 2021 48. VMWare: RabbitMQ - messaging that just works. https://www.rabbitmq.com/. Accessed Oct 2021
714
A. Tsitsipas et al.
49. Ye, L., Keogh, E.: Time series shapelets: a new primitive for data mining. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 947–956 (2009) 50. Zaharia, M., Das, T., Li, H., Hunter, T., Shenker, S., Stoica, I.: Discretized streams: fault-tolerant streaming computation at scale. In: Proceedings of the TwentyFourth ACM Symposium on Operating Systems Principles, pp. 423–438 (2013) 51. Zhu, S., Fiameni, G., Simonini, G., Bergamaschi, S.: SOPJ: a scalable online provenance join for data integration. In: 2017 International Conference on High Performance Computing & Simulation (HPCS), pp. 79–85. IEEE (2017)
Detection of Credit Card Frauds with Machine Learning Solutions: An Experimental Approach Courage Mabani, Nikolaos Christou(B) , and Sergey Katkov School of Computer Science and Technology, University of Bedfordshire, Luton LU1 3JU, UK [email protected] http://beds.ac.uk/Computing Abstract. In many cases frauds in payment transactions could be detected by analysing the customer’s behaviour. Only in the United States fraudulent transactions led to financial losses of 300 billion a year. Machine learning (ML) and Data Mining techniques were shown to be efficient for detection of fraudulent transactions. This paper proposes an experimental way for designing a ML solution to the problem, which allows practitioners to minimise financial losses by analysing the customer’s behaviour and common patterns of using credit cards. The solution designed within a Random Forest (RF) strategy is examined on a public data set available for the research community. The results obtained on the benchmark data show that the proposed approach provides a high accuracy of detecting fraudulent transaction based on the customer’s behaviour patterns that were learnt from data. This allow us to conclude that the use of the RF models for detecting credit card fraud transactions allows practitioners to design an efficient solution in terms of sensitivity and specificity. Our experimental results show that practitioners using the RF models can find new insights into the problem and minimise the losses. Keywords: Fraud detection · Payment transactions behaviour · Machine learning · Random forest
1
· Customer’s
Introduction
The global rate of online fraudulent transactions has grown to 1% according to the UK finance report [33]. In the United States, fraudulent claims only in healthcare and insurance led to financial losses amounting to 98 billion and 300 billion a year, respectively. Machine learning (ML) and Data Mining techniques were shown to be efficient to improve the detection of credit card frauds, see e.g. [8,10,21]. A realistic data set of credit card transactions has been made by the EU cardholders. This data set contains a large number of payment transactions, and only during two days 284,807 transactions have been made in which 492 c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 506, pp. 715–722, 2022. https://doi.org/10.1007/978-3-031-10461-9_49
716
C. Mabani et al.
or 0.17% were found fraudulent. The data set made available for the research community was represented by 31 principal components allowing the data to be anonymous. Only variables such as Time, Amount, and Class were available. The Class is the target variable, where 1 stands for the fraudulent and 0 for normal transactions [7,22]. Because of a small number of fraudulent transactions the data set is unbalanced and so required a special technique estimating the detection accuracy, sensitivity and specificity [5,6,17]. The research on such a data set is limited because of using the principal components which cannot be directly explained, as described in [2,15,25]. In our research first we aim to analyse Random Forest (RF) models which are well-known in the related literature for provision of efficient solutions to realworld problems including credit card fraud detection, see e.g. [16,35]. According to [2] the random forests perform better than other models such as ANN which can be affected by noise and overfitting problems. The random forest models are robust in handling missing values, noise. Such models are easy to use because of a small number of parameters required to be fitted to the given data set, as described in [3]. In particular in our study we will analyse: (i) how samples of customer’s payment transactions impact the detection accuracy and (ii) how a fraudulent transaction can be detected most accurately and reliably. The RF is a wellknown ML technique which aggregates decision tree (DT) models built on a given data set. Within this technique DT models are randomised by using different combinations of explaining variables as well as by using different data samples as shown in [4]. Another problem what will be studied is imbalance of payment transaction data. This problem could be typically mitigated by using sampling methods, see e.g. [23,24]. The main imbalance strategies are as follows: oversampling, undersampling, and synthetic generation of data. In the related literature it has been reported that the best performance is provided by the under-sampling strategy where the fraud rate is increased by designing a balanced class distribution, that is achieved by reducing a disproportion between the majority (normal) and minority (fraud) classes of payment transactions. In most cases, the imbalanced data make the specificity and sensitivity critical for fraud detection. An increase in sensitivity reduces the false negative outcomes while increasing the true positive detection, see e.g. [9]. This paper describes an experimental approach based on the RF and ML techniques of designing solutions for detecting credit card frauds. Within this study the fraudulent transactions are detected by analysing the customer’s purchasing behaviour. Finally the designed models are analysed and compared in terms of detection accuracy taking into account the fact that the credit card data are imbalanced.
2
Related Work
The use of Hidden Markov Model transaction sequences provides additional features whose correlations with fraudulent transactions improve the detection
Detection of Credit Card Frauds
717
accuracy, as described in [18]. According to [34] a transaction aggregation and descriptive features representing the past periods allow for a 28% increase in accuracy of fraud detection. Deep Learning concepts have been efficiently used for detection of abnormal patterns [19,20] and human “brain-print” identification [30] as well as for early detection of bone pathologies [1,11]. The Bayesian Learning concepts capable of providing reliable estimation of risks have been efficiently implemented for trauma severity estimation [13,29] and survival prediction [26,27], estimation of brain development [14,31,32], and collision avoidance at Heathrow [28]. The detection accuracy could be increased by using a random walk sampler based on Markov chains, as described in [12,13]. Such methods have provided reliable estimates of predictive posterior density distribution, that is critically important for minimisation of risks on imbalanced data.
3
Methods and Data
The credit card data introduced in the Introduction contain 284,315 legitimate and 492 illegitimate payment transactions, having an imbalance rate 0.0017. It is therefore of crucial importance to find parameters of a ML model which will provide the maximal detection accuracy on such an imbalanced data set. In this study a solution which is developed within the RF framework will be explored as follows: (i) the number of decision trees included in a RF varies between 200 and 500, and (ii) the number of explaining variables is explored within a range between 5 and 25. To evaluate the designed ML solutions we will use a confusion matrix along with a Receiver Operator Characteristic Curve (ROC) that provide a metric for the fraud detection outcomes. The ROC is constructed by plotting True positive against False Positive rates. The TP rate is a portion of data samples which are correctly detected as positive. The False Positive rate is a portion of incorrectly detected events to be positive being negative. The ROC curve depicts a tradeoff between the sensitivity and specificity. A curve close to the top left corner of ROC indicates a better model performance. The closer the curve to the ROC diagonal line 50%, the less accurate the model. Evaluation of ROC is made within the area under the ROC curve and the higher the AUC the better for detection accuracy.
4
Experiments and Results
The best performance in terms of confusion matrix has been achieved with a model providing the detection error 0.04 and True Positive rate 0.194. Figure 1 shows the fitting (training) error over the number of DTs included in RF models explored in the study.
718
C. Mabani et al.
Fig. 1. Training error over the number of decision trees included the random forest model.
We can see that the error rate drops down with the number of DT models and becomes stable after 200 trees. We could therefore suppose that the use of a larger number of DT models will further improve the detection accuracy. Figure 2 shows a distribution of the number of DT nodes included in a designed RF model. This number largely varies between 80 and 160 nodes with a mean value around 130 nodes. The performances of ML techniques applied to an imbalanced problem are critically depended on both the model parameters and imbalance strategies. The oversampling strategy increases the rate of positive (fraudulent) transactions to the same value as in the negative (legitimate) transactions. According to our observations shown in Table 1 the oversampling strategy has the smallest sensitivity, defined by a True Positive rate 0.7770. In contrast, the under-sampling strategy has the highest specificity 0.8851 whilst a sensitivity of the synthetic strategy is only 0.8446. Therefore, the under-sampling is more efficient when the sensitivity is required to be maximal. The ability of ML techniques to detect the True Negative (legitimate) events, defined by the specificity, can play the second role in fraud detection. In practice the specificity is typically defined by a trade-off between financial losses of two types. We can see in Table 1 that the oversampling strategy has a specificity of 0.9999 following after the synthetic strategy with a specificity 0.9890 and the under-sampling strategy with 0.9781. The overall accuracy was highest in the oversampling strategy 0.9995, compared to 0.9781 of the under sampling and 0.9890 of the synthetic strategies. The accuracy however plays the secondary role in imbalance data problems.
Detection of Credit Card Frauds
719
Fig. 2. Distribution of decision tree nodes in the designed random forest model. Table 1. Performances of the oversampling, undersampling, and synthetic strategies in terms of the detection accuracy, sensitivity, and specificity. Method
Accuracy Sensitivity Specificity
OverSample 0.9995 UnderSample 0.9779 0.9888 Synthetic
5
0.7770 0.8851 0.8446
0.9995 0.9781 0.9890
Discussion
The presented study aimed to experimentally examine the performance of ML strategies known in the literature on a benchmark that represents a real-world fraud detection problem. The benchmark data used in this study are heavily imbalanced. The use of a confusion matrix for evaluation of the detection performance cannot be informative because of the imbalance problem. For this reasons the experimental results were analysed in terms of sensitivity and specificity. The experiments have been conducted with Random Forest (RF) models which advantages practitioners with a small set of model’s parameters which are required to experimentally optimise on given data. The experiments were run with the number of decision trees in RF model as well as with another two parameters which are specifically data and feature sampling rates.
6
Conclusions
There is a growing need to prevent losses caused by frauds in payment transactions. Machine learning (ML) and Data Mining techniques have been shown efficient for improving the fraud detection accuracy in many applications described in the related literature. This paper has proposed an experimental approach to building ML solutions to the problem with the aim to minimise the financial losses by analysing the customer’s behaviour and patterns of using credit card. The designed solution is examined on a public data set available for the research community and evaluated in terms of fraud detection accuracy.
720
C. Mabani et al.
In practice the ML fraud detection techniques require a retrospective transaction data which demonstrate the card holder’s behaviours. There are a wide range of ML approaches used for credit card fraud detection. This paper has experimentally investigated the efficiency of a Random Forest strategy on a realistic set of credit card fraud detection data. The RF models are well known for provision of practitioners with a high detection accuracy which can be often better than that provided by other ML techniques. When data are heavily imbalanced the special strategies are required to deal with the detection problem. Applications such as fraud detection require the analysis of True Positive (sensitivity) and True Negative (specificity) rates in order to let practitioners find a cost-efficient solution and minimise the financial losses. The study has shown that the use of the under-sampling strategy enables the RF models to achieve a higher sensitivity on the imbalance data. In practice the sensitivity plays a significant role in designing of efficient strategies of fraud detection in credit card transactions. The above allows us to conclude that the use of the RF models for detecting credit card fraud transactions lets practitioners design an efficient solution in terms of sensitivity and specificity. Our experimental results show that practitioners using the RF models can find new insights into the problem and minimise the losses. Acknowledgments. The authors are grateful to the anonymous reviewers for constructive recommendations as well as to Dr L. Jakaite and Dr V. Schetinin from the University of Bedfordshire for insightful and useful comments.
References 1. Akter, M., Jakaite, L.: Extraction of texture features from X-Ray images: case of osteoarthritis detection. In: Yang, X.-S., Sherratt, S., Dey, N., Joshi, A. (eds.) Third International Congress on Information and Communication Technology. AISC, vol. 797, pp. 143–150. Springer, Singapore (2019). https://doi.org/10.1007/978-981-131165-9 13 2. Bhattacharyya, S., Jha, S., Tharakunnel, K., Westland, J.C.: Data mining for credit card fraud: a comparative study. Decis. Support Syst. 50(3), 602–613 (2011) 3. Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and Regression Trees, p. 368. Chapman and Hall, New York (1984) 4. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001) 5. Carcillo, F., Le Borgne, Y.-A., Caelen, O., Bontempi, G.: Streaming active learning strategies for real-life credit card fraud detection: assessment and visualization. Int. J. Data Sci. Anal. 5(4), 285–300 (2018) 6. Carcillo, F., Le Borgne, Y.-A., Caelen, O., Kessaci, Y., Obl´e, F., Bontempi, G.: Combining unsupervised and supervised learning in credit card fraud detection. Inf. Sci. 557, 317–331 (2021) 7. Carcillo, F., Dal Pozzolo, A., Le Borgne, Y.-A., Caelen, O., Mazzer, Y., Bontempi, G.: SCARFF: a scalable framework for streaming credit card fraud detection with spark. Inf. Fusion 41, 182–194 (2018)
Detection of Credit Card Frauds
721
8. Bahnsen, A.C., Aouada, D., Stojanovic, A., Ottersten, B.: Feature engineering strategies for credit card fraud detection. Expert Syst. Appl. 51, 134–142 (2016) 9. Dzakiyullah, N.R., Pramuntadi, A., Fauziyyah, A.K.: Semi-supervised classification on credit card fraud detection using autoencoders. J. Appl. Data Sci. 2(1), 01–07 (2021) 10. Fiore, U., De Santis, A., Perla, F., Zanetti, P., Palmieri, F.: Using generative adversarial networks for improving classification effectiveness in credit card fraud detection. Inf. Sci. 479, 448–455 (2019) 11. Jakaite, L., Schetinin, V., Hladuvka, J., Minaev, S., Ambia, A., Krzanowski, W.: Deep learning for early detection of pathological changes in X-ray bone microstructures: case of osteoarthritis. Sci. Rep. 11 (2021) 12. Jakaite, L., Schetinin, V., Maple, C.: Bayesian assessment of newborn brain maturity from two-channel sleep electroencephalograms. Comput. Math. Methods Med., 1–7 (2012) 13. Jakaite, L., Schetinin, V., Maple, C., Schult, J.: Bayesian decision trees for EEG assessment of newborn brain maturity. In: The 10th Annual Workshop on Computational Intelligence, UKCI 2010 (2010) 14. Jakaite, L., Schetinin, V., Schult, J.: Feature extraction from electroencephalograms for Bayesian assessment of newborn brain maturity. In: 24th International Symposium on Computer-based Medical Systems (CBMS), pp. 1–6 (2011) 15. Jha, S., Westland, J.C.: A descriptive study of credit card fraud pattern. Global Bus. Rev. 14(3), 373–384 (2013) 16. Le Borgne, Y.-A., Bontempi, G.: Machine learning for credit card fraud detectionpractical handbook. ACM SIGKDD Explor. Newsl. 6(1), 1–6 (2004) 17. Lebichot, B., Borgne, Y.L., He-Guelton, L., Obl´e, F., Bontempi, G.: Deep-learning domain adaptation techniques for credit cards fraud detection. In: Oneto, L., Navarin, N., Sperduti, A., Anguita, D. (eds.) INNSBDDL 2019. PINNS, vol. 1, pp. 78–88. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-16841-4 8 18. Lucas, Y., et al.: Towards automated feature engineering for credit card fraud detection using multi-perspective HMMs. Future Gener. Comput. Sys. 102, 393– 402 (2020) 19. Nyah, N., Jakaite, L., Schetinin, V., Sant, P., Aggoun, A.: Evolving polynomial neural networks for detecting abnormal patterns. In: 2016 IEEE 8th International Conference on Intelligent Systems (IS), pp. 74–80 (2016) 20. Nyah, N., Jakaite, L., Schetinin, V., Sant, P., Aggoun, A.: Learning polynomial neural networks of a near-optimal connectivity for detecting abnormal patterns in biometric data. In: 2016 SAI Computing Conference (SAI), London, pp. 409–413 (2016) 21. Pourhabibi, T., Ong, K.-L., Kam, B.H., Boo, Y.L.: Fraud detection: a systematic literature review of graph-based anomaly detection approaches. Decis. Support Syst. 133, 113303 (2020) 22. Dal Pozzolo, A., Boracchi, G., Caelen, O., Alippi, C., Bontempi, G.: Credit card fraud detection: a realistic modeling and a novel learning strategy. IEEE Trans. Neural Networks Learn. Syst. 29(8), 3784–3797 (2018) 23. Dal Pozzolo, A., Caelen, O., Le Borgne, Y.-A., Waterschoot, S., Bontempi, G.: Learned lessons in credit card fraud detection from a practitioner perspective. Expert Syst. Appl. 41(10), 4915–4928 (2014) 24. Pozzolo, A.D., Caelen, O., Johnson, R.A., Bontempi, G.: Calibrating probability with undersampling for unbalanced classification. In: IEEE Symposium Series on Computational Intelligence SSCI 2015, Cape Town, South Africa, 7–10 December 2015, pp. 159–166. IEEE (2015)
722
C. Mabani et al.
25. Prusti, D., Rath, S.K.: Fraudulent transaction detection in credit card by applying ensemble machine learning techniques. In: 2019 10th International Conference on Computing, Communication and Networking Technologies (ICCCNT), pp. 1–6. IEEE (2019) 26. Schetinin, V., Jakaite, L., Krzanowski, W.: Bayesian averaging over decision tree models: an application for estimating uncertainty in trauma severity scoring. Int. J. Med. Informatics 112, 6–14 (2018) 27. Schetinin, V., Jakaite, L., Krzanowski, W.: Bayesian averaging over decision tree models for trauma severity scoring. Artif. Intell. Med. 84, 139–145 (2018) 28. Schetinin, V., Jakaite, L., Krzanowski, W.: Bayesian learning of models for estimating uncertainty in alert systems: application to air traffic conflict avoidance. Integr. Comput. Aided Eng. 26, 1–17 (2018) 29. Schetinin, V., Jakaite, L., Krzanowski, W.J.: Prediction of survival probabilities with Bayesian decision trees. Expert Syst. Appl. 40(14), 5466–5476 (2013) 30. Schetinin, V., Jakaite, L., Nyah, N., Novakovic, D., Krzanowski, W.: Feature extraction with GMDH-type neural networks for EEG-based person identification. Int. J. Neural Syst. 28(6), 1750064 (2018) 31. Schetinin, V., Jakaite, L., Schult, J.: Informativeness of sleep cycle features in Bayesian assessment of newborn electroencephalographic maturation. In: 2011 24th International Symposium on Computer-Based Medical Systems (CBMS), pp. 1–6 (2011) 32. Schetinin, V., Jakaite, L.: Extraction of features from sleep EEG for Bayesian assessment of brain development. PLoS ONE 12(3), 1–13 (2017) 33. UK Finance: Fraud the facts 2019. https://www.ukfinance.org.uk/policy-andguidance/reports-publications/fraud-facts-2019. Accessed 01 Oct 2021 34. Whitrow, C., Hand, D.J., Juszczak, D., Weston, D., Adams, N.M.: Transaction aggregation as a strategy for credit card fraud detection. Data Min. Knowl. Disc. 18(1), 30–55 (2009) 35. Xuan, S., Liu, G., Li, Z., Zheng, L., Wang, S., Jiang, C.: Random forest for credit card fraud detection. In: 2018 IEEE 15th International Conference on Networking, Sensing and Control (ICNSC), pp. 1–6. IEEE (2018)
ALBU: An Approximate Loopy Belief Message Passing Algorithm for LDA for Small Data Sets Rebecca M. C. Taylor(B) and Johan A. du Preez Department of Electrical and Electronic Engineering, Stellenbosch University, Stellenbosch 7602, RSA [email protected], [email protected]
Abstract. Variational Bayes (VB) applied to latent Dirichlet allocation (LDA) has become the most popular algorithm for aspect modeling. While sufficiently successful in text topic extraction from large corpora, VB is less successful in identifying aspects in the presence of limited data. We present a novel variational message passing algorithm as applied to Latent Dirichlet Allocation (LDA) and compare it with the gold standard VB and collapsed Gibbs sampling. In situations where marginalisation leads to non-conjugate messages, we use ideas from sampling to derive approximate update equations. In cases where conjugacy holds, Loopy Belief update (LBU) (also known as Lauritzen-Spiegelhalter) is used. Our algorithm, ALBU (approximate LBU), has strong similarities with Variational Message Passing (VMP) (which is the message passing variant of VB). To compare the performance of the algorithms in the presence of limited data, we use data sets consisting of tweets and news groups. Using coherence measures we show that ALBU learns latent distributions more accurately than does VB, especially for smaller data sets. Keywords: Latent Dirichlet Allocation · Variational · Loopy Belief Cluster graph · Graphical model · Message passing · LauritzenSpiegelhalter
1
·
Introduction
Latent Dirichlet Allocation [7] is a hierarchical Bayesian model that is most commonly known for its ability to extract latent semantic topics from text corpora. It can also be thought of as the categorical/discrete analog of Bayesian principal component analysis (PCA) [8] and can be used to extract latent aspects from collections of discrete data [7]. It finds application in fields such as medicine [3,36], computer vision [11,43,45] and finance [12,42]. In non-text specific applications, it is commonly referred to as grade of membership [50] or aspect modeling [28]. 1.1
Topic Extraction in the Presence of Less Data
For most text topic modeling problems the traditional approximate inference techniques are sufficient. However, in many cases where less data is available c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 506, pp. 723–746, 2022. https://doi.org/10.1007/978-3-031-10461-9_50
724
R. M. C. Taylor and J. A. du Preez
(due to smaller collections and/or collections where document size is small) or where there is large topic overlap (these make for more difficult inference) alternative approaches are favored [18,20,35,37]. The phenomenal increase in social media usage, typified by digital microblogging platforms such as Twitter [37,47], has called attention to the need for addressing problems associated with the semantic analysis of text corpora involving short texts [4,18,33]. The difficulties inherent in the extraction of topic information from very short texts [26,35,51] are partly due to the small number of words remaining after the removal of stop words [1,18]. On average, the number of words per tweet, even before data cleaning, is between 8 and 16 words [19], [9]. There are many ways of improving LDA performance in the presence of minimal data, such as document pooling, including additional information (such as temporal [46] or author [40]), or performing transfer learning [18,20,35] where word-topic distributions are learnt from a large, universal corpus [20,35]. In this article, we present our general approach to improving LDA accuracy on small data sets by using an alternative approximate inference technique. By modifying the inference only, we allow other strategies such as pooling or transfer learning to be applied additionally where the data set allows. 1.2
Inference Algorithms for LDA
Exact inference is intractable for many useful graphical models such as LDA [5–7, p. 461]. In fact, one cannot perform exact inference on any graphical model where continuous parent distributions have discrete children [32]. A range of approximate techniques can be used to overcome this difficulty. These techniques vary in performance, based on the models to which they are applied [21]. Particle based approaches (such as Markov chain Monte Carlo [5, p. 462]) are computationally expensive [44] and convergence rates can be slow, though asymptotic convergence is guaranteed. Blei et al. [7] presented variational Bayes as the inference technique for LDA. The success of VB, and all of its variants, as approximate inference techniques for LDA has resulted in their widespread use for topic modeling by machine learning practitioners [13]. However, when working with small text corpora, or performing aspect modeling on small non-text data sets, the quality of extracted aspects can be low [18,51]. Typically, when topic extraction is more difficult (such as when extracting topics out of very informal text such as tweets or forum messages [14]), collapsed Gibbs sampling [16], [15] is used [14]. Collapsed Gibbs sampling, like its other Markov chain Monte Carlo (MCMC) sampling counterparts, however, has its drawbacks: it is hard to determine when it has converged (resulting in difficulty in debugging [54]), it can be slow [38], and it has inherent scalability issues [54,55]. We propose that our algorithm we used as an alternative to collapsed Gibbs sampling since it achieves similar results but is based on variational methods and not sampling, which results in greater stability and lower computational complexity.
Approximate Loopy Belief Update (ALBU) for LDA
1.3
725
Loopy Belief Propagation and Update Algorithms
For the LDA model, Zeng et al. [53] apply a modified version of loopy belief propagation (LBP) and loopy belief update (LBU) to the collapsed LDA graphical model (in factor graph form). They, however, use a sum-sum version, where the sum-product algorithm is typically used for Shafer-Shenoy on factor graphs [23,48]. Zeng et al.’s algorithm is shown to outperform VB by a large margin and is comparable to or (given a set number of epochs) collapsed Gibbs sampling, as well as being considerably faster than VB. Zeng also presents their research group’s MATLAB toolbox called TMBP [52], available for use where their core algorithms are coded in C++ and called from MATLAB using MEX. This toolbox has been used to apply BP to other LDA-like problems [25]. Our work compliments that of Zeng et al. since our approach also uses a form of LBU, (but without collapsing the graphical model) and also performs significantly better than VB. 1.4
Other Message Passing Algorithms for LDA
Expectation propagation (EP) [28,29] is a message passing algorithm that has been shown to reduce to belief propagation when certain conjugacies hold [29]. We can therefore think of it as a more general form of belief propagation (or belief update if the update form is used). It is also variational in nature, but instead of minimising the reverse Kullback-Leibler divergence (KLD), it minimises the forward KLD. Minka applies EP to the initial (unsmoothed) version of LDA successfully [28] and shows how EP outperforms standard VB for this model (using simulated and hybrid data sets). Our work is similar to that of EP, but we propose specific closed form approximate message updates for LDA, where EP relies on moment matching (typically iterative calculated) for approximate messages [28]. Variational message passing (VMP) [48,49] is the message passing analog of VB and the two algorithms are mathematically equivalent [49]. In VMP, the VB algorithm is posed as a sequence of local updates to achieve global graphwide optimisation. We have not found a published implementation of either the equations of code for VMP for LDA, but we can measure its performance since VB is widely used for LDA. In our work, we compare our new algorithm, ALBU to VMP in terms of its messages but use a VB implementation when comparing performance—this is because we want to use a reputable implementation in this research. 1.5
Our Contribution
In this article, we present a novel algorithm, approximate loopy belief propagation (ALBU) for LDA. We utilize the Lauritzen-Spiegelhalter algorithm, LBU, directly where possible, and where approximate messages are required, we derive update equations by using a sampling approach (which turns out to have strong similarities with the VMP approximate messages [48]). We used fixed updates
726
R. M. C. Taylor and J. A. du Preez
for the approximate messages, unlike in Minka et al.’s approach [29]. Our approach is similar to Zeng et al.’s approach [52], but differs in that we do not use the collapsed version of the model (where Zeng et al. do) and use the standard Lauritzen-Spiegelhalter where possible (where Zeng et al. [52] use the sum-sum version of BP). We have not seen any other belief propagation based inference algorithms for LDA that use the full, uncollapsed graphical model, and aim to address this gap with the work presented in this article. In this article, we present our ALBU algorithm for LDA and compare it with the standard VB algorithm algorithmically—since it can be easily compared with our message passing algorithms by using the VMP variant of VB. We also benchmark ALBU against collapsed Gibbs sampling for completeness.
2
Message Passing Background
Message passing algorithms recursively update messages on a graph by using local computations done at graph vertices. Belief propagation (BP), also known as the Shafer-Shenoy algorithm [41], is one of the many variational message passing algorithms [22] that can be used to perform approximate inference on a subset of graphical models. The message passing protocol used in the belief propagation method (also used in the sum-product algorithm for factor graphs [23]) requires a node to have receive message from all its neighbours before sending a message to a neighbour [17,31]. The Lauritzen-Spiegelhalter (LS) algorithm [24] (also commonly referred to as the belief update algorithm [22, p. 364–366]), however, allows message to be sent along an arbitrary edge. It also differs from Shafer-Shenoy in the manner in which over-counting is handled: instead of excluding reverse direction messages when calculating an outgoing message, message cancelling is performed when updating the target factor [22, p. 364–366]. For trees, the update rules yield exact marginals for both BP and LS algorithms [34]. For graphs with loops, the same message passing algorithms can be applied, but convergence guarantees are lost [31,49]. Murphey et al. [31], however, note that if the algorithms are stopped after (a) a fixed (sensible) number of iterations or (b) minimal change in beliefs after subsequent iterations are observed, results can still be useful. Furthermore, should the algorithm converge, the approximation is usually good [31]. The loopy cariants are nammed loopy belief propagation (LBP) and loopy belief update (LBU), respectively [22]. In ALBU, the loopy belief update (LBU) algorithm is followed wherever possible. With regard to the message passing strategy, this differs from VMP (the equivalent of VB, framed in a message passing formulation) in that, unlike in VMP, reverse messages (messages that the previous factor had sent to that node) are always excluded [49]. This is done explicitly when updating the target cluster by the division of the current forward direction sepset belief by the previous forward direction sepset belief. This division of sepset beliefs allows us to avoid over-counting [22].
Approximate Loopy Belief Update (ALBU) for LDA
3
727
The Graphical Representation and Message Passing Schedule for ALBU
In this section we present the LDA graphical model and define the relevant random variables (RVs) and distributions. The message passing schedule for ALBU is also presented. 3.1
Notation and Graphical Representation of LDA
LDA was originally depicted [7] in the Bayes net plate model format shown in Fig. 1. The symbols used are shown in Table 1 and described in the paragraphs below.
Fig. 1. Plate model of LDA system as a Bayes net. Each node in the graph is assigned a name based on the random variable that is to the left of the conditioning bar for its initial belief. The symbols used in this figure are shown in Table 1 as well as explained in the paragraphs below.
LDA is comprised of the following random variables and vectors. – Topic-document random vectors, θm : one per document (M in total). Each θm is K-dimensional with θm,k indicating the probability of topic k in document m with {θm,k ∈ R|θm,k ∈ (0, 1)}. – Topic-document random variables, Zm,n : one for each word in each document and therefore m Nm in total (with Nm words in document m). These RVs can take on one of K values and describe the probability of each word deriving from the various topics. – Word-topic random variables, Wm,n : one of these for each word in each document. Each of these RVs can take on one of V words in the vocabulary (each v represents a word in the vocabulary). – Word-topic random vectors φk : one per topic where φk = {φk,1 , φk,2 , ..., φk,V }. Each vector, φk , is V-dimensional with φk,v indicating the probability of word v in topic k with {φk,v ∈ R|φk,v ∈ (0, 1)}. The graphical model shown in Fig. 1 allows us to visually identify the conditional dependencies in the LDA model. Arrows in the graph indicate the direction of dependence. From Fig. 1, we can see that the n’th word in document
728
R. M. C. Taylor and J. A. du Preez Table 1. Symbols for the LDA model, as used in Fig. 1. Symbol Description M m
Total number of documents Current document
N n
Number of words in current document Current word (in document)
K k
Total number of topics Current topic
V v v
Total number words in the vocabulary Current word (in vocabulary) Observed word (in vocabulary)
θm Zm,n Wm,n φk
Topic-document Dirichlet node for document m Topic-document categorical node for word n in document m Word-topic conditional categorical node for word n in document m Word-topic Dirichlet set node for topic k
m is Wm,n . This word depends on the topic Zm,n present in the document, which, selects the Dirichlet random vector φk describing the words present in each topic. Based on the graph we can identify the following distributions. – Topic-document Dirichlet distributions are used to inform our belief about the topics present within each document, as well as how confident we are in this belief. They have a belief of Ψθ m (θm ; αm ) = p(θm ; αm,1 , ..., αm,K ). – Topic-document distributions: each distribution informs the local belief about the topic proportions for each word in each document. They have an initial belief of ΨZm,n (Zm,n , θm ) = p(Zm,n |θm ). The form changes to that of a joint with the respective topic-document Dirichlet distribution: ΨZm,n (Zm,n , θm ) = p(Zm,n , θm ), which we refer to as a Polya distribution. – Word-topic conditional distributions are conditioned on the RV Zm,n , which means that if we could observe the topic, we are left with a categorical or Polya distribution. These distributions effectively tell us what the topic proportions are per word within a document, for each topic. They initially have a belief of ΨWm,n (Wm,n , Zm,n , Φ) = p(Wm,n |Zm,n , Φ). The joint distributions ΨWm,n (Wm,n , Zm,n , Φ) = p(Wm,n , Zm,n |Φ) are used after initilisation. – Word-topic Dirichlet distributions (a Dirichlet set) inform our belief that a word should be assigned to a topic as well as our confidence in this. These have a belief of Ψφ k (φk ) = p(φk |βk ). We can see the group of all the wordtopic Dirichlet distributions in our graph as a set of Dirichlet distributions, denoted by ΨΦ (Φ) = p(Φ|B). Figure 2 shows an unrolled factor graph for a corpus containing two documents, two words per document and three topics. When we describe the algorithm below, we refer to one branch of the LDA graph; this relates to one specific
Approximate Loopy Belief Update (ALBU) for LDA
729
Fig. 2. Unrolled cluster graph representation of LDA for a two document corpus with two words per document. A single branch is highlighted (where the document number is m = 1 and the word in the document is number n = 2). We include the cluster and sepset beliefs in the diagram, but exclude the function arguments to avoid cluttering the diagram: for example, we show Ψθm instead of Ψθm (θm ). For the meanings of the symbols, refer to Table 1.
word, n, in a single specific document, m. The branch where m = 1 and n = 2 is highlighted in Fig. 2. We now proceed to derive the VMP equations for LDA. 3.2
Loopy Message Passing Schedule for ALBU
Because LDA has a simple, known structure per document, it is sensible to construct a fixed message passing schedule; this is not always the case for graphical models, where in some cases we chose rather to base the schedule on message priority using divergence measures to prioritise messages, based on the impact they will have. We provide the message passing schedule used in our implementation in Algorithm 1, and then proceed to present the derivation of our ALBU algorithm.
4
The ALBU Algorithm
We now take each cluster and present the sepset beliefs in the direction of the clusters to its left and to its right. We also present the updates to the target clusters. 4.1
Topic-Document Dirichlet Clusters, {θm }
Topic-document clusters have a cluster belief of, Γ ( k αm,k ) αm,k −1 θm,k , Ψθ m (θm ) = Dir(θm ; αm ) = k Γ (αm,k ) k
(1)
730
R. M. C. Taylor and J. A. du Preez
Algorithm 1. Loopy Message Passing for LDA For each epoch: For each document m: For each word n in document m: – send message from ΨΦ (Φ; B) to ΨWm,n (Wm,n , Zm,n , Φ) – observe word Wm,n = v – send message from ΨWm,n (Wm,n , Zm,n , Φ) to ΨZm,n (Zm,n , θm ) – send message from the ΨZm,n (Zm,n , θm ) to Ψθ m (θm ; αm ) For each word n in document m: – send message from Ψθ m (θm ; αm ) to ΨZm,n (Zm,n , θm ) to the – send message from ΨZm,n (Zm,n , θm ) ΨWm,n (Wm,n , Zm,n , Φ) For each word n in each document m: – send message from the ΨWm,n (Wm,n , Zm,n , Φ) to ΨΦ (Φ)
where αm,k are the hyperparameters for document m and topic k. Initially these hyperparameters will be the hyperparameters chosen before inference is perprior . In further iterations these will be updated and no longer formed, namely αm,k equal to the initial hyperparameter settings. Updating the Topic-Document Dirichlet Clusters, Ψθ m (θm ). For a single branch, each topic-document Dirichlet cluster belief, Ψθm (θm ), is only updated by information from one sepset, as seen in Fig. 2. This sepset’s belief, ψθ m ,Zm,n (θm ), is of Dirichlet form and is initially an uninformative Dirichlet. We can up update Ψθ m (θm ) by applying the LBU equation,
Ψθ m (θm ) = Ψθ m (θm )
ψθ m ,Zm,n (θm ) ψθ m ,Zm,n (θm )
= Dir(θm ; αm )
Dir(θm ; αm ) Dir(θm ; α∗m,n )
= Dir(θm ; αm,n ),
(2)
where division between Dirichlet distributions is simple to calculate. At conver gence, we will have final α hyperparameters, resulting in the posterior, Γ ( k αm,k ) αm,k −1 θm,k . (3) Ψθ m (θm ) = Dir(θm ; αm ) = k Γ (αm,k ) k Updating the Neighboring Sepset Beliefs. Each topic-document Dirichlet cluster belief, Ψθ m (θm ), is used to update N neighbouring sepset beliefs. For a single branch, Ψθ m (θm ) updates ψθ m ,Zm,n (θm ).
Approximate Loopy Belief Update (ALBU) for LDA
731
Initially the sepset belief, ψθ m ,Zm,n (θm ), is the product of the messages from both opposite directions passing through it, ψθ m ,Zm,n (θm ) = μθ m − →Zm,n μZm,n − →θ m 1 = Dir(θm ; αprior m )Dir(θm ; αm )
= Dir(θm ; αprior m ).
(4)
No marginalisation of the cluster belief, Ψθm (θm ), is required because the cluster is only a function of the variable, θm . In each iteration, we can therefore simply update the topic-document Polya cluster belief, ΨZm,n (Zm,n , θm ),
ψθ m ,Zm,n (θm ) = Ψθ m (θm ). 4.2
(5)
Topic-Document Polya Clusters, {Zm ,n , θm }
Each topic-document Polya cluster belief, ΨZm,n (Zm,n , θm ), is initialised with the belief, ΨZm,n (Zm,n , θm ) = p(Zm,n |θm ) =
k
[Z
m,n θm,k
=k]
,
(6)
where each categorical distribution is assigned equal probabilities for each topic. The Polya distribution only arises after the initial update from the sepset ψθ m ,Zm,n (θm ). Updating a Topic-Document Polya Cluster, ΨZ m , n (Zm ,n , θm ). The sepset belief, ψθ m ,Zm,n (θm ), is of Dirichlet form and is initialised as shown in Eq. 4. The topic-document cluster belief is known as a Polya distribution, Γ ( k αm,n,k ) [Zm,n =k]+αm,n,k −1 θm,k . (7) ΨZm,n (Zm,n , θm ) = k Γ (αm,n,k ) k
The updated sepset belief, ψθ m ,Zm,n (θm ) (Eq. 5), can be used to update the topic-document cluster belief,
ΨZ m,n (Zm,n , θm )
= ΨZm,n (Zm,n , θm )
ψθ m ,Zm,n (θm ) ψθ m ,Zm,n (θm )
Dir(θm ; αm,n ) = p(Zm,n , θm ; αm,n ) Dir(θm ; α∗m )
= p(Zm,n , θm ; αm,n )
(8)
732
R. M. C. Taylor and J. A. du Preez
As mentioned earlier, when referring to messages or parameters that have been updated due to avoid over-counting (by the division by the previous sepset belief) we use the notation αm,n . We can now update the cluster belief as follows:
ΨZ m,n (Zm,n , θm )
= ΨZm,n (Zm,n , θm )
ψZm,n ,Wm,n (Zm,n ) ψZm,n ,Wm,n (Zm,n )
.
(9)
After each topic-document Polya cluster belief, ΨZm,n (Zm,n , θm,n ), has been updated, it becomes a joint distribution, [Zm,n =k] Γ ( k αm,n,k ) [Zm,n =k]+αm,n,k −1 pm,n,k ΨZm,n (Zm,n , θm,n ) = θm,k ∗[Zm,n =k] pm,n,k k Γ (αm,n,k ) k Γ ( k αm,n,k ) [Zm,n =k]+αm,n,k −1 [Zm,n =k] = θm,k pm,n,k , (10) k Γ (αm,n,k ) k
where pm,n,k are the topic proportions that are multiplied in when ΨZm,n (Zm,n , θm ) is updated by the sepset belief, ψZm,n ,Wm,n (Zm,n ) and ∗[Z
=k]
m,n are from the previous epoch. pm,n,k After the first epoch, we have a unique αm,n for each word n in each document m due to message refinement.
Updating the Neighboring Sepset Beliefs. To calculate the new sepset belief, ψθ m ,Zm,n (θm ), from the cluster belief, ΨZm,n (Zm,n , θm ), we remove Zm,n from ΨZm,n (Zm,n , θm ) by marginalizing it out. The exact sepset belief, ψθ*m ,Zm,n (θm ), can be shown to be, ψθ*m ,Zm,n (θm ) = p(Zm,n , θm ; αm,n ) Zm,n
Γ ( k αm,n,k ) [Zm,n =k]+α m,n,k −1 [Zm,n =k] = θm,k pm,n,k Γ (α ) k m,n,k Zm,n k Γ ( k αm,n,k ) αm,n,k −1 k αm,n,k k pm,n,k θm,k . = θm,k k Γ (αm,n,k ) k pm,n,k αm,n,k k
(11) The exact sepset belief, ψθ*m ,Zm,n (θm ) is unfortunately not conjugate to a Dirichlet distribution. We can use an approximate sepset belief instead. We know how to update the Dirichlet posterior if we observe values for Z – simply incre ment the count of the corresponding αm values. However, because Z is latent, we can use the topic proportions to scale the respective αm values before adding this scaled fractional count to the current αm . We validate this by sampling a large number of Z values and averaging the counts: 1. We start by choosing a numerical value for our observed value of W , namely, v as well as the values in a conditional probability table for P (W |Z).
Approximate Loopy Belief Update (ALBU) for LDA
733
2. Sample θ from a Dirichlet distribution with specified αs. 3. Observing this θ in a Polya distribution, changes it into categorical distribution in Z. 4. Sample Z = k from the categorical distribution and use this in the distribution for p(W |Z). This gives us a distribution over W as dictated by the particular v. 5. Sample W from this distribution and if it matches the chosen v, we keep it, otherwise we discard it. This results in the closed form approximate update, pm,n,k αm,n,k + αm,k , αm,k = j pm,n,j αm,n,j
(12)
for each αm,k in the Dirichlet distribution applicable to a document m. Alternatively, once could use one of the updates proposed by Minka [27]. Our update has the advantage that it is a closed form update that is simple and computationally efficient. To update the sepset belief, ψZm,n ,Wm,n (Zm,n ), we need to marginalise over θm ,
ψZm,n ,Wm,n (Zm,n ) =
θm
=
k
p(Zm,n , θm ; αm,n )dθm [Z
m,n pm,n,k d
=k]
θm
Γ ( k αm,n,k ) [Zm,n =k]+αm,n,k −1 θm,k dθm , k Γ (αm,n,k ) k
(13) k pk αk . k αk
with d = Given that the volume of the Dirichlet distribution equals 1, we can write the sepset belief as, ⎧ pm,n,1 αm,n,1 with Zm,n = 1 ⎪ ⎪ pm,n,k αm,n,k ⎪ pkm,n,2 ⎪ αm,n,2 ⎪ ⎨ p with Zm,n = 2 k m,n,k αm,n,k ψZm,n ,Wm,n (Zm,n ) = . (14) ⎪ .. ⎪ ⎪ ⎪ ⎪ ⎩ pm,n,K αm,n,K with Zm,n = K, pm,n,k α k
with d = 4.3
m,n,k
k pk αk . k αk
Word-Topic Conditional Polya Clusters, {Wm ,n , Zm ,n , Φ}
Word-topic conditional Polya clusters initially have a belief of, ⎧ [W =v] ⎪ φ m,n with Zm,n = 1 ⎪ ⎪ ⎨ v 1,v . ΨWm,n (Wm,n , Zm,n , Φ) = .. ⎪ ⎪ ⎪ ⎩ φ[Wm,n =v] with Z m,n = K. v K,v
(15)
734
R. M. C. Taylor and J. A. du Preez
This belief is a conditional categorical distribution. To obtain a normalised joint, each categorical distribution will later be scaled by p(Zm,n = k; αm ), resulting in a distribution with the total volume of 1. Updating the Word-Topic Conditional Polya Clusters, ΨW m , n (Wm ,n , Zm ,n , Φ). The sepset belief, ψZm,n ,Wm,n (Zm,n ), is in the form of a categorical distribution and is initialised to be noninformative. We use Eq. 14 to obtain the probability for each topic in the categorical distribution in our sepset belief, ψZm,n ,Wm,n (Zm,n ), and update our word-topic conditional Polya cluster belief, ΨWm,n (Wm,n , Zm,n , Φ), as follows,
ΨWm,n (Wm,n , Zm,n , Φ) = ΨWm,n (Wm,n , Zm,n , Φ)
(Zm,n ) ψZ m,n ,Wm,n
ψZm,n ,Wm,n (Zm,n )
,
(16)
where we provide the result of Eq. 16 in Eq. 18. For each topic, this becomes a division of normalised topic proportions. The update from the sepset belief, ψWm,n ,Φ (Φ) is very similar to the update in Eq. 8 except that we have a multiplication and division per topic,
ΨWm,n (Wm,n , Zm,n , Φ) = ΨWm,n (Wm,n , Zm,n , Φ)
ψWm,n ,Φ (Φ) ψWm,n ,Φ (Φ)
= p(Wm,n , Zm,n , Φ; αm,n , Bm,n )
(17)
Taking the updates from both adjoining sepsets into account, the joint distribution for this cluster becomes,
ΨWm,n (Wm,n , Zm,n , Φ) ⎧ [Wm,n =v]+β1,v −1 Γ( β ) ⎪ p(Zm,n = 1; αm ) Γv(β 1,v) v φ1,v with Zm,n = 1 ⎪ ⎪ 1,v v ⎨ .. = . ⎪ ⎪ ⎪ [Wm,n =v]+βK,v −1 Γ ( v βK,v ) ⎩ p(Z with Zm,n = K. m,n = K; αm ) Γ (β v φK,v ) v
K,v
(18) Note that the α and β hyperparameters are unique for each branch (for specific m and n) due to message refinement but we do not indicate this in the subscripts to avoid unnecessary notational complexity. Updating the Neighboring Sepset Beliefs. Now that we have the full distribution, we can apply LBU to update the neighbouring sepset beliefs. To update the sepset belief, ψZm,n ,Wm,n (Zm,n ), we need to marginalise over Φ and Wm,n ,
Approximate Loopy Belief Update (ALBU) for LDA
735
ψZm,n ,Wm,n (Zm,n ) =
W m,n ,Φ
ΨWm,n (Wm,n , Zm,n , Φ)dWm,n , Φ.
(19)
In LDA, the word is observed (Wm,n = v) and this simplifies Eq. 19. For each topic, k, we can write, ψ˜Zm,n ,Wm,n (Zm,n = k) = p˜m,n,k
φk
= p˜m,n,k
φk
Γ ( v βk,v ) [W =v]+βk,v −1 φk,vm,n dφk v Γ (βk,v − 1) Γ ( v βk,v ) βk,v φk,v dφk Γ (β − 1) k,v v
βk,v = p˜m,n,k . v βk,v
(20)
where p˜m,n,k = p(Zm,n = k; αm ) are the topic proportions that have recently been updated using information from the topic-document side of the graph. Equation 20 is un-normalised (as indicated by the ˜. ) and has a volume of volk,v = Z. We need to normalize the sepset belief for each topic by multiplying each ψ˜Zm,n ,Wm,n (Zm,n = k) by a normalizing constant to give,
ψZm,n ,Wm,n (Zm,n = k) =
βk,v 1 pm,n,k . Z v βk,v
(21)
To calculate the updated sepset belief, ψWm,n ,Φ (Φ) from the word-topic distribution, ΨWm,n (Wm,n , Zm,n , Φ)dWm,n,Zm,n , we need to marginalise over all the variables except for Φ, ψWm,n ,Φ (Φ) = ΨWm,n (Wm,n , Φ)dWm,n , Zm,n . (22) Wm,n ,Zm,n
Once again, observation of the word simplifies matters. Inserting Eq. 18 into Eq. 22, and observing Wm,n = v gives us the true marginal. We use an approximate message, derived in a similar manner to our approximation in Eq. 12. ⎧ 0 with Wm,n = 0 ⎪ ⎪ ⎪ ⎪ . ⎪ .. ⎪ ⎪ ⎪ ⎨ β with Wm,n = v, ψ˜Wm,n ,φ k (φk ) = pm,n,k k,v v βk,v ⎪ ⎪ . ⎪ ⎪ .. ⎪ ⎪ ⎪ ⎪ ⎩0 with Wm,n = V
(23)
736
R. M. C. Taylor and J. A. du Preez
for each φk . As with the update for the sepset belief, ψZm,n ,Wm,n (Zm,n ), normalisation over K is required. For a single topic, pm,n,k,v =
βk,v 1 pm,n,k , Z v βk,v
(24)
with Z1 used for normalisation. Note the additional subscript, v, in pm,n,k,v that is included to indicate these proportions are after observing the word. 4.4
Word-Topic Dirichlet Set Cluster, {Φ}
The word-topic Dirichlet set contains K Dirichlet distributions; one for each of the topics that the RV Z can assume. The k’th word-topic Dirichlet factor contains the k’th RV, φk = {φ1,Z=k , φ2,Z=k , ..., φV,Z=k } parametarised by βk = {β1,Z=k , β2,Z=k , ..., βV,Z=k }. Each factor in the cluster has an initial belief of, Γ ( v βk,v ) βk,v −1 Ψφ k (φk ) = Dir(φk ; βk ) = φk,v , (25) v Γ (βk,v ) v where βk,v are the hyperparameters for each word, v, in the vocabulary for the specific topic, k. Initially these hyperparameters will be the hyperparameters prior prior (for example βk,v = 0.1 for chosen before inference is performed, namely βk,v all v and all k). In further iterations these will be updated and no longer equal to the initial hyperparameter settings. Updating the Word-Topic Dirichlet Set Cluster, ΨΦ . The word-topic Dirichlet set cluster belief, ΨΦ (Φ), is updated by using the partial counts calculated in Eq. 24 to update the corresponding Dirichlet distribution’s hyperpa rameters. The update from each sepset belief, ψWm,n ,φ k (φk ), to its respective Dirichlet distribution, within the cluster belief, ψΦ (Φ), are performed in the same manner as done for the α’s in Eq. 9. The Dirichlet hyperparameters are updated as follows, (26) βk,v = pm,n,k,v + βk,v , where,
pm,n,k,v =
1 pm,n,k βk,v , Z p∗m,n,k,v v βk,v
(27)
with p∗m,n,k,v being the previous iteration of the sepset (this accounts for the cancelling portion). It is important to note here, that each pm,n,k,v contains both updated and adjusted α and β hyperparameters and that the normalizing constant, Z, accounts for normalisation over all topics. At convergence, we will have final β hyperparameters in the distribution defined in Eq. 25.
Approximate Loopy Belief Update (ALBU) for LDA
737
Updating the Sepset Belief, ψW m , n ,Φ . To calculate the new sepset belief, ψWm,n ,Φ (Φ), from the cluster belief ΨΦ (Φ), no marginalisation is required, since Φ is the only variable in the cluster. For a single topic, k, we can write the sepset belief as, Γ ( v βk,v ) βk,v −1 ψWm,n ,φ k (φk ) = Dir(φk ; βk,v ) = φk,v , v Γ (βk,v ) v
(28)
which are simply the word-topic Dirichlet distributions for each topic.
5
Method
In this study, we compare the topic extraction performance of three algorithms, namely ALBU, collapsed Gibbs sampling and VB (or VMP) using two text corpora. 5.1
Text Corpora
We limit our analyses to corpora on which topic extraction is expected to be difficult and which are likely to emphasise performance differences among the algorithms. We therefore choose corpora that fulfill the three criteria enumerated below. 1. Few documents: Algorithms generally perform better on larger text corpora than on smaller ones. In order to assess the ability of various algorithms to extract useful information from text corpora, we test our algorithms on small corpora, and compare these results with existing algorithms. Using smaller corpus sizes also shortens execution time and requires less memory, factors which are beneficial since all simulations are run on a local machine. 2. Short document length: The document length affects the ability of the algorithm to gain confidence in the topic-document distributions. Using short documents makes inference more difficult, since few tokens remain per document after pre-processing. The advantages of using short documents for algorithm evaluation, are similar to using few documents. 3. Large topic overlap: Where topics overlap appreciably, it is hard for inference algorithms to distinguish between topics. Differences in performance among algorithms could therefore become more pronounced. The corpus containing tweets about the Coronavirus pandemic falls into this category. Two text corpora are chosen, (1) the 20 Newsgroups corpus, and (2) a selection of 20,000 Covid-related tweets, Covid Tweets. Both of these corpora are considered small (they contain few terms per document and also contain a relatively small number of documents). Word clouds of the two corpora are shown in Fig. 3.
738
R. M. C. Taylor and J. A. du Preez
(a) Word Cloud for 20 Newsgroups Corpus. We See that people and One are Two of the Most Common Words. There are Many General Words not Obviously Relating to a Specific Topic.
(b) Word Cloud for Covid Tweets Corpus. The Token Covid has been Removed Since it is Present in Every Tweet. The Coronavirus Token has been Retained in the Corpus.
Fig. 3. Word clouds are based on the frequency of individual words occurring in a corpus. In (a) we show the 20 Newsgroups corpus and in (b) the Covid Tweets corpus.
In all of the text corpora we remove all characters that are not in the English alphabet. Standard text processing techniques are then applied such as lemmatisation and stopword removal. We also remove some specific additional stopwords for each corpus, such as corona from Covid Tweets because this is the word used to identify the selection of tweets. Documents shorter than 4 tokens in length after initial preprocessing are discarded. The token frequency histograms (after preprocessing) are shown in Fig. 4. 5.2
Metric and Hyperparameter Selection
Held-out perplexity [10] has been the most popular evaluation metric for topic models. However, a large-scale human topic labeling study by Chang et al. [10] demonstrated that low held-out perplexity is often poorly correlated with interpretable latent spaces. Coherence on the other hand, has been shown to be highly correlated with human interpretability of topics [39]. In a comprehensive study of multiple coherence measures [39], the CV coherence score had the highest correlation with human topic ratings. We therefore use this as our measure of performance. Standard practice has been to choose the same hyperparameter settings over all data sets and algorithms, popular choices being α = β = 0.1 [30], and α = β = 0.01 and α = 0.1, β = 0.01 [53]. We choose a list of standard hyperparameter settings and evaluate their performance based on coherence metrics as recommended by [2]. Hyperparameters of α = β = 0.1 worked well over all algorithms over a larger range of topics, and these were therefore used for evaluation. Based on initial convergence tests, we fix the number of epochs for both ALBU and VB to 150 epochs for the text data sets. For collapsed Gibbs sampling, 5,000 samples are used after an initial 2,000 iterations with rather poor results.
Approximate Loopy Belief Update (ALBU) for LDA
739
(a) Histogram for Documents with under (b) Tweet Lengths in the Covid Tweets 500 Words in Length. Corpus.
Fig. 4. Histogram showing document length after preprocessing for the (a) 20 Newsgroups and (b) Covid Tweets corpora.
This is significantly more than the 2000 samples that recommended in the lda Python package and 1,000 as used by Zeng et al. [53].
6
Results
We now present the topic extraction performance of the various algorithms based on both coherence and by inspecting the topics manually. 6.1
20 Newsgroup Corpus
Based on both Fig. 5a, and inspection of the actual words within the topics (topics extracted by ALBU shown in Table 2), it is evident that the algorithms all perform relatively well at K = 13. When extracting more than 13 topics, however, ALBU and collapsed Gibbs sampling perform significantly better than VB. Many of the topics can be directly mapped to a specific newsgroup, as can be seen in Table 2 where 6 of the 13 topics are shown. However, some topics contain words that seem to come from multiple newsgroups. This is to be expected, since some of the newsgroups address similar subjects. For example, the topic to which we have assigned the label hardware contains words that are predominantly occur in two of the newsgroups, namely comp.sys.ibm.pc.hardware and comp.sys.mac.hardware. This is the case with a number of the topics, and it is therefore and it is therefore not surprising that the most coherent topics extracted from the 20 Newsgroups corpus occur at the settings of 13 ≤ K ≤ 20. Inspection of the top 10 words per topic for K = 13, reveals that the three algorithms yield similar results for many of the topics. The differences between Gibbs and ALBU do not seem semantically significant but for VB there are topics that could be considered in-congruent mixtures of topics. For example VB extracts a topic which seems to be a mixture of talk.politics.guns and
740
R. M. C. Taylor and J. A. du Preez
Fig. 5. Cv Scores for the (1) 20 Newsgroups and (2) Covid Tweets corpora. Table 2. Top 10 words in 6 of the topics extracted from the 20 Newsgroup Text corpus where K = 13 for ALBU. We assign a topic category to each topic based on themes captured by most of these top ranking words. Some are mixtures of newsgroups such as comp.sys.ibm.pc.hardware comp.sys.mac.hardware which we have called Hardware. hardware politics.mideast religion.misc sci.med sci.space
talk.politics.guns
drive
armenian
god
drug
space
gun
card
people
one
health
system
law
one
said
would
medical nasa
right
use
turkish
one
one
launch
people
problem
jew
christian
doctor
satellite
state
scsi
one
jesus
use
water
would
system
israel
say
disease
may
file
disk
israeli
think
food
earth
government
bit
nazi
believe
patient mission
work
woman
bible
day
weapon
technology crime
rec.motorcycles since it contains words like gun, firearm, weapon and handgun as well as bike, motorcycle and ride. It is important to note that although the top 10 words are very similar for most of the extracted topics (especially for Gibbs and ALBU), we use a coherence window of 50 which takes more of the words into account and this is why there are differences in the coherence values, even when the top 10 words per topic are almost identical. 6.2
Covid Tweets
Tweets are short messages that are posted by users on the Twitter platform and are typically only a sentence or two in length. We expect coherence values to be low for such short documents, especially with such small corpora because
Approximate Loopy Belief Update (ALBU) for LDA
741
there is less information available from which to learn interpretable topics. From Fig. 5b we can see that VB performs more poorly than do the ALBU and Gibbs sampling algorithms. Table 3. Top 15 words in 6 of the topics extracted from the Covid Tweets corpus where K = 8 for ALBU. masks
children cases
trump
spread help
testing
mask
like
case
trump
people
get
new
realdonaldtrump health
test
face
school
death
via
spread
hospital
make
back
total
american
risk
patient
wear
know
day
every
even
testing
home
want
india
give
symptom
tested
safe
would
number
sign
daily
need
life
going
report
support
earth
say
keep
see
today
month
self
minister chief
positive
stay
still
last
change
slow
social
think
positive
worker
community pradesh
one
thing
reported
government
identify
doctor
wearing
kid
update
need
sooner
govt
everyone one
hour
million
long
care
please
confirmed america
impact
police
look
We inspect the extracted topics at K = 8 for ALBU and K = 9 for the collapsed Gibbs sampling algorithm. By inspecting 6 of the 8 Covid topics extracted by ALBU in Table 3, we see coherent, relevant topics such as mask wearing and testing. The extracted topics by collapsed Gibbs sampling are very similar to these especially with K = 9. VB extracts significantly fewer coherent topics; for example, two of the topics relate to both Donald Trump and children (trump, school, and america were top words in the one, and realdonaldtrump, american and school in another topic).
7
Discussion
We now the discuss the results in light of the algorithmic differences between VB, ALBU and collapsed Gibbs sampling. ALBU and collapsed Gibbs sampling perform similarly for the two corpora presented above, although ALBU seems to marginally outperform collapsed Gibbs sampling for the Covid tweets corpus. We suspect that with even longer chains that collapsed Gibbs sampling would achieve similar results to that of
742
R. M. C. Taylor and J. A. du Preez
ALBU. Our results confirm that ALBU shows similar performance to collapsed Gibbs sampling but with more consistent results (we do not need to wait as long for convergence). This indicates that ALBU can be used as an alternative to collapsed Gibbs sampling. For both corpora, VB does not perform as well as the other algorithms, especially for the Covid tweets corpus. The Covid tweets corpus is a difficult corpus from which to extract coherent topics, since the topics have significant word overlap. This makes the differences in topic extraction performance between the algorithms more prominent. 7.1
Algorithmic Comparison
Since VB is the algorithmic equivalent of VMP, we can compare ALBU to VMP as they are both message passing algorithms. There are three main differences between the VMP and ALBU approaches. 1. VMP does not cancel out reverse-direction messages [49]. 2. ALBU uses full distribution where possible and do not explicitly use expected values. 3. The approximate messages used by ALBU to update the Dirichlet distributions are equivalent to the sampling solution, but in closed form. Message cancelling is known to stop an algorithm from becoming too overconfident in its predictions—we believe that VB becomes over confident that certain words are in a topic, and that certain topics are in a document, where ALBU maintains the sufficient amount of uncertainty. In terms of our Dirichlet update equations, we believe that using the same messages that hierarchical sampling would send (but in a closed form), allows ALBU to converge with minimal bias. These differences cause VB and ALBU to perform differently when extracting topics. We believe that this behavior is due to the combination of its explicit message cancelling as well as its superior message updates (equivalent to that of hierarchical sampling) of the word-topic Dirichlet set members. Important future work should include the application of these techniques to other similar graphical models. This could enhance performance of many other Bayesian models. Furthermore, the impact of applying our closed form message updates to the VMP algorithm directly could be valuable. This would result in an algorithm that is computationally simpler than VMP and VB due to the exclusion of digamma functions.
8
Conclusion
Variational Bayes, the standard approximate inference algorithm for LDA, performs well at aspect identification when applied to large corpora of documents but when applied to smaller domain-specific data sets, performance can be poor. In this article we present an alternative approximate technique, based on loopy
Approximate Loopy Belief Update (ALBU) for LDA
743
belief update (LBU) for LDA that is more suited to small or difficult data sets because more information is retained (we do not use expectations but rather full distributions). We test our ALBU implementation against the Python Gensim VB implementation on two of difficult data sets. We benchmark these results by including the standard collapsed Gibbs sampling algorithm (allowing an extended period for convergence). The performance for ALBU is significantly better than that of VB and comparable to that of Gibbs but at a lower computational cost. The findings in this article indicate that although we live in a word where large volumes of data are available for model training and inference, using fully Bayesian principles can provide improved performance when problems get hard (without significant extra complexity). Based on our findings we invite others to apply the principles that we have used in ALBU to other similar graphical models with Dirichlet-Categorical relationships.
References 1. Albakour, M., Macdonald, C., Ounis, I., et al.: On sparsity and drift for effective real-time filtering in microblogs. In: Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, pp. 419–428. ACM (2013) 2. Asuncion, A., Welling, M., Smyth, P., Teh, Y.W.: On smoothing and inference for topic models. In: Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, pp. 27–34. AUAI Press (2009) 3. Backenroth, D., et al.: FUN-LDA: a latent Dirichlet allocation model for predicting tissue-specific functional effects of noncoding variation, p. 069229. BioRxiv (2017) 4. Basave, A.E.C., He, Y., Xu, R.: Automatic labelling of topic models learned from Twitter by summarisation. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 618–624 (2014) 5. Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, New York (2006) 6. Blei, D.M., Kucukelbir, A., McAuliffe, J.D.: Variational inference: a review for statisticians. J. Am. Stat. Assoc. 112(518), 859–877 (2017) 7. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003) 8. Buntine, W.: Variational extensions to EM and multinomial PCA. In: Elomaa, T., Mannila, H., Toivonen, H. (eds.) ECML 2002. LNCS (LNAI), vol. 2430, pp. 23–34. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-36755-1 3 9. Celikyilmaz, A., Hakkani-T¨ ur, D., Feng, J.: Probabilistic model-based sentiment analysis of Twitter messages. In: 2010 IEEE Spoken Language Technology Workshop, pp. 79–84. IEEE (2010) 10. Chang, J., Gerrish, S., Wang, C., Boyd-Graber, J.L., Blei, D.M.: Reading tea leaves: how humans interpret topic models. In: Advances in Neural Information Processing Systems, pp. 288–296 (2009) 11. Fei-Fei, L., Perona, P.: A Bayesian hierarchical model for learning natural scene categories. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005), vol. 2, pp. 524–531. IEEE (2005)
744
R. M. C. Taylor and J. A. du Preez
12. Feuerriegel, S., Pr¨ ollochs, N.: Investor reaction to financial disclosures across topics: an application of latent Dirichlet allocation. Decis. Sci. 52(3), 608–628 (2018) 13. Foulds, J., Boyles, L., DuBois, C., Smyth, P., Welling, M.: Stochastic collapsed variational Bayesian inference for latent Dirichlet allocation. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 446–454. ACM (2013) 14. Han, B., Baldwin, T.: Lexical normalisation of short text messages: makn sens a# Twitter. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 368–378 (2011) 15. Heinrich, G.: Parameter estimation for text analysis. Technical note version 2.4, Vsonix GmbH and University of Leipzig, August 2008, 2010 16. Heinrich, G.: Parameter estimation for text analysis. Technical report (2005) 17. Heskes, T.: Stable fixed points of loopy belief propagation are local minima of the bethe free energy. In: Advances in Neural Information Processing Systems, pp. 359–366 (2003) 18. Hong, L., Davison, B.D.: Empirical study of topic modeling in Twitter. In: Proceedings of the First Workshop on Social Media Analytics, pp. 80–88. ACM (2010) 19. Jabeur, L.B., Tamine, L., Boughanem, M.: Uprising microblogs: a Bayesian network retrieval model for tweet search. In: Proceedings of the 27th Annual ACM Symposium on Applied Computing, pp. 943–948. ACM (2012) 20. Jin, O., Liu, N.N., Zhao, K., Yu, Y., Yang, Q.: Transferring topical knowledge from auxiliary long texts for short text clustering. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, pp. 775–784. ACM (2011) 21. Knowles, D.A., Minka, T.: Non-conjugate variational message passing for multinomial and binary regression. In: Advances in Neural Information Processing Systems, pp. 1701–1709 (2011) 22. Koller, D., Friedman, N., Bach, F.: Probabilistic Graphical Models: Principles and Techniques. MIT Press, Cambridge (2009) 23. Kschischang, F.R., Frey, B.J., Loeliger, H.-A.: Factor graphs and the sum-product algorithm. IEEE Trans. Inf. Theory 47(2), 498–519 (2001) 24. Lauritzen, S.L., Spiegelhalter, D.J.: Local computations with probabilities on graphical structures and their application to expert systems. J. Roy. Stat. Soc.: Ser. B (Methodol.) 50(2), 157–194 (1988) 25. Liu, C., Lin, H., Gong, S., ji, Y., Liu, Q.: Learning topic of dynamic scene using belief propagation and weighted visual words approach. Soft. Comput. 19(1), 71–84 (2014). https://doi.org/10.1007/s00500-014-1384-8 26. Mehrotra, R., Sanner, S., Buntine, W., Xie, L.: Improving LDA topic models for microblogs via tweet pooling and automatic labeling. In: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 889–892. ACM (2013) 27. Minka, T.: Estimating a Dirichlet distribution (2000) 28. Minka, T., Lafferty, J.: Expectation-propagation for the generative aspect model. In: Proceedings of the Eighteenth Conference on Uncertainty in Artificial Intelligence, pp. 352–359. Morgan Kaufmann Publishers Inc. (2002) 29. Minka, T.P.: Expectation propagation for approximate Bayesian inference. In: Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence, pp. 362–369. Morgan Kaufmann Publishers Inc. (2001) 30. Mukherjee, I., Blei, D.M.: Relative performance guarantees for approximate inference in latent Dirichlet allocation. In: Advances in Neural Information Processing Systems, pp. 1129–1136 (2009)
Approximate Loopy Belief Update (ALBU) for LDA
745
31. Murphy, K.P., Weiss, Y., Jordan, M.I.: Loopy belief propagation for approximate inference: an empirical study. In: Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, pp. 467–475. Morgan Kaufmann Publishers Inc. (1999) 32. Murphy, K.P.: Dynamic Bayesian networks: representation, inference and learning. Dissertation, Ph.D. thesis, UC Berkley, Department of Computer Science (2002) 33. Nugroho, R., Molla-Aliod, D., Yang, J., Zhong, Y., Paris, C., Nepal, S.: Incorporating tweet relationships into topic derivation. In: Hasida, K., Purwarianti, A. (eds.) Computational Linguistics. CCIS, vol. 593, pp. 177–190. Springer, Singapore (2016). https://doi.org/10.1007/978-981-10-0515-2 13 34. Pearl, J.: Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan kaufmann, San Francisco (1988) 35. Phan, X.-H., Nguyen, L.-M., Horiguchi, S.: Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: Proceedings of the 17th International Conference on World Wide Web, pp. 91–100. ACM (2008) 36. Pritchard, J.K., Stephens, M., Donnelly, P.: Inference of population structure using multilocus genotype data. Genetics 155(2), 945–959 (2000) 37. Ramage, D., Dumais, S., Liebling, D.: Characterizing microblogs with topic models. In: Fourth International AAAI Conference on Weblogs and Social Media (2010) 38. Robert, C.P., Casella, G., Casella, G.: Monte Carlo Statistical Methods, vol. 2. Springer, New York (2004). https://doi.org/10.1007/978-1-4757-4145-2 39. R¨ oder, M., Both, A., Hinneburg, A.: Exploring the space of topic coherence measures. In: Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, pp. 399–408. ACM (2015) 40. Rosen-Zvi, M., Griffiths, T., Steyvers, M., Smyth, P.: The author-topic model for authors and documents. In: Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, pp. 487–494. AUAI Press (2004) 41. Shafer, G.R., Shenoy, P.P.: Probability propagation. Ann. Math. Artif. Intell. 2(1), 327–351 (1990). https://doi.org/10.1007/BF01531015 42. Shirota, Y., Hashimoto, T., Sakura, T.: Extraction of the financial policy topics by latent Dirichlet allocation. In: TENCON 2014-2014 IEEE Region 10 Conference, pp. 1–5. IEEE (2014) 43. Sivic, J., Russell, B.C., Efros, A.A., Zisserman, A., Freeman, W.T.: Discovering object categories in image collections (2005) 44. Wainwright, M.J., Jordan, M.I., et al.: Graphical models, exponential families, and variational inference. Found. Trends® Mach. Learn. 1(1–2), 1–305 (2008) 45. Wang, X., Grimson, E.: Spatial latent dirichlet allocation. In: Advances in Neural Information Processing Systems, pp. 1577–1584 (2008) 46. Wang, X., McCallum, A.: Topics over time: a non-Markov continuous-time model of topical trends. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 424–433. ACM (2006) 47. Weng, J., Lim, E.-P., Jiang, J., He, Q.: TwitterRank: finding topic-sensitive influential Twitterers. In: Proceedings of the Third ACM International Conference on Web Search and Data Mining, pp. 261–270. ACM (2010) 48. Winn, J., Bishop, C.M.: Variational message passing. J. Mach. Learn. Res. 6, 661– 694 (2005) 49. Winn, J.M.: Variational message passing and its applications. Ph.D. thesis, Citeseer (2004) 50. Woodbury, M.A., Manton, K.G.: A new procedure for analysis of medical classification. Methods Inf. Med. 21(04), 210–220 (1982)
746
R. M. C. Taylor and J. A. du Preez
51. Yan, X., Guo, J., Lan, Y., Cheng, X.: A biterm topic model for short texts. In: Proceedings of the 22nd International Conference on World Wide Web, pp. 1445– 1456. ACM (2013) 52. Zeng, J.: A topic modeling toolbox using belief propagation. J. Mach. Learn. Res. 13, 2233–2236 (2012) 53. Zeng, J., Cheung, W.K., Liu, J.: Learning topic models by belief propagation. IEEE Trans. Pattern Anal. Mach. Intell. 35(5), 1121–1134 (2012) 54. Zhai, K., Boyd-Graber, J., Asadi, N.: Using variational inference and MapReduce to scale topic modeling. arXiv preprint arXiv:1107.3765 (2011) 55. Zhai, K., Boyd-Graber, J., Asadi, N., Alkhouja, M.L.: Mr. LDA: a flexible large scale topic modeling package using variational inference in mapreduce. In: Proceedings of the 21st International Conference on World Wide Web, pp. 879–888 (2012)
Retrospective Analysis of Global Carbon Dioxide Emissions and Energy Consumption Rajvir Thind and Lakshmi Babu Saheer(B) Anglia Ruskin University, Cambridge, UK [email protected]
Abstract. The international climate emergency has triggered actions for understanding and controlling the Green House Gases (GHGs) all over the world to achieve the net zero “dream”. It is important to understand the global trend in the GHG emissions, particularly focusing on Carbon dioxide. This paper investigates the trends in carbon dioxide emissions between 1965 and 2019. A retrospective look at this data will help us understand the major global trends and help designing better strategies to work towards the zero carbon target. This data analysis will look at breakdown of carbon emissions by geographical regions, which could provide a greater insight into the regions experiencing a rise in carbon emissions. The relationship between energy consumption and carbon emissions are also tested. The work also aims to provide a forecast of future energy consumption based on the learning from this analysis. Keywords: Data analysis · Carbon emissions Renewable energy · Statistics
1
· Energy consumption ·
Introduction
The Green House Gases (GHGs), mainly carbon dioxide, (CO2 ) traps heat in the atmosphere which is a natural phenomena to keep the Earth warm and maintain the balance of life. CO2 is very abundant and stays longer in the atmosphere compared to the other GHGs like methane or nitrous oxide. Recently, this balance of temperature and life on Earth has been tampered with by the unprecedented increase in the amount of CO2 emissions in the atmosphere. According to the National Oceanic and Atmospheric Administration (NOAA), the levels of atmospheric CO2 is at its highest point in 3 million years. The temperature at that point in time was 2◦ −3 ◦ C(3.6◦ −5.4 ◦ F) higher and sea levels 15–25 m higher than today. CO2 has been increasing at a rate of 100 times faster in the past 60 years compared to normal increase in the 11,000–17,000 years. This increase has resulted in trapping additional heat in Earth’s atmosphere and slowly increasing the global temperature. Carbon dioxide is a colourless and odourless naturally occurring gas in the earth’s atmosphere which is made up of one carbon atom and two oxygen c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 506, pp. 747–761, 2022. https://doi.org/10.1007/978-3-031-10461-9_51
748
R. Thind and L. B. Saheer
atoms [2]. The negative connotations of carbon dioxide often lead to the misconception of it being harmful. However, a natural amount of CO2 plays a crucial part in maintaining our ecosystem. It only causes damage to the environment when there is an excess of CO2 , usually generated by man-made activities such as fossil fuel use [13]. Excess CO2 can be a problem because it acts like a “greenhouse gas”. Due to its molecular structure, CO2 absorbs and emits infrared radiation, warming the Earth’s surface and the lower levels of the atmosphere [3]. Scientists and experts believe that carbon emissions need to be reduced or there may be irreversible harm to the global ecosystem, it is therefore necessary to measure emissions of carbon dioxide so that progress (or lack of progress) can be identified and the necessary changes can be implemented. The main contribution of atmospheric CO2 is from burning fossil fuels. The energy industry has been keen to look at this information. BP plc was one of the first companies to gather this data in terms of global energy consumption and release of CO2 [7]. The data since 1965 is available for analysis. This data analysis aims to explore carbon dioxide emissions data from this dataset. This paper will first look at some related literature followed by analysis and some insights from this analysis.
2
Literature Review
A group of scientists in the 19th century discovered that atmospheric gases produce a “greenhouse effect”, a natural process in which the accumulation of gases in the atmosphere absorb energy and contributes to a rise in temperature of the earth. However, this was even seen as beneficial, as per the Swedish scientist, Svante Arrhenius whose perspective was that it could lead to higher agricultural yield in colder climates. The concept of climate change was largely dismissed until the 20th century, at which point several studies of climate change began to emerge. The study of carbon isotopes by Hans Suess [10] revealed that carbon dioxide was not absorbed by the ocean immediately. Another significant study was conducted by the Stanford research institute in 1968. It was predicted that: “If the Earth’s temperature increases significantly, a number of events might be expected to occur, including the melting of the Antarctic ice cap, a rise in sea levels and a warming of the oceans” [8]. Carbon dioxide emissions have increased for several reasons; however, it is widely regarded that human activities, including the production and consumption of oil, natural gas and coal are the main cause of rising carbon emissions. These activities are linked to economic growth and industrialisation in emerging economies. However it has been discovered that higher income levels increases the demand for policies which promote environmental protection, reducing emissions in developed countries [11]. Another paper which investigates the relationship between carbon emissions, energy consumption and economic growth in China [15], concludes that the government of China is able to implement policies aimed at reducing carbon emissions without compromising economic growth. Although this paper finds that China is one of the world’s major emitters of carbon
CO2 Emission Analysis
749
dioxide, contradicting its findings. Developed countries face different challenges to the rest of the world, for example, countries such as the UK have experienced structural change in their economy. A shift towards the service sector means that the manufacturing and agricultural sector have declined, meaning a larger proportion of goods consumed by UK households are produced abroad. This increases transport related carbon emissions [1]. The lack of technology meant that more quantitative research and analysis was not undertaken until recently. There are now several publications looking at several different aspects of carbon emissions. Bulent Tumez [12] looked at Trends in energy emissions. The work uses Trend Analysis to predict future emissions, however, it concludes that carbon emissions have several influencing factors which should be considered alongside the projections provided in the paper. The effects of the COVID-19 Pandemic demonstrates this. Recent studies by Liu et al. [4] reveals the effect of Covid19 pandemic situation on global CO2 emissions. The research finds that there was an 8.8% drop in carbon dioxide emissions in the first half of 2020 compared to the same period in 2019. This outlines the difficulty in accurately predicting future carbon emissions. Chi Xu [14], investigates the distribution of human populations and its environmental niche. The study states that global warming will change human health, food and water supply and economic growth. A strength of this analysis is that it was funded by reputable bodies such as the European research council and uses various graphical techniques to visualise their findings. However, it is not clear if the paper offers an balanced view of the facts. The projection for MAT’s (Mean Annual Temperatures) for 2070 has been referred to as a ‘worst case scenario’ by many scientists. Mark Maslin, an earth science professor at UCL, claims that the study does not consider the dynamic and adaptable nature of human technology and society [5]. Nevertheless, the paper provides several valid arguments and exemplifies the need to study carbon emissions and its effects in greater detail. The effect of carbon emissions on global warming is one of the recurring themes of carbon emissions analysis that can be found online, and research has been conducted by major organisations and government institutions. NASA has created a “Global Climate Change Website” [6] which provides interactive visualisations, appealing to those from a non-science backgrounds, for visitors to view. The presence of freely available and comprehensible information has piqued the interest of the public. A survey conducted by the UK Government Department for Business, Energy and Industrial Strategy found that 76% of respondents were somewhat concerned about global warming [9]. It is therefore worthwhile, to investigate the relationship between carbon dioxide emissions and global warming in greater detail. This paper will also explore, using time series analysis, the reasons behind changes in carbon emissions and will conclude with ‘attempting’ to forecast when renewable energy will account for all of the world’s energy consumption, something that is seen as a crucial objective yet is lacking from previous research and analysis on this topic. The data for carbon emissions from the burning of fossil fuels have been published by BP Plc [7]. The company has also analysed this dataset and publishes an annual report which provides insights into world energy production and
750
R. Thind and L. B. Saheer
consumption. Its main focus is analysis of the world energy market from the previous year. The 2020 edition analyses world energy data and trends prior to the COVID-19 pandemic covering areas such as the oil, natural gas and coal markets. Renewable energy is also becoming a prominent feature in recent annual reports as it continues its rapid growth. Despite being a highly detailed report, carbon emissions form only a very small fraction of the total analysis and therefore it can’t be considered a very comprehensive analysis of carbon emissions data. This proposed research will analyze the details in the same dataset. The dataset seems quiet exhaustive. The main findings in the report does refer to China being the “biggest driver of energy” and “accounting for more than three quarters of net global growth” [7]. This research will validate these claims on the BP dataset and at the same time look at new insights to be gained from the dataset. To the best of our knowledge, there has not been any other analysis of the carbon dioxide and energy consumption section of the dataset making this paper the most comprehensive analysis of BP’s carbon dioxide and energy consumption data.
3 3.1
Data Analysis Carbon Emissions Introductory Analysis
This Data Analysis was performed using python and various libraries. We will begin by investigating the overall trend of carbon emissions between 1965 and 2019. Below is a time series graph demonstrating this:
Fig. 1. Carbon emissions 1965–2019
CO2 Emission Analysis
751
We can draw the following key findings from Fig. 1 – There is a clear upwards trend with some minor fluctuations – The largest increase happened between 2009 and 2010, however, the largest decrease happened the year before in 2008 - the year of the global recession. Carbon emissions could have reduced due to lower energy consumption as a result of higher prices for oil and gas. This suggests that the main reason for the sharp increase in carbon emissions the following year were due to the world recovering from the event. – Carbon emissions reached their highest ever point in 2019, with a value of 34169 million tonnes of carbon. To gain a better understanding of this time series data, a rate of change graph can be plotted to find outliers, which in this case are specific years with higher than usual growth. In a statistical sense, an appropriate definition is: “a point which is 2 standard deviations away from the mean.” Standard deviation is a measure of variance, which in simple terms is the extent to which individual data points are spread out from the mean. In a normal distribution, around 68% of the data lies within one standard deviation of the mean and approximately 95% of the data is within 2 standard deviations. Therefore by defining an outlier as being two standard deviations away from the mean, it is very likely to be an outlier. Figure 2 confirms the earlier point on the exceptional rate of change in 2009.
Fig. 2. Carbon emissions rate of change 1965–2019
752
3.2
R. Thind and L. B. Saheer
Carbon Emissions Breakdown by Region
A breakdown of carbon emissions by region will provide a greater insight into what geographical regions are experiencing a rise in carbon emissions. For this breakdown, countries are placed into 7 different regions based on their geographical location. The regions are listed as follows: “North America, South and Central America, Europe, CIS, Middle East, Africa and Asia Pacific.
Fig. 3. Choropleth map of carbon emissions
Figure 3 is a choropleth map which are used to represent data through shading patterns on geographical areas. In this case it shows the variability of carbon emissions across the world, the legend on the right hand side translates the various shading patterns to numerical data. This type of visualisation does well to illustrate how carbon emissions vary around the world and is easy to understand for the reader. A disadvantage of this is that it is difficult to identify exact values Figure 4 is a nested pie chart, it does well to summarise a dataset in visual form and is more detailed than a traditional pie chart as it has more than one ‘level’. For example, we can see that Asia Pacific is the biggest contributor of carbon emissions and China is the biggest contributor in Asia Pacific. The reader can make an immediate analysis at a glance. However, like the previous visualisation it is difficult to get an exact value. Figure 5 is an ECDF plot (Empirical Cumulative Distribution Function). It allows the plotting of data from the lowest value to the greatest value and visualise the distribution of the dataset. It also shows the percentage of data that has a particular x value. For example, Fig. 5 shows that almost 90% of countries had carbon emissions below 1000 million tonnes of carbon in 2019 with a handful of countries exceeding that threshold.
CO2 Emission Analysis
Fig. 4. Nested pie chart of carbon emissions
Fig. 5. ECDF plot of carbon emissions of carbon emissions
753
754
R. Thind and L. B. Saheer
Figure 6 is a time series graph of carbon emissions between 1965 and 2019 showing growth by region. It is clear that Asia Pacific’s emissions are rising whereas the other regions have levelled off. Another interesting observation is that in 1965 they were among the lowest emitters of carbon dioxide. We can deduct the following key findings from these visualisations: – Asia Pacific is responsible for approximately half of the world’s carbon emissions- a significant part of this is China. – The US accounts for more than 70% of North America’s carbon emissions. – South and Central America is the region with the lowest carbon emissions. – Around 90% of countries had individual carbon emissions below 1000 million tonnes of carbon in 2019. Perhaps if the other 10% took action to reduce their carbon emissions, it could help to bring worldwide emissions down. – Asia Pacific’s emissions are rising whereas other regions have established, this implies that the Asia Pacific region is the main driving force for the increase in global carbon emissions. – African data may not be valid as there are several countries not included in the data (choropleth map has several blank spaces on the African region).
Fig. 6. Carbon emissions breakdown by region 1965–2019
3.3
Energy Consumption
It is widely considered that energy consumption and carbon dioxide emissions are related, to test this relationship, Pearson’s correlation coefficient can be used. This measures the statistical relationship between two continuous variables, it gives information about the magnitude of the association, or correlation, as well as the direction of the relationship.
CO2 Emission Analysis
755
Fig. 7. Relationship between energy consumption and carbon emissions by region
Plotting energy consumption and carbon emissions results in a clear linear relationship consistent with all geographical regions (as shown in Fig. 7). Using Pearson’s correlation coefficient to test the magnitude of the relationship results in a coefficient of 0.99, where 1 is an exact linear relationship and 0 is no relationship, implying a very strong association. More specifically, it is the consumption of fossil fuels such as oil, coal and gas that is the main contributor to carbon emissions. By plotting the consumption of each of these fossil fuels against carbon and calculating the correlation coefficient for each relationship, we can determine which fossil fuel has the closest relationship with carbon (Fig. 8). It seems that coal and oil seem to have the closest correlation to carbon consumption, to verify this, Pearson’s correlation coefficient has been used. Coal and oil have a close to perfect positive correlation with carbon (correlation coefficients are 0.95 and 0.94, respectively), with gas having a slightly lower but still strong relationship with carbon (correlation coefficient of 0.76). Some fossil fuel plots have more outliers than others. This study uses only what was provided in the BP dataset and did not look for other factors contributing to emissions which might give a better understanding of these correlations. The distribution of usage of different types of fossil fuels varies for each country or region which could explain the high variance.
756
R. Thind and L. B. Saheer
Fig. 8. Correlation between oil, coal gas and carbon dioxide emissions
3.4
Renewable Energy Consumption
Whilst this paper is mostly focused on analysing carbon emissions and the reason why they might be increasing. It is also worthwhile to explore renewables consumption data as many scientists believe that by switching to renewable energy, carbon emissions can be brought down. As a result, analysis of renewables consumption is relevant. Figure 9 is a simple time series graph of renewables consumption between 1965 and 2019. We can see that there was slow growth between 1965 and 2000 and then a steep rise until 2019. By breaking down Fig. 9 and show renewables consumption by region (Fig. 10), we can see that the Asia Pacific region is experiencing the greatest growth. This is interesting as it is also responsible for the steepest rise in carbon emissions. Europe and North America are following closely behind. Whilst the growth in renewables consumption has been close to exponential in recent years, it is still insignificant compared to total energy consumption. As shown in Fig. 11 forecasting total energy consumption and renewable energy consumption can provide an insight into how the current trend will progress into the future. The “prophet” library, an open source library which uses an additive regression model, produced some interesting results. This is shown in Fig. 12.
CO2 Emission Analysis
Fig. 9. Renewable energy consumption 1965–2019
Fig. 10. Renewable energy breakdown by region
757
758
R. Thind and L. B. Saheer
Fig. 11. Total energy and renewable energy consumption 1965–2019
The dark blue line represents the most likely forecast and the light blue shaded area represents the bounds. We can see that growth in total energy consumption is mostly linear, although the shaded area underneath the line is unlikely to be valid as scientists predict that energy consumption is unlikely to reduce as developing and emerging economies industrialise and the world’s population increases. The forecast for renewable energy shows growth even in a worst-case scenario (Fig. 13), however, the forecast is mostly linear which is unexpected. It also never reaches the level of total energy consumption which is what scientists predict will happen in the next 30 years. This disparity is likely since this forecast is predicting the next 46 years using historical data which only accounts for 55 years. The model also assumes that there will not be a drastic rate of change in renewables consumption, whereas scientists feel that there will be, this leads to the problem of overfitting. Whilst this forecast is likely not valid, it is still interesting to see how the current trend would play out in the future if there is no action by countries to increase investment and consumption in renewable sources of energy.
CO2 Emission Analysis
759
Fig. 12. Forecast for total energy consumption
Fig. 13. Forecast for renewable energy consumption
4
Conclusion
Through the analysis of this data, we can conclude the following: – Carbon emissions have maintained a constant upwards trend since 1965 with an average rate of change of 2% The financial crisis in 2008 led to the largest
760
–
–
–
–
–
R. Thind and L. B. Saheer
reduction in carbon emissions between 1965 and 2019, this was quickly followed by a large increase in carbon emissions due to the world recovering from the event. Carbon emissions reached an all time high in 2019, suggesting that the world has not done enough to reduce their carbon footprint Asia Pacific account for around half of the worlds carbon emissions, China is a major contributor - having almost as many emissions as North America and Europe combined. Asia Pacific has experienced a sharp rise in carbon emissions whereas every other region has levelled off, implying that Asia Pacific is the main driving force for the rise in the worlds carbon emissions in recent years. The top 5 carbon dioxide emitting countries (China, US, India, Russia and Japan) are responsible for 58% of the worlds carbon emissions, this suggests that changes in environmental policy in these countries can be effective in reducing global emissions. 90% of countries have emissions below 1000 million tonnes of carbon. Energy Consumption has a very strong correlation with carbon emissions, implying that higher energy consumption is one of the main causes of higher carbon emissions. More specifically, coal and oil consumption is very strongly associated with carbon emissions. Using alternative sources of energy can go a long way in reducing energy related carbon emissions. Renewable energy consumption, which is widely regarded as crucial in reducing carbon emissions, has rapidly increased in the last 15 years. Asia Pacific has experienced more growth than any other region. This suggests that they acknowledged their contribution to carbon emissions and are taking action to reduce it. Despite this rapid growth, renewable energy is still an insignificant proportion of total energy consumption as many countries are still reliant on fossil fuels such as oil, coal and gas. Forecasting the next 46 years with historical data between 1965 and 2019, shows that even in a best-case scenario, renewable energy does not account for total energy consumption in that period, despite many scientists predicting that the world is on track to reach 100% renewable energy by 2050. However, the forecast does not account for external factors such as drastic changes in environmental policy and investment in renewable energy which means that it is not completely valid. The coronavirus pandemic has drastically reduced carbon emissions this year due to lower industrial activity and transport use and it is not clear what effect this may have regarding carbon emissions and renewable energy consumption in the future. During previous crises, the growth rate of carbon emissions has recovered, however, this pandemic has reduced Carbon dioxide emissions by more than any other event in history.
References 1. Food & Rural Affairs Department for Environment. UK’s carbon footprint 1997– 2018 (2018) 2. Drax. What is Carbon dioxide? (2020) 3. Inspire. Why are Greenhouse Gases a Problem? (2016)
CO2 Emission Analysis
761
4. Liu, Z., et al.: Near-real-time monitoring of global co2 emissions reveals the effects of the COVID-19 pandemic. Nat. Commun. 11(1), 5172 (2020) 5. Maslin, M.: Will three billion people really live in temperatures as hot as the sahara by 2070? The Conversation (2020) 6. NASA. Climate Change: How Do We Know? (n.d.) 7. BP PLC. BP Corporate Statistical Review (2020) 8. Robinson, E., Robbins, R.C.: Sources, abundance, and fate of gaseous atmospheric pollutants. Final report and supplement (1968) 9. Statista. How concerned, if at all, are you about current climate change, sometimes referred to as ‘global warming’ ? Statista (2020) 10. Suess, H.E.: Radiocarbon concentration in modern wood. Science 122(3166), 415– 417 (1955) 11. Tucker, M.: Carbon dioxide emissions and global GDP. Ecol. Econ. 15(3), 215–223 (1995) 12. Tutmez, B.: Trend analysis for the projection of energy-related carbon dioxide emissions. Energy Explor. Exploit. 24(1–2), 139–150 (2006) 13. Viessmann. What are carbon emissions (and why do they matter)? (n.d.) 14. Chi, X., Kohler, T.A., Lenton, T.M., Svenning, J.-C., Scheffer, M.: Future of the human climate niche. Proc. Natl. Acad. Sci. 117(21), 11350–11355 (2020) 15. Zhang, X.-P., Cheng, X.-M.: Energy consumption, carbon emissions, and economic growth in china. Ecol. Econ. 68(10), 2706–2712 (2009)
Application of Weighted Co-expressive Analysis to Productivity and Coping Sipovskaya Yana Ivanovna(B) Institute of Psychology of the Russian Academy of Sciences, 129366, Yaroslavskayast., 13, Moscow, Russia [email protected]
Abstract. The article examines the features of the use of weighted coexpression and correlation analysis (the Spearmene method) in the context of studying the ratio of school academic performance and coping strategies. The study involved 155 high school students (15–17 years old). Methodological base of the research: “Strategies for coping behavior of Lazarus” and data from the school electronic diary. The results of the study allow us to conclude that both methods of statistical analysis used in the study showed similar results and the choice of the preferred statistical tool depends on the need for better visualization of the result and, for example, the ability to build a model of the investigated construct in its most significant components, excluding insignificant, “noisy” components. The results’ve showed that with an increase in the degree of school success, the number of cognitively loaded coping strategies also increases, while the number of emotionally loaded strategies, on the contrary, decreases. Keywords: Data analysis · Weighted co-expression analysis · School performance · Coping strategies
1 Introduction Adolescence is one of the critical age periods of ontogenetic development. It is accompanied by sharp qualitative changes in a number of spheres of human life, among which are the physical, intellectual, moral and others. Such dramatic changes are accompanied by stress and, accordingly, the formation and development of methods for overcoming it, coping, aimed at reducing the negative consequences of difficult life situations, their resolution or avoidance of these unfavorable factors. In addition, in older adolescence, cognitive activity begins to play a particularly significant role, coupled with educational and professional self-determination, increasing and increasing methods of monitoring school performance (final tests, exams), etc. Thus, the task of this empirical study is to analyze the structure of preferred coping behaviors in older adolescents with different levels of school performance. Coping behavior is a set of mechanisms, ways to overcome difficulties in various spheres of mental activity, which are aimed at increasing the adaptation of adolescents © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 506, pp. 762–768, 2022. https://doi.org/10.1007/978-3-031-10461-9_52
Application of Weighted Co-expressive Analysis
763
to the changing conditions of the external or internal environment. Difficult life circumstances in the individual consciousness differ in the degree of significance [4]. So, for example, L.I. Antsyferova defines “coping” as an individual way of interaction between the subject of activity and the situation in accordance with its own logic, significance and psychological capabilities [2]. N.I. Bykasova E.A. Kalyagin consider “coping” as “constantly changing cognitive, emotional and behavioral attempts to cope with specific external or/and internal requirements, which are assessed as stress or exceed a person’s resources to cope with them” [4, p. 83]. Speaking about the styles or strategies of coping behavior in the framework of this study, it should be separately pointed out that all the strategies involved differ in the prevalence of cognitive and emotional-personal components of activity, but one should not operate with the concepts of productivity or unproductiveness of any of the strategies - the effectiveness of one or a different strategy depends on the characteristics of the current situation, the degree of actualization of the strategy itself, the selectivity of its or its individual components and available individual resources. Nevertheless, a number of coping behavior strategies - self-control, responsibility, planning - are associated with a reflexive cognitive component of activity, while others, for example, conflict, distancing - with various ways of emotional solution to a problem situation. N.I. Bykasova and E.A. Kalyagina point out that “… cognitive coping strategies aimed at resolving stressful situations contribute to the adaptation of adolescents to the conditions of the social environment. Emotional coping strategies aimed at avoiding problems lead to maladjustment of the adolescent, his isolation, decreased self-esteem, increased psychological anxiety, etc.” [4, p. 85]. Accordingly, following this position, one should expect the prevalence of cognitively loaded coping strategies in schoolchildren who have good academic performance, and a preference for more emotional coping strategies associated with avoiding active problem solving in poorly performing students [1, 3]. However, as one would expect in psychological research, one should not expect linear relationships between research variables (fuzzy sets). That is why in this work, along with the classical correlation method (according to Spirman), the network modeling metod is relevant. 1.1 Research Questions Hypotheses: • the choice of one or another coping behavior strategy is associated with the success of older adolescents; • with an increase in the degree of school success, the number of cognitively loaded coping behavior strategies also increases, while the number of emotionally loaded ones, on the contrary, decreases. 1.2 Purpose of the Study The subject of this empirical study was the study of the relationship between coping strategies of older adolescents with different school performance, and the object of the study was older adolescents with different school performance and preferred methods of coping behavior in difficult life situations.
764
S. Y. Ivanovna
2 Study Participants Students of ninth grades took part in the study, totaling 155 older teenagers aged 15– 17 years with different levels of academic achievement, who are studying in secondary schools.
3 Research Methods 3.1 Data of the Student’s Electronic Diary 3.2 Questionnaire “Methods of Coping Behavior” Lazarus [5] The technique is designed to determine coping mechanisms, ways to overcome difficulties in various areas of mental activity, coping strategies: confrontation; distancing; selfcontrol; seeking social support; acceptance of responsibility; escape-avoidance; planning a solution to the problem; positive revaluation. 3.3 Research Techniques Correlation analysis (Sperman’s method) and network modeling are used in this work to establish the structure of the intellectual competence construct.
4 Results 4.1 Correlation Analysis Figure 1 shows the correlation matrix in the form of a correlogram. According to the correlation analysis results, which included indicators of school academic performance and coping strategies (confrontation; distancing; self-control; seeking social support; acceptance of responsibility; escape-avoidance; planning a solution to the problem; positive revaluation), a significant correlation was between the academic performance and such coping strategies as confrontation (negative), positive revaluation, self-control and acceptance of responsibility. As follows from the results, the choice of a particular coping strategy is weakly associated with the school success of older adolescents. Thus, among the most preferable coping behavior strategies for all research groups was the “positive reappraisal,” characterized by an orientation toward a transpersonal, metacognitive understanding of the problem situation that has arisen. This strategy has both a pronounced metaregulatory component and a probabilistic underestimation by the individual of the possibilities of effectively resolving a problem situation instead of avoiding solving a problem situation at a real time. The results also established a preference for coping behavior “self-control”, which involves actively resolving a difficult situation and a preference for rational comprehension of complexity instead of emotional decision and avoiding overcoming the problem.
Application of Weighted Co-expressive Analysis
765
Fig. 1. Correlogram of indicators of school academic performance (A.Perf) and coping strategies (Confrontation; Distancing; Self-Control; Seeking Social Support; Acceptance of Responsibility; Escape-Avoidance; Planning a Solution to the Problem; Positive Revaluation).
The strategy of coping behavior of the type of “taking responsibility”, which involves a person’s recognition of his role in the emergence of a problem and responsibility for its solution, in some cases with a distinct component of self-criticism and self-accusation. With moderate use, this strategy reflects the desire of the individual to understand the relationship between their own actions and their consequences, the willingness to analyze their behavior, to look for the causes of current difficulties in personal shortcomings and mistakes. At the same time, the severity of this strategy in behavior can lead to unjustified self-criticism, feelings of guilt and dissatisfaction with oneself. These features are known to be a risk factor for the development of depressive conditions [6]. And finally, the only significant negative correlation in the obtained correlation matrix belongs to “confrontation”, as an attempt to resolve a problem through not always purposeful behavioral activity, taking concrete actions aimed either at changing the situation or at responding to negative emotions in connection with the difficulties that have arisen. With a pronounced preference for this strategy, impulsivity in behavior (sometimes with elements of hostility and conflict), hostility, difficulties in planning actions, predicting their results, correcting behavior strategies, and unjustified persistence can
766
S. Y. Ivanovna
be observed. Often the strategy of confrontation is considered as maladaptive, but when used moderately, it provides the individual’s ability to resist difficulties, energetic and receptiveness when resolving problem situations nd, the ability to defend their own interests, to cope with anxiety in stressful conditions. Thus, in the course of the conducted acorrelation analysis of the ratio of indicators of school success in older adolescents and their preferred strategies for matching behavior, it can be concluded that these constructs are weakly connected. Network modeling of the same data clarifies this fact, which suggests further development of this topic. 4.2 Network Analysis Figure 2 present the results of studying the network structure of school academic performance and coping strategies through network modeling. Thus, there is reason to conclude about the heterogeneity of school academic performance due to the functional load and cognitive and behavioral complexity of the coping’s components.
Fig. 2. Network model of school performance and coping strategies.
Notes: • • • • • • • • •
A.P - manifestations of school academic performance Pl. - planning a solution to the problem P.V - positive revaluation SC. - self-control SS. - seeking social support Cn. – confrontation Es. - escape-avoidance Rs. - acceptance of responsibility Ds. - distancing
Application of Weighted Co-expressive Analysis
767
The modeling results allow us to conclude that the central link of school academic performance is coping strategies for positive reassessment and control of their activities. These sitrategies are characterized by both a passive attitude to the sympathetic situation and, on the contrary, active actions to overcome it. The negative role of the strategy of confrontation is also significant, albeit not clearly expressed. Thus, according to the results, there is reason to conclude that by adolescence, the set of coping behavior strategies is insufficiently developed, which leads to a differentiated coping structure in the context of school academic success.
5 Findings The data allows to draw general conclusions: 1. Both methods of statistical analysis used in the study (Spearman’s correlation method and network modeling) showed similar results, however, the network analysis method visualizes the result better. 2. The importance of using, along with the method of correlation analysis, also network analysis has been proven, which allows you to build a model of the investigated construct in its most significant components and exclude from the analysis insignificant components that can "interfere" with the results. 3. The analysis carried out using statistical methods made it possible to obtain the following significant psychological results: after an increase in the degree of school success, the number of cognitively loaded coping behavior strategies does not increase, but emotionally loaded strategies are present in the identified structure.
6 Conclusion The article examines two methods of studying the approach to studying and understanding the structure of intellectual competence - correlation and network analysis, using the example of the study of psychological data (school performance and coping behavior strategy). Both methods of statistical analysis used in the study showed similar results and the choice of the preferred statistical tool depends on the need for better visualization of the result and, for example, the ability to build a model of the investigated construct in its most significant components, excluding insignificant, “noisy” components. According to the analysis, empirical data refuted the hypothesis that with an increase in the degree of school success, the number of cognitively loaded coping strategies also increases, while the number of emotionally loaded strategies, on the contrary, decreases.
7 Future Work Since in the course of the analysis of the ratio of the indicators of the success of school activities of older adolescents and their preferred strategies for matching behavior, it can be concluded that these constructs are weakly connected or are in the stage of active formation, this fact needs to be clarified and suggests further development of the theme.
768
S. Y. Ivanovna
Acknowledgments. State assignment for the IP RAS No. 0159-2019-0008, Institute of Psychology of the Russian Academy of Sciences).
References 1. Alexandrov, Yu.I., et al.: Regression as a Stage of Development. Institute of Psychology RAS, Moscow (2017) 2. Antsyferova, L.I.: Personality in difficult living conditions: rethinking, transformation of situations and psychological defense. Psychol. J. 15(1), 3–18 (1994) 3. Bogomaz, S.A., Filonenko, A.L.: Differences in the choice of coping strategies by individuals with different tendencies to manipulative behavior. Siberian Psychol. J. 10, 122–126 (2009) 4. Bykasova, N.I., Kalyagina, E.A.: Comparative analysis of coping strategies among older adolescents with different levels of academic achievement. Personality, Family and Society: Issues of Pedagogy and Psychology, no. 35–2, pp. 81–87. Association of researchers “Siberian Academic Book”, Novosibirsk (2013) 5. Kryukova, T.L., Kuftyak, E.V.: Questionnaire of coping methods (adaptation of the WCQ methodology). J. Pract. Psychol. Moscow 3, 93–112 (2007) 6. Rokitskaya, Yu.A.: Factor structure of coping behavior of adolescents. Bulletin of the Chelyabinsk State Pedagogical University, no. 3, pp. 220–233. South Ural State Humanitarian and Pedagogical University, Chelyabinsk (2018)
An Improved Architecture of Group Method of Data Handling for Stability Evaluation of Cross-sectional Bank on Alluvial Threshold Channels Hossein Bonakdari1(B) , Azadeh Gholami2 , Isa Ebtehaj3 , and Bahram Gharebaghi4 1 Department of Civil Engineering, University of
Ottawa, 161 Louis Pasteur Drive, Ottawa K1N 6N5, Canada [email protected] 2 Environmental Research Center, Razi University, Kermanshah, Iran 3 Department of Soils and Agri-Food Engineering, Université Laval, Quebec City, QC G1V06, Canada 4 School of Engineering, University of Guelph, Guelph, ON N1G 2W1, Canada
Abstract. The depth and surface of the water in the center of stable channels are two variables the majority of river engineers have been studying. As, the natural profile shape formed on stable banks is of great importance in designing threshold channels with gravel beds, in this study, extensive experiments are done to examine a channel’s geometric shape dimensions in the stable state. A novel method called the Improved Architecture of the Group Method of Data Handling (IAGMDH) is designed to overcome the main limitation of the classical GMDH model, including considering only 2nd order polynomial, considering only two inputs for each neuron as well as don’t use of neurons of the non-adjacent layers. The developed IAGMDH is applied to estimate bank profile specifications of stable channels. Accordingly, the flow discharge (Q) and transverse distance of points (x) located on stable banks from the center line are considered as input parameters and vertical boundary level (y) of points are considered as the output parameter. The performance of IAGMDH is compared and evaluated with seven previous models proposed by other researchers, a well-known scheme of the GMDH that is optimized with the genetic algorithm (GMDH-GA) and a Non-Linear Regression (NLR) model. Comparing the nine models’ results with experimental data shows that the IAGMDH model outperformed (MARE = 0.5107, RMSE = 0.052, and R = 0.9848) others in testing mode and is thus more accurate than the other models. Vigilar and Diplas Model (VDM) with RMSE of 0.2934 performs better among previous relationships. The GMDH model presented in this study is similar to VDM, suggesting a polynomial curve shape for the proposed threshold’s cross-section. Among other shapes proposed, the polynomial curve is the most appropriate compared with experimental values. The IAGMDH model also offers a robust and straightforward relationship that can predict a variety of channels’ given cross-section dimensions; hence, the proposed approach can be employed in the design, construction, and operation of artificial channels and rivers.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 506, pp. 769–796, 2022. https://doi.org/10.1007/978-3-031-10461-9_53
770
H. Bonakdari et al. Keywords: Bank profile shape · Experimental model · Improved Architecture of Group Method of Data Handling (IAGMDH) · Hybrid model · Threshold channel · Water Resource management
1 Introduction During offshore flow, sediment grains around a channel are in impending motion, a state called threshold channel. A threshold channel, also known as a static equilibrium channel, is considered the ultimate state of channels that get wider. This mostly occurs in rivers with gravel beds and not so easily in rivers with fine bed particles (like silt) in threshold state, unless the discharge is too low [1]. Recognizing threshold channel characteristics is highly significant in irrigation and artificial channel design [2]. Numerous studies have been done on the measurements of the alluvial channels’ hydraulic geometry in the dynamic equilibrium or regime state [3–15], but relatively few studies have focused on the profile shapes of the bank zones after stability [16]. The first study on threshold channel bank profiles was done by the U.S. Bureau of Reclamation (USBR) using the tractive force approach. USBR considered the formed profile of the stable channel as a function of the material repose angle (ϕ) [17, 18]. The proposed bank profile function by the USBR was a sine curve. Glover and Florey [19] tried to expand the USBR model to predict the stable bank profile by modifying the since curve. They ignored lateral momentum diffusion induced by turbulence in their equation and suggested the classic cosine curve for a channel bank profile. The cosine curve is one of the threshold channel shapes [20, 21]. Parker [21] extensively studied self-formed direct rivers with moving gravel beds and presented a cosine bank profile concerning the lift force in the bank particles. Other bank profile shapes presented by researchers for a threshold bank are the parabolic curve [22], exponential function [18, 23, 24] and polynomial shape [25–28]. Some studies have been done on natural rivers [21] and others with experimental models [28–31]. Vigilar and Diplas [16] evaluated two different channel shapes in the presence of momentum diffusion and without on a moving bed. The presence of the momentum diffusion term causes tension reduction and greater flow depth in the central region of the channel. At the same time, the tension in the top bank areas increases, and the bank slope becomes milder. It is argued that an optimized stable channel is a channel with banks consisting of particles in impending motion and where sediment transportation occurs in the flatbed region. This type of channel is called “mobile-bed, stable bank” and is mostly seen in natural channels and rivers with gravel beds [1]. Therefore, considering the momentum diffusion term presents a more realistic shape than other forms presented for channel banks and natural rivers. The bank profile geometry proposed by Vigilar and Diplas [26] was the third-order polynomial curve. Cao and Knight [22] solved bank profile equations based on the entropy principle. The proposed shape for a threshold channel was a parabolic curve. Babaeyan-Koopaei [31] experimentally compared different existing models of the bank profile geometry of stable alluvial channels. The author concluded that Diplas and Vigilar’s [25] fifthdegree polynomial results in better approximation than the normal-depth method [24]
An Improved Architecture of Group Method of Data Handling
771
and cosine profile [21]. Khodashenas [28] experimentally evaluated the stable channel geometry of gravel rivers. In a vast investigation, Khodashenas [28] predicted the stable bank profile using 13 previous models against six laboratorial sets and compared the results. The results illustrated that almost all previous relations had high differences from the laboratorial results and additional research is required in this field. Gholami et al. [32] widely evaluated the physical justifications of the entropy parameter of the Shannon entropy-based model, which is proposed to estimate transverse slope and bank profile characteristics of stable channels. They stated that different hydraulic and geometric parameters influenced the entropy parameter and hence estimated bank profile shape. Moreover, using the Tsallis entropy concept, Gholami et al. [33] proposed a shape profile equation on stable channels banks. The main focus of the conducted stud by Gholami et al. [33] was to appraise the signs and assigned values of entropy parameters by numerically solving the obtained bank profile shape equation. They pointed out further studies using the entropy concept in estimating bank profile shape and transverse slope of stable channels. It is common in different sciences to use a variety of soft computing (SC) techniques for modeling, as they require less time and cost [34–37]. Many researchers have used SC to solve various hydraulic problems, such as simulating hydraulic open channel bends [38–41], hydraulic jump [42, 43], sediment transfer [44], time series predictions [45– 48]. Based on the authors’ knowledge, the majority of studies performed to predict the geometry of stable channel shapes have employed Artificial Neural Networks (ANN) [49]. The high accuracy of these models in estimating dimensions of the stable channel is notable. One of the main problems with ANN modeling is the lack of clear relationships to apply in practical tasks. Through all soft computing techniques, the Group Method of Data Handling (GMDH), which is a self-organized method, is able not only to solve complex nonlinear problems [50, 51], but also provides a simple set of polynomials for practical tasks. Gholami et al. [11] evaluated the ability of the GMDH model for the first time in the estimation of the dimension of stable channels (slop, depth and width) and compared their obtained results with the previous regime equations proposed by others. They identified the high ability of GMDH in comparison with regression models in predicting the geometry of stable channels which were the most common methods. More, Shaghaghi et al. [12] evaluated the efficiency of the GMDH approach integrated with genetic algorithm in comparison with Particle Swarm Optimization (PSO) as another evolutionary algorithm to estimate stable channel dimension (slop, depth, and width). Subsequently, Shaghaghi et al. [13, 14] evaluated the performance of other SC-based methods such as Least Square Support Vector Regression (LSSVR), M5 Model Tree (M5Tree), Gene Expression Programming (GEP), and Multivariate Adaptive Regression Splines (MARS) in the prediction of stable channel dimensions. However, based on authors’ knowledge in evaluating banks profile shapes of stable channels using AI methods, no study is seen except the recent works related to authors themselves. Further, authors using different AI models examined the bank profile characteristics of stable channels. Accordingly, Gholami et al. [52] assessed the application of the Gene Expression Programming (GEP) compared to previous bank profile shape equations proposed by others. They referred to high ability and acceptable accordance of the GEP model with experimental values compared to other rational methods. Furthermore, Gholami
772
H. Bonakdari et al.
et al. [53, 54] examined the application of Adaptive Neuro-Fuzzy Inference System (ANFIS) integrated with Singular Value Decomposition (SVD) and two evolutionary algorithms, including Differential Evolution (DE) [53], Genetic Algorithm (GA) and Particle Swarm Optimization (PSO) [54] to measure the bank profile characteristics of the stable channels with various hydraulic conditions. The authors showed that although the results of these models are acceptable, these hybrid models did not provide acceptable accuracy in some samples. Besides, Gholami et al. [55] evaluated the performance of different hybrid ANFIS models (ANFIS-DE/SVD and ANFIS-DE and) compared to simple ANFIS to estimate banks profile shape of stable channels. Gholami et al. [56], for the first time, applied the Emotional Artificial Neural Networks (EANN) model to predict banks profile shape of stable channels based on various experimental data sets in comparison with previous bank profile shape equations proposed by others and the GEP model [52]. They indicated that the EANN outperformed other ones. However, the proposed EANN model by [56] cannot be able to provide a straightforward relationship to estimate bank profile shapes of channels or rivers with different morphology and geology conditions. The main limitations of the current studies are the lack of simple relationship to apply them in the practical tasks so that most of them do not provide any relation to employ by other researchers, and a method such as GEP, which provides an explicit relation, is not accurate enough compared to other methods (such as EANN). To the authors’ best knowledge, no study has been carried out on predicting bank profile shape using GMDH in straight stable open channels. Due to that, the classical GMDH has some disadvantages, such as (1) providing the final relationship based on a set of second-order polynomials and (2) only use of neurons located in the adjacent layer as inputs of the newly generated neurons in each layer. To overcome these drawbacks, a new version of the GMDH knowing the improved architecture of the GMDH (IAGMDH) is developed for predicting the bank profile shape so that it can map the input variables to the output one using a set of 2nd and 3rd order polynomials that the input of each neuron could be selected from the adjacent and non-adjacent layers. The main aim of the current study is to provide a robust IAGMDH model to generate a set of simple mathematical formulations in predicting the bank profile shape. The authors experimented extensively with passing discharge rates of 6, 2.57, 2.18, and 1.157 (l/s) to estimate the shape of the stable channel in a laboratory flume. Experimental tests are employed to develop different IAGMDH-based models in terms of the 2nd order polynomials and 3rd order polynomials with one, two, and three layers. Besides, the performance of these IAGMDH-based models is compared with the GA-GMDH as a well-known optimized version of the GMDH applied by the authors in different fields of science [57–59], and a Non-Linear Regression model (NLR) are fitted to the same data sets. The current study presents an appropriate geometric shape from channel profile predictions using IAGMDH modeling and different discharge rates. In addition, the IAGMDH and NLR results are compared with seven different previous mechanical (rational) relationships which were introduced by previous researchers based on different numerical and experimental models. Finally, the superior model is introduced.
An Improved Architecture of Group Method of Data Handling
773
2 Materials and Method In this section, the experiments performed are first described. Subsequently, the IAGMDH models and other researchers’ empirical relationships are described. Finally, the statistical indices employed to assess the performance of the different machine learning-based models are explained. 2.1 Experimental Model The authors performed several experimental tests to measure the bank profile specifications (x and y) [28, 53]. The flume used in the experimental tests was 1.22 m wide and 20 m long, the flume bottom was covered with smooth sand with d 50 = 0.53 mm, and the flow depth throughout the flume was 0.6 m. In the experiments, two cross-sections with triangular and trapezoidal shapes were selected. The experiments were done in various hydraulic conditions. Four discharge rates passed through the channel and remained constant until a steady channel state was reached. Table 1 presents the 4 experimental sets performed. A magnetic flow meter measured the discharge rate. For each rate, a wooden former was fixed on a carriage channel carving with a slope of 23% along the centerline flume while the discharge passed through the channel. The duration of the experimental tests were continued until a stable shape was achieved in the channel’s plane and cross-section. Table 1. Different experimental characteristics Test No
Discharge (Q) (l/s)
d 50 (mm)
S
hc (mm)
Cross section’s shape
1
1.157
0.53
0.0023
37
Triangular
2
2.57
0.53
0.0023
61.3
Triangular
3
6.2
0.53
0.0023
80
Triangular
4
2.18
0.53
0.0023
61.2
Compound channel
2.2 Overview of the Rational Model The formation of bank profiles in threshold channels with the passing flow is important. It is possible to measure the coordinates of all points of the profile to predict the channel shape. The coordinates of each point with two characteristics (x, y) known as “bank profile characteristics” are defined, which are the vertical and transverse distances from the channel bed midpoint (m), respectively. Figure 1 indicates two bank profile characteristics: x and y are the vertical and transverse distances from the channel bed midpoint (m), respectively, hc is the centerline flow depth and T is the water surface width. The two-channel characteristics (x and y) can be dimensionless as follows: x∗ = x/ hc , y∗ = y/hc and T ∗ = T / hc .
774
H. Bonakdari et al.
Fig. 1. Schematic representation of cross section
where x * and y* are dimensionless transverse and vertical (respectively) distance and T * is the dimensionless width of the water surface. Table 2 presents the relationships proposed by Glover and Florey Model (GFM) [19], U.S. Bureau of Reclamation’s model (USBR) [17], Ikeda Model (IKM) [23], Pizzuto Model [18] (PIM), Diplas Model (DIM) [24], Babaeyan-Koopaei Model (BVM) [31], Vigilar and Diplas Model (VDM) [26], that were examined in this study. These models were achieved numerically or experimentally and are based on the mechanical (rational) method [60]. Table 2. Models studied in this research
An Improved Architecture of Group Method of Data Handling
775
2.3 Improved Architecture of the Group Method of Data Handling (IAGMDH) Group Method of Data Handling (GMDH) is a self-organized method that has been introduced by Ivakhnenko [61]. GMDH employs a set of neurons to present different models. These neurons are interconnected through a 2nd order polynomial, which leads to the production of new neurons in the subsequent layer. The main objective of GMDH is to find the fˆ approximation function with yˆ output to a set of inputs x = (x1 .x2 , ..., xn ), as the presented function estimates the yˆ values with the least difference with the observed values. For M training samples that includes n input variables and one output, the real output value (yi ) is defined as follows: yi = f (xi1 , xi2 , ..., xin )
(i = 1, 2, .., M )
(1)
The GMDH network output values in the model training stage are expressed as: yˆ i = fˆ (xi1 , xi2 , ..., xin )
(i = 1, 2, .., M )
(2)
In order to reach an optimal solution, the square error due to the difference between actual and training values is minimized with the following equation: E=
M
(fˆ (xi1 , xi2 , ..., xin ) − yi )2 → Min
(3)
i=1
In the GMDH network, to map the input variables to the output one(s), the Kolmogorov-Gabor Polynomial is applied as follows: yˆ = a0 +
m i=1
ai xi +
m m
aij xi xj +
i=1 j=1
m m m
xj aijk xi xj xk + ...
(4)
i=1 j=1 k=1
Rewriting the above equation with two neurons leads to a simpler relationship: yˆ = G(xi , xj ) = a1 + a2 xi + a3 xj + a4 xi2 + a5 xj2 + a6 xi xj
(5)
The set of the unknown coefficients {a1 , a2 , a3 , a4 , a5 , a6 } are calculated using
regression to minimize the difference between the calculated value y and actual output (y) for all data pairs (xi and xj ) [61–63]. Thus, the set of coefficients from the second-degree polynomial is calculated with Eq. (5). The coefficients for each Gi function (each neuron built in modeling) are achieved to minimize all neurons’ errors and to optimize the inputs in all sets of input-output pairs. The modeling process continues until the following equation is minimized: E=
M 1 (yi − Gi )2 → Min M i=1
(6)
776
H. Bonakdari et al.
In the base GMDH algorithm form, the number of generated neurons in each layer is calculated as C2n = (n(n − 1))/n (where n is the number of input variables). Hence, in Eq. (5), the data triples for each M are as follows: ⎡ ⎤ .. x x . y 1q 1 ⎢ 1p ⎥ ⎢ ⎥ .. ⎢ ⎥ ⎢ x2p ⎥ . y x 2q 2 (7) ⎢ ⎥ ⎢... ... ... . . .⎥ ⎢ ⎥ ⎣ ⎦ .. xMp xMq . yM For each row, the following matrix-based equation is defined as: Aa = y
(8)
in which ⎡
1
x1p
⎢ ⎢ 1 x2p ⎢ ⎢ A = ⎢ 1 x3p ⎢ ⎢... ... ⎣ 1 xMp
x1q
x1p x1q
2 x1p
x2q
x2p x2q
2 x2p
2 x2q
x3q
x3p x3q
2 x3p
2 x3q
... xMq
... xMp xMq
2 x1q
... 2 xMp
⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ . . .⎥ ⎦
(9)
2 xMq
y = {y1 , y2 , ..., yM }T
(10)
a = {a1 , a2 , a3 , a4 , a5 , a6 }
(11)
The least squares method presents the second-order coefficient (ai ) using multiple regression analysis as follows: a = (AT A)−1 AT Y
(12)
The above equation is applied to compute the uncertainty coefficient vector of the polynomial presented as Eq. (5) for a set of M three-rows. The process for each neuron is repeated according to the neural network structure. The most important advantages of classical GMDH that distinguish it from other methods can be mentioned as follows: (1) the ability to select the most influential input variables and eliminate variables that have less impact than other variables, (2) Automatic selection of the final structure of the model, which represents its most important feature as a self-organization method [64, 65], (3) Provide simple equations to apply in practical work.
An Improved Architecture of Group Method of Data Handling
777
Although the advantages of classical GMDH have differentiated it from other methods, it also has its drawbacks that must be overcome to achieve the optimal model. The main limitations of this model are as: (1) providing the final relationship based on a set of second-order polynomials, (2) the inputs of each neuron are only two, and (3) only use of neurons located in the adjacent layer as inputs of the newly generated neurons in each layer. The use of second-order models may force the GMDH to use different equations, which negatively affects the model’s simplicity. In addition, the limited input of the model to two and the selection of input neurons only from adjacent layers will also affect the complexity of the model. In order to overcome the mentioned limitations, an advanced version of GMDH entitled improve the architecture of GMDH (IAGMDH) is presented in this study. So that the produced polynomials in this model can be of the second and third orders simultaneously. In addition, each of the generated neurons during the training process can be two or three. In addition, the input of the neurons produced in this model can be selected from non-adjacent layers, which will greatly contribute to the simplicity of the final model. An example of developed models based on the classical GMDH and IAGMDH with four inputs are presented in Fig. 2. According to the figure, to establish a nonlinear mapping between the problem inputs (x1 , x2 , x3 , x4 ) and the output (y), in the GMDH method, it is necessary to produce three neurons in the first layer (x11 , x12 , x13 ) and using these three neurons, two neurons (x21 and x22 ) are generated in the second layer, and finally, the output variable of the problem is calculated using the two neurons produced in the second layer. Whereas for IAGMDH, there is only the use of three inputs in the x11 and the use of neurons in the non-adjacent layers in the x21 (second layer) and the output variable (y) one neuron in the second and third layers. Indeed, the use of the structure presented in this study has significantly increased the simplicity of the model.
Fig. 2. Chromosome structure of the classical GMDH and IAGMDH network
778
H. Bonakdari et al.
The structure of the presented equations in the development IAGMDH model, in addition to the second-order polynomial with two inputs (Eq. 5), has three other formats, which are shown in the form of Eqs. (13) to (15). Equation (13) is related to quadratic polynomials with three inputs, Eq. (14) is related to third-order polynomials with two inputs, and Eq. (15) is related to third-order polynomials with three inputs. G(xi , xj , xk ) = a1 + a2 xi + a3 xj + a4 xk + a5 xi2 + a6 xj2 + a7 xk2 + a8 xi xj + a9 xi xk + a10 xj xk
(13)
G(xi , xj ) = a1 + a2 xi + a3 xj + a4 xi2 + a5 xj2 + a6 xi xj + a7 xi3 + a8 xj3 + a9 xi2 xj + a10 xj2 xi
(14)
G(xi , xj , xk ) = a1 + a2 xi + a3 xj + a4 xk + a5 xi2 + a6 xj2 + a7 xk2 + a8 xi xj + a9 xi xk + a10 xj xk + a11 xi3 + a12 xj3 + a13 xk3 + a14 xi2 xj + a15 xi2 xk + a16 xj2 xi + a17 xj2 xk + a18 xk2 xi + a19 xk2 xj + a20 xi xj xk
(15)
The pseudo-code of the developed IAGMDH is presented in Code 1. The number of all possible combinations the model applied to select the best combination is as follows:
C2n + C3n = n3 − n /6 (16) It should be noted that to evaluate the performance of each neuron, there are two options as objective functions, root mean square error (RMSE) that is provided as Eq. (6) and Corrected Akaike Information Criteria that considered two terms of the error and complexity, simultaneously. AICc =
N
2 1 M 2K(K + 1) Gi − GiA + 2K + N N −K −1
(17)
i=1
where GM and GA are the estimated values by the model and actual values (respectively), K is the number of tuned variables in through the training phase, and N is the number of training samples. The K for Eqs. (5), (13), (14) and (15) are 6, 10, 10, 20, respectively.
An Improved Architecture of Group Method of Data Handling
779
2.4 Error Measures To evaluate the performance of different developed models in the current study and existing ones, different statistical indices, including a correlation-based index (i.e. coefficient of determination, R2 ), a relative based index (i.e. Mean Absolute Relative Error, MARE), and three absolute-based indices (i.e. Mean Squared Error, RMSE, Mean Absolute Error, MAE, and Mean Error, ME), are applied. Besides, BIAS as an index to check the overall performance of a model in terms of the over/underestimation and ρ as a newly developed
780
H. Bonakdari et al.
hybrid model are also applied. The mathematical definition of the mentioned indices is as follow: N 1 (Oi − Pi ) N
ME =
(18)
i=1
1 |Oi − Pi | N i=1 N 1 RMSE = (Oi − Pi )2 N N
MAE =
(19)
(20)
i=1
N 1 |Oi − Pi | MARE = N Oi
(21)
i=1
N
R = N
(Oi − Oi ).(Pi − Pi )
i=1
2 i=1 (Oi − Oi )
BIAS =
N
(22)
2 i=1 (Pi − Pi )
N 1 (Pi − Oi ) N
(23)
i=1
ρ=
SI 1+R
(24)
where Pi and Oi are the predicted and actual values of the ith samples, respectively, Oi and P i are the means of the actual and predicted samples, respectively, and N is the number of samples (training or/and testing).
3 Results and Discussion At the first part of the current section, four GMDH-based methods, including three IAGMDH (i.e. IAGMDH – Simplest, IAGMDH – 2nd order polynomial, and IAGMDH – 3rd order polynomial) and one well-known optimized version of the GMDH (i.e. GMDHGA), NLR and existing models are validated and the shape of stable channels is discussed. The second part investigates the suggested bank profile shape by the developed model in the current study. Finally, the sensitivity of the developed IAGMDH model to different discharge rates is analyzed. 3.1 Verification of GMDH-Based Models, NLR, and Previous Rational Relations in Predicting the Profile Shape Characteristic of a Stable Channel The performance of five models presented in this study (IAGMDH – Simplest, IAGMDH – 2nd order polynomial, and IAGMDH – 3rd order polynomial, GMDH-GA, and
An Improved Architecture of Group Method of Data Handling
781
NLR) and the USBR model as the first approach in stable channel design is evaluated in this section. Experiments conducted at different discharge rates of 6.2, 2.57, 2.18, and 1.157 (l/s), and containing 241 data were used to calibrate and validate the GMDH-based models. The data were distributed into two equal parts (120 data for calibration and 121 for validating). The model inputs are x and Q, and the output is y. The GMDH-based structures of the developed models are provided in Fig. 3. The presented structures in this figure show that for IAGMDH (Simplest), only one neuron is generated to calculate y. The variable y is calculated directly using the variables x and Q, which are considered inputs. In order to calculate the target parameter, the proposed structures for IAGMDH, which have more than one neuron (i.e. 2nd Order polynomial and 3rd order polynomial), use not only the newly generated neuron (i.e. y1 ) but also the input neurons located in the non-adjacent layer is also applied. The difference between these two structures is the employed inputs to generate output neurons located in the third layer so that they are y1 and x for model generated using 2nd order polynomial while they are y1 and Q for the 3rd order polynomial.
(a) IAGMDH – Simplest
(c) IAGMDH – 2nd order polynomial
(b) GMDH – GA
(d) IAGMDH – 3rd order polynomial
Fig. 3. The structure of the developed GMDH-based models
The mathematical formulations of the developed GMDH-based equations are provided as Eqs. (25)–(28). It should be noted that the details of the GMDH-GA are provided by Gholami et al. [11]. Moreover, the NLR model (Eq. 29) was as well applied to the observational data available to present the following equation for predicting the y value of channel banks profiles: GMDH-GA y = − 0.602 + 1.0908y2 − 0.062y3 +6.283y22 +6.247y23 − 12.531y2 × y3 y1 = 11.243 − 5.596Q + 0.0439x + 0.756Q2 +0.00209x2 − 0.0219Q × x
(25) (25-1)
y2 = − 1.483 + 1.0934y1 − 0.00457x - 0.00351y21 +0.00025x2 +0.0000799y1 × x (25-2)
782
H. Bonakdari et al.
y3 = − 2.0284 + 0.0457Q + 1.203y1 +0.0075Q2 − 0.00327y21 +0.00179Q × y1 (25-3) IAGMDH – Simplest y = 10.98170306 + 0.04487201989 × x − 5.456555995 × Q − 0.02247282915 × Q × x + 0.002098163456 × x2 + 0.7482779205 × Q2 (26) IAGMDH – 2nd order polynomial y = −1.505182126 + 1.104799365 × x3 − 0.003786844611 × x + 6.270280505e − 05 × x × x3 − 0.00354071475 × x32 + 0.0002273622657 × x2 (27) x3 = 10.98170306 + 0.04487201989 × x − 5.456555995 × Q − 0.02247282915 × Q × x + 0.002098163456 × x2 + 0.7482779205 × Q2 (27-1) IAGMDH – 3rd order polynomial x3 = 77.95831728 + 0.003907330452 × x − 102.4274293 × Q − 0.01587996486 × Q × x + 0.002483725879 × x2 + 39.34158713 × Q2 − 0.0001418282622 × Q × x2 + 0.001860122794 × Q2 × x + 6.115340033e − 07 × x3 − 3.99008815 × Q3
(28)
y = − 5.789257066 + 1.906004882 × x3 − 2.989175602 × Q − 0.6364441019 × Q × x3 + 0.007064321987 × x32 + 3.703786575 × Q2 + 0.001909424724 × Q × x32 + 0.06262942332 ∗ Q2 × x3 − 0.0001483834252 × x33 − 0.477933747 × Q3
(28-1)
MLR y = 0.01Q−0.188 x1.718
(29)
An Improved Architecture of Group Method of Data Handling
783
Figure 4 presents scatter plot diagrams for the training and testing of GMDH-based models and NLR as well as the analytical USBR method [17] in the approximation of the bank profile shape of stable channels. Table 3 presents different index error values in comparison with experimental values. Error lines of ±10% and ±20% are drawn in Fig. 4 for better comparison. Figure 4 indicates that with the GMDH-based model, almost all data are estimated with the lowest difference with the exact line. The coefficient of determination (R2 ) index values for GMDH-GA, IAGMDH-Simplest, IAGMDH-2nd order and IAGMDH-3rd order are 0.9471, 0.9371, 0.9438, and 0.973 (respectively) for training stage and 0.9427, 0.9303, 0.9378, and 0.9696 (respectively) for testing stage. The coefficient of determination of the developed GMDH-based models indicates high model accuracy in stable bank profiles prediction. It can be seen from the plots that most data were estimated with less than ±10% error (especially in network training), and some of them are approximated with the relative error in the range of ±10% and ±20% error. According to the table, the values of the GMDH-based models are close to the correlation coefficient (R) value of 1 in both training and testing (similar to R2 ), confirming the high accuracy of GMDH-based models in predicting the geometric properties of channel shape. The estimated samples are not very close to the exact line with the NLR model, but the data are scattered. Nonetheless, the R2 coefficient values of 0.8746 and 0.9399, respectively, in testing and training signify acceptable model performance. With this model, the fit line is very different from the exact line and is located on the left (especially in testing mode). With NLR, more data are within the positive error range, indicating model underestimation. Moreover, the negative and positive BIAS index values indicate model overestimation and underestimation, respectively. Negative values of the BIAS in Table 3 show that the NLR model underestimates. With respect to the GMDH-based models, there are no great differences between the exact and fit lines. The BIAS value is close to zero, which shows acceptable compliance of the model results with the experimental results. Nonetheless, it can be mentioned that the negative BIAS index values indicate underestimation, as with the NLR model. In addition to the GMDH-based models and NLR models, the y* values predicted with the USBR method using the related relation in Table 2 are calculated based on existing experimental data in this study. As noted in the USBR scatter graph, the inability of this relation to predict the coordinates of points situated in a stable bank profile is clear, as the high data diversion from the fit line is significant. The high error-index value of the USBR relation in Table 3 illustrates the large difference from the GMDHbased models. The low R2 values of the USBR method in both testing and training (R2 = 0.19 and 0.0626, respectively) signify there is a very weak correlation among the estimated and observed samples (especially for high y* close to the water-free surface). The reason for this occurrence is that the stable state in existing experiments is defined as sediment movement on the channel bed and threshold of motion of particles located on channel banks (mobile bed and stable bank) that are introduced as optimum stable channels [16, 21]. The lateral momentum diffusion consideration causes significant non-uniform shear stress redistribution at the bed and banks of the channel. This leads to a continuously changing channel bank profile curve that is mostly described as a multi-degree polynomial (threshold channel) [16]. In the USBR method, the momentum diffusion term is ignored and all particles at the banks and on the bed of the channel are
784
H. Bonakdari et al.
in the threshold of motion and there is no sediment motion anywhere in the channel [30]. Hence, setting existing experimental data in the USBR relation causes dispersion in y* value prediction. This relation considers the sine and cosine curves for stable channel state, and considering momentum diffusion in a cosine channel leads to an unstable channel. The inefficiency of the USBR relation, especially for high y* values is due to the high difference between cosine and threshold channel profiles. Figure 5 shows the cosine and threshold channel bank profile shapes. A comparison of GMDH-based models and USBR model error values are given in Table 3. The low error of the IAGMDH indicates the high agreement between this model and experimental data. Therefore, the IAGMDH can predict the threshold channel bank profile shape well. The difference between the GMDH-based models and USBR models is natural. The reason for the difference is that the USBR method does not consider the governing hydraulic parameters of a stable channel in natural rivers, causing a weakness in the USBR analytical relation. Please refer to Khodashenas [28] and Cao and Knight [22] for further investigation into previous relations and the high difference between previous and the USBR models. It can be concluded from the table (both testing and training values) that for the GMDH-based models, the MAE index is less than for the NLR model, about half for GMDH-GA and one third for IAGMDH – 3rd polynomials. Also, regarding the GMDHbased models, the RMSE absolute error values are lower than for NLR. But in terms of MARE (relative error index), it is lower for the NLR model than for GMDH-based models (unlike the absolute error values). It can be stated that in predicting small y* values (close to zero) the GMDH-based models do not perform well and exhibits major differences in prediction from the observed values. Also, the achieved differences in these points might be contributed to sample errors (errors in measured and observed data) which can be ignored in evaluating the model performance [66]. The small y* values are at the beginning of cross section formation (near the bed), which is ineffective in channel stable cross-section formation. Besides, the high R value of the GMDH-based models (especially for IAGMDH - 3rd order which is more than 0.98 at both training and testing stages, whereas for the NLR model it is 0.8993 at the testing stage) shows the correlation between the predicted and observed values, confirming this model’s accuracy. In general, it can be declared that the IAGMDH - 3rd Order model with lower error-index values (MARE = 0.5107, RMSE = 0.052, MAE = 0.039) than other GMDH-based models and the NLR model (MARE = 0.3537, RMSE = 0.1415, MAE = 0.0954) can predict channel bank profile more accurately. Also, the lower ρ index value for the GMDH-based models than NLR (by about half) indicates less GMDH-based models error, especially for the IAGMDH - 3rd Order model. Comparing the GMDH-based models and NLR demonstrates that GMDH-based models perform better and much more efficiently than regression models. In the case investigated in this study (stable channel shape), this matter is notable based on the index error values of both models. Therefore, it can be concluded that the GMDH-based models definitely outperform the USBR analytical method and NLR regression model to approximate the bank profile shape of stable channels.
An Improved Architecture of Group Method of Data Handling
785
Fig. 4. Comparison of the y* values predicted by the (a) IAGMDH, (b) NLR and (c) USBR models with experimental values in training and testing modes
786
H. Bonakdari et al.
Fig. 4. continued
Fig. 5. View of cosine and threshold channel bank profile shapes [16]
Next, relationships for predicting stable channel shape proposed by other researchers are compared with the currently presented models, and the superior model is introduced. Table 4 provides the MAE, RMSE, and ρ indices for different models (Table 2) and compares the IAGMDH-3rd order and NLR models with observed samples for 1.157 (l/s) discharge. Figure 6 illustrates bar graphs of the different models’ indices. It is clear that the IAGMDH-3rd order model has the lowest error (MAE = 0.0.06, RMSE = 0.0.08 and ρ = 0.11) and is thus selected as the best model for estimating the shape characteristics of a channel in a stable state. After IAGMDH-3rd order, the NLR model has similar but higher index error values and is considered acceptable. Among previous relationships, the VDM model has the lowest index errors (MAE = 0.31, RMSE = 0.2934 and ρ = 0.1037). The IKM and BVM models have the greatest error values and are considered the worst models. The high ρ value of the USBR model (1.0083) represents
An Improved Architecture of Group Method of Data Handling
787
Table 3. Comparison of IAGMDH, NLR, and USBR models with experimental values in training and testing modes using various statistical indices Models
Stage
R
MARE
RMSE
MAE
BIAS
ρ
IAGMDH – Simplest
Train
0.9681
0.3705
0.0753
0.0538
−0.0011
0.1079
Test
0.9636
0.6845
0.0800
0.0561
−0.0005
0.1105
IAGMDH – 2nd Order
Train
0.9715
0.3214
0.0712
0.0529
−0.0026
0.1018
Polynomial
Test
0.9679
0.5628
0.0751
0.0552
−0.0009
0.1035
IAGMDH – 3rd Order polynomial
Train
0.9864
0.3114
0.0495
0.0370
−0.0025
0.0703
Test
0.9848
0.5107
0.0520
0.0390
−0.0013
0.0711
GMDH-GA
Train
0.9732
0.3447
0.069
0.0531
−0.0028
0.0986
Test
0.9705
0.5551
0.0719
0.0548
−0.0018
0.099
NLR
Train
0.9695
0.2504
0.0791
0.0553
−0.026
0.1133
Test
0.8993
0.3537
0.1415
0.0954
−0.0555
0.191
USBR
Train
0.0626
5.9281
0.5804
0.5091
0.3915
1.747
Test
0.189
4.303
0.7551
0.5751
−0.3385
1
Table 4. Evaluation of IAGMDH, seven rational relationships from previous studies in comparison with experimental values from the present study for 1.157 (l/s) discharge using the relative and absolute error indices MARE and RMSE Models
MAE
RMSE
ρ
IAGMDH – 3rd Order polynomial
0.063
0.08
0.11
USBR
0.3663
0.4382
1.0083
NLR
0.1132
0.1643
0.214
IKM
0.874
1.0703
0.3776
BVM
0.986
1.0332
0.3648
VDM
0.31
0.2934
0.1037
DIM
0.5119
0.4343
0.1533
PIM
0.4456
0.3366
0.119
GFM
0.446
0.3103
0.1097
the high accuracy and low correlation between values predicted by the USBR relation and experimental data as well as the high gap between this method and the IAGMDH model presented in this study. The reasons were explained in a previous section.
788
H. Bonakdari et al.
Fig. 6. Various index error values in the calculation of bank profile characteristics using different models compared to the corresponding experimental values
Figure 7 presents plots of the threshold channel bank profiles estimated by the two best models (IAGMDH and VDM) in comparison with the experimental models. The fit lines for both models are also plotted in this figure. The more acceptable compliance of the IAGMDH model is clear while the VDM model presents a great difference from the experimental results. The shape proposed by the two models for the cross section of a channel in a stable state is worth noting in this figure. Despite the large error, the VDM model estimated a polynomial curve in the channel deformation process, almost the same as the experimental model. In addition to drag force, the lift force is also involved in the VDM model equation (unlike the other models). The IAGMDH model with a higher R2 value performed similar to VDM (R2 = 0.9999), except for the lower bank slope and higher water surface width. The non-compliance of IAGMDH with the experimental model in terms of small y values (near the bed) is clear in Fig. 7. The MARE relative error value increased, as pointed out in previous sections (see Table 3). The non-compliance of IAGMDH regarding values near the water surface is obvious, as it predicted a lower y value. This was because the channel shape in this model was a little wider and larger than the other two models. However, the predicted form is quite similar to the experimental model. The shape proposed by the IAGMDH model is a polynomial curve, which is consistent with the proposed polynomial equations obtained based on the line fitted to the model. A polynomial curve is deemed a reasonable channel shape in the present study. This result is consistent with the polynomial curve shape suggested by the VDM model for a stable channel as well. The polynomial equation derived by fitting both models is also given in the figure.
An Improved Architecture of Group Method of Data Handling
789
Fig. 7. Bank profile shapes proposed by the two superior models in the present study (IAGMDH and VDM) compared with the experimental model
3.2 Bank Profile Shape Prediction by IAGMDH in Different Hydraulic Conditions In this section, the bank profile shape at different discharge rates estimated by the IAGMDH is evaluated with observed samples. Figure 8 contains plots of stable channel bank profile predictions by the IAGMDH model and the fit line is drawn as well in different discharge rates. Table 5 shows the different error values of this model compared to the experimental values. High R values (close to 1) show high IAGMDH model accuracy. At low discharge of 1.157 and 2.57 (l/s), IAGMDH had higher error indices and the differences between the model and experimental model are evident as well. A comparison of these models demonstrates that at 1.157 (l/s) discharge, the error indices are higher and R is lower (MARE = 0.635, RMSE = 0.285, MAE = 0.207 and R = 0.990) than at 2.57 (l/s) (MARE = 0.582, RMSE = 0.154, MAE = 0. 124 and R = 0.997). Therefore, IAGMDH model accuracy is lower at 1.157 (l/s) than at 2.57 (l/s) discharge. A remarkable point in the error values at these discharge rates is the large MARE index at 1.157 (l/s), which is due to the very small y value differences from the experiments. Despite ρ (0.085) being lower than at 2.57 (l/s) (ρ = 0.119), it can be stated that the IAGMDH model does not perform well at this discharge rate. At 6.2 (l/s), the error-index values are low and R is high; hence, it is concluded that at this discharge rate, the IAGMDH model performed the best (MARE = 0.149, RMSE = 0.02, ρ = 0.047 and R = 0.994). In general, It can be concluded that the larger the discharge, the greater the compatibility of the IAGMDH model with the observed samples, and its accuracy further increases. A
790
H. Bonakdari et al.
notable point in these figures is the threshold channel shape proposed by the IAGMDH model at various discharge rates. Figure 9 shows the threshold channel shape at four discharge rates predicted by the IAGMDH model. Clearly, the bank side slope and water surface width increase and decrease at higher discharge, respectively. At 1.157 (l/s), the channel depth, water surface width, and channel cross section are low, high, and wide, respectively. With increasing discharge, the depth increases as well. Unlike natural channels, with increasing passing flow rate in threshold channels, the flow depth increases while the changing channel width and slope are less correlated with flow rate. These changes can be evaluated for three discharge rates of 6.2, 2.57, and 1.157 (l/s) for the reason that the experimental cross-sections considered in all three states are the same (trapezoidal), and at 2.18 (l/s) the cross-section is composed. It is revealed from Fig. 8 that the polynomial curve has the best fit at all discharge rates (especially 6.2 (l/s)) and is almost fully compliant with the experimental model. Moreover, the fit line equation (second-order polynomial) is shown for each discharge rate. Therefore, it can be said that the cross-sectional shape of the recommended threshold channel is the “polynomial type”. Negative BIAS index values at both 1.157 and 2.57 (l/s) and positive values at 2.18 and 6.2 (l/s), respectively, indicate
Fig. 8. Stable channel bank profile predicted by the IAGMDH model and regression trend for different discharge rates compared with the experimental model
An Improved Architecture of Group Method of Data Handling
791
IAGMDH model underestimation and overestimation the predicted values compared to observational values. Table 5. Different error index values in predicting the bank profile characteristics by the IAGMDH model in comparison with experimental results at different discharge rates Discharge (Q) (l/s)
1.157
2.57
6.2
2.18 (compound)
R
0.99
0.997
0.994
0.995
MARE
0.635
0.582
0.149
0.365
RMSE
0.285
0.154
0.02
0.087
MAE
0.207
0.124
0.015
0.073
BIAS
−0.202
−0.093
0.085
0.119
ρ
0 0.047
−0.073 0.074
Fig. 9. Threshold channel shapes predicted by the IAGMDH Model for different discharge rates: 1.157, 2.18, 2.57 and 6.2 (l/s)
4 Conclusions This research comprised an extensive experimental study and examination of bank profile characteristics (x and y) in various hydraulic conditions. Moreover, robust GMDHbased models were designed to evaluate the application of each one in the estimation of the bank profile shape of stable channels for the first time. Two hundred forty-one experimental data were obtained and used to train the models (GMDH-based models and NLR), and the models’ performance was evaluated accordingly. The models’ results
792
H. Bonakdari et al.
were compared with seven other previous rational models (based on experimental and numerical methods) and the superior model was introduced. The most remarkable point in this study is the shape suggested by these models for threshold channels. A result evaluation indicated that the IAGMDH model, superior GMDH-based models, had lower error values and higher coefficient of determination than the NLR regression model and previous rational relationships. The previous USBR model has a high difference from the IAGMDH model in bank profile prediction, which contributes to ignoring the lateral momentum transfer in obtaining their equation based on rational (mechanical) methods. Thus, IAGMDH complied with the experimental model better and was chosen as the best model in the present study (RMSE = 0.08 and ρ = 0.11). Moreover, the IAGMDH model presented a simple, robust and straightforward equation for predicting the bank profile characteristics of a threshold channel with different passing flow rates. Among previous models, the one presented by Vigilar and Diplas (VDM) [26] was more accurate (RMSE = 0.1415 and ρ = 0.2021) because of the momentum diffusion term. Consequently, non-uniform redistribution of shear stress is considered in their numerical models. The IAGMDH model suggested a polynomial curve-shaped threshold channel, and with increasing discharge, this curve was deeper, and the water surface width decreased. This shape is similar to that proposed by the VDM model and conforms well to the experimental results. The IAGMDH model and derived equation can be used to design the geometric characteristics of natural channels and rivers. This research comprised an extensive experimental study and examination of bank profile characteristics (x and y) in various hydraulic conditions. Moreover, robust GMDH-based models were designed to evaluate the application of each one in the estimation of the bank profile shape of stable channels for the first time. Two hundred forty-one experimental data were obtained and used to train the models (GMDH-based models and NLR), and the models’ performance was evaluated accordingly. The models’ results were compared with seven other previous rational models (based on experimental and numerical methods) and the superior model was introduced. The most remarkable point in this study is the shape suggested by these models for threshold channels. A result evaluation indicated that the IAGMDH model, superior GMDH-based models, had lower error values and higher coefficient of determination than the NLR regression model and previous rational relationships. The previous USBR model has a high difference from the IAGMDH model in bank profile prediction, which contributes to ignoring the lateral momentum transfer in obtaining their equation based on rational (mechanical) methods. Thus, IAGMDH complied with the experimental model better and was chosen as the best model in the present study (RMSE = 0.08 and ρ = 0.11). Moreover, the IAGMDH model presented a simple, robust and straightforward equation for predicting the bank profile characteristics of a threshold channel with different passing flow rates. Among previous models, the one presented by Vigilar and Diplas’ (1998) (VDM) was more accurate (RMSE = 0.1415 and ρ = 0.2021) because of the momentum diffusion term. Consequently, non-uniform redistribution of shear stress is considered in their numerical models. The IAGMDH model suggested a polynomial curve-shaped threshold channel, and with increasing discharge, this curve was deeper, and the water surface width decreased. This shape is similar to that proposed by the
An Improved Architecture of Group Method of Data Handling
793
VDM model and conforms well to the experimental results. The IAGMDH model and derived equation can be used to design the geometric characteristics of natural channels and rivers.
References 1. Yu, G., Knight, D.W.: Geometry of self-formed straight threshold channels in uniform material. Proc. Inst. Civil Eng. Water Maritime Energy 130(1), 31–41 (1998) 2. Kazemian-Kale-Kale, A., Bonakdari, H., Gholami, A., Khozani, Z.S., Akhtari, A.A., Gharabaghi, B.: Uncertainty analysis of shear stress estimation in circular channels by Tsallis entropy. Phys. A 510, 558–576 (2018) 3. Millar, R.G.: Theoretical regime equations for mobile gravel-bed rivers with stable banks. Geomorphology 64(3–4), 207–220 (2005) 4. Lee, J.S., Julien, P.Y.: Downstream hydraulic geometry of alluvial channels. J. Hydraul. Eng. 132(12), 1347–1352 (2006) 5. Afzalimehr, H., Abdolhosseini, M., Singh, V.P.: Hydraulic geometry relations for stable channel design. J. Hydrol. Eng. 15(10), 859–864 (2010) 6. Kaless, G., Mao, L., Lenzi, M.A.: Regime theories in gravel-bed rivers: models, controlling variables, and applications in disturbed Italian rivers. Hydrol. Process. 28(4), 2348–2360 (2014) 7. Singh, U.: Controls on and morphodynamic effects of width variations in bed-load dominated alluvial channels: experimental and numerical study (Doctoral dissertation, University of Trento) (2015) 8. Liu, X., Huang, H.Q., Nanson, G.C.: The morphometric variation of islands in the middle and lower Yangtze River: a variational analytical explanation. Geomorphology 261, 273–281 (2016) 9. Eaton, B., Millar, R.: Predicting gravel bed river response to environmental change: the strengths and limitations of a regime-based approach. Earth Surf. Proc. Land. 42(6), 994–1008 (2017) 10. Zhang, M., Townend, I., Zhou, Y., Cai, H.: Seasonal variation of river and tide energy in the Yangtze estuary. China. Earth Surface Process. Landforms 41(1), 98–116 (2016) 11. Gholami, A., Bonakdari, H., Ebtehaj, I., Shaghaghi, S., Khoshbin, F.: Developing an expert group method of data handling system for predicting the geometry of a stable channel with a gravel bed. Earth Surf. Proc. Land. 42(10), 1460–1471 (2017) 12. Shaghaghi, S., Bonakdari, H., Gholami, A., Ebtehaj, I., Zeinolabedini, M.: Comparative analysis of GMDH neural network based on genetic algorithm and particle swarm optimization in stable channel design. Appl. Math. Comput. 313, 271–286 (2017) 13. Shaghaghi, S., et al.: Stable alluvial channel design using evolutionary neural networks. J. Hydrol. 566, 770–782 (2018) 14. Shaghaghi, S., Bonakdari, H., Gholami, A., Kisi, O., Binns, A., Gharabaghi, B.: Predicting the geometry of regime rivers using M5 model tree, multivariate adaptive regression splines and least square support vector regression methods. Int. J. River Basin Manage. 17(3), 333–352 (2019) 15. Nanson, G.C., Huang, H.Q.: A philosophy of rivers: equilibrium states, channel evolution, teleomatic change and least action principle. Geomorphology 302, 3–19 (2018) 16. Vigilar Jr, G.G., Diplas, P.: Stable channels with mobile bed: formulation and numerical solution. J. Hydraul. Eng. 123(3), 189–199 (1997) 17. Henderson, F.M.: Stability of alluvial channels. J. Hydraul. Div. 87(6), 109–138 (1961)
794
H. Bonakdari et al.
18. Pizzuto, J.E.: Numerical simulation of gravel river widening. Water Resour. Res. 26(9), 1971– 1980 (1990) 19. Glover, R.E., Florey, Q.L.: Stable channel profiles, Lab. Rep. 325Hydraul, U.S. Bureau of Reclamation, Washington, DC (1951) 20. Simons, D.B., Senturk, F.: Sediment transport technology, Fort Collins. Water Resources Publications. Colorado , 4 (TC175. 2, S5) (1976) 21. Parker, G.: Self-formed straight rivers with equilibrium banks and mobile bed. Part 2. The gravel river. J. Fluid Mech. 89(1), 127–146 (1978) 22. Cao, S., Knight, D.W.: Entropy-based design approach of threshold alluvial channels. J. Hydraul. Res. 35(4), 505–524 (1997) 23. Ikeda, S.: Self-formed straight channels in sandy beds. J. Hydraul. Div. 107(4), 389–406 (1981) 24. Diplas, P.: Characteristics of self-formed straight channels. J. Hydraul. Eng. 116(5), 707–728 (1990) 25. Diplas, P., Vigilar, G.: Hydraulic geometry of threshold channels. J. Hydraul. Eng. 118(4), 597–614 (1992) 26. Vigilar, G.G., Jr., Diplas, P.: Stable channels with mobile bed: model verification and graphical solution. J. Hydraul. Eng. 124(11), 1097–1108 (1998) 27. Dey, S.: Bank profile of threshold channels: a simplified approach. J. Irrig. Drain. Eng. 127(3), 184–187 (2001) 28. Khodashenas, S.R.: Threshold gravel channels bank profile: a comparison among 13 models. Int. J. River Basin Manage. 14(3), 337–344 (2016) 29. Stebbings, J.: The shape of self-formed model alluvial channels. Proc. Inst. Civ. Eng. 25(4), 485–510 (1963) 30. Ikeda, S., Parker, G., Kimura, Y.: Stable width and depth of straight gravel rivers with heterogeneous bed materials. Water Resour. Res. 24(5), 713–722 (1988) 31. Babaeyan-Koopaei, K.: A study of straight stable channels and their interactions with bridge structures (Doctoral dissertation, Newcastle University) (1996) 32. Gholami, A., Bonakdari, H., Mohammadian, M., Zaji, A.H., Gharabaghi, B.: Assessment of geomorphological bank evolution of the alluvial threshold rivers based on entropy concept parameters. Hydrol. Sci. J. 64(7), 856–872 (2019) 33. Gholami, A., Bonakdari, H., Mohammadian, A.: A method based on the Tsallis entropy for characterizing threshold channel bank profiles. Physica A: Statistical Mechanics and its Applications, p. 121089 (2019) 34. Gholami, A., Bonakdari, H., Zaji, A.H., Akhtari, A.A.: Simulation of open channel bend characteristics using computational fluid dynamics and artificial neural networks. Eng. Appl. Comput. Fluid Mech. 9(1), 355–369 (2015) 35. Gholami, A., Bonakdari, H., Zaji, A.H., Akhtari, A.A., Khodashenas, S.R.: Predicting the velocity field in a 90 open channel bend using a gene expression programming model. Flow Meas. Instrum. 46, 189–192 (2015) 36. Gholami, A., Bonakdari, H., Ebtehaj, I., Akhtari, A.A.: Design of an adaptive neuro-fuzzy computing technique for predicting flow variables in a 90 sharp bend. J. Hydroinf. 19(4), 572–585 (2017) 37. Yaseen, Z.M., et al.: Novel hybrid data-intelligence model for forecasting monthly rainfall with uncertainty analysis. Water 11(3), 502 (2019) 38. Gholami, A., Bonakdari, H., Zaji, A.H., Ajeel Fenjan, S., Akhtari, A.A.: Design of modified structure multi-layer perceptron networks based on decision trees for the prediction of flow parameters in 90° open-channel bends. Eng. Appl. Comput. Fluid Mech. 10(1), 194–209 (2016)
An Improved Architecture of Group Method of Data Handling
795
39. Gholami, A., Bonakdari, H., Zaji, A.H., Michelson, D.G., Akhtari, A.A.: Improving the performance of multi-layer perceptron and radial basis function models with a decision tree model to predict flow variables in a sharp 90° bend. Appl. Soft Comput. 48, 563–583 (2016) 40. Gholami, A., Bonakdari, H., Zaji, A.H., Akhtari, A.A.: A comparison of artificial intelligencebased classification techniques in predicting flow variables in sharp curved channels. Eng. Comput. 36(1), 295–324 (2019). https://doi.org/10.1007/s00366-018-00697-7 41. Fenjan, S.A., Bonakdari, H., Gholami, A., Akhtari, A.A.: Flow variables prediction using experimental, computational fluid dynamic and artificial neural network models in a sharp bend. Int. J. Eng. 29(1), 14–22 (2016) 42. Karimi, S., Bonakdari, H., Karami, H., Gholami, A., Zaji, A.H.: Effects of width ratios and deviation angles on the mean velocity in inlet channels using numerical modeling and artificial neural network modeling. Int. J. Civil Eng. 15(2), 149–161 (2017) 43. Azimi, H., Bonakdari, H., Ebtehaj, I.: Gene expression programming-based approach for predicting the roller length of a hydraulic jump on a rough bed. ISH J. Hydraul. Eng. 27(1), 77–87 (2019) 44. Ebtehaj, I., Bonakdari, H., Zaji, A.H.: A new hybrid decision tree method based on two artificial neural networks for predicting sediment transport in clean pipes. Alex. Eng. J. 57(3), 1783–1795 (2018) 45. Ebtehaj, I., Bonakdari, H., Zeynoddin, M., Gharabaghi, B., Azari, A.: Evaluation of preprocessing techniques for improving the accuracy of stochastic rainfall forecast models. Int. J. Environ. Sci. Technol. 17(1), 505–524 (2019). https://doi.org/10.1007/s13762-019-02361-z 46. Ebtehaj, I., Bonakdari, H., Gharabaghi, B.: A reliable linear method for modeling lake level fluctuations. J. Hydrol. 570, 236–250 (2019) 47. Lotfi, K., et al.: Predicting wastewater treatment plant quality parameters using a novel hybrid linear-nonlinear methodology. J. Environ. Manage. 240, 463–474 (2019) 48. Zaji, A.H., Bonakdari, H., Gharabaghi, B.: Developing an AI-based method for river discharge forecasting using satellite signals. Theoret. Appl. Climatol. 138(1–2), 347–362 (2019). https:// doi.org/10.1007/s00704-019-02833-9 49. Bonakdari, H., Gholami, A.: Evaluation of artificial neural network model and statistical analysis relationships to predict the stable channel width, 11–14 July, p. 417. River Flow 2016: Iowa City, USA (2016) 50. Azimi, H., Bonakdari, H., Ebtehaj, I., Gharabaghi, B., Khoshbin, F.: Evolutionary design of generalized group method of data handling-type neural network for estimating the hydraulic jump roller length. Acta Mech. 229(3), 1197–1214 (2017). https://doi.org/10.1007/s00707017-2043-9 51. Ebtehaj, I., Bonakdari, H., Gharabaghi, B.: Development of more accurate discharge coefficient prediction equations for rectangular side weirs using adaptive neuro-fuzzy inference system and generalized group method of data handling. Measurement 116, 473–482 (2018) 52. Gholami, A., Bonakdari, H., Zeynoddin, M., Ebtehaj, I., Gharabaghi, B., Khodashenas, S.R.: Reliable method of determining stable threshold channel shape using experimental and gene expression programming techniques. Neural Comput. Appl. 31(10), 5799–5817 (2018). https://doi.org/10.1007/s00521-018-3411-7 53. Gholami, A., et al.: A methodological approach of predicting threshold channel bank profile by multi-objective evolutionary optimization of ANFIS. Eng. Geol. 239, 298–309 (2018) 54. Gholami, A., Bonakdari, H., Ebtehaj, I., Mohammadian, M., Gharabaghi, B., Khodashenas, S.R.: Uncertainty analysis of intelligent model of hybrid genetic algorithm and particle swarm optimization with ANFIS to predict threshold bank profile shape based on digital laser approach sensing. Measurement 121, 294–303 (2018) 55. Gholami, A., Bonakdari, H., Ebtehaj, I., Talesh, S.H.A., Khodashenas, S.R., Jamali, A.: Analyzing bank profile shape of alluvial stable channels using robust optimization and evolutionary ANFIS methods. Appl. Water Sci. 9(3), 40 (2019)
796
H. Bonakdari et al.
56. Gholami, A., Bonakdari, H., Samui, P., Mohammadian, M., Gharabaghi, B.: Predicting stable alluvial channel profiles using emotional artificial neural networks. Appl. Soft Comput. 78, 420–437 (2019) 57. Ebtehaj, I., Bonakdari, H., Khoshbin, F.: Evolutionary design of a generalized polynomial neural network for modelling sediment transport in clean pipes. Eng. Optim. 48(10), 1793– 1807 (2016) 58. Ebtehaj, I., Bonakdari, H., Khoshbin, F., Bong, C., Joo, H., Ab Ghani, A.: Development of group method of data handling based on genetic algorithm to predict incipient motion in rigid rectangular storm water channel. Scientia Iranica 24(3), 1000–1009 (2017) 59. Walton, R., Binns, A., Bonakdari, H., Ebtehaj, I., Gharabaghi, B.: Estimating 2-year flood flows using the generalized structure of the Group Method of Data Handling. J. Hydrol. 575, 671–689 (2019) 60. ASCE Task Committee on Hydraulics, Bank Mechanics, and Modeling of River Width Adjustment on River width adjustment. I: Processes and mechanisms. J. Hydraul. Eng. 124(9), 881–902 (1998) 61. Ivakhnenko, A.G.: Polynomial theory of complex systems. IEEE Trans. Syst. Man Cybern. 4, 364–378 (1971) 62. Farlow, S.J.: Self-Organizing Method in Modelling: GMDH Type Algorithm, p. 54. Marcel Dekker Inc., CRC Press (1984) 63. Iba, H., deGaris, H., Sato, T.: A numerical approach to genetic programming for system identification. Evol. Comput. 3(4), 417–452 (1995) 64. Safari, M.J.S., Ebtehaj, I., Bonakdari, H., Es-haghi, M.S.: Sediment transport modeling in rigid boundary open channels using generalize structure of group method of data handling. J. Hydrol. 577, 123951 (2019) 65. Soltani, K., Amiri, A., Zeynoddin, M., Ebtehaj, I., Gharabaghi, B., Bonakdari, H.: Forecasting monthly fluctuations of lake surface areas using remote sensing techniques and novel machine learning methods. Theoret. Appl. Climatol. 143(1–2), 713–735 (2020). https://doi.org/10. 1007/s00704-020-03419-6 66. Harman, C., Stewardson, M., DeRose, R.: Variability and uncertainty in reach bankfull hydraulic geometry. J. Hydrol. 351(1–2), 13–25 (2008)
Increasing Importance of Analog Data Processing Shuichi Fukuda(B) System Design and Management Research Institute, Keio University, 4-1-1, Hiyoshi, Kohoku-Ku, Yokohama 223-8526, Japan [email protected]
Abstract. This paper points out that although DX (Digital Transformation) is getting wide attention these days and it is very good from the standpoint of making the most of the current bit computer, we should remember that most data the Natural World is composed of are analog. And analog data are increasing their importance quickly, because materials are getting softer and softer with the remarkable progress of material engineering. We need need to directly interact with the objects to find out what they are and how we should handle them. And although our traditional engineering approach was Euclidean Space based, with orthonormality and interval-based distance with units as its requirements, the Real World is getting more and more complicated and complex, and the data come to contain a wide variety of information. So, the Euclidean space approach cannot be applied anymore and we need to develop Non-Euclidean Space approach. To cope with this situation, Mahalanobis Distance, which is an ordinal approach combined with pattern is proposed. The basic idea of this approach is to help our Instinct with performance indicator to provide it with the whole picture of the situation and to enable it to make the appropriate decisions on what actions we should take. As this approach is Instinct-based, it can process data with extremely large number of dimensions. Keywords: Analog data processing · Non-Euclidean space · Ordinal approach · Strategic decision making · Prioritization · Subjective · Qualitative · Instinct · Performance indicator
1 Rapidly Changing Real World The Real World is rapidly changing. Yesterday, changes were smooth, so we could differentiate them and could predict the future. But today changes are sharp, so we cannot differentiate them. We cannot predict the future anymore. And our world was closed with boundary yesterday, but it keeps on expanding and now the boundary disappeared, and it is an open world. Another very important change is materials are getting softer and softer with the remarkable progress of material engineering. Until very recently, materials were hard. We called them hardware. But today they are “software”. So, we need to directly interact with the material to find out how we should handle them (Fig. 1). © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 506, pp. 797–807, 2022. https://doi.org/10.1007/978-3-031-10461-9_54
798
S. Fukuda
Fig. 1. Rapidly changing real world
2 Sensing and Actuation Sensing and actuation plays a fundamental role in perceiving the Real World, and in making decisions what actions to take. In fact, living things are called “Creatures”, because we create movement to survive. Movements constitute our basis for living. VAK ‘(Visual, Auditory and Kinesthetics) are crucial in sensing and actuation. Although DX (Digital Transformation) is getting wide attention these days, we must remember that the Real World or the Natural World is analog. In fact, until very recently, we could easily identify the object with our eyes, and even from a distance. and we could understand how we should handle it, because most of materials or objects in the Natural World were hard. In fact, we called them “Hardware”. But with the remarkable progress of material engineering, materials are getting softer and softer. Therefore, unless we directly interact with the object, we do not understand what it is and how we should handle it. They are now “Software”. So, VAK becomes increasingly important. Visual means eyes. Eyes played important roles as described. But we need other sensing functions. Another one is Auditory. It means ears. But it is not only the usual hearing function, but more important is the sense of balance and positions of the body. Three semicircular canals are the organs. Why we can walk in the darkness, or scratch our back, although we cannot see it is thanks to these organs. Final K means Kinesthetics. This is movement. So, let us discuss Human Movement next.
3 Human Movements Human movements are divided into two; One is Motion, which can be observed from outside. The other is Motor, which is the movement inside of us. Movements of Muscles, etc. are included in Motor. Nikolai Bernstein clarified Motion [1] (Fig. 2). At first, our motion trajectories vary widely, but as we get close to the target object, our muscles harden and they move together with our skeleton. They constitute a musculoskeletal system, which includes bones, muscles, tendons, ligaments and soft tissues. So, we can easily identify parameters
Increasing Importance of Analog Data Processing
799
and apply mathematical approaches. In fact, most researches on human movements are directed to this stage and they discuss how we can control them. But what is more important in this age of frequent, extensive and unpredictable changes is the initial stage. Why our trajectories vary widely from time to time is because we try to adapt to the changing environments and situations. We coordinate many body parts and balance our body to cope with the changing environments and situations. Yes, Coordination becomes more important these days to cope with the changing Real World. But we should remember that Coordination is not only important for actuation. We need to perceive the current environment and situation to make adequate decisions on what actions to take. So, Coordination is important for both. When it comes to control, it can be processed ideally in the current bit or binary digit computer framework, which processes number data on the basis of 0 or 1. In other words, if datasets are processed in Euclidean Space, then it works perfect. But Euclidean Space approach calls for orthonormality and interval scale distance with units. When our world was small and closed with boundary, we could satisfy these requirements, but the rapidly changing Real World cannot satisfy these requirements anymore. Datasets come to be composed of a wide variety of information. So, now we must develop Non-Euclidean Space approach.
Fig. 2. Human motion
4 Mahalanobis Distance (MD): Ordinal Approach P. C. Mahalanobis proposed Mahalanobis Distance (MD) [2], (Fig. 3). He is a researcher of design of experiments. He proposed MD to remove outliners from his datasets to improve the quality of datasets for the design of experiments. But although MD is called Distance. It is not cardinal (one, two, three) distance. It is ordinal (first, second, third). In other words, by introducing MD, we can prioritize which data should be removed first or second or third. MD serves for prioritizing our decisions.
800
S. Fukuda
MD only indicates how the point P is away from the mean of the dataset. This dataset does not require Euclidean Space requirements. Any dataset is OK. So, we can process a wide variety of information. Another great benefit of MD is it is ordinal. So, we can introduce subjective, and qualitative evaluation. In fact, when we make decisions, our senses play a very important. If the objects are identical and do not change from product to product, we may make decisions objectively and quantitatively. But when the object shapes or behaviors change from product to product, we depend on our deep sensation or proprioception. Until now, i.e., in the age of Euclidean Space, digital data processing was our chief focus but in the Natural World, most things are analog, and they are different in shapes and behaviors from one to another. Thus, MD opens the door to the Non-Euclidean Space. In fact, Analog originates from Latin and Greek meaning ana (according to), and logos (ratio). So, MD perfectly fits with this idea.
Fig. 3. Mahalanobis distance (MD)
5 Real World and Industrial World Let us consider and compare the Real World or the Natural World and the Industrial World. Since the Industrial Revolution, we have been focused on how we can improve technology and develop a more sophisticated Industrial Society. The Industrial Revolution introduced Division of Labor and we started to work for others, i.e., for external rewards. But we should remember that engineering started to make our dreams come true. And it is a challenge. Challenge is the core and mainspring of all human activities. Why mountain climbers choose to climb the most difficult route, in spite of the fact that easier routes are available. That is because they enjoy the challenge. They would like to actualize themselves and demonstrate how capable they are, as Abraham Maslow pointed out [3] (Fig. 4), Edward Deci and Richard Ryan proposed Self-Determination Theory and pointed out [4] that if the job is internally motivated and self-determined, that will provide the maximum satisfaction and feeling of achievement, which no external rewards can provide. Another important point they insisted is SDT satisfies human needs for growth. In the Real World or in the Natural world, things are changing, and all things keep on growing. The Industrial World, however, pursues reproducibility. The current engineering does not take growth into consideration.
Increasing Importance of Analog Data Processing
801
Analog may be expressed as continuous, while digital discrete, although in the strict sense, analog and continuous, and digital and discrete are different. But if rough discussion is allowed, we may say that even though our bodies are composed of many cells, we pay attention to our bodies and the focus of our discussion on Control or Coordination is our bodies. It is because we need appropriate unit. We identify things in the Natural World based on these appropriate units. If it comes to digital or discrete, how we can digitalize them is a big challenge. That is why in business, performance indicator is considered very important and that is why KPI (Key Performance Indicator) is strongly pursued. In business, their activities involve a wide variety of information, so instead of discussing individual problems, they discuss the whole picture.
Fig. 4. Maslow’s hierarch of human needs
6 Communication: Verbal and Non-verbal Human Communication started as Non-Verbal. We observed the body signs of others and understood what they wish to tell us. Then, we developed words. But today, languages have developed so much that we forgot to pay attention to body messages. Today, we rely too much on words. But even today, we try to detect emotion from face, etc., because emotion plays an important role to understand what the other party is thinking or feeling in his or her mind. Kevin Ashton proposed the concept of IoT (Internet of Things) [5] and he emphasized the importance of human and machine playing together on the same team. He calls it “Thing’s team”. The Internet in his IoT means not only internet. It means Communication. What he insists is that we must form a team with everything to produce what we want. We must communicate with the Real World to adapt to its frequent, extensive, and unpredictable changes.
802
S. Fukuda
6.1 Mind, Body and Brain Figure 5 illustrates the relation among Mind, Body and Brain. We should note that it is Body that really interact with the Real World and Brain receives these pieces of signals and structure them into knowledge. So, Body interacts with the Real World without any delay. It is real time. Brain, on the other hand, makes decisions based on Knowledge. So, there is delay.
Fig. 5. Mind, body and brain
6.2 What Octopuses and Babies Teach Us When it comes to direct interaction with the Real World, we have a lot to learn from octopuses and babies. The octopus dies immediately after their babies are born. So, they do not inherit knowledge. They must live on their Instinct alone. But they are known as the expert of escape. They can escape from any environments and situations. They can even escape from a screwed container. If we are locked in a screwed container, we would be panicked and cannot do anything. But the octopus can. The brain of the octopus is large but its capability is almost the same as dogs. But the octopus has eight arms and its brain work to coordinate them. Peter Godfrey-Smith published “Other Minds: The Octopus, the Sea, and the Deep Origins of Consciousness” [6]. He tells us how wise it is and we should learn from it. For example, it can recognize the mirror image is self. Even in vertebrates, only humans and apes can recognize self-mirror image. This is because they can sense the motor movements. The octopus is the only invertebrate which can. This is because its head is used to coordinate its 8 arms and this feeling of movement leads it to self-mirrorrecognition. Thus, the octopus intelligence is body intelligence, while the human’s is brain intelligence. In other words, the octopus intelligence is Wisdom, while the human’s is knowledge (Fig. 6). Jean Piaget, Swiss psychologist, clarified how babies develop their cognitive capabilities [7]. Babies directly interact with the Real World and learn ho to cope with the changing environments and situations, and they grow. Betty Edwards, American sketch artist, made it clear [8] that children up to 7 years old draw sketches of the Real World, but after 7 years old, they draw sketches based on concepts. It will interest you if I introduce the following psychological test. You hold a seminar in a room with a clock.
Increasing Importance of Analog Data Processing
803
Fig. 6. The octopus and the human
During the seminar, tell everybody to get out of the room and ask them if there is a clock in the room. Half of them do not remember. The other had said “yes”. But if you ask them how the clock looks, only one or two can answer exactly. We, grownups, are not seeing the Real World. We are seeing the World of Concepts. We should bear this in mind.
7 Emotion Communication is deeply associated with emotion. Even between humans, who have words to communicate, we observe the other party to detect emotion. Emotion originates from the Latin “movere” and what is important it shares the etymology with motivation. So, when you watch the other party, you are hoping to detect what is motivating him or her to communicate with you, i.e., you would like to know what is in his or her mind. If you can share the emotion, you will have the feeling of togetherness.
8 Patterns MD opened the door to Non-Euclidean Space. But MD just describes how far the point P is from the mean of individual dataset. There is no description about the relation between the datasets. To manage the system as we like, we would like to have the whole picture and know how datasets are related. Our past study on detecting emotion from face provided the solution. 8.1 Detecting Emotion from Face Fukuda and his group used to work on detecting emotion from face around 2000. We tried image processing tools one after another. But they took too much time and did not produce satisfactory results. During these challenges, Fukuda suddenly realized that we can easily detect emotion from characters in cartoons. In those days, most cartoons were in black and white. So, we developed a cartoon face model and based on it, we tried to
804
S. Fukuda
detect emotion. It worked so well. It does not take time. We can identify the emotion at once and with satisfactory results [9] (Fig. 7). We realized pattern works for Nonverbal communication.
Fig. 7. Detection of emotion from face
9 Mahalanobis Distance-Pattern (MDP) Approach Based on the above discussion, Fukuda developed Mahalanobis Distance-Pattern (MDP) Approach to process Analog data by combining Mahalanobis Distance and Pattern. To process dynamic data, we can introduce Recurrent Neural Network (RNN) (Fig. 8). But RNN assigns links to nodes in a random manner. So, it is a black box, and we cannot manipulate the system as we wish. So, Reservoir Computing (RC) is introduced. RC enables us to make adjustment at the output. So, we can manage the system as we wish. Another great benefit of RC is it enables us to introduce micro technologies. So, we can make our sensors and actuators extremely small. In fact, we can make them part of our body. Up to now. machines and humans work in separate worlds. And machines follow human’s instructions. But now the true Things Team, which Ashton proposed in his IoT, is realized. Up to now HMI means Human Machine interaction. Their worlds were different. But now it means “Human Machine Integration”. Until now, we, humans, had to give instructions to machines, but since humans and machines are integrated, machines can play the role of humans. They can now understand what we want. If the handicapped needs help, they offer the help that is needed without any instruction. They can help the seniors to realize what they want (Fig. 9). They can act spontaneously. 9.1 Basic Concept of MDP: Swimming as an Example Let us take swimming to explain how MDP works. In swimming, water changes from moment to moment. So, we cannot identify parameters and we cannot apply mathematical approaches. But if we put wearable sensors on the swimmer or take images of
Increasing Importance of Analog Data Processing
805
Fig. 8. Recurrent neural network
Fig. 9. Reservoir computing
their swimming, we can obtain the data sheet as shown in Fig. 10. Each row represents individual muscle. We apply MD to this sheet and compute MD between Time T1 and Time T2 . If MD is decreasing, that muscle is working OK. If increasing, we need to change the moving style of that muscle to improve swimming. As our muscles and how we move them vary from person to person, there is no other way but to learn swimming on our own. Self-learning is the only way to learn swimming. 9.2 From Individual to Team: Soccer as an Example The previous section describes how MDP can be applied to individuals. Here we will discuss how it will be applied to Team Organization and Management (TOM), by taking soccer as an example. Games did not change in yesterday soccer. So, the formation remains the same during the game. Managers were off the pitch and gave instructions. Every player was expected to play his best at his own position. But today soccer changes so frequently during the game. Thus, formation needs to be updated every minute. So rapid and frequent is the change. Managers need to play together with players on the pitch. In other words, soccer formation yesterday was a tree, if we use a graph theory term. But now it changes to a network. But what differentiate soccer network is in other
806
S. Fukuda
Fig. 10. Mahalanobis distance-pattern (MDP)
network applications, the number of members or players on the team is fixed and the rule to manage the team is also fixed. But in soccer, team formation changes every minute, i.e., the number of nodes and links vary extensively in its network. Then, how can we manage such rapid change of team formation? In short, proactive is the keyword. Every player needs to be prepared beforehand for the next formation. Not only expecting beforehand what the playing manager would expect them, but each player needs to perceive what other players are thinking for the next formation and they should also know what play the other players are doing excellent. Thus, each player needs to know each other well and must know the playing-manager’s favorite strategy. Thus, all players must be proactive and prepare for any possible team formation.
10 Instinct The greatest feature of MDP is it is based on Instinct. The traditional approach to human movement is based on bit processing. So, it has a high affinity with the current computer. And as pointed out in the above, most studies on human movement are based on musculoskeletal system and discuss Control. But as Bernstein pointed out Coordination becomes increasingly important and to perceive the current environment and situation, by directly interacting with the Real World, we need to free ourselves from knowledge and make the most of our Instinct, as the octopus and the baby teach us. And with increasing dimension, we cannot process data mathematically. We need to deal with information, which is composed of a wide variety of data. So, the number of dimensions increases exponentially and cannot be solve by mathematical approach. Instinct, however, can lead us to understand the context and enable us to make appropriate decisions on what actions we should take. Thus, what is important is how we can support Instinct. That is why the importance of performance indicator is stressed in business field. What they need is a tool to let them see the whole picture and help their instinct to prioritize their decisions. MDP is nothing other than performance indicator to support our Instinct.
Increasing Importance of Analog Data Processing
807
If we remember that the Natural World is analog, we should recognize how important our Instinct is and do our best to fully utilize it.
11 Self-sustaining and Self-satisfying Society: Society in the Next Generation AI is getting wide attention these days, But AI consumes 10,000 times more energy than a human brain. In fact, unless we use AI for specific purposes, energy resources will soon be depleted. The Industrial Society is getting to its ceiling, and it is time now to explore the new society for the next generation. To sustain growth, we need to take a completely different perspective. If we look back, we started engineering to make our dreams come true. We pursued and enjoyed not only products, but processes, Even failures provided us with the next opportunity for challenge. So, we enjoyed self-sustaining life and it brought us a self-satisfying society. We should turn around and should go the other way. Then, we can truly enjoy life, as our mental needs will be fully satisfied.
References 1. Bernstein, N.: The Co-ordination and Regulation of Movements. Pergamon Press, Oxford (1967) 2. Mahalanobis, P.C.: On the generalized distance in statistics. Proc. Natl Inst. Sci. 2(1), 49–55 (1936) 3. Maslow, A.H.: A theory of human motivation. Psychol. Rev. 50(4), 370–396 (1943) 4. Deci, E.L., Ryan, R.M.: Intrinsic Motivation and Self-Determination in Human Behavior. Springer, Berlin (1985) 5. Ashton, K.: That ‘Internet of Things’ Thing. RFID J. 22 (2009) 6. Godfrey-Smith, P.: Other Minds: The Octopus, the Sea, and the Deep Origins of Consciousness. William Collins, Glasgow (2016) 7. https://en.wikipedia.org/wiki/Jean_Piaget 8. Edwards, B.: Drawing on the Right Side of the Brain: A Course in Enhancing Creativity and Artistic Confidence, J.P. Tarcher, Los Angeles (1979) 9. Kostov, V., Fukuda, S., Johansson, M.: Method for Simple extraction of paralinguistic features in human face image & visual computing. J. Inst. Image Electron. Eng. Jpn. 39(2), 111–125 (2001)
New Trends in Big Data Profiling J´ ulia Colleoni Couto(B) , Juliana Damasio, Rafael Bordini, and Duncan Ruiz School of Technology, PUCRS University, Porto Alegre, RS, Brazil [email protected]
Abstract. A known challenge related to big data is that data ingestion occurs continuously and at high speed, and the data profile can quickly vary because of this dynamism. Data profiling can range from simple summaries to more complex statistics, which is essential for understanding the data. For a data scientist, it is essential to know the profile of the data to be handled, and this information needs to be updated according to the new data that is continuously arriving. This paper reviews the literature about how data profiling is being used in big data ecosystems. We search in eight relevant web databases to map the papers that present big data profiling trends. We focus on categorizing and reviewing the current progress on big data profiling for the leading tools, scenarios, datasets, metadata, and information extracted. Finally, we explore some potential future issues and challenges in big data profiling research.
Keywords: Big data
1
· Data profiling · Data lakes
Introduction
One of the ways to present information about the data we have stored is by generating data profiles. Data profiling creates data summaries of varied complexities, from simple counts, such as the number of records [13], to more complex inferences, such as functional data dependencies [7]. Data profiling allows us to understand better the data we have, and it is essential to help us choose the tools and techniques we will use to process the data according to its characteristics. It is useful for query optimization, scientific data management, data analytics, cleansing, and integration [14]. Data profiling is also useful in conventional file systems (such as those used in Windows and Linux), but it is essential in big data environments, mainly due to volume, velocity, and variety. In this paper, we review the literature about data profiling in big data. We aim to understand the big picture of how data profiling is being done in big data. We present the most used tools and techniques, the types of data, the areas of application, the type of information extracted, and the challenges related to the big data profiling research field. To the best of our knowledge, no previous studies have systematically addressed this issue. To achieve our goal, we perform a Systematic Literature Review (SLR), based on eight electronic databases, containing papers published from 2013 to 2019. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 506, pp. 808–825, 2022. https://doi.org/10.1007/978-3-031-10461-9_55
New Trends in Big Data Profiling
809
We started with 103 papers, and, using inclusion and exclusion criteria, we selected 20 papers for the final set. We use the PRISMA checklist [22] to help us improve the quality of our report, and we use the process suggested by Brereton et al. [5] to plan the steps to follow. We use the Kappa method [21] to enhance results quality and measure the level of agreement between the researchers. Two researchers worked on analyzing the papers to reduce bias, and two others were involved in case of disagreement. Our main contribution is related to characterizing new trends in big data profiling. For instance, we found that R, Python, and Talend are the most used tools, and we identified seven areas of application, namely, automotive, business, city, health, industry, web, and others. We also mapped the datasets they use in those areas, mostly based on online repositories, real-world datasets, and data auto-generated. Our analysis also shows that most papers use data profiling to generate metadata rather than using metadata to generate data profiling. Furthermore, data type, origin, and temporal characteristics are among the most frequent metadata presented in the papers. We also create a classification for the type of information extracted using data profiling (statistics, dependencies, quality, data characteristics, data classification, data patterns, timeliness, and business processes and rules). Finally, we present and discuss 15 challenges related to big data profiling: complexity, continuous profiling, incremental profiling, interpretation, lack of research, metadata, online profiling, poor data quality, profiling dynamic data, topical profiling, value, variability, variety, visualization, and volume. We believe that our findings can provide directions for people interested in researching the field of big data profiling.
2
Materials and Methods
An SLR is a widely used scientific method for systematically surveying, identifying, evaluating, and interpreting existing papers on a topic of interest [16]. We performed an SLR using the protocol proposed by Brereton et al. [5]. This method has three phases, namely, Plan, Conduct, and Document. Also, we chose to develop and report our systematic review following the Preferred Reporting Items for Systematic Review and Meta-Analysis Protocols (PRISMA-P) [22] because the document helps us build the protocol and best arrange the items to report. In the following sections, we detail how we perform each phase. 2.1
Plan Review
The planning phase introduces the processes and steps to ground the SLR, and it should be carefully done because it is the basis of all subsequent research. In this phase, we define research questions, develop, and assess the review protocol. Specify Research Question. Our main goal is to identify how data profiling is being used in the big data context. To do this, we create the Research Questions
810
J. C. Couto et al.
(RQ) presented in Table 1, which are important to we can get an overview of big data profiling. The RQ1 is useful to understand how to perform big data profiling, so people who would like to start working with big data profiling can start by exploring the most used tools. RQ2 helps us understand the main areas of applications and what kind of datasets are being used to report studies on data profiling, so beginners in big data profiling can focus on some areas or specifics datasets to start exploring profiling, for instance. In RQ3, we are interested in understanding the type of metadata collected by the papers, and it can help people who will develop a big data profiling application to map the most important characteristics of the data to be presented. RQ4 explores the most commonly presented type of information. Finally, RQ5 maps the main challenges pointed by the selected papers, and thus we suggest future research directions for big data profiling. We use the PICO (Population, Intervention, Comparison, and Outcome) and PICo (Population, Interest, and Context) to help in formulating our RQs. PICO and PICo are similar evidence-based models that can be combined and used to improve the research’s delimitation, clarify the scope, and elaborate the research question. Table 2 presents our research scope. Developing the Review Protocol. We selected eight relevant computer science electronic databases to develop and apply our search protocol: Scopus, IEEE Xplorer, Springer, Google Scholar, Science Direct, ACM, Web of Science, and arXiv. We included papers published in English, regardless of the year of publication. We do not specify a start date because we aim to map data profiling’s evolution in big data since its beginning. Afterward, we identify the most important keywords related to our research question, such as "data profiling" AND ("big data" OR "data lake"). We combine these terms to create the search expression according to each electronic database’s mechanism (see Table 3). For example, in ArXiv and ACM, we joined two search strings since the results obtained using both were more aligned to what we expected. We performed the searches in the abstract, title, and keywords fields. Table 1. Research questions. No
Question
RQ1 What are the tools for big data profiling? RQ2 What are the areas of application and datasets reported to be profiled? RQ3 What type of metadata did the papers collect? RQ4 Which information is extracted using data profiling? RQ5 What are the challenges in big data profiling?
New Trends in Big Data Profiling
811
Table 2. PICO and PICo definitions. PICO
PICo
Population: Big data systems
Population: Big data systems
Intervention: Data profiling Comparison: Data warehouses
Interest: Tools and challenges
Outcome: Tools and challenges
Context: Data profiling
We defined a control study to validate the search expression. A control study is a primary study resulting from a non-systematized web-search, which is known to answer our research questions. We use it to check if the search strings are adequate. If this paper were in the electronic database, it had to come up in the search with the search string that we previously defined. If the search did not return the control study, the search string needed to be adjusted until they did so. We chose the following control study: Juddoo, Suraj. “Overview of data quality challenges in the context of Big Data.” 2015 International Conference on Computing, Communication and Security (ICCCS). IEEE, 2015. [14]. We choose this paper because it is highly related to our research, because it presents a related literature review and answers some of our research questions. Assessing Quality of the Studies. We followed the selection criteria for the inclusion and exclusion of papers to get only results related to our research topic. The papers we accepted met all the following criteria: – – – –
Be qualitative or quantitative research about data profiling in big data. Be available on the internet for downloading. Present a complete study in electronic format. Be a paper, review, or journal, published on the selected electronic databases. The papers we rejected met at least one of the following criteria:
– – – – – –
Incomplete or short paper (less than four pages). Unavailable for download. Duplicated paper. Written in a language other than English. Paper is not about data profiling in big data. Literature review or mapping (this criteria was only used for the review about data integration in data lakes). – Ph.D., M.Sc., or Undergraduate theses. Validating Review Protocol. One researcher (JC) developed the review protocol and made several trials changing the search string to obtain results relevant and aligned to the research question. Then, another researcher (JD) performed the second review. They made new adjustments together, based on their reviews. Based on this validation, we agreed to develop the SLR using the protocol we present here.
812
J. C. Couto et al. Table 3. Search strings for each electronic database.
Electronic database Search string Scopus
(TITLE-ABS-KEY ("data profil*") AND TITLE-ABS-KEY ("big data" OR "data lake*"))
IEEE Xplore
("All Metadata":"data profiling" AND ("big data" OR "data lake"))
Springer
https://link.springer.com/search?dc.title=%22data+prof iling%22+%28%22big+data%22+OR+%22data+lake%22%29&datefacet-mode=between&showAll=true
Google Scholar
allintitle: "big data" OR "data lake" OR"data profiling"
Science Direct
Title, abstract, keywords: "data profiling" AND ("big data" OR "data lake")
ACM
(Searched for acmdlTitle:(+"data profiling" +"big data") OR recordAbstract:(+"data profiling" +"big data") OR keywords.author.keyword:(+"data profiling" +"big data")) JOIN( Searched for acmdlTitle:(+"data profiling" +"data lake") OR recordAbstract:(+"data profiling" +"data lake") OR keywords.author.keyword:(+"data profiling" +"data lake"))
Web of Science
(from all databases): TOPIC: ("data profiling") AND TOPIC: ("big data" OR "data lake") Timespan: All years. Databases: WOS, DIIDW, KJD, RSCI, SCIELO. Search language=Auto
arXiv
(Query: order: -announced date first; size: 50; include cross list: True; terms: AND all="data profiling"; AND all="big data") JOIN (Query: order: -announced date first; size: 50; include cross list: True; terms: AND all="data profiling"; AND all="data lake")
2.2
Conducting the Review
In this phase, we start applying the protocol we previously defined. To do so, we apply the search string to each electronic database and extract the results in a BibTeX file format. Only Springer and arXiv do not facilitate this process, so we have to select each register, copy its BibTeX, and then consolidate it into a single file. Also, Google Scholar has a slightly different process, where we have to log-in to a Google account, run the search, mark each result as favorite and export the results, 20 at a time, and then we also have to consolidate it in a single file. Of course, this process refers to the available version of the web searchers we used when conducting the review phase (Dec. 2019), and for each one, the process can evolve or change in future versions.
New Trends in Big Data Profiling
813
Table 4. Kappa results through each iteration—table based on Landis and Koch [18].
Kappa values Strength of agreement Value b, a = b or a < b; b) the law of transitivity – for all a, b, c: a > b and b > c => a > c. Thus, if it is possible to search for objects for which a single sequence of indicators forms a monotonically increasing/decreasing sequence, such objects can be combined into a single ordinal-invariant pattern cluster. However, the question arises about the possibility and expediency of using such algorithms to search for diffusion-invariant pattern clusters. This question is complicated by the need to exchange indicator values for the implementation of diffusion-invariant pattern clustering and, therefore, the need to repeat the search procedure for monotonically increasing/decreasing sequences.
830
A. Myachin
3 Diffusion-Invariant Pattern Clustering: Possible Implementations 3.1 Alternative Implementation of Diffusion-Invariant Pattern Clustering To use diffusion-invariant pattern clustering in conjunction with data sorting algorithms, we must define an implementing algorithm. It was proved in [4] that the partition obtained on the basis of ordinal-invariant pattern clustering is the only one (since pairwise comparisons are uniquely determined). Using sorting algorithms for this method implies finding one object, arranging its indicators in the form of a monotonically increasing/decreasing sequence, and searching for other objects with a similar sequence of indicators. Based on this, we define an alternative algorithm for diffusion-invariant pattern clustering and make calculations using real data from [1] as an example. For practical implementation, one ordinal-invariant pattern cluster is also considered, into which several objects have fallen. If there is a single object, such a pattern will automatically be a diffusion-invariant pattern cluster consisting of a single object. For definiteness, we take the same objects x k = (x k1 , x k2 , …, x kn ) and x m = (x m1 , x m2 , …, x mn ). Since these objects are already combined on the basis of ordinal-invariant pattern clustering, it can be argued that for them there is a single monotonously non-decreasing/non-increasing sequence of indicators. Next, we exchange the first indicator between the objects x k * = (x m1 , x k2 , …, x kn ) and x m * = (x k1 , x m2 , …, x mn ). Arbitrarily, choose any of them (for definiteness, the first one). For this object, a pairwise comparison of indicators is performed and an increasing/decreasing sequence is determined. This sequence is compared with the location of the indicators of the object x m * . When they do not coincide, objects are divided into different groups. If they match, we make a mutual exchange of the second indicator between the objects. In this case, we obtain the objects x k * = (x k1 , x m2 , …, x kn ) and x m * = (x m1 , x k2 , …, x mn ), and at the j-th step x k * = (x k1 , x k2 , …, x mj ,…, x kn ) and x m * = (x m1 , x m2 , …, x kj ,…, x mn ). Next, we fix one of them and again form a monotonically increasing/decreasing sequence. If the location of the indicators coincides, this algorithm is repeated until there is an exchange of n indicators between the objects. 3.2 Correction of the Results of Diffusion-Invariant Pattern Clustering The accumulation of a large variety of different data currently allows us to extract and practically apply the developed algorithms in various fields, e.g. increasing sales, identifying target groups, etc. However, the question of the accuracy of the information provided for data processing is very important. If there are errors that occur at the stage of collecting and processing information, it is possible to accumulate errors at various stages of data analysis, which in turn can lead to erroneous results. This statement is especially true for diffusion-invariant pattern clustering, since this method is based on pairwise comparison of indicators. The obvious solution seems to be to round the data to a certain value. However, the disadvantage of this method is the decrease in accuracy, which can be critical for certain tasks. Of course, here, we should proceed from the specific formulation of the original problem and the expected results. An alternative approach may be to adjust the results using centroids of diffusion-invariant pattern clusters. In [14], the possibility of adjusting the results of ordinal-invariant pattern clustering was demonstrated. In [4],
Finding Structurally Similar Objects Based on Data Sorting Methods
831
a statement was proved that allows the use of centroids for this task: “If object x 1 = (x 11 ,…, x 1j ,…, x 1n ) belongs to some ordinal-invariant pattern cluster vinv , then for any positive value α (α > 0) the object x α = αx 1 also belongs to this cluster”. Thus, some “average object”, which is formed as the arithmetic mean value of all indicators included in a certain ordinal-invariant pattern cluster, will belong to this cluster (using a similar procedure for splitting the objects under study). We provide an algorithmic implementation of the correction of a possible adjustment (depending on the task) of the results of diffusion-invariant pattern clustering. Suppose, using this method, z diffusion-invariant vdiff pattern clusters were formed. For each such cluster, it is possible to calculate the centroid (or “middle object”) according to the formula: |vl | xi vl (1) xcentroid = i=1 |vl | where l is the number of the diffusion-invariant pattern cluster. In other words, the centroid of each diffusion-invariant pattern cluster is calculated as mean of the values of the objects that form it. Next, we take the proximity measure necessary in a particular problem (we give an example using the Euclidean metric) and evaluate the distance from the obtained centroids to all the objects under study d E w (x vl centroid , x i ). In this case, the partition results are corrected according to the minimum distance to the centroids of diffusioninvariant pattern clusters. Formally, an additional parameter is calculated according to the formula vz v1 (2) , xi , . . . , dEw xcentroid , xi gi = min dEw xcentroid In this case, the object is related to the vl cluster if gi = d E w (x vl centroid , x i ). Thus, the object belongs to that diffusion-invariant pattern cluster, the distance to the centroid of which is the minimum. It should be noted that in some cases, the application of this method may not significantly change the results. This can happen due to the clear identification of certain clusters in the specific problem under consideration, relatively small distance inside the cluster and large inter-cluster distance. 3.3 Case Studies Let us give an example on the results of the analysis of data on the state capacity of 166 countries published in [1]. In this study, 5 indicators were used for 3 reference years: 1996, 2005, and 2015. A part of objects of one of the groups is presented in the Table 1. Due to the limitation in the amount of work, we took only 8 objects from this group (in the original work there are 50 of them). More details on the parameters used and the obtained partition results in [1]. The order of indicators was set arbitrarily. For demonstration, we use visualization of objects from the Table 1 in a system of parallel coordinates. In Fig. 2 shows that the values of the indicators of these countries are very similar, which means they need to be combined into a single group according to the structure
832
A. Myachin Table 1. The study of state capacity.
Country
Taxes
WGI
Mil_exp
Safety
Mil_pers
Australia1996
0,65
0,94
0,18
0,99
0,1
Australia 2005
0,66
0,93
0,38
0,99
0,05
Australia 2015
0,57
0,92
0,37
0,99
0,08
Austria 2005
0,55
0,93
0,19
0,99
0,09
Austria 2015
0,59
0,87
0,11
1
0,08
Belgium 1996
0,59
0,86
0,19
0,99
0,15
Belgium 2005
0,59
0,84
0,24
0,97
0,07
Belgium 2015
0,59
0,84
0,14
0,98
0,09
Australia 1996
1.2
Australia 2005
1
Australia 2015
0.8
Austria 2005
0.6
Austria 2015 0.4
Belgium 1996
0.2 0
Belgium 2005 Taxes
WGI
Mil_exp
Safety
Mil_pers
Belgium 2015
Fig. 2. Polylines of different countries in parallel coordinates.
of state capacity. Pattern analysis methods allow this. It should be noted that, since this example shows that the absolute values of the indicators are very similar, a similar result will be obtained using classical methods of cluster analysis. This example was chosen only to demonstrate the possibility of using data sorting algorithms to implement ordinal-invariant and diffusion-invariant pattern clustering. To make sure that data sorting algorithms can really help with the implementation of these methods, we will choose an arbitrary object (for definiteness, take Australia 1996). Using the indicators of this object, we construct a monotonically increasing sequence. Next, we compare the obtained sequence with the results for other studied objects. As can be seen from Fig. 3, all the studied objects with a similar arrangement of indicators will form a monotonously increasing sequence. Therefore, we can combine them into a single group. In this case, since all objects can be combined, adjustment of the obtained results is not required. For comparison, we add additional objects of another group obtained in [1] and use a similar sequence of indicators.
Finding Structurally Similar Objects Based on Data Sorting Methods
833
Fig. 3. Monotonically increasing sequence of parameters.
Fig. 4. The original sequence of indicators and the sequence obtained for Australia 1996.
It is worth noting some feature of this group. For two objects, a different comparison was obtained for the indicators “Taxes” and “Mil_pers”. This difference is due to the use of adjustments to the results using centroids.
Fig. 5. Hierarchical clustering results.
As can be seen from Fig. 4, indicators of these objects do not form a monotonously increasing sequence, which means that they are not combined with the objects presented in Table 1. Similar results are obtained, for example, when using hierarchical clustering, the results of which are presented in Fig. 5
834
A. Myachin
Thus, data sorting algorithms can be used to form both ordinal-invariant and diffusion-invariant pattern clusters, and the results can be used to compare with classical clustering methods.
4 Conclusion Significant accumulation over the past decades of huge data arrays requires the study and creation of additional methods for their storage, processing and extraction of useful knowledge. One of the possible approaches is the methods of pattern analysis, one of the possible implementations of which is proposed in the work. The paper demonstrates the possibility of using various data sorting algorithms to reduce the time required to implement diffusion-invariant pattern clustering. This implementation allows to obtain a relatively high accuracy of partitioning in the presence of a sample of large objects. The described algorithm for adjusting the results allows us to work with data that has a certain error, while slightly losing the accuracy of the final partition. Note, however, that such an algorithmic solution increases the computational complexity of the pattern analysis method under consideration. Thus, the article proposes an additional approach to structuring and finding similar objects (according to the chosen measure of proximity) and confirms the expediency of using them on an applied example. However, it should be noted that the choice and use of each method of pattern analysis requires an understanding of both the task and the interpretation of the final results. Acknowledgment. The article was prepared within the framework of the Basic Research Program at the National Research University Higher School of Economics (HSE) and supported within the framework of a subsidy by the Russian Academic Excellence Project ‘5-100’.
References 1. Akhremenko, A., Myachin, A.: The study of trajectories of the development of state capacity using ordinal-invariant pattern clustering and hierarchical cluster analysis. In: 8th International Conference on Computers Communications and Control (ICCCC) 2016. Oradea: Agora University (2020) 2. Aleskerov, F., Ersel, H., Yolalan, R.: Multicriterial ranking approach for evaluating bank branch performance. Int. J. Inf. Technol. Decis. Mak. 3(2), 321–335 (2004) 3. Aleskerov, F., Nurmi, H.: A Method for finding patterns of party support and electoral change: an analysis of British Genera l and Finnish municipal elections. Math. Comput. Model. 48, 1225–1253 (2008) 4. Myachin, A.: Pattern analysis in parallel coordinates based on pairwise comparison of parameters. Autom. Remote. Control. 80(1), 112–123 (2019) 5. Sammon, J.W.: Interactive pattern analysis and classification. IEEE Trans. Comput. 100(7), 594–616 (1970) 6. Niemann, H.: Pattern Analysis and Understanding, vol. 4, Springer, Berlin (2013). https:// doi.org/10.1007/978-3-642-74899-8 7. Siedlecki, W., Siedlecka, K., Sklansky, J.: An overview of mapping techniques for exploratory pattern analysis. Pattern Recog. 21(5), 411–429 (1988)
Finding Structurally Similar Objects Based on Data Sorting Methods
835
8. Williams, W.: Pattern Analysis in Agricultural Science. Elsevier, Netherlands (1976) 9. Aleskerov, F., Egorova, L., Gokhberg, L., Myachin, A., Sagieva, G.: Pattern analysis in the study of science, education and innovative activity in Russian regions. Procedia Comput. Sci. 17, 687–694 (2013) 10. Myachin A.: Pattern Analysis: Diffusion-Invariant Pattern Clustering. Problemy Upravleniya, vol. 4, pp. 2–9 (2016). (in Russian) 11. Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge (2004) 12. Niemann, H.: Pattern Analysis and Understanding. vol. 4. Springer, Berlin (2013). https:// doi.org/10.1007/978-3-642-74899-8 13. Inselberg, A.: The plane with parallel coordinates. Vis. Comput. 1(2), 69–91 (1985) 14. Myachin, A.: Determination of centroids to increase the accuracy of ordinal-invariant pattern clustering. Upravlenie Bol’shimi Sistemami, vol. 78, pp. 6–22 (2019). (in Russian)
Application of the Proposed Thresholding Method for Rice Paddy Field Detection with Radarsat-2 SAR Imagery Data Kohei Arai(B) and Kenta Azuma Saga University, Saga 840-8502, Japan [email protected]
Abstract. One of the applications of the proposed thresholding method for rice paddy field detection with RADARSAT-2 SAR imagery data is shown in this paper. Planted area estimation method using SAR data based on the proposed thresholding is also proposed. Through comparative research on the proposed and the conventional thresholding methods with RADARSAT-2 SAR imagery data, it is found that the proposed method is superior to the conventional methods. Also, it is found that the proposed thresholding method does work for rice paddy field detection with SAR imagery data through the comparison between the result by the classified method with the proposed method and field survey report of the rice paddy fields. Keywords: Thresholding · Classification · Quantity of a class · Counting accuracy · Synthetic Aperture Radar · Synthetic Aperture Radar: SAR · RADARSAT
1 Introduction The true value intended in this study is “when there are multiple types of sets (classes) in a certain population and the population is composed of elements that clearly belong to each set, the elements that belong to each set by referring to the number of pixels. For example, when a population is composed of elements that clearly belong to “paddy rice plantations” and “other fields,” the number of elements that belong to “paddy rice plantations” is the true value of the integrated value of “paddy rice plantations.” Let the number of elements belonging to “other fields” be the true value of the integrated value of “other fields”. When discussing the true value, even if the element belonging to “paddy rice planting land” has the feature amount of “other fields”, that element is just “paddy rice planting land” and “paddy rice planting land”. It is one of the elements that make up the true value of the integrated value of. However, as a practical matter, most of the sets of elements that have some physical quantity have variations in the values indicated by the physical quantities and in this study as well, the elements that belong to one set have certain characteristics. K. Azuma—Former student of Saga University. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 506, pp. 836–845, 2022. https://doi.org/10.1007/978-3-031-10461-9_57
Application of the Proposed Thresholding Method
837
A constant characteristic does not mean that the values are the same, but a characteristic that has variations caused by a certain regularity as shown by an arbitrary probability density function. In this study, some correct data are used for each verification item in evaluating the analysis results. In the analysis using the satellite images observed, some ground surface coverings are in the category at the time of classification, but the ground surface coverings found by the field survey of the target area are the true values. For example, in the classification of “paddy rice plantation” and “other fields”, the number of fields judged to be “paddy rice plantation” and the number of fields judged to be “other fields” in the area by the field survey. Is the true value of the integrated value of each group. Currently, it is unknown what kind of characteristics each element has, but it is assumed that all the elements belonging to each group have physical quantities according to a certain characteristic and perform classification discrimination according to that characteristic. In this paper, an appropriate method for thresholding for a population analysis is proposed because the purpose of the method is to estimate planted area estimation. The proposed thresholding method has been proposed already. The proposed method is based on the categorized number of pixels after the classification. To show one of the applications of the proposed method, detection of rice paddy fields with space based Synthetic Aperture Radar: SAR imagery data is tried. To verify the practicality of the proposed method, an application example of the proposed method using actual data is shown. In this study, as an application example, we will discriminate between water and land areas by binarizing satellite images. By implementing and evaluating not only the proposed method but also the conventional method, the characteristics of the proposed method are reconfirmed, and the effectiveness of the proposed method is also confirmed. The following section describes related research works followed by classification method with SAR data based on the proposed thresholding method is described followed by some experimental data of Radarsat-2 SAR imagery data. After that, conclusion is described together with some discussions. The proposed thresholding method with some theoretical background is described in the Appendix.
2 Related Research Works Dr. Sahoo and Dr. Wong listed and evaluated thresholding method for binarization of the images in concern [1]. The thresholding methods are categorized as point dependent techniques [2–7], region dependent techniques [8–12] and multi-thresholding [13–15]. Dr. Otsu proposed the widely used thresholding method [16, 17], also Dr. Kittler et al. proposed the modified thresholding method [18, 19] and Dr. Kurita et al. proposed their original thresholding method [20, 21]. These are known as typical thresholding methods for application of image classification. There are many application papers which deal with satellite remote sensing imagery data classification based on these thresholding method [22–25]. These thresholding methods are used for not only image classification but also unmixing of mixed pixels [26]. On the other hand, thresholding-based method for rain, cloud detection with NOAA/AVHRR data by means of Jacobi iteration method is proposed [27]. Also, optimum threshold for maximizing classification performance of Landsat TM image classification is estimated with these thresholding methods [28].
838
K. Arai and K. Azuma
Optimization techniques for engineering can be referred for threshold level determination in particular for image classification application of thresholding method [29]. A modified simulated annealing method is proposed for estimation of refractive index and size distribution of aerosol using direct and diffuse solar irradiance as well as aureole [30]. Also, classification accuracy assessment is well reviewed as a classification performance evaluation method [31].
3 Proposed Method A technique for discriminating between paddy rice plantations and other fields using satellite images and Geographic Information System: GIS data has been proposed. First, the backscatter count of SAR data is used to determine the paddy rice planting area. Paddy rice plantations are flooded before rice planting. In a flooded field, the surface becomes flat due to the water surface, so that the microwaves used in SAR have predominant forward scattering and extremely weak backscatter. Then, as the rice grows, the surface shape of the field becomes rough, and backscattering becomes stronger. In this way, the backscattering of the paddy rice planting area becomes extremely small once and gradually becomes stronger. By observing at multiple times with SAR satellites, the characteristics of backscattering in paddy rice plantations can be extracted. It is important to detect whether the water is flooded during the rice planting period to identify the paddy rice planting site. In addition, by linking the backscattering counts of GIS data and SAR data that have polygons that represent the position and shape of the fields, the area of each field becomes known, and the quadrature of the paddy rice planting area becomes possible. Conventionally, threshold selection methods such as Otsu’s method and Kittler’s method have been used for the above methods to identify paddy rice plantings and flooded fields by changing the backscattering count. However, these threshold selection methods are designed to increase the accuracy rate. Although it has the effect of increasing the accuracy rate indiscrimination, there is a possibility that the counting accuracy in discrimination will be extremely deteriorated. Therefore, it was verified whether the improvement from the conventional method could be recognized by using the threshold value selected by the proposed method for the quadrature of the paddy rice planting site (flooded field). The threshold selection method to be compared is shown below. Threshold selection method to compare the following three methods, • Otsu’s method • Kittler et al. • Proposed method (a method to bring the total number of discriminated classes closer to the true value by selecting a threshold value that reduces the difference in the number of misjudgments between classes)
Application of the Proposed Thresholding Method
839
4 Experiment 4.1 Procedure For verification, RADARSAT-2 data (Ultra-Fine, HH-polarization) observed in Kawakita Town, Ishikawa Prefecture, Japan on June 3, 2009, was used as SAR data. For the GIS data, 949 polygon data were created and used by extracting the outline of the field by interpreting the optical satellite image of Kawakita Town, Ishikawa Prefecture. For the RADARSAT-2 data, the backscattering count of sigma note was calculated as the pixel value and superimposed on the created polygon data. The average backscatter count in the polygon was calculated for each polygon and used as the input population used to identify the flooded field. Figure 1 shows the intensive study area. Here, as a method for selecting the threshold value, not only the proposed method but also the method of Otsu and the method of Kittler et al. were implemented at the same time for comparison. When calculating the threshold value by the proposed method, it is assumed that the distributions of the flooded field and the non-flooded field follow a normal distribution, and the probability density function obtained from the input image is calculated by the sum of the two normal distributions and then is optimized. Fields with a smaller average backscatter count in the polygon than the obtained threshold were identified as flooded fields, and fields with a larger average backscatter count in the polygon were identified as non-flooded fields. The procedure for determining the paddy rice planting site is shown below. • Overlay the polygons that indicate the fields on the SAR data. • Calculate the average backscatter count of SAR data for each polygon. • The average backscattering count for each polygon is input, and the threshold value for separating flooded and non-flooded (Or paddy rice and non-paddy rice) is selected using the above three threshold selection methods. • In the case of flooding discrimination, fields whose average backscatter count in the polygon is below the threshold value are classified as “flooded”, and fields larger than the threshold values are classified as “non-flooded”. • In the case of paddy rice discrimination, if the difference between the average backscatter counts of multiple periods is larger than the threshold value, it is judged as “paddy rice”, and the fields below the threshold value are judged as “non-paddy rice”. Figure 2 shows the histogram obtained from the input data, the distribution of each class obtained by optimizing the input data to the histogram, and their mixture distribution. 4.2 Experimental Result As the correct answer data for the evaluation, we used the results of a field survey conducted in Kawakita Town, Ishikawa Prefecture (see Fig. 3). By shooting a video of the local fields, we investigated whether each field was flooded or not.
840
K. Arai and K. Azuma
(a) Radarsat-2 SAR
(b)Google Map Fig. 1. Intensive study area of Kawakita, Ishikawa, Japan. A field polygon that represents the outline of the field superimposed on the satellite image. The number in the field polygon is the average value of the backscatter count in the field polygon.
Application of the Proposed Thresholding Method
841
Fig. 2. Histogram of SAR input image in discrimination of paddy rice planting site (gray), and histogram of class 1 (Pink), class 2 (blue) and their mixture distribution (red) estimated by optimization
Fig. 3. The result of investigating the actual usage of the field by a field survey. Light blue indicates paddy rice planting area, and orange indicates other fields.
Comparing the discrimination results using each threshold selection method with the correct answer data from the field survey, create the confusion matrix shown in Table 1, and calculate the correct answer rate, user accuracy, creator accuracy, and counting accuracy. The results and method were evaluated and validated. The evaluation results are shown in Table 2.
842
K. Arai and K. Azuma Table 1. Results from the water field detection and field survey report Method
Water_Field
Non-water_Field
Otsu
649
300
Kittler
652
297
Proposed
644
305
Field_Survey
679
270
Table 2. Results from the experiment of rice paddy field detection with the referring to the field survey report Method
Threshold
O
P1
U1
C1
P2
U2
C2
Otsu
−14.11 (dB)
91.6
96.2
81.7
91.9
90.7
95.6
111.1
Kittler
−13.99 (dB)
91.9
96.2
82.5
92.3
90.7
96.0
110.0
Proposed
−14.3 (dB)
91.3
96.3
80.7
91.3
91.1
94.9
113.0
As a result of the accuracy evaluation of the discrimination by the two conventional methods and the proposed method in the discrimination of the paddy rice planting value, the accuracy rate O and the counting accuracy C became the worst, and the results were different from the theory. Furthermore, the feature that the user accuracy U and the creator accuracy P, which were recognized in the verification of the characteristics of the proposed method using the sample images carried out in the previous section, did not occur. If it was assumed before the verification, based on the theory proved in Sect. 2, the result by the proposed method should have the counting accuracy C approaching 100%, and the user accuracy U and the creator accuracy P approaching. However, the expected results were not obtained though. The following are possible reasons why the expected results were not obtained: • The distribution of backscatter counts in flooded or non-flooded fields does not fit the normal distribution. • The number of input data is not enough to obtain stable results (threshold values) by statistical methods. • The number of data used for discrimination and the amount of data for evaluation are not sufficient for stable evaluation. • The distribution of backscatter count in flooded and non-flooded fields is not followed by a normal distribution. In the classification of water and land areas using the SAR data shown in the previous section, the backscatter count of the SAR data was approximated by a normal distribution in both water and land areas. In the data of this verification, as shown in Fig. 2, the histogram of the input data itself has large variations due to the small amount of data.
Application of the Proposed Thresholding Method
843
As a rough feature, it seems that it can be approximated by a mixture distribution of normal distribution, but it is unclear because the histogram has a large variation. The reason for the large variation in the histogram is that the number of input data (sample size) of 949 is not sufficient. Since the number of input data is not enough, it is possible that the threshold value is not stable enough to obtain the theoretical result. Furthermore, even if a stable threshold value is obtained, the amount of data for evaluation is also 949, so it is doubtful whether the number of data is sufficient to obtain a stable evaluation.
5 Conclusion As a result, the proposed method was able to select a threshold that approximates the number of misjudgments in the two classes. Furthermore, a comparison between the proposed method and the conventional method showed that the proposed method has the following characteristics. Higher counting accuracy can be obtained compared to the conventional method. It has the effect of bringing the creator accuracy closer to the user accuracy compared to the conventional method. It is effective even if there is a difference in quantity and variance in each class. It is effective even when the distribution of the input data does not show bimodality. It is necessary to assume or know the distribution characteristics of the input data. Land cover classification using satellite images and application examples for quadrature of paddy rice planted area were shown. In some cases, the proposed method did not produce the expected effect, so the scope of application of the proposed method was verified. As a result, it was confirmed that the obtained threshold value is not stable when the number of samples is too small. At the same time, it was also confirmed that, although it is self-evident, the results and methods cannot be evaluated normally if the amount of data for evaluation is too small. However, when the number of samples is large enough, the expected effect is obtained by the proposed method, and when the purpose is to “improve the estimation accuracy of the quantity of each set of discrimination results”, the proposed method is effective. It was confirmed that it was a means. In addition, it was confirmed that the accuracy of the classification (discrimination) result changed when the distribution of the input data was different from the expected distribution or when the distribution contained noise. As a result, even if there is no correct answer data for performing accuracy evaluation, a value that can be used as a guide for evaluating the credibility of the threshold selection (with the probability density function of the input data and the mixture distribution estimated by optimization). It was confirmed that the root means square error) can be obtained. Furthermore, it was confirmed that if the guideline is larger than a certain value, the correct answer rate and counting accuracy may be significantly deteriorated in the discrimination using the threshold value obtained by it. It was confirmed that high counting accuracy can be obtained in pixel classification by using the proposed method. Therefore, it can be said that the number of pixels in the area occupied by the target class can be calculated from a certain image with high accuracy. In addition, since the values of creator accuracy and user accuracy approach each other by
844
K. Arai and K. Azuma
using the proposed method, the discrepancy between the evaluation of the classification result from the creator’s point of view and the evaluation of the classification result from the user’s point of view is reduced and possibility is confirmed. Furthermore, since it is effective even when the number of pixels and standard deviation of each class are biased, the number of pixels is highly accurate even between classes composed of a small number of pixels and classes with different distribution properties. It is considered that it can be calculated. Furthermore, although a general statistical method is sufficient for evaluating the number of pixels, it is a major feature of the proposed method that information such as which pixel corresponds to which class can be obtained.
6 Future Research Works In this paper, we have shown a method for classifying each pixel for a certain image, but the proposed method can be applied not only to images but also to various sets with values for each element, such as remote sensing, GIS, and it is considered that it can be widely applied to other data mining. Further research works are required for the other applications not only rice paddy field detection with SAR imagery data. Acknowledgment. The authors would like to thank Professor Dr. Hiroshi Okumura and Professor Dr. Osamu Fukuda of Saga University for their valuable discussions.
References 1. Sahoo, P.K., Soltani, S., Wong, A.K.C.: A survey of thresholding techniques. Comput. Vis. Graph. Image Process. 41, 233–260 (1988) 2. Doyle, W.: Operation useful for similarity-invariant pattern recognition. J. Assoc. Comput. 9, 259–267 (1962) 3. Prewitt, J.M.S., Mendelsohn, M.L.: The analysis of cell images. Ann. New York Acad. Sci. 128, 1035–1053 (1966) (1983) 4. Pun, T.: A new method for gray-level picture thresholding using the entropy of the histogram. Signal Process. 2, 223–237 (1980) 5. Kapur, J.N., Sahoo, P.K., Wong, A.K.C.: A new method for gray-level picture thresholding using the entropy of the histogram. Comput. Vis. Graph Image Process. 29, 273–285 (1985) 6. Johannsen, G., Bille, J.: A threshold selection method using information measures. In: Proceedings, 6th International Conference on Pattern Recognition, Munich, Germany, pp. 140–143 (1982) 7. Tsai, W.: Moment-preserving thresholding: a new approach. Comput. Vis. Graph. Image Process. 29, 377–393 (1985) 8. Mason, D., Lauder, I.J., Rutoritz, D., Spowart, G.: Measurement of C-bands in human chromosomes. Comput. Biol. Med. 5, 179–201 (1975) 9. Ahuja, N., Rosenfeld, A.: A note on the use of second-order gray-level statistics for threshold selection. IEEE Trans. Syst. Man Cybernet SMC-8, 895–899 (1978) 10. Kirby, R.L., Rosenfeld, A.: A note on the use of (gray level, local average gray level) space as an aid in thresholding selection. IEEE Trans. Syst. Man Cybernet SMC-9, 860–864 (1979)
Application of the Proposed Thresholding Method
845
11. Deravi, F., Pal, S.K.: Gray level thresholding using second-order statistics. Pattern Recogn. Lett. 1, 417–422 (1983) 12. Southwell, R.: Relaxation Methods in Engineering Science. A Treatise on Approximate Computation. Oxford University Press, London (1940) 13. Boukharouba, S., Rebordao, J.M., Wendel, P.L.: An amplitude segmentation method based on the distribution function of an image. Comput. Vis. Graph. Image Process. 29, 47–59 (1985) 14. Wang, S., Haralick, R.M.: Automatic multi-threshold selection. Comput. Vis. Graph. Image Process. 25, 46–67 (1984) 15. Kohler, R.: A segmentation system based on thresholding. Comput. Graph. Image Process. 15, 319–338 (1981) 16. Otsu, N.: an automatic threshold selection method based on discriminant and least squares criteria. IEICE Trans. Fundam. Electron. Jpn. D 63(4), 349–356 (1980) 17. Ots, N.: A thresholding selection method from gray-level histograms. IEEE Trans. Syst. Man Cybern. SMC-9(1), 62–66 (1979) 18. Kittler, J., Illingworth, J.: Minimum error thresholding. Pattern Recogn. 19(1), 41–47 (1986) 19. Kittler, J.: Fast branch and bound algorithms for optimal feature selection. IEEE Trans. Pattern Anal. Mach. Intell. 26(7), 900–912 (2003) 20. Sekita, I., Kurita, T., Otsu, N., Abdelmalek, N.N.: Thresholding methods considering the quantization error of an image. IEICE Trans. Fundam. Electron. J78-D-2(12), 1806–1812, (1995) 21. Kurita, T., Otsu, N., Abdelmalek, N.: Maximum likelihood thresholding based on population mixture models. Pattern Recogn. 25(10), 1231–1240 (1992) 22. Katoh, M., Tsushima, T., Kanno, M.: Monitoring of Ishikari River model forest for sustainable forest management, Grasp and monitor of forest area using satellite data. Hoppou Ringyo. 51(11), 271–274 (1999) 23. Takeuchi, H., Konishi, T., Suga, Y., Oguro, Y.: Rice-planted area estimation in early stage using space-borne SAR data. J. Jpn. Soc. Photogram. 39(4), 25–30 (2000) 24. Takeuchi, W., Yasuoka, Y.: Mapping of fractional coverage of paddy fields over East Asia using MODIS data. J. Jpn. Soc. Photogram. 43(6), 20–33 (2005) 25. Sutaryanto, A., Kunitake, M., Sugio, S., Deguchi, C.: Calculation of percent imperviousness by using satellite data and application to runoff analysis. J. Agric. Eng. Soc. Jpn 63(5), 23–28 (1995) 26. Azuma, K., Arai, K., Ishitsuka, N.: A thresholding method to estimate quantities of each class. Int. J. Appl. Sci. 3(2), 1–11 (2012) 27. Arai, K.: Thresholding based method for rain, cloud detection with NOAA/AVHRR data by means of Jacobi iteration method. Int. J. Adv. Res. Artif. Intell. 5(6), 21–27 (2016) 28. Arai, K., Goodenough, D.G., Iisaka, J., Fuang, K., Robson, M.: Consideration on an optimum threshold for maximum likelihood classification. In: Proceedings of the 10th Canadian Symposium on Remote Sensing, pp. 1–8 (1986) 29. Amaya, K.: Introduction to Optimization Techniques for Engineering. Mathematical Engineering Publishing Co., Ltd., Tokyo (2008) 30. Arai, K., Liang, X.: Method for estimation of refractive index and size distribution of aerosol using direct and diffuse solar irradiance as well as aureole by means of a modified simulated annealing. J. Remote Sens. Soc. Jpn. 23(1), 11–20 (2003) 31. Foody, G.M.: Classification accuracy assessment. IEEE Geosci. Remote Sens. Soc. Newsl. 2011, 8–14 (2011)
Understanding COVID-19 Vaccine Reaction Through Comparative Analysis on Twitter Yuesheng Luo and Mayank Kejriwal(B) Information Sciences Institute, University of Southern California, Los Angeles, USA [email protected]
Abstract. Although multiple COVID-19 vaccines have been available for several months now, vaccine hesitancy continues to be at high levels in the United States. In part, the issue has also become politicized, especially since the presidential election in November. Understanding vaccine hesitancy during this period in the context of social media, including Twitter, can provide valuable guidance both to computational social scientists and policy makers. Rather than studying a single Twitter corpus, this paper takes a novel view of the problem by comparatively studying two Twitter datasets collected between two different time periods (one before the election, and the other, a few months after) using the same, carefully controlled data collection and filtering methodology. Our results show that there was a significant shift in discussion from politics to COVID-19 vaccines from fall of 2020 to spring of 2021. By using clustering and machine learning-based methods in conjunction with sampling and qualitative analysis, we uncover several fine-grained reasons for vaccine hesitancy, some of which have become more (or less) important over time. Our results also underscore the intense polarization and politicization of this issue over the last year. Keywords: COVID-19 vaccine reaction · Social media analysis · Computational social science
1 Background Although the rapid development and manufacturing of COVID-19 vaccines has been touted as a modern miracle, vaccine hesitancy is still very high in the United States and many other western nations [1, 2]. As of the time of writing (October 2021), even single-dose vaccination in the US has only just crossed 65% and full-dose vaccination is significantly lower1 . Since vaccine supply currently far outstrips demand across the US, communication and outreach have proven to be valuable tools but thus far, the effects have not been as pronounced as expected. Furthermore, vaccine hesitancy is not uniform across all sociodemographic segments of the US population, suggesting there may be complex causal drivers at play [3]. Therefore, a better understanding of vaccine hesitancy at scale, and the public perception of it, especially on platforms like social media, may 1 https://usafacts.org/visualizations/covid-vaccine-tracker-states/.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 506, pp. 846–864, 2022. https://doi.org/10.1007/978-3-031-10461-9_58
Understanding COVID-19 Vaccine Reaction Through Comparative Analysis
847
be valuable for social scientists and policy experts alike in tackling this problem with effective strategies and incentives. Table 1 provides some samples of tweets arguably expressing vaccine hesitancy for a variety of reasons. Already from this sample, we find that there is heterogeneity in the reasoning. For example, some tweets have some religious or even conspiratorial connotations (e.g., ‘Mark of the Beast’, ‘666 with nano technology to change our DNA’) while others are doing an implicit risk analysis, such as in the very first tweet. Given the volume of tweets on this subject on social media, a natural opportunity arises for combining the right tools with qualitative analysis to understand both the reasons, and the prevalence of those reasons, behind vaccine hesitancy. Table 1. Examples of tweets (from the Fall 2020 dataset collected prior to the US Presidential Election) expressing vaccine hesitancy for a variety of reasons. I’m not getting a vaccine for a virus with a 99.5% survival rate Cuomo: Americans should be ‘very skeptical’ about COVID-19 vaccine Gov. Newsom added to concerns about COVID-19 vaccines and said the state will review the safety of any vaccine approved by the Trump administration “Coronavirus will probably never disappear and a vaccine won’t stop it completely, according to Sir Patrick Vallance.“Then what on earth are the government waiting for? SOUTH KOREA: Five people have died after getting flu shots in the past week, raising concerns over the vaccine’s safety just as the seasonal inoculation program is expanded to head off potential COVID-19 complications Everyone please retweet. Asymptomatic carriers have NEVER driven the spread of airborne viral disease. From FAUCI himself!!! The covid-19 Vaccine is the Mark of the Beast! The 666 with nano technology to change our DNA! Who else REFUSES to get a covid vaccine?
In this paper, we propose a methodology for conducting a comparative study on this issue using Twitter data. Specifically, we collect data during a period in October 2020 and also in February 2021 using an identical collection and filtering methodology. We compare and contrast these two corpora in rigorous ways, using both statistical analysis on keywords and hashtags, including keywords that are related to the pharmaceutical organizations developing the vaccines (e.g., Moderna) as well as keywords with emotional connotation. We also aim to understand the causes of vaccine hesitancy, and their fluctuation between these two time periods by using, in addition to these purely statistical and count-based analyses, machine learning methods like clustering and word embeddings, in conjunction with manual labeling of potential vaccine hesitancy reasons that we detect in the tweets. Analysis of the nature described above has utility, both from a computational social science perspective (understanding COVID-19 vaccine hesitancy as a phenomenon in itself) and from a policy perspective (to counteract vaccine hesitancy with effective
848
Y. Luo and M. Kejriwal
communication strategies and incentives). We list below two specific research questions we address in this paper. 1.1 Research Questions • How have units of discussion (such as hashtags and keywords) changed on Twitter from Fall 2020 (prior to the US presidential election) to Spring 2021, especially pertaining to COVID-19 and vaccines, and can these changes be explained by events that have come to the public’s attention since? • Can a combination of machine learning (especially, representation learning and clustering) and qualitative sample-based analysis on tweets help understand some of the finer-grained causes, and prevalence thereof, of vaccine hesitancy?
2 Related Work In the last decade, computational social science has become a mainstream area of study [4–6], especially in using social media and other such datasets to understand complex phenomena at lower cost [7–9]. Twitter studies have been particularly prolific [10–12]. To our knowledge, however, comparative studies of the kind we conduct in this paper have been rarer. One reason is the natural confound of using datasets that have not been collected or processed in a near-identical manner e.g., using the same sets of keywords and preprocessing modules, and so on. At the same time, vaccine hesitancy has also been an area of research in itself [1, 2], well before COVID-19 [13–15]. COVID-19 vaccines were approved in late 2020 and early 2021 for public use in countries across the world. In May, President Biden announced his goal as getting at least 70% of Americans partially vaccinated against COVID-19 by early July at the latest. Current statistics indicate that this goal has only just been achieved (and then only for a single shot). Low vaccination uptake is not due to supply constraints, but rather, due to vaccine hesitancy among segments of the population. Numerous news articles have documented this phenomena2 , but there have been few rigorous vaccine hesitancy studies (specific to COVID-19) due to the recency of the issue. Our study, which uses grassroots social media data, offers some perspectives on this issue both before and after the election. The methods in this paper rely on established and time-tested machine learning methods that have been found to work well broadly [16], rather than the very latest methods whose utility has not necessarily been validated outside their evaluation contexts, such as specific benchmarks or applications. Many of the natural language algorithms we use are already pre-packaged in open-source software such as the natural language toolkit (NLTK) or Scikit-Learn [17, 18]. Algorithms such as k-means and word representation learning can be found in many now-classic papers or books on machine learning [16, 19, 20].
2 Two relatively recent examples include https://www.sltrib.com/news/2021/09/27/heres-why-
vaccine/ and https://www.nytimes.com/2021/09/24/opinion/vaccines-identity-education.html.
Understanding COVID-19 Vaccine Reaction Through Comparative Analysis
849
3 Materials and Methods 3.1 Data Collection and Statistics Using the publicly available Twitter API, we started collecting our first dataset, referred to as Fall 2020, on October 19, 2020 and finished collecting the data on October 25, 2020. A total of 371,940 tweets were collected over this period using a set of COVID19 specific keywords and phrases as input to the API3 . However, despite this focused collection, not all of these tweets were useful for our study. We filtered the tweets by dropping N/A tweets (which could not be retrieved by the API despite having a tweet ID), removing duplicates4 , and removing non-English tweets. After this filtering, we were left with 128,472 tweets. We pre-processed these tweets by removing URLs, special characters, punctuation and emojis. Hashtags were retained as they are important for our subsequent analysis. Table 2. Dataset Statistics (for Full Corpus, Retweets and Non-Retweets) for the Two Datasets (Fall 2020/Spring 2021) Described in the Main Text. Data
Number of tweets
Number of unique user IDs
Hashtags (total)
Hashtags (unique)
Retweets
22,224/76,883
18,201/57,450
5,953/18,057
2,076/5,527
Non-retweets
106,248/281,128
72,304/173,902
21,062/117,590
5,132/21,600
Total
128,472/358,011
90,505/218,528
27,015/135,647
5,672/22,966
Our second dataset, referred to as Spring 2021 was collected using a near-identical methodology (and in particular, the same set of keywords) to facilitate accurate comparisons. A total of 970,708 tweets were collected over the period of collection from February 18, 2021 to February 24, 2021; after pre-processing, 358,011 tweets were retained. Statistics on both datasets are tabulated in Table 2, wherein we show not just the total numbers of tweets in both datasets, but the proportion of tweets that are retweets5 , as well as the number of unique user IDs, the total number of hashtags and the unique number of hashtags. Note that, despite using the same collection methodology, the volume of tweets (both before and after pre-processing) in the Spring dataset is almost three times higher than that of the Fall dataset, attesting to the increased relative importance that COVID19 related topics had assumed in Twitter since our initial data collection and following the election. In keeping with our research questions, we study these changes more systematically in Results. 3 “covid vaccine”, “coronavirus vaccine”, “china virus vaccine”, “covid injection”, “covid shot”,
“kungflu vaccine”, “wuhan virus vaccine”, “pfizer covid”. 4 These are tweets with the same tweet text. It happens when different users retweet the same
text without editing the original tweet text. 5 Note that a retweet can contain more than just the ‘original’ tweet (that is being retweeted)
since the retweeting user can also add to it.
850
Y. Luo and M. Kejriwal
3.2 Clustering and Labeling One of the goals of the paper is to discover and understand the reasons for COVID-19 vaccine hesitancy specifically through the analysis of Twitter data. Since our datasets have hundreds of thousands of tweets, manually reading and categorizing tweets is clearly infeasible. However, fully automated approaches are unlikely to be fruitful either, mainly because Twitter data tends to be noisy. Instead, we rely on a combination of manual and machine learning approaches. Our first step in discovering structure in the data is to cluster the tweets. However, in order to cluster the tweets, we need to first convert them into vectors. In recent times, word embedding techniques have emerged as a particularly powerful class of techniques for converting sequences of text into dense, continuous and low-dimensional vectors. Hence, we rely on these word embeddings for tweet vectorization. Before describing the approach, however, we note that, in order to have meaningful tweet vectors, tweets with fewer than 50 characters are filtered out. After this additional filtering step, 29,951 and 43,879 tweets were found to have been excluded from the Fall 2020 and Spring 2021 datasets in Table 2, respectively. For each tweet, the text is first tokenized using NLTK’s tokenize package6 [17], following which we embed each tweet by using the fastText ‘bag-of-tricks’ model implemented in the gensim7 package [21, 22]. Although fastText provides a pre-trained model based on Wikipedia, we decided to train our own model both due to the fact that social media is noisy and non-grammatical, and also because we wanted to control the dimensions of the word embedding vectors. Specifically, we trained a separate word embedding model for each of the two datasets (Fall 2020 and Spring 2021), setting the vector dimensionality to 25 in each case. Once the word embedding model had been trained, we calculated the tweet embedding by averaging the word embeddings of tokens in the tweets, as is standard practice. Following the procedure above, a single embedding is obtained per tweet. Using these embeddings, we used the K-Means8 algorithm [19] to cluster the embeddings in the Fall dataset into 7 clusters, following early explorations. Next, for each cluster, we randomly sampled 100 tweets and manually labeled them with a vaccine hesitancy cause (e.g., ‘COVID-19 is common flu’) to obtain a small, but reliable, ground-truth for better understanding vaccine hesitancy causes qualitatively. After labeling, we were able to determine six broad, but distinct, reasons that seemed to be significantly impacting vaccine hesitancy expressed on Twitter. For the Spring dataset, the procedure was very similar, except that we picked k = 5 rather than k = 7. One reason for doing so is that, in our initial exploration, we sampled 1,000 tweets from the Spring 2021 dataset, and found that only 5 out of 6 causes that we had identified in the Fall 2020 labeling exercise were to be found in that sample. After obtaining the 5 clusters, we again sampled 100 tweets from each cluster, and manually labeled them to counter any kind of structural bias. In fact, in this last step, we uncovered 6 https://www.nltk.org/api/nltk.tokenize.html. 7 https://radimrehurek.com/gensim/models/fasttext.html. 8 We used the implementation provided in the Python sklearn library: https://scikit-learn.org/sta
ble/modules/generated/sklearn.cluster.KMeans.html. We used the default parameters, with k set to 7, and selected initial centers.
Understanding COVID-19 Vaccine Reaction Through Comparative Analysis
851
a new sixth reason that had not been present before in the Fall 2020 dataset. We comment both on the excluded classes, as well as the specific reasons we identified through the clustering and labeling process, in Results.
4 Results Our first set of results concerns a comparison of the top hashtags and keywords in the Spring and Fall datasets. To compute a ranked list of top hashtags and keywords, we collected the hashtags by searching for the hashtag symbol (#). Next, keywords were collected by tokenizing the whole document using a publicly available term frequencyinverse document frequency (tf-idf) vectorization facility9 [23]. Common English stop words and task specific stop words were removed. Keywords with similar meanings are grouped together. For example, ‘death’, ‘dies’ and ‘died’ are group into one. More specific grouping rules are described in a table that we provide in an Appendix (Table 8) for replicability. The popularity of each hashtag or keyword was computed by calculating the frequency of tweets in which it appeared. Figure 1 illustrates ranked list of top 30 hashtags for each dataset, with the ‘lines’ showing which hashtags in the top 30 terms of the Fall dataset are also in the Spring dataset. We find that there was considerable flux, attesting to the rapid and real-time pace of discussion on Twitter, especially on political matters such as the elections that were ongoing in the state of Bihar in India in the fall (and that were well over by the spring) to the US presidential election debate. We also find that ‘trump’ is not present in the Spring dataset, likely due to the former president being removed from Twitter in January 2021. In focusing on the vaccine related tweets, we find that ‘lockdown’ became more frequent in spring, while, as expected, ‘covid19’ and ‘covidvaccine’ continued to be the top hashtags. The hashtag ‘astrazeneca’ went down due to concerns about the AstraZeneca vaccine that had just emerged. In contrast, ‘pfizer’, which is not even present in the Fall top-30 list, emerges as a top-5 hashtag in the Spring top-30 list. We also find more vaccine-related hashtags in the top-10 in the Spring dataset due to the success of vaccines in early 2021. The ‘moderna’ hashtag has also entered the top-30 though it is still trending well below ‘pfizer’. Figure 2, which illustrates a similar plot, but for keywords rather than hashtags, shows similar trends. The keyword ‘pfizer’ and vaccine-related keywords are even more prominent in Fig. 2 rather than Fig. 1, suggesting that, at least on Twitter, keywords may be better at capturing COVID-19 and vaccine-related matters more than just raw hashtags. We find indeed that even ‘johnson’ and ‘johnsonjohnson’ appear as prominent keywords in the Spring top-30 list, although these are not present yet as hashtags in the Fig. 1 top-30 list. At the time of writing, the Moderna, Pfizer and Johnson & Johnson vaccines have proved to be the most popular options in Western countries. Figure 3, 4 and 5 show a slightly different and more quantitative view reinforcing the findings. In Fig. 3, we quantitatively map the percentage change of tweets in which 9 Specifically, we used the scikit-learn package in Python: https://scikit-learn.org/stable/modules/
generated/sklearn.feature_extraction.text.TfidfVectorizer.html with max_features = 2000, and ngram_range from 1 to 2 (i.e. unigrams and bigrams).
852
Y. Luo and M. Kejriwal
Fig. 1. Change in ranked list of top hashtags (in descending order) between the two datasets. the left and right hand columns represent the top hashtags in the Fall 2020 and Spring 2021 datasets respectively.
a shared keyword (between the Fall 2020 and Spring 2021 datasets) appears. While keywords like ‘flu’, ‘trump’ and ‘astrazeneca’ are very high in the Fall dataset compared to the Spring, the ‘pfizer’, ‘dose’, ‘vaccinated’ and ‘vaccination’ keywords have become much more prominent in the spring, which also suggests that there is much more discussion about vaccination on Twitter during this time, following the politics-heavy discussions in the fall. In order to approximately understand the ‘public mood’ during this time, we also conducted a similar exercise, but with keywords that have an ‘emotional connotation’. The high prevalence of words like ‘trust’, ‘lie’ and ‘concern’, which does not necessarily10 indicate either positive or negative sentiment per se, suggest intense discussion on vaccines and COVID-19. Unfortunately, we also see a rise in words like ‘worry’ and 10 For example, ‘How can I trust the government on vaccines?’ and ‘I trust the vaccine’ both count
toward the prevalence of ‘trust’ but the former indicates more hesitancy and negative sentiment. In Discussion, as well as later in this section, we explore sentiments and vaccine-related clusters in more qualitative detail.
Understanding COVID-19 Vaccine Reaction Through Comparative Analysis
853
‘forced’, and a decline in ‘hope’, but we also witness a sharp rise in ‘amazing’ and less extremity of prevalence in various words compared to the fall.
Fig. 2. Change in ranked list of top keywords (in descending order) between the two datasets. The left and right hand columns represent the top keywords in the Fall 2020 and Spring 2021 datasets respectively.
Finally, Fig. 5 compares keywords corresponding to the four vaccine-related organizations on Twitter using the two datasets. As expected, ‘astrazeneca’ and ‘pfizer’ witness opposing trends, while ‘johnsonjohnson’ and ‘moderna’ remain steady, and with the former exhibiting some growth. In interpreting these results, it must be borne in mind that, although the signals are weaker than might have been expected through direct surveys such as the one conducted by Gallup, the analysis has been conducted over tens of thousands of tweets and users,
854
Y. Luo and M. Kejriwal
Fig. 3. Visualizing the percentage change of tweets in which a shared keyword (between the Fall 2020 and Spring 2021 datasets) appears. Keywords are in descending order of prevalence in the Fall 2020 dataset. We do not consider keywords that are prevalent in at least 0.01% tweets in at least one dataset.
Fig. 4. Visualizing the percentage change of tweets in which a shared keyword (between the Fall 2020 and Spring 2021 datasets) with Emotional Connotation appears. Keywords are in descending order of prevalence in the Fall 2020 dataset. we do not consider keywords that are prevalent in at least 0.001% Tweets in at least one dataset.
making it a much larger scale and organic enterprise than a focused and closed-answer survey. Hence, these results should be thought of as being complementary to more rigorous (but smaller-scale and less exploratory) methods, rather than a substitute, in studying COVID-19 from a computational social science lens. Although some of the
Understanding COVID-19 Vaccine Reaction Through Comparative Analysis
855
Fig. 5. A comparison of the keywords corresponding to the four vaccine-related organizations on twitter using the two datasets.
keywords and hashtags (including the ones with emotional connotation such as in Fig. 4) may not always be referring to, or found in, COVID-19 related tweets, our data collection and processing was tailored to such topics to the extent possible. Evidence that many of the tweets are indeed COVID-19 related can be seen both in the prevalence and popularity of COVID-19 related keywords and hashtags earlier in Fig. 1 and 2, as well as in the sample shown in Table 1 and the more qualitative insights discussed subsequently. Table 3. Results from the sampling and labeling exercise for the fall 2020/spring 2021 datasets, the methodology of which was described in materials and methods. Vaccine hesitancy reason/Label Number of positively labeled samples
Percentage of positively labeled samples
Negative influence
46/0
11.1%/0
Efficacy of the vaccines
76/5
18.3%/10.4%
Negative vaccine (trial) news
174/6
42%/12.5%
Distrust toward government and vaccine research
82/7
19.8%/14.5%
Blatantly refuse
14/8
3.3%/16.7%
Covid-19 is common flu
18/2
4.3%/4.2%
Complaints about vaccine distribution and appointment
0/20
0/41.7%
Total
414 (out of 700)/48 (out of 500) 100%/100%
856
Y. Luo and M. Kejriwal
Table 4. Three Examples each from the Fall 2020 and Spring 2021 Datasets for Each of the Identified Vaccine Hesitancy Reasons/Labels (see Table 3). For the Spring 2021 Dataset, Note that we could not Find Any Tweets with Label Negative Influence Likely due to a more Concerted Effort by Social Media Platforms such as Twitter to Crack down on COVID-19 Misinformation; Similarly, for the Fall 2020 Dataset, there were No Tweets Corresponding to Vaccine distribution & Appointment Complaints, Probably because Viable Vaccines had not Emerged yet. Vaccine hesitancy reason/Label
Fall 2020 dataset
Spring 2021 dataset
Negative influence
– Cuomo: Americans should N/A be ‘very skeptical’ about COVID-19 vaccine – Gov. Newsom added to concerns about COVID-19 vaccines and said the state will review the safety of any vaccine approved by the Trump administration – California Governor Newsom is going to have his own health experts review the Covid-19 vaccine which will not make it available until around April
Efficacy of the vaccines
– “Coronavirus will probably – The current vaccines won’t never disappear and a work against the new covid vaccine won’t stop it variant shit vaccine will completely, according to Sir only speed up covid Patrick Vallance.” Then mutation and make it more what on earth are the deadly – Local doctor shares concern government waiting for? – “It is worth reflecting that about covid 19 vaccine side there’s only one human effects for breast cancer disease that’s been truly survivors – Avoid the flu shot it eradicated” – If #COVID1984 infection dramatically increases your doesn’t provide long term risk of covid if concerned immunity then a COVID about heart disease change Vaccine won’t either. While your diet lower your risk to it might kill. So our only zero most medical orgs and best hope at moment is cardiologists are the last you population immunity aka want to listen to or follow herd immunity! unless you want to throw away your health (continued)
Understanding COVID-19 Vaccine Reaction Through Comparative Analysis
857
Table 4. (continued) Vaccine hesitancy reason/Label
Fall 2020 dataset
Spring 2021 dataset
Negative vaccine (trial) news – SOUTH KOREA: Five – 4500 people diagnosed with people have died after covid after getting 1st getting flu shots in the past vaccine dose week, raising concerns over – A 78 year old woman with a history of heart problems the vaccine’s safety just as died after receiving a the seasonal inoculation covid-19 vaccine in Los program is expanded to Angeles county head off potential – Pfizer vaccine kill woman COVID-19 complications 78 who died hours after – Brazilian volunteer of having it via’, ‘louisiana AstraZeneca’s COVID-19 woman convulsing after vaccine trial dies – Breaking News: A U.S. pfizer experimental covid government-sponsored vaccine clinical trial for a Covid-19 antibody treatment was paused because of a potential safety concern, a day after a vaccine trial was paused after a volunteer fell ill Distrust toward government and vaccine research
– This article is long but O M – Censored Dr Kaufman want G…but PLEASE read it. It to genetically modify us gives you all the details of with covid 19 loses his job why you MUST resist and and willing to go to jail to learn about WHAT is resist – Covid live updates minister coming through that for regional health urges COVID19 vaccine. Please, people in the safest places to for the future of humanity still get vaccinated why and your great-grand would he say is he been paid children, learn and refuse – Nope! There is already too by big pharma to advertise much harm done to the vaccine to healthy stall children, leave them out of cure medicines to this political game! symptomatic – IF WE DONT BUY THEIR – When the globalist make up VACCINE OUR PEOPLE a pandemic they can lie WILL FUCKING DIE about the figures up or down now they’re saying they’re going down only to promote the vaccine (continued)
858
Y. Luo and M. Kejriwal Table 4. (continued)
Vaccine hesitancy reason/Label
Fall 2020 dataset
Blatantly refuse
– The covid-19 Vaccine is the – Thousands of service Mark of the Beast! The 666 members saying no to covid with nano technology to vaccine – Doctors and nurses giving change our DNA! – no matter bullshit happens the covid 19 vaccine will be never take vid vaccine write tried as war criminals fake papers submit bullshit – I wonder about vaccine – Never ever ever ever will I covid 19 huge toxic hope take this vaccine. This not force me and parents should be shared FAR and didn’t get this stuff we hate WIDE it
Covid-19 is common flu
– I ask this question at least – Stop making kids wear a three times a week for the mask it’s more dangerous last three months. Why does for their health than not the world need a vaccine for wearing it their immune COVID-19 a.k.a. Chinese systems don’t need a Virus that has a recovery vaccine to fight covid-19, rate of 99.74%? No covid-21 or any other virus – boooooo to government of comments huh Bill Gates? – The science says that Victoria for locking people children have 99.97% of down over a fake virus living with COVID created by china we will not – Enough of the Chinese take this fake vaccine to Wuhan virus BS!!! NO way depopulate the population am I taking a vaccine where 99% of the population gets it and, ummm LIVES!
Complaints about vaccine N/A distribution and appointment
Spring 2021 dataset
– The way we are handling the distribution of the covid vaccine is an absolute joke – The disabled are being systematically denied the covid vaccine across the country – The county just cancelled my mom’s first covid shot back to square one for a shot so angry 80 and get a shot this is ridiculous makes me want to cry
Understanding COVID-19 Vaccine Reaction Through Comparative Analysis
859
4.1 Profiling Vaccine-Related Clusters on Twitter In Materials and Methods, we had proposed a methodology for clustering and labeling the tweets, with results provided below in Table 3. Based on our labeling, the results show that the primary causes of vaccine hesitancy in the fall were negative vaccine news (perhaps caused by the complications observed for the AstraZeneca vaccine), distrust toward government and vaccine research, and vaccine efficacy. After the Biden administration was sworn in, however, we find significant declines across the three causes, perhaps due to better communication, more trust in the administration, and a control of misinformation on social media platforms. However, an unfortunate rise was also observed in the ‘blatantly refuse’ cluster. Examples of sampled tweets from all the clusters, for both Fall and Spring datasets, are provided in Table 4. Samples for the ‘blatantly refuse’ cluster suggest a variety of reasons, including conspiracy theories. As more misinformation control measures are implemented, we may see a decline in such clusters; however, this does not necessarily mean that the people who are blatantly refusing have changed their views. Rather, with such control measures, we may have to be more cautious in computing such statistics, as the statistics may be less representative of the population. Since such strong misinformation-control measures have not been implemented on these platforms prior to 2021, we leave an investigation of such selection and sampling-related biases to future work. Some such bias is already observed, since ‘negative influence’ tweets have virtually disappeared in the Spring dataset despite constituting almost 11% of the Fall dataset. In contrast, the 0% prevalence of tweets in the ‘complaints about vaccine distribution and appointment’ cluster in the Fall dataset is expected, since vaccines were not available to anyone yet when the Fall dataset was collected.
5 Discussion We supplement the findings in Results by using sentiment analysis, along with tweetsampling, to get a sense of how sentiment around the three public figures during COVID19 (Trump, Biden and Fauci) fluctuated from fall to spring. We use the VADER sentiment analysis tool due to its ease of use and interpretability. For a given tweet, VADER outputs a score between −1 and + 1. Tweets with score of greater than 0.05 and less than − 0.05 are considered positive and negative, respectively with the remainder of the tweets labeled as ‘neutral’. Table 5 shows that percentage of positive tweets for all three figures have actually declined and percentage of negative tweets has increased. The numbers reflect the polarization on Twitter, but it is worthwhile noting that the negative percentage of tweets for Trump increased very significantly (by over 20% points) from fall to spring, while positive percentage declined by 5% points. For Fauci and Biden, the numbers are steadier and more consistent. For instance, while positive sentiment for Biden declines significantly, negative sentiment only increases modestly, the opposite trend observed for Trump. More encouragingly, the percentage of tweets that are somewhat neutral remains steady for Fauci, increases for Biden (from 25 to 34) and more than halves for Trump. The results also indicate that, in the Twittersphere, Fauci is as polarizing a figure as the political figures. Accounts in the mainstream media seem to bear this finding out, with Fauci considered overwhelmingly positively and negatively on left-wing and right-wing outlets,
860
Y. Luo and M. Kejriwal
respectively. An important agenda is to replicate these analyses using other sentiment analysis tools, as well as comparative analyses using datasets collected after the Spring dataset. Table 5. Sentiment Score Ratios of Tweets Mentioning ‘Biden’, ‘Trump’ or ‘Fauci’ using the Vader Sentiment Analysis Tool. We show results for Both the Fall 2020/Spring 2021 Datasets. Figure
Positive
Negative
Neutral
Total tweets
Biden
0.49/0.32
0.26/0.34
0.25/0.34
1,761/6,722
Fauci
0.41/0.33
0.31/0.39
0.28/0.27
679/3,165
Trump
0.40/0.35
0.31/0.54
0.29/0.11
6,088/4,297
We sample some tweets from both datasets, concerning all three figures, with positive sentiment labels (Table 6) and negative sentiment labels (Table 7). To ensure that we only consider the tweets that are unambiguously positive or negative, we isolate the sets of tweets with VADER sentiment scores greater than 0.9 or less than −0.9, respectively. We sample tweets from these more extreme sets in Tables 6 and 7, respectively. The results show that the tweets are divided along party lines, with other political figures (such as Harris and Cuomo) mentioned as well. Depending on the sentiment, the tweets are divided along the lines of who should take credit for the vaccine, and also whether the vaccine is safe. Negative sentiment tweets tend to have more explicit invectives in them, as expected. While qualitative and small-sample, the tweets highlight the intense polarization around, and politicization of, a public health issue (COVID-19), which is unfortunate. Table 6. Examples of Tweets Concerning Trump, Fauci and Biden that were Labeled as having Positive Sentiment (Score >= 0.9) by the Vader Sentiment Analysis System. Explicit Content has not been Filtered, although the Tweets have been Preprocessed using the Methodology Presented in Materials and Methods. Figure Fall 2020 Biden
Spring 2021
– trump says good chance coronavirus vaccine – you mean how they take credit for everything ready weeks biden predicts dark winter pick that was done by the previous yeah i noticed president america that too joe biden has done a fabulous job – absolute best case biden perfect vaccine developing the covid 19 vaccine amazing rollout gets credited ending coronavirus lol what he has accomplished in a month – let get real people joe biden may greatest guy – thank you president biden a caring competent world blessed thing end covid going another president will save america from covid we shutdown vaccine ready ready people wear just wish it could have happened 400 000 wear masks please nothing change folks ago – oh my god thank you please president biden tells to the lab scientists upgrade some powerful safe effective vaccines
(continued)
Understanding COVID-19 Vaccine Reaction Through Comparative Analysis
861
Table 6. (continued) Figure Fall 2020
Spring 2021
Fauci
– we love care dr fauci voice medically sound knowledge reason would take vaccine vid 19 says safe so – fauci probably highly respected infectious disease expe world also terrific communicator great confidence francis collins director national institutes health – fauci predicts safe effective coronavirus vaccine end year
– fauci says yes to hugs white house chief medical adviser dr anthony fauci says that it’s very likely that family members who have been vaccinated against coronavirus can safely hug each other – love dr fauci so kind so congenial and so committed – dr fauci has said this my friend hope you have a positive and amazing week
Trump – wow can’t wait coronavirus vaccine stas getting distributed weeks wow god bless president trump – trump says good chance coronavirus vaccine ready weeks biden predicts dark winter pick president america – fda plan ensure safety vaccines developed since trump dismantled office charge trump administration shut vaccine safety office last year
– thanks to the fantastic efforts of president trump for promoting an environment creating a vaccine and pushing pharma to be better and also create a path to combat this china virus – my friend who plagued me with her love of trump for four years is now convinced that the covid vaccine is 666 she was never an anti vaxxer not till now – south dakota is doing a great job getting out people vaccinated we have a great governor and medical system our medical personnel are doing an average of 900 folks per day and our civil air patrol are getting the vaccine out into smaller towns thanks you president trump
Table 7. Examples of Tweets Concerning Trump, Fauci and Biden that were Assigned Score Less than −0.9 by the Vader Sentiment Analysis System. Explicit Content has not been Filtered, although the Tweets have been Preprocessed using the Methodology in Materials and Methods. Figure Fall 2020 Biden
Spring 2021
– cuomo joins biden harris spreading – Joe Biden said in his town hall last reckless anti vaxxer misinformation night that African Americans and right arrow curving down Latinos don t know how to register – biden want close borders scaring people online to get the vaccine this racist shit getting vaccine democrats history along with his lies about children can’t wanting people die look murder rate spread SMDH cities aboion – trump says Biden is either lying or he s – biden keeps lying lying lying lying mentally gone after claiming there was people eat up no covid vaccine when he took office and floats creating his own social media after twitter ban and teases a 2024 run via fuck you fat basterd (continued)
862
Y. Luo and M. Kejriwal Table 7. (continued)
Figure Fall 2020
Spring 2021
Fauci
– and why are trumplicans now mad about vaccines when for 11 months refused to wear masks said covid was no worse than the flu and called people sheep for listening to fauci instead of get to the end of the vaccine line – anyone who listens to fauci is just stupid the sob lied and people died because of the lies – dr fauci said he had pain in these 2 places after the covid vaccine dr fauci is a pain all by himself lying to the media and scaring people on a regular basis
– fauci stupid science wrong covid go away vaccine coming election new health care plan wall get built trade deals get better deeper debt world relations worse crisis vote – fauci fraud needs dealt like traitors criminals need dealt with – vaccine agenda fail fauci birx cuomo lamont plan fail spent billions terrible idea ignoring treatments work pushing fear
Trump – trump fake wife fake economy fake – imbecile biden ipotus is an idiot he and coronavirus vaccine fake insurance plan carmala said trump should have left a – for months trump routinely undermined stockpile of vaccine what assholes advice doctors researchers mes fighting needs to be kept in refrigerated state covid 19 examines operation warp you f n idiot speed political fight vaccine turmoil – 60 million americans were infected in inside fda 2009 with joe biden in charge if you – trump persists refusing publicly apply that number of infected to covid acknowledge true severity coronavirus instead of swine flu in 2009 we would crisis america he lied failed forewarn have 1 3 million dead with no vaccine public impending danger jan rejecting in site be grateful trump developed 6 projections due to vaccines with the first from pfizer in 9 months – covid is a hoax mask do nothing the vaccine are for people not the virus the death rate has not change in 3 year they is no pandemic it is a power and money grab by the globalist and the ccp it has all been done to remove trump from office election fraud is real
Understanding COVID-19 Vaccine Reaction Through Comparative Analysis
863
6 Conclusion Media reports, as well as non-partisan observations, have all noted the alarming rise and politicization of public health issues that, on their face, seem politically neutral. At the same time, vaccine hesitancy remains persistent in the population; although the Biden administration has continued to aggressively promote the benefits of vaccination (including making it mandatory for federal employees in the United States). Understanding vaccine reaction and polarization around COVID-19 this last year from a comparative lens, especially given the scale and reach of social media platforms like Twitter, is an important agenda for computational social scientists and policy makers. For the latter, in particular, it may help inform decisions both in the near future as well as from a planning standpoint for future pandemics. Our study took a step in this direction by present both a methodology for conducting such comparative analyses on raw corpora collected from Twitter, and results guided by both quantitative and sample-driven qualitative analyses. In particular, the latter uncovered some reasons for vaccine hesitancy, and found encouragingly that, in the spring, many of the concerns may have been due to logistics-related matters like vaccine distribution and getting appointments. Indeed, in the months since this data was collected, and as vaccine supply exceeded demand, these concerns have been mitigated and vaccination rates have consequently risen. Nevertheless, almost 15% of sampled tweets in the Spring dataset suggested that vaccine hesitancy was also being caused by mistrust toward the government, among other similarly fundamental reasons. Therefore, significantly more outreach and communication may be necessary in the next few months to reach vaccination rates of 90% or beyond.
Appendix See Table 8 Table 8. Manually determined keyword/hashtag groupings used in results. Covid19
Covid/covid19/ covid-19/covid-19/covid_19/coronavirus/covid__19/corona
Covid vaccine Covid vaccine/covid vaccines/corona virus vaccine/covaxin/covax/covid19vaccination/covid vaccination/vaccines/vaccine Die
Deaths/dies/died
Dose
Dose/doses
Trust
Trust/trusted
Worry
Worried/worries/worrying
Scare
Scary/scared/scare
Concern
Concerns/concerned/concerning
Lie
Lying/lies/lied
Doubt
Doubts/doubt
Hope
Hopes/hopeful
864
Y. Luo and M. Kejriwal
References 1. Dror, A.A., et al.: Vaccine hesitancy: the next challenge in the fight against COVID-19. Eur. J. Epidemiol. 35(8), 775–779 (2020). https://doi.org/10.1007/s10654-020-00671-y 2. Machingaidze, S., Wiysonge, C.S.: Understanding COVID-19 vaccine hesitancy. Nat. Med. 27(8), 1338–1339 (2021) 3. Razai, M.S., et al.: Covid-19 vaccine hesitancy among ethnic minority groups (2021) 4. Lazer, D., et al.: Social Science. Computational Social Science. Science 323(5915), 721–723 (2009) 5. Michael Alvarez, R. (ed.): Computational social science. Cambridge University Press, Cambridge (2016) 6. Giles, J.: Computational social science: making the links. Nat. News 488(7412), 448 (2012) 7. Ruths, D., Pfeffer, J.: Social media for large studies of behavior. Science 346(6213), 1063– 1064 (2014) 8. Ngai, E.W.T., Tao, S.S.S., Moon, K.K.L.: Social media research theories, constructs, and conceptual frameworks. Int. J. Inf. Manage. 35(1), 33–44 (2015) 9. Ekman, M., Widholm, A.: Twitter and the celebritisation of politics. Celebrity Stud. 5(4), 518–520 (2014) 10. Melotte, S., Kejriwal, M.: A geo-tagged COVID-19 Twitter dataset for 10 North American metropolitan areas over a 255-day period. Data 6(6), 64 (2021) 11. Hu, M., Kejriwal, M.: Measuring spatio-textual affinities in twitter between two urban metropolises. J. Comput. Soc. Sci. 1–26 (2021) 12. Hu, M., et al.: Socioeconomic correlates of anti-science attitudes in the US. Future Internet 13(6), 160 (2021) 13. Dubé, E., et al.: Vaccine hesitancy: an overview. Hum. Vaccines Immunother. 9(8), 1763–1773 (2013) 14. Jacobson, R.M., St Sauver, J.L., Finney Rutten, L.J.: Vaccine hesitancy. In: Mayo Clinic Proceedings, vol. 90, issue number 11. Elsevier (2015) 15. Paterson, P., et al.: Vaccine hesitancy and healthcare providers. Vaccine 34(52), 6700–6706 (2016) 16. Mohri, M., Rostamizadeh, A., Talwalkar, A.: Foundations of Machine Learning. MIT Press, Cambridge (2018) 17. Loper, E., Bird, S.: Nltk: The natural language toolkit. arXiv preprint cs/0205028 (2002) 18. Trappenberg, T.P.: Machine Learning with Sklearn. Fundamentals of Machine Learning, pp. 38–65. Oxford University Press, Oxford (2019) 19. Hartigan, J.A., Wong, M.A.: Algorithm AS 136: a k-means clustering algorithm. J. Roy. Stat. Soc. Ser. C (Appl. Stat.) 28(1), 100–108 (1979) 20. Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: International Conference on Machine Learning. PMLR (2014) 21. Joulin, A., et al.: Bag of tricks for efficient text classification. arXiv preprint arXiv:1607. 01759 (2016) ˇ uˇrek, R., Sojka, P.: Gensim—statistical semantics in python (2011). genism.org 22. Reh˚ 23. Aizawa, A.: An information-theoretic perspective of tf–idf measures. Inf. Process. Manage. 39(1), 45–65 (2003)
A Descriptive Literature Review and Classification of Business Intelligence and Big Data Research Ammar Rashid1 and Muhammad Mahboob Khurshid2(B) 1 College of IT, Ajman University, Ajman, UAE
[email protected] 2 Department of Examinations, Virtual University of Pakistan, Lahore, Pakistan
Abstract. Business Intelligence (BI) leverages IT tools and services to transform data into insights that inform an organization’s business decisions. BI using Big Data has gained popularity in recent years and has become a significant study area for academics and practitioners. However, prior studies have highlighted the technical challenges of BI using Big Data. The extant BI and Big Data literature has mainly focused on technology and behavior-related factors to examine this field. Fewer studies have provided the extent of this area to understand the classification of BI and the Big Data field. Given the significant nature of BI and Big Data, this paper presents a descriptive literature review and classification scheme for BI and Business Intelligence. The study includes 128 refereed journal articles published since the inception of BI and Big Data research. The articles are classified based on a scheme that consists of three main categories: Management, Technological, and Application and Domain of usage. The results show that current research is still skewed towards technological aspects, followed by management, and followed by application and domain of use. This review provides a reference source and classification scheme for information system research interested in BD and Business Intelligence domain and indicates under-focused areas and future directions. Keywords: Big Data · Business Intelligence · Business analytics · Classification · Descriptive literature review
1 Introduction In the recent decade of the digital revolution, it is important to achieve Business Intelligence (BI) using Big Data (BD). Organizations continuously develop the capacity to acquire, store, manipulate, and transmit complex data quickly so that social, political, technical, operational, and economic benefits can be leveraged. More importantly, BD has become the engine of this revolution that provides unique opportunities to social, human, and natural sciences. Importantly, BD is more than just data [1]. Therefore, the way to understand the world is reshaping with BI and BD’s growing trends composed of five V’s, namely, volume, velocity, variety, veracity, and value [2]. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 506, pp. 865–879, 2022. https://doi.org/10.1007/978-3-031-10461-9_59
866
A. Rashid and M. M. Khurshid
Big Data can give rise to scientific and business research by linking and correlating complex data to understand the system, business trends, and human behaviors. There are many areas of research where such rises are closely relevant such as weather and climate forecasting [3], understanding the workings of the brain, behavior of the global economy [4], quality of human life, sustainable developments in cities [5]. However, on the other hand, Big Data bears several challenges related to data itself (such as volume, velocity, veracity, variability, visualization, etc.), process (such as data acquisition and warehousing, data mining and cleansing, interpretation, etc.), and management (such as security, privacy, governance, operational and maintenance expenditures, ownership, etc.) [6]. The relative novelty and rapidly increasing growth of achieving BI using Big Data make it an exciting area of research. The majority of the research studies reviewed are analytical in the categories of experiments, algorithms, simulations, and/or mathematical modeling techniques to tackle Big Data [6]. Regardless of their research approaches, the articles in BI and Big data domains are presented as a source of generating new knowledge. Therefore, considering the novelty of BI and Big Data, the present paper aims to assess the state of BI and Big Data literature through descriptive literature review and classification scheme that can help to develop a taxonomy of BI and Big Data research. The paper is organized as follows. The first section outlines the research motivations. The second section discusses the literature related to BI and Big data. The third section explains the research methodology used in this study. The fourth section discusses the result, followed by the last section on the limitation of this research study.
2 Literature Review 2.1 Big Data Recently the dynamics of the work environment have been changing rapidly with increasing dependence on business analytics and big data followed by smart decision making. Big data has emerged as the most useful tool for many organizations. Many organizations are investing in significant data research, exploring the ways to create maximum value for firms, communities, individuals, and businesses through the effective deployment of vast volumes of data [7, 8]. Many businesses worldwide speculate regarding whether they are extracting the maximum value from the data they have. The upgraded technologies enable firms to collect a large volume of data than ever possible before. Yet, many firms are still struggling to extract maximum value from that data [9, 10]. Collecting data from these vastly extensive social and economic data sets requires the powerful computational capability to generate patterns between them. The use of new techniques can provide more precious and in-depth insights from the data that largely remain static (such as surveys, statistics, archival data). This extraction of data and providing in-depth insight from it happens in real-time, reducing time and information gaps (reference). The massive volume of data distinguishes it from other data, as it cannot be computed, managed, and analyzed through typical database soft wares [11].
A Descriptive Literature Review and Classification of Business Intelligence
867
Big Data is characterized by various organizations’ massive volume of data frequently in the structured or unstructured form [12]. Big Data refers to a set of advanced technologies that enable the accumulation, organization, and analysis of datasets that are too large for traditional databases to process [13, 14]. The significance of big data does not entirely lie in its big volume, instead of in its application in terms of making business decisions more effective and efficient. Big data can provide valuable insights regarding strategies and business direction [15]. Its significance can be assessed because it not only assists in analyzing vast volumes of data to identify patterns; it can also predict the probability of an event. Predictive actions like consumer preference, patterns of traffic, consumer search behavior, or disease eruptions can all be assessed from the effective use of big data analytics [16]. Big data is produced from various sources ranging from business transactions (such as purchase transactions), clicks on the internet, and mobile transactions to user-created content and social media (reference). Also, operations management, health industry, engineering management, and finance contributed to the extensiveness of big data [17– 19]. 2.2 Business Intelligence Big data is the most recent advancement emerging from business analytics and business intelligence. Big data can work through advanced technologies by incorporating unusual sources of data and an exceptional combination of user skills [4, 20]. Business Intelligence is defined as a system capable of providing complex information to decision-makers and policy planners through knowledge management, data assembling, data storage, and analytically advanced tools [21, 22]. This definition highlights that business intelligence facilitates its users in making calculated decisions at the right time. It assists decision-makers with actionable information through capitalizing on multiple inputs [23]. Business intelligence is also defined as the broadly categorized system of accumulation, storage, and analysis of data through various technologies, applications, and processes that assist decision-makers in making the right decisions [24]. Business intelligence is a combination of the technologies and applications and processes associated with business decision-making. It is referred to as a system that gets data out and accumulates it through warehousing or data mart [25, 26]. Business intelligence systems deal with both semi-structured and structured data [27]. Semi-structured data is the type of data that does not have structured flat files [8, 28]. Because BI consumes both semi-structured and structured data, its architecture is business-oriented rather than technical. The architecture of Behavioral Intention systems entails policies, business regulations, metadata, standards, and protocols [1, 29]. Business intelligence initiatives differ in their nature and scope depending upon the need of the organizations. Three specific targets of BI are identified as follows [14, 30]: the first target is whether to develop a single of few BI applications. It depends upon the need of the organization. The second target entails creating organizational infrastructure that can accommodate the needs of BI systems (both current and future). Third and final target is the transformation required on a corporate level. It can be in the form of a new
868
A. Rashid and M. M. Khurshid
business model, funding, sponsorship, and acknowledgment of BI’s significance at the top management level. BI transforms data into useful and actionable information. Managers and decisionmakers use this data and produce valuable knowledge. Tasks like anticipating future direction, estimating and forecasting based on existing data, analyzing the results of the anticipated change, and providing strategic insight are all part of functions performed by business intelligence systems [31]. BI facilitates operational and strategic decisionmaking. These strategic decision-making areas include performance management of the organization, monitoring of business activities, customer relationship management, and management reporting of Behavioral Intention [19, 32, 33]. BI as a concept has provided quality substitution for management information systems and decision support systems [25]. BI facilitates organizations in making more sophisticated decisions through capitalizing on diverse business opportunities [34]. With its ability to compute, analyze and assess multiple data sets, BI assists firms in their development [32]. It facilitates numerous beneficiaries, including customers, suppliers, regulators, managers, knowledge workers, analysts, and IT developers [35–40]. 2.3 Business Analytics It has been reflected from the recent studies that business analytics and intelligence are becoming the supreme priority for CIOs (chief information officers), even more significant than cloud computing and mobile technology [41]. International Data Corporation forecasted the business analytics market to be 50.7 billion dollars by the end of 2016. The market is overgrowing, reflecting its importance for businesses [42, 43]. The study of analytics supports investigating it from various perspectives and dimensions. Collectively considered, three aspects of analytics are orientation, domain, and technique [29, 44]. Orientation refers to the direction analytics presume. More simply, it is considered what can be termed as part of analytics. Orientation is the take on what analytics does, as to whether it can be referred to as “description, prediction, or prescription” [10, 45, 46]. Domain refers to the sphere of business analytics in terms of the subject fields it covers. More simply, the domain specifies which subject fields analytics are being practiced [16, 47]. Areas include further sub-domains and disciplines within. For instance, the field of business includes human resources, marketing, finance, supply chain management, information systems, etc. the level of analytics can even be more specified at discipline in a specific subdomain [48, 49]. Technique refers to the way through which business analytics is executed. The technique can also be viewed from various perspectives. For instance, the techniques can be qualitative, quantitative, technology-oriented, or practice-oriented [31, 50]. Business analytics is the modern advances of technological systems to facilitate decision-making. It includes a robust system to acquire, create, integrate, select and disseminate useable knowledge that can facilitate decision making [32, 51, 52]. The business analytics system works on qualitative and quantitative knowledge types and makes appropriate decisions accordingly. The decision support situation can be well organized and structured or entirely complex and unstructured [53]. Conventionally
A Descriptive Literature Review and Classification of Business Intelligence
869
analytics is concerned with data operation that supports business activities. The technological frame of business analytics accordingly supports analytics embedded in decision support activities of businesses. As a holistic approach, business analytics encompasses all disciplines of business administration. Every subject can implement and extract useful information from analytics depending on their requirements [54, 55]. The future of business analytics is anticipated to include non-SQL databases, mobile, cloud, bandwidth, and newer data forms. It is also expected to include analytics like pervasive, social, real-time, and more [56, 57].
3 Research Methodology Multiple dimensions dismantle how literature review can be conducted. The selection of execution style depends on the stance of the researcher and the nature of the study. There are four primary methods, categorized depending upon their application in the qualitative and quantitative nature of the study [2]. These methods are: narrative review, descriptive review, vote counting, and meta-analysis. The following figure illustrates how these methods scale in the qualitative and quantitative range (Fig. 1).
Fig. 1. Qualitative and quantitative methods
The first approach, narrative review, is the orthodox way of conducting a literature review. This approach is inclined towards the interpretation of literature qualitatively. It involves the verbal description of prior research work, encompasses mainly the research findings, the underlying theories, and framework designs about a hypothesized relationship [58]. Though this is the general pattern of the narrative review approach, there is no standard execution procedure. Researchers have applied multiple ways in the past, and it mostly depends on subjectivity. Consequently, it is inherently vulnerable to subjectivity, and presumably, two different research may present two different conclusions from the same set of literature [59]. The second vote counting approach is prominently used in developing inferences about focal relationships by accumulating individual research outcomes [58]. A count is developed, representing the amount of time a specific research position is supportive of a particular stance. Mostly this approach assists in generating insights from a series of multiple experiments. The rationale behind adopting this approach lies in spreading
870
A. Rashid and M. M. Khurshid
the result through the cumulative power of numerous incidences (even if some are nonsignificant) and pointing in one direction rather than building upon a significant isolated effect [58]. The third approach, meta-analysis, is more inclined towards the quantitative continuum. It involves analyzing multiple studies to synthesize and formulate inferences on a research domain [58]. It may entail relationships among numerous independent and dependent variables extracted from available studies. The nature of the review allows only similar quantitative studies as a part of the analysis and may exclude any qualitative literature on the subject matter under investigation. Consequently, this approach successfully provides an objective view of the research in a specific context. The last approach that is a more apt one, the descriptive review, is predominantly focused on uncovering an interpretable pattern from the extant literature [59]. The review method involves a rigorous process of searching, filter and further classifying the existing literature. The reviewer accumulates relevant literature by collecting domain-relevant maximum available research papers, followed by identifying trends and patterns among the collected literature in the relevant domain [58]. This very approach of review involves the systematic study of the obtained literature with some preliminary quantifications such as the order and record of publishing year, research methodology applied, and then the research output. The nature of the descriptive review approach is said to exhibit a closeto-the-current state of a specific research realm. The descriptive literature review has been considered the best approach for the interfusion of previous research [60]. Hence, a descriptive literature review is adopted for this study to uncover patter from the extant literature. 3.1 Scope of Literature Search Traditionally the first step in literature review analysis is to search for domain-specific relevant literature. This can be done by manually searching or using computers. Mostly it involves targeting multiple noticeable journals and conferences in that research domain. For research in information systems, using online databases for accumulating research literature has been evident as an emerging trend lately [61]. Hence, a literature review on BI and Big data is conducted using an online database within the information systems field. 3.2 Filtering Process A total of 240 articles were checked for duplication. Thirty-two (32) articles were removed after finding them exact duplicates. The remaining 208 Articles were then processed through different phases of filtering and scanning following a systematic selection process [62]. In the first phase of filtration, the articles were manually screened following the relevance of their titles. The articles that did not address big data or business intelligence were excluded from the repository. Those articles were included coarsely by the online search engines. A total of 56 articles were excluded in this phase. In the second phase of filtering, the remaining articles were filtered manually through reading their pertaining abstracts. Full-text description of articles was also studied where
A Descriptive Literature Review and Classification of Business Intelligence
871
needed. Those articles were excluded that didn’t have big data or business intelligence as a focal theme. The search engines picked these articles depending on even their slightest relevance with the subject matter. This second phase was the most comprehensive and time-consuming because more careful reading and further scanning were needed. A comprehensive screening through in-depth reading allowed to eliminate editorial notes, letters, briefs, book reviews, opinion papers, and other literature that does not conform to the objectivity of this study. Also, those articles that almost covered the same/similar topics by a similar group of authors were also considered duplicates and further excluded. The most recent research article was included for the final review. After the second phase of filtering, 128 articles were considered for further investigation. 3.3 Classification Scheme Special attention is given to creating a literature classification scheme related to BI and Big data research. This classification was based on categorizing the research focus of the 128 articles, which remained after closely monitored filtering processes for literature to be reviewed. A ‘bottom-up’ approach informed by grounded theory [63] was assumed to identify the categories used for this literature analysis. Such an approach has recently been recommended as a rigorous method for reviewing the literature [64]. Hence, specific subcategories were assigned to each article and then fused into more general top categories in three steps, as described further below. The first step was an initial reading of the 128 papers. In the initial coding stages, open coding techniques were applied and generated a wide range of codes to capture the themes represented in each article [65]. Codes were generated from article keywords, analysis of the article abstract, and, where necessary to explain the paper’s content further, a careful reading of the entire article. In this process, thirty to forty codes were identified. In the next stage, we pursued connections between our initial axial coding and reduced the codes initially identified into a concluding set comprising 20 subcategories [65]. This aggregation was revised iteratively to confirm that it was not dragged into the unnecessary overlapping theme; instead, a diverse representation of the coding was performed initially. In the next step, we aggregated those 20-coding sets of sub-groupings into three domain labels of topics to simplify the broader research in the literature available. This was performed using a technique called “affinity diagramming” developed by Jiro Kawakita in 1960. It evaluates and substantiates classifications in a well-crafted systematic direction. Hence, this resulted in the following grid, shown in Table 1 [66]. Consequently, 128 articles were reviewed with complete text, henceforth clustered into three wide-ranging groupings: management aspects, Technological aspects and Applications and Domains in usage.
4 Result and Analysis In the stated approach to the descriptive literature review, 128 articles were classified following the scheme worked in this study. According to the analysis performed, ‘Technological aspects’ visibly stands out as the most consistently published category (64
872
A. Rashid and M. M. Khurshid
articles, 50%), this is followed by ‘Management aspects’ (36 articles, 29%), and lastly ‘Application and Domains in usage’ (28 articles, 21%). 4.1 Management Aspect This grouping emphasizes precise and effective management attributes details of big data and business intelligence. This grouping entails those articles in which researchers contemplate big data and business intelligence in the context of originating, managing, analyzing, utilizing, and further impacting for the benefit or otherwise, not only to one entity such as organization or firms rather every stakeholder that is involved within it, may it be an individual or a group constituting a major shareholder in this context. Seven subgroupings are constituted to such management aspects of big data and business intelligence. 1. ‘Influx and Management’: Articles in this subgrouping entails entry, invasion, incursion, and the Inflow of the massive volume in which the data is being generated. In effect, the world is creating more and more data each day and, this data is stored on history much higher. 2. ‘Context and Relevance’: The central point of this subgrouping is the aspect of context and relevance of big data in which it is intended to be used. Given that the volume is huge, we need to trim down the relevant and context-aware data to meet the desired objectives regarding results. Given the processing cost, context and considerable relevance are becoming even more critical. 3. ‘Dimensions and Diversity’: In this subgrouping, the discussion is the varied variety of big data. For instance, in the business domain, studies show that appropriate and effective use of data-driven knowledge is a competitive advantage and the magnitude in which it is being produced. This data is not just unidirectional; instead, it exhibits a complex set of dimensions to explore and add perspective. In another instance, big data is changing the biomedical industry by bringing benefits for human beings. Still, it all is being brought with a lot of versatility, given the complex, multi-faceted domain of the biomedical industry. In addition to this, big data is also available in the study of history and geography, and it is being originated in humongous volume [67]. 4. ‘Ethical and Legal issues’: This subgrouping substantiates two critical social phenomena. Big data has many interventions with various subjects and their relevant data. May it be social media, medicine, and financial or interdisciplinary cross-over among these distinct natures of big data, ethical and legal constraints are essential when it comes to big data and its manipulation and application. Some of such obstacles involve privacy, ownership rights, and rightful possession, to name a few. 5. ‘Risk Management and Governance’: This management aspect emphasizes that more and more entities are utilizing big data in today’s age businesses. This interdependence poses a certain amount of risks that need to manage to operate better with big data. It includes both technical risks such as access management of potential data, financial risks such as transactional over-runs, and behavioral risks such as duplication and archival routines are some of the soaring risks to this aspect.
A Descriptive Literature Review and Classification of Business Intelligence
873
6. ‘Cost of Processing and Economics’: Articles concerning this management aspect signify the specificity of processing costs involving big data. As discussed above, having a huge volume is one of the striking features of big data it bears some high processing costs relative to its size. It may not be possible for the traditional software, means, and methods for processing such a gigantic quantity of data with due purpose. 7. ‘Analysis and Research Insights’: Articles in this stream of subgrouping pinpoint the crux necessitating one of the most important aspects of big data, that is sought-after research and analysis of big data and its resultant insights for various purposes, including strategy development, trends, and patterns contemplation, patronizing decisions for future growth and improvements. As far as the ‘Management aspects’ are concerned, ‘Analysis and Research Insights’ and ‘Dimensions and Diversity’ constitute closer to half of the literature. This shows the focus of researchers at this point. This pretty much aligned with the early objectives desired by the organizations themselves. Big data has a strategic direction. Hence these two aspects show and further capitalize an organization’s ability to derive meaningful trends and patterns from boasting its future direction in many ways, where traditional analysis loses its capabilities. Table 1. Management aspects of Big Data and BI. Management aspects
Articles and Percentage
Influx and Management
4 (11%)
Context and Relevance
5 (14%)
Dimensions and Diversity
8 (22%)
Ethical and Legal issues
4 (11%)
Risk Management and Governance
3 (8%)
Cost of Processing and Economics
3 (8%)
Analysis and Research Insights
9 (25%)
Total
36 (100%)
Aspects such as ‘Ethical and Legal Issues’ and ‘Risk management and Governance’ along with ‘Influx and Management’ are presumably in its fancy stage. It is expected that once better platforms and architectural frameworks are at hand for big data research, these aspects will rise in terms of research and development in the future.
874
A. Rashid and M. M. Khurshid
4.2 Technological Aspects This grouping explores technological elements about big data and business intelligence. This grouping selects those articles in which researchers predict big data and business intelligence in the technological context of creating, managing, and analyzing. Six subgroupings are established to technological aspects of big data and business intelligence under this grouping. 1. ‘Storage platforms and Archiving’: This stream of subgrouping discusses how big data on an enormous scale can be stored and archived. It puts light on the technological platforms which partake to make this happen. It becomes necessary to prepare grounds to host such huge volume, variety, and velocity of big data so that entities can manipulate this huge incursion of data generated in this digital age. 2. ‘Processing Capability and Capacity’: This subgrouping contains the articles which progress on the technological capacity and capability to process big data. Specialized platforms are needed to process big data to cater to depth (signified with volume) and breath (signified with variety) it embodies. 3. ‘Data sources’: When it comes to data sources for big data, this subgrouping contains articles that emphasize the wide variety of sources that contribute to streaming-in big data for further analysis. 4. ‘Transformation and Portability’: This is perhaps one of the most talked-about and research interest areas from the current crop of big data researchers. Several articles have been reviewed to focus on this particular area in big data research. 5. ‘Data Visualization’: This subgrouping contains big data output. It includes articles that have been deemed to focus on extrapolating the possible ways in which big data could result and further meet the objectives, may it be aiding a strategy for businesses or scientific simulation for a spaceship. This area is gaining a lot of momentum but currently is hindered through computing-intensive resources being required. 6. ‘Knowledge Management and Intelligence’: Last but one prominent subgroupings is this one. It contains the literature which, perhaps, has garnered the most attention by the researcher as it is directly linked with one of the big data objectives termed as ‘knowledge/Information /Intelligence’ out of data supplied. The following table (Table 2) provides a picture of the literature explored regarding technological aspects. According to this percentage, ‘Technological aspects’ represented 50% of the total descriptive literature review. This indicates that there is an apparent inclination towards resolving technological hindrances and addressing concerns regarding technology when it comes to research for big data. 4.3 Application and Domain in the Usage This grouping contains various subjects and domains under which big data and business intelligence are currently being used and further investigated (Table 3). Business, Social, Science and Technology, Arts and History, Biotechnology, Medicine and Health, Agriculture and Farming, and Spatial and Geography are the subjects under investigation
A Descriptive Literature Review and Classification of Business Intelligence
875
Table 2. Technological aspects of Big Data and BI. Technological aspects
Articles and percentage
Storage platforms and Archiving
12 (19%)
Processing Capability and 14 (22%) Capacity Data Sources
9 (14%)
Transformation and Portability
8 (12%)
Data Visualization
7 (11%)
Knowledge Management and Intelligence
14 (22%)
Total
64 (100%)
through the lens of big data and advancement made to facilitate experts to take maximum benefits in these subjects. Lastly, but importantly, application and domain usage of big data are not widely dispersed. This exhibits that big data predominates gaining momentum in every walk of life, though the progress is gradual. Nonetheless, ‘Business’ and ‘Science and Technology’ and ‘Biotechnology, Medicine and Healthcare dominate domains due to the nature of data they contain, and the generated volume is humongous. Progressively but consistently, the role of big data on arts, history, and geography is on the rise as depicted in Table 3. It is going to be increased in time to come. Table 3. Application and Domains in Big Data and BI. Application and domains in usage
Articles and Percentage
Business
6 (25%)
Social
4 (18%)
Social and Technology
7 (25%)
Art and History
2 (7%)
Biotechnology, Medicine and Healthcare
6 (14%)
Agriculture and Farming
1 (4%)
Spatial and Geography
2 (7%)
Total
28 (100%)
876
A. Rashid and M. M. Khurshid
4.4 Discussion A comprehensive descriptive review is conducted and thoroughly presented, classifying the literature of extant BI and Big data research in a range of categories further, disintegrated into resultant subgroupings. The purpose of this write-up is to reveal a backdrop of current academic research in the field of BI and Big data. The result estimated and inferred in this study has suggested useful comprehensions to both practitioners and academic researchers. One may draw upon several insights while investigating the literature about BI and Big data. Nonetheless, this study has helped to provide a comprehensive sketch of the current condition of BI and Big data literature available on hand with specific observable patterns. Since a particular mix of both aspects of ‘technological’ and ‘management’ inclined literature was observed, ‘technological’ aspects still embody more literature. Big data and BI, specifically, are going through a transition phase from traditional data to big data and relevant business intelligence transformation; therefore, now the focus toward technical level field preparation is natural and understandable. It has not been much time since the advent of big data, so literature is still in its fancy stage. Though advancement in the technology is well appreciated and significantly contributes to mature frameworks at a rapid state, when it comes to the velocity and variety that big data contains, it will take some time to consider this literature in a mature state with proven frameworks and tested methodologies. Under ‘Technological’ aspects, a precise inclination of the studied literature was for ‘capturing’ and ‘retaining’ big incoming data. Due to the constant evolution of computational capacity, which has been progressing leaps and bounds, especially since the last decade, ‘Processing Capacity and Capability’ in conjunction with ‘Knowledge Management and Intelligence’ for big data have been major highlights of the investigated literature of the subject. Under ‘Management’ aspects, many articles inclined towards ‘Analysis and Research Insights’ followed by ‘Dimensions and Diversity’.
5 Conclusion and Limitations An immense interest regarding the ever-emergent phenomenon of BI and Big data has been observed among practitioners and academics. Henceforth, while this review cannot claim to be comprehensive, it provides good insights into the current state of BI and Big data research. The classifications and descriptive review would help provide a useful quality reference for academics and practitioners with a concentration in BI and Big Data and suggestions for forthcoming lines of research. This study has several limitations. Firstly, the literature reviewed was mainly based on academic publications and research articles published in journals. A vast majority of leading research on BI and Big data is being done in the industry due to its industrydriven nature; subsequently, many quality professional articles may also embrace this phenomenon. Secondly, the research articles encompassed are all refereed journal articles. Consequently, the classification scheme might not reflect the topic distribution of conference papers related to BI and Big data. Thirdly, search criteria might be inadequate, as some articles may not have BI and Big data in the abstract or list of keywords.
A Descriptive Literature Review and Classification of Business Intelligence
877
References 1. Khurshid, M.M., Zakaria, N.H., Rashid, A.: Big data value dimensions in flood disaster domain. J. Inf. Syst. Res. Innov. 11(1), 25–29 (2017) 2. Chen, Y., Han, D.: Big data and hydroinformatics. J. Hydroinf. 18(4), 599–614 (2016) 3. Li, D., S. Guo, and J. Yin. Big data analysis based on POT method for design flood prediction. In: 2016 IEEE International Conference on Big Data Analysis (ICBDA) (2016) 4. Nguyen, T., et al.: Big data analytics in supply chain management: a state-of-the-art literature review. Comput. Oper. Res. 98, 254–264 (2018) 5. Hardy, K., Maurushat, A.: Opening up government data for big data analysis and public benefit. Comput. Law Secur. Rev. 33(1), 30–37 (2017) 6. Sivarajah, U., et al.: Critical analysis of Big data challenges and analytical methods. J. Bus. Res. 70, 263–286 (2017) 7. Sharma, S., Mangat, V.: Technology and trends to handle big data: survey. In: Fifth International Conference on Advanced Computing and Communication Technologies. IEEE (2015) 8. Fosso Wamba, S., et al.: How ‘big data’ can make big impact: findings from a systematic review and a longitudinal case study. Int. J. Prod. Econ. 165, 234–246 (2015) 9. Salleh, K.A., Janczewski, L.: Technological, organizational and environmental security and privacy issues of big data: a literature review. Procedia Comput. Sci. 100, 19–28 (2016) 10. de Camargo Fiorini, P., et al.: Management theory and big data literature: from a review to a research agenda. Int. J. Inf. Manage. 43, 112–129 (2018) 11. McInnis, D.: Taking advantage of Big Data (2016). http://www.binghamton.edu/magazine/ index.php/magazine/story/taking-advantage-of-big-data 12. Fang, H., et al.: A survey of big data research. IEEE Netw 29(5), 6–9 (2015) 13. Litchfield, A.T., Althouse, J.: A systematic review of cloud computing, big data and databases on the cloud. In: Twentieth Americas Conference on Information Systems, Savannah (2014) 14. Shin, D.-H.: Demystifying big data: anatomy of big data developmental process. Telecommun. Policy 40(9), 837–854 (2016) 15. Siddiqa, A., et al.: A survey of big data management: taxonomy and state-of-the-art. J. Netw. Comput. Appl. 71, 151–166 (2016) 16. Khade, A.A.: Performing customer behavior analysis using big data analytics. Procedia Comput. Sci. 79, 986–992 (2016) 17. Yadegaridehkordi, E., et al.: Influence of big data adoption on manufacturing companies’ performance: An integrated DEMATEL-ANFIS approach. Technol. Forecast. Soc. Change 137, 199–210 (2018). https://doi.org/10.1016/j.techfore.2018.07.043 18. Wang, Y.F., et al.: Power system disaster-mitigating dispatch platform based on big data. In: 2014 International Conference on Power System Technology (POWERCON) (2014) 19. Weerakkody, V., et al.: Factors influencing user acceptance of public sector big open data. Prod. Plann. Control 28(11–12), 891–905 (2017) 20. Sirin, E., Karacan, H.: A review on business intelligence and big data. Int. J. Intell. Syst. Appl. Eng. 5(4), 206–215 (2017) 21. Monaghan, A., Lycett, M.: Big data and humanitarian supply networks: can big data give voice to the voiceless? In: 2013 Global Humanitarian Technology Conference (GHTC). IEEE (2013) 22. Gonzalez-Alonso, P., Vilar, R., Lupiáñez-Villanueva, F.: Meeting technology and methodology into health big data analytics scenarios. In: 2017 IEEE 30th International Symposium on Computer-Based Medical Systems (CBMS). IEEE (2017) 23. Bendre, M.R., Thool, V.R.: Analytics, challenges and applications in big data environment: a survey. J. Manage. Anal. 3(3), 206–239 (2016)
878
A. Rashid and M. M. Khurshid
24. Duan, L., Xiong, Y.: Big data analytics and business analytics. J. Manage. Anal. 2(1), 1–21 (2015) 25. Chen, Y., et al.: Big data analytics and big data science: a survey. J. Manage. Anal. 3(1), 1–42 (2016) 26. Miller, G.J.: Comparative analysis of big data analytics and BI projects. In: 2018 Federated Conference on Computer Science and Information Systems (FedCSIS). IEEE (2018) 27. Tiwari, S., Wee, H.M., Daryanto, Y.: Big data analytics in supply chain management between 2010 and 2016: insights to industries. Comput. Ind. Eng. 115, 319–330 (2018) 28. Bodislav, D.-A.: Transferring business intelligence and big data analysis from corporations to governments as a hybrid leading indicator. Theor. Appl. Econ. 22(1), 257–264 (2015) 29. Loshin, D.: Introduction to High-Performance Appliances for Big Data Management, pp. 49– 59 (2013) 30. Olszak, C.M.: Business intelligence and analytics in organizations. In: Mach-Król, M., M. Olszak, C., Pełech-Pilichowski, T. (eds.) Advances in ICT for Business, Industry and Public Sector. Studies in Computational Intelligence, vol. 579, pp. 89–109. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-11328-9_6 31. Hallman, S., et al.: BIG DATA: Preconditions to Productivity, pp. 727–731 (2014) 32. Akter, S., Wamba, S.F.: Big data analytics in E-commerce: a systematic review and agenda for future research. Electron. Mark. 26(2), 173–194 (2016). https://doi.org/10.1007/s12525016-0219-0 33. Soon, K.W.K., Lee, C.A., Boursier, P.: A study of the determinants affecting adoption of big data using integrated Technology Acceptance Model (TAM) and Diffusion of Innovation (DOI) in Malaysia. Int. J. Appl. Bus. Econ. Res. 14(1), 17–47 (2016) 34. Miloslavskaya, N., Tolstoy, A.: Big data, fast data and data lake concepts. Procedia Comput. Sci. 88, 300–305 (2016) 35. Lau, R.Y.K., et al.: Big data commerce. Inf. Manage. 53(8), 929–933 (2016) 36. Almeida, F.: Big data: concept, potentialities and vulnerabilities. Emerg. Sci. J. 2(1), 1–10 (2010) 37. Almeida, F., Low-Choy, S.: Exploring the relationship between big data and firm performance. Manage. Res. Pract. 13(3), 43–57 (2021) 38. Cassel, C., Bindman, A.: Risk, benefit, and fairness in a big data world. JAMA 322(2), 105–106 (2019) 39. Balachandran, B.M., Prasad, S.: Challenges and benefits of deploying big data analytics in the cloud for business intelligence. Procedia Comput. Sci. 112, 1112–1122 (2017) 40. Hussein, A.E.E.A.: Fifty-six big data V’s characteristics and proposed strategies to overcome security and privacy challenges (BD2). J. Inf. Secur. 11(04), 304–328 (2020) 41. Abawajy, J.: Comprehensive analysis of big data variety landscape. Int. J. Parallel Emergent Distrib. Syst. 30(1), 5–14 (2015) 42. Ma’ayan, A., et al.: Lean big data integration in systems biology and systems pharmacology. Trends Pharmacol. Sci. 35(9), 450–460 (2014) 43. Chen, H., Chiang, R.H., Storey, V.C.: Business Intelligence and Analytics: From Big Data to Big Impact. MIS Q. 36(4), 1165–1188 (2012) 44. Minelli, M., Chambers, M., Dhiraj, A.: Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today’s Businesses. Wiley, Hoboken (2012) 45. Marjanovic, O., Dinter, B., Ariyachandra, T. R.: Introduction to the Minitrack on Organizational Issues of Business Intelligence, Business Analytics and Big Data (2018) 46. Grover, V., et al.: Creating strategic business value from big data analytics: a research framework. J. Manag. Inf. Syst. 35(2), 388–423 (2018) 47. Seddon, P.B., et al.: How does business analytics contribute to business value? Inf. Syst. J. 27(3), 237–269 (2017)
A Descriptive Literature Review and Classification of Business Intelligence
879
48. Zhang, Y., Hua, W., Yuan, S.: Mapping the scientific research on open data: a bibliometric review. Learn. Publ. 31, 95–106 (2017) 49. Asadi Someh, I., et al.: Enablers and Mechanisms: Practices for Achieving Synergy with Business Analytics (2017) 50. Yerpude, S., Singhal, T.K.: Internet of things and its impact on business analytics. Indian J. Sci. Technol. 10(5), 1–6 (2017) 51. Jin, X., et al.: Significance and challenges of big data research. Big Data Res. 2(2), 59–64 (2015) 52. Ngai, E.W.T., Gunasekaran, A., Wamba, S.F., Akter, S., Dubey, R.: Big data analytics in electronic markets. Electron. Mark. 27(3), 243–245 (2017). https://doi.org/10.1007/s12525017-0261-6 53. Fazal-e-Amin, et al.: Big data for C4i systems: goals, applications, challenges and tools. In: 2015 Fifth International Conference on Innovative Computing Technology (INTECH) (2015) 54. Kemp, R.: Legal aspects of managing big data. Comput. Law Secur. Rev. 30(5), 482–491 (2014) 55. Nalchigar, S., Yu, E.: Conceptual modeling for business analytics: a framework and potential benefits. In: 2017 IEEE 19th Conference on Business Informatics (CBI). IEEE (2017) 56. Zhuang, Y., et al.: An evaluation of big data analytics in feature selection for long-lead extreme floods forecasting. In: 2016 IEEE 13th International Conference on Networking, Sensing, and Control (ICNSC) (2016) 57. Marjanovic, O., Dinter, B.: 25+ years of business intelligence and analytics minitrack at HICSS: a text mining analysis. In: Proceedings of the 50th Hawaii International Conference on System Sciences (2017) 58. King, W.R., He, J.: Understanding the role and methods of meta-analysis in IS research. Commun. Assoc. Inf. Syst. 16(1), 32 (2005) 59. Guzzo, R.A., Jackson, S.E., Katzell, R.A.: Meta-analysis analysis. Res. Organ. Behav. 9(1), 407–442 (1987) 60. Kitchin, R.: Big data and human geography: opportunities, challenges and risks. Dialogues Hum. Geogr. 3(3), 262–267 (2013) 61. Sabherwal, R., Jeyaraj, A., Chowa, C.: Information system success: individual and organizational determinants. Manage. Sci. 52(12), 1849–1864 (2006) 62. Dybå, T., Dingsøyr, T.: Empirical studies of agile software development: a systematic review. Inf. Softw. Technol. 50(9–10), 833–859 (2008) 63. Glaser, B., Strauss, A.: The Discovery of Grounded Theory. Chicago, p. 230. Adeline, Chicago (1967) 64. Wolfswinkel, J.F., Furtmueller, E., Wilderom, C.P.: Using grounded theory as a method for rigorously reviewing literature. Eur. J. Inf. Syst. 22(1), 45-55 (2013). https://doi.org/10.1057/ ejis.2011.51 65. Strauss, A., Corbin, J.M.: Grounded Theory in Practice. Sage, Thousand Oaks (1997) 66. Yang, H., Tate, M.: Where are we at with cloud computing? A descriptive literature review. In: 20th Australasian Conference on Information Systems (2009) 67. Mo, Z., Li, Y.: Research of big data based on the views of technology and application. Am. J. Ind. Bus. Manage. 05(04), 192–197 (2015)
Data Mining Solutions for Fraud Detection in Credit Card Payments Awais Farooq(B) and Stas Selitskiy School of Computer Science and Technology, University of Bedfordshire, Park Square, Luton LU13JU, UK {awais.farooq,stanislav.selitskiy}@study.beds.ac.uk http://beds.ac.uk/computing
Abstract. We describe an experimental approach to design a Fraud Detection system using supervised Machine Learning (ML) methods such as decision trees and random forest. We believe that such an approach allows financial institutions to investigate fraudulent cases efficiently in terms of accuracy and time. When ML methods are applied to imbalance problems such as Fraud Detection, the outcomes of decision models must be accurately calibrated in terms of predicted fraud probabilities. The use of different ML models allows practitioners to minimize the risks and cost of the solutions. We discuss the main results obtained in our experiments on the benchmark problems. Keywords: Fraud detection · Payment transactions learning · Data mining · Imbalance data
1
· Machine
Introduction
The rapid rise in the use of the Internet around the world has shaken and transformed normality. According to the Statista report, “In 2019, the number of internet users worldwide stood at 3.97 billion, which means that more than half of the global population is currently connected to the world wide web” [3]. This outcome indicates that the rise includes online shopping at a new scale as it offers a more accommodative approach than traditional brick and mortar businesses. As online shopping rises with it, the number of online payments increases too, which attracts fraudsters and scammers. Another report suggests that ‘In 2018, $24.26 Billion was lost due to payment card fraud worldwide’ [1]. The definition of fraud is commonly known as any criminal deception carried out with the motive to acquire financial gain. Unauthorized cardholders or stolen personal details can lead to credit card fraud payments. The consolidation of data mining, machine learning, and artificial intelligence to detect fraudulent payments has constrained the issue at some level. However, scammers’ ability to find one’s way around requires data scientists to be one step ahead. “Data mining technique is one of the most used techniques for solving the problem of credit card fraud detection. Credit card fraud detection is the process of detecting c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 506, pp. 880–888, 2022. https://doi.org/10.1007/978-3-031-10461-9_60
Fraud Detection in Credit Card Payments
881
whether a transaction is genuine or fraudulent” [2]. A multitude of techniques could be used to detect fraud across the board. This project would compare and evaluate three data mining techniques, including Decision Tree (DT), Random Forest (RF) and k-nearest neighbours’ algorithm (k-NN).
2 2.1
Data Data Benchmark Data
Our study is based on analysis of the European cardholders’ credit card fraud data set for two days in September 2013. This data set consists of “492 frauds out of 284,807 transactions. The data set is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions (Credit Card Fraud Detection, 2021). This data set includes 31 variables, and Principal Component Analysis (PCA) transformations have been implemented to protect privacy. An academic defines PCA as “Principal component analysis (PCA), a standard tool in modern data analysis used by almost all scientific disciplines. The goal of PCA is to identify the most meaningful basis to re-express a given data set. It is expected that this new basis will reveal hidden structure in the data set and filter out the noise” [12]. 2.2
Preprocessing the Dataset and Designing a Solution
Python programming language has been used for this project, and a data set for credit card fraud has been downloaded into the Kaggle Integrated development environment (IDE) to examine the data. This data set is significant and could not be viewed in one frame. All the columns from V1 to – V28 were processed through PCA to convert all these feathers into numerical values, used to identify unusual occurrences in the data pattern. Non-fraudulent transactions are considerably more significant than the fraudulent transaction, as shown in Fig. 1 overview of a highly unbalanced data set. Legit transactions amount for 284315 entries, while fraudulent transactions count is only 492. In the target variable, the legit transaction is way higher than the fraudulent transaction. Feeding this data set into machine learning models during training phase can lead to an inaccurate result prediction; therefore, this unbalanced data set is required to be pre-processed before experimenting. After separating data for analyses, two variables, ‘legit’ and ‘fraud’, have been created, and the training data were under-sampled and over-sampled respectively. The data were split into two sets, test and train. Four variables or Arrays has been created as “X train, X test, Y train, Y test” to randomly split feathers from ‘x’ and labels from ‘y’. 80% of the training data stored in the variable ‘X train’, all the corresponding data labels would be stored in ‘Y train’, and 20% data would store in ‘X test’ and the corresponding label in ‘Y test’. The number of data points in each split are the following: total value of 284807 in the original data set and 80% or 227845 values saved in ‘X train’ 20% or 56962 values stored in ‘X test’.
882
A. Farooq and S. Selitskiy
Fig. 1. Unbalanced classes of the transaction dataset
3
Outcome Analysis
All the models in this report were evaluated on the score of accuracy (1), F1 score (2), and confusion matrix. Accuracy =
TP + TN TP + TN + FP + FN
(1)
where True Positive (TP) is a number of correct positive predictions of the algorithm, True Negative (TN) - correct negative predictions, False Positive (FP) - erroneous positive predictions, and False Negative (FN) - incorrect negative ones. P recision ∗ Recall (2) P recision + Recall where Precision is a measure of the ‘pollution’ of the true positive verdicts by the false positive errors (3). F1 = 2 ∗
P recision =
TP TP + FP
(3)
And Recall is a measure of the true positive verdicts ‘loss’ due to false negative errors: TP (4) Recall = TP + FN
4 4.1
Experiments Decision Tree
According to a study, a simple explanation of a Decision Tree is “A tree comprises of internal nodes that denote a test on an attribute, each branch denotes an
Fraud Detection in Credit Card Payments
883
outcome of that test, and each leaf node (terminal node) holds a class label. It recursively partitions a data set using either depth-first greedy approach or breadth-first greedy approach and stops when all the elements have been assigned a particular class [6]. Another reason to use the Decision Tree (DT) algorithm is easy to understand as a study suggest [11]. The DT algorithm was used for the 1st experiment. A variable model has been created according to Listing 1.1, with maximum depth set to 4 and training criterion set for ‘entropy’. Listing 1.1. Decision Tree model creation and training
from s k l e a r n . t r e e import D e c i s i o n T r e e C l a s s i f i e r DT model = D e c i s i o n T r e e C l a s s i f i e r ( max depth =4, c r i t e r i o n = ’ entropy ’ ) DT model . f i t ( X t r a i n , Y t r a i n ) tree yhat = tree model . predict ( X test ) The decision tree model predicted an accuracy score of 0.999367, and F1 score 0.81026, which is below the KNN model and above the Random forest model. After applying the computation of confusion Matrix of the plot for the Decision Tree algorithm’s outcome can be seen in Fig. 2.
Fig. 2. Confusion matrix for decision tree
884
4.2
A. Farooq and S. Selitskiy
Random Forest
This model could have many decision trees at random data points using the ensemble learning technique. One academic explains ensemble learning “The idea of ensemble learning is to employ multiple learners and combine their predictions” [17]. The average of their projections is then calculated. Taking the average of numerous decision trees’ forecasts is usually better than using a single decision tree. The illustration for two trees can be seen at Fig. 3.
Fig. 3. Decision trees in random forest example
The explanation of Random Forest (RF) as a study suggests, “The idea of the random forest is to partition the training set into subsets and construct the different decision trees, subset by subset. In the random forest, each item is classified by voting the outputs of the trees” [10]. RF ability to discriminate between classes also provides an edge to this model as academics suggest that “This classifier can be successfully used to select and rank those variables with the greatest ability to discriminate between the target classes” [5]. This RF model has been used for the 2nd experiment to solve a classification problem. A variable model created according to Listing 1.2, with maximum depth set to 5. Listing 1.2. Random Forest model creation and training
from s k l e a r n . t r e e import R a n d o m F o r e s t C l a s s i f i e r RF model = R a n d o m F o r e s t C l a s s i f i e r ( max depth =5) RF model . f i t ( X t r a i n , Y t r a i n ) RF yhat = RF model . p r e d i c t ( X t e s t )
Fraud Detection in Credit Card Payments
885
The RF model can predict all the labels from test data and store them with variable RFy hat, then compare with the value of the original label and generate an accuracy score. The F1 score with the Random Forest model is 0.793296, where the accuracy rate is 0.99332 and could be considered an excellent outcome with this model. The computation and the result of a confusion matrix for Random Forest are evident in the plot predicted label in Fig. 4.
Fig. 4. Confusion matrix for random forest
4.3
k-Nearest Neighbours’ Algorithm
A study suggests that “KNN is one of the efficient machine learning techniques for classification and K-means considered an efficient clustering algorithm. Much research proved that KNN provided reliable accuracy in classification based on supervised learning methodology whereas K-means based on unsupervised learning methodology” [14]. Another work has described as “The KNN is a learning method based on the case, which keeps all the training data for classification” [13]. k-nearest neighbours algorithm (k-NN) used for 3rd experiment. A variable model was created according to Listing 1.3, with n variable set for 5.
886
A. Farooq and S. Selitskiy Listing 1.3. Random Forest model creation and training
from s k l e a r n . t r e e import K N e i g h b o r s C l a s s i f i e r n = 5 knn model = K N e i g h b o r s C l a s s i f i e r ( h n e i g h b o r s =5) knn model . f i t ( X t r a i n , Y t r a i n ) KNN model predicted with an accuracy score of 0.998314 and F1 score for the KNN model is 0.857142. If comparing the RF F1 score with the KNN F1 score for this project, it could be argued that the KNN model can predict higher accuracy. The Confusion Matrix of the KNN algorithm is palpable in Fig. 5.
Fig. 5. Confusion matrix for k-nearest neighbours’ algorithm
In particular the ML approach has been efficiently used to solve problems such as detection of abnormal patterns [15,16] and detection of bone pathologies at the early stage [4,7]. The other examples of efficient applications of ML include Bayesian evaluation of brain development from biomedical signals [9,18,19], trauma patient survival prediction [8,20,21] and prediction of aircraft positions [22].
5
Conclusions
This study aimed to investigate the challenge of detecting fraudulent transactions using a data set with PCA implemented to protect consumers’ privacy.
Fraud Detection in Credit Card Payments
887
Three different data mining algorithms have been implemented in this project to identify and differentiate fraudulent transactions from legit transactions. Even a simple decision tree and random forest model can obtain decent recall, while a far more sophisticated KNN model outperforms other models in terms of the best result. Initially, an accuracy score above 0.998314 was predicted. To check the reliability of this model, once the F1 score was implemented, the accuracy score was reduced to just above 0.857142, which is still a better result than the other two models, including Decision Tree and Random Forest. After applying confusion matrix for KNN model the results can be seen in Fig. 5 where True Positive = 56, 861, False positive = 0, False Negative = 96 and True Negative = 5 with test data input. It could be argued that the KNN model performs very well in identifying fraudulent transactions with the highest prediction accuracy rate.
References 1. Credit Card Fraud Statistics, September 2021. Accessed 24 Jan 2022 2. Bagga, S., Goyal, A., Gupta, N., Goyal, A.: Credit Card Fraud Detection using Pipeling and Ensemble Learning (2020). Accessed 24 Jan 2022 3. Topic: Internet usage worldwide, January 2022. Accessed 24 Jan 2022 4. Akter, M., Jakaite, L.: Extraction of texture features from x-ray images: case of osteoarthritis detection. In: Yang, X.-S., Sherratt, S., Dey, N., Joshi, A. (eds.) Third International Congress on Information and Communication Technology. AISC, vol. 797, pp. 143–150. Springer, Singapore (2019). https://doi.org/10.1007/978-981-131165-9 13 5. Belgiu, M., Dr˘ agut¸, L.: Random forest in remote sensing: a review of applications and future directions. ISPRS J. Photogramm. Remote. Sens. 114, 24–31 (2016) 6. Jain, Y., Tiwari, N., Dubey, S., Jain, S.: A comparative analysis of various credit card fraud detection techniques, October 2019. Accessed 24 Jan 2022 7. Jakaite, L., Schetinin, V., Hladuvka, J., Minaev, S., Ambia, A., Krzanowski, W.: Deep learning for early detection of pathological changes in x-ray bone microstructures: case of osteoarthritis. Sci. Rep. 11, 1–9 (2021) 8. Jakaite, L., Schetinin, V., Maple, C., Schult, J.: Bayesian decision trees for EEG assessment of newborn brain maturity. In: The 10th Annual Workshop on Computational Intelligence UKCI 2010 (2010) 9. Jakaite, L., Schetinin, V., Schult, J.: Feature extraction from electroencephalograms for Bayesian assessment of newborn brain maturity. In: 24th International Symposium on Computer-Based Medical Systems (CBMS), pp. 1–6 (2011) 10. Jo, T.: Machine Learning Foundations. Springer, Cham (2021). https://doi.org/ 10.1007/978-3-030-65900-4 11. Kotsiantis, S.B.: Decision trees: a recent overview. Artif. Intell. Rev. 39, 261–283 (2011) 12. Kurita, T.: Principal component analysis (PCA). In: Computer Vision: A Reference Guide, pp. 1–4 (2019) 13. Li, H., Li, H., Wei, K.: Automatic fast double KNN classification algorithm based on ACC and hierarchical clustering for big data. Int. J. Commun. Syst. 31(16), e3488 (2018). e3488 IJCS-17-0750.R1
888
A. Farooq and S. Selitskiy
14. Mittal, K., Aggarwal, G., Mahajan, P.: Performance study of k-nearest neighbor classifier and k-means clustering for predicting the diagnostic accuracy. Int. J. Inf. Technol. 11(3), 535–540 (2018) 15. Nyah, N., Jakaite, L., Schetinin, V., Sant, P., Aggoun, A.: Evolving polynomial neural networks for detecting abnormal patterns. In: 2016 IEEE 8th International Conference on Intelligent Systems (IS), pp. 74–80 (2016) 16. Nyah, N., Jakaite, L., Schetinin, V., Sant, P., Aggoun, A.: Learning polynomial neural networks of a near-optimal connectivity for detecting abnormal patterns in biometric data. In: 2016 SAI Computing Conference (SAI), pp. 409–413 (2016) 17. Rocca, J.: Ensemble methods: bagging, boosting and stacking - towards data science. Medium, December 2021 18. Schetinin, V., Jakaite, L.: Classification of newborn EEG maturity with Bayesian averaging over decision trees. Expert Syst. Appl. 39(10), 9340–9347 (2012) 19. Schetinin, V., Jakaite, L., Schult, J.: Informativeness of sleep cycle features in Bayesian assessment of newborn electroencephalographic maturation. In: 2011 24th International Symposium on Computer-Based Medical Systems (CBMS), pp. 1–6 (2011) 20. Schetinin, V., Jakaite, L., Krzanowski, W.: Bayesian averaging over decision tree models: an application for estimating uncertainty in trauma severity scoring. Int. J. Med. Informatics 112, 6–14 (2018) 21. Schetinin, V., Jakaite, L., Krzanowski, W.: Bayesian averaging over decision tree models for trauma severity scoring. Artif. Intell. Med. 84, 139–145 (2018) 22. Schetinin, V., Jakaite, L., Krzanowski, W.: Bayesian learning of models for estimating uncertainty in alert systems: application to air traffic conflict avoidance. Integr. Comput. Aided Eng. 26, 1–17 (2018)
Nonexistence of a Universal Algorithm for Traveling Salesman Problems in Constructive Mathematics Linglong Dai(B) George School, Newtown, PA 18940, USA [email protected]
Abstract. Proposed initially from a practical circumstance, the traveling salesman problem caught the attention of numerous economists, computer scientists, and mathematicians. These theorists were instead intrigued by seeking a systemic way to find the optimal route. Many attempts have been made along the way and all concluded the nonexistence of a general algorithm that determines optimal solution to all traveling salesman problems alike. In this study, we present proof for the nonexistence of such an algorithm for both asymmetric (with oriented roads) and symmetric (with unoriented roads) traveling salesman problems in the setup of constructive mathematics. Keywords: Traveling salesman problem
1
· Constructive mathematics
Introduction
The traveling salesman problem (TSP) is a problem of major concern in computer science, economics, and mathematics. It abstracts the realistic situation of traveling between cities as a graph. In the graph, cities are denoted as nodes and roads connecting them as edges with each costing the traveler a certain amount of toll fee. For a long time, theorists have worked on finding systemic ways to obtain the optimal route, following which a salesman could travel to all cities non-repeatedly in a complete tour and return to the original location with a minimum cost. Although various algorithms have been proposed and improved upon one another, a universal algorithm that computes the optimal routes in all traveling salesman problems in finite time seems to be non-existing. In this paper, we focus on the problem of the existence of such an algorithm. In particular, we consider such a problem in constructive mathematics. The principle of omniscience grants two possible outcomes: given an algorithm, either optimal solutions could be found in all TSPs or optimal solution(s) computable in finite time does not exist for at least one TSP. We present the validity of the latter in the paper. This research was supported by the Ivy Mind Summer Program and the project was created as a result of joint collaboration of Viktor Chernov and Vladimir Chernov. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 506, pp. 889–895, 2022. https://doi.org/10.1007/978-3-031-10461-9_61
890
L. Dai
Specifically, we formulate our objective in this paper as to prove the following theorem: Theorem 1. There does not exist an everywhere defined computable algorithm ˆ which determines the optimal route in all traveling salesman problems when H costs of roads are constructive real numbers. Furthermore, in Sect. 4, we limit the type of TSPs to symmetric TSPs, meaning that traveling back and forth between two cities would have the same cost. And we prove that: Remark 1. Theorem 1 holds true for symmetric TSPs. Note that there are inextensible algorithms defined on N to [0, 1]. Theorem 2 (Shen & Vereshchagin [3]). There exists a computable function that has no total computable extension. The fact that there are partially defined inextendable algorithms is crucial in the proofs.
2
Definitions
Before proving Theorem 1 and Remark 1, we define one key term and two sequences. In this paper, all traveling salesman problems of consideration have toll fees that are constructive real numbers, which we define as the following: Definition 1 (Constructive Real Numbers). Given two programs α and β. α generates a sequence of rational numbers α(n), n ∈ N. β, the regulator program, generates a sequence of natural numbers β(m). The sequence of rational numbers α(n) converges to a constructive real number such that ∀m, i, j > β(m), |α(i) − α(j)| ≤ 2−m .
(1)
In addition, we define two sequences that will be utilized in the remaining sections. Definition 2 (Sequence C). Given a partially defined computable algorithm H that takes in an input n, we define sequence C in the following manner: Cn,k
1 If by step k, H didn’t finish working on input n or already gave 1. = 1 − 2−m If by step k, H finished working on input n and gave 0.
Here m is the step number when 0 was printed and m, n, k ∈ N. Definition 3 (Sequence D). In a similar manner, we define a new sequence, D: Dn,k =
1 If by step k, H didn’t finish working on input n or already gave 0. 1 − 2−m If by step k, H finished working on input n and gave 1.
Same as in the previous definition, m is the step number when 1 was printed and m, n, k ∈ N.
Nonexistence of a Universal Algorithm
891
For each fixed n, Cn,k and Dn,k are constructive numbers, which we denote by Cn , Dn . In either of the subsequent cases, Cn and Dn are constructive real numbers: 1. If program prints 0 for C or 1 for D, terms are always equal to 1 − 2−m after or by step m, which according to Definition 1, the corresponding Cn and Dn are constructive real numbers with the standard regulator. 2. If program prints 1 for C or 0 for D or does not terminate, then terms in the sequence are always equal to 1, by Definition 1, they are constructive real numbers with the standard regulator.
3
The Case of General TSPs
In this section, we present a proof of the nonexistence of an algorithm that always decides the optimal solution to general traveling salesman problems by constructing a particular contradiction. To begin, we consider the simplistic instance of 3 nodes and 6 one-way roads with numbers representing the tolls. Cn and Dn indicated in Fig. 1 are constructive real number(s) generated by the partially defined algorithm H as defined in Sect. 2. n3
Dn
1 1
1
n1
Cn
n2
1
Fig. 1. TSP with 3 nodes
In this case, there are two complete tours, n1 → n2 → n3 → n1 in blue and n1 → n3 → n2 → n1 in red. ˆ be the algorithm that solves all traveling salesman Let an algorithm H problems by deciding the optimal route with a minimum cost. We have a partially defined computable algorithm H that generates sequence C and D. There is an extension of algorithm H, H , that applies to the aforementioned instance by determining min(2+Cn , 2+Dn ), which further reduces to computing min(Cn , Dn ).
892
L. Dai
To explain the process of determining the optimal route by H , we shall enumerate a few possible C and D and the corresponding outputs of H . We use Z to denote the case when the program does not terminate. Table 1. Enumerations of possible sequence C and D. output of
Cn \k 1 2
1 2 1 4
1 1 1
1 2 1 4 1 8
··· ··· ···
1 1 1 ···
H
(0) (0) (0) .. . (Z or 1)
output of
Dn \k 1 2
1 2 1 4
1 1 1
1 2 1 4 1 8
··· ··· ···
1 1 1 ···
H
(1) (1) (1) .. . (Z or 0)
As we shall observe from the Table 1 that each corresponding line of some Cn and some Dn satisfies the definition of constructive real numbers Definition 1, and are possible tolls of some particular roads. Notice that when H outputs 1, Dn ≤ Cn , H decides that the red route is the optimal solution, and when H outputs 0, Cn ≤ Dn , we would otherwise have the blue route as the optimal solution. When Cn = Dn , both routes are optimal, which means that the previous statement would also be correct. We proceed to prove Theorem 1 by contradiction. Theorem 1. There does not exist an everywhere defined computable algorithm ˆ which determines the optimal route in all traveling salesman problems when H costs of roads are constructive real numbers. ˆ that always finds the optimal routes Proof. Assume there exists an algorithm H in traveling salesman problems. Let algorithm H be a partially defined computable algorithm that generates sequence C and D mentioned in Sect. 2. Then there exists an extension of H, H , that determines either Cn or Dn is smaller, and decides the optimal route in the previously mentioned traveling salesman problem. The extension poses a contradiction to our knowledge about H being an inextendable algorithm (see Theorem 2). Therefore, we have proven Theorem 1. The idea of this proof could be utilized to prove Remark 1, which we present in the next section.
Nonexistence of a Universal Algorithm
4
893
The Case of Symmetric TSPs
In Sect. 3, we proved Theorem 1 by constructing a contradiction with an asymmetric traveling salesman problem. In this section, we attempt to limit the range of traveling salesman problems down to the symmetric cases, and we prove the nonexistence of an algorithm that finds the optimal solution in all symmetric traveling salesman problems (with unoriented roads). Similarly, we begin by constructing a particular traveling salesman problem:
n1
1
100
n2
n5
1 Cn
100
Dn
100
n3
1
1
1
n4
Fig. 2. TSP with 5 nodes
The dotted paths in Fig. 2 have costs of 100, which are economically inefficient. Therefore, complete tours (tours that connect 5 nodes and return to the initial position) with at least one road costing 100 could be ignored in the selection of the optimal route in the course of the proof. In the hypothetical situation of machine decision, the algorithm H would also eliminate those cases as it recognizes them as economically inefficient when making a comparison between tours marked in the following diagram. The binary search would culminate in comparing the tours below, which are the only candidates of the optimal solution in the aforementioned TSP.
894
L. Dai n1
n1
100
1 n2
n5
1
100
100
n3
n2
1
Dn
Cn
1
1
n4
n5
1
Dn
Cn
100
1
100
100
n3
1
1
1
n4
Fig. 3. The only 2 possibilities
As indicated in the graphs in Fig. 3, the tour in blue, n1 → n2 → n5 → n4 → n3 → n1 , costs the traveler 4 + Cn ; the tour in red, n1 → n2 → n5 → n3 → n4 → n1 1 , costs the traveler 4 + Dn . Cn and Dn represent some constructive numbers between [0, 1]. We then follow the idea in the previous section to prove the following: Remark 1. Theorem 1 holds true for symmetric TSPs. ˆ that always finds the optimal routes Proof. Assume there is an algorithm H in symmetric traveling salesman problems. Let a partially defined computable algorithm H be the algorithm that generates sequence C and D. We could find an extension of H, H , that applies specifically to the previously mentioned symmetric traveling salesman problem with 5 nodes. To determine the optimal route, the algorithm compares 4 + Cn and 4 + Dn , which reduces to Cn and Dn . With the same explanation in Sect. 3, we would conclude that such an algorithm exists. However, since algorithm H is inextendable (see Theorem 2), the constructed H poses a contradiction. Therefore, H does not exist and we proved Remark 1.
5
Conclusion and Future Work
In this paper, we proved the nonexistence of a universal algorithm that determines the optimal complete tour in all traveling salesman problems (both symmetric and asymmetric) in a constructive mathematical setup and further in all symmetric traveling salesman problems. 1
The aforementioned tours are unorientated, n1 → n3 → n4 → n5 → n2 → n1 for example, is considered the same as n1 → n2 → n5 → n4 → n3 → n1 .
Nonexistence of a Universal Algorithm
895
For future research, we wish to explore the possibility of creating an infinite series of examples as in Fig. 1 and 3. Additionally, we propose that gaining topological alternatives to understand and construct examples might be a plausible direction.
References 1. Bishop, E., Bridges, D.: Constructive Analysis. Springer, Heidelberg (1985). https:// doi.org/10.1007/978-3-642-61667-9 2. B¨ orger, E., Gr¨ adel, E., Gurevich, Y.: The Classical Decision Problem 3. Shen, A., Vereshchagin, N.K.: Computable Functions. American Mathematical Society (2002)
Addition-Based Algorithm to Overcome Cover Problem During Anonymization of Transactional Data Apo Chim`ene Monsan1(B) , Jo¨el Christian Adepo2 , Edi´e Camille N’zi1 , and Bi Tra Goore1 1
Institut National Polytechnique F´elix Houphou¨et-Boigny, BP 1093, Yamoussoukro, Ivory Coast [email protected] 2 Universit´e Virtuelle de Cˆ ote d’Ivoire, BP 536, Abidjan, Ivory Coast
Abstract. Transactional data (such as diagnostic codes, customer shopping lists) are shared or published on the Internet for use in many applications. However, before sharing, it is protected by anonymization techniques such as disassociation. Disassociation makes data confidential without suppressing or altering it. However, it has been found to have a cover problem in disassociated data, which weakens its level of privacy. To overcome these shortcomings, we propose an algorithm based essentially on the addition of items. The performance evaluation results show that our algorithm completely suppresses the cover problem without significant information loss.
Keywords: Transactional data Anonymization
1
· Disassociation · Cover problem ·
Introduction
Transactional data are data in which records are represented as sets of items [1,2]. These sets of items can be customer shopping lists, query logs, diagnostic codes, etc. In the context of openness and data sharing, transactional data are made available to the public through the internet, where it is widely accessed and exploited for research purposes and for use in many applications such as prediction systems [3], data mining [4]. However, a malicious user may attempt to re-identify individuals in the dataset. This is likely to reveal sensitive, or confidential information about individuals. This information may concern pathology, preferences, or sexual orientation. Hence, the re-identification of individuals information could cause harm to some individuals. To avoid this situation, various methods based on anonymization [5–7] have been proposed in the literature. Anonymization is different from cryptography and access control [8], which respectively have a decryption key or a password without which it is impossible to access the data. These models provide limited access to information, while c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 506, pp. 896–914, 2022. https://doi.org/10.1007/978-3-031-10461-9_62
Algorithm to Overcome Cover Problem
897
anonymization provides protected data that are open to the public. Therefore, anonymization is more efficient in the context of open and published data. The anonymization process removes directly identifying data (e.g., ID number, surname, first names) and modifies other attributes (e.g., date of birth, gender, address) using its techniques. Disassociation [9–11] is one of the most commonly used techniques for anonymization. This technique separates records into several groups called clusters. Then, it vertically subdivides each group of records (see Fig. 1a) into small groups or record chunks k m -anonymous [12]. k m -anonymity or k m -anonymity constraint is a privacy model that ensures that an attacker who knows up to m items cannot associate them with less than k records. Disassociation renders confidential without suppressing or altering the data but retains it in its original form. As a result, it provides more useful and effective data for future analyses. However, in [13], the authors have shown that disassociation has a defect that makes it vulnerable. This defect is caused by the cover problem. Let us consider a dataset in which a data refers to an itemset knowing that the itemset can be repeated. The cover problem occurs if an item is attached only to the same itemsets. Therefore, if a cover problem occurs in the disassociated data, then it is possible for an adversary with some background knowledge to link a record to its owner with certainty. To solve this problem some authors have proposed solutions such as partial suppression [14], suppression/addition [15] and generalization [15]. The first two solutions fail to eliminate all instances of the cover problem and keep the entire dataset k m -anonymous. The generalization solution eliminates the cover problem with a large amount of information loss. Therefore, what solution to solve all instances of the cover problem while keeping the dataset k m -anonymous and causing less information loss?
Fig. 1. Example of transactional data
This paper aims to solve the cover problem in order to improve the privacy level of disassociated data while keeping the disassociated data k m -anonymous with less information loss. To this end, we propose the following contributions: – An algorithm called ADCOV (ADdition of Covered items), which is based on adding items to solve the cover problem.
898
A. C. Monsan et al.
– A study of the efficiency of this algorithm in solving the cover problem and a proof of the guarantee of the k m -anonymity model after the application of ADCOV. The rest of the paper is structured as follows: Sect. 2 presents succinctly related works. Section 3 presents a formulation through a detailed explanation of the cover problem. Section 4 concerns the development and proof of the effectiveness of our algorithm with some illustrations. In Sect. 5, the evaluations and results of our study are presented. Section 6 gives the conclusion of our work.
2
Related Works
Several studies on data protection use anonymization techniques to protect the data. These are generalization [16,17], suppression [1,18] [suppression/addition [6] or bucketization [19]. These techniques are associated with anonymization models such as k-anonymity [20], k m -anonymity and many other models. kanonymity consists of forming classes of k records (k ≥ 2). In a class, each record is identical to the other k − 1 records. This is a widely used model to protect relational data [21]. However, some authors as Tsai et al. [22], Motwani and Nabar [5] and other authors have extended k-anonymity in the context of transactional data. Nevertheless, k-anonymity has been criticized for suffering from a dimensionality problem. The dimensionality problem means that kanonymity is not suitable for the data (or datasets) with high dimension. A data (or dataset) with high dimension is a dataset with several attributes. Transactional data are data with high-dimensional [23], which causes huge information loss in the k-anonymization process. To minimize this loss, the authors use other data models such as k m -anonymity, which are better suited to transactional data. k m -anonymity was introduced by Terrovitis in [12]. k m -anonymity states that if an attacker knows up to m items of a specific record, he cannot link them to less than k records. Since then, several authors, including Terrovitis himself, have built on this model to propose an anonymization approach. Other algorithms such as local and global recoding [7], COAT (Constraint-Based Anonymization of Transactions) [24], PCTA (Privacy-Constrained Clustering-based Transaction data Anonymization) [16] have been proposed. However, these algorithms are based on generalization and suppression which alter the data. To overcome these shortcomings, the authors proposed disassociation [5–7], which is an approach similar to bucketization [19]. Bucketization is a technique used to anonymize relational data. This technique consists of partitioning the data horizontally and vertically to form groups of data. In the same way, in the disassociation approach, records are sorted according to the most frequent items and then grouped horizontally into clusters. Then, each cluster is partitioned vertically to form k m anonymous record chunks. This is a technique that preserves the utility of the data by keeping the original values intact. However, it was found vulnerable in the work of Barakat et al. [13], through the cover problem. The authors in [14], attempted to solve it using an approach called safe disassociation. This work consists of removing some items from the record chunk concerned by the cover
Algorithm to Overcome Cover Problem
899
problem. This is an effective approach, but it has some preconditions for application. These preconditions lead to a partial resolution of the cover problem. In effect, some record chunks, concerned by the cover problem, are not treated because they do not respect these preconditions. Puri et al. recently proposed two algorithms: Improvement in Disassociation by Generalizing (IDGC) [15] and Improvement in Disassociation using Suppression and Addition (IDSA) [15] for solving the cover problem. IDGC is a generalization approach that applies only to categorical data and causes considerable of information loss. As for IDSA, it does not completely remove the cover problem. In summary, different approaches have been proposed to solve the cover problem, but these approaches all have their limitations.
3
Formulation of the Problem
This section allows us to define the cover problem and show us its impact on data privacy. We also show how a privacy breach could occur in a disassociated dataset owing to the cover problem. 3.1
Data and Notations
Let T be the original dataset, which is structured as a collection of records. Each record is an itemset associated with a specific individual. For example, consider dataset T in Fig. 1a, which represents the log of purchases made in a supermarket at a given time. Table 1. Notations D
Set of items: {x1 , x2 , . . . , xn }
T
Original dataset: a dataset containing individuals related records
T∗
Disassociated dataset: a table anonymized using the disassociation technique
T iv
Dataset reconstructed by inverse transformation of T ∗
Ri
The i-th cluster in a disassociated dataset, formed by the horizontal partitioning of T
Ri Cj
The j-th record chunk of the i-th cluster in a disassociated dataset
Ii,j
The i-th itemset of the j-th record chunk
S(I, Ri Cj ) The support of I in Ri Cj , which is the number of subrecords in Ri Cj that contain I δ
The maximum number of records allowed in a cluster
Each record is an itemset purchased by a specific customer. In the context of anonymization through disassociation, this dataset can be divided into several
900
A. C. Monsan et al.
clusters R1 , R2 ,..., Rn . But, in Fig. 1a, there is only one cluster containing records T1 to T5 . In each cluster Ri , records are partitioned into smaller subrecords into record chunks Cj . However, for more precision, the record chunks are denoted by Ri Cj . The set of distinct record chunks constitutes the disassociated dataset T ∗ . Thus, a record chunk contains itemsets or subrecords I. The number of subrecords containing itemset I in the record chunk Ri Cj is called the support and is denoted by s(I, Ri Cj ). Table 1 presents some useful notations. 3.2
Description of the Cover Problem
The particularity of the disassociation technique is that the data are not suppressed, altered, or generalized. And in the disassociated data, all record chunks are k m -anonymous [12]. This k m -anonymity constraint is fundamental in the disassociation technique. In fact, the respect of this constraint makes it possible to say that the data are safe. Formally, k m -anonymity is defined as follows: Definition 1 (k m -anonymity). Let T be a dataset. T is k m -anonymous if ∀ I ⊆ D such that |I| ≤ m, s(I, T ) ≥ k, with k ≥ 2 and m ≥ 2. Note that the support of I in T denoted by s(I, T ) is equal to the number of records containing I in dataset T . For example, the dataset in Fig. 1b is 32 -anonymous because all combinations of m = 2 items exist in at least k = 3 subrecords in all the record chunks R1 C1 , R1 C2 . As previously mentioned, respecting k m -anonymity keeps the disassociated data confidential and preserves the privacy of individuals. Indeed, k m -anonymity prevents a malicious user from being certain to identify an individual in the dataset. However, when the data is disassociated, it is possible for an attacker to try to recover the original dataset by a reverse transformation, which means that the data is no longer disassociated. However, in [9], the authors guarantee that, among these datasets reconstructed by the inverse transformation, there is at least one dataset that is k m -anonymous. Definition 2 (The guarantee of disassociation). Let Iv be the set of inverse transformations of the disassociated dataset T ∗ . The guarantee of disassociation certifies that for any itemset I ⊆ D such that |I| ≤ m, there exists T iv ∈ Iv with s(I, T iv ) ≥ k. Therefore, even when the data is reversed, it is still protected by the k m anonymity constraint. Unfortunately, with the cover problem, this guarantee will no longer hold. This is the case shown in Fig. 2a, which shows a disassociated transactional dataset subject to the cover problem. In the rest of the document, some items in Fig. 1a are repeated with a = alcohol, b = Pain, c = milk and f = Fruit. In principle in Fig. 2, since the dataset is 22 -anonymous, all item combinations must exist in at least k = 2 records like {a, b}; {a, f }; {b, c}. However, it is noticed that in all inverted datasets T iv in Fig. 2b, the support s({c, f }, T iv ) = 1 and not equal to 2 like the other item pairs. This is due to the cover problem.
Algorithm to Overcome Cover Problem
901
Fig. 2. Disassociated dataset T ∗ with cover problem
Definition 3 (Cover Problem). Let Ii,j−1 (j ≥ 2) be the i-th itemset of the (j-1)th record chunk having a support greater than or equal to the support of an item xi,j ∈ Ri Cj . A cover problem exists if there is an item yi,j−1 ∈ Ii,j−1 such that (i) the support of Ii,j−1 is equal to the support of the singleton yi,j−1 ∈ Ri Cj−1 . More formally, (i) is expressed by (1): ∀ Iij−1 ∈ Ri Cj−1 , ∃ yij−1 ∈ Ii,j−1 , s(yij−1 , Ri Cj−1 ) = s(Iij−1 , Ri Cj−1 )
(1)
where yi,j−1 is a covered item and Ii,j−1 is the covered itemset. The covered itemset is the itemset that contains at least one covered item. However, the covered itemset may contain several covered items and, some uncovered items. Illustration: Consider the example in Fig. 2. I1,1 = {a, b, c} is an itemset belonging to R1 C1 . c ∈ {a, b, c} and s({a, b, c}, R1 C1 ) = 3; s(c, R1 C1 ) = 3. Therefore, s({a, b, c}, R1 C1 ) = s(c, R1 C1 ). In conclusion, as shown in Fig. 2a, the record chunk R1 C1 is subject to the cover problem, where c is a covered item; a and b are the uncovered items and {a, b, c} is the covered itemset. 3.3
Privacy Breach
Lemma 1. Let T ∗ be a disassociated dataset subject to a cover problem with a covered item yi,j−1 ∈ Ri Cj−1 . Let’s assume that there exists an item xi,j belonging to the adjacent record chunk Ri Cj such that s(yi,j−1 , Ri Cj−1 ) ≥ s(xi,j , Ri Cj ). Then, xi,j binds to the covered item yi,j−1 and they appear together less than k times in all datasets reconstructed by inverse transformations [13]. In addition, the pair (yi,j−1 , xi,j ) is linked to the same uncovered items. Thus, there is a possibility of privacy breach.
902
A. C. Monsan et al.
To show how a cover problem from the disassociated dataset can lead to a privacy breach, consider Fig. 2. The covered item c belongs to R1 C1 and x1,2 = f , another item belonging to R1 C2 . We notice that the pair {c, f } appears less than k (here k = 2) times in all datasets reconstructed by inverse transformations sets. Thus, if an adversary has information about an individual who has bought items c and f then, he will be able to link each record containing c and f to less than k (here k = 2) records in the inverted datasets. Worse is that with the cover problem, the itemsets containing c and f are all the same. This itemset is {a, b}. Therefore, it becomes simple for an attacker to find the rest of the linked items, which are {a, b} and completely de-anonymize the individual.
4
Contributions: Safe Disassociation by Addition Technique
This section describes the approach. First, we present our method with some explanations before demonstrating how the addition technique is sufficient not only for solving the cover problem but also for preserving the k m -anonymity. Then, we present our algorithm along with scenarios that illustrate the algorithm. 4.1
Presentation of the Proposed Method
From (1), it results that the cover problem can be solved if there is a difference between the support of yi,j−1 and the support of Ii,j−1 , i.e.: s(yi,j−1 , Ri Cj ) = s(Ii,j−1 , Ri Cj ). It is generally on this last assertion that the authors in [15,16] rely to solve the cover problem. This is also the basis of our algorithm. To get the supports of yi,j−1 and Ii,j−1 different, we can decide to increase or reduce the support of yi,j−1 or Ii,j−1 . Authors often choose to reduce the supports by suppressing certain items in the record chunks [13,15]. However, the difficulty that arises with these suppression techniques is the respect of k m -anonymity. The cover problem may no longer exist after suppression. However, it may change the initial layout of the data so that it is no longer k m -anonymous. Thus, the data is still vulnerable to privacy breaches. In order to preserve the k m -anonymity, some works have needed to associate the addition technique [15,16] with suppression, while others have used preconditions. Note that, the methods proposed in these works are able to keep all record chunks k m -anonymous. However, there are some record chunks whose cover problem cannot be resolved. In our case, we essentially use the addition technique to solve the cover problem and do not encounter any difficulties with the preservation of k m -anonymity. In this approach, we only add the covered items to one or two available subrecords. An available subrecord It,j is an itemset in a record chunk that is different from the covered itemset and on which the addition of a covered item yi,j−1 does not form a covered itemset. Note that It,j is the t-th itemset of the j-th record chunk. Therefore, we can have several of these available subrecords in a record chunk or none at all. In some cases where an available subrecord It,j does not exist, the covered item may be added directly to the record chunk as an additional record.
Algorithm to Overcome Cover Problem
903
Finally, our algorithm denoted by ADCOV guarantees that the record chunks obtained after the application of the addition technique are k m -anonymous and no longer contain a cover problem. We justify this result in the following subsection. For more concision, in the rest of the document, yc is the set of covered items and Ii,j−1 is replaced by Ic with yc ⊂ Ic . 4.2
Proof of the Elimination of the Cover Problem by Addition Technique
We recall that in an initial record chunk, the existence of the cover problem is characterized by the equality between the support of the covered itemset and the support of yc . Formally: ∀ Ic ∈ Ri Cj , ∃ yc ⊂ Ic : s(Ic , Ri Cj ) = s(yc , Ri Cj )
(2)
Here, we prove that after the addition of yc in the initial record chunk, the support of the covered itemset Ic and the support of the set of covered items yc becomes different in the obtained record chunk, i.e.: ∀Ic ∈ Ri Cj , ∀yc ⊂ Ic , s(Ic , Ri Cj ) = s(yc , Ri Cj ). Ri Cj is the initial record chunk of disassociated data and Ri Cj is the record chunk obtained after the addition of the covered items. Let s(y, Ri Cj ) be the support of y in the record chunk Ri Cj . This support is equal to the number of subrecords belonging to Ri Cj that contains y. Equation (3) gives expression of s(y, Ri Cj ): |Ri Cj |
s(y, Ri Cj ) =
αy (Ii,j )
(3)
l=1
where
αy (Ii,j ) =
1 if the itemset Iij contains the itemset y 0 else.
(4)
According to SubSect. 4.1, to solve the cover problem it is necessary to add covered items to the available itemsets. These available itemsets may or may not exist. Case 1. Assume that the available itemset It,j exists Choice of itemset It,j on which to add the set of covered items yc It,j ∈ Ri Cj , It,j = Ic
and It,j ∪ yc = Ic
(5)
The support of Ic unchanged after addition of yc : According to (3) |Ri Cj |
s(Ic , Ri Cj ) =
l=1
αIc (Ii,j ), Ii,j ∈ Ri Cj
(6)
904
A. C. Monsan et al.
With the addition of yc the support for the whole record chunks becomes: |Ri Cj |
s(Ic , R i Cj ) =
αIc (Ii,j ) + αIc (I t,j ), I t,j = It,j ∪ yc
(7)
l=1
According to (4), αIc (It,j ) = 0, so |Ri Cj |
s(Ic , R i Cj ) =
αIc (Ii,j ) + 0,
(8)
l=1
s(Ic , R i Cj ) = s(Ic , Ri Cj )
(9)
Change of support of yc after addition of yc : |Ri Cj |
s(yc , Ri Cj ) =
αyc (Ii,j )
(10)
αyc (Ii,j ) + αyc (I t,j )
(11)
l=1
After addition of yc : |Ri Cj |
s(yc , R i Cj ) =
l=1
From (4) |Ri Cj |
s(yc , R i Cj ) =
αyc (Ii,j ) + 1,
(12)
l=1
s(yc , R i Cj ) = s(yc , Ri Cj )
(13)
From (2), (9) and (13), we deduce for the first case: s(Ic , R i Cj ) = s(yc , R i Cj )
(14)
Case 2. Assume that the available itemset It,j does not exist. This means that It,j = ∅ Note Ip,j = It,j ∪ yc = yc . Ip,j is an additional record with p > |Ri Cj | The support of Ic unchanged after addition of yc : |Ri Cj |
s(Ic , Ri Cj ) =
αIc (Ii,j ) + αIc (I p,j ), Ip,j ∈ Ri Cj
(15)
l=1
According to (4), αIc (Ip,j ) = 0. Therefore:
s(Ic , Ri Cj ) = s(Ic , Ri Cj )
(16)
The support of yc changed after addition of yc
s(yc , Ri Cj ) =
|Ri Cj |
l=1
αyc (Ii,j ) + αyc (Ip,j ), αyc (Ip,j ) = 1
(17)
Algorithm to Overcome Cover Problem
s(yc , Ri Cj ) = s(yc , Ri Cj )
905
(18)
From (2), (16) and (17), we deduce: s(Ic , R i Cj ) = s(yc , R i Cj )
(19)
In conclusion, (14) and (19) show that the cover problem is eliminated with or without an available itemset It,j , through our addition technique. 4.3
Proof of km -Anonymity Guarantee After Addition
According to Definition 1, ∀ y ∈ Ri Cj , such that |y| ≤ m; m ≥ 2, Ri Cj is k m anonymous means: (20) s(y, Ri Cj ) ≥ k Equation (20) can be expressed as follows: |Ri Cj |
αy (Ii,j ) ≥ k
(21)
l=1 |Ri Cj |
αy (Ii,j + αy (Ii,j ) ≥ k
(22)
l=1
s(y, Ri Cj ) ≥ k
(23)
m
From (23), it results that Ri Cj is k -anonymous. Thus our addition method preserves the k m -anonymity. 4.4
Proposed Algorithm with Description and Correctness
The Algorithm (ADdition of Covered items) denoted by ADCOV solves the cover problem. It takes as input a disassociated dataset with a cover problem and as output another disassociated dataset without the cover problem where all record chunks are k m -anonymous. To do this, ADCOV first scans by clusters and by record chunks. If a record chunk is affected by a cover problem, the set of covered items yc and the covered itemset Ic are identified (refer to lines 1 to 8). Then, to solve the cover problem, only the set of covered items are added directly or they are divided into two sets to be added to available subrecords (refer to lines 9 and 10 and from line 14 to 19). However, if the covered items cannot be added to any available subrecord, then additional records are created containing only the covered items (refer from line 11 to 13 and line 20 to 22). Therefore, for any disassociated dataset, ADCOV returned a disassociated dataset without cover problems while preserving k m -anonymity. In summary, ADCOV is designed to provide a disassociated dataset, without the cover problem while preserving the constraint of k m -anonymity. A disassociated dataset is likely to contain a cover problem when there is a record chunk in which the supports of the covered itemset Ic and the covered item yc are equal. However, ADCOV breaks this relationship between the covered itemset and the covered item by making their supports different in all record chunks and retaining the k m -anonymity (see Subsect. 4.2).
906
A. C. Monsan et al.
Algorithm 1. ADCOV Input: Dataset disassociated T ∗ Output: Dataset without cover problem 1: for each Ri ∈ T ∗ do 2: for each Ri Cj ∈ Ri do 3: Ic ← ∅, yc ← ∅ 4: for each Ic ∈ Ri Cj ∈ Ri do 5: if ∃z ∈ Ii,j such that s(z, Ri Cj ) = s(Ii,j , Ri Cj ) then 6: Ic ← Ii,j , and yc receives the set of items of Ic respecting Eq. (1) 7: end if 8: end for 9: if It,j ∈ Ri Cj is found such that It,j = Ic and (It,j + yc = Ic ) then 10: Add yc to It,j 11: else if (It,j + yc = Ic ) or (It,j = Ic ) then 12: if |yc | = 1 then 13: Add yc to the record chunk as additional subrecord 14: else if |yc | >= 2 then 15: Split yc into two sets yc1 and yc2 and check to find It,j 16: if s(It,j , Ri Cj ) >= 2 then 17: Add yc1 and yc2 to two itemsets 18: else if s(It,j , Ri Cj ) = 1 then 19: Add yc1 to It,j and yc2 as additional subrecord 20: else if s(It,j , Ri Cj ) = 0 then 21: Add yc1 and yc2 as additional subrecords 22: end if 23: end if 24: end if 25: end for 26: end for
4.5
Illustration of ADCOV
Consider the dataset shown in Fig. 2a, which is affected by the cover problem. We apply our Algorithm to this dataset. ADCOV identifies Ic = {a, b, c} as the covered itemset and c as the only covered item. In other words, the set of covered items yc = {c}. Available Itemset {a} is different from covered itemset {a, b, c} and the union of itemset {a} and the set of covered items {c} is different from the covered itemset {a, b, c}. Thus, ADCOV replaces itemset a with itemset {a} + yc = {a, c} on Fig. 3a. According to the guarantee of disassociation (see Definition 2), the disassociated dataset on Fig. 3b is safe because there are inverse datasets T1iv , T2iv , T4iv in which the supports of {c, f } are greater than or equal to k (k = 2). Moreover the pair {c, f } is related to {a} and {a, b}, not to a single itemset.
5
Performance Evaluation
The aims of the experiments are to:
Algorithm to Overcome Cover Problem
907
Fig. 3. Elimination of the cover problem with one covered item
– Evaluate the privacy breach in the disassociated dataset through the rate of elimination of the cover problem; – Evaluate the information loss relating to the number of modified subrecords and the association error when applying the different techniques; – Study the performance on the execution time of the algorithms. In addition, the performances of our algorithm (i.e., ADCOV) are compared with two other algorithms (i.e., IDSA and IDGC); which are the most recent and efficient algorithms to the best of our knowledge. 5.1
Evaluation Metrics
Metric of Privacy Breach (RECP): A potential privacy breach occurs when a record chunk in the disassociated dataset is vulnerable, which means that the disassociated dataset is affected by the cover problem. Thus, the number of privacy breaches detected in a disassociated dataset is equal to the number of vulnerable record chunks. We use the Rate of Elimination of the Cover Problem (RECP) as the metric of privacy breach. – VRC(Tf∗ ) = Number of Vulnerable Record Chunks or number of privacy breaches detected after techniques in the final disassociated dataset Tf∗ – RC(Ti∗ ) represents the total number of record chunks in the initial disassociated dataset. Equation (24) gives the expression of RECP: RECP =
RC(Ti∗ ) − V RC(Tf∗ ) ∗ 100 RC(Ti∗ )
(24)
908
A. C. Monsan et al.
Metrics of Information Loss Relative to Fraction of Modified Itemsets (FMI): The FMI is used to evaluate the proportion of modified records by using a technique to eliminate the cover problem. Equation (25) gives the expression of FMI: Im ∗ 100 (25) FMI = I where Im represents the number of records modified and I: the total number of records. Metrics of Information Loss Relative to the Relative Association Error (RAE) or RE: It assesses the loss of association between items. In other words, RAE shows whether two items that were initially associated in the original dataset will remain so after the application of the proposed approach. Equation (26) gives the expression of RAE: RAE =
|s({x, y}, Ti∗ ) − s({x, y}, Tf∗ )| AV G(s({x, y}, Ti∗ ), s({x, y}, Tf∗ ))
(26)
where AV G(s({x, y}, Ti∗ ), s({x, y}, Tf∗ )) is the average of the supports of the pair = {x,y} in the initial disassociated dataset Ti∗ and the final disassociated dataset Tf∗ . 5.2
Results Analysis
We conducted our experiments on the BMS-WebView-11 and BMS-WebView-22 datasets whose characteristics are listed in Table 2. Table 2. Characteristics of the datasets Datasets
Total number of records
Numbers of distinct items
Average item Domain per record
BMS-WebView-1 59601
497
2.42
BMS-WebView-2 77512
3340
4.62
Clickstream data from e-commerce
These are the click stream e-commerce data. They are commonly used in transactional data anonymization approaches. In our case, each dataset has been previously applied to the disassociation algorithms horpat and verpat [9], in order to partition them into Cluster Ri (groups of records) and Record chunks Ri Cj . This allowed us to make the data disassociated and we used these datasets as input for our algorithm (i.e., ADCOV) and the two other algorithms IDSA and IDGC [15]. The different partitioning parameters k and m are the privacy 1 2
http://www.philippe-fournier-viger.com/spmf/datasets/BMS1 spmf. http://www.philippe-fournier-viger.com/spmf/datasets/BMS2.txt.
Algorithm to Overcome Cover Problem
909
parameters of k m -anonymity with (k ≥ 2, m ≥ 2) and δ is the maximum cluster size. The value of m was usually set to 2 as in [14,15] but δ and k were often varied as shown in the different graphs. Finally, all methods were implemented in Python and the experiments are performed on a 2.3 GHz Intel Core i5 processor with 8 GB of RAM. Results of the Privacy Breach Evaluation: As the Rate of Elimination of the Cover Problem (RECP) was used as the metric of privacy breach. The (RECP)-related trend of the different algorithms are shown in Fig. 4. To obtain Fig. 4, for each dataset BMS-WebView-1 and BMS-WebView-2, we vary the privacy parameter k from 2 to 5 with maximum cluster sizes of 20 and 50: Fig. 4a shows the privacy breach levels when the maximum cluster size δ = 20 and Fig. 4b for δ = 50. From these figures, it can be seen that for each dataset, regardless of the cluster size δ and the value of k, there are no more vulnerable record chunks detected after applying ADCOV (i.e. our method) and IDGC. Concerning ADCOV, this result is justified by the fact that ADCOV is compatible with all instances of the problem while IDSA for example was not. As for IDSA, it removes on average 82.88% of the record chunks affected by the defect on BMS-WebView-1 and on average 73.72 % on BMS-WebView-2 while ADCOV and IDGC remove 100% of the record chunks affected by the defect, on both data sets. In summary, IDSA partially solves the cover problem, unlike IDGC and ADCOV. Results of Relative Association Error (RAE): As defined above, the Relative Association Error (RAE) is a metric that shows whether two items that were initially associated in the basic disassociated dataset will remain after the application of an algorithm. Specifically, RAE measures in our techniques, the number of items more or less than the basic disassociation. To do so, for each dataset BMS-WebView-1 and BMS-WebView-2 we vary the maximum cluster size from 10 to 50 and calculate for each value the RAE on disassociated datasets after applying IDGC, IDSA and ADCOV. Figure 5 shows that IDGC makes more information loss relating to the RAE on both BMS-WebView-1 and BMSWebView-2 datasets. This level of RAE is due to the fact that in the generalization process for IDGC, each covered item is replaced by a relatively large group of items. Whereas in the ADCOV technique, only the covered items are added. Thus, if there is only one covered item in a record chunk, only this item is added to the record chunk. The RAE of the IDSA is lower because the technique used to solve the cover problem consists of deleting but also adding items. Therefore, there is a balance in the number of additional items and finally, the difference in these items between the basic disassociation and the IDSA is lower. Results of Fraction of Modified Itemsets (FMI): The computation of Fraction of Modified Itemsets (FMI) is performed by varying the value of the privacy parameter k from 2 to 5 for different cluster sizes 20 and 50 and, on
910
A. C. Monsan et al.
Fig. 4. Rate of elimination of the cover problem
the BMS-WebView-1 and the BMS-WebView-2 datasets. Figure 6a shows the proportion of modified subrecords when the maximum cluster size δ = 20 and Fig. 6b for δ = 50. The observation made in Fig. 6, on the BMS-WebView-1 and on the BMS-WebView-2, IDSA modifies respectively 5.03% and 3.5% of the records and IDGC modifies respectively 28.82% and 13.67% of the records. As for ADCOV, it modifies very few records: 0.17% for BMS-WebView-1 and 0.35% for BMS-WebView-2; while ADCOV eliminates all detected privacy breaches. The reason why our algorithm modifies few subrecords is that the solution of the cover problem in our algorithm is not only to add the covered items on the subrecords, but also to add them on the whole record chunks as additional subrecords. Thus, to eliminate a cover problem in a record chunk, the original records are not necessarily altered.
Algorithm to Overcome Cover Problem
911
Fig. 5. Relative association errors according to cluster sizes
Fig. 6. Fraction of modified records (FMI) according to privacy parameter k
Runtime Performance: Runtime performance. Performance evaluation consists in studying the execution time of the different approaches by varying the size of the dataset. Thus different sizes of records were considered: 10K to 50K with K = 1000 records. The best execution performance observed in Fig. 7 concerns our algorithm (ADCOV).
912
A. C. Monsan et al.
Fig. 7. Runtime representation
6
Conclusion
The cover problem is an anonymization flaw that may cause privacy breach. To better secure the data, we have proposed in this paper an approach to eliminate the cover problem. Our approach consists of reinforcing the support of the covered items. which are at the origin of the cover problem. These covered items are added to other subrecords in order to increase their number and also to diversify the subrecords that support them. The particularity of our addition-based approach is that it causes less information loss and does not lead in any case to a breach of k m -anonymity. It is also suitable for any type of cover problem and can be applied to all types of data. Our approach has been evaluated through different metrics on information loss, privacy breach and execution performance. Finally, we compared our approach to previous approaches. The results of our approach (i.e., ADCOV) are better in terms of information loss and runtime time. In addition, ADCOV removes completely the cover problem. Although in terms of information loss we achieve good performances, we notice that for larger data sizes (BMS-WebView-2) our values increase. In future work, we plan to reduce information loss in the context of large data.
References 1. Xu, Y., Wang, K., Fu, A.W.C., Yu, P.S.: Anonymizing transaction databases for publication. In: Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD 2008. ACM Press (2008). https:// doi.org/10.1145/1401890.1401982 2. Arava, K., Lingamgunta, S.: Adaptive k -anonymity approach for privacy preserving in cloud. Arab. J. Sci. Eng. 45(4), 2425–2432 (2019). https://doi.org/10.1007/ s13369-019-03999-0 3. Adar, E., Weld, D.S., Bershad, B.N., Gribble, S.S.: Why we search. In: Proceedings of the 16th International Conference on World Wide Web - WWW 2007. ACM Press (2007). https://doi.org/10.1145/1242572.1242595 4. Kim, S.S., Choi, S.H., Lee, S.M., Hong, S.C.: Enhanced catalytic activity of Pt/AL2O3 on the CH4 SCR. J. Ind. Eng. Chem. 18(1), 272–276 (2012). https:// doi.org/10.1016/j.jiec.2011.11.041 5. Motwani, R., Nabar, S.U.: Anonymizing unstructured data (2008)
Algorithm to Overcome Cover Problem
913
6. Wang, S.-L., Tsai, Y.-C., Kao, H.-Y., Hong, T.-P.: On anonymizing transactions with sensitive items. Appl. Intell. 41(4), 1043–1058 (2014). https://doi.org/10. 1007/s10489-014-0554-9 7. Terrovitis, M., Mamoulis, N., Kalnis, P.: Local and global recoding methods for anonymizing set-valued data. VLDB J. 20(1), 83–106 (2010). https://doi.org/10. 1007/s00778-010-0192-8 8. Gai, K., Qiu, M., Zhao, H.: Privacy-preserving data encryption strategy for big data in mobile cloud computing. IEEE Trans. Big Data 7(4), 678–688 (2017). https://doi.org/10.1109/tbdata.2017.2705807 9. Terrovitis, M., Mamoulis, N., Liagouris, J., Skiadopoulos, S.: Privacy preservation by disassociation. Proc. VLDB Endow. 5(10), 944–955 (2012). https://doi.org/10. 14778/2336664.2336668 10. Bewong, M., Liu, J., Liu, L., Li, J.: Utility aware clustering for publishing transactional data. In: Kim, J., Shim, K., Cao, L., Lee, J.-G., Lin, X., Moon, Y.-S. (eds.) PAKDD 2017. LNCS (LNAI), vol. 10235, pp. 481–494. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-57529-2 38 11. Loukides, G., Liagouris, J., Gkoulalas-Divanis, A., Terrovitis, M.: Utilityconstrained electronic health record data publishing through generalization and disassociation. In: Gkoulalas-Divanis, A., Loukides, G. (eds.) Medical Data Privacy Handbook, pp. 149–177. Springer, Cham (2015). https://doi.org/10.1007/ 978-3-319-23633-9 7 12. Terrovitis, M., Mamoulis, N., Kalnis, P.: Privacy-preserving anonymization of setvalued data. Proc. VLDB Endow. 1(1), 115–125 (2008). https://doi.org/10.14778/ 1453856.1453874 13. Barakat, S., Bouna, B.A., Nassar, M., Guyeux, C.: On the evaluation of the privacy breach in disassociated set-valued datasets. In: Proceedings of the 13th International Joint Conference on e-Business and Telecommunications. SCITEPRESS - Science and Technology Publications (2016). https://doi.org/10. 5220/0005969403180326 14. Awad, N., Al Bouna, B., Couchot, J.-F., Philippe, L.: Safe disassociation of setvalued datasets. J. Intell. Inf. Syst. 53(3), 547–562 (2019). https://doi.org/10. 1007/s10844-019-00568-7 15. Puri, V., Kaur, P., Sachdeva, S.: Effective removal of privacy breaches in disassociated transactional datasets. Arab. J. Sci. Eng. 45(4), 3257–3272 (2020). https:// doi.org/10.1007/s13369-020-04353-5 16. Gkoulalas-Divanis, A., Loukides, G.: PCTA. In: Proceedings of the 4th International Workshop on Privacy and Anonymity in the Information Society - PAIS 2011. ACM Press (2011). https://doi.org/10.1145/1971690.1971695 17. Samarati, P.: Protecting respondents identities in microdata release. IEEE Trans. Knowl. Data Eng. 13(6), 1010–1027 (2001). https://doi.org/10.1109/69.971193 18. Jia, X., Pan, C., Xu, X., Zhu, K.Q., Lo, E.: ρ-uncertainty anonymization by partial suppression. In: Bhowmick, S.S., Dyreson, C.E., Jensen, C.S., Lee, M.L., Muliantara, A., Thalheim, B. (eds.) DASFAA 2014. LNCS, vol. 8422, pp. 188–202. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-05813-9 13 19. Martin, D.J., Kifer, D., Machanavajjhala, A., Gehrke, J., Halpern, J.Y.: Worst-case background knowledge for privacy-preserving data publishing. In: 2007 IEEE 23rd International Conference on Data Engineering. IEEE (2007). https://doi.org/10. 1109/icde.2007.367858 20. Sweeney, L.: k-anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 10(05), 557–570 (2002). https://doi.org/10.1142/ s0218488502001648
914
A. C. Monsan et al.
21. Puri, V., Sachdeva, S., Kaur, P.: Privacy preserving publication of relational and transaction data: survey on the anonymization of patient data. Comput. Sci. Rev. 32, 45–61 (2019). https://doi.org/10.1016/j.cosrev.2019.02.001 22. Tsai, Y.C., Wang, S.L., Ting, I.H., Hong, T.P.: Flexible anonymization of transactions with sensitive items. In: 2018 5th International Conference on Behavioral, Economic, and Socio-Cultural Computing (BESC). IEEE (2018). https://doi.org/ 10.1109/besc.2018.8697320 23. Ghinita, G., Tao, Y., Kalnis, P.: On the anonymization of sparse high-dimensional data. In: 2008 IEEE 24th International Conference on Data Engineering. IEEE (2008). https://doi.org/10.1109/icde.2008.4497480 24. Loukides, G., Gkoulalas-Divanis, A., Malin, B.: COAT: constraint-based anonymization of transactions. Knowl. Inf. Syst. 28(2), 251–282 (2010). https:// doi.org/10.1007/s10115-010-0354-4
Author Index
A Abuomar, O., 620 Adepo, Joël Christian, 896 Aggarwal, Riya, 499 Ahmedy, Ismail, 228 Al Qudah, Islam, 630 Alam, Abu, 178, 512 Alieksieiev, Volodymyr, 138 Almadi, Soloman M., 123 Almiñana, Cesar C., 681 Arai, Kohei, 836 Araujo, Gabriel, 171 Awais, Ch Muhammad, 449 Azevedo, Fábio, 43 Azimi, Hamed, 1 Aznar, Marcos Bautista López, 207 Azuma, Kenta, 836 B Baharum, Aslina, 278 Banville, Frederic, 255 Barron, Harry, 512 Barrozzi, Vitor Vitali, 171 Becerra-Suarez, Fray L., 321 Bekkouch, Imad Eddine Ibrahim, 449 Beko, Marko, 43 Bernik, Andrija, 345 Bonakdari, Hossein, 1, 769 Bordini, Rafael, 808 Brandt, Eric, 25 Brandt, Felix, 25 Browarska, Maria, 483
C Cai, Mingyang, 75 Caligiuri, Luigi Maxmilian, 414 Carlos, Paul Ccuno, 293 Chambi, Pabel Chura, 293 Chen, Daqing, 532 Chen, Ting-Ju, 469 Chen, Weisi, 630 Christou, Nikolaos, 715 Címbora Acosta, Guillermo, 207 Correia, Sérgio D., 43 Couto, Júlia Colleoni, 808 D D’Amato, Giulio, 196 Dai, Linglong, 889 Damasio, Juliana, 808 De Silva, Varuna, 560 Deris, Farhana Diana, 278 di Furia, Marco, 656 Dlamini, Gcinizwe, 639 Dodonova, Evgeniya, 92 du Preez, Johan A., 723 Dubinina, Irina, 92 E Ebtehaj, Isa, 1, 769 Eisenhart, Georg, 695 Ergasheva, Shokhista, 639 F Farooq, Awais, 880 Formisano, Giovanni, 196
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 506, pp. 915–917, 2022. https://doi.org/10.1007/978-3-031-10461-9
916 Fribance, D., 620 Fukuda, Shuichi, 797 G Gadea, Walter Federico, 207 Gajda, Michał J., 146, 171 Gamber, Cayo, 358 Garcés-Báez, Alfonso, 266 Gassmann, Luke, 178 Gharabaghi, Bahram, 1 Gharebaghi, Bahram, 769 Gholami, Azadeh, 769 Glines, Mark, 375 Golovnin, Oleg, 92 Goore, Bi Tra, 896 Guarini, Piergiorgio, 656 Guarnieri, Guido, 196 Gunathilake, M. D. S. S., 332 Gurusinghe, P. M., 332 H Hashem, Ibrahim, 630 Heiden, Bernhard, 138 Huang, Kevin, 469 I Ismail, Habibah, 228 Ismail, Rozita, 278 Ivanovna, Sipovskaya Yana, 762 Ivaschenko, Anton, 92 J Jainari, Mohd. Hisyamuddin, 278 Jamali, Ali, 1 Jawawi, Dayang N. A., 228 Jo, Yeeun, 304 Jovanovska, Elena Mitreska, 670 Jovanovski, Damjan, 670 K Kaš´celan, Ljiljana, 433 Katkov, Sergey, 715 Kejriwal, Mayank, 846 Khan, Adil Mehmood, 449 Khan, Rishi, 375 Kholmatova, Zamira, 639 Khurshid, Muhammad Mahboob, 865 Kim, Ji Eun, 469 Kitajima, Yuzuki, 608 Koulas, Emmanouil, 103 Krone, Joan, 58 Kruglov, Artem, 639 Krzak, K., 620
Author Index L Limone, Pierpaolo, 656 López-López, Aurelio, 266 Luo, Yuesheng, 846 M Mabani, Courage, 715 Mai, T. Tien, 545 Malik, Aadin, 560 Mariano, Angelo, 196 Mat Noor, Noorsidi Aizuddin, 278 Mat Zain, Nurul Hidayah, 278 Matos-Carvalho, João P., 43 Mbwambo, Nicodemus M. J., 58 Mejia-Cabrera, Heber I., 321 Merabtene, Tarek, 630 Migliori, Silvio, 196 Milhomme, Daniel, 255 Monsan, Apo Chimène, 896 Mora, André, 43 Mujica, Pedro, 123 Mullin, Lenore, 375 Myachin, Alexey, 826 N N’zi, Edié Camille, 896 Nakao, Shunta, 608 Namatame, Takashi, 608 Naumoski, Andreja, 670 Nguyen, Dong Quan Ngoc, 590 Nieto-Chaupis, Huber, 248, 394 O Ochoa, José Lipa, 293 Ochoa, Karla Saldaña, 483 Oh, Uran, 304 Otake, Kohei, 608 P Parent, Andree-Anne, 255 Pedro, Dário, 43 Peristeras, Vassilios, 103 Pirgov, Peter, 375 Popovska, Katja, 670 R Rashid, Ammar, 865 Reichelt, Dirk, 25 Rogi´c, Sunˇcica, 433 Ruhland, Johannes, 404 Ruiz, Duncan, 808 S Sadovykh, Andrey, 639 Saheer, Lakshmi Babu, 747
Author Index Saidakhmatov, Mirolim, 581 Salkenov, Aldiyar, 581 Samararathne, W. A. H. K., 332 Santomauro, Giuseppe, 196 Sarsenova, Zhibek, 581 Schindler, Paulina, 404 Selitskiy, Stas, 880 Seybold, Daniel, 695 Shah, Syed Iftikhar Hussain, 103 Shaheed, Hassnaa Hasan, 499 Shekhar, Shashank, 469 Sitaraman, Murali, 58 Sitnikov, Pavel, 92 Smaiyl, Assel, 581 Soufyane, Abdelaziz, 630 Sriyaratna, Disni, 332 Succi, Giancarlo, 639 Sulla-Torres, José, 293 Sun, Yu-Shan, 58 T Talesh, Seyed Hamed Ashraf, 1 Taylor, Rebecca M. C., 723 Thind, Rajvir, 747 Tibebu, Haileleol, 560 Timchenko, Anton, 639
917 Tomic, Slavisa, 43 Tonino-Heiden, Bianca, 138 Toto, Giusi Antonia, 656 Trepanier, Mylene, 255 Tsitsipas, Athanasios, 695 Tuesta-Monteza, Víctor A., 321 V van Buuren, Stef, 75 Vasquez, Xavier, 639 Villanueva-Ruiz, Deysi, 321 Vink, Gerko, 75 Vong, André, 43 W Wesner, Stefan, 695 Wijenayake, W. W. G. P. A., 332 X Xiao, Perry, 532 Xing, Lin, 590 Z Zagorc, Ana, 345 Zouev, Evgeny, 639