135 18 52MB
English Pages 180 [721] Year 2024
Lecture Notes in Networks and Systems 921
Kohei Arai Editor
Advances in Information and Communication Proceedings of the 2024 Future of Information and Communication Conference (FICC), Volume 3
Lecture Notes in Networks and Systems
921
Series Editor Janusz Kacprzyk , Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland
Advisory Editors Fernando Gomide, Department of Computer Engineering and Automation—DCA, School of Electrical and Computer Engineering—FEEC, University of Campinas— UNICAMP, São Paulo, Brazil Okyay Kaynak, Department of Electrical and Electronic Engineering, Bogazici University, Istanbul, Türkiye Derong Liu, Department of Electrical and Computer Engineering, University of Illinois at Chicago, Chicago, USA Institute of Automation, Chinese Academy of Sciences, Beijing, China Witold Pedrycz, Department of Electrical and Computer Engineering, University of Alberta, Alberta, Canada Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland Marios M. Polycarpou, Department of Electrical and Computer Engineering, KIOS Research Center for Intelligent Systems and Networks, University of Cyprus, Nicosia, Cyprus Imre J. Rudas, Óbuda University, Budapest, Hungary Jun Wang, Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong
The series “Lecture Notes in Networks and Systems” publishes the latest developments in Networks and Systems—quickly, informally and with high quality. Original research reported in proceedings and post-proceedings represents the core of LNNS. Volumes published in LNNS embrace all aspects and subfields of, as well as new challenges in, Networks and Systems. The series contains proceedings and edited volumes in systems and networks, spanning the areas of Cyber-Physical Systems, Autonomous Systems, Sensor Networks, Control Systems, Energy Systems, Automotive Systems, Biological Systems, Vehicular Networking and Connected Vehicles, Aerospace Systems, Automation, Manufacturing, Smart Grids, Nonlinear Systems, Power Systems, Robotics, Social Systems, Economic Systems and other. Of particular value to both the contributors and the readership are the short publication timeframe and the worldwide distribution and exposure which enable both a wide and rapid dissemination of research output. The series covers the theory, applications, and perspectives on the state of the art and future developments relevant to systems and networks, decision making, control, complex processes and related areas, as embedded in the fields of interdisciplinary and applied sciences, engineering, computer science, physics, economics, social, and life sciences, as well as the paradigms and methodologies behind them. Indexed by SCOPUS, INSPEC, WTI Frankfurt eG, zbMATH, SCImago. All books published in the series are submitted for consideration in Web of Science. For proposals from Asia please contact Aninda Bose ([email protected]).
Kohei Arai Editor
Advances in Information and Communication Proceedings of the 2024 Future of Information and Communication Conference (FICC), Volume 3
Editor Kohei Arai Saga University Saga, Japan
ISSN 2367-3370 ISSN 2367-3389 (electronic) Lecture Notes in Networks and Systems ISBN 978-3-031-54052-3 ISBN 978-3-031-54053-0 (eBook) https://doi.org/10.1007/978-3-031-54053-0 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland Paper in this product is recyclable.
Preface
We are extremely delighted to bring forth the seventh edition of Future of Information and Computing Conference 2024 (FICC 2024) held successfully on April 4th and 5th, 2024, in Berlin, Germany. The hybrid conference allowed 140 learned scholars from across the globe to put forth their jaw-breaking researches related to communication, data science, computing, and Internet of things. The remarkable lineup of keynotes and presenters addressed and provided amicable solutions to many problems and a glimpse into the future of technology. Ever since its inception, FICC has made a huge contribution to the scientific world. This year too we received an overwhelming 401 papers as submissions out of which 155 were handpicked by careful review in terms of originality, applicability, and presentation and 139 are finally published in this edition. The conference also hosted a Best Student Paper, Best Paper, Best Poster, and Best Presentation Competition. All those who submitted papers, particularly the finalists, did a marvelous job. The outstanding success could not have been possible without the coordinated efforts of many people. The eager participation of authors whose papers were skillfully chosen by our veteran technical committee was indeed the pillar of the entire event. The session chair’s role deserves a special mention. Our sincere thanks to all the keynote speakers for sparing their valuable time and delivering thought-provoking keynote addresses. In the end, the silent workers who managed to conduct the entire event smoothly, our Organizing Committee, definitely deserve a huge applause. We sincerely hope to provide an enriching and nourishing food for thought to our readers by means of our well-researched studies published in this edition. The overwhelming response by authors, participants, and readers motivates us to better ourselves each time. We hope to receive continued support and enthusiastic participation from our distinguished scientific fraternity. Regards, Kohei Arai Saga University, Japan
Organization
Conference Chair Pascal Lorenz
University of Haute Alsace, France
Program Chair Kohei Arai
Saga University, Japan
Contents
Cognitive Programming Assistant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Indervir Singh Banipal, Shubhi Asthana, Sourav Mazumder, and Nadiya Kochura
1
Correcting User Decisions Based on Incorrect Machine Learning Decisions . . . . Saveli Goldberg, Lev Salnikov, Noor Kaiser, Tushar Srivastava, and Eugene Pinsky
12
On Graphs Defined by Equations and Cubic Multivariate Public Keys . . . . . . . . . Vasyl Ustimenko, Tymoteusz Chojecki, and Michal Klisowski
21
System Tasks of Digital Twin in Single Information Space . . . . . . . . . . . . . . . . . . Mykola Korablyov, Sergey Lutsky, Anatolii Vorinin, and Ihor Ivanisenko
47
Smart City, Big Data, Little People: A Case Study on Istanbul’s Public Transport Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Emre Kizilkaya, Kerem Rizvanoglu, and Serhat Guney Educommunication as a Communicative Strategy for the Dissemination of Cultural Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . David Xavier Echeverría Maggi, Washington Dután, Lilibeth Orrala, Gregory Santa-María, Mariana Avilés, Ángel Matamoros, María-José Macías, Martha Suntaxi, Lilian Molina, Gabriel Arroba, and Arturo Clery On Implemented Graph-Based Generator of Cryptographically Strong Pseudorandom Sequences of Multivariate Nature . . . . . . . . . . . . . . . . . . . . . . . . . . . Vasyl Ustimenko and Tymoteusz Chojecki Fengxiansi Cave in the Digital Narrative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wu-Wei Chen
57
76
84
99
Lambda Authorizer Benchmarking Tool with AWS SAM and Artillery Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 Cornelius and Shivani Jaswal Hardware and Software Integration of Machine Learning Vision System Based on NVIDIA Jetson Nano . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 Denis Manolescu, David Reid, and Emanuele Lindo Secco
x
Contents
Novel Approach to 3D Simulation of Soft Tissue Changes After Orthognathic Surgery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 B. A. P. Madhuwantha, E. S. L. Silva, S. M. S. P. Samaraweera, A. I. U. Gamage, and K. D. Sandaruwan EdgeBench: A Workflow-Based Benchmark for Edge Computing . . . . . . . . . . . . 150 Qirui Yang, Runyu Jin, Nabil Gandhi, Xiongzi Ge, Hoda Aghaei Khouzani, and Ming Zhao An Algebraic-Geometric Approach to NP Problems II . . . . . . . . . . . . . . . . . . . . . . 171 F. W. Roush Augmented Intelligence Helps Improving Human Decision Making Using Decision Tree and Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 Mohammed Ali Al-Zahrani An Analysis of Temperature Variability Using an Index Model . . . . . . . . . . . . . . . 192 Wisam Bukaita, Oriehi Anyaiwe, and Patrick Nelson A Novel Framework Predicting Anxiety in Chronic Disease Using Boosting Algorithm and Feature Selection Techniques . . . . . . . . . . . . . . . . . . . . . . 213 N. Qarmiche, N. Otmani, N. Tachfouti, B. Amara, N. Akasbi, R. Berrady, and S. El Fakir Machine Learning Application in Construction Delay and Cost Overrun Risks Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222 Ania Khodabakhshian, Umar Malsagov, and Fulvio Re Cecconi Process Mining in a Line Production . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241 Cristina Santos, Joana Fialho, Jorge Silva, and Teresa Neto Digital Transformation of Project Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258 Zornitsa Yordanova Text Analysis of Ethical Influence in Bioinformatics and Its Related Disciplines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269 Oliver Bonham-Carter A Multi-Criteria Decision Analysis Approach for Predicting User Popularity on Social Media . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290 Abdullah Almutairi and Danda B. Rawat A Computationally Inexpensive Method for Anomaly Detection in Maritime Trajectories from AIS Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304 Zahra Sadeghi and Stan Matwin
Contents
xi
An Ontology-Based Recommendation Module for Optimal Career Choices . . . . 318 Maria-Iuliana Dascalu, Rares Birzaneanu, and Constanta-Nicoleta Bodea Digital Infrastructures for Compliance Monitoring of Circular Economy: Requirements for Interoperable Data Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332 Wout Hofman, Boriana Rukanova, Yao Hua Tan, Nitesh Bharosa, Jolien Ubacht, and Elmer Rietveld Predictive Analytics for Non-performing Loans and Bank Vulnerability During Crises in Philippines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352 Jessie James C. De Guzman, Macrina P. Lazo, Ariel Kelly D. Balan, Joel C. De Goma, and Grace Lorraine D. Intal Classification of Academic Achievement in Upper-Middle Education in Veracruz, Mexico: A Computational Intelligence Approach . . . . . . . . . . . . . . . 368 Yaimara Céspedes-González, Alma Delia Otero Escobar, Guillermo Molero-Castillo, and Jerónimo Domingo Ricárdez Jiménez Accelerating the Distribution of Financial Products Through Classification and Regression Techniques: A Case Study in the Wealth Management Industry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383 Edouard A. Ribes Are K-16 Educators Prepared to Address the Educational and Ethical Ramifications of Artificial Intelligence Software? . . . . . . . . . . . . . . . . . . . . . . . . . . 406 Julie Delello, Woonhee Sung, Kouider Mokhtari, and Tonia De Giuseppe Teleoperation of an Aerial Manipulator Robot with a Focus on Teaching: Learning Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433 Alex R. Chanataxi and Jessica S. Ortiz Exploring Traditional and Tech-Based Toddler Education: A Comparative Study and VR Game Design for Enhanced Learning . . . . . . . . . . . . . . . . . . . . . . . . 448 Fatemehalsadat Shojaei A Universal Quantum Technology Education Program . . . . . . . . . . . . . . . . . . . . . . 461 Sanjay Vishwakarma, D. Shalini, Srinjoy Ganguly, and Sai Nandan Morapakula Digital Formative Assessment as a Transformative Educational Technology . . . . 471 Boumedyen Shannaq
xii
Contents
Meaningful Learning Processes of Service Robots for Tracking Trajectories Through Virtual Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 482 Jhonatan W. Tercero and Jessica S. Ortiz Audience Engagement Factors in Online Health Communities: Topics, Domains and “Scale Effect” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 499 Konstantin Platonov Enhancing Human-Computer Interaction: An Interactive and Automotive Web Application - Digital Associative Tool for Improving Formulating Search Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 511 Boumedyen Shannaq Dtnmqtt: A Resilient Drop-In Solution for MQTT in Challenging Network Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524 Lars Baumgärtner Forensic Analysis of Social Media Android Apps via Timelines . . . . . . . . . . . . . . 544 Ayodeji Ogundiran, Hongmei Chi, Jie Yan, and Jerry Miller AES-ECC and Blockchain in Optimizing the Security of Communication-Rich IoT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 560 Ibrahima Souare and Khadidiatou Wane Keita Bitcoin Gold, Litecoin Silver: An Introduction to Cryptocurrency Valuation and Trading Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573 Haoyang Yu, Yutong Sun, Yulin Liu, and Luyao Zhang A Stakeholders’ Analysis of the Sociotechnical Approaches for Protecting Youth Online . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 587 Xavier Caddle, Jinkyung Katie Park, and Pamela J. Wisniewski Intrusion Detection in IoT Network Using Few-Shot Class Incremental Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 617 Mostafa Hosseini and Wei Shi A Transcendental Number-Based Random Insertion Method for Privacy Protection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 637 Ki-Bong Nam, Suk-Geun Hwang, Ki-Suk Lee, and Jiazhen Zhou Zero Trust and Compliance with Industry Frameworks and Regulations: A Structured Zero Trust Approach to Improve Cybersecurity and Reduce the Compliance Burden . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 650 Yuri Bobbert and Tim Timmermans
Contents
xiii
The Forever Robotics Rules? An Overview Analysis of Their Applicability Scaled Over Time from Isaac Asimov to Our Software Robots . . . . . . . . . . . . . . . 668 Monica-Ioana Vulpe and Stelian Stancu Artificial Intelligence (AI) in Medical Diagnostics: Social Attitude of Poles . . . . 677 Joanna Ejdys and Magdalena Czerwi´nska Predictive Models of Gaze Positions via Components Derived from EEG by SOBI-DANS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 687 Akaysha C. Tang, Rui Sun, Cynthina Chan, and Janet Hsiao Revealing the Relationship Between Beehives and Global Warming via Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 699 Jeongwook Kim and Gyuree Kim Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 707
Cognitive Programming Assistant Indervir Singh Banipal1(B) , Shubhi Asthana2 , Sourav Mazumder1 , and Nadiya Kochura3 1 IBM Silicon Valley Lab, San Jose, CA 95141, USA [email protected], [email protected] 2 IBM Research, San Jose, CA 95120, USA [email protected] 3 IBM Data & AI, Lowell, MA 01851, USA [email protected]
Abstract. With the growing advent of intelligent software engineering tools which can write code, there has been interest in automating the programming tasks that surround this coding. While the complete elimination of human aspect is very difficult and unsustainable, there are some aspects of these tools which can be augmented with the help of artificial intelligence and natural language understanding. The goal of those aspects is to augment certain programming aspects and enhance code understanding. The goal of programming assistant should not only be to write code automatically, but suggest the right tech stack so the code suits the use case, and is scalable, high-performance and maintainable in the long run. Another example can be automatic identification of the intent of programming task objective and the subject on which it was performed, and based on that, recommend the most optimal set of language, framework, and design pattern to fulfill the programmer’s intent. In this paper, we develop a novel, intelligent system which can provide appropriate useful recommendations to the user in real-time, while taking into account the current use case, software language’s documentation, historical performance of functions, and certain user constraints. The system solves the above challenges by recommending the correct language paradigm, language level, and suggest optimal and efficient functions to the programmers as they code, so the code is maintainable, efficient, scalable, and helps reduce code base bugs in long term. Keywords: Source code · Artificial intelligence · Natural language processing · Human-in-the-loop · Software engineering · Machine learning
1
Introduction
Optimal selection of programming languages and frameworks is very critical while building production grade distributed systems. With the rapid development of new technologies, it is challenging for a programmer to deeply understand all of the available approaches and track what languages are currently best c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 K. Arai (Ed.): FICC 2024, LNNS 921, pp. 1–11, 2024. https://doi.org/10.1007/978-3-031-54053-0_1
2
I. S. Banipal et al.
suited and effective for given use cases. For example in recent times, the shift from imperative to declarative style of programming expects the user to simply tell the system what to do in their own primitive language, instead of prescribing how to perform the intended actions. Also, an increase in software development in the area of data science, machine learning and big data has led to radical programming paradigm shifts from imperative to declarative [1]. On the real-time systems side and the advent of event driven architectures, there has been more focus towards reactive [2] and asynchronous style of programming. Especially with the recent changes in the programming paradigms & design patterns due to more focus on big data and real-time systems, the control of programming has shifted away from user and become more declarative and reactive [3]. These shifts are evident of the fact that the programming languages expect user to not re-invent the wheel and re-use existing low level programming and algorithms so that the programmer can focus more on the product level design, architecture and behaviors. There has been prior research work in the past few years. CodeHint [4] presents a new technique for synthesizing code that allows programmers to reason about concrete executions, and which is easy-to-use and interactive in nature. Basically, it proposes a system for dynamic and interactive code suggestions that takes into account’s contextual information. Liu et al. [5] explores the use of neural network techniques to automatically learn code completion from a large corpus of dynamically typed JavaScript code. They show different neural networks that leverage not only token level information but also structural information, and evaluate their performance on different prediction tasks. Xiaodong Gu et al. [6] proposes a deep learning based approach to generate API usage sequences for a given natural language query. It adapts a neural networks based language model named RNN Encoder-Decoder. Allamanis et al. [7] provide a comprehensive survey of work in the area of intelligent programming. They contrast programming languages against natural languages and discuss how these similarities and differences drive the design of probabilistic models. A taxonomy based on the underlying design principles of each model is also presented, and used to navigate the literature. The field of automated programming using artificial intelligence [8,9] is gathering attention of software engineering researchers, especially after the advances in generative AI [10]. For the rest of the paper, we provide a high level approach in Sect. 2, followed by detailed by methodology in Sect. 3, wherein we assist programmers by understanding their intent, and then provide them suggestions of best programming paradigms [11] and language levels. We also provide the experiments in Sect. 4, followed by conclusion and future work.
2
Approach
In our novel methodology, we address the challenges in a multi-faceted manner. The first step is to identify the coding target the programmer is trying to achieve. This means analyzing past few code lines written, past files and packages
Cognitive Programming Assistant
3
modified, pull requests raised and issues which have been assigned. It includes metadata from sources such as Github [12] and Jira [13] in an opt-in manner. The metadata can help in better understanding of the coding target which means identifying the intent of the programmer’s action through code. The second step is to define intent buckets. The intent bucket is defined with a certain programming goal such as looping or iterating over objects, or something specifically around error handling of exceptions. Let us consider the case of looping or iteration. In software languages such as Java, or C/C++, whenever a programmer is looping or iterating over objects, the intent will be detected and classified to the ‘iteration’ intent bucket. Before doing this, the intent buckets identification itself needs to be trained through pre-trained machine learning models [14] which would be semi-automated with some human-in-the-loop [15] element in it. The pre-trained models are trained on language data corpus of different popular languages, and diverse from paradigm point of view, which are used by developers and research scientists from day to day. As we see in the upcoming sections, the language documentation is ingested and trained using machine learning models. This includes metadata about the issues and pull requests users are working on as well. With human-in-the-loop, the system generates custom models as well so that intent identification is smooth and fits well to a large set of varied programming languages. Once the intent bucket is ready, the system is ready to identify the intent of the programming target which the developer is trying to achieve in real-time. This also includes various intended actions by the user through programming interfaces and routines of the language at hand. Based on the ongoing programming actions of the user, the system performs intent classification of the programming task in real-time. Most of the application programming interfaces exposed by any language have certain broader classification of programming tasks. Some of the most common programming tasks can be iterating over objects, handling references, performing file I/Os, error and exception handling, building dictionaries and lookup, and many others. Based on the functionality provided by the language and its documentation, the models can learn and generate certain buckets which are specialized for certain group of programming tasks [16]. As we will see in the upcoming sections, these coding targets which help arrive at the intent bucket, it will further help in training models which can help decide what programming paradigm [17] to be recommended. The intent buckets can pertain to multiple languages because most of the languages provided these basic functionalities to developers. This helps us devise models later on, which can recommend what programming paradigm to recommend, what level or language should be recommended, and finally what are the most optimal functions and routines should be recommended to the user. Figure 1 shows the workflow starting off with intent bucket classification and subsequent paradigm classification. Every intent bucket specifically contains information about the programmer’s coding target from any subject language at hand. The next step is to identify the best programming paradigm based on what the developer is trying
4
I. S. Banipal et al.
to achieve. This leads to another machine learning model which can identify the best paradigm realm towards which the programmer’s coding target is heading to. For example, the intent buckets stressing on filtering, mapping and reducer operations [18] will be more inclined towards declarative paradigms. It would also incline towards functional programming paradigm. Whereas, the intent buckets specializing in pointers and memory management [19] lean towards imperative and procedural programming [20] paradigms. Each bucket has certain broader programmatic goal to be achieved.
Fig. 1. Intent buckets & paradigm classification
While classification of programming goals into certain intent buckets is achievable through building machine learning models and natural language processing [21–23] which we will talk in the upcoming sections, the system also employs human-in-the-loop approach to refine the initial ground truth. With a combination of knowledge discovery and subject matter labeling, an internal ranking of paradigms and language levels for each intent bucket will be prepared. For example, when it comes to programming intent bucket specializing in filtering, mapping and reduce operations, functional programming languages such as Scala and Python will rank higher than imperative programming languages such as Java, C and C++. Each intent bucket focusing on a specific programming goal will have certain degree of inclination towards different paradigms. For example the intent bucket focusing on filtering, mapping and reduce operations will have more likelihood probability, let’s say more than 0.5, for declarative style and very less for imperative and procedural style. Also, it would output the recommended level of language to be on the higher side. A low level language would be better suitable for assembly level or systems level programming instead.
Cognitive Programming Assistant
3
5
Methodology
The primary goal is to identify what the user is trying to achieve in the current programming. This means understanding the programming intent and the target task which the user is trying to achieve. When the user is generating text while programming using functions and APIs, the user intent is extracted out for further data mining and processing from that generated text, so that the system can identify the possible programmatic action which the user is trying to perform. Let’s say the user is using Python language and trying to build a data science pipeline. The system will already have Python documentation ingested with the relevant APIs and functions mapped in the intent buckets described above. First, the documentation of the language is ingested and a knowledge corpus [24] is built and indexed. This knowledge corpus has search capabilities [25] is then analyzed to build up generic buckets which can pertain to not only Python but other languages as well. For example, when it comes to data science heavy functions such as mappers, filters and reduce operations [18] (see Fig. 3), a separate bucket is created and the Python documentation about such data science functions will be mapped to this bucket. Similarly, there is one bucket for only looping and iterative functions, one for error handling and exceptions related functions, and so on. The ingestion module is responsible for consuming the documentation of the language which can happen through web crawlers for dynamic processing or on a standalone basis for established popular languages, then passing onto the enrichment modules where the models are run, and then storing the enriched information to the appropriate data-stores. Using custom classifiers built on base natural language understanding models, the enrichment module will decide what bucket this function or API should be assigned to. Once a base classification happens, the subject matter expert can view it and provide appropriate feedback so that the bucket can be changed if required. Similar approaches have been employed before for enterprise documents understanding [26] and analysis as well. As we discussed above, the system can ingest the language documentation from the help sections or crawl from trusted and official web sources, pertaining to the current language, into its into the enrichment module and then extract important features [27]. The system can map the extracted features to the existing intent buckets which have been predefined for the programming language understanding models. For example, the documentation pertaining to large file iterations from the help corpus of a new language will be mapped to the iteration bucket corresponding to large file reading and iteration. A corpus of natural language texts is used to train the system to identify user intents, actions and entities related to programming tasks from natural language. Additionally, in an opt-in manner, the textual data from project management software such as GitHub, Jira, etc., which basically describe the tasks and issues, may be supplied as training data as well. The language corpus data from online and offline sources is collected in relevant data stores. This data will be used to train intent classification buckets
6
I. S. Banipal et al.
Fig. 2. Custom classification models
Fig. 3. Reduce functions in Project Reactor
which specialize for certain coding targets. Since this involves understanding the functions and mapping them into buckets, the system uses Natural Language Programming (NLP) understanding technologies for documentation understanding and mapping the relevant functions and routines with it. These processes involve using NLP driven annotations to map and define the programming language functions into the intent buckets, and these annotations can be used to build pre-trained NLP models to understand the language documentation and functions. As shown in Fig. 2., while building the pre-trained models, the first layer understands basic syntactic NLP primitives and operators such as tokenization, lemmatization, parts of speech etc. The second layer is built on top of the first layer, which contains Linguistic Logical Expressions (LLEs) and ML models. These first two layers are domain agnostic. The final layer is domain specific which can understand the documents from language’s official documentation such from either the online official documentation or local installation as well. The domain agnostic and domain specific layers, are constructed using SystemT [28], System ML [29] and Annotated Query Language [30]. The NLP annotations [31] generated as output, are mapped against an ontology which is specific to security policy document understanding domain. This ontology can be specific to intent buckets for this particular intent buckets model. When we are building the machine learning model for paradigm classification, the ontology will be different, as in it would contain various paradigms such as declarative, functional, reactive, imperative, procedural etc. The model building for language classification can have ontology’s which talk about various levels such as very high, high, medium, low, very low level of languages. This output, in combination
Cognitive Programming Assistant
7
with the paradigm classification, helps decide the right language for the user. While generating the clues which are responsible for classification, it uses the underlying tokens and lemmas, which can optionally also include BERT enabled entity extraction [32]. These tokens can help train the higher layers of the pretrained models mentioned in the previous steps. Its recommended to have GPUs enabled [33] while employing BERT based approaches. While rule based systems can require less computational power with quicker model iteration cycles, deep learning based approaches can lead to higher accuracy but risks of model overfitting. Above mentioned rule based approaches using LLEs have been used in applications such as building efficient component review systems [34], cognitive queries [35], generating personalized media content [36], and devising advanced auto-complete methods [37]. We also find their application in building of contextually aware chat-bots [38] for public usage, enterprise settings such as in retail industry [39] and supply chain applications [40]. This paradigm ontology can have major paradigms [41] to start off with such as declarative, imperative, reactive, object oriented, procedural and few others; and can be updated by the admin subject matter expert on a need basis, if the sufficient paradigms are not good enough for bucket classification. A similar ontology can be employed for language level classification as well, such as high, medium, low level of languages. If the user is using the recommended language already, the system still retains the programming intent bucket information with it. In case, the system detects the user is not using a more optimal function for this coding target, an appropriate functions is suggested. For recommending the optimal function, the Subject Matter Expert (SME) driven approach for pretraining is helpful and further optimized on case to case basis [42]. But in this specific case, a federated model construction [43] approach which notes down the performance of certain functions based on the use case and hardware is stored and the model can recommend what best function to employ for the coding target at hand. This can be further extended into recommendation of optimal functions and routines based on the user’s system configurations and the resource allocation [44] configurations.
Fig. 4. Likelihood scores for Paradigms
8
I. S. Banipal et al.
The system can monitor the user’s intent buckets in real-time and recommend the following suggestions to the programmer: 1. The majority of the most used intent buckets used by the user in an aggregated manner over time. This will tell us the aggregated coding target or the programming goal which is being try to be achieved by the user in real-time manner. 2. The most highly ranked programming paradigms and language levels from the aggregated bucket list in the previous step. In case the user is already using the recommended language, the system still retains the coding target information and based on that, the most optimal function or routing for that particular language is shown as recommendation. 3. Based on the aggregated analytics in the previous steps, the user is shown which language paradigm suits the best for the current coding target. The likelihood metrics of intent buckets (see Fig. 4), is used to generate final scores for the paradigm recommendation.
4
Experiments
The system can start gathering the programming languages’ documentation, corpus and help section’s information data either through online or offline sources. This is required for training the models for intent bucket classification [45]. This data can either be crawled which would be a recommended method, or can be manually ingested in enterprise settings. The enterprise companies which do not want to rely on public data or exposed their source code through crawlers, can use the local document ingestion approach or using connectors to source from internal repositories. In an opt-in manner, metadata from the user’s latest pull requests, Github repositories and Jira tasks can also be optionally used to train and refine the programming intent bucket classification models. While building the machine learning pipeline, its recommended to start off with a standard list of paradigms and corresponding languages. Based on our experimentation, following are some conclusions and recommendations: 1. We recommend a minimum of 50 source code repositories per language category, and ∼200 per language paradigm category to build robust ground truth collection. Based on the experimentation, it is advisable to thoroughly prepare the ground truth and ontologies pertaining to which language belongs to what paradigm and how high or low the level of language is. 2. The source code repositories should have higher ratings (90%+), 100 plus verified repository forks and followers, active source code contributions in past five years and a higher sentiment analysis rating. 3. During the training of intent bucket classification, only official sources of the language documentation should be used as training corpus. In enterprise settings, the propriety in-house languages can be used but the ground truth source should only come from the official documentation on on-premise server training.
Cognitive Programming Assistant
9
4. Some languages can belong to multiple paradigms which should be allowed. The likelihood estimates of classification should be able to disambiguate the language selection through the user, in the final steps. Our experimentation is conducted in enterprise setting for internal repositories, for both internal and external customers. Based on how the crawler services and machine learning pipelines are setup for the experimentation, it is important to take care of intelligent allocation of resources to the cloud services as well. The SMEs should have had past exposure of different languages pertaining to certain important language paradigms which are employed by the developers throughout industry and research institutions. It is advisable to allot SMEs to the training sets based on their past experience and expertise in certain programming languages and paradigms. In case of conflicts while ground truth construction [46] or feedback labeling [47], it is recommended to resolve the conflicts through SMEs with the most amount of experience in that programming area. We are exploring new algorithms which can recommend optimal languages and functions for specific types of code, such as security [48,49], cryptography [50], and quantum [51], to build intelligent programming assistants [52,53] according to specialized domains.
5
Conclusion and Future Work
In our paper, we propose a novel methodology that can recommend the programming language paradigm, language itself and optimal functions and routines in order to enable the developers to build efficient, maintainable and scalable software. As the enterprises move towards hybrid and multi-cloud [54] approaches while building enterprise systems, it is important to devise more sophisticated algorithms which can support upcoming programming languages and paradigms, in order to build smarter programming assistants.
References 1. Smith, J., et al.: A predicate construct for declarative programming in imperative languages. In: Proceedings of the 24th International Symposium on Principles and Practice of Declarative Programming (2022) 2. Bainomugisha, E., et al.: A survey on reactive programming. ACM Comput. Surv. (2013) 3. Chupin, G., et al.: Functional reactive programming, restated. In: Proceedings of the 21st International Symposium on Principles and Practice of Declarative Programming (2019) 4. Galenson, J., et al.: Codehint: dynamic and interactive synthesis of code snippets. In: Proceedings of the 36th International Conference on Software Engineering (2014) 5. Liu, C., et al.: Neural code completion (2016) 6. Gu, X., et al.: Deep API learning. In: Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering (2016)
10
I. S. Banipal et al.
7. Allamanis, M., et al.: A survey of machine learning for big code and naturalness. ACM Comput. Surv. (CSUR) (2018) 8. Banipal, I.S., et al.: US11185780B2: Artificial intelligence profiling (2017) 9. Banipal, I.S., et al.: US20220335302A1: Cognitive recommendation of computing environment attributes (2021) 10. Generative AI (Nvidia). https://research.nvidia.com/research-area/generative-ai 11. Kwatra, et al.: US11556335B1: Annotating program code (2021) 12. GitHub. https://github.com 13. Jira. https://www.atlassian.com/software/jira 14. Banipal, I.S., et al.: US20220309379A1: Automatic identification of improved machine learning models (2021) 15. Ustalov, D.: Challenges in data production for AI with human-in-the-loop. In: Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining (2022) 16. Skupas, B., et al.: Developing classification criteria for programming tasks. In: Proceedings of the 14th Annual ACM SIGCSE Conference on Innovation and Technology in Computer Science Education (2009) 17. Floyd, R.W.: The paradigms of programming. In: ACM Turing Award Lectures (2007) 18. Dean, J., et al.: MapReduce: a flexible data processing tool. Commun. ACM (2010) 19. Daconta, M.C.: C++ pointers and dynamic memory management. ACM (1995) 20. Wirth, N.: The development of procedural programming languages personal contributions and perspectives. In: Weck, W., Gutknecht, J. (eds.) JMLC 2000. LNCS, vol. 1897, pp. 1–10. Springer, Heidelberg (2000). https://doi.org/10.1007/ 10722581 1 21. Heyman, G., et al.: Natural language-guided programming. In: Onward 2021, Proceedings of the 2021 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software (2021) 22. Silverstein, et al.: US20210264480A1: Text processing based interface accelerating (2020) 23. Banipal, I.S., et al.: US20220215047A1: Context-based text searching (2021) 24. Kwatra, et al.: US11552966B2: Generating and mutually maturing a knowledge corpus (2020) 25. Banipal, I.S., et al.: Relational Social Media Search Engine. The University of Texas at Dallas (2016) 26. Asthana, S., et al.: Joint time-series learning framework for maximizing purchase order renewals. IEEE Big Data 2021 (2021) 27. Wong, H.M., et al.: Feature selection and feature extraction: highlights. In: 2021 5th International Conference on Intelligent Systems, Metaheuristics and Swarm Intelligence (2021) 28. IBM SystemT. https://en.wikipedia.org/wiki/IBM SystemT 29. IBM SystemML. https://systemds.apache.org/docs/1.2.0/ 30. Annotation Query Language, IBM Watson Knowledge Studio. https://cloud.ibm. com/docs/watson-knowledge-studio?topic=watson-knowledge-studio-annotationquery-language-reference 31. Banipal, I.S., et al.: US20210042290A1: Annotation Assessment and Adjudication (2019) 32. Brandsen, A., et al.: Can BERT Dig It? Named entity recognition for information retrieval in the archaeology domain. ACM J. (2022)
Cognitive Programming Assistant
11
33. Chen, S., et al.: E.T.: re-thinking self-attention for transformer models on GPUs. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (2021) 34. Banipal, I.S., et al.: US11188968B2: Component based review system (2020) 35. Baughman, et al.: US11481401B2: Enhanced cognitive query construction (2020) 36. Kwatra, et al.: US11445042B2: Correlating multiple media sources for personalized media content (2020) 37. Trim, C., et al.: US11556709B2: Text autocomplete using punctuation marks (2020) 38. Kwatra, et al.: US11483262B2: Contextually-aware personalized chatbot (2020) 39. Kochura, N., et al.: US11488240B2: Dynamic chatbot session based on product image and description discrepancy (2020) 40. Banipal, I.S., et al.: US11514507B2: Virtual image prediction and generation (2020) 41. Floyd, R.W.: The paradigms of programming. Commun. ACM (1979) 42. Bravo, R., et al.: US10921887B2: Cognitive state aware accelerated activity completion and amelioration (2019) 43. Yang, Q., et al.: Federated machine learning: concept and applications. ACM Trans. Intell. Syst. Technol. (TIST) (2019) 44. Gan, S.C., et al.: US11556385B2: Cognitive processing resource allocation (2020) 45. Liu, H., et al.: A simple meta-learning paradigm for zero-shot intent classification with mixture attention mechanism. In: SIGIR 2022, Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (2022) 46. Banipal, I.S., et al.: US11188517B2: Annotation assessment and ground truth construction (2019) 47. Silverstein, et al.: US11055119B1: Feedback Responsive Interface (2020) 48. Banipal, I.S., et al.: US20220358237A1: Secure data analytics (2021) 49. Cannon, G.F., et al.: US20220164472A1: Recommending post modifications to reduce sensitive data exposure (2020) 50. Xing, W., et al.: Ensuring correct cryptographic algorithm and provider usage at compile time. In: Proceedings of the 23rd ACM International Workshop on Formal Techniques for Java-like Programs (2021) 51. Serrano, M.A., et al.: Quantum software components and platforms: overview and quality assessment. ACM Comput. Surv. (2022) 52. Trim, C., et al.: US20220012018A1: Software programming assistant (2020) 53. Trim, C., et al.: US20220188525A1: Dynamic, real-time collaboration enhancement (2020) 54. Banipal, I.S., et al.: Smart system for multi-cloud pathways. In: IEEE Big Data 2022 (2022)
Correcting User Decisions Based on Incorrect Machine Learning Decisions Saveli Goldberg1 , Lev Salnikov2 , Noor Kaiser3 , Tushar Srivastava3 , and Eugene Pinsky3(B) 1
3
Department of Radiation Oncology Mass General Hospital, Boston, MA 02115, USA [email protected] 2 AntiCA Biomed, San Diego, CA, USA [email protected] Department of Computer Science, Metropolitan College, Boston University, 1010 Commonwealth Avenue, Boston, MA 02215, USA {nkaiser,tushar98,epinsky}@bu.edu Abstract. It is typically assumed that for the successful use of machine learning algorithms, these algorithms should have higher accuracy than a human expert. Moreover, if the average accuracy of ML algorithms is lower than that of a human expert, such algorithms should not be considered and are counter-productive. However, this is not always true. We provide strong statistical evidence that even if a human expert is more accurate than a machine, interacting with such a machine is beneficial when communication with the machine is non-public. The existence of a conflict between the user and ML model and the private nature of user-AI communication will make the user think about their decision and hence increase overall accuracy. Keywords: User-computer interaction Decision-making factors
1
· Machine learning ·
Introduction
The use of machine learning (ML) and AI is becoming ever more widespread in numerous industries, including (but not limited to) healthcare, e-commerce, business, and finance (see bibliography). In situations where ML is used to inform the decision-making process, the decisions made by human experts and ML systems will inevitably differ. The subsequent question that arises is; how should such conflicts be resolved? Intuitively, if the results of ML algorithms are more accurate than the decisions of an expert, this would help the expert to make a well-informed and correct decision. On the flip side, when the results of ML algorithms are lower in accuracy than the decisions of an expert, the ML decisions are expected to be counter-productive. As stated in [4], for successful use of machine learning decisions, their accuracy should be at least 70% even if an expert’s accuracy is below this level. However, our experiments suggest that even in cases where the c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 K. Arai (Ed.): FICC 2024, LNNS 921, pp. 12–20, 2024. https://doi.org/10.1007/978-3-031-54053-0_2
Correcting User Decisions
13
ML model’s accuracy is lower than that of an expert, such ML decisions are still beneficial in the decision-making process for an expert – as long as the ML model’s accuracy is not significantly lower than that of an expert. How can we explain such a surprising result? We show that in case of such a conflict and when the ML model’s decisions are not publicly known, an expert would think more critically about their decisions and subsequently achieve higher accuracy. We illustrate this by running a series of experiments. It is important to note that while the accuracy of an ML model is incredibly important in gauging its performance, using it as a metric in isolation to assess the quality of the model can be misleading since “good accuracy” in the field of ML is subjective and heavily context-dependent [1–10]. However, in this paper, we will examine the impact of a conflict between the outcome of an ML model and a human expert. This paper is organized as follows. In Sect. 2 we describe the set-up for our experiments. In Sect. 3 we summarize results for different configurations of our experiments. Finally, in Sect. 4 we present the discussion of obtained results and discuss future plans.
2
Experiment Design
To design our experiments to replicate a situation where an expert consults an oracle accurately, we considered experimenting with music since many people have expertise in this area. Additionally, we had several participants willing to partake in these experiments because the subject matter of the tests was interesting and familiar to them and because they were eager to learn more about the research we were conducting. It is of value to note that all participants had different levels of knowledge in this realm and that the decisions of the participants had no consequences. There were two types of tests the participants of this study completed in all rounds across time. 34 different songs were used to create these tests. Tests 1 and 2 consisted of 12 questions, and Test 3 consisted of ten questions. The pattern of these tests was as follows: first, participants are presented with an audio clip and asked to choose who they believe the song’s artist is (refer to Fig. 1 below for an example of this question). Once they make their preliminary decision, they move on to the next question where the same audio clip is played again and the artist identified by the machine learning model is displayed, too. They are then asked to make their final decision for the song’s artist (refer to Fig. 2 below for an example of this question). This allows them to either stick with their preliminary response or change their final response anonymously in light of the new information presented to them after choosing their preliminary response. Both types of tests were completed independently of one another. To standardize all the tests, the audio clips presented to the participants were ten seconds long, and the time given to complete each test was set to ten minutes. It is important to note that backtracking was not permitted in these tests, i.e., once
14
S. Goldberg et al.
Fig. 1. Preliminary decision question
Fig. 2. Final decision question
participants had recorded a response and moved on to the next question, they could not go back to their previous questions to view or change their responses. This ensured that the integrity of the data collected would remain intact. Additionally, participants were not allowed to move to the next question without responding to the current question, which ensured that the data collected was complete since the likelihood of missing information or incomplete fields was eliminated. Three levels of accuracy of the machine learning model were used; 66.67%, 75% and 80%. For example, when the accuracy of the machine learning model
Correcting User Decisions
15
is set to 80% in the third test, 8 out of 10 of the “system responses” shown to the participants in the test are correct, while 2 out of 10 of these responses are incorrect. The participant is unaware of what the accuracy level of the ML model is for their respective tests. All participants completed these tests on Blackboard Learn, which is the primary learning management system at Boston University. Participants were able to access and complete these tests on this platform at any time within three days once the tests were made available to them as part of their course. However, as mentioned earlier, the performance on these tests was not treated as part of their coursework, i.e., participants were not rewarded or penalized for their performance on these tests in any manner.
3
Results
A general description of the experimental results is presented in Table 1, which shows that the final solution of the students turned out to be much more accurate than both the students’ preliminary solution and the ML solution. Table 1. Summary statistics for all student classes ML Acc #Songs Pre. Acc μ σ
Final μ σ
# stud ML vs. Pre ML vs. Final Pre. vs. Final p-value p-value p-value
66.7%
12
77.5%
24.5% 83.1%
17%
239
3 and (3, d) = 1 to construct cubical map Gm of affine space K m , m ≥ 2 which acts injectively on Tm (K) = K ∗ m and has eulerian inverse En which is an endomorphism of K[x1 , x2 , . . . , xm ] such that the composition of Gm and Em acts on n,m T (K) as identity map. The degree of Em (K) is at least 3×t where t is maximal power of 3 which is < d. So we take affine transformation T1 from AGLm (K) such that T1 (x1 ) = αx1 where α ∈ K ∗ together with T2 ∈ AGLm (K) and tuple, u =(x, x + a1 , x + a1 , x + a2 , x + a2 , . . . , x + as−1 , x + as−1 , x + as , x3 ) where even s is selected as in the previous example. Standard form Gm of T1 A(n,K) η(u)T2 is a toric automorphism of K[x1 , x2 , . . . , xm ]. The knowledge of trapdoor accelerator (T1 , u, T2 ) allows to compute the reimage of G(K ∗ m ) in time O(m2 ). So we have cubic endomorphism with trapdoor accelerator of level t. It can be used for the construction of public keys with the space of plaintexts K ∗ m and the space of ciphertexts K m . Example 6.4.2. This is the case obtained via the change of A(n, K) for D(n, K) Noteworthy that in the case of K = Fq we get the Examples 5.3.1 and 5.3.2. We implement this algorithm in some cases of K = Z2n , n = 8, 16.32, 64. It uses cubical toric automorphism of level 3t where t is maximal power of 3 from interval (0, 2n−1 ). In this case we can use more general form for T1 defined by condition T1 (x1 ) = a1 x1 + a2 x2 + · · · + am (xm ) where odd number of ai are odd residues modulo 2n (see [13,24]). We are going to present of these ”nonbijective public keys” in our future publications. In the next section we consider the implementation of public key based on the trapdoor accelerator of Example 6.2.1. We will use the equations of A(n, q) given in Sect. 3.
7
On the Example of Public Key Rule
Assume that Alice is the owner of the public key and Bob is public user of the cryptosystem. Alice selects the appropriate finite field and the dimension of the vector space V of ciphertexts. Let us assume that the field is F232 and the dimension equals n. Alice may consider V as the point set P or the line set L. Let she selects L. In that case her plaintext is the vector [x1 , x2 , . . . , xn ]. Alice also selects even parameter s which is the length of the walk in the graph A(n, F232 ). Let s be selected as number of size O(n). Assume that [l] = [x1 , x2 , . . . , xn ] is the ” symbolic” line of length n. The part of the trapdoor accelerator is the path p(t1 , t2 , . . . , ts ) of length s with the starting line [l], it is defined by colours of vertexes x1 , x1 + t1 , x1 + t2 , . . . , x1 + ts where t2 = 0, ti = ti−2 , for i = 3, 4, . . . , s. We assume that s ≤ n and u is the last vertex of the path. The public rule uses destination line [y] of the walk of A(n, Fq [x1 , x2 , . . . , xn ] defined by colours x1 , x1 + t1 , x1 + t2 , . . . x1 + ts starting in [l]. Assume that [y]
38
V. Ustimenko et al.
= [x1 + ts , g2 , g3 , g4 , . . . , gn ], where g2 , g3 , . . . , gn are multivariate polynomials of degree ≤ 3 in variables x1 , x2 , . . . , xn . The trapdoor accelerator is the cubical transformation F = F (t1 , t2 , . . . , ts ) acting on L = Fq n given by the following rules x1 → x1 + ts , x2 → g2 , ..., xn → gn . Alice uses the operator of the change of colour and forms the transformation G : x1 → x1 2 , xi → gi , i = 1, 2, ..., n. Alice generates two bijective affine transformations 1 T and 2 T of L with degree 1 given by x1 →i l1 (x1 , x2 , . . . , xn ) x2 →i l1 (x1 , x2 , . . . , xn ) ... xn →i l1 (x2 , x3 , . . . , xn ) where i = 1, 2. Alice can use the idea of LU factorisation. So she generates each T as a product of lower triangular matrix i L, i = 1, 2 with nonzero entries on diagonal and upper triangular matrices i U with unity elements on diagonal. For selection of the tuple ti , i = 1, 2, . . . , 256, i L and i U , i = 1, 2 Alice can use pseudorandom generators of field elements or some methods of generating genuinely random sequences (usage of existing implementation the quantum computer, other Probabilistic modifications of Turing machine, quasi-stellar radio sources (quasars) and etc.). Alice forms the tuple of variables [x] = (x1 , x2 , . . . , xn ) and conducts the steps S1 − S4 . S1 . She forms a product of vector [x] and matrix 1 T . The output is a string [1 l1 (x1 , x2 , . . . , xn ), 1 l2 (x1 , x2 , . . . , xn ), . . . 1 ln (x1 , x2 , . . . , xn )] = [1 u]. which is the line of graph A(n, F232 [x1 , x2 , . . . , xn ]) S2 . Alice computes the destination vector [2 u] of the walk with the initial line [1 u] given by colours 1 u1 , 1 u1 + t1 , 1 u1 + t2 , . . . , 1 u1 + ts . S3 . She takes the vector [2 u] = [1 u1 +ts ,2 u2 ,2 u3 , . . . ,2 un ] of elements F232 [x1 , x3 , . . . , xn ] and forms the line 3 u = [1 u1 )2 , 2 u2 , 2 u3 , . . . 2 un ] of the vector space L. S4 . Alice computes the product of the vector 3 u and the matrix of linear transformation 2 T . So Alice has the tuple of cubic multivariate polynomials 4 u = (f1 , f2 , . . . , fn ). She presents transformations 4 u in their standard forms and writes the following public rule F x1 → f1 (x1 , x2 , . . . xn ), x2 → f2 (x1 , x2 , . . . xn ), x3 → f3 (x1 , x2 , . . . xn ), . . . xn → fn (x1 , x2 , . . . xn ).
i
On Graphs Defined by Equations and Cubic Multivariate Public Keys
39
At the end Alice announces the multivariate rule for public users. Notice that for the generation of this private key Alice uses only operations of addition and multiplication in the mmultivariate ring F232 [x1 , x2 , x3 , . . . , xn ]. Encryption Process. Public user Bob writes his message p = (p1 , p2 , ..., pn ) from the space (F232 )n . He computes the tuple (f1 (p1 , p2 , . . . , pn ), f2 (p1 , p2 , . . . , pn ), . . . , fn (x1 , x2 , . . . xn )) of the ciphertext c. Theoretical estimation of the execution time is O(n4 ). Let D(n) be the density of the public rule F , which is a total number of monomial terms in all multivariate polynomials f1 , f2 , f3 , . . . . Execution time is cD(n) where constant c is time of the computation of single cubic monomial term. We can encode each character of F232 by four symbols of F28 . Thus we can identify plaintext and the ciphertext with the tuple of binary symbols of length 1024. So we can encrypt files with extensions .doc, .jpg, .avi, .tif, .pdf and etc. Decryption Procedure. Alice has the private key which consists of the sequence t1 , t2 , . . . , ts and matrices 1 T and 2 T . Assume that she got a ciphertext c from Bob. She computes 2 T −1 × c =1 c and treats this vector as line [1 l] = [c1 , c2 , c3 , . . . , cn ]. Alice computes parameter d = c01 31 . She changes the colour of [1 l] for d + t2 56 and gets the line [l] = [d + ts , c1 , c2 , . . . , cn ]. Alice has to form the path in the graph A(n, F232 ) with the starting line [l] and further elements defined by colours d + ts−1 , d + ts−2 , . . . , d + t1 and d. So she computes the destination line [1 l] = [d, d1,1 , d12 , . . . , d128,128 ]. Finally Alice computes the plaintext p as [1l ] ×2 T −1 . Table 1. Public map generation time (MS), D(n, F232 ), case 1, length of the path (s = 2l) n
32
16
64 48
128 100
256
212
420
32
648
1372
2816
5712
64
8397
19454
41568
85783
128 139366 357361 824166 1758059 Table 2. Public map generation time (in ms), D(n, F232 ), case 2, length of the path (s = 2l) n 16
32
64 140
128 268
256 524
1036
32
2328
4541
8968
17828
64
40417
77480
151592
299844
128 812140 1526713 2946022 5792889
40
V. Ustimenko et al. Table 3. Public map. number of nonzero coefficients, D(n, F232 ), case 1 n
length of w 16
16
32 4679
64
128
4679
4679
256 4679
4679
32
52570
59873
59873
59873
59873
64
490906
729992
847109
847109
847109
128 4214042 7165704 10829396 12705549 12705549 Table 4. Public map. number of nonzero coefficients, D(n, F232 ), case 2 n
length of w 16
16
32 15504
64 15504
128 15504
15504
256 15504
32
209440
209440
209440
209440
209440
64
3065920
3065905
3065920
3065920
3065920
128 46866560 46866560 46866560 46866560 46866560
8
Conclusions
Multivariate Cryptography in the classical case K = Fq is reflected in [5–7]. The public rule F defines automorphism σn of multivariate ring K[x1 , x2 , . . . , xn ] into itself given by its values on variables xi . Its degree can be defined as maximum of degrees of polynomials fi . For the usage of F as efficient encryption tool degree of σn can be bounded by some constant c, cases of c = 2 or c = 3 are popular. Multivariate public key scheme suggests that rule F is given publicly. Public users use it for encryption, they are unable to decrypt because the information on G is not given. Presumably G hast to be of high degree to be resistant against its approximation attempts. The key owner (Alice) suppose to have some additional piece T of private information about pair (F, G) to decrypt ciphertext obtained from public user (Bob). In [45] the following formalization of T is given. We say that family σn , n = 2, 3, . . . has trapdoor accelerator n T if the knowledge of the piece of information n T allows to compute reimage x of y = σn (x) in time O(n2 ). We use families of extremal algebraic graphs which approximate infinite forest (or tree) for the constructions of families of automorphisms σn with trapdoor accelerators and (σn )−1 of large order. We use bipartite regular graphs n G(K) with partition sets K n (set of points and set of lines), such that incidence relation between point and line is given by system of linear equations over K and projective limit of bipartite graphs n G(K) is well defined tends to infinite regular forest. Two families D(n, K) and A(n, K) defined over arbitrary integrity domain K, i.e. commutative ring without zero divisors, are known.
On Graphs Defined by Equations and Cubic Multivariate Public Keys
41
To define trapdoor accelerator for the family σn , n = 2, 3, . . . we use special walks on graphs n G(K) and n G(K[x1 , x2 , . . . , xn ]). This way in the case of K = F2m we construct trapdoor accelerator n T for the special map σn with the inverse of order > 3 × 2m−1 . So we can construct a family of public keys working with the space of plaintexts F2n64 and multivariate rule F with the inverse of order > 3 × 263 . In this paper we discuss implemented not so ambitious case m = 32 (the examples of trapdoor accelerators 5.2.2 and 5.2.1 of Sect. 5) which is also can give secure and efficient cubic cryptosystem. Partial results in the case 5.2. 1 were presented in note [41]. We introduce here trapdoor accelerators 5.3.1 and 5.3.2 based on graphs A(n, q) and D(n, q) with q − 1 = 0 (mod 3, corresponding bijective public keys were implemented in some selected cases. Section 5 also contains the description of trapdoor accelerators 5.4.1 and 5.4.2 defined in the cases of commutative rings K with zero divisors, some of them were implemented in the special cases of arithmetical rings Z2t but we are going to present the results of computer simulations in the nearest future. Groups generated by transformations presented in Examples 5.1.1 and 5.1.2 can be used as platforms of Noncommutative Cryptography (see [39,40]). They allow Alice and Bob to elaborate the collision cubical transformation H on the affine space K n , where K is a finite field or arithmetical ring Zm and n is freely selected positive number. Thus the following alternative to public key option asymmetric algorithm is possible (1) Alice and Bob elaborate H via the protocol. (2) Alice generates the described above map G in its standard form with the trapdoor accelerator. 3 Alice sends G+H to Bob. So Bob uses G for the encryption and Alice decrypt because of her knowledge on trapdoor accelerator. The security of this scheme is based on the security of protocols based on the complexity of the word decomposition problem in the affine Cremona group. For the safe delivery of multivariate rule from one correspondent to its partner also can be used the protocol of [47] base of the subgroup of affine Cremona semigroup generated by endomorphisms of linear degree and minimal density.
References 1. Bodnarchuk, Yu.: Every regular automorphism of the affine Cremona group is inner. J. Pure Appl. Algebra 157, 115–119 (2001) 2. Bolloba´s, B.: Extremal Graph Theory. Academic Press 1978, Dover (2004) 3. Buekenhout, F. (ed.): Handbook on Incidence Geometry. North Holland, Amsterdam (1995) 4. Canteaut, A., Standaert, F.-X. (eds.): Eurocrypt 2021, Part I. LNCS, vol. 12696, 839p. Springer, Heidelberg (2021). https://doi.org/10.1007/978-3-030-77870-5 5. Ding, J., Gower, J.E., Schmidt, D.S.: Multivariate Public Key Cryptosystems. AIS, vol. 80, 260p. Springer, New York (2006). https://doi.org/10.1007/978-10716-0987-3 6. Goubin, L., Patarin, J., Yang, B.-Y.: Multivariate Cryptography, Encyclopedia of Cryptography and Security, 2nd edn, pp. 824–828 (2011) 7. Koblitz, N.: Algebraic Aspects of Cryptography, 206p. Springer, Heidelberg (1998). https://doi.org/10.1007/978-3-662-03642-6
42
V. Ustimenko et al.
8. Lazebnik, F., Ustimenko, V.: Some algebraic constructions of dense graphs of large girth and of large size, DIMACS series in discrete mathematics and theoretical computer. Science 10, 75–93 (1993) 9. Lazebnik, F., Ustimenko, V.A.: New examples of graphs without small cycles and of large size. Europ. J. Comb. 14, 445–460 (1993) 10. Lazebnik, F., Ustimenko, V., Woldar, A.J.: A new series of dense graphs of high girth. Bull. AMS 32(1), 73–79 (1995) 11. Lubotzky, P., Sarnak, A., Lubotsky, R., Philips, P.S.: Ramanujan graphs. J. Comb. Theory 115(2), 62–89 (1989) 12. MacKay, D.J.C., Postol, M.S.: Weaknesses of Margulis and Ramanujan-Margulis low-density parity-check codes. Electron. Notes Theor. Comput. Sci. 74, 97–104 (2003) 13. Margulis, G.A.: Explicit construction of graphs without short cycles and low density codes. Combinatorica 2, 71–78 (1982) 14. Noether, M.: Luigi Cremona. Math. Ann. 59, 1–19 (1904) 15. Post-Quantum Cryptography, Call for Proposals. https://csrc.nist.gov/Project; Post-Quantum-Cryptography-Standardization/Call-for-Proposals, Post-Quantum Cryptography: Round 2 Submissions 16. Shafarevich, I.R.: On some infinite dimension groups II. Izv. Akad. Sci. Ser. Math. 2(1), 214–226 (1981) 17. Sharma, D., Ustimenko, V.: Special graphs in cryptography. In: The Poster Papers Collection, Third International Workshop on Practice and Theory in Public Key Cryptography (PKC 2000), Melbourne Exhibition Centre, Australia, January 2000, pp. 16–19 (2000) 18. Tits, J.: Buildings of Spherical Type and Finite BN-Pairs. LNM. Springer, Heidelberg (1974). https://doi.org/10.1007/978-3-540-38349-9 19. Ustimenko, V.: Linguistic dynamical systems, graphs of large girth and cryptography. J. Math. Sci. 140(3), 412–434 (2007). Springer 20. Ustimenko, V.: On the Extremal Graph Theory and Symbolic Computations, no. 2, pp. 42–49. Dopovidi National Academy of Science, Ukraine (2013) 21. Ustimenko, V.: CRYPTIM: graphs as tools for symmetric encryption. In: Bozta¸s, S., Shparlinski, I.E. (eds.) AAECC 2001. LNCS, vol. 2227, pp. 278–286. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-45624-4 29 22. Ustimenko, V.: On the graph based cryptography and symbolic computations. Serdica J. Comput. Proceedings of International Conference on Applications of Computer Algebra 2006, Varna, N1 (2007) 23. Ustimenko, V., Roma´ nczuk, U.: On extremal graph theory, explicit algebraic constructions of extremal graphs and corresponding Turing encryption machines. In: Yang, X.S. (ed.) Artificial Intelligence, Evolutionary Computing and Metaheuristics. SCI, vol. 427, pp. 257–285. Springer, Heidelberg (2013). https://doi.org/10. 1007/978-3-642-29694-9 11 24. Ustimenko, V.: On infinite connected real networks without cycles and pseudorandom and random real sequences. In: Isaac Newton Institute, Workshop Fractional Kinetics, Hydrodynamic Limits and Fractals, 21.03.2022–25.03.2022, Cambridge, UK (2022) 25. Ustimenko, V.: Coordinatisation of trees and their quotients, in the Voronoi’s impact on modern science. Kiev, Inst. Math. 2, 125–152 (1998) 26. Khmelevsky, Yu., Ustimenko, V.: Walks on graphs as symmetric and asymmetric tools for encryption. South Pac. J. Nat. Stud. 20, 23–41 (2002). http://www.usp. ac.fj/spjns
On Graphs Defined by Equations and Cubic Multivariate Public Keys
43
27. Khmelevsky, Yu., Govorov, M., Sharma, P., Ustimenko, V.: Dhanjal, S.: Security solutions for spatial data in storage (implementation case within Oracle 9iAS). In: Proceedings of 8th World Multiconference on Systemics, Cybernetics and Informatics (SCI 2004) Orlando, USA, 18–21 July 2004, pp. 318–323 (2004) 28. Govorov, M., Khmelevsky, Y., Khorev, A., Ustimenko, V.: Security control for spatial warehouses. In: Proceedings of 21st International Cartographic Conference (ICC), Durban, South Africa, pp. 1784–1794 (2003) 29. Khmelevsky, Yu., Ustimenko, V.: Practical aspects of the Informational Systems reengineering. South Pac. J. Nat. Sci. 21, 75–21 (2003). http://www.usp.ac.fj/ spjns/volume21 30. Govorov, M., Khmelevsky, Y., Ustimenko, V., Chorev, A., Fisher, P.: Security for GIS N-tier architecture. In: Govorov, M., Khmelevsky, Y., Ustimenko, V., Khorev, A., Fisher, P. (eds.) Development Spatial Data Handling, pp. 71–83. Springer, Heidelberg (2005). https://doi.org/10.1007/3-540-26772-7 6 31. Tousene, A., Ustimenko, V.: CRYPTALL - a system to encrypt all types of data. Not. Kiev - Mohyla Acad. 23, 12–15 (2004) 32. Tousene, A., Ustimenko, V.: Graph based private key crypto-system. Int. J. Comput. Res. 13(4), 12p. (2005). Nova Science Publisher 33. Touzene, A., Ustimenko, V.: Private and public key systems using graphs of high girth, In: Chen, R.E. (ed.) Cryptography Research Perspectives, pp. 205–216. Nova Publishers, Hauppauge (2008) 34. Touzene, A., Ustimenko, V., Al Raisi, M., Boudelioua, I.: Performance of algebraic graphs based stream ciphers using large finite fields. Annalles UMCS Informatica AI X1 2, 81–93 (2011) 35. Romanczuk-Polubiec, U., Ustimenko, V.: On two windows multivariate cryptosystem depending on random parameters. Algebra Discrete Math. 19(1), 101–129 (2015) 36. Ustimenko, V.: On multivariate algorithms of digital signatures of linear degree and low density. IACR Cryptology ePrint Archive 2020:1015 (2020) 37. Ustimenko, V.: On multivariate algorithms of digital signatures based on maps of unbounded degree acting on secure El Gamal type mode. IACR Cryptology ePrint Archive 2020:1116 (2020) 38. Ustimenko, V.A., Wroblewska, A.: Dynamical systems as the main instrument for the constructions of new quadratic families and their usage in cryptography. Annales UMCS Informatica, AI XII 3, 65–74 (2012) 39. Ustimenko, V., Klisowski, M.: On non-commutative cryptography with cubical multivariate maps of predictable density. In: Arai, K., Bhatia, R., Kapoor, S. (eds.) CompCom 2019. AISC, vol. 998, pp. 654–674. Springer, Cham (2019). https://doi. org/10.1007/978-3-030-22868-2 47 40. Ustimenko, V., Klisowski, M.: On new protocols of noncommutative cryptography in terms of homomorphism of stable multivariate transformation groups. J. Algebra Discrete Math. 220–250 (2023) 41. Ustimenko, V., Chojecki, T., Klisowski, M.: On the implementations of new graph based cubic Multivariate Public Keys. In: Proceedings of the 18th Conference on Computer Science and Intelligence Systems, ACSIS, vol. 35, pp. 1179–1184 (2023) 42. Ustimenko, V.: Affine system of roots and Tits geometries, Voprosy teorii grupp i gomologicheskoy algebry, Yaroslavl, pp. 155–157 (1989). (in Russian) 43. Ustimenko, V.: On the embeddings of some geometries and flag systems in Lie algebras and superalgebras, in Root systems, representations and geometries, pp. 3–16. Kiev, IM AN UkrSSR (1990)
44
V. Ustimenko et al.
44. Futorny, V.M., Ustimenko, V.A.: On the Hecke algebras corresponding to Tits geometries. In: Root Systems, Representations and Geometries, pp. 17–31. IM AN UkrSSR, Kiev(1990) 45. Ustimenko, V.: On Extremal Algebraic Graphs and Multivariate Cryptosystems IACR e-print archive, 2022/1537 46. Ustimenko, V.: Graphs in terms of algebraic geometry, symbolic computations and secure communications in post-quantum world, p. 198. UMCS Editorial House, Lublin (2022) 47. Ustimenko, V.: On Eulerian semigroups of multivariate transformations and their cryptographic applications. Eur. J. Math. 9, 93 (2023) 48. Ustimenko, V.: On new results on extremal graph theory, theory of algebraic graphs and their applications in cryptography and coding the Ory, Reports of Nath. Acad. Sci. Ukraine 4, 42–49 (2022) 49. Wroblewska, A.: Lingwistyczne uklady dynamiczne oparte na grafach algebraicznych i ich zastosowanie w kryptografii, PAN Instytut Podstawowych Problemow Techniki rozprawa Doktorska, rozpraws doktorska. Promotor prof. Vasyl Ustymenko, Warszawa (2016) 50. Wroblewska, A.: On some properties of graph based publickeys. Albanian J. Math. 2(3), 229–234 (2008). NATO Advanced Studies Institute: “New challenges in digital communications” 51. Ustimenko, V., Wroblewska, A.: On the key exchange with nonlinear polynomial maps of stable degree. Annalles UMCS Informatica AI X1(2), 81–93 (2011) 52. Ustimenko, V., Wroblewska, A.: On new examples of families of multivariate stable maps and their cryptographical applications. Ann. UMCS Informatica 14(1), 19–35 (2014) 53. Ustimenko, V., Wroblewska, A.: On the key exchange with nonlinear polynomial maps of degree 4. In: Proceedings of the Conference “Applications of Computer Algebra” Vlora, Albanian J. Math. (December) 4(4), 161–170 (2010) 54. Klisowski, M., Romanczuk, U., Ustimenko, V.: On public keys based on a new family of algebraic graphs. Annalles UMCS Informatica AI X1(2), 127–141 (2011) 55. Klisowski, M.: Zwiekszenie bezpieczenstwa kryptograficznych algorytm´ ow wielu zmiennych bazujacych na algebraicznej teorii grafow, Politechnika CZESTOCHOWSKA, Wydzial Inzynierii Mechanicznej i Informatyki, rozprawa doktorska. Promotor prof. dr hab, Vasyl Ustymenko Czestochowa (2014) 56. Klisowski, M., Ustimenko, V.: On the public keys based on the extremal graphs and digraphs. In: International Multiconference on Computer Science and Informational Technology, October 2010, Wisla, Poland, CANA Proceedings, 12 p. (2010) 57. Klisowski, M., Ustimenko, V.: On the implementation of cubic public rules based on algebraic graphs over the finite commutative ring and their symmetries. In: MACIS2011: Fourth International Conference on Mathematical Aspects of Computer and Information Sciences, Beijing, 13 p. (2011) 58. Klisowski, M., Ustimenko, V.: On the comparison of cryptographical properties of two different families of graphs with large cycle indicator. Math. Comput. Sci. 6(2), 181–198 (2012) 59. Kotorowicz, S.J.: Kryptograficzne algorytmy strumieniowe oparte na specjalnych grafach algebraicznych, Wydzial Matematyki, Fizyki i Informatyki Uniwersytet Marii Curie-Sklodowskiej w Lubline. Rozprawa doktorska napisana pod kierunkiem prof. dr hab. Vasyla Ustimenko, IPPT PAN, Warszawa (2014) 60. Kotorowicz, S., Ustimenko, V.: On the properties of stream ciphers based on extremal directed graphs. In: Chen, R.E. (ed.) Cryptography Research Perspectives. Nova Publishers (2008)
On Graphs Defined by Equations and Cubic Multivariate Public Keys
45
61. Kotorowicz, S.J., Ustimenko, V.A.: On the implementation of cryptoalgorithms based on algebraic graphs over some commutative rings. In: Condenced Matters Physics, Special Issue: Proceedings of the International Conferences on Finite Particle Systems, Complex Systems Theory and Its Application, Kazimerz Dolny, Poland, 11 no.2 (54), 2008, 347–360 (2006) 62. Kotorowicz, J., Romanczuk, U., Ustimenko, V.: Implementation of stream ciphers based on a new family of algebraic graphs. In: Proceedings of Federated Conference on Computer Science and Information Systems (FedCSIS), 13 p. (2011) 63. Y. Khmelevsky, Gaetan Hains, E. Ozan, Chris Kluka, V. Ustimenko and D. Syrotovsky, International Cooperation in SW Engineering Research Projects, Proceedings of Western Canadien Conference on Computing Education, University of Northen British Columbia, Prince George BC, May 6-7, 2011, 14pp 64. Futorny, V., Ustimenko, V.: On small world semiplanes with generalised Schubert cells. Acta Appl. Math. 98(1), 47–61 (2007) 65. Ustimenko, V., Romanczuk-Polubiec, U., Wroblewska, A., Polak, M., Zhupa, E.: On the implementation of new symmetric ciphers based on non-bijective multivariate maps. In: Ganzha, M., Maciaszek, L., Paprzycki, M. (eds.) Proceedings of the 2018 Federated Conference on Computer Science and Information Systems. ACSIS, vol. 15, pp. 397–405 (2018) 66. Ustimenko, V., Roma´ nczuk-Polubiec, U., Wr´ oblewska, A., Polak, M., Zhupa, E.: On the constructions of new symmetric ciphers based on non-bijective multivariate maps of prescribed degree. Secur. Commun. Netw. 2, 2137561, 15 p. (2019) 67. Ustimenko, V.: On algebraic graph theory and nonbijective maps in cryptography. Algebra Discrete Math. 20(1), 152–170 (2015) 68. Polak, M.K.: Wykorzystanie algebraicznej Teorii Grafow w kodowaniu, Wydzial Matematyki, Fizyki i Informatyki Uniwersytet Marii Curie-Sklodowskiej w Lublinie. Wykorzystanie algebraicznej Teorii Grafow w kodowaniu Rozprawa doktorska napisana pod kierunkiem prof. dr hab. Vasyla Ustimenko Lublin 26 kwietnia (2016) 69. Polak, M., Ustimenko, V.A., Wroblewska, A.: On multivariate cryptosystems based on edge transitive graphs. In: Third International Conference on Symbolic Computations and Cryptography, Castro Urdiales, 9–13, July 2012, Extended Abstracts, pp. 160–164 (2012) 70. Ustimenko, V.: On the extremal graph theory for directed graphs and its cryptographical applications. In: Advances in Coding Theory and Cryptography, Series on Coding Theory and Cryptology, vol. 3, pp. 181–200. World Scientific (2007) 71. Ustimenko, V., Romanczuk, U.: Finite geometries. LDPC codes and cryptography, Lublin, UMCS (2012) 72. Ustimenko, V.: On extremal graph theory and symbolic computations, vol. 2, pp. 42–49. Dopovidi National Academy of Sci, Ukraine (2013) 73. Ustimenko, V., Roma´ nczuk, U.: On dynamical systems of large girth or cycle indicator and their applications to multivariate cryptography. In: Yang, X.S. (ed.) Artificial Intelligence, Evolutionary Computing and Metaheuristics. SCI, vol. 427, pp. 231–256. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-296949 10 74. Polak, M., Romanczuk, U., Ustimenko, V., Wroblewska, A.: On the applications of extremal graph theory to coding theory and cryptography. Electron. Notes Discrete Math. 43, 329–342 (2013) 75. Ustimenko, V.: On new multivariate cryptosystems based on hidden Eulerian equations. Dopovidi Natl. Acad. Sci. Ukraine 5, 17–24 (2017)
46
V. Ustimenko et al.
76. Romanczuk-Polubiec, U., Ustimenko, V.A.: On new key exchange multivariate protocols based on pseudorandom walks on incidence structures. Dopovidi NAN Ukrainy 1, 41–49 (2015) 77. Priyadarsini, P.L.K.: A survey on some applications of graph theory in cryptography. J. Discret. Math. Sci. Cryptogr. 18(3), 209–217 (2015) 78. Ustimenko, V.: Maximality of affine group, hidden graph cryptosystem and graph’s stream ciphers. J. Algebra Discret. Math. 1, 51–65 (2005) 79. Ustimenko, V.: Graphs in Terms of Algebraic Geometry Symbolic Computations and Secure Communications in Post-quantum World. Maria Curie-Sklodowska University Press, Lublin (2022) 80. Ustimenko, V.: On inverse protocols of post quantum cryptography based on pairs of non-commutative multivariate platforms used in tandem. IACR Cryptology ePrint Archive 2019:897 (2019)
System Tasks of Digital Twin in Single Information Space Mykola Korablyov1(B) , Sergey Lutsky1 , Anatolii Vorinin2 , and Ihor Ivanisenko3,4 1 Kharkiv National University of Radio Electronics, Kharkiv 61166, Ukraine
[email protected]
2 Simon Kuznets Kharkiv National University of Economics, Kharkiv 61166, Ukraine
[email protected]
3 University of Jyväskylä, 40014 Jyväskylä, Finland
[email protected] 4 Kharkiv National Automobile and Highway University, Kharkiv, Ukraine
Abstract. An approach to solving systemic problems of a digital twin in a single information space is considered, taking into account the presence of uncertainty. The solution of the system tasks of the digital twin is implemented on the basis of system-information models using the Unified System Information Space (USIS) and Product Lifecycle System Information (PLSI) software products, which are a platform for product lifecycle management and are system-compatible with Product Lifecycle Management (PLM). One of the solutions to information system problems is to estimate the value of the system uncertainty based on the systeminformation model of the dynamic processes of the digital twin in the system (USIS + PLSI + PLM), which characterize the stability of the system and optimize its state. A mathematical model of information interaction between the elements of the system is proposed, which is formalized on the basis of the methodology of the system-information approach. The presented approach to the description of the system-information process and system is a theoretical platform for solving system problems of analysis, synthesis, identification, observability, forecasting, estimation, solvability, control, stability, optimization, and dynamics based on system-information models of processes and systems system-based digital twin (USIS + PLSI + PLM). Keywords: Digital twin · System problem · Uncertainty · System information model · Dynamic process · Stability · Optimization
1 Introduction A digital twin is a digital copy of any process, system, or physical asset that enhances the functionality of virtual and real systems that meet the goals. Digital twins can be created for any “real world” scenario as part of the manufacturing process. However, as indicated in [1], the theoretical basis and practical implementation of digital twins do not yet fully comply with this concept, since there is no universal model and standards for digital twins. Therefore, there are some technical and domain-related problems that have yet to be solved. In [2], it is noted that the design of intelligent production systems based © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 K. Arai (Ed.): FICC 2024, LNNS 921, pp. 47–56, 2024. https://doi.org/10.1007/978-3-031-54053-0_4
48
M. Korablyov et al.
on digital twins is a difficult task since the concept of digital twins in this case remains uncertain. This requires the development of a new approach to the analysis and modeling of digital twins, which provides a scientific basis for researching information models of processes and systems of the product life cycle under conditions of uncertainty, on the basis of which algorithms and software products for the digital twin can be developed. The digital twin as an information system is inherent in the classical tasks of the general systems theory [3], which require the development of principles for formalizing system tasks and their solution based on system information models. This is theoretically based on the concept of the term system information of objects [4–7]. The solution of system tasks of the digital twin is based on the Unified System Information Space (USIS) and Product Lifecycle System Information (PLSI) software products, which are a platform for product lifecycle. This is theoretically based on the concept of the term system information of objects [8] and is implemented based on system-information models. USIS and PLSI of the digital twin are a system whose elements are the parameters of processes and systems of real production, presented in a form that characterizes their system information properties. One of the solutions to information system problems is the estimation of the value of the uncertainty of the parameters of a technical (closed) system based on the system-information model of the dynamic processes of the digital twin in the system (USIS + PLSI + PLM), which characterize the stability of the system and optimize its state. System-information models of processes and systems based on USIS and PLSI systems allow solving system problems of a digital twin from the standpoint of the quantity, quality, and value of system information, which characterize the parameters of a production process or system. Determining the value of system information is based on the measurement of the parameter and its uncertainty. There is always a random component in measurement errors. Therefore, in order to correctly estimate the parameter and determine the objective value of system information, the random component of measurements (expanded uncertainty) should not exceed the accuracy tolerance of the parameter (sensitivity threshold). In this case, the system information of the parameters of processes and systems of real production acquires a deterministic character. The system tasks of digital twins based on system-information models are related to the methods of managing the uncertainty of the parameters of processes and systems of real production. The digital twin’s system tasks are systems theory questions: analysis, synthesis, identification, observability, forecasting, evaluation, solvability, control, stability, dynamics, optimization, and others. The paper considers the solution to digital twin system problems, taking into account the presence of uncertainty, which is implemented on the basis of system-information models using USIS and PLSI software products, which are system-compatible with PLM. Section 2 discusses the system-information approach to modeling digital twins, which is the next evolutionary stage in the development of automation and computerization of production processes. Section 3 considers the system tasks of the digital twin, namely, the mathematical model of the system-information process of the digital twin, analyzes the dynamics of the elements of the information system, and also provides an example of solving the problem of the dynamics of the elements of the information system. Section 4 presents the conclusion.
System Tasks of Digital Twin in Single Information Space
49
2 System-Information Approach to Modeling Digital Twins The system information approach is a scientific direction in information theory. It is a set of scientific methods for modeling processes and systems based on the concepts of quantity, quality, and value of system information of physical quantities (parameters), which are system-information characteristics of the state of a process or system. On the basis of theoretical provisions, system information is possessed by: a set of object properties, the time of manifestation of these properties, and the place of their manifestation in space. The manifestation of the information properties of the system occurs due to the physical interaction of objects, which is formalized by a mathematical systeminformation model. The formalization of system-information models of objects is based on system information that displays the system information features of objects [9]. The methodology of system information modeling is based on the concept of system information, which is characterized by a quantitative indicator of the communication ability of an object to exchange information with the environment [9, 10]. In the process of exchanging system information, the object changes its state by a value that is a multiple of the threshold of sensitivity to the influencing object. The mathematical model of information interaction between the elements of the system is formalized on the basis of laws, regularities, and established rules of the methodology of the system-information approach. At the same time, the information process of the system as a whole is considered, in which the system information is redistributed between the elements of the system. Formalization of the system information of the digital twin is subject to a priori and a posteriori system information that the elements of the system possess, as well as the quality and value of system information that characterizes the result of the process [9]. The a priori information model characterizes the features of an information object before interaction. The a posteriori information model characterizes the features of an information object as a result of an interaction. The model of the information process characterizes the features of the process of interaction between objects. Modern research in general systems theory integrates the developments accumulated in the fields of “classical” general systems theory, cybernetics, systems analysis, operations research, systems engineering, and synergistic. The main idea of the general theory of systems is the recognition of the isomorphism of the laws governing the functioning of system objects [11]. The subject of research within the framework of the general theory of systems is the study of various classes, and types of systems, the basic principles and patterns of behavior of systems, and the processes of functioning and development of systems. From a mathematical point of view, a system is a set on which a pre-given relation with fixed properties is realized. Relations usually act as a requirement for a certain order of communication between the elements of the system: the processes occurring in one of the elements of the system in a certain way affect the processes in other elements. Any system is located and operates in some quite definite external environment [12]. The interaction of the system with the external environment is carried out through the input and output of the system. In this case, the input is understood as a point or area of influence on the system from the outside; under the exit - a point or area of influence of the system outside. The system can be in different states. The state of any system at
50
M. Korablyov et al.
a certain point in time can be characterized with certain accuracy by a set of values of internal state parameters. System-information dependencies between the elements of the system are formalized on the basis of the concept of the measure of their communication ability. Communication ability (CA) is the potential ability of the properties of some objects to be “displayed” on other objects discretely. It is a scalar value and is the only measure of different xi . The logarithmic unit measure forms of object communication and is defined as x i 1 (LUM) of communication ability is defined as log 2 xi . The logarithmic measure (LM) xi , where of communication ability (amount of system information) is defined as log 2 x i xi – feature value, xi – sensitivity threshold [7]. The theoretical foundations of the methodology of system-information modeling of processes and systems are based on the reflection of system information: object properties (intensity X); space (length L); time (duration T ). System-information modeling considers fragments of the reality of the manifestation of information objects that are combined in various configurations: intensity - duration (X, T ); intensity - length (X, L); intensity - duration - length (X, T, L) within the framework of homogeneity. The system information of an object is characterized by an information measure and an information norm [13]. Information measure |I(X)| is a function of the absolute value of the qualitative and/or quantitative proportion of the attribute to the established measure and is determined by the ratio: Xmax − Xmin |I (X )| = f . (1) x The information norm indicates the place of the particular in the general and is equal to the value of the ratio of the general possible value of the attribute to its particular value: Xmax − Xmin Xmax − Xmin (X ) = f , (2) = x x + Xmin where: x = n · x + Xmin . The classical approach to the analysis of a “process”, as to a successive change of phenomena, states in the development of something, provides for the construction of a mathematical concept of a system using three processes: input U, output Y, and process X in the state space. For the mathematical task of the process, it is necessary to single out the set of its values and the ordered set, fixing the sequence in which these values are realized. Often an ordered set is interpreted as time, and then one speaks of processes occurring in time. The ordered set for all three processes is considered the same and is called the set of time points T. The ordered set can also have a different interpretation. Each specific system is characterized by its own set of inputs u(t) ⊂ U , the set of all reactions of the system, that is the set of all outputs y (t) ⊂ Y , as well as the set of states x (t) ⊂ X . The specific output y(t) at each moment t is completely determined only by the state of the system at this moment t, while the relation is fulfilled: y(t) = η(t, x(t)), t ∈
(3)
System Tasks of Digital Twin in Single Information Space
51
The absence of dependence of the output y(t) at the moment t on u(t) can be interpreted as the impossibility for an infinitely small time, by changing the input action, to cause a change in the output of the system. The mapping η is called the output mapping or observation function. It should be noted that at each moment t the system is in a certain state x(t), and the states at times t > τ are uniquely determined by the state at time t and the input segment u(t, τ ). This reflects the principle of determinism (certainty) in the behavior of the system. When formalizing this circumstance, the existence of a family of mappings σ - transitional mappings is established, and the following relation is satisfied: x(t) = σ (t, τ, x(t, τ ), u(t, τ )), t ∈ T , τ ∈ T , τ < t.
(4)
The requirement that the transition mapping must satisfy is that equality (2) is fulfilled identically for all t ∈ T , x ∈ X , u ∈ U , that is, over a time interval of zero length, the system cannot go to another state (or at the same time the system cannot be in two different states). Thus, the system generating processes from Y can be defined as the triple {σ , X, η}. This triple is an expression of the law of the system’s behavior and it is generating principle. More traditional is the definition of the system by setting the relations describing the transition mappings σ. Real production, like the digital twin, is a closed information system, unlike biological systems. The mathematical model of a dynamic information system makes it possible to study and describe not only the evolution of systems over time but also fluctuations and other statistical phenomena. In real production, control of the accuracy of parameters is achieved by technological methods. For example, the statistical principle of achieving the accuracy of product processing is a method for managing the uncertainty of process parameters in production. The study of methods for solving system problems based on uncertainty management algorithms in digital twin problems is an important direction in the development of modern digital production.
3 System Tasks of Digital Twin 3.1 Mathematical Model of System-Information Process of Digital Twin In the system-information process, as in the display, two objects (systems) participate. A distinctive feature of the information process from the display process is that the display process is symmetrical, and the information process is one-way directed [13]. The cause-and-effect chains of the system-information process is a sequence of increments of system information of states I (X A ) and I (X B ) of systems A and B interacting by the value I (UeA,B ) = I (IeB,A ) (see Fig. 1). The set of system information of output, transitional, as well as mappings of coordination of interaction between systems A and B is an information system, which is described by a tuple of sets: f (I ) : I UeA YeB → XeA → YeA UeB → XeB → YeB UeB ,
52
M. Korablyov et al.
Fig. 1. Scheme of the system-information process of objects A and B
I X A , X B = I U A , σ A , X A , ηA , Y A , θ, U B , σ B , X B , ηB , Y B ,
(5)
where: I - system information; U A , U B , X A , X B , Y A , Y B – inputs, outputs, states of systems A and B; ηA , ηB , σ A , σ B – output and transitional mappings of systems A and B, θ – displaying the coordination of systems A and B. The presented approach (5) to the description of the system-information process and the system is a theoretical platform for solving system problems of analysis, synthesis, identification, observability, forecasting, evaluation, solvability, control, stability, optimization and dynamics based on the system information models of processes and digital twin systems based on (USIS + PLSI + PLM) systems. 3.2 Analysis of Dynamics of Information System Elements The conditions of the system-information process have the form: IY =
N
IXi , xi = UXi , xi ,
i=1
x1 x2 U1 U1 = , x1 = dx1 = d U1 U2 U2 U2 where: x1, , x2 – elements of the system, U1, , U2, – expanded uncertainty. The increment of system information I(dx1, ) = I(dx2, ) determines the characteristics of the dynamics of the information connection of the elements x1 (t) and x2 (t) depending on the values U1 , U2 , which can be a function of timeUi (xi ). The task is set: to formalize the dynamics of the system-information process: N
IXi (t), N y(t) xi (t) = log 2 . log 2 i=1 y xi IY (t) =
i=1
The problem of system information dynamics is solved in several stages. 1) A matrix of information links of elements xi (t) is built:
(7)
System Tasks of Digital Twin in Single Information Space N/N
x1
x2
……
xN
x1
K11 = x1 /x1
K12 = x1 /x2
……
K12 = x1 /xN
x2
K21 = x2 /x1
K22 = x2 /x2
……
K22 = x2 /xN
…
…….
…….
…….
…….
xN
KN 1 = xN /x1
KN 2 = xN /x2
…….
KN 2 = xN /xN
53
The time variable xi (t) can be an argument of the function x1t , and then you need to solve the equation and take positive roots as a separate element xi (t). 2) We compose a system of equations of dynamics for each pair of elements, such pairs will be equal to N × N: xi (k) = xi (k + 1) − xi (k), i = 1, . . . , N , where: k – discrete time, xi – sensitivity threshold. x1 x1 xN xN , ..., a1N = , . . . , aN 1 = , . . . , aNN = x2 xN x1 xN x1 x1 x1 x1 (0), . . . , xN (0), = , x1 = xN , xN xN xN ⎧ ⎫ x x x + 1) − x = a + a + · · · + a1N xN (k), ⎪ 1 (k 1 (k) 11 1 (k) 12 2 (k) ⎪ ⎪ ⎪ ⎨ ⎬ x2 (k + 1) − x2 (k) = a21 x1 (k) + a22 x2 (k) + · · · + a2N xN (k), ⎪ ⎪ ... ⎪ ⎪ ⎩ ⎭ xN (k + 1) − xN (k) = aN 1 x1 (k) + aN 2 x2 (k) + · · · + aNN xN (k),
a12 =
or ⎧ ⎫ x1 (k + 1) = (1 + a11 )x1 (k) + a12 x2 (k) + · · · + a1N xN (k), ⎪ ⎪ ⎪ ⎪ ⎨ ⎬ x2 (k + 1) = a21 x1 (k) + (1 + a22 )x2 (k) + · · · + a2N xN (k), ⎪ ⎪ ... ⎪ ⎪ ⎩ ⎭ xN (k + 1) = aN 1 x1 (k) + aN 2 x2 (k) + · · · + (1 + aNN )xN (k), or ⎡
⎤⎡ ⎤ ⎡ ⎤ x1 (k + 1) x1 (k) a1N (1 + a11 ) a12 . . . ⎢ x2 (k + 1) ⎥ ⎢ a21 ⎢ ⎥ (1 + a22 ) . . . a2N ⎥ ⎢ ⎥⎢ x2 (k) ⎥ ⎥=⎢ ⎣ ⎦ ⎣ ⎦ ⎣ ... ... ⎦ ... aN 2 . . . (1 + aNN ) xN (k + 1) xN (k) aN 1
(8)
3) The reduced equations of the dynamics of the information system are solved, which makes it possible to determine the values of the variables xi (t) depending on the values of the coefficients aij , as well as the dynamic characteristics of the system as a whole.
54
M. Korablyov et al.
3.3 An Example of Solving the Problem of Dynamics of Information System Elements Solve the system of equations of dynamics for the elements x1 and x2: x2 (t) x1 (t) = x2 , = ϕx1 , dt dt x1 U1 x2 U2 = , ϕ = K12 = = , = K12 = x2 U2 x1 U1 x1 , x2 = const, min ≤ xi ≤ max, where: xi – variables, xi – sensitivity threshold, Ui – expanded uncertainty, Kij – information link coefficient. Then we have: x1 (n + 1) = x1 (n) + x2 (n), x2 (n + 1) = x2 (n) + ϕx2 (n), x1 (0), x2 (0).
(9)
We apply the discrete Z-Laplace transform: ∞ x(z) = x(n)z −n . n=0
System (9) in matrix form takes the form: z−1 − x (0) x (z) =z 1 , = 1 x2 (z) x2 (0) −ϕ z − 1 or
z−1 − z x1 (z) x1 (0) · · = 2 . x2 (z) x2 (0) z − 2z + 1 − ϕ ϕ z−1
(10)
(11)
Let’s put σ 2 = l · ϕ ≥ 0, then we can determine the roots of the equation z 2 − 2z + 1 − σ 2 = 0, (z − (1 − σ )) · (z − (1 + σ )) = 0, z1 = 1 − σ, z2 = 1 + σ
(12)
Solution (11) takes the form: z2 − z ϕ · x1 (0) + 2 · x2 (0), z 2 − 2z + 1 − σ 2 z − 2z + 1 − σ 2 ϕz z2 − z x2 (z) = 2 · x · x2 (0). + (0) 1 z − 2z + 1 − σ 2 z 2 − 2z + 1 − σ 2
x1 (z) =
Find the inverse transformation to the temporary variable n: z2 − z z z 1 · + ⇒ = z 2 − 2z + 1 − σ 2 2 z − (1 − ) z − (1 + )
(13)
System Tasks of Digital Twin in Single Information Space
55
1 1 · (1 − σ )n + ·(1 + σ )n , 2 2 1 σ +1 1 z σ −1 + ⇒ = 2 2 z − 2z + 1 − σ 2σ z − (1 − σ ) 2σ z − (1 + σ ) 1 1 ⇒ ·(1 + σ )n + ·(1 − σ )n . 2σ 2σ ⇒
It follows that
x1 (0) · ((1 − σ )n + (1 + σ )n ) + · (1 − σ )n + (1 + σ )n ·x2 (0), 2 2σ
ϕ x2 (0) x2 (n) = · (1 + σ )n − (1 − σ )n · x1 (0) + · ((1 − σ )n + (1 + σ )n ), 2σ 2
x1 (n) =
or 1 (x1 (0) − 2 1 x2 (n) = (x2 (0) − 2
x1 (n) =
x2 (0))(1 − σ )n + σ ϕ x1 (0))(1 − σ )n + σ
1 (x1 (0) + 2 1 (x2 (0) + 2
x2 (0))(1 + σ )n , σ ϕ x1 (0))(1 + σ )n . σ
(14)
Thus, in the closed information system of the digital twin, the stability of the dynamic process is ensured by the conditions: min ≤ xi ≤ max and
N
xi i=1 xi
= const.
The analysis of the dynamics of the system-information system showed that the dynamics of the system connections of the elements {x2 → x1, x → x2} of the information system is self-developing and represents an evolutionary spiral, and the dynamics of the system connections of the elements {1/ x2 → x1, 1/ x1 → x2} of the information system is fading. In a stable information system of real production, there are connections between the elements of both dynamics. Their relationship is determined by physical laws.
4 Conclusions The digital twin as an information system requires the development of principles for the formalization of system tasks and their solution based on system information models. Formalization of the system tasks of the digital twin and their solution based on systeminformation models is based on the concept of system information of objects. The systeminformation model of the digital twin requires the development of new methods and approaches to solving system problems, taking into account the presence of parameter uncertainty in the model. The solution to information system problems is based on system-information models of the dynamic processes of the digital twin in the system (USIS + PLSI + PLM), which characterize the stability of the system. The study of methods for solving system problems based on uncertainty management algorithms in digital twin problems is an important direction in the development of modern digital production.
56
M. Korablyov et al.
The system-information process of the digital twin is a sequence of systeminformation increments. A mathematical model of the system-information process of the digital twin is presented, which is a theoretical platform for solving system problems. The task of the dynamics of system information is formalized, which is solved in several stages: 1) a matrix of information links of system elements is built; 2) a system of equations of dynamics is compiled for each pair of elements of the system; 3) the equations of the dynamics of the information system are solved. An example of solving the problem of the dynamics of information system elements is given, which showed that the dynamics of system connections of information system elements are self-developing and represent an evolutionary spiral. The direction of future work is to study the influence of both the values of the simulated features and their fluctuations on the stability of system-information models.
References 1. Sharma, A., Kosasih, E., Zhang, J., Brintrup, A.: Calinescu a digital twins: state of the art theory and practice, challenges, and open research questions. J. Ind. Inform. Int. 30, 100383 (2022) 2. Leng, J., Wang, D., Shen, W., Li, X., Liu, Q., Chen, X.: Digital twins-based smart manufacturing system design in Industry 4.0: A review. J. Manufac. Syst. 60, 119–137 (2021). https:// doi.org/10.1016/j.jmsy.2021.05.011 3. Forrest, J.Y.-L.: General Systems Theory. Foundation, Intuition and Applications in Business Decision Making (2018) 4. Li, L., Lei, B., Mao, C.: Digital twin in smart manufacturing. J. Ind. Inf. Integr. 26, 100289 (2022) 5. Gao, L., Jia, M., Liu, D.: Process digital twin and petrochemical industry. J. Softw. Eng. Appl. 15(8), 308–324 (2022) 6. Molnár, B., Benczúr, A., Béleczki, A.: Formal approach to modeling of modern information systems. Int. J. Inf. Syst. Proj. Manag. 4(4), 69–89 (2016) 7. Prakash, N., Prakash, D.: Novel Approaches to Information Systems Design. Hershey, New York (2020) 8. Lutskyy, S.: System-information approach to the uncertainty of process and system parameters. Innov. Technol. Sci. Solut. Ind. 3(17), 91–106 (2021) 9. Korablyov, M., Lutskyy, S.: System-information models for intelligent information processing. Innov. Technol. Sci. Solut. Ind. 3(21), 6–13 (2022) 10. Love, P.E.D., Zhou, J., Matthews, J., Sing, M.C.P., Edwards, D.J.: System information modelling in practice: Analysis of tender documentation quality in a mining mega-project. Autom. Construct. 84, 176–183 (2017). https://doi.org/10.1016/j.autcon.2017.08.034 11. Hasan, F.F.: A review study of information systems. Int. J. Comput. Appl. 179(18), 15–19 (2018) 12. Korablyov, M., Lutskyy, S., Ivanisenko, I., Fomichov, O.: System-information rationing of digital twins accuracy. In: Proceedings of the intern. conference “13th Annual Computing and Communication Workshop and Conference (CCWC)”, pp. 653–660. Las Vegas (2023) 13. Korablyov, M., Lutskyy, S., Ivanisenko, I.: Data Processing of Digital Productions under Conditions of Uncertainty using System Information Models. Sel Paper IX International Science Congress information Technology Implemantation, WS Proceed (IT&I-WS 2022), CEUR-WS, vol. 3384, pp. 62–73 (2023)
Smart City, Big Data, Little People: A Case Study on Istanbul’s Public Transport Data Emre Kizilkaya(B) , Kerem Rizvanoglu, and Serhat Guney Galatasaray University, Ciragan Cad. No: 36 Ortakoy, 34357 Istanbul, Turkey [email protected], {krizvanoglu,sguney}@gsu.edu.tr
Abstract. Over the course of two years, a team of researchers collaborated with the Istanbul Metropolitan Municipality (IBB) in a joint research endeavor. The project aimed to leverage a substantial 30-gigabyte dataset containing information on more than 265 million mass transit journeys to serve public interest, with a particular emphasis on enhancing the urban experience of international students through locative media applications. This article presents a case study on communication that explores the intricate and evolving network of both human and nonhuman actors involved, employing Actor-Network Theory (ANT), a sociological approach designed to study relationships within heterogeneous networks. Fundamental methodological principles of ANT - agnosticism, generalized symmetry, and free association - were utilized to decipher the “moments of translation”. The article proceeds to discuss a unique set of observed effects specific to the context of a developing country’s “fraying” metropolis, its distinct “buffer mechanisms”, and its patron-client network relations. It concludes by proposing the integration of the concept of affordances to enhance ANT critically and further the cause of public interest-oriented urban co-governance of Big Data. Keywords: Smart cities · Data governance · Actor-network theory · Urban studies · Human-computer interaction · Geographical information systems
1 Introduction 1.1 Background In the era of digital transformation and algorithmic automation, data has emerged as a pivotal resource with the potential to transform urban governance. The concept of Smart Cities has gained momentum, heralding efficient and sustainable urban environments through the integration of data-driven technologies and services. Consequently, the use of “Big Data” and advanced analytics by public institutions has become a prominent area of research and practice. This article presents a case study examining the distinct networks at the intersection of social sciences and e-governance, focusing on the Istanbul Metropolitan Municipality (IBB) in Turkey. Istanbul, one of the largest cities worldwide, is undergoing significant sociopolitical and technological changes, making it a compelling context to study the dynamics of urban data utilization for public interest. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 K. Arai (Ed.): FICC 2024, LNNS 921, pp. 57–75, 2024. https://doi.org/10.1007/978-3-031-54053-0_5
58
E. Kizilkaya et al.
A research team from Galatasaray University in Istanbul initiated a joint project with IBB in February 2021 to analyze the abundant citizen data collected daily, with a focus on mass transit usage of international students. The primary objective was to gain insights into how students, particularly the Erasmus exchange community, “use” the city. This collaborative effort, spanning over two years, aimed to understand and enhance the students’ urban experiences through innovative locative media applications, thereby contributing to Istanbul’s Smart Cities agenda. Simultaneously during the fieldwork, we utilized the Actor-Network Theory (ANT) as a comprehensive framework for analyzing the relationships and interactions within the complex network that we worked with/in. This network consisted of a web of relationships between researchers and various actants within IBB, including different departments, managers, technical teams, and data infrastructure. Our study coincided with a period of rapid change in Istanbul. Over the past four decades, Istanbul experienced significant urbanization, with its population tripling from 4.7 million in 1980 to 15.9 million in 2022 [1]. This growth transformed IBB into one of the largest public bodies in Europe with a budget exceeding those of several countries worldwide. According to official figures shared with researchers during the project, IBB employed over 92,000 people, owned 30 companies, and provided municipal services across a territory larger than 5,461 square kilometers of land as of 2022. The study also occurred alongside two key events. Firstly, Turkey’s conservative Justice and Development Party (AKP) lost Istanbul’s mayoral seat after 25 years in power [2]. The center-left main opposition, the Republican People’s Party (CHP), saw its candidate, Ekrem Imamoglu, elected as the new mayor. The transition involved more than political/ideological/rhetorical shifts; Imamoglu’s administration embarked on a technologically-driven transformation. During this period, IBB’s software and hardware infrastructure for managing “Big Data” underwent modification, and the municipality embarked on a journey to integrate its various services into a “super app.“ This app, named “Istanbul Senin” (Istanbul Is Yours), was launched in November 2021, and it surpassed the 100,000-user milestone in less than three months. The second key event was the transition from the COVID-19 period to the “new normal.“ Throughout the pandemic that started in Turkey in March 2020, Turkish authorities imposed extraordinary measures from time to time. The last of the total shutdowns occurred in May 2021, with “normalization” declared in March 2022 when the outdoor mask mandate was scrapped. Most of the fieldwork was conducted during this period. 1.2 Literature Review ANT is a theoretical and methodological approach to social theory that was first proposed by academics from the Centre de Sociologie de l’Innovation (CSI) of the École nationale supérieure des mines de Paris in the early 1980s [3]. Also known as the “sociology of translation”, ANT was developed by scholars such as Michel Callon, John Law, and Bruno Latour [4]. These scholars aimed to understand how innovation and knowledgecreation occur in science and technology. ANT adopts an anti-essentialist stance, positing that everything in the social and natural worlds exists within a heterogeneous network of relationships between entities. These entities are simultaneously material and semiotic. ANT seeks to redefine actors not
Smart City, Big Data, Little People: A Case Study
59
merely as willful or intentional agents, but as any entity—human or non-human—that in some way influences or disrupts the activity of a techno-social system [5]. As Crawford [6] further explains, ANT is underpinned by three main principles: agnosticism, which involves abandoning a priori assumptions about networks and their entities; generalized symmetry, which involves treating all entities, whether human or non-human, as equal; and free association, which rejects the concept of pre-existing social structures, considering them as perpetually emerging from associations between entities. ANT was described not as a full-fledged theory but as “an adaptable, open repository” [7, p. 253]. Its practitioners have adopted or modified pre-existing concepts, creating a unique terminology in the process. They derived the concept of translation from French philosopher Michel Serres, redefining it as “the identity of actors, the possibility of interaction, and the margins of maneuver are negotiated and delimited” within an actornetwork [8, p. 203]. According to Latour, a specific instance of translation is delegation, through which social or technical actants “enroll” in a mediating process that transforms intentions into forms for a particular end, such as a bridge built by engineers to facilitate travel across a river [9, p. 602]. Power in actor-networks is mainly conveyed through delegation, which can involve not only humans or physical structures but also machines and other non-human actants. The discipline process “means convincing or forcing those delegated to conform to patterns of action and representation” [10, p. 28]. ANT practitioners also adapted the term “black box,” rooted in information science, to describe a set of complex sociotechnical features that are hidden once they are packaged into a discrete object, such as a consumer product like a seatbelt [11]. They further developed this terminology by introducing the concept of punctualization, which Callon [12] described as the process of black-boxing, or converting an entire network into a single point or node in another network. Since the mid-1990s, ANT has been applied to a diverse range of fields beyond science and technology studies. These include organizational analysis, international relations, feminist studies, and economics. Moreover, ANT has been used as an analytical tool to understand the implementation of information technologies in various sectors, such as healthcare [13]. ANT has also been extensively employed in urban studies, as well as in research into “smart cities,” “Big Data,” and algorithmic governance [14]. It was praised for its potential to “act upon and even change urban studies, and a less or more critical diagnostic of the current state, conventions and blind spots of urban studies” [15, p. 2]. The framework has been found to “provide a valuable insight into the local and global actor-networks that surround e-government projects” [16, p. 1]. Secinaro et al. [17] argued that hybrid organizations benefit from actor-network theory when managing smart city initiatives, while Smith [18, p. 25] acknowledged the potential of ANT to achieve “further progress in the conceptualization and empirical study of world cities and their networks.” Despite its influence, ANT has also been controversial. From the late 1990s onwards, it faced intensified criticism. Some argued that it “fails to offer a satisfactory theory of the actor which is allegedly endowed either with limitless power or deprived of any room for maneuver at all” [19, p. 181]. It was criticized for being analytically underpowered
60
E. Kizilkaya et al.
[20], ambiguous [21], and prone to “technicism” [22]. While some argued that ANT’s empiricism too strict [23], others contented that it was not empiricist enough [24]. Due to “a naturalizing ontology, an un-reflexive epistemology and a performative politics,” its contribution to developing a critical theory was also questioned [25, p. 611]. Despite these critiques, and particularly after its proponents revised or clarified some of the key elements of their theses in the late 2000s—a period known as “ANT and After"—the framework remains a useful mode of inquiry for understanding the complex relationships between entities in a network [26]. In this article, we present a comprehensive case study on Istanbul’s public transport system, focusing on the utilization of a large-scale dataset containing information on millions of mass transit journeys. The study, conducted in collaboration with IBB, aims to leverage big data to enhance the urban experience of international students through locative media applications. We employ ANT to explore the intricate and evolving network of both human and non-human actors involved. The article delves into the fundamental methodological principles of ANT, including agnosticism, generalized symmetry, and free association, to decipher the “moments of translation”. The ensuing sections provide detailed insights into the methodology, findings and analysis, shedding light on the potential of big data in shaping smart cities and improving the urban experience for inhabitants.
2 Methodology Large-scale municipal administrations like IBB, particularly in the digital governance era, can be analyzed as heterogeneous sociotechnical networks, suitable for study through the lens of ANT. The application of ANT, however, can take various forms. For instance, Kanger [21] lists seven distinct interpretations for ANT’s scope of applicability: as a sensitizing framework, as a tool for structuring descriptions/explanations, as a description/explanation in and of itself, as a framework of/for the fluid, as a methodology, and as a coherent combination of different ontological and methodological assumptions. Upon recognizing through our initial observations that the IBB environment could indeed be analyzed as a heterogeneous sociotechnical network, we decided to employ ANT as a tool for structuring descriptions/explanations in our long-term field study. Within such fieldwork, ANT extends beyond traditional qualitative research approaches by incorporating a variety of unique techniques and methods to analyze the intricate interactions in actor-networks. One of the primary methods we employed was ethnographic observation, which offers a comprehensive understanding of network dynamics and the roles of various actors. Throughout the first year of our project, we engaged in prolonged, immersive fieldwork, observing and documenting the interactions between actors in their natural settings. As researchers, we were actors ourselves, participating in dozens of conversations with over 30 IBB figures, observing how they collaboratively created and stabilized the network at the municipality headquarters in Istanbul’s Sarachane neighborhood and its extension building in the Kasmipasa district, as well as in numerous video conferences. Following one of the most well-known ANT principles—Latour’s directive to “just follow the actors” [27]—we traced the trajectories of actants as they moved across different contexts in space and time, focusing on their agency and the changes they brought
Smart City, Big Data, Little People: A Case Study
61
about in the network. Over time, the roles of some human actors changed, with some getting promotions at the IBB headquarters, others having their job descriptions altered, while others left the institution or were laid off. Non-human actors, too, experienced shifts. For example, IBB’s hardware and software for handling mass transit databases underwent significant changes, its main data center was moved, and the mobile applications available to the public were replaced with new ones. Meanwhile, we continually updated our observations and conversations, creating visual representations of the network—actor maps. We also conducted document analysis, examining various materials, including IBB’s internal policy papers, raw datasets, reports, presentations, emails, and other documents, to deepen our understanding of this actor-network. As ANT is commonly applied in booklength texts, we narrowed our focus to the network surrounding a non-human actor: a 30-gigabyte file named export-GSU.csv, which contained a dataset of more than 265 million mass transit journeys in one year that IBB shared with us for the main research project. After utilizing ANT to structure the descriptions/explanations in the Results section of this paper, we attempted to extend it towards critical theories of organization with a denaturalizing ontology, reflexive epistemology, and anti-performative politics in the Discussion section. For this, we turned to the concept of affordances as an analytical tool for generally observed features of this particular network, as described by [28] as one of the “predominant lenses through which to theorize about how digital technologies are involved in organizational change and innovation.” To interpret locally observed features, we utilized three key concepts developed by Turkish sociologist Mubeccel Kiray: fraying, patronage [29] and buffer mechanisms [30].
3 Results As depicted in the figure below, Istanbul’s mayor oversees 11 entities (see Fig. 1). The secretary-general is a key figure within this hierarchy, supported by seven deputies, each responsible for a different set of areas. For an external stakeholder, such as a university researcher, determining which IBB office to approach may initially be challenging as a potential collaboration might intersect with the responsibilities of several deputies. In the context of our study, the following deputies and their areas of responsibility seemed relevant: Deputy 1 for Information Technology (IT) and external relations, Deputy 2 for rail transport, Deputy 3 for culture, Deputy 4 for public transport, Deputy 5 for strategy development, and Deputy 7 for research and projects. Under each deputy, there are 27 “presidents” who manage more specific areas. Moreover, 104 “branch managers” report to the presidents, focusing on even more particular functions. Our entry point for collaboration with the IBB was the secretary-general’s Deputy 1, to whom the president of IT reports. Four branch managers report to this presidency: IT (again), Electronics Systems, Geographical Information System, and Smart City. Following the initial meetings between the university representative and IBB’s high-ranking officials, and before the formal agreement was signed, Deputy 1 delegated the co-management of the joint research project to the Smart City branch. Actors from several other IBB entities later collaborated with the researchers, such as the Geographical Information System branch, Bus and Trams Directorate (IETT) and BELBIM,
62
E. Kizilkaya et al.
Fig. 1. The corporate hierarchy of the Istanbul Metropolitan Municipality (IBB), displayed on its website, does not accurately depict the intricate interactions within its networks, particularly those involved in the multi-dimensional operationalization of public data.
the public-owned company that is responsible for developing and operating electronic payment systems in mass transit. The research project, which was initially envisaged as a one-year study commencing in early March 2021, was only concluded after two years due to a series of challenges on the IBB side, which will be described (see Fig. 2). 3.1 First Moment of Translation: Problematisation The researchers’ initial objective was to enlist the support of IBB in a study designed to explore the mobility patterns of international students. The study aimed to gain an indepth understanding of these students’ urban experiences and expectations. This understanding could guide the formulation of improved city policies and practices, including potential developments in locative media applications. While the project incorporated qualitative elements, such as ethnographic studies, the phase that required IBB’s involvement leaned more towards quantitative data. The researchers endeavored to gather all relevant datasets from IBB that pertained to mobility
Smart City, Big Data, Little People: A Case Study
63
Year One February 2021: The research team's first contact with the Istanbul Metropolitan Municipality (IBB)
June 2021: The joint research partnership is signed and the project officialy begins
March 2021: The research team meets IBB's top managers
November 2021: Another meeng with IBB's technical teams (challenges)
September 2021: The research team meets IBB's technical team for data sharing
Year Two
May 2022: Another meeng with high-level managers to resolve the impasse
March 2022: Another meeng with IBB's technical team
April 2022: IBB's technical team nofies researchers that the project is on hold due to an impasse.
November 2022: Final versions of the data is shared
September 2022: Data sharing restarts
March 2023: Concluding conversaons between research and IBB teams
Fig. 2. The timeline of the study
and students. Their goal was to analyze these datasets holistically, providing municipal authorities with an evidence-based policy proposal. In this context, the researchers made a presentation to IBB’s top officials, including Deputy 1. The meeting, conducted in person, highlighted the potential benefits of the project. They confirmed to IBB that only anonymized versions of several datasets concerning students, foreigners, and mobility were needed. Deputy 1 and his team, representing an internal network within IBB, found the proposal compelling. This collaboration represented an opportunity for IBB to enhance its organizational capacity without necessitating significant financial or human capital
64
E. Kizilkaya et al.
investment. In return, the organization’s “dormant” datasets would be analyzed by external experts affiliated with a state university, a non-profit educational institution. To recognize the potential value of the collaboration, IBB greenlit the partnership immediately after the researchers’ presentation. The initial step in the translation process, problematization, preceded without a hitch. The researchers successfully convinced an internal network at IBB that a critical issue existed: unutilized data pertaining to an overlooked segment of Istanbul’s citizens. They presented themselves as willing and capable solution providers, working for the public good. Implicitly, both parties agreed that the researchers had established themselves as an obligatory passage point in the network of relationships they were constructing (see Fig. 3).
Fig. 3. The Callonesque actor-map after the first moment of translation, drawn by adapting the ANT diagramming method proposed by Potts [31]
3.2 Second Moment of Translation: Interessement Following the signing of the joint research partnership protocol, the researchers’ subsequent step involved requesting IBB’s related datasets for analysis. These included data on mass transit, municipal bike rentals, the municipal taxi app, the “Walk & Discover” app, usage of municipal public Wi-Fi, census data on foreign residents in Istanbul, and complaints filed by foreigners to IBB’s help desk. The network began to expand and the process became more complex as the technical materiality started to obscure the previously clear and concise domain of rhetoric. Communication shifted to email, an actor that brought its own advantages and limitations. Deputy 1 and the internal network around him expanded to include roughly a dozen new actors in the email’s “CC” box, representing various IBB functions, such as presidents and branch experts. However, “not every message lends itself for email” [32,
Smart City, Big Data, Little People: A Case Study
65
p. 52]. The researchers anticipated encountering initial challenges, as effective organizational communication is vital, particularly in the early stages of developing or revealing a new actor-network. It’s noteworthy that the main figures at IBB, as depicted in the organizational hierarchy, are represented by human actors (politicians, bureaucrats, technicians, etc.) as figureheads (directors, managers, experts, etc.). However, it’s important to remember that each entity is an actor-network in itself, encompassing non-human actors as well, such as email (as a medium) and even the selected email operator/server. Following the partnership protocol, IBB’s top management assigned the lead role to the Smart City branch. Within this branch, the “Open Data” and “Big Data” departments are separate units where human actors (e.g., data scientists, business intelligence analysts, visualization experts) work alongside non-human actors (e.g., databases, the hardware used to store them, the software used to manipulate them). Email communication might have potentially impeded or delayed the formation of a robust network. However, another non-human actor, the datasets in question, now assumed a significant role at the core of the increasingly intricate network. Adhering to the ANT’s principle of generalized symmetry, all actors within a network are perceived as equally important and influential. However, the framework also posits that an actor’s power and agency arise from their interactions within the network. As such, the heightened significance of the datasets as an actor wasn’t inherent but rather a temporary condition that emerged from transient interactions at this stage. Exhaustive email correspondence and occasional video conferences were used to discuss legal and ethical aspects of data sharing, such as privacy concerns, along with technical matters like file structures and formats. The total number of participants— actors in their own right and/or representatives of internal actor-networks—fluctuated throughout this process. 3.3 Third Moment of Translation: Enrolment In this case study, certain entities, beginning with the IBB’s Smart City branch, were identified by the initial problematization. However, they resisted quick integration into the project as it stood, implicitly indicating diverging goals and motivations at the time of these interactions. This pushed the researchers into the interessement phase, where they aimed to impose and stabilize the identity of the actor-network as initially envisioned. The researchers’ strategy was clear-cut. They firstly abandoned the protracted email thread, choosing instead to engage in face-to-face meetings to clarify how the cooperation would benefit all actors involved. Using insights gained during this phase, they sought to align the priorities of the coordinating actor, the Smart City department, with the goals of the research. It became apparent that discussions on, for example, the technical aspects of the project should be conducted not throughout the entire network, but within specific ‘black-boxed’ units of it. As observed by Callon, “[i]nteressement achieves enrolment if it is successful” [8, p. 211]. In this case study, success was intermittent and limited, particularly with respect to the research project’s objectives. Enrolment was neither complete nor permanent. There was often an actor in the network that obstructed or completely halted interactions and information flow. At times, non-human actors were the culprits - a faulty database
66
E. Kizilkaya et al.
export procedure could result in data loss. However, more often than not, it was human actors who caused disruption, with their names and positions undergoing constant change over nearly two years. Consequently, negotiations frequently had to restart due to various changes - from software updates to human resources decisions such as dismissals and promotions. 3.4 Fourth Moment of Translation: Mobilisation The mobilisation phase commenced with two nearly simultaneous moments of translation: one affecting a more localized (and black-boxed) actor-network, and the other impacting the entire actor-network. On a more local scale, the black box of the Smart City department was opened or leaked to the researchers during a negotiation to secure their enrolment. The researchers endeavoured to convince the department managers and experts that the ensuing data analysis would offer mutual benefits. Initially, the department remained unpersuaded. However, upon viewing a data visualization created by the researchers from a sample of the requested IBB data, the department head - its spokesperson - mistakenly believed that it had been produced by his own department. This revelation initiated new channels of interaction, leading the department to agree with the researchers that the data analysis could yield valuable insights for all involved actors, including themselves. They were now aligned with the project. The more general mobilising moment of translation required the intervention of the spokesperson for a larger, related actor-network. While the IBB is politically represented by the mayor and the council, its “on-the-ground” representation is tied to deputy secretary-generals. When Deputy 1’s IT director was informed about the slowing progress of the joint research project, he intervened, not to dictate, but to revitalise the network interactions by supplying new information. This intervention precipitated a process of reverse punctualization, as certain sub-networks were de-black-boxed, revealing their constituent elements - both human and non-human - to all actors, akin to viewing the inner workings of a cell under a microscope (see Fig. 4). Despite a delay of nearly a year, the researchers met the project objectives by accessing the dormant urban mobility datasets and analyzing them in the interest of public policy formulation. Nevertheless, they were unsuccessful in establishing a durable alliance – a more enduring actor-network – in which they served as spokespersons for various actors. This missed opportunity, particularly given IBB’s demonstrated capacity to form ad hoc actor-networks with external research partners, could have supported an ecosystem where public data is fully utilized with a focus on public interest. In the following section, we will explore the potential reasons for this shortcoming (see Fig. 5).
Smart City, Big Data, Little People: A Case Study
67
Fig. 4. IBB’s mass transit data sample was finally completed and handed over to the researchers in a file named export-GSU.csv, which was later analyzed and visualized
Fig. 5. The actor-map that was updated after the last moment of translation. We “zoomed” in and out while reverse punctualizing some entities to evolve a thicker and more expansive description of the network
68
E. Kizilkaya et al.
4 Discussion The insights we gleaned through ANT’s descriptive toolkit can be scrutinized more critically, particularly in relation to the concept of public interest, which involves normative judgments and values. In this context, how should a public dataset be utilized for the public interest by human and non-human actors within a complex network? As we diverge from the ANT perspective, we introduce a dichotomy here. Public interest has often been defined in negative terms, that is, as that which is not “selfinterest” [33]. While some have attempted to define this term positively [34], there are certain attributes that can be unanimously attributed to the public interest. As it concerns the welfare of the largest number of people, it generally promotes greater participation, inclusivity, openness, accessibility, accountability, and transparency. So, how is the public interest associated with a “product” like an urban transit dataset? In agreement with Gandy’s argument [35], we recognize information as a “public good,” while rejecting the neoclassical definition that frames it as an essence, a natural resource, or merely as a product of labour. In this context, utilizing public data for the public interest means (re)producing a public good, which is “non-rivalrous” and “non-excludable” [36, p. 308]. The design of artifacts, as well as the structures to produce and distribute them, is the results of human decisions, even when these decisions are influenced by non-human actors, such as algorithms. Acemoglu and Johnson [37, p. 318] observe that “there is also nothing inherently antidemocratic in digital technologies (…) It was a matter of choice—choice by tech companies, AI researchers, and governments—that got us into our current predicament.” In our case study, we can re-examine the observations about the central artifact, the export-GSU.csv file, and IBB’s internal mass transit data dashboards, through the lens of public interest, imagining new possibilities. To do so, we must focus on the human actors who had the ability to exert power through delegation and discipline in various processes to produce public goods. To keep this article within an acceptable length, we will limit the discussion to a small number of features that were observed in our actor-network across two distinct dimensions. 4.1 Affordances of IBB’s Datasets Gibson’s notion of affordance, introduced in 1979, refers to the potential actions provided or made possible by an environment or object, and this concept has been applied in communication research in a variety of ways, sometimes even conflictingly [38]. However, it remains a useful descriptor for technology utilization as it ties together the materialist perspective, which focuses on objects, with the constructivist viewpoint centered on human agency [39]. Regarding telic affordances, the IBB mass transit datasets and associated dashboards interact with the user in limited ways. For example, variables included or excluded from the dataset are determined through a somewhat opaque decision-making process. For instance, it became apparent to IBB experts that they couldn’t provide data regarding mobility of Erasmus students simply because they hadn’t collected it. IBB had student data, including foreign students, but not specifically participants in the Erasmus program.
Smart City, Big Data, Little People: A Case Study
69
How many other sub-groups of the public were overlooked in the datasets? The answer remains unclear. Design choices are similarly made through a non-transparent, non-pluralistic process, which potentially undermines the inclusivity of public datasets. One of many key issues is the discretion of a technician in deciding which data points to include in the Executive Summary panel at IBB’s mass transit board meetings. Consequently, we can posit that datasets should be viewed not only as objects with affordances but also as a kind of “frozen discourse” that is either inclusive or exclusive towards other actors in the network - be it by design or accident (see Fig. 6).
Fig. 6. IBB’s internal data dashboard for mass transit. This panel shows the pedestrian accessibility of the nearest station, as well as many other related data points
Indeed, beyond issues of inclusivity, accessibility to the data presented its own set of challenges. The decisions regarding the color palettes used for internal data visualizations were made within the “black-boxed” actor-networks of the IBB. The choice of colors used in these visualizations might have had implications not only for policy decisions, but could also pose a barrier for individuals with color blindness (see Fig. 7). 4.2 Fraying, Patron-Client Networks and Buffer Mechanisms Certain characteristics of the actor-network at IBB are specific to the local context, unfolding in conditions that, while not unique, are typical for an urban setting in Turkey. The concepts developed by sociologist Kiray, who studied social change and urbanization in Turkey, may provide valuable insights to explain these characteristics. Our observations noted that the intervention of influential actors within the IBB network facilitated the resolution of a deadlock. Although delegation and discipline feature prominently in ANT, the conditions within the IBB case presented unique dynamics. This deadlock was not resolved due to an order by a hierarchically superior actor. In fact, previous orders from this actor had proven ineffective with other actors within the network. Was this merely an act of disobedience or could there be a more complex, underlying reason?
70
E. Kizilkaya et al.
Fig. 7. An internal dashboard of IBB, displaying traffic forecast for the first day of school in the 2021 fall semester. Note the non-color blind friendly palette preferred for this visualization
According to Kiray, the rapid modernization of Turkey led to significant social changes throughout the 20th century, largely driven by unplanned urbanization and industrialization. Traditional patron-client relationships transformed over the decades, with the feudal patronage of the village lord in the 1950s shifting to town merchants in the 1960s, then to politicians in the 1970s, followed by religious leaders in the 1990s and eventually returning to politicians in the 2000s. As large numbers of people migrated from rural areas to cities, social solidarity networks - comprising not only elders but also fellow townsfolk - were fortified. Rapidly expanding cities, such as Istanbul, found themselves overwhelmed by internal migration and unable to plan adequately for their transition to a metropolis. As a result, new industrial areas sprang up on the peripheries of these cities, populated by an influx of migrants and city dwellers in search of affordable housing. Kiray describes this process as fraying [29], and identifies the sudden emergence of a new type of low-quality housing (known as “gecekondu” in Turkish) as a buffer mechanism [30] that society employs to manage rapid change. As such that, the process of urbanization, industrialization, and social restructuring in Turkey significantly differed from the experiences of Europe and North America during their respective industrialization periods in the 1960s. In revisiting our network at IBB, we can reinterpret the interactions of the actors who are linked to the dataset in question. In this context, Kiray’s three key concepts manifest themselves. The human actors at IBB, and potentially non-human entities created by these actors (such as algorithms), operate within a network characterized by corporate and political territorialism. Written rules or verbal commands are accepted by these actors as valid directives only when they originate from, or are endorsed by, their own patron. Although the laws and statutes governing Turkey and Istanbul bear similarity to those in Western contexts, their application exhibits a distinct “flexibility.” However, caution must be exercised when attributing these differences to cultural or religious factors. The interactions within the IBB actor-network suggest they can be
Smart City, Big Data, Little People: A Case Study
71
interpreted as a buffer mechanism, developed by social groups as a temporary response to ongoing changes in the city’s demographics, economy, technology, and political administration. Consequently, urban governance was also fraying at the edges of the network. This could explain why we were unable to identify a center of translation within the network; instead, they appear to be irregularly distributed. This process has likely exacerbated well-documented challenges within such networks, such as the formation of information silos. One of the informing conversations conducted during this research was with a “Big Data Analyst” at IBB. We hoped that getting him onboard would expedite the joint research. Having worked at IBB for nearly 15 years –appointed during the tenure of the previous conservative mayor– he offered a unique perspective. At the time of our meeting, COVID-19 pandemic protocols were in place, but the analyst chose not to keep his mask on within the confines of a small room. During the talk, he signaled a deep mistrust towards his new superiors, who were appointed by the new social democrat mayor. He candidly admitted that he had intentionally “slowed down” processes and occasionally ignored directives from his superior in other projects as well. He provided two reasons for his conduct: Firstly, he contended that the new administration – essentially a new network within an existing network – primarily invested in “PR projects”. These projects were aimed at yielding rapid results before the next election while allegedly sidelining more critical initiatives intended for public interest. For instance, he claimed that while it was currently technically impossible to determine the number of passengers in a metro train at a particular time and location, this information could be accessible with necessary infrastructural investments. His second reason was less political and more practical: “Data is big, and we are small,” the analyst remarked, bemoaning that the growth of IBB’s human resources and urban information infrastructure had not kept pace with the increasing demand and rapidly escalating expectations. As a result, there existed a big dataset that was largely unexplored until some outsiders knocked the door, and the “Big Data Analyst” left IBB a few months before our project was concluded.
5 Conclusion The findings from our case study on Istanbul’s mass transit datasets highlight the intricate interplay between Actor-Network Theory (ANT), urban governance, and the public interest in data-driven environments. The key facets of our findings primarily emphasize the interconnectedness and localized specificities of actor-networks, and their oftenunforeseen implications on the transparency, inclusivity, and accessibility of public goods like urban datasets. Notably, the utilization of ANT offered novel insights into the interrelationships among various actors, both human and non-human, involved in the creation, dissemination, and interpretation of IBB’s urban transit datasets. The pivotal role of individual actors’ power dynamics in the delegation and discipline process was also underscored, revealing complexities within the decision-making networks. Moreover, the influence of socio-cultural factors in actor-networks cannot be overlooked. As our research revealed, the peculiarities of the local environment and historical
72
E. Kizilkaya et al.
context in Turkey, largely characterized by rapidly changing patron-client relations, significantly influenced the interactions within the IBB’s actor-networks. This hints at an underlying social buffer mechanism acting as a temporary solution to cope with Istanbul’s rapid urbanization and the ensuing changes in its demographics, economy, technology, and political administration. Despite the detailed analysis and novel insights, this study has some limitations that should be acknowledged. First, the scope of the research was limited to a single case study, focusing on IBB and its specific actor-networks. While this approach allowed for an in-depth exploration of the interrelations among human and non-human actors within a specific context, it may limit the generalizability of the findings. The dynamics, interactions, and power relations observed within this actor-network may vary substantially in different cultural, political, or organizational contexts. Therefore, caution should be exercised when attempting to extrapolate the insights gained from this study to other settings, without taking into account the peculiarities of the local environment and the unique characteristics of the actor-networks within those environments. Second, the nature of ANT-based research inherently involves dealing with complex, often nebulous networks with diverse actors and interactions, which makes it challenging to capture the full spectrum of relations within the network. While the study meticulously tracked the interactions between various actors within the IBB and their influence on dataset creation and usage, some subtleties or less-visible interactions may have been overlooked. Additionally, the study’s focus on the internal dynamics of the IBB actornetwork could have potentially downplayed the impact of external influences or actors from outside the immediate network. The idiosyncrasies observed in our study, as well as its limitations, bring forth an important implication for future research: the need for contextualizing actor-networks within their specific socio-cultural environments to accurately interpret their dynamics. It is important to be wary of generalizing interactions in these networks across different cultural or geographical settings without taking local conditions into account. Finally, the significance of public interest and its often-contested interpretation within actor-networks underscores the need for more transparent, inclusive, and accountable processes in data generation and dissemination in urban governance. Whether data is considered an essence, a natural resource, a product of labor, or a public good, its role in serving the public interest cannot be overstated. In the age of rapid digital transformation and growing reliance on data for decisionmaking, our research illuminates the intricate dynamics of actor-networks and their influence on urban governance. We hope that this study serves as a foundation for further exploration into how these networks can be structured to promote transparency, inclusivity, and public interest in the generation and use of urban transit datasets and other public goods. Acknowledgments. This research, its authorship, and subsequent publication were supported by Galatasaray University’s Scientific Research Project under grant number 21/021, titled "Student Mobility and Locative Media Applications." The collaborative project was initiated based on the partnership agreement signed between Galatasaray University and the Istanbul Metropolitan Municipality on June 14, 2021, under order number E-59749965-030.01-2021.692847. The study underwent rigorous ethical review and was approved by Galatasaray University’s Ethical Board
Smart City, Big Data, Little People: A Case Study
73
for Scientific Studies and Publi-cation on November 27, 2021, under Decision No. E-65364513050.06.04-16.169. The authors wish to express their heartfelt appreciation to all participants, including the Istanbul Metropolitan Municipality, and our esteemed colleagues, Betül Aydo˘gan, Onurcan Güden and Robin Kanat, for their invaluable contributions to this study.
References 1. Turkish Statistical Institute. Adrese Dayalı Nüfus Kayıt Sistemi Sonuçları, 2022 [AddressBased Population Registration System Results, 2022] (2023). https://data.tuik.gov.tr/Bulten/ Index?p=49685 2. Gall, C.: Turkey’s President Suffers Stinging Defeat in Istanbul Election Redo—The New York Times. New York Times (2019). https://www.nytimes.com/2019/06/23/world/europe/ istanbul-mayor-election-erdogan.html 3. Law, J., Hassard, J., (eds.): Actor Network Theory and After. Blackwell/Sociological Review. Oxford, England (1999) 4. Callon, M., Law, J., Rip, A., (eds.): Mapping the dynamics of science and technology: Sociology of science in the real world (transferred to digital printing). Macmillan (1998) 5. Crawford, T.H.: Actor-network theory. In: Hugh Crawford, T. (ed.) Oxford Research Encyclopedia of Literature. Oxford University Press (2020). https://doi.org/10.1093/acrefore/978 0190201098.013.965 6. Crawford, C.: Encyclopedia of social theory. In: Ritzer, G. (ed.). Sage Publications (2005) 7. Mol, A.: Actor-Network Theory: Sensitive terms and enduring tensions. Kölner Zeitschrift Für Soziologie Und Sozialpsychologie. Sonderheft 50, 253–269 (2010) 8. Callon, M.: Some elements of a sociology of translation: domestication of the scallops and the fishermen of St Brieuc Bay. Sociol. Rev. 32(1_suppl), 196–233 (1984). https://doi.org/ 10.1111/j.1467-954X.1984.tb00113.x 9. Mead, G., Barbosa Neves, B.: Contested delegation: Understanding critical public responses to algorithmic decision-making in the UK and Australia. Sociol. Rev. 71(3), 601–623 (2023). https://doi.org/10.1177/00380261221105380 10. Star, S.L.: Power, technology and the phenomenology of conventions: On being allergic to onions. In: Law, J. (ed.) A Sociology of Monsters: Essays on Power, Technology and Domination (1. publ, pp. 26–56). Routledge (1991) 11. Latour, B.: Where are the missing masses? The sociology of a few mundane artifacts. In: Bijker, W.E., Law, J. (eds.) Shaping technology/building society: Studies in sociotechnical change, pp. 225–258. MIT Press (1992) 12. Callon, M.: Techno-Economic Networks and Irreversibility. In: Law, J. (ed.) A Sociology of Monsters: Essays on Power, Technology and Domination (1. publ). Routledge (1991) 13. Cresswell, K.M., Worth, A., Sheikh, A.: Actor-Network Theory and its role in understanding the implementation of information technology developments in healthcare. BMC Med. Inform. Decis. Mak. 10(1), 67 (2010). https://doi.org/10.1186/1472-6947-10-67 14. Campbell-Verduyn, M., Goguen, M., Porter, T.: Big Data and algorithmic governance: The case of financial practices. New Politic. Econ. 22(2), 219–236 (2017). https://doi.org/10.1080/ 13563467.2016.1216533 15. Farías, I., Bender, T.: Urban Assemblages: How Actor-Network Theory Changes Urban Studies. Routledge (2012) 16. Heeks, R., Stanforth, C.: Understanding e-Government project trajectories from an actornetwork perspective. Eur. J. Inf. Syst. 16(2), 165–177 (2007). https://doi.org/10.1057/pal grave.ejis.3000676
74
E. Kizilkaya et al.
17. Secinaro, S., Brescia, V., Calandra, D., Biancone, P.: Towards a hybrid model for the management of smart city initiatives. Cities 116, 103278 (2021). https://doi.org/10.1016/j.cities. 2021.103278 18. Smith, R.G.: World city actor-networks. Prog. Hum. Geogr. 27(1), 25–44 (2003). https://doi. org/10.1191/0309132503ph411oa 19. Callon, M.: Actor-network theory—the market test. Sociol. Rev. 47(1_suppl), 181–195 (1999). https://doi.org/10.1111/j.1467-954X.1999.tb03488.x 20. Alcadipani, R., Hassard, J.: Actor-Network Theory, organizations and critique: Towards a politics of organizing. Organization 17(4), 419–435 (2010). https://doi.org/10.1177/135050 8410364441 21. Kanger, L.: Mapping ‘the ANT multiple’: A comparative, critical and reflexive analysis. J. Theory Soc. Behav. 47(4), 435–462 (2017). https://doi.org/10.1111/jtsb.12141 22. Grint, K., Woolgar, S.: The Machine at Work: Technology, Work, and Organization. Polity Press (1997) 23. Elder-Vass, D.: Searching for realism, structure and agency in Actor Network Theory1: Searching for realism, structure and agency in Actor Network Theory. Br. J. Sociol. 59(3), 455–473 (2008). https://doi.org/10.1111/j.1468-4446.2008.00203.x 24. Krarup, T.M., Blok, A.: Unfolding the social: quasi-actants, virtual theory, and the new empiricism of bruno latour. Sociol. Rev. 59(1), 42–63 (2011). https://doi.org/10.1111/j.1467-954X. 2010.01991.x 25. Whittle, A., Spicer, A.: Is actor network theory critique? Organ. Stud. 29(4), 611–629 (2008). https://doi.org/10.1177/0170840607082223 26. Law, J.: Actor network theory and material semiotics. In: Turner, B.S. (ed.) The New Blackwell Companion to Social Theory, pp. 141–158. Wiley-Blackwell (2009) 27. Fioravanti, C.H., Velho, L.: Let’s follow the actors! Does Actor-Network Theory have anything to contribute to science journalism? J. Sci. Commun. 09(04), A02 (2010). https://doi.org/10. 22323/2.09040202 28. Lehrer, C., Wieneke, A., Vom Brocke, J., Jung, R., Seidel, S.: How big data analytics enables service innovation: materiality, affordance, and the individualization of service. J. Manag. Inf. Syst. 35(2), 424–460 (2018). https://doi.org/10.1080/07421222.2018.1451953 29. Kiray, M.B.: Kentlesme Yazilari [Writings on Urbanization] (1. basım). Ba˘glam Yayıncılık (1998) 30. Kiray, M.B.: Social change in Çukurova: A Comparison Of Four Villages. In: Tümertekin, B., Mansur (eds.) Turkey, pp. 179–203. BRILL (1974). https://doi.org/10.1163/978900449 1106_012 31. Potts, L.: Diagramming with Actor Network Theory: A method for modeling holistic experience. In: 2008 IEEE International Professional Communication Conference (2008). https:// doi.org/10.1109/ipcc.2008.4610231 32. Marques, J.F.: Enhancing the quality of organizational communication: A presentation of reflection-based criteria. J. Commun. Manag. 14(1), 47–58 (2010). https://doi.org/10.1108/ 13632541011017807 33. Lewin, L.: Self-interest and Public Interest in Western Politics. Oxford University Press (1991) 34. Feintuck, M.: ‘The Public Interest’ in Regulation, 1st edn. Oxford, Oxford University Press (2004). https://doi.org/10.1093/acprof:oso/9780199269020.001.0001 35. Gandy, O.: The political economy of communications competence. In: Mosco, V. (ed.), The Political Economy of Communication, 2nd edn. SAGE (2009) 36. Stiglitz, J.: Knowledge as a global public good. In: Kaul, I., Grunberg, I., Stern, M.A. (eds.) Global Public Goods: International Cooperation in the 21st Century. Oxford University Press (1999) 37. Acemoglu, D., Johnson, S.: Power and Progress: Our Thousand-Year Struggle Over Technology and Prosperity, 1st edn. PublicAffairs (2023)
Smart City, Big Data, Little People: A Case Study
75
38. Evans, S.K., Pearce, K.E., Vitak, J., Treem, J.W.: Explicating affordances: a conceptual framework for understanding affordances in communication research: EXPLICATING AFFORDANCES. J. Comput.-Mediat. Commun. 22(1), 35–52 (2017). https://doi.org/10.1111/jcc4. 12180 39. Leonardi, P.M., Barley, S.R.: Materiality and change: Challenges to building better theory about technology and organizing. Inf. Organ. 18(3), 159–176 (2008). https://doi.org/10.1016/ j.infoandorg.2008.03.001
Educommunication as a Communicative Strategy for the Dissemination of Cultural Programs David Xavier Echeverría Maggi1(B) , Washington Dután1 , Lilibeth Orrala1 , Gregory Santa-María1 , Mariana Avilés1 , Ángel Matamoros1 , María-José Macías1 , Martha Suntaxi1 , Lilian Molina1,2 , Gabriel Arroba1,2 , and Arturo Clery1,2 1 Universidad Estatal Península de Santa Elena, La Libertad, Ecuador
{decheverria,clery}@upse.edu.ec 2 Universidad de Guayaquil, Guayaquil, Ecuador
Abstract. Educommunication is the exchange of ideas, knowledge, messages and information; this same allows a broad criterion in the meaning of the environment, its characteristics and development that highlights a territory from any place. Likewise, culture and educommunication are closely interrelated. Both require various forms of communication to thrive, create, be re-created and shared. Consequently, culture shapes much of the content and forms of communication. Together, culture and educommunication have the capacity to produce and disseminate a great wealth of information, knowledge, ideas and content, contributing to the expansion of people’s options in bringing values and identity to life, thus creating favorable environments for people-centred development. The communicational dimension of this research will examine the degree to which educommunication and the spaces of cultures that may arise positively interact. In addition, this is promoted by evaluating the right to freedom of expression, the existing opportunities to access the media and the content they transmit, and finally the offer of local production in the traditional media (radio), a pleasant, friendly and compatible option in an environment that lacks cultural spaces; mostly media that cover a vast and extensive transmission of content. This research is done with the need to recover and strengthen spaces in cultural programming, so that the public shows interest in their identity; the objective will be that Educommunication allows enriching cultural programming and this; in turn is transmitted within appropriate spaces in traditional media. The proposal of this study is that the dissemination of various areas of history be carried out, which creates interest in citizen learning, forming a society of high knowledge of training spaces. Communication processes are also pedagogical components that enrich learning, since in this way the stations will impart knowledge that will build an educational field. In addition, this will allow the development of critical thinking when consuming programs broadcast by a traditional medium. Keywords: Communication · Communication strategy · Cultural programs
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 K. Arai (Ed.): FICC 2024, LNNS 921, pp. 76–83, 2024. https://doi.org/10.1007/978-3-031-54053-0_6
Educommunication as a Communicative Strategy for the Dissemination
77
1 Introduction The present investigation is carried out due to the lack of training programs in some traditional media with a greater reach of local audience, such as radios, which promote an active cultural identity in the new generations, due to the concern for having spaces that feed citizens wanting to learn more. Within the territory there are various communication institutions, which predominate within the local territory. These impart information to the public, news that updates the population on daily events, but there is a deficit in content that allows the population to be enriched in culture, art, and identity. It is also true that within the content grids of the traditional media, such as radio, there are edges that fuel interest in consuming programming such as music, sports, news, among others. Education is a long-term content process, so if cultural spaces are promoted they will be very helpful, and thus new ways of thinking in society will be formed, this is a set of new learning. The means of communication is a fundamental part of transforming thoughts of knowledge with innovative methodologies, to capture the attention of the listener. These tools will make a participatory society and the results will be new motivational knowledge to the recipient.
2 Conceptual Framework At present, cultural spaces have been a concept of deep reflection for learning. It is evident that the radio can also be a strategic tool to innovate and educate the local population in an interactive and didactic way, this through educommunication, the new generations learn dynamically, therefore, this traditional means of communication It is a massive information channel and can help impart different techniques for the benefit of society, without a doubt, the stations will help an alternative education. It is worth emphasizing that cultural identity is everything that represents a population, that is, its culture, customs, tradition and other elements, which strengthen the patrimonial-cultural roots within modern society. It is necessary to facilitate the learning of the community by looking for spaces with adequate hours to transmit programs that feed the knowledge and interest of wanting to be linked with the cultural identity that a territory has and that is differentiated from the others, good information with educommunication in the medium radial will make the audience take as a habit to listen and tune in to said station, giving as a significant contribution the development of knowledge.
3 Education and Communication For the authors Carias-Pérez et al. (2021) [1] education, communication and culture are issues within the discursive context in which educational radio is used as a facilitator of processes related to cultural activation, to manage good educational practice (p. 42). That is to say, that culture has the power to improve the quality of life of people, people or society itself. For years, perhaps decades, it has been a powerful tool for
78
D. X. E. Maggi et al.
emotional and intellectual development, and indispensable in the identity of a society, who find in cultural expression, a language and a vertex from which to understand the world and connect with other societies.. It is necessary and indispensable to create an identity that allows you to stand out from others. Despite this, the multiple expressions of culture still do not escape the effects of inequality. In the same way, for Tissera (2019) [2], the social construction of knowledge, the teaching-learning process, educommunication is a space and scenario for the execution of activities that promote a better quality of life, an advance for citizen development, for the collective interlearning. The communicative space of education contributes with tools of transformation in teaching in order to have a good quality of life in a community, developing processes to favor Educommunication, recognizing its cultural identity to history, which is being forgotten in recent years. According to Barbas (2012) [3] Educommunication would have as its purpose the construction and collective social creation, through symbolic exchange and the flow of meanings. This means considering, firstly, the collaborative and participatory nature of Educommunication; secondly, it’s creative and transformative possibilities and, thirdly, the means and codes through which the educommunicative process is established. The purpose of educational communication is to manufacture the necessary events for participatory development, with learning from permanent experimentation and involving cultural identity, the lack of creation of educational/cultural programs, makes Educommunication aim to favor the community. In the same way, Ortega and Fernández (2014) [4] mention that education is a human and cultural process. Its purpose and definition are always conditioned by a context, or a temporality and a space, which maintains an intrinsic relationship with the condition and nature of man and culture as a whole. Education is a given process, where culturality is adapted in the educational field, this generates content for the traditional media, educating the population in general and spreads spaces that cause interest to listeners. The traditional media transcend the lives of the inhabitants, generating interest in consuming the content offered by the stations, for the information content that is offered; but these frequencies lack programming or spaces for cultural programming that encourage citizenship; an identity that must prevail in all generations and it is the responsibility of all media outlets to provide spaces for programs to be established within their programming grid, as well as informative, cultural, and consumer-interest programs. As cited in Ávila Muñoz (2012) [5], “education in communication matters includes all forms of studying, learning and teaching”. In the context of using the media as practical arts and scientific techniques, as well as cultural, it leaves much to be desired. For several years, the lack of cultural programming has been notable in some traditional radio media, so much so that it has overshadowed the interest of citizens in acquiring an identity that little by little is lost as time goes by. This is due to the weak interest that the media have in showing programming that not only feeds routine “fun” programming, but also opting for programming that, in addition to entertaining, presents programming with cultural content to the audience, native to the study area.
Educommunication as a Communicative Strategy for the Dissemination
79
According to the authors Molina-Benavides et al. (2016) [6] interest in knowledge is essential at a time when education has no limits and for which reason it acquires more importance, and represents the most valuable resource of the person, since determines the competitive capacity of individuals, organizations and the State. It arises as a need for progress, initially social and then personal. That is, education has no borders, it still takes importance in the fundamental knowledge of learning, this resource is the most valuable for a community, it was born with the need to excel and increase spaces of cultural identity. According to Clery (2021) [7], “the composition of educational proposals favors the formation of critical citizens, capable of socializing, which tends to mastery in the development of critical and cultural content.” Within the educational context, it is necessary to understand educommunication as a new field of autonomous cultural and social intervention, whose constitutive core is the transversal relationship between education and communication. It is a field never defined, but in permanent construction, in as much as it is influenced by the continuous process of social change and innovation where the fundamental thing would be to teach, educate, create cultural identity and not just entertain the population. It must be taken into account that the use of the media, whatever it may be, does not solve the educational problem of cultural identity, the media itself is just a tool that, if used correctly at the right time, serves to reinforce what has been acquired with a vision of a cultural identity within the social environment. In the local context, the authors (Cortez Clavijo et al. 2018) [8] affirm that: An educational activity that is projected beyond the institution that promotes it, interprets the instances of society that should be close to the local educational administration, in order to favor the system of relations in the idea of detecting the concerns of the territory and building the sense of belonging. From the method of the lack of educational training that citizens lack through traditional media and transcendence such as the radio, they prevent part of the population from approaching in a comprehensive way a cultural programming that participates in the construction of a society with identity.
4 Educommunication Theory According to Barbas Coslado (2012) [3] that “Educommunication becomes an interdisciplinary instrument in which knowledge, meanings, experiences and sounds are exchanged. This supposes three fundamental aspects: its participative nature, the creative and transformative possibilities and the channels to generate an educommunicative space”. In this sense, Andrade Martínez (2020) [9] refers that “When we refer to educommunication, we refer to a discipline or area of knowledge in the process of consolidation, which is discussed from different perspectives, approaches and scenarios, but which above all requires praxis to materialize”. In this postmodern context, any disciplinary perspective that proposes a redefinition of the social subject and its sociocultural context will be important, for which educommunication is a possible way. For Barbas Coslado (2012) [3], “educommunication conceives learning as a creative process where the construction of knowledge to knowledge is possible through the
80
D. X. E. Maggi et al.
promotion of the creation and activity of the participants.” According to Cruz Díaz (2016) [10] “it is a process continuous educational, in charge of providing basic tools to promote awareness about the world in which we live, promoting a committed citizenry and mobilized by their environment”. In this sense, Teatring (2017) [11] reports that “educommunication has an enormous responsibility: to contextualize interculturality, so that the formation of a much more open criterion arises”. According to Teatring (2017) [11], learning is no longer automaton and focused simply on memorization, its raison d’être is to give a new direction to society, so that it has a more critical, inclusive and truly educated vision, committed to the identity of each person. On the other hand, Barbas Coslado (2012) [3] mentions that “educommunication is an interdisciplinary and transdisciplinary field of studies that addresses the theoreticalpractical dimensions of two disciplines: education and communication”, refers that media education covers all forms of research, learning and teaching, at all levels and in all contexts, the history, creation, use and evaluation of media as artistic practices and techniques, as well as the place in which the media are located in the society, its social impact, the consequences of media communication, participation, the change they make in perception, the role of creative work and access to the media. Cruz Díaz (2016) [10] refers that: “The role of communication is the key in all processes of intervention and social transformation, in the performance of its institutions, but it is conviction, it appears weighed down when it comes to embodying social practices by the Communicative and instrumental inertias in which we are formed and educated and with which we must break, before society breaks with us. For Rodas (2017) [12] “educommunication is a transversal axis, for radio production: filmmakers and announcers must value this component so that they have that aspect and educational, cultural, intercultural or community internationalities”. According to Soler Rocha and López Rivas (2022) [13] the media of social communication shape our life and capture us; but it is also possible through them to learn correct denotations, typical of a liberated education that instructs a dynamic individual in his integral development, required for the permanent transformation that society needs, a human, critical, creative and social being. Likewise, Molina-Benavides et al. (2017) [14] describes that the space created with the intention of serving as a means for the dissemination of different artistic, philosophical, educational. De la A Quinteros (2012) [15] affirms that “a radio program can be defined as a communicative product of a massive nature, which has a set of specific characters, objectives, audiences, content, etc. that allow it to differentiate itself from another space”. For this reason, Salazar Cisneros (2019) [16] considers that “culture is closely related to the degree of development of the productive forces achieved by society and points out that despite this premise, there is an unequal development between material development and cultural development and artistic”. For Salazar Cisneros (2019) [16], it is through cultural development where the human being is capable of displaying all his creativity and achieves full cultural development,
Educommunication as a Communicative Strategy for the Dissemination
81
using this human potential for his own benefit and that of the people. Cultural development promotes social action through culture as the foundation of development, in order to contribute to the formation of human capital, the cohesion of the social fabric, the strengthening of governance and the cultural integration of the region. They assume man as subject and main result, under the principle of equity and promotion of participation and creativity.
5 Conclusions Educommunication has been consolidated as an effective communication strategy for the dissemination of cultural programs in contemporary society. In recent decades we have witnessed a significant change in the way culture is transmitted and consumed, and educommunication has played a fundamental role in this transformation. First of all, it is important to highlight that education is based on the idea that education and communication are intrinsically related. In the context of the dissemination of cultural programs, this means that the objective is not only to entertain or inform, but also to educate the public about cultural issues. Cultural programs are not only a form of entertainment, but also an opportunity to promote knowledge and appreciation of culture. One of the most outstanding aspects of educommunication in the dissemination of cultural programs is its ability to reach a diverse and broad audience. Through different media, such as television, radio, Internet and social networks, audiences of different ages, interests and cultural backgrounds can be reached. This allows cultural programs to reach a broader and more diverse audience that would not otherwise have had access to this form of cultural expression. Furthermore, educommunication encourages active public participation. It is not simply about consuming cultural content, but about participating in it. Cultural programs can include interactive elements, such as online discussions, surveys or culture-related activities that encourage the audience to participate and contribute their own ideas and perspectives. This creates a richer and more meaningful experience for the audience. Educommunication also promotes reflection and dialogue. Cultural programs can address complex and controversial topics, providing the opportunity to generate meaningful debates in society. This not only enriches the cultural content, but also contributes to the development of a more critical and reflective society. In conclusion, educommunication has become an essential strategy for the dissemination of cultural programs in contemporary society. It makes it easier to reach diverse audiences, promotes active participation, encourages reflection and dialogue and, ultimately, contributes to cultural enrichment and the strengthening of cultural identity in an era marked by globalization. As a result, educommunication continues to play a crucial role in promoting and preserving cultural diversity around the world. Acknowledgment. The authors thank the Universidad Estatal Península d Santa Elena and the Universidad de Guayaquil; for the resources contributed, in the development of the projects "Study of the impact of communication 4.0" and "Training by skills of the communicator in the digital age".
82
D. X. E. Maggi et al.
References 1. Carias Perez, F., Marín Gutierrez, I., Hernando Gómez, Á.: Educomunicación e interculturalidad a partir de la gestión educativa con la radio. Revista de ciencias sociales y humanas Niversitas 42 (2021). https://doi.org/10.17163/uni.n35.2021.02 2. Tissera, V.: Educomunicación en organizaciones culturales públicas . Estrategia de posicionamiento del Centro Cultural Comunitario Leonardo Fvaio. Actas de Periodismo y Comunicación 2(1), 1–10 (2019). https://perio.unlp.edu.ar/ojs/index.php/actas/article/view/ 4139 3. Barbas Coslado, Á.: Educomunicación: desarrollo, enfoques y desafíos en un mundo. Dialnet 165 (2012). https://dialnet.unirioja.es/servlet/articulo?codigo=4184243 4. Ortega, R., Fernández, J.: La Ontología de la Educación como un referente para la compresión de sí misma y del mundo. Redalyc 40 (2014). https://www.redalyc.org/pdf/4418/441846098 003.pdf 5. Ávila Muñoz, P.: Educación a distancia y educomunicación. Repositorio Institucional de INFOTEC (2012). https://infotec.repositorioinstitucional.mx/jspui/bitstream/1027/190/1/ Educaci%C3%B3n%20a%20distancia%20y%20educomunicaci%C3%B3n.%20Patricia% 20%C3%81vila%20Mu%C3%B1oz.pdf 6. Molina, L., Rey, C., Vall, A., Clery, A., Santa María, G.: The Ecuadorian University System. Revista Científica y Tecnológica UPSE 3(3), 80–89 (2016). https://incyt.upse.edu.ec/ciencia/ revistas/index.php/rctu/article/view/201 7. Clery, A., Molina, L., Linzán, S., Zambrano-Maridueña, R., Córdova, A.: University communication in times of covid-19: the ecuadorian case. In: Antipova, T. (ed.) Comprehensible Science. ICCS 2020. LNNS, vol. 186. Springer, Cham (2021). https://doi.org/10.1007/9783-030-66093-2_16 8. Cortez Clavijo, P., Molina, L., Clery, A., Cochea Panchana, G.: Aprendizaje servicio en el conocimiento de los estudiantes universitarios mediado por TIC. Un enfoque teórico. Repositorio UPSE (2018). file:///C:/Users/User/Downloads/428Texto%20del%20art%C3%ADculo-1244-4-10-20190820.pdf 9. Andrade Martínez, C.: La educomunicación de Don Bosco y la formación de universitarios como buenos ciudadanos. Scielo (2020). https://doi.org/10.4067/S0718-07052020000300007 10. Cruz Díaz, R.: La Educomunicación como herramienta de transformación social. La formacion de los profesionales de los medios. ResearchGate 10 (2016). https://www.researchg ate.net/publication/318724114_La_Educomunicacion_como_herramienta_de_transforma cion_social_La_formacion_de_los_profesionales_de_los_medios 11. Teatring, H.: Recursos Educativos. Recursos Educativos (2017): https://comunidad.recursose ducativos.com/la-educomunicacion-destino-una-nueva-sociedad/ 12. Rodas, B.I.: Influencia de la Radio con un Enfoque Educomunicativo para la Formación Ciudadana, vol. 233. Docplayer (2017). https://docplayer.es/90906517-Influencia-de-la-radiocon-un-enfoque-educomunicativo-para-la-formacion-ciudadana.html 13. Soler Rocha , J., López Rivas, O.: Educomunicación y radio escolar en los campos boyacenses. Una perspectiva desde la hermenéutica de Gadamer. Scielo (2022). https://doi.org/10.19053/ 01227238.11663 14. Molina, L., Vera, N., Parrales, G., Laínez, A., Clery, A.: Applied Research in Social Sciences. La Libertad, Ecuador: Universidad Estatal Península de Santa Elena (2017). https://incyt. upse.edu.ec/libros/index.php/upse/catalog/book/60
Educommunication as a Communicative Strategy for the Dissemination
83
15. De la A Quinteros, E.: Programa radial de orientación educativa para el fortalecimiento de valores dirigido a la comunidad del cantón La Libertad. Repositorio UPSE (2012). https://repositorio.upse.edu.ec/bitstream/46000/495/1/NARCISA%20DE% 20VERA%20TESIS%20DE%20GRADO.pdf 16. Salazar Cisneros, Y.: El desarrollo cultural, complicidad necesaria. Scielo (2019). http://sci elo.sld.cu/scielo.php?script=sci_arttext&pid=S2308-01322019000100088#B3
On Implemented Graph-Based Generator of Cryptographically Strong Pseudorandom Sequences of Multivariate Nature Vasyl Ustimenko1,2(B) and Tymoteusz Chojecki3 1 Royal Holloway University of London, Egham, UK
[email protected]
2 Institute of Telecommunications and Global Information Space, Kyiv, Ukraine 3 University of Maria Curie-Skłodowska, Lublin, Poland
Abstract. Classical Multivariate Cryptography is searching for the special families of functions F on the affine space Kn, where F is a quadratic or cubical polynomial map and K is a finite commutative ring with unity. Usually the map F is given in its standard form, which is the tuple C(F) of nonzero coefficients of F ordered in the lexicographical form. The owner of the publicly given map need the private key which is the piece of information T such that its knowledge allows to compute the reimage of F in a polynomial time O(n). We consider the Inverse Problem of multivariate Cryptography to find T for the given tuple C(F) of the multivariate map F. If α ≤ 2 then the solution of the Inverse Problem is harder than computation of the reimage of F which is NP-hard problem for general quadratic or cubic maps. We use Inverse Problem of the constructing C(F) for the cubic multivariate maps constructed in terms of well-known families of graphs of D(n,K) and A(n,K) which appear in the studies of Extremal Graph Theory and its applications. So piece of information T is a seed for cryptographically strong pseudorandom sequence C(F). We consider cases of K = Fq, K = Zq, q = 2 m and the case of Bollean ring B(m) of cardinality 2m. Keywords: Secure pseudorandom sequences · Multivariate cryptography · Stream ciphers · Public keys
1 Introduction Let K be a commutative ring, n T 1 and n T 1 are elements of the group AGL n (K) of affine transformations of the free module K n . Graphs D(n,K) and A(n,K) are bipartite graphs with the partition sets isomorphic to K n such that the incidence of point (x 1, x 2, …,x n ) and line [y1, y2 ,…,yn ] are given by the special equations of kind x 2- y2 = f 2 (x 1, y1 ), x 3 − y3 = f 3 (x 1, x 2, y1, y2 ),…,x n -yn = f n (x 1, x 2, …,x n−1 ,y1, y2, …,yn−1 ) (see [27] or [15] and further references). Let us refer to the first coordinates ρ(x) = x 1 and ρ(y) = y1 as colors of the point and the line. It is easy to see that each vertex of the graph has a unique neighbor of the chosen color. The path in the graph is given by the starting point and consecutive sequences α 1 , © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 K. Arai (Ed.): FICC 2024, LNNS 921, pp. 84–98, 2024. https://doi.org/10.1007/978-3-031-54053-0_7
On Implemented Graph-Based Generator of Cryptographically Strong
85
α 2 , …, α t of colors of vertices. Let v(x, α 1 ,.. α t ) be the last vertex of the path of even length t starting from the point (x 1 ,x 2 ,….,x n ) and defined by color x 1 + α 1 ,x 1 + α 2 , … x1 + αt . We have v(x, α 1 ,…, α t ) = (x 1 + α t , g2 (x 1 ,x 2 ), g3 (x 1 ,x 3 ,x 3 ),…,gn (x 1 ,..,x n ). In fact the map F:K n → K n given by the rule x 1 → x 1 + α t , x 2 → g2 (x 1 ,x 2 ),.., x n → gn (x 1 ,…,x n ) is a cubical multivariate map in the cases of graphs D(n,K) and A(n,K) (see [15] for further references). We study the sequences of C(G) for G = T 1 FT 2 as possible pseudorandom sequences with the “seed” given by T 1 , T 2 and tuple α 1 , α 2 ,…, α t . It will be shown that these sequences are cryptographically strong, i.e. the recovery of seed T from C(G) is an NP hard task. We understand that various tests on pseudorandomness have to be used for the justification of statistical properties of the sequences. These studies are planned as future research. Noteworthy that the seed owners (Alice and Bob) can use C(G) = (c1 ,c2 ,…,cl ) for the exchange of genuinely random sequence (p1, p2, ,..,pn ) generated by modern quantum computer via the usage of one-time pad encryption (p1, …pn ) + (c1, …,cn ). In Sect. 2, we consider the concept of trapdoor accelerator of the multivariate map of bounded degree and discuss the inverse problem of Multivariate Cryptography. The idea of graph based trapdoor accelerator T of Multivariate Cryptography is discussed in Sect. 3 in the cases of our graphs. We suggest to use seed T elaboration protocol in Sect. 4. Conclusions and general iterative scheme of generation of potentially infinite cryptographically strong sequence of ring characters are presented in Sect. 5.
2 On the Inverse Problem of Multivariate Cryptography Task of generation of cryptographically strong pseudorandom sequence of elements of finite field F q is a traditional problem of applied cryptography. We can replace F q for general commutative ring K with unity, infinite cases K = Z, K = R or K = F 2 [x] are especially important. Some practical applications are observed in [1] books (Chapters 16, 17), [2] and [3], papers [4–9] selected for demonstration of different approaches for the constructions of pseudorandom sequences. Noteworthy that there are possibilities of construction genuinely random sequences with usage of quantum computers or other natural randomness sources (see [10–12]). The task is about generation of potentially infinite sequence a(n) = (a1 , a2 ,…, af(n) ) of field characters which depends from the secret seed. We assume that f(n) is increasing function on the set N of natural number in natural variable n. Requirements of pseudo randomness practically means that sequences a(n) satisfy several special tests which confirm that the behavior of sequence is “similar” to behavior of genuine random sequence. Nowadays the term cryptographically strong means that the knowledge of a(n) for some value of n does not allow adversary to recover the seed and reconstruct the computation of a(x) for arbitrary x. It means that adversarial task is at least as hard as one as known NP-hard problem intractable even with usage of Quantum Computer.
86
V. Ustimenko et al.
We assume that two correspondents Alice and Bob use some protocol for secure elaboration of the “seed” which is the tuple S = (s(1), s(2), …, s(d)) of nonzero symbols from finite field F q of characters 2. They would like to construct a secure renovations of this seed in a form of potentially infinite sequences m RY (S) = R(S) and n H Z (S) = H(S) of nonzero field elements of polynomial length f(Y, m) and g(Z, n) where n and m are potentially infinite natural numbers. The parameters n and m as well as pieces of information Y and Z are known publicly. In the case of finite commutative rings correspondents will use string H(S) as the password of one time pad to encrypt plaintext P from (F q ) g(Z„n) . So, the ciphertext will be P + H(S). The tuple R(S) will be used as a new seed for the next round of the procedure. Correspondents agree on new numbers n* and m* and information pieces Y* and Z* and compute m* RY* (R(S)) = *R and n* H Z* (R(S)) = *H. They will keep *R safely as the seed for the next session and use *H for the encryption. Assume that adversary got the password H(S). He/she knows Z and n and can try to restore the seed S and break the communication process. We use Multivariate Cryptography techniques for the implementation of this scheme and making seed restoration an NP-hard problem. We generalize the above scheme via simple change of F q for arbitrary commutative ring K with unity. We assume that multivariate map F is given in its standard form of kind x 1 → f 1 (x 1 , x 2 ,…, x n ), x 2 → f 2 (x 1 , x 2 ,…, x n ),…, x n → f n (x 1 , x 2 ,…, x n ) where f i are polynomials from K[x 1 , x 2 ,…, x n ] given in their standard forms which are lists of monomial terms ordered according to the lexicographic order. Let C(F) be the list of nonzero coefficients of lexicographically ordered monomial terms. Practically we will use quadratic or cubic multivariate maps. For the nonlinear map F of bounded degree given in its standard form we define trapdoor accelerator F = 1 TGD 2 T as the triple 1 T, 2 T, GD of transformations of K n where i T, i = 1, 2 are elements of AGL n (K), G = GD is nonlinear map on K n and D is the piece of information which allow us to compute the reimage for nonlinear G in time O(n2 ) (see [20]). In this paper we assume that D is given as a tuple of characters (d(1), d(2), …, d(m)) in the alphabet K. We consider the INVERSE PROBLEM for the construction of trapdoor accelerator of multivariate rule, i. e. with given standard form of F find a trapdoors 1 TGD 2 T for F. We suggest the following general scheme. Let n F r be a family of nonlinear maps in n-variables which has trapdoor accelerator of kind GD(n) where D(n) = (n d(1), n d(2),…, n d(r)),such that r = m(n). Affine maps are identities. Correspondents have initial seed (s(1), s(2),…, s(d)). One of them selects parameters n and r = m(n) and forms multivariate frame Y(n, r) which consists on the tuple h = (i1 , i2 , …, ir ) of elements from M = {1, 2, …, d}, tuples b(k) = (k b1 , k b2 ,…, k bn ) from M n and matrices M(k) = (k z(i,j)), i, j {1, 2, …n}, k = 1,2 with entries k z(i,j) from M. and send his/her partner via open channel. They compute specialised matrices k M = (s(k z(i,j))) and tuples k b = = (s(k b1 ), s(k b2 ),…, s(k bn )). They form affine maps 1 T(x) = 1 Mx + 1 b and 2 T = 2 Mx + 2 b, k = 1,2. Each correspondent computes standard form of 1 T n F r 2 T = G(Y(n,r)) = G and write down the list C(G(Y(n, r)) of coefficients of monomial terms. They can treat C(G)
On Implemented Graph-Based Generator of Cryptographically Strong
87
as the password H(S) for one time pad and use other multivariate frame Y*(m*, r*) as new seed R(S). REMARK. It is possible to modify the definition of 1 M and 2 M with the option of entries from MU{1,0}.
3 On Graph Based Trapdoor Accelerators of Multivariate Cryptography We suggest the algorithm where trapdoor accelerator n F r defined over commutative ring K is a cubical rule w F induced by the walk w = r w of length r on algebraic incidence structure (bipartite graph) with point and line sets isomorphic to variety K n . The walk depends on the sequence of symbols (s(1), s(2), …, s(r)) in the alphabet K of length r on bipartite graph n (K) with partition sets and recovery of the walk between the plaintext tuple and the ciphertext gives the information about the seed. Noteworthy that Dijkstra algorithm is able to find the path between given vertices in time O(vln(v)) where v is the order of graph. In our case the order is 2qn . It means that the complexity of this algorithm is subexponential. In the case of K = F q suggested algorithm graphs n (q) form one of the known families of graphs with increasing girth D(n, q) and A(n, q) (see [13, 14] and further references, [15] and further references). Recall that girth is the length of minimal cycle in a graph. If the distance r between vertexes is less than half of the girth, then the shortest path between them is unique. For the graphs from each family the projective limit is well defined and tends to q-regular forest. Connected components of these graphs are good tree approximations. It means that if n is sufficiently large then expected complexity is q(q-1)r−1 . We select r, r ≤ n as unbounded linear function l(n) in variable n. In fact it can be proven that ai , i = 1, 2,…, f(n) are polynomial expressions in variables s(1), s(2),…, s(r) of degree r. Let us construct the function n F. The incidence structure A(n, K) is defined A(n, K) as bipartite graph with the point set P = K n and line set L = K n (two copies of a Cartesian power of K are used). We will use brackets and parenthesis to distinguish tuples from P and L. So (p) = (p1 , p2 , …, pn ) Pn and [l] = [l1 , l 2 , …, l n ] L n . The incidence relation I = A(n,K) (or corresponding bipartite graph I) is given by condition p I l if and only if the equations of the following kind hold. p2 - l 2 = l 1 p1 , p3 l3 = p1 l 2 , p4 - l 4 = l 1 p3 , p5 - l 5 = p1 l 4 , …, pn - l n = p1 l n−1 for odd n and pn - l n = l 1 pn-1 for even n. We can consider an infinite bipartite graph A(K) with points (p1 , p2 ,…, pn ,…) and lines [l 1 , l 2 ,…,l n , …]. It is proven that each odd n girth indicator of A(n, K) is at least [n/2]. Another incidence structure I = D(n, K) is defined below. Let us use the same notations for points and lines as in previous case of graphs A(n, K). Points and lines of D(n, K) also are elements of two copies of the affine space over K. Point (p) = (p1 , p2 , …, pn ) is incident with the line [l] = [l1 , l 2 , …, ln ] if the following relations between their coordinates hold: p2 - l 2 = l 1 p1 , p3 - l 3 = p1 l 2 , p4 - l 4 = l 1 p3 , …, l i− pi = p1 l i-2 if i congruent to 2 or 3 modulo 4, li -pi = l 1 pi-2 if i congruent to 1 or 0 modulo 4. Incidence structures D(n, F q ), q > 2 form a family of large girth (see [13]), for each pair n, n ≥ 2, q, q > 2 the girth of the graph is at least n + 5.
88
V. Ustimenko et al.
Let (n, K) be one of graphs D(n, K) or A(n, K). The graph (n, K) has so called defined linguistic coloring ρ of the set of vertices. We assume that ρ(x 1 , x 2 , …, x n ) = x 1 for the vertex x (point or line) given by the tuple with coordinates x 1 , x 2 ,…, x n . We refer to x 1 from K as the color of vertex x. It is easy to see that each vertex has unique neighbor of selected color. Let N a be operators of taking the neighbor with colour a from K. Let [y1 , y2 , …, yn ] be the line y of Γ (n, K[y1 , y2 , …, yn ]) and ((1), (2), …, (t)) and (β(1), β(2), …, β(t)), t are the sequences of nonzero elements of the length at least 2. We form sequence of colours of points a(1) = y1 + (1), a(2) = y1 + (1) + (2), …, a(t) = y1 + (1) + (2)… + (t) and the sequence of colours of lines b(1) = y1 + β(1), b(2) = y1 + β(1) + β(2),…, b(t) = y1 + β(1) + β(2) … β(t) and consider the sequence of vertices from (n, K[y1 , y2 , …, yn ]): v = y, 1 v = N a(1) (v), 2 v = N b(1) (1v), 3 v = N a(2) (2 v), …, 2t−1 v = N (t) (2t−2 v), 2t v = N 2t−1 v). b(t) ( Assume that v = 2t v = [v1 , v2 , …, vn ] where vi are from K[y1 , y2 , …, yn ]. We consider bijective quadratic transformation g((1), (2),…, (t)|β(1), β(2), …, β(t)), t ≥ 2 of affine space Kn of kind y1 → y1 + β(t), y2 → v2 (y1 , y2 ), y3 → v3 (y1 , y2, y3 ), …, yn → vn (y1 , y2,…, yn ). It is easy to see that g((1), (2),…, (t)| β(1), β(2), …, β(t))•g(γ(1), γ(2),…, γ(s) | σ(1), σ(2), …, σ(t)) = g((1), (2),…, (t), γ(1) + β, γ(2) + β,…, γ(s) + β | β(1), β(2), …, β(s), σ(1) + β, σ(2) + β, …, σ(s) + β) where β = β(1) + β(2) + … + β(t). THEOREM 1 [11]. Bijective transformations of kind g((1), (2),…, (t) | β(1), β(2), …, β(t)), t ≥ 2 generate the subgroup G((n, K)) of transformations of Kn with maximal degree 3. Let F be a standard form of 1 T g((1), (2),…, (t)|β(1), β(2), …, β(t))2 T where 1 T and 2 T are elements of AGLn (K) and T = O(n). Then triple 1 T, 2 T, ((1), (2),…, (t), β(1), β(2), …, β(t)) be a trapdoor accelerator of F. We will use family of graphs A(n, K) and D(n, K) together with A(n, K[y1 , y2 ,…, yn ]) and D(n, K[y1 , y2 , …, yn ]). Let (n, K) be one of those graphs defined over the commutative ring K with the unity. Assume that correspondents Alice and Bob already completed some seed agreement protocol and elaborate seed s = (s(1), s(2), …, s(k)). Without loss of generality we assume that s(i) = 0 for i = 1,2,…, k. For the construction of multivariate frame they select parameters t and n together with sequences (i1 , i2 ,…., it ), (j1 , j2 , …,jt ) of elements from M = {1, 2,..,k} and matrices r U = (r u(i.j)), r = 1, 2 with r u(i,j) from M. Correspondents take linear transformations 1 T and 2 T corresponding to matrices 1 A and 2 A with entries s(1 u(i.j)) and s(2 u(i.j)) and computes the standard form of F = 1 Tg(s(i ), s(i ),…, s(i )| s(j ), s(j ),…, s(j ))2 T. 1 2 t 1 2 t We need some “ general frame generation algorithm”. The simple suggestion is the following. We concatenate word (s(1), s(2),…, s(d)) with itself and get infinite sequence s1 , s2 ,…, si ,…. We identify (1 u(1, 1), 1 u(1,2), …, 1 u(1,n)) with (s1 , s2 , …, sn ) and use the cyclic shift and set (1 u(i, 1), 1 u(i, 2), …, 1 u(1,n)) = (si , si+1 , …, sn , sn+1 , sn+2 ,…sn+i-1 ) for i = 2,3,…,n
On Implemented Graph-Based Generator of Cryptographically Strong
89
We use reverse tuples to form matrix 2 U. So (2 u(1, 1), 2 u(1,2), …, 2 u(1,n)) = (sn , sn-1 ,…, s1 ) and (2 u(i, 1), 2 u(i,2), …, 2 u(1,n)) = (sn+i-1 , sn+i-2 ,…, sn+1 , sn , sn-1 , …, si ) for i = 2, 3, …, n. We set (i1 , i2 ,…., it ) and (j1 , j2 , …,jt ) as (s1 , s2 ,…., st ) and (s1+t , s2+t , …,s2t ) respectively. So they can use the sequence of symbols C(F) as a password for the additive one time pad with plaintext K d(F) where d(F) is the number of monomial terms for the multivariate map F. Other multivariate frame can be used for the seed renovation. Noteworthy that alternatively correspondents can use a new session of the protocol for the seed elaboration. Other option is to use the stream cipher on K n where each r T is changed for the compositions of lower and upper unitriangular matrices r L and r U with nonzero entries from r A. One of the option is to use transformations T 1 : y → 1 U 1 Ly + (1 a(1,1), 1 a(2,2), …, 1 a(n, n)) and T 2 : y → L 2 U 2 y + (2 a(1,1), 2 a(2,2), …, 2 a(n, n)). So correspondents use bijective transformation F = T 1 g((1), (2),…, (t)|β(1), β(2), …, β(t))T 2 for the encryption. The knowledge of trapdoor accelerator allows correspondents to encrypt or decrypt in time O(n2 ). Trapdoor Modification. In the case of K = F q , q = 2r , r ≥ 16 we can use operator a J of changing the colour p1 of the point (p1 , p2 , …, pn ) from the graph (n, K)) for the ring element a. We can take the path in the graph (n, K[y1 , y2 ,…, yn ])) corresponding to g((1), (2),…, (t)|β(1), β(2), …, β(t)) with the starting point (y1 , y2 , …, yn ) and ending point 2t v. We change 2t v for v = a J(2t v) = ((y1 )^2, v2 ,..,vn ), a = (y1 )2 and consider the rule y1 → (y1 )2 , y2 → v2 (y1 , y2 ), y3 → v3 (y1 , y2, y3 ), …, yn → vn (y1 , y2,…, yn ). This rule induces bijective cubic transformation h((1), (2),…, (t)|β(1), β(2), …, β(t)) of vector space Kn . Then polynomial degree of inverse for G = T 1 h((1), (2),…, (t)|β(1), β(2), …, β(t))T 2 is at least 2r−1 , descryption of this graph based accelerator can be found in [20]. Noteworthy that the map F and its inverse are cubic transformations. Adversary has to intercept more than n3 /2 pairs of kind plaintext/ciphertext to restore F or its inverse. Theoretically interception of O(n3 ) pairs will allow adversary to break the stream cipher in time O(n10 ) via linearisation attacks. It is easy to see that the transformation G is resistant to linearization attacks. REMARK 1. In the case of (n, K) based encryption we can use sparce frame given by two numbers r and n and sequences (i1 , i2 ,…., it ), (j1 , j2 , …,jt ) of elements from M = {1, 2,..,k} together with two sequences (1 i, 2 i,…, n−1 i) and (1 j, 2 j,…, n−1 j) from Mn−1 . So Alice and Bob form linear transformations 1 τ and 2 τ such that 1 τ(y1 ) = y1 + s(1 i)y2 + s(2 i)y3 + … s(n−1 i)yn , 2 τ(y1 ) = y1 + s(1 i)y2 + s(2 i)y3 + … s(n−1 i)yn , j τ(yi ) = yi for j = 1,2 and i ≥ 2. So correspondents compute the standard form of F = 1 τg(s(i1 ), s(i2 ),…, s(it )| s(j1 ), s(j2 ),…, s(jt ))2 τ and able to use string C(F). Let us assume that t = O(n ) where 0 ≤ < 1. Then inverse problem of restoration of sparce frame is harder than finding the algorithm of computing F −1 in time O(n+1 ).
90
V. Ustimenko et al.
Recall that solving nonlinear system of polynomial equations is known NP hard problem, if the inverse map F −1 is cubic it can be found in time O(n10 ). We implemented described above algorithm of generating C(F) in the case of finite fields F q , q = 2m of characteristic 2, arithmetical rings Z q and Boolean rings B(m, 2) of order 2m . REMARK 2. We can treat element 1 + 2 x + 3 x2 … m xm−1 of Fq , q = 2m as a sequence of elements (1 , 2 ,…, m ) of F2 (element of Boolean ring) or number 1 + 2 2 + 3 22 … m 2m−1 (element of Zq ). The results of computer simulations are presented in [19]. For the reader’s convenience we present the results from [19] on the density of the quadratic maps defined over the finite fields F q , q = 232 in the Tables 3, 4, 5 and 6. We conduct new computer simulations to investigate maps in the case of arithmetical rings Z q , q = 232 . These results are presented in the Tables 1 and 2. The comparison of densities of quadratic maps in the cases of finite fields F q , commutative rings Z q and Boolean rings of size q are given in Fig. 1, 2, 3 and 4. Table 1. Number of monomial terms of the cubic map of induced by the walk on the graph (n, Z232 ), case of sparce frame length of the walk r n
16
32
64
128
256
16
4899
4899
4899
4899
4899
32
53710
61013
61013
61013
61013
64
498046
737132
854249
854249
854249
128
4264158
7215820
10879512
12755665
12755665
Table 2. Number of monomial terms of the cubic map of induced by the walk on the graph A(n, Z232 ), case of full linear transformation T1 and T2 length of the walk r n
16
32
64
128
256
16
15504
15504
15504
15504
15504
32
209440
209440
209440
209440
209440
64
3065920
3065905
3065920
3065920
3065920
128
46866560
46866560
46866560
46866560
46866560
REMARK 3. Presented above experiment does not use the described above TRAPDOOR MODIFICATION. In fact we modified the map in the cases of A(n, Fq ) and A(n, Zq ), q = 232 and obtained tables with similar numbers of nonzero coefficients. Graphical presentation of the results of computer simulations in the cases of F q , Z q and Bq are presented below.
On Implemented Graph-Based Generator of Cryptographically Strong
91
Table 3. Number of monomial terms of the cubic map of induced by the walk on the graph D(n, F222 ), case of sparce frame length of the walk r n
16
32
64
128
256
16
3649
3649
3649
3649
3649
32
41355
41356
41356
41356
41356
64
440147
529052
529053
529053
529053
128
3823600
6149213
7405944
7405945
7405945
Table 4. Density of the cubic map of induced by the walk on graph D(n, F222 ), case of full linear transformations length of the word n
16
32
64
128
256
16
6544
6544
6544
6544
6544
32
50720
50720
50720
50720
50720
64
399424
399424
399424
399424
399424
128
3170432
3170432
3170432
3170432
3170432
Table 5. Density of the cubic map of linear degree induced by the graph A(n, Z222 ), case of sparce frame length of the walk n
16
32
64
128
256
16
5623
5623
5623
5623
5623
32
53581
62252
62252
62252
62252
64
454375
680750
781087
781087
781087
128
3607741
6237144
9519921
10826616
10826616
4 Example of the Seed Elaboration Protocol of Multivariate Nature Presented above are the algorithms of generation of potentially infinite sequences of ring elements use seeds in the form of tuples of nonzero elements. Such seeds can be elaborated via protocols of Noncommutative Cryptography (see [21–25]) based on the various platform. We will use one of the simplest protocols of Noncommutative Cryptography which is straightforward generalization Diffie -Hellman algorithm. The scheme is presented below.
92
V. Ustimenko et al.
Table 6. Density of the map of linear degree induced by the graph A(n, Z222 ), case of general linear transformation T1 and T2 length of the walk n
16
32
64
128
256
16
6544
6544
6544
6544
6544
32
50720
50720
50720
50720
50720
64
399424
399424
399424
399424
399424
128
3170432
3170432
3170432
3170432
3170432
Fig. 1. Number of monomial terms of the cubic map induced by the walk on the graph (n = 128) , case of sparce frame.
Twisted Diffie-Hellman Protocol. Let S be an abstract semigroup which has some invertible elements. Alice and Bob share element gS and pair of invertible elements h, h–1 from this semigroup. Alice takes positive integer t = k A and d = r A and forms h−d gt hd = gA . Bob takes s = k B and p = r B. And forms h−p gs hp = gB . They exchange gA and gB and compute collision element X as A g = h−d gB t ha and B g = h−p gB t hp respectively. The security of the scheme rest on the Conjugation Power Problem, adversary has to solve the problem h−x gy hx = b where b coincides with gB or gA . The complexity of the problem depends heavily on the choice of highly noncommutative platform S. We will use the semigroups of polynomial transformations of affine space K n of kind x 1 → f 1 (x 1 , x 2 ,…, x n ), x 2 → f 2 (x 1 , x 2 ,…, x n )„…, x n → f n (x 1 , x 2 ,…, x n ) where f i , i = 1,2,…n. Noteworthy that in case of n = 1 the composition of two nonlinear
On Implemented Graph-Based Generator of Cryptographically Strong
93
Fig. 2. Number of monomial terms of the map induced by the walk on graph (n = 128), , case of general frame
Fig. 3. Number of monomial terms of the cubic map induced by the graph (n = 128) , case of sparce frame
transformations of degree s and r will have degree rs. The same fact holds for the majority of nonlinear transformations in n variables. For the feasibility of the computations in the semigroup of transformation we require the property of computing n elements in a polynomial time O(n ), > 0. We refer to this property as Multiple Composition Polynomiality Property (MCP). Below we present one of the MCP type families for which Conjugation Power Problem is postquantum
94
V. Ustimenko et al.
Fig. 4. Number of monomial terms of the map induced by the walk on graph (n = 128) , case of general frame
untractable, i. e. usage of Quantum Computer for Cryptanalysis does not lead to the change of its NP hard status. Let K be a finite commutative ring with the multiplicative group K* of regular elements of the ring. We take Cartesian power n E(K) = (K*)n and consider an Eulerian semigroup n ES(K) of transformations of kind a(1,1)
, x2
a(2,1)
, x2
x1 → μ1 x1
x2 → μ2 x1 ...
a(n,1)
xm → μn x1
a(1,2)
a(1,n) , . . . , xm ,
a(2,2)
a(2,n) , . . . , xm ,
a(n,2)
, x2
(1)
a(n,n) , . . . , xm ,
where a(i,j) are elements of arithmetic ring Z d , d = |K*|, Let n EG(K) stand for Eulerian group of invertible transformations from n ES(K). Simple example of element from n EG(K) is a written above transformation where a(i,j) = 1 for i = j or i = j = 1, and a(j,j) = 2 for j ≥ 2. It is easy to see that the group of monomial linear transformations M n is a subgroup of n EG(K). So semigroup n ES(K) is a highly noncommutative algebraic system. Each element from n ES(K) can be considered as transformation of a free module K n (see 15). We implemented described above protocol with the platform n ES(K) in the cases of fields K = F q , q = 2m and arithmetical rings K = Z q . The output of algorithm is the element as above with elements a(i,j) from multiplicative group F*q (case of the field) or group Zt , t = 2 m−1 in the case of elements of arithmetical ring. We form matrix of regular elements of K and treat as the sequence of elements of length n2 . In necessary we identify nonzero field element a0 + a1 x + a2 x 2 + … + am-1 x m−1 with the tuple (a0 , a1 ,…, am-1 ) from the Boolean ring B(m. 2) of order 2m .
On Implemented Graph-Based Generator of Cryptographically Strong
95
For the generation on invertible element h from the protocol we use transformation E which is obtained as composition of ‘’ upper triangular element” 1 E a(1,1)
x1 → q1 x1
. . . xna(1,n)
a(2,2) a(2,3) x3 . . . xna(2,n) ,
x2 → q2 x2 ...
a(n−1,n−1) a(n−1,n) xn a(n,n) qn xn
xn−1 → qn−1 xn−1 xn →
(2)
and lower triangular element b(1,1)
x1 → r1 x1 x2 →
,
b(2,1) b(2,2) r2 x1 x2 ,
..., b(n,1) b(n,2) x2 . . . xnb(n,n) ,
xn → rn x1
(3)
where qi and r i are regular elements of K, elements a(i,j), b(i,j) are from the group Z t , where t is the order of multiplicative group of the ring, residues a(i,i), b(i,i) are mutually prime with the modulo t. Noteworthy that computation of the inverse elements of 1 E and 2 E is straightforward. In fact we can use other platforms of affine transformation, and more general protocols in terms of semigroup of transformations of K n and its homomorphic image (see [16– 19]). Security of generalised protocols rests on the complexity of Word Decomposition problem. It is about the decomposition of element w of semigroup S into combination of given generators of S. This problem is harder than its particular case of Conjugation Power Problem.
5 Conclusions and Topics for Further Research We suggest the protocol based communication scheme for a Postquantum usage. It uses nonlinear transformation of affine space K n where K is a finite commutative ring with unity. Convenient for practical application choices for K are finite field of characteristics 2 of order 2s , arithmetic ring Zt , t = 2s and Boolean ring B(s, 2). Correspondents Alicia and Bob can use the following communication scheme or its modification. • Firstly, they have to generate a ‘’seed of information”. Correspondents agree on the parameter s, basic commutative field K which is the F q or arithmetic rings Z t and the dimension n of the affine space. Alice selects elements 1 E of kind (2) and2 E of kind (3). She computes h = 1 E 2 E and its inverse h−1 . She selects transformation g of kind (1) and sends the triple (h, h−1 , g) to her partner Bob via an open channel. Alice and Bob conduct described in Sect. 3
96
V. Ustimenko et al.
algorithm. So they elaborate a collision element C of kind (1) with coefficients and a(i, j) in a secure way. . Correspondents arrange They form the matrix B = (b(i,j)) with entries these entries accordingly to the lexicographical order and get the seed in a form a tuple (s(1), s(2),…, s(n2 )). Noteworthy that the complexity of this protocol is O(n4 ). 1) Correspondents have to agree via an open channel on the commutative ring R. They can treat characters s(i) as field elements, residuals or elements of Boolean ring B(s, 2). 2) They will use elaborated seed for the creation of cryptographically strong potentially infinite sequence (b(1), b(2),…, b(t)) from Rt for some parameter t. Correspondents agree on potentially infinite parameter m, the graph Γ m (R) (A(m, R) or D(m, R)) and type of the frame for multivariate accelerator (general frame of Sect. 2 or sparce frame of Remark 1) and parameter r (length of the path). They construct multivariate accelerator which is the cubic transformation F acting on the affine space Rm . 3) They can exchange the information with the usage of the following options. a) Compute the standard form of F and tuple C(F) = (c1 , c2 ,…, cl ), where l = l(r, m, R) depends on the choice during step 2. Experiment demonstrate that parameterldoes not depends on the coordinates of the seed, n2 < l < n3 . Correspondents use one type pad. One of them creates the plaintext (p) = (p1 , p2 , …, pl ) and sends to his/her partner ciphertext (p) + C(F). After this action correspondents can go to the next step. b) Correspondents use their knowledge on the frame for F and use bijective trapdoor accelerator for encryption of plaintexts from Rn . They can exchange up to n3 /2 messages and after that go to step 4. In the case of large fields of characteristic 2 correspondents can change F for G described in the remark on trapdoor modification presented above. They can use this G without time limitations. 4) The change of seed. There are two following options. a) Correspondents repeat the step 2 with the same seed s(1), s(2), …with different data which include new graph of kind m’ (R’) and different type of frame in comparison with previous frame usage. They create corresponding accelerator G and take C(G) as a new seed. b) Alice and Bob change the seed via the new session of described twisted DiffieHellman protocol. After the step 4 they doing sequence of actions (2) and (3) for the encryption with the new seed and go to step 4 again. We plan to test sequences of kind C(F) for the presented above graph based cubic transformations via various approaches for the investigation of pseudorandom sequences (see [26]). Funding Information. This research is partially supported by British Academy Fellowship for Researchers under Risk 2022 and by UMCS Mini-Grants programme.
On Implemented Graph-Based Generator of Cryptographically Strong
97
References 1. Schneier, B.: Applied Cryptography, Second Edition: Protocols, Algorthms, and Source Code in C. Wiley 784 p. 2. Boneh, D., Shoup, V.: A Graduate Course in Applied Cryptography, Stanford University, free on-line course 3. Easttom, W.: Random number generators. In: Easttom, W. (ed.) Modern Cryptography: Applied Mathematics for Encryption and Information Security, pp. 257–276. Springer International Publishing, Cham (2021). https://doi.org/10.1007/978-3-030-63115-4_12 4. Grozov, V., Guirik, A., Budko, M., Budko, M.: Development of a Pseudo-Random Sequence Generation Function Based on the Cryptographic Algorithm “Kuznechik.” Proceedings of the 12th International Congress on Ultra Modern Telecommunications and Control Systems and Workshops (ICUMT 2020), Czech Republic, pp. 93–98 (2020). https://doi.org/10.1109/ ICUMT51630.2020.9222457 ˇ Bucci, M., De Luca, A., Hladký, J., Puzynina, S.: Aperiodic pseudorandom 5. Balková, L, number generators based on infinite words. Theor. Comput. Sci. 647, 85–100 (2016). https:// doi.org/10.1016/j.tcs.2016.07.042 6. Kaszián, J., Moree, P., Shparlinski, I.E.: Periodic structure of the exponential pseudorandom number generator. In: Larcher, G., Pillichshammer, F., Winterhof, A., Xing, C. (eds.) Applied Algebra and Number Theory, pp. 190–203. Cambridge University Press (2014). https://doi. org/10.1017/CBO9781139696456.012 7. Panneton, F., L’Ecuyer, P., Matsumoto, M.: Improved long-period generators based on linear recurrences modulo 2. ACM Trans. Math. Software 32, 1–16 (2006) 8. Hastad, J., Impagliazzo, R., Levin, L.A., Luby, M.: A Pseudorandom generator from any one-way function. SIAM J. Comput. 28(4), 1364–1396 (1999) 9. Blackburn, S., Murphy, S., Paterson, K.G.: Comments on “Theory and applications of cellular automata in cryptography” [with reply]. IEEE Trans. Comput. 46(5), 637–639 (1997). https:// doi.org/10.1109/12.589245 10. Wikramaratna, R.S.: Theoretical and empirical convergence results for additive congruential random number generators. J. Comput. Appl. Math. (2009). https://doi.org/10.1016/j.cam. 2009.10.015 11. Herrero-Collantes, M., Garcia-Escartin, J.C.: Quantum random number generators. Rev. Mod. Phys. 89(1), 1–54 (2016). https://doi.org/10.1103/RevModPhys.89.015004 12. Johnston, D.: Random number generators – principles and practices. DeG Press, A guide for engineers and programmers (2018) 13. Lazebnik, F., Ustimenko, V.A., Woldar, A.J.: A new series of dense graphs of high girth. Bull. Am. Math. Soc. 32(1), 73–79 (1995). https://doi.org/10.1090/S0273-0979-1995-00569-0 14. Ustimenko, V.A.: On the extremal graph theory and symbolic computations. Dopovidi National Academy of Science, No. 2, pp. 42–49. Ukraine (2013) 15. Ustimenko, V.: Graphs in terms of Algebraic Geometry, symbolic computations and secure communications in Post-Quantum world, p. 198. University of Maria Curie Sklodowska Editorial House, Lublin (2022) 16. Ustimenko, V., Klisowski, M.: On non-commutative cryptography with cubical multivariate maps of predictable density. In: Arai, K., Bhatia, R., Kapoor, S. (eds.) Intelligent Computing: Proceedings of the 2019 Computing Conference, Volume 2, pp. 654–674. Springer International Publishing, Cham (2019). https://doi.org/10.1007/978-3-030-22868-2_47 17. Ustimenko, V., Klisowski, M.: On D(n; q) quotients of large girth and hidden homomorphism ´ ezak, D. based cryptographic protocols. In: Ganzha, M., Maciaszek, L., Paprzycki, M., Sl˛ (eds). Communication Papers of the 17th Conference on Computer Science and Intelligence Systems, ACSIS, vol. 32, pp. 199–206 (2022). https://doi.org/10.15439/2022F54
98
V. Ustimenko et al.
18. Ustimenko, V.: On new symbolic key exchange protocols and cryptosystems based on a hidden tame homomorphism, Dopovidi National Academy of Scince, n. 10, pp. 26–36. Ukraine (2018) 19. Ustimenko, V., Klisowski, M.: On Noncommutative Cryptography and homomorphism of stable cubical multivariate transformation groups of infinite dimensional affine spaces, Cryptology ePrint Archive, 2019/593 20. Ustimenko, V.: On Extremal Algebraic Graphs and Multivariate Cryptosystems, Cryptology ePrint Archive, 2022/593 21. Myasnikov, A., Shpilrain, V., Ushakov, A.: Non-commutative Cryptography and Complexity of Group-theoretic Problems. American Mathematical Society, Providence, Rhode Island (2011) 22. Moldovyan, D.N., Moldovyan, N.A.: A new hard problem over non-commutative finite groups for cryptographic protocols. In: Kotenko, I., Skormin, V. (eds.) Computer Network Security, pp. 183–194. Springer Berlin Heidelberg, Berlin, Heidelberg (2010). https://doi.org/10.1007/ 978-3-642-14706-7_14 23. Kahrobaei, D., Khan, AB.: non-commutative generalization of ElGamal key exchange using polycyclic groups. In: IEEE GLOBECOM 2006 - 2006 Global Telecommunications Conference [4150920]. https://doi.org/10.1109/GLOCOM.2006 24. Tsaban, B.: Polynomial-time solutions of computational problems in noncommutativealgebraic cryptography. J. Cryptol. 28(3), 601–622 (2015) 25. Roman’kov, V.A.: A nonlinear decomposition attack, Groups Complex. Cryptol. 8()2 197– 207.27 (2016) 26. Bassham, L., et al.: A Statistical Test Suite for Random and Pseudorandom Number Generators for Cryptographic Applications, Special Publication (NIST SP), National Institute of Standards and Technology, Gaithersburg, MD (2010). https://tsapps.nist.gov/publication/get_ pdf.cfm?pub_id=906762. Accessed 8 May 2023 27. Ustimenko, V.: Linguistic Dynamical Systems, grasphs of large girth and cryptography. J. Math. Sci. Springer 140(3), 412–434 (2007)
Fengxiansi Cave in the Digital Narrative Wu-Wei Chen(B) Faculty of Center for Global Asia, Core Faculty of Shanghai Key Laboratory of Urban Design and Urban Science, New York University Shanghai, Shanghai 200135, China [email protected], [email protected]
Abstract. 12 kms south of Luoyang City in Henan province, Longmen Grottoes mark the milestone of the Chinese royal Buddhist cave temples since the Taiho era of the Emperor Hsiaowenti of the Northern Wei in China, 459AD. Among thousands of deities and caves, the Locana Buddha and the pairing deities of Fengxiansi Cave accomplished during Empress Wu’s regime in 675AD, symbolize the mixture of ideology among royalty, politics, and belief. The representative image of the site - both colossal and volumetric - also depicts the future Buddha and Mesiaah in the thought. Unfortunately, the constant looting last century caused huge loss and damages to the sites and works. Persistent efforts are made by the collaborations of academia, global museums and local institutions to bring the overseas heritage back to the home soil by their digital doubles. Inspired by the joint efforts of the forerunners in the digital restoration, this paper focuses on the challenges, former digital restoration projects, and further possibilities of the virtual reinterpretations of the site through the digital heritage imaging. Keywords: Longmen grottoes · Fengxiansi Cave · Digital heritage · Photogrammetry · Cyber archiving · Ground-truth documentation · Virtual restoration · World heritage site
1 Visual Documentations of Longmen Grottoes 1.1 Early Documentation of Fengxiansi In ancient China, text descriptions of the Longmen Grottoes exist in the inscriptions such as 河洛上都龍門之陽大盧舍那像龕記 (Record of Locana Niche) in the Fonfxiansi Cave and the 龍門二十品 in the Guyang Cave (one of the twenty inscriptions is located in the Laulong Cave, no.660). In the late 19th Century, Longmen Grottoes attracted the attentions of multi-nation delegations for its values of art, education, and archaeology. Okakura Kakuzo, Sekino Tadashi, Félix Leprince-Ringuet, Édouard Chavannes and Charles Lang Freer were among the early tiers of global explorers left the world with documentations. In 2001, Longmen Grottoes was in the World Heritage listing by UNESCO. Comparing with the archaeological research towards the architectural structures of Yungang Grottoes, Longmen Grottoes represents further significance in the achievements of Chinese sculpture. Fengxiansi Cave for example, even with the remaining © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 K. Arai (Ed.): FICC 2024, LNNS 921, pp. 99–107, 2024. https://doi.org/10.1007/978-3-031-54053-0_8
100
W.-W. Chen
of holes, grooves and curvature of the roof (see Fig. 1), the wooden hall that shelters the temple still vanishes. With the virtual visualization, we can reimagine the possible existence of the structure. Charles Lang Freer’s photographic glass plate negatives in 1910 and his book “Research of Longmen Grottoes” [1] published in 1941 provide various angles to look into the sculptures of the Fengxiansi Cave. The condition of the primary deity and attendants were dire and continuous restorations were needed. Cracking and deteriorations on the left face, nostril, and the upper torso can be identified in the earlier archive photos (see Fig. 2). The disciples deities with the standing posture at both sides of the Locana Buddha also carry sever damages from the face to the upper torso (see Fig. 3).
Fig. 1. Locana Buddha statue in the center of Fengxiansi Cave, Longmen Grottoes. Photograph by the author
Fengxiansi Cave in the Digital Narrative
101
Fig. 2. Photograph of the Locana Buddha from 龍門石窟の研究 (Research of Longmen Grottoes), 1941
102
W.-W. Chen
Fig. 3. Photographic glass plate negative of Charles Lang Freer, 1910
1.2 Overseas Heritage Documentations of Fengxiansi The overseas heritage from the Longmen Grottoes echoes the quintessential styles of the deity sculptures. These works get drifted to other parts of the world. The relief mural of the Bin-yang Central Grotto of Longmen for example - Emperor Xiaowen and his entourage worshipping the Buddha relief, is now displayed in the Nelson-Atkins Museum of Art and the Metropolitan Museum of Art. As for the fragmented hands, heads, and figures of the deities, are displayed in museums in China, Japan (see Fig. 4), North America, and Europe [2]. Missing parts of deity sculptures of Fengxiansi are also found among various institutions overseas. The head sculptures of Buddha in the Ono Collection of the Osaka City Museum of the Fine Arts resonates both rounded and secularized appearances of the colossal deities at the centre of the cave.
Fengxiansi Cave in the Digital Narrative
103
Fig. 4. Ono collection of the Buddha head, Fengxiansi Cave of Longmen Grottoes
2 Digital Restorations of Buddhist Iconography from the Global Institutions 2.1 Iconography of Locana [3] Buddha The iconography of the Avatamsaka-School Buddhism makes clear on the sculptures of Fengxiansi Cave. Even with Empress Wu’s sponsorship, the facial appearance of the Locana Buddha reflects the androgynous look rather than the superimposition of feminine or even Wu Zetian’s image. The Kasaya robe of the Locana Buddha resonates the Sam . gh¯ati style of India (Gupta) and Central Asia (Gandhara). The octagon pedestal down below is carved with lotus and sitting Buddha relief believed to be inspired by the Brahmaj¯ala-s¯utra. Inscription [4] of the deity making gets embedded on the front of the pedestal as well. In the earlier photo records of 1920’s, the right side of the aureole (both the halo of the head and the mandorla around the Locana Buddha) was severely cracked, the mudras of the Buddha were also gone missing. Nowadays there is no way to tell if the right-hand gesture of Locana Buddha is either Abhaya, Varada or Bhumisparsha mudra, even though the relief inside the flame pattern of the halo is depicted with the Abhaya mudra [5].
104
W.-W. Chen
2.2 Bodhisattvas, Attendants Lokapalas, and Dvarapalas The Attendants (Ananda and Kashyapa) are with standing postures right next to the Locana Buddha. The Ananda at the right of the Buddha gets restored, yet the Kashyana statue severely deteriorated and is left with the fragments only. The two Bodhisattvas are next to the attendants. The overall shapes like the facial appearance, draperies and gestures are intact. Pair of Lokapalas and Dvarapalas are with standing postures at both sides. Lokapalas with the armory outfits indicates the transformation of styles after Buddhist iconography dissemination into China. Masculine look of the Dvarapalas at the far ends, on the other hand, complete the notion of cosmos depicted in the Buddh¯avatam . saka S¯utra. 2.3 Collaborative Projects Among Global Institutions In the earlier records, ink rubbings and photographs provide the ancient records of the steep limestone cliffs towards the steles and inscriptions. With the adaption of digital technology, it further extends the recordings and conservation of the sites with abundant imageries in terms of Buddhist iconography (Avatamsaka, Chan, Pure Land, Tantrist, Dharmalaksana and Three Stages School) [6]. The nation-wide, multi-year plans of China to systematically document and restore the cultural relics further consolidate the inventory of the Longmen Grottoes. Earlier collaborative projects of China and international experts such as Mt. Xiantan Shan and Shuilu-An be in the late 20th century provide the opportunities to digitally restore the demolished site from looting, theft and vandalization. The recent project between Longmen Grottoes Research Institute, Center for the Art of East Asia of Chicago University, Xi’an Jiaotong University and Nelson Atkins Museum of Art towards the virtual restoration of Binyang Central Cave [7] further examines the possibilities of cross-institution collaborations. The AR project to restore the head of Bodhisattva high relief in front of the Wanfo Cave also marks efforts of local institutions to integrate scanned data, XR experience, and immersive storytelling onsite.
3 Digital Documentation of the Fengxiansi Cave 3.1 Ground-Truth Documentation Through the author’s field documentations, the data of Fengxiansi Cave gets captured by DSLR camera and Ricoh spherical camera. The photogrammetry method enables the triangulations of each spot of the Cave. The ground-truth model can get established by multiple methods: structure-from-Motion (sfM), Neural Radiance Field (NeRF) and else. 3.2 Point Cloud Visualization Once the result of photogrammetry documentations gets converted into point cloud model, volumetric view of the 3D cave emerges on the screen. The delicacy of the detailed look of the cave depends on the density of the point cloud, and it becomes the base layer for multi-level info overlaying. The de-noise process also assists to filter the excessive pint clouds (see Figs. 5, 6, 7 and 8).
Fengxiansi Cave in the Digital Narrative
105
Fig. 5. Dense point-model of the Fengxiansi Cave, perspective view
Fig. 6. Dense point-model of the Fengxiansi Cave, perspective view
3.3 Reinterpretations Through Visual Programming Point cloud models of the documented deities and sites get to represent and simulate visual phenomenon to conduct the storytelling, or augment the virtual experience to the physical space through the Extended Reality (XR). The procedural networks established by node-based visual programming enable the open and sustainable features for production or collaborative projects. The node-based structure allows the analysis and breakdown of color, UVs, texture, and normal. This multi-facet information can then be utilized to enhance, highlight and emphasize.
106
W.-W. Chen
Fig. 7. Dense point-model of the Fengxiansi Cave, perspective view
Fig. 8. Dense point-model of the Fengxiansi Cave, perspective view
4 Extended Applications on the Machine Learning Platform 4.1 Alternative Methods for Representing the Digital Restoration The digitized sites and objects can get preserved as point cloud models, or further converted into low-poly models for dissemination, annotation, and demonstration online or hand-held, wearable devices. Museum exhibitions for the curatorial approaches are commonly seen as one of the applications of the documented data. The data gets further interpreted through the immersive experiences or interactive installations for educational purposes. 4.2 Ground-Truth Data in the Realm of Machine Learning Massive photographs from the photogrammetry documentation at the ground level of the sites get converted into point cloud models. Besides the usage for visualization and
Fengxiansi Cave in the Digital Narrative
107
conservation, each point cloud out of trillions in the model can become the inspect point in the software. Further utilization of the point cloud as navigation enables the visualization of triangulations from the photographs. Ground-truth data emerges by combining the data from the hybrid sources (laser scanning, remote sensing, etc.). In the realm of machine learning, Neural Radiance Field (NeRF) is also able to generate volumetric 3D model based on 2D images. The advantages of NeRF are speed (million-second) and less image-overlapping needed. Yet the challenges of using NeRF are in the post-production and further iteration of the compiled models - unnecessary parts inside the established 3D model also get generated. These parts don’t help with the need for clean structure of the 3D model; also there will be concern when converting the 3D models to line drawings for cross-disciplinary researches.
References 1. Charles, L.F.: Photographic glass plate negatives taken during travels in China. FSA A.01 12.05.GN. Freer/Sackler Gallery of Art Archives. https://sova.si.edu/details/FSA.A.01#ref 3164. Accessed 20 Oct 2023 2. Amy, M.: Donors of Longmen: Faith, Politics, and Patronage in Medieval Chinese Buddhist Sculpture, pp. 2–5. University of Hawai’i Press, Honolulu (2007) 3. Roshana (Lushena 盧舍那) or Vairocana (Biluzhena 毗盧遮那) 4. 河洛上都龍門山之陽大盧舍那像龕記 (Record of Locana Niche) 5. Hida, R.: Translated by Yan, Juan Ying. 雲翔瑞像-初唐佛教美術研究(Hatsukarabukkyo Bijutsu No Kenkyu, Research of Buddhist Art In the Early Tang Dynasty), pp. 203–223. National Taiwan University Press, Taipei (2018) 6. WHC Nomination for Cultural Heritage of Cultural Properties for Inscription. https://whc.une sco.org/en/nominations/. Accessed 20 Oct 2023 7. Title of the mural relief: Emperor Xiaowen and his entourage worshipping the Buddha
Lambda Authorizer Benchmarking Tool with AWS SAM and Artillery Framework Cornelius and Shivani Jaswal(B) National College of Ireland, Dublin, Ireland [email protected]
Abstract. Leading cloud provider Amazon Web Services (AWS) provides a security feature called Lambda Authorizer in their serverless service, AWS Lambda. This security feature processes the security token in the request header according to the custom code set by the developers. This security technology development has led to more developers leveraging serverless technology to develop serverless APIs with public access, which previously tended to be used for private scope. This study determines the performance and cost of a serverless function that implements the Lambda Authorizer. By knowing the benchmarking results, developers can maximize the performance parameters in realizing a secure and cost-effective serverless public API. This paper proposes a benchmarking tool based on the AWS Serverless Application Model (SAM) and Artillery framework to measure the performance of Lambda Authorizer-implemented serverless functions with three primary performance parameters such as start-up conditions, programming language runtimes, and authorization types. Using this combination of parameters, the Lambda Authorizer Benchmarking Tool shows that Python is still more performant and cost-efficient than other runtimes. It also becomes the best choice to achieve the lowest response time when combined with the request authorizer during warm conditions. The exciting result is that Go performs better if the function code requires much memory since it starts faster and has better memory management than Python. Keywords: AWS lambda · AWS Serverless · Application model · Artillery framework · Benchmarking tool · Lambda authorizer
1 Introduction AWS EC2 is one of the company’s first Infrastructure-as-a-Service (IaaS) products. IaaS enables computer resources as a service, allowing customers to lease computer resources, easily set up server environments, and access them through the internet. In detail, IaaS offers many benefits, including a reduction in workers because the infrastructure configuration process is automated. In this way, servers and resources are operated efficiently, development and production costs are reduced, and companies are able to compete quickly in the global market. Platform-as-a-Service (PaaS) is the next step in cloud computing. AWS Elastic Beanstalk is a well-known example. PaaS is a service abstraction layer built on top of © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 K. Arai (Ed.): FICC 2024, LNNS 921, pp. 108–128, 2024. https://doi.org/10.1007/978-3-031-54053-0_9
Lambda Authorizer Benchmarking Tool
109
IaaS to ease complex server installations by including the Operating System (OS) into the outsourced infrastructure and enables developers to deploy software, patch software, and monitor systems. Another comparable service is Containers-as-a-Service (CaaS). CaaS is a cloud-based service architecture that hosts and orchestrates containerized workloads within the cluster. A well-known application example for CaaS is Docker, a container-based virtual machine that promoted the notion of containerization among developers all over the world [1]. Serverless computing is the most advanced breakthrough in the field of cloud computing. Serverless services divide into Backend-as-a-Service (BaaS) and Function-as-aService (FaaS). BaaS, such as AWS Amplify, aims to replace the back-end server developers that generally set up and manage themselves. In contrast, FaaS is more analogous to a computing service that allows developers to run code granularity at the function level. One of the most outstanding examples of FaaS is AWS Lambda, the most widely used FaaS nowadays. Furthermore, the term “serverless” does not indicate the lack of a server. In essence, serverless technology frees users from the burden of server administration [2]. Figure 1 depicts the distinction between traditional server and serverless applications.
Fig. 1. Traditional server vs. FaaS software application [26]
The technology sector as a whole is embracing and developing serverless technologies. Serverless technology introduces a novel approach to creating and delivering applications, as well as the varying complexity of using the serverless ecosystem’s components. As a result, each software company has come up with its own serverless application architectural design. Each architecture design, however, has one thing in common: the authorization and authentication procedure, which is carried out using BaaS, Amazon Cognito, or a custom logic using Lambda Authorizer. On that premise, this paper proposes a benchmarking tool for serverless functions based on start-up circumstances, programming language runtimes, and authorization logic in order to improve the performance of the access-controlled AWS serverless API.
110
Cornelius and S. Jaswal
1.1 Motivation and Research Objective In the world of cloud computing, it is believed that his benchmarking tools can bring benefits in the following ways: Researchers and developers can specify which performance parameters yield the most favorable cost-performance ratio when calling the Lambda Authorizer accesscontrolled AWS Serverless API. The researchers and developers can use the Serverless Benchmarking Tool to evaluate different programming languages or access control mechanisms. The objective of this carried out research is to determine how much influence do start-up conditions, programming language runtimes, and authorization types have on the cost and performance of executing the Lambda Authorizer-enabled AWS serverless API? This research holds the value in the area of Cloud Computing by determining the effect on the performance of serverless APIs so that processed can be run without any problems at the initial stages. This research paper is divided into various sections. The section covers the related work related to the filed. After this, Sect. 3 covers the research methodology and the process overflow. In Sect. 4, the benchmarking tool is proposed along with the sequence diagram. In Sects. 5 and 6, performance implementation and evaluation and result analysis is illustrated respectively. In last section conclusion and future work is insighted along with the references.
2 Related Work An understanding of serverless itself is necessary for researching and evaluating their performance. The following section overviews a brief description of existing research works. A significant part of this section explains what is needed to conduct this work and helpful information from trustworthy sources to support the author’s research. In software architecture, serverless computing is a dream come true. By using serverless technology, developers can accelerate and simplify the development process while ensuring scalability, security, and performance are handled. The concept of serverless is widely misunderstood by most people, who believe it does not require a server. It is essential to understand that serverless still uses servers. It is only that the internal infrastructure is mainly concealed from the consumer. A serverless approach saves developer’s time by focusing on function development rather than configuring infrastructure [3]. Functional components can be separated granularly with the help of serverless technology. They use as few resources and costs as possible, owing to their advantages. For instance, when deploying functions as serverless services, developers do not need to pay for the creation and setup of the functions. The pricing system differs in serverless services, where developers have to pay based on the memory usage and the runtime duration of used functions. A single-purpose APIs and web services can be leveraged to easily create serverless APIs, which allow developers to construct loosely-linked, scalable, and efficient structures. Ultimately, developers should concentrate on creating code rather than worrying about how the server infrastructure works when using serverless APIs [3]. Serverless services apply to a wide range of business applications, including enterprise
Lambda Authorizer Benchmarking Tool
111
applications, mobile applications, data processing, scheduling, and embedded applications. This technology fuses the processes of containerization and virtualization into a unified architecture. Even though serverless is still considered new to developers, they can choose the best strategy to implement the system into a serverless by understanding the various technologies that work around it [4]. Three of the most popular serverless cloud providers are AWS, Microsoft Azure, and Google Cloud Programming (GCP). According to their findings, AWS is the platform with the best performance across all testing situations, whereas Microsoft Azure has the most significant variance in working behavior and GCP has unpredictable performance [5]. To objectively evaluate the performance of reference applications on cloud platforms, the workloads generated from the benchmark experiments must be realistic and use the same workload model and data volume as the input data [6]. However, the author does not have to create a fair workload because the focus of the research in this work is benchmarking serverless function that implemented Lambda Authorizer. This feature is only available in the AWS environment. In addition, a fully secured serverless application can be created with the help of AWS Lambda, which makes sense for enterprises already committed to using AWS as their public cloud provider. In order to allow fast access to internal resources, AWS integration is necessary, which includes storage, databases, and streaming. It automatically scales the bandwidth and processing power needed for each function based on the amount of memory specified by the developers [7]. AWS provides a unique AWS SAM framework to manage serverless resources programmatically or through Command-Line Interface (CLI). This framework can provision and perform serverless operations with multiple programming runtimes in AWS Lambda. AWS SAM is built on top of AWS CloudFormation as an extension, primarily used to develop structural components of AWS services because of their outstanding performance when used in the AWS environment. AWS CloudFormation, on the other hand, has a level of difficulty in learning and understanding it. Thus AWS SAM is built with a script syntax that is more human-readable and simpler to understand. The configuration files adopt popular formats such as JSON and YAML, which have various vital properties in serverless management, ranging from global, security, and event sources. Since the implementation of AWS SAM is open-source, the community may contribute to improving its functionality, and the degree of popularity of AWS SAM has continued to rise to compete with competing for serverless frameworks [8]. The Artillery and JMeter framework is one of the most versatile and famous testing frameworks. They both can generate the function performance result test offline. However, JMeter is not specifically designed to calculate the performance in serverless environments [9]. On the other hand, the Artillery framework is built purposely for testing serverless functions. A similarity can be seen between the Artillery and JMeter framework in terms of what it aims to achieve. Libraries in this module measure serverless test data scenario performance. The maximum number of concurrent tests the Artillery framework can run depends on its cloud environment computing capabilities and network. The Lambda functions are executed using the Artillery package, which can increase throughput for developers. Due to this, the application can handle a more significant number of tests simultaneously. Additionally, Artillery allows developers to
112
Cornelius and S. Jaswal
deploy custom methods for delivering header data and query string parameters in cases where the serverless API requires them [10]. Consider what would happen if unauthorized users gained access to the AWS environment. Inadvertently publishing AWS credentials to the public GitHub project is one of the most prevalent examples. Hackers can quickly access the developer’s AWS resources by searching the public repository. Events such as these may interrupt the serverless function working process since service constraints apply at the region level. All available throughputs in the area can be exhausted by one team or service, and this causes all other processes to slow down. Every change made in a non-production environment may affect users in production since all environments share the same AWS account. This subsection explains a few examples of how developers can restrict access to AWS Lambda serverless APIs [3]. AWS Lambda has many security features, from IAM Permissions and Amazon Cognito to Custom Authorizer. The below part explains more about the custom authorizer because it is directly related to the main scenario of this research project. Moreover, AWS Lambda has changed the name of the Custom Authorizer feature to Lambda Authorizer. As of the most recent update, Lamba Authorizer allows the developers to create custom access controls for serverless APIs. API caller identity is determined using bearer token schemes like OAuth or SAML or request parameters from Lambda Authorizer. Following is an explanation of the two types of Lambda Authorizer and how they work: A TOKEN authorizer identifies callers using bearer tokens, such as JSON Web Tokens (JWT) and OAuth tokens. Bearer Tokens are not meant to have any meaning for users. A typical implementation of token authorizer is GitHub’s authorization process. A REQUEST authorizer analyzes headers, query string arguments, state variables, and context variables to determine the caller’s identity. Additionally, WebSocket APIs use request parameters and do not support other types of authorization.
Fig. 2. Lambda authorizer workflow [11]
Lambda Authorizer performs the authentication process between the client and serverless resource, as illustrated in Fig. 2. First, clients request API Gateway APIs using
Lambda Authorizer Benchmarking Tool
113
bearer tokens or request parameters. In order to determine whether a Lambda Authorizer is used, the API Gateway checks the method configuration and determines whether the Lambda function can be called or not. It authenticates the caller by obtaining an access token from an OAuth provider or a SAML assertion from a SAML provider. Based on the value in the request parameter, the system creates an IAM policy to obtain login information. If the checking process is successful, the Lambda function sends an object containing the IAM policy and principal identity. If access is denied, the API gateway evaluates the policy and sends HTTP status codes such as 403 ACCESS DENIED. As soon as access is accepted, the API Gateway executes the method. Additionally, the API Gateway caches policy data, so Lambda Authorizer does not have to be called multiple times when the caching system is enabled [12]. By extracting research on serverless benchmarking tools, the author found a number of existing projects on serverless benchmarking tools, such as the Serverless Benchmarking Suite that used HyperFlow engine to support many clouds [13], the Serverless Performance Framework, which was built from the Serverless Framework [14], the Serverless Application Analytics Framework, which was used to measure the performance of serverless data processing pipelines [15] and the PanOpticon benchmarking tool to support custom serverless functions with Python runtimes in AWS and GCP environment [16]. This includes the most comprehensive benchmarking tool, Serverless Benchmark Suite, which was used to assess the performance of serverless data processing pipelines. It also supports several serverless cloud platforms [5] as well as the latest research titled Serverless Benchmarks that uses fair workloads to test the performance of different serverless cloud platforms [6]. This subsection will explain the factors that influence a serverless function’s performance. When an initialization start occurs, the latency of the requests is also called cold latency, and the actual latency of the serverless function execution process is called warm latency. There are only two situations in which the cold start penalty occurs precisely. The first situation occurs when AWS Lambda must create new environments to handle the influx of incoming requests. The second is calling a serverless function that has not been called for a long time. Generally, cold starts make up less than 0.5% of incoming requests for serverless functions, but infrequently functions’ traffic spikes are not affected as much by cold starts. Moreover, AWS Lambda’s timeout setting applies to the entire request delay, so requests with cold start-ups may also experience timeouts [3]. Given the cost transparency provided by serverless pricing models, understanding the cost impact of a function’s language is critical. For example, C# programming language has a modest advantage in minimizing the number of cold starts because of the maturity of support for distributed tracing in Microsoft Azure Functions. From what can be seen in Fig. 3, calling a serverless function is not straightforward. It must go through several layers, such as the internal compute substrate, execution environment, and language runtime. The execution environment and the function code are instantiated on demand for each request. When the first request arrives, AWS Lambda runs the code within the function handler after the environment is formed. As soon as the handler logic is complete, AWS Lambda considers the function complete. AWS Lambda does not destroy the execution environment; instead, it saves it. When a subsequent request happens, and
114
Cornelius and S. Jaswal
a cached execution environment is available, AWS Lambda handles the request using that execution environment. AWS Lambda will construct the new execution environment if a cached execution environment is unavailable.
Fig. 3. Serverless function layers execution [3]
Many limitations must be overcome for serverless technology to be effective, including one arising from the dark side of the cloud platform. When developing non-serverless applications, the developer has total control over the software and hardware stack. However, in serverless application development, almost all hardware and software aspects become opaque and invisible from the developer’s point of view. Because of this, serverless environments are less flexible than traditional ones. The following are some of the missing aspects in a serverless environment, the absence of total power and hardware underlying management, including the least control for complex security features. These limitations can sometimes cause benchmark testing results to be inconclusive and different from time to time due to many unexpected and mysterious things from the developer’s perspective [17, 18]. At the first step, the user successfully logs in with the credentials entered and gets an access token from a third-party identity provider. Then in the second step, the user makes an HTTP call with the GET request method into the API Gateway with the access token added in the request header. At the next point, the third and fourth steps, the API Gateway forwards the request to the Lambda Authorizer. The function in the Lambda Authorizer is custom code deployed by the developer to validate the access token provided. If the access token is valid, the process will continue to the fifth step, where the Lambda Authorizer starts the IAM policy generation process to grant access permissions to the requested resource. In the final step, the sixth step, the API Gateway, analyzes the IAM policy provided and grants or denies access to the intended resources [27]. Figure 4 describes a use case where the Lambda authentication process uses a third-party identity provider as a source of validation checks. In Table 1, related work has been compared based on their framework, scenario, advantages and limitations.
Lambda Authorizer Benchmarking Tool
115
Fig. 4. Lambda authentication with a third-party identity provider use case [27]
Table 1. Comparison of related work Reference Framework
Scenario
Advantages
Limitation
This Research
Lambda Authorizer Benchmarking Tool (LABT) constructed using AWS SAM and Artillery Framework
A serverless functionthat implements Custom Lambda Authorizer
Analyze the performance of a serverless function with access control
–
[6]
Serverless Benchmarker (SB)
Thumbnail generator, model training, and video processing
Fair benchmark comparison to broad cloud serverless provider
Mitigating construct validity depends on cloud platform features
[5]
Serverless Benchmark Suite (SeBS)
General serverless application
Support multiple Cloud platforms
Supportonly specific serverless scenarios (continued)
116
Cornelius and S. Jaswal Table 1. (continued)
Reference Framework
Scenario
Advantages
Limitation
[15]
Serverless Application Analytics Framework (SAAF)
Transform-LoadQuery serverless application
Support Processing Pipeline Scenario Data
Data Storage is limited to S3
[16]
PanOpticon (PO) Custom simple function uses Serverless Framework and JMeter
Easy configuration using a dedicated configuration file
Python runtimes are the only ones supported
[14]
Serverless Empty function Performance Framework(SPF) made using Serverless Framework
As benchmarking runs without third parties involved, they produce accurate results
Theempty function scenario is plain compared to the actual serverless application
[25]
Microbenchmark Basic function (MB) built on top Serverless Framework
Widerangeof cloud provider support
Limitedinput andoutput performance parameters
[13]
Serverless Benchmarking Suite (SBS) produced using Serverless and HyperFlow Engine
Custom simple function Multi-cloud provider support
Output parameters are only CPU and RAM usage
3 Research Methodology The use of appropriate methodologies is required when benchmarking serverless functions. With the correct approach, the configuration process will be easier to manage, and the outputs will be more accurate. This section presents a general overview of the benchmarking process for an access-controlled serverless function, from input until it generates the expected outcomes. 3.1 Process Overflow In this section, Fig. 5 depicts the flow of benchmarking tool process in calculating performance for multiple secured serverless functions. Users must input a specific command to
Lambda Authorizer Benchmarking Tool
117
trigger the benchmarking process. In short, the benchmarking process undergoes several stages and validation to produce the desired output. Below is a more detailed description of each flowchart component:
Fig. 5. Process flowchart for general serverless benchmarking tool
Here, in the initial stages, using the benchmarking tool command line, user can start the serverless application builder which will generate input due to fetch command properties. In the second step, the system receives the user’s input and processes the existing application builder properties. Here, the system will check, if the application builder is having a value or not. If yes, the application will be build and deployed based on various serverless builder properties and if not, the system will be moved to end step. After this, considering the yes option, the system will keep on checking if elements are still remaining in serverless builder properties and keeps on performing the task of build and deploying the application. When no elements are remaining in serverless properties, a Lambda Authorizer function is built and deployed based on the properties and linked with the desired function by the serverless application builder framework. Before moving on to the performance testing stage, a final check will be performed if any application building properties are still unexecuted. The following property will be run if one is found. Subsequently, the system will perform serverless performance testing based on configuration files that the user already sets up. This process then determines which scenarios
118
Cornelius and S. Jaswal
users wish to run and how many testing iterations are needed based on input in the beginning. As a result of benchmarking, the serverless monitoring dashboard displays each tested function’s performance metrics. Also, users can retrieve benchmarking results in JSON format in the local output folder. Additionally, the users can repeat the testing process by providing different command parameters and settings for the serverless application builder to analyze different scenarios.
4 Proposed Work Cutting-edge technologies and techniques must be employed to develop a reliable and robust benchmarking tool for serverless computing. Accordingly, this paper will briefly describe the high-level process for the Lambda Authorizer Benchmarking Tool using a sequence diagram. This section explains the performance metrics that the serverless benchmarking tool can gather. Sequence diagrams are used in this section to demonstrate the interaction between objects sequentially. The diagram is intended exclusively for developers. It has also been found that these diagrams can also be used to facilitate communication between technical and business department employees in some companies. Figure 6 illustrates the sequence diagram of the Lambda Authorizer benchmarking process. This diagram was to explain how the application’s internal system works.
Fig. 6. Lambda authorizer benchmarking tool sequence diagram
The user instructs the Lambda Authorizer Benchmarking Tool to run an automated server- less benchmarking test. AWS SAM reads the authorizer and application function code within the source folder and continues building the code using the configured
Lambda Authorizer Benchmarking Tool
119
runtime. If there is no compilation error, the system will deploy it. It will generate multiple virtual users to call the request URL simultaneously for a particular duration. Through AWS Lambda, the Artillery framework converts the performance outputs into human-readable content in JSON format, which will be stored in the local folder. The AWS CloudWatch service monitors and writes logs of serverless function activities at the same time. Using the AWS CloudWatch dashboard, the users can visit a graphical representation of the performance activity by default. In addition, this service also provides users with remarkable capabilities to analyze logs interactively directly from within written logs using AWS CloudWatch Logs Insights [19]. Furthermore, there are three main components in the Lambda Authorizer Benchmarking Tool, namely AWS SAM and Artillery Framework, also AWS CloudWatch as monitoring services. AWS SAM is selected for the serverless application builder framework rather than popular frameworks like Serverless Framework because the benchmark scenario in this research work is to measure the performance of serverless functions that implements Lambda Authorizer. Since Amazon SAM is a native framework, it performs much better than rival frameworks. Meanwhile, the Artillery framework is used as a serverless performance tester framework because it is purposely engineered to measure the performance of scalable cloud functions [20]. In contrast, popular performance testing frameworks like JMeter exist for functions in general. As a serverless monitoring system, AWS CloudWatch is chosen since it is suitable for this application. This service directly supports logging and monitoring activities natively within AWS as the cloud provider being tested. Another thing that needs to be added is that the output results in JSON format is selected because it has high flexibility and can support various processes regardless of programming language.
5 Performance Implementation The results of different scenarios in benchmarking will be presented in a Table 2 structure. Based on four programming runtimes and two authorizer types, the performance output parameters recorded are the maximum duration for cold and warm starts, maximum memory use, response time, and performance cost. Using this table, the research will determine which programming runtime has won the most categories as the overall winner. With AWS Lambda, the service price is calculated based on the number of CPU cycles rather than the number of bytes [21]. The service counts all types of requests and their duration, whether they come from event managers, Amazon API Gateway, or AWS console directly [22]. During the calculation, the time from the start of the function code to its end is considered. Minor duration rounding begins at 1 ms. Moreover, serverless functions are also priced based on the memory resources allocated to them. Due to memory allocation, function durations will be influenced, which affects cost as well. Memory can be allocated to functions between 128 MB and 10,240 MB in 1 MB increments. By default, each function has a timeout of three seconds, 128 MB of memory and 512 MB ephemeral storage. Table 3 gives a table of the cost calculation of a serverless function using 1024 MB of memory. The implementation of the proposed design is presented in this section. Besides explaining the high-level architecture of the application and the type of scenarios are
120
Cornelius and S. Jaswal Table 2. Serverless benchmarking tool performance metrics template
Runtime
Authorizer Type
Max Init Duration
Max Duration
Max Memory Used
Avg Response Time
Max Cost
Python 3.9
Request
?
?
?
?
?
Python 3.9
Token
?
?
?
?
?
Node 16.x
Request
?
?
?
?
?
Node 16.x
Token
?
?
?
?
?
Go 1.x
Request
?
?
?
?
?
Go 1.x
Token
?
?
?
?
?
Java 11
Request
?
?
?
?
?
Java 11
Token
?
?
?
?
?
Table 3. AWS lambda pricing table [23] Architecture
Duration
Requests
x86
$0.0000166667 for every GB-second
$0.20 per 1M requests
Arm
$0.0000133334 for every GB-second
$0.20 per 1M requests
run, how the performance cost calculations are made is also described. Figure 7 displays the high-level architecture of the Lambda Authorizer Benchmarking Tool. This application is developed based on the sequence diagram in Fig. 6. AWS SAM is used for removing existing function stack, building, and deploying function code according to template scripts. Furthermore, the application also utilizes the Artillery framework to run performance test scenarios with query string parameters for request authorization types and authorization headers for token authorization types. In addition, using the basic configuration provided by this benchmarking tool, the user can adjust the duration and rate of function calls. Performance testing results are split into two parts, those obtained from the Artillery framework and those queried directly from AWS CloudWatch Logs Insights. Performance results generated by Artillery have important outputs such as request rate and response time. This output has a JSON format that can be converted into HTML with a more human-friendly visual. Meanwhile, the output provided by AWS CloudWatch Log Insight only has a JSON format but has a complete output, such as cold and warm duration, as well as memory usage.
6 Performance Evaluation and Result Analysis A total of eight benchmarking scenarios are run, a combination of four programming language runtimes and two authorization types. The programming language runtimes used are Python 3.9, NodeJS 16.x, Go 1.x and Java 11. According to AWS Lambda’s
Lambda Authorizer Benchmarking Tool
121
Fig. 7. High-level architecture of lambda authorizer benchmarking tool
documentation, Python and NodeJS have superior overall performance, while Go has a quick start, while Java has a long start but fast after initialization [24]. Based on these documents, the four programming runtimes as the subject tests have been selected in this research. The default specification for all tested serverless functions is x86 architecture, 128 MB memory, 512 MB ephemeral storage, and no cache. Further, each function is tested for ten seconds, which is comprised of ten different virtual users invoking each function once every second, resulting in a total of 100 invocations. 6.1 Logs Insights Custom Queries The Lambda Authorizer Benchmarking Tool’s Logs Insights feature allows this application to generate seven pre-defined query results consisting of an overview query with a combination of two authorizer types and three main parameters (maximum initialization time, maximum duration, and maximum memory used). All these results are displayed in JSON format. As additional information, if the user wants to get other custom query output, the user can get the results of the query results manually, as shown in Fig. 8. 6.2 Performance Cost Calculation The function cost is calculated manually with the help of the AWS Calculator tool. This tool can estimate the price to be paid each month based on the service used and the resource parameters entered. For example, a function with three times cold starts of 1 million calls. The maximum cold start duration is 131.03 ms, which happened three times, while the maximum warm start duration is 1.85 ms. Then with balance calculation, the average maximum duration of a function is 5.7254 ms. It is necessary to round up the result to six milliseconds because AWS Lambda does not record duration in decimal
122
Cornelius and S. Jaswal
Fig. 8. Logs insight manual query for overview output
form. Next, the result must be input into the AWS Calculator so that the below results will be displayed: • Function duration 6 ms and 1,000,000 requests/month Memory allocated: 128 MB x 0.0009765625 GB in a MB = 0.125 GB Ephemeral storage allocated: 512 MB × 0.0009765625 GB in a MB = 0.5 GB. • Pricing calculations 1,000,000 requests × 6 ms x 0.001 ms to sec = 6,000.00 total compute (seconds) 0.125 GB × 6,000.00 s = 750.00 total compute (GB/s) Tiered price for: 750.00 GB/s 750 GB/s × $0.0000166667 = $0.01 Total tier cost = $0.0125 (monthly compute charges) 1,000,000 requests × $0.0000002 = $0.20 (monthly request charges) Billable ephemeral storage = 0.50 GB - 0.5 GB = 0.00 GB (no charge) Lambda costs - Without Free Tier (monthly): $0.0125 + $0.20 = $0.21. Evaluation. This section is filled with five summary bar charts for each category of main output parameters in the section Performance Metrics. Subsequently, the advantages and disadvantages of each programming language runtime based on benchmarking results have been discussed. Maximum Initialization Duration Query Result. The first bar chart in Fig. 9 shows the benchmarking results in the form of the maximum duration of initialization of a function at each runtime. This duration is known as cold latency. Cold latency usually occurs when a new function is created or one that has not been used for a long time starts to be activated. Typically, with the default settings including cache-enabled, cold starts only occur in less than 0.25% of the total requests. Maximum Duration Query Result. The maximum duration bar chart illustrated in Fig. 10 shows the speed in processing the Lambda Authorizer function with the Request and
Lambda Authorizer Benchmarking Tool
123
Fig. 9. Authorizer maximum initialization duration bar chart
Token types. This maximum duration is also called warm duration, which means that the function is active and ready to wait for incoming requests. The discrepancy between Java and other runtimes regarding the maximum time required is recorded and analyzed. This lengthy process duration was observed only in 1 to 5% of total requests.
Fig. 10. Authorizer maximum duration bar chart
Maximum Memory used Query Result. The results of querying the efficiency of runtime in memory management are shown in the bar chart in Fig. 11. Despite implementing Lambda Authorizer with the lowest memory specification (128 MB), there is still a lot of memory left over. So there should be no more excuses if the function that implements the access control feature broke the amount of memory limit used. Interestingly, when comparing the memory efficiency of the Go runtime with that of the Java runtime, the gap is quite large, more than two times as large. Response Time Calculation Result. There is a difference between response time and duration. Duration means the period of time required to initiate and execute Lambda Authorizer functions. Meanwhile, this output parameter is an overall time calculation starting from when the user requests the function, enters the Lambda Authorizer function, steps further to the actual function, and returns it as a response to the user. The Artillery
124
Cornelius and S. Jaswal
Fig. 11. Authorizer maximum memory used bar chart
framework generates the data displayed in Fig. 12. The bar chart shows here by average response time after 100 invocations. It is interesting to note that the Java runtime actually has relatively good performance because after the averaging process, in token authorizer scenarios, the results are the same as the performance of the NodeJS runtime.
Fig. 12. Authorizer average response time bar chart
Performance Cost Calculation Result. Finding the best performance cost without reducing the security of a function is one of the objectives of this research work. The performance cost of a scenario is generally affected by the speed at which functions are executed and by the amount of memory they consume. The value in Fig. 13 is calculated manually based on the maximum init and function duration according to the proportion of cold and warm conditions. The results of these calculations are fed into forecasting calculations for AWS Lambda in AWS Calculator. This chart shows that Python is the runtime with the best cost performance. However, according to Fig. 10, Go runs more efficiently in terms of memory usage. When running scenarios that drain memory capacity, Go will have the upper hand in the battle for best performance cost.
Lambda Authorizer Benchmarking Tool
125
Fig. 13. Authorizer performance cost bar chart
Discussion. Here, Table 4 is a copy of the performance matrix of Table 2, filled with the benchmarking results from previous subsections. It gives a blue highlight to the columns with the best results from each category and a red highlight which means the worst result in each category. The programming language runtime will be sorted based on how many highlights it gets. Highlights in blue score plus 1, highlights in red score minus 1, and columns without highlights score 0 points. After calculation, here is the order of the best programming language runtimes: Python 3.9, Go 1.x, NodeJS 16.x, and Java 11. Moreover, rows with the most blues represent the best scenario. There are two scenarios. First is a combination of Python runtime with a request authorizer in warm conditions. The other one combines Go runtime with a token authorizer in cold conditions. However, the cold duration is only a tiny part of the total request, and the Go’s warm duration in the second scenario is ten times slower than in the first scenario. So it is recommended that the first result is taken as the best scenario. Table 4. Serverless benchmarking tool performance metrics result Runtime
Authorizer Type
Max Init Duration
Max Duration
Max Memory Used
Avg Response Time
Max Cost
Python 3.9
Request
131.03 ms
1.85 ms
34.33 MB
62.2 ms
$0.21
Python 3.9
Token
116.82 ms
1.82 ms
34.33 MB
73 ms
$0.22
Node 16.x
Request
223.68 ms
14.14 ms
55.31 MB
64.7 ms
$0.24
Node 16.x
Token
165.6 ms
17.31 ms
54.36 MB
74.4 ms
$0.25
Go 1.x
Request
255.78 ms
8.27 ms
28.61 MB
68.7 ms
$0.24
Go 1.x
Token
85.51 ms
19.99 ms
27.66 MB
64.7 ms
$0.24
Java 11
Request
739.91 ms
358.61 ms
83.92 MB
83.9 ms
$1.61
Java 11
Token
584.48 ms
577.48 ms
78.20 MB
74.4 ms
$1.41
126
Cornelius and S. Jaswal
Python is the overall winner of this research’s benchmarking tests. This runtime gets ranked one for function duration, response time, and performance cost. Based on the author’s observation, the Python runtime is swift because cold conditions only occur three times out of 100 invocations, apart from a fast start-up time. Apart from that, what is impressive is that the duration of running the function in warm conditions is at least four times faster than other runtimes. Then with everything fast, performance costs can be reduced, so this runtime combined with a request authorizer is the best choice if developers aim to create serverless applications with a limited budget but strong performance. Go ranks second in this benchmarking test research. This runtime attracts the writer’s attention because, according to the benchmark output, this runtime can beat Python in terms of cold start and memory efficiency. The difference is quite plenty. The cold start speed on the token authorizer scenario is approximately 30 ms or 27% faster than Python. Meanwhile, one of the most fantastic things about this runtime is its memory management, which only requires 20% less memory than its closest competitor, Python again. For this reason, the authors suggest that if developers want to develop responsive or real-time serverless applications with the best memory usage, then use Go. NodeJS is firmly in third place. The performance of the runtime individual component variables is entirely satisfactory. It is just that the average response time in the token authorizer scenario has the same results as the Java runtime, which is 74 ms, which is weird. Because the scope of this research is limited, this anomaly needs to be investigated further in other research. Developers can still use this runtime if they are familiar with it. According to the author, this runtime is the most developer-friendly, easy to develop, has no constraints, and has many communities that provide open-source libraries. Java is located at the bottom of the benchmarking test in this research. All red highlights are located in this runtime, especially in the request authorizer scenario. This runtime has the most prolonged start-up duration, lengthy function execution, highest memory usage, high response time, and a more expensive function per request cost. However, if the readers look more closely at the response time section, the results obtained in the token authorizer scenario are only 13% slower than rank 1, Go. That means the maximum function duration hit at the Java runtime is rare and mainly has a much lower duration. Java is the programming language of millions of people and has proven reliable and secure on various production systems. For this reason, the authors understand that there are developers who want to use Java in developing serverless applications. This paper recommends enabling caches to minimize dire performance activities and high costs for a serverless API created with Java runtime.
7 Conclusion and Future Work In conclusion, as seen from the evaluation results, Python runtime is the best option for the lowest response time scenario in combination with the request authorizer type and warm condition. The Python runtime is very powerful also economical, especially when running functions during warm condition. Compared to its closest competitor, Go, this runtime is four times faster. However, this achievement could have been flawless. Go, which rank second, is very close to Python. It also has advantages over Python,
Lambda Authorizer Benchmarking Tool
127
such as a 27% quicker cold start-up time and 20% more efficient memory usage. Go may have the upper hand if the scenario has higher memory usage. In third place is the NodeJS runtime, which performs well and is stable in individual component testing. However, when viewed from the authorizer token scenario average response time, the results obtained are the same as Java results. The last position is the Java runtime, which has the most unsatisfactory results in almost every benchmarking scenario. Despite this, Java’s overall average response time on token authorizer is still on par with NodeJS in some scenarios and only 13% slower than the first-place Go. With the creation of the Lambda Authorizer Benchmarking Tool, understanding the influence between start-up conditions, programming language runtimes, and authorization types in Lambda Authorizer-enabled serverless function or API becomes more accessible and accurate. However, this application still needs improvement, and many parts can be improved. One of them is the addition of command options so that users can easily set up a new language runtime or other access control features in the AWS SAM builder properties. Then a higher level of complexity can also be added to benchmarking scenarios, such as scenarios accessed via data- bases, message queues, and event managers, along with a greater variety of output formats and parameters. This application can be commercialized with more diverse features, such as CI/CD integration, multiple accounts access, a friendly user interface, web accessibility, and a proper support system. It is also possible to develop this benchmarking tool for other cloud platforms. The only requirement is to choose a framework that supports both automated serverless application building and cross-cloud platforms, such as the Serverless, Spring Cloud Function, or Terraform. Alternatively, there is also an option of developing natively for each cloud provider.
References 1. Mohammed, C.M., Zeebaree, S.R., et al.: Sufficient comparison among cloud computing services: IAAS, PAAS, and SAAS: A review. Int. J. Sci. Bus. 5(2), 17–30 (2021) 2. Shafiei, H., Khonsari, A., Mousavi, P.: Serverless computing: a survey of opportunities, challenges, and applications. ACM Comput. Surv. 54(11s), 1–32 (2022) 3. Sbarski, P., Kroonenburg, S.: Serverless Architectures on AWS: With Examples using Aws Lambda. Simon and Schuster (2017) 4. Rajan, R.A.P.: Serverless architecture-a revolution in cloud computing. In: 2018 Tenth International Conference on Advanced Computing (ICoAC), pp. 88–93. IEEE (2018) 5. Copik, M., Kwasniewski, G., Besta, M., Podstawski, M., Hoefler, T.: Sebs: A serverless benchmark suite for function-as-a-service computing. In: Proceedings of the 22nd In- ternational Middleware Conference, pp. 64–78 (2021) 6. Deng, R.: Benchmarking of serverless application performance across cloud providers: An in-depth understanding of reasons for differences (2022) 7. Patterson, S.: Learn AWS Serverless Computing: A Beginner’s Guide to Using AWS Lambda, Amazon API Gateway, and Services from Amazon Web Services, Packt Publishing Ltd. (2019) 8. Grumuldis, A.: Evaluation of “serverless” application programming model: How and when to start serverles (2019) 9. Abbas, R., Sultan, Z., Bhatti, S.N.: Comparative analysis of automated load testing tools: Apache jmeter, microsoft visual studio (tfs), loadrunner, siege, pp. 39–44. In: 2017 International Conference on Communication Technologies (comtech). IEEE (2017)
128
Cornelius and S. Jaswal
10. Andell, O.: Architectural implications of serverless and function-as-a-service (2020) 11. Use API Gateway Lambda authorizers (2022). https://docs.aws.amazon.com/apigateway/lat est/developerguide/apigateway-use-lambda-authorizer.html 12. Calles, M.A.: Authentication and Authorization, pp. 229–256. Springer, Serverless Security (2020) 13. Malawski, M., Figiela, K., Gajek, A., Zima, A.: Benchmarking heterogeneous cloud functions. In: Heras, D.B., Bougé, L. (eds.) Euro-Par 2017. LNCS, vol. 10659, pp. 415–426. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-75178-8_34 14. Jackson, D., Clynch, G.: An investigation of the impact of language runtime on the performance and cost of serverless functions. In: 2018 IEEE/ACM International Conference on Utility and Cloud Computing Companion (UCC Companion), pp. 154–160. IEEE (2018) 15. Cordingly, R., et al.: Implications of programming language selection for serverless data processing pipelines. In: 2020 IEEE International Conference on Dependable, Autonomic and Secure Computing, International Conference on Pervasive Intelligence and Computing, International Conference on Cloud and Big Data Computing, International Conference on Cyber Science and Technology Congress (DASC/PiCom/CBDCom/CyberSciTech), pp. 704– 711. IEEE (2020) 16. Somu, N., Daw, N., Bellur, U., Kulkarni, P.: Panopticon: A comprehensive bench- marking tool for serverless applications, 2020 International Conference on COMmunication Systems & NETworkS (COMSNETS), pp. 144–151. IEEE (2020) 17. Shahrad, M., Balkind, J., Wentzlaff, D.: Architectural implications of function-as-a-service computing. In: Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, pp. 1063–1075 (2019) 18. Kelly, D., Glavin, F., Barrett, E.: Serverless computing: Behind the scenes of major platforms. In: 2020 IEEE 13th International Conference on Cloud Computing (CLOUD), pp. 304–312. IEEE (2020) 19. Analyzing log data with CloudWatch Logs Insights (2022). https://docs.aws.amazon.com/ AmazonCloudWatch/latest/logs/AnalyzingLogData.html 20. Ritzal, R.: Optimieren von Java fu¨r Serverless Applikationen, PhD thesis, University of Applied Sciences (2020) 21. Farley, D.: Modern Software Engineering: Doing What Works to Build Better Software Faster, Addison-Wesley Professional (2021) 22. Ibrahimi, A.: Cloud computing: Pricing model. Int. J. Adv. Comput. Sci. Appl. 8(6) (2017) 23. AWS Lambda Pricing (2022). https://aws.amazon.com/lambda/pricing/ 24. AWS Lambda Runtimes and performance (2022). https://docs.aws.amazon.com/lambda/lat est/operatorguide/runtimes-performance.html 25. Back, T., Andrikopoulos, V.: Using a microbenchmark to compare function as a service solutions. In: European Conference on Service-Oriented and Cloud Computing, pp. 146–160. Springer (2018) 26. Roberts, M., Chapin, J.: What is Serverless, O’Reilly Media, Incorporated (2017) 27. “Use aws lambda authorizers with a third-party identity provider to secure amazon api gateway rest apis,” Amazon Web Services, Inc (2020). https://aws.amazon.com/blogs/sec urity/use-aws-lambda-authorizers-with-a-third-party-identity-provider-to-secure-amazonapi-gateway-rest-apis/
Hardware and Software Integration of Machine Learning Vision System Based on NVIDIA Jetson Nano Denis Manolescu, David Reid, and Emanuele Lindo Secco(B) Robotics Lab, School of Mathematics, Computer Science and Engineering, Liverpool Hope University, Liverpool, United Kingdom {20203547,reidd,seccoe}@hope.ac.uk
Abstract. This study investigates the capabilities and flexibility of edge devices for real-time data processing near the source. A configurable Nvidia Jetson Nano system is used to deploy nine pre-trained computer vision models, demonstrating proficiency in local data processing, analysis and providing real-time feedback. Additionally, the system offers deployment control via a customized Graphical User Interface (GUI) and proves very low-latency inference re-stream to other local devices using the G-Streamer framework. The Machine Learning models which cover a wide range of applications, including image classification, object recognition or detection, depth estimation and semantic segmentation, show potential for IoT and industrial applications. Further, the fusion of these capabilities with AI and machine learning algorithms unveils a promising perspective for substantial industrial redevelopment. This research underscores the strategic significance of edge devices in modern computational frameworks and their potential role in future technological advancements. Keywords: Computer vision inference · Deep learning · IoT · Machine learning · Jetson nano · Low-cost systems · Portable systems
1 Introduction Machine Learning and Computer Vision have rapidly evolved in recent years, enabling a wide range of applications across various industries that favour efficiency enhancement and automation. The development of compact, modular and powerful hardware, such as the NVIDIA Jetson Nano, has further fuelled this growth by decentralising access to high-performance AI capabilities and edge computing [1]. Such a system provides, in fact, high versatility vs. a variety of applications where Computer Vision is required and combined with the benefits and needs of portability, low-cost assembly and modularity [2–6]. In this context, we focus on the attempt of providing an integrated and low-cost as well as portable system for Computer Vision which embeds proper computational capability and easy interaction with the system by means of a customized end-user interface. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 K. Arai (Ed.): FICC 2024, LNNS 921, pp. 129–137, 2024. https://doi.org/10.1007/978-3-031-54053-0_10
130
D. Manolescu et al.
The paper presents an experimental study focused on deploying deep learning and computer vision models on edge devices, highlighting the potential for real-world applications. The hardware setup includes a Jetson Nano equipped with an 8 MP camera module connected by a 2-m HDMI cable using two CSI-to-HDMI extension adapters. Additionally, the device has been fitted with a Wi-Fi network adapter with a dual-antenna range extension and a 64 GB SD card for OS storage (see Fig. 1). The system is operated by the Nvidia AI Framework – namely, the Jetpack SDK, built on top of Ubuntu OS, offering a comprehensive set of tools, libraries, and APIs to build, train, optimise, and run machine learning models. This software suite leverages the power of GPU-accelerated computing with the help of specialised algorithms like CUDA toolkit, cuDNN libraries, TensorFlow, TensorRT and Pytorch. The final results of this research present nine Computer Vision Models running on Jetson Nano which are deployed in real-time on the CSI camera video stream. Using the Nvidia transfer learning technique, these models are pre-trained in image classification, object recognition and detection, depth estimation, and semantic segmentation. Additionally, the device runs a local close-loop server that hosts a website landing page, allowing users to switch between each model. Furthermore, the processed image gets broadcasted to a specific local IP address on another device. Through these results, the study demonstrates the reliability, optimal performance and low-latency video restream with various computer vision algorithms deployed on edge computing devices. The following Section details the hardware settings of the Jetson Nano and all the components necessary to replicate the experiment. Section 2 will continue with the SDK software setup and swap memory, the Docker, the Flask server and the device automation process. The Section will end with details about developing the HTML server landing page and integrating GStreamer broadcasting into the project. The Machine Learning in Sect. 3 will briefly present the nine convolutional networks that are incorporated into the research. The Sect. 4 analyses the challenges and solutions adopted, while the conclusion in Sect. 5 reflects on the edge devices’ perspective, importance and advantages.
Fig. 1. The Jetson Nano system: Hardware components and assembly on the left and right panels, respectively
Hardware and Software Integration of Machine Learning Vision System
131
2 Materials and Methods 2.1 Hardware Set-up and Assembly Nvidia Jetson Nano is a single-board edge computing system powered by a 128-core Maxwell GPU and a quad-core ARM A57 CPU with 4 GB of RAM, offering a decent 0.5 Teraflops of half-precision (FP16) GPU computing performance. These capabilities make it suitable for various AI and deep learning real-time inferencing tasks, including running computer vision or natural language processing models. While the device is commercialized as a development kit, the manufacturer provides only the carrier board and the Jetson GPU module. For this study, the device has been fitted with an Intel Wi-Fi network card (AC8265) and a set of Wi-Fi signal antennas, enhancing its mobility and accessibility (see Fig. 1). In addition, to ensure real manoeuvrability, two CSI-to-HDMI adapters are connecting the board with an 8MP camera module via a 2-m-long HDMI cable. The Jetson’s 5V-2A AC power adapter mode is selected by shorting the J48 jumper pin; otherwise, the device is, by default, powered via micro-USB. Initially, the Jetson board uses a USB mouse and keyboard and is connected to an HDMI monitor. This feedback configuration of the system is essential since numerous computer vision models and other superuser root system settings cannot be initiated through headless connections, such as SSH. 2.2 Software Configuration and Further Developments Jetson Nano runs on a customized version of Ubuntu 18.04 LTS OS, called Linux for Tegra (L4T), which is integrated into the multi-tools Jetpack SDK (v4.6). The first step to installing the SDK is to download the SD card system image file provided by the developer (Nvidia, 2019) and flash it on the memory card using Etcher software (Etcher, 2021 – [7]). After booting up, the OS onscreen GUI gives the rest of the instructions. Additional libraries and packages may also be necessary depending on the installation scope and the timeframe of compatibility and support between different tools. Swap Memory. Before using the device for any memory-intensive tasks, like deep learning, an important aspect is to increase the swap memory to prevent Out-Of-Memory (OOM) events and allow the system to boost its processing capability when necessary. This process was done by executing the terminal commands shown in Fig. 2. Nvidia Docker Container. NVIDIA provides a selection of Docker containers developed explicitly for the Jetson Nano platform. These containers are optimized for the Nano hardware and include most of the necessary dependencies and software, making it easier to deploy AI projects. The Docker container is installed with the terminal commands that is shown in Fig. 3. After the Docker installation, the user can download different pre-trained inference models from a popup prompt Graphical User Interface (GUI) as it is shown in Fig. 4. For experimentation purposes, all the available models are downloaded. Once models are installed, they can be tested with a command such as imagenet csi://0, which should run the inference directly on the CSI camera stream.
132
D. Manolescu et al.
Fig. 2. Swap memory setting to prevent OOM events
Fig. 3. Installation of the Docker container
Fig. 4. Jetson Nano inference ML models
Flask Server, System Start-Up and Autorun Configurations. Flask is a lightweight, easy-to-use web micro-framework for Python that allows the creation of web applications quickly and with minimum effort, without needing any particular tools or libraries. To install Flask, it is first generally recommended to use a Python virtual environment, which is installable with the command sudo apt install python3-venv. Then a server folder is created in which the python environment is installed by the command python3 -m venv name, where the name variable is user-defined. The virtual environment is then activated by running sudo source venv/bin/activate - this will facilitate the installation of Flask with pipe install Flask. The Flask server is run through a Python script in the backend that is linked to the HTML frontend GUI (see Fig. 5). The Flask server is run automatically every time the Jetson system restarts. This automation has been implemented by using a shell script file and a systemd service
Hardware and Software Integration of Machine Learning Vision System
133
Fig. 5. Jetson Nano system diagram
configuration file. Once a copy of jetserver.service is created into /etc./systemd/system/, the service is enabled with sudo systemctl enable jetserver.service and started with sudo systemctl start jetserver.service. Next step is to make the jetserver.sh script executable by running chmod + x /path-to-server/jetserver.sh. GUI Browser Control Page – Design and Setup. The GUI of the webpage is an HTML5 and CSS development, adaptable to any screen size (see Fig. 6). This browser design sends user commands to a backend Python script responsible for closing and switching operations between ML models. The landing and control page is hosted by the Flask server running on the Jetson Nano, which is simultaneously performing inference computations on the camera live feed without decreasing stream latency or performance. G-Streamer – Real-Time Data Streaming Protocol. The project integrates the GStreamer RTP Network Protocol to broadcast the inference data from Jetson Nano to an external machine connected to the local network. In this case, the other device is required to install the GStreamer module to receive the data stream. This streaming pipeline can easily transmit, encode, decode and display data packets at very high speeds in real-time. Once the CNN model starts the inference, the stream is sent to a specific IP and port address, and it is received by the command prompt line of code which is shown in Fig. 7.
3 Machine Learning Inference Models The entire training, validation and testing process of models takes place in isolation in the Jetson docker container. In the research, through the server user interface, Jetson Nano is running nine pre-trained network models, fully optimized for Nvidia edge devices architecture, as it follows: • Behind ImageNet is one of the pre-trained convolutional neural network versions of GoogleNet, which uses the extensive ImageNet dataset for object recognition tasks. It classifies objects within input images or real-time video streams. • DetectNet is a model designed explicitly for object detection, identifying the class of objects and their location within the input image or video. It provides bounding boxes around detected objects.
134
D. Manolescu et al.
Fig. 6. Design of the GUI webpage
Fig. 7. Integration of the G-Streamer RTP network protocol
• The SegNet model is focused on semantic segmentation, which classifies individual pixels in an input image or video stream, allowing for precise segmentation of objects and scenes. It’s useful for applications like autonomous navigation and scene understanding. • PoseNet and HandNet are similar, dedicated to human pose estimation, and capable of detecting the body joints and estimating the pose of individuals within an input image or video. This inference enables applications such as motion analysis, gesture recognition, hand manipulation, and human-computer interaction.
Hardware and Software Integration of Machine Learning Vision System
135
• ActionNet is designed for human action recognition to identify and classify various human actions within a given video stream. It enables applications like video surveillance and activity monitoring. • BackgroundNet is specially designed for background subtraction and foreground object detection. It isolates moving objects from static backgrounds in video streams, useful for applications like traffic analysis, security, and video editing. • DepthNet model is specialized in depth estimation, predicting the depth of objects and scenes within input images or video streams. This network is helpful for applications like 3D reconstruction, augmented reality, and robotics, taking advantage of edge devices’ capabilities for real-time performance. • Kiwi_noKiwi is a Restnet18 classification network built and trained entirely on the Jetson Nano for this research. The dataset is collected using the camera-capture tool developed by Nvidia, which facilitates the train-validate-test folder structure of the images taken via the CSI camera (Dustin Franklin, 2023 – [8]). The network is trained for 35 epochs on 210 photographs that label scenes with and without kiwi fruit. The final results show 87% accuracy, although a larger dataset can significantly improve identification for untrained angles and backgrounds.
4 Results During this study, Jetson Nano had a stable performance on running almost all the convolutional networks, with only occasionally overheating processor warnings. The device often crushes when training a new CNN model because of the cumulated outof-memory errors. The solution to this OOM issue was to command the board to enter runlevel 3 text-based mode (command: init 3), effectively disabling the graphics signal and ultimately releasing around 800 MB of RAM memory. Furthermore, having a small amount of GPU resources, the ActionNet and BackgroundNet inference models are limited to running at only three to six frames per second. Additionally, another issue arising from implementing these two previously mentioned networks is related to their need for more support outside the Jetson inference docker. After many attempts to deploy these models outside the container or even implementing a Python script to trigger the container to run shell commands in the backend autonomously, none worked. For this reason, these networks momentarily do not have webpage GUI control.
5 Conclusions The experimentation of the present research has shown the potential and the flexibility of handling data closer to its source using a configurable edge device. These hardware tools are offering excellent support in local information processing, analysing or filtering and enable real-time feedback for IoT devices or other application of industrial applications which requires Computer Vision combined with Machine Learning [9–11]. The edge devices can be highly capable of optimising their computational resources
136
D. Manolescu et al.
and can enhance or be easily integrated into any large-scale project. Their capacity for real-time image and video processing, combined with the inference ability of the AI and Machine learning algorithms, presents a great perspective for a cohesive industrial re-development. Moreover, other methods of unsupervised learning could be explored in the future, including, for example, clustering-based methods. In this context, it could be worth also to explore a set of applications where these technologies could be effectively implemented with a practical impact [10–12]. Acknowledgments. This work was completed by Denis Manolescu as part of his coursework requirements for the BEng in Robotics at Liverpool Hope University’s Robotics Laboratory within the School of Mathematics, Computer Science, and Engineering.
Conflict of Interest. There is no conflict of interest for this study.
References 1. Nvidia: Getting Started with Jetson Nano Developer Kit (2019). https://developer.nvidia.com/ embedded/learn/get-started-jetson-nano-devkit#write 2. Isherwood, Z., Secco, E.L.: A raspberry Pi computer vision system for self-driving cars. Comput. Conference 2, 910–924 (2022). https://doi.org/10.1007/978-3-031-10464 3. Humphreys, J., Secco, E.L.: A low-cost thermal imaging device for monitoring electronic systems remotely. Computing Conference 2023, Lecture Notes in Networks and Systems, in press 4. Van Eker, M., Secco, E.L.: Development of a low-cost portable device for the monitoring of air pollution. Acta Scientific Computer Sciences 5(1) ( 2023) 5. Tharmalingam, K., Secco, E.L.: A Surveillance Mobile Robot based on Low-Cost Embedded Computers. In: 3rd International Conference on Artificial Intelligence: Advances and Appl. 25, 323–334 (ICAIAA 2022). https://doi.org/10.1007/978-981-19-7041-2 6. Brown, K., Secco, E.L., Nagar, A.K.: A Low-Cost Portable Health Platform for the Monitoring of Human Physiological Signals. In: The 1st EAI International Conference on Technology, Innovation, Entrepreneurship and Education. 978–3–030–02242–6_16 (2017) 7. Etcher: Etcher - Flash. Flawless. Flash OS images to SD cards & USB drives, safely and easily (2021). https://www.balena.io/etcher 8. Dustin Franklin, N.: Re-training on the Cat-Dog Dataset (2023). https://github.com/dusty-nv/ jetson-inference/blob/master/docs/pytorch-cat-dog.md. Accessed 2023 9. McHugh, D., Buckley, N., Secco, E.L.: A low-cost visual sensor for gesture recognition via AI CNNS. Intelligent Systems Conference (IntelliSys) 2020, Amsterdam, The Netherlands 10. Buckley, N., Sherrett, L., Secco, E.L.: A CNN sign language recognition system with single & double-handed gestures. In: IEEE 45th Annual Computers, Software, and Applications Conference (COMPSAC), pp. 1250–1253 (2021). https://doi.org/10.1109/COMPSAC51774. 2021.00173
Hardware and Software Integration of Machine Learning Vision System
137
11. Secco, E.L., McHugh, D.D., Buckley, N.: A CNN-based Computer Vision Interface for Prosthetics’ application. In: EAI MobiHealth 2021 - 10th EAI International Conference on Wireless Mobile Communication and Healthcare, pp. 41–59 (2022). https://doi.org/10.1007/9783-031-06368-8_3 12. Myers, K, Secco, E.L.: A Low-Cost Embedded Computer Vision System for the Classification of Recyclable Objects. Congress on Intelligent Systems (CIS - 2020), Intelligent Learning for Computer Vision, Lecture Notes on Data Engineering and Communications Technologies 61. https://doi.org/10.1007/978-981-33-4582-9_2
Novel Approach to 3D Simulation of Soft Tissue Changes After Orthognathic Surgery B. A. P. Madhuwantha(B) , E. S. L. Silva, S. M. S. P. Samaraweera, A. I. U. Gamage, and K. D. Sandaruwan University of Colombo School of Computing, Colombo 00700, Sri Lanka [email protected], [email protected]
Abstract. This research paper discusses a novel approach to simulate soft tissue changes after Orthognathic surgery. Since the surgery is complex, it is reliable to simulate the outcome which ultimately benefits both doctor and the patient. It lets doctors give patients a more accurate picture of the surgery outcome and improve their decision-making. However, the current simulation models like Finite Element Analysis (FEA), Mass Tensor Model (MTM), and Mass Spring Model (MSM) are computationally expensive. Commercial solutions like Dolphin, and Surgicase are costly, complicated, and need extensive training. Thus, it’s difficult for small-scale medical practices and public hospitals to adapt to these existing solutions. In our proposed approach, the lower area of the face is divided into 10 regions, and by considering the bone-to-soft tissue movement ratio of each region and a custom fall-off algorithm, the 3D face mesh is changed to reflect the planned changes. These changes can be done in a customized way because the implemented application allows changing the value of bone movement, texture scale, and other properties. This allows doctors in a well-informed, easy decision-making process on how much the bone tissue should move to get the best possible outcome. Then after adding skin texture the final 3D prediction is achieved. A working browserbased application was implemented using ThreeJS, and the preliminary results of the experiments were promising. The accuracy and other aspects of this approach are yet to be evaluated systematically. Keywords: 3D-Soft-Tissue simulation · Orthognathic surgery · Cephalometric landmarks
1 Introduction 1.1 Problem Statement Orthognathic surgery is a corrective jaw surgery where misaligned jaws are surgically cut and moved to correct alignments. These alignment issues are primarily due to problems in teeth, not in the jaw (base/skull). Those issues can be fixed with orthodontic treatment, which is a way of straightening or moving teeth, to improve the appearance of the teeth and how they work. It involves the adjustment of teeth using braces. However, when it is impossible to adjust misalignment only with orthodontic treatments, the © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 K. Arai (Ed.): FICC 2024, LNNS 921, pp. 138–149, 2024. https://doi.org/10.1007/978-3-031-54053-0_11
Novel Approach to 3D Simulation of Soft Tissue Changes
139
patient is directed to jaw surgery. Misalignment of the upper jaw (maxilla) and lower jaw (mandible) significantly impacts the face’s appearance and aesthetics. Orthognathic surgery is not a single event but a complex process. It starts with orthodontic treatment, which takes about one to two years. Then, it proceeds to orthognathic surgery. After the surgery, there will be swelling and inflammation for two to three weeks [1]. It takes around six months to heal completely from the surgery, including the bone. Since this surgery has a complex procedure, it will be reassuring for the patient to know what the face will look like after the surgery. Also, an output simulation would help doctors to decide how much correction should be done to get the outcome. An increasing number of adult patients [2] have chosen orthodontic care during the past ten years. Surgery has become a realistic choice for many patients thanks to advancements in jaw surgery, computer graphics, visuals, robust skeletal fixes, and shorter admit periods. Misalignment of the upper jaw (maxilla) and lower jaw (mandible) cannot be fixed without proper orthognathic surgery. 1.2 Significance Orthognathic surgery seeks to enhance jaw function and create a more aesthetically pleasing facial skeleton. This prospective study examined whether orthognathic surgery improved individuals with dentofacial deformities’ quality of life and whether such improvements were clinically significant. There are many advantages of jaw surgery, such as a balanced appearance on the lower face, health benefits from improved sleep, breathing, chewing, and swallowing, speech impairments, appearance, and self-esteem. When predicting the surgery results at the early stage, it will benefit the patient to get an idea or make necessary changes or suggestions. Existing models and commercial tools face various problems, such as high costs, computationally expensive requirements, bulky applications, device-specific limitations, and the need for extensive training. Regarding the information system subject area, unlike most research, this research is not focused only on coding or developing algorithms. Not having a cost-effective and efficient prediction method for orthognathic surgery is a real-world problem faced by many small-scale medical practices and public hospitals, and we are doing this research to address that issue. So, our research is ahead of pure research and focuses on applied research areas. In contrast, our approach is designed to be highly useful for small-scale medical practices. It is easy to adapt and can be used on a wide range of devices because it is browser-based. Furthermore, it is user-friendly and has a low learning curve, making it accessible to a wide range of medical professionals. 1.3 Related Work According to certain research, patients can better match their intentions with clinical achievements by understanding how they will appear after the treatment by visualizing the outcome [3, 4]. Orthognathic surgery has relied extensively on the manual tracing methods [5] of cephalometric X-rays which is a unique tool that enables the dentist to capture a complete radiographic image of the side of the face. For treatment planning and the prediction of operational results [6]. By comparing the prediction models built on this scale and the patient’s postoperative condition, studies were performed to develop
140
B. A. P. Madhuwantha et al.
2D models based on the ratio between the mold motion face and bones [7], focusing on sub-regions of the face. It is noted that traditional approaches have limitations in predicting the outcome of large and complex movements, and many factors must be selected [8]. On the other hand, planning based on three-dimensional (3D) computing has recently grown in popularity as a precise 3D surgical simulation which is helpful for communication with the patients, Orthognathic surgery planning, and evaluating surgical outcomes [9]. When it comes to the 3D predictions, using the sparse partial least squares method, Hee-Yeon Suh et al. [9]. Created a prediction system to anticipate soft tissue changes following orthognathic surgery. Concerning surgical movements, the individuals’ sex, or further surgeries, they discovered no statistically significant changes in prediction accuracy. Olga-Elpis Kolokitha and Nikolaos Topouzelis [10] compare manual and computer-based techniques to model the results of orthognathic surgery and point out various traditional tracing techniques’ shortcomings. Samir Aboul-Hosn Centenero and Federico Hern’andez-Alfaro [11] determined the advantages of 3D planning in orthognathic surgery when using CAD/CAM technologies. They found that those results can be reliable in certain areas. Based on probabilistic finite element modeling, Paul G.M. Knops et al. [12]. Suggested a novel soft tissue prediction methodology for orthognathic surgery. The benefits of this probabilistic finite element modeling over conventional methods were noted. Several commercial tools are available for orthognathic surgery 3D planning, soft tissue prediction, and result simulations. The physical model they apply is the fundamental distinction between the various commercial tools such as Dolphin 3D, Maxilim, SurgiCase, IPS Case Designer, and ProPlan CMF [13]. All the software mentioned above comes at a considerable cost, with most requiring an annual license renewal [13]. However, for small-scale medical practices, these solutions may not be the most practical or affordable. That’s where our solution comes in - it provides a more practical and affordable alternative for these practices. With our approach, medical professionals can access a user-friendly, browser-based solution that is tailored to their needs and budget. FEM (Finite Element Models), MTM (Mass Tensor Models), and MSP (Mass Spring Models) are several other dense volumetric models [14]. Main prediction programs related to the orthognathic subject area rely on these dense volumetric models [15]. Some also rely on rare models that require orientation, and they are mainly based on interpolation between positions. Various prediction programs have been demonstrated to have errors of several millimeters in certain investigations, but some are acceptable. However, several predictions are questioned by some other researchers [12, 16, 17]. Currently, there is no adequate, cost-effective approach to displaying the patient’s post-surgery findings in Sri Lanka. Most hospitals use printed images of the patient that are adjusted by cutting and rearranging portions of the photo to get a general notion of the outcome. This approach displays updates in the profile view which is Lateral side of the face. However, that would not give the doctor/patient an accurate picture of the surgery’s outcome. Orthognathic surgery planning software systems such as Dolphin, Maxillim, and SurgiCase are used at some facilities. So, it is notable that no analysis or process has been done in Sri Lanka to simulate soft tissue changes after orthognathic surgery. Also, the use of computer-based orthognathic surgery result prediction is lacking area
Novel Approach to 3D Simulation of Soft Tissue Changes
141
that many have failed to address or are unconsidered. The motivation for this research grows in the background as such. This research proposes a method to forecast a novel approach to predict soft tissue changes after orthognathic surgery using CBCT (Cone Beam Computer Tomography) scan and free and open-source libraries, with the added advantage of a browser-based solution that anyone can use without installing any third party software. The advantage is that a CBCT scan usually has a lower radiation dose than a CT scan while providing a higher resolution [18].
2 Methodology When doing an Orthognathic surgery, the doctor will first take a lateral x-ray of the patient and mark certain points on that x-ray called Cephalometric landmarks as shown in Fig. 1. Also, the doctor will take photos of the patient from different angles (Front View, Side View). Using the Cephalometric landmarks, the doctor carries out Cephalometric analysis, and determines the bone movement needed to fix the patient’s jaw. The doctor also takes a CBCT scan other than the x-ray.
(a) Soft tissue landmarks
(b) Hard tissue landmarks
Fig. 1. Cephalometric landmarks
2.1 Pre-processing and Importing The planned research approach needs a CBCT scan to start with. The reason is to get a 3D model of the patient’s jaw area. The CBCT scan comes with a file extension called DICOM. This DICOM file will be imported into a medical image viewer. From that, separate 3D meshes can be generated, one for the skull and one for the outermost soft tissue layer, and then export those models. Since the number of vertices in the 3D meshes is so high, we need to decimate meshes. Decimation will reduce the number of vertices while maintaining the overall shape and appearance of the object. A web-based application was implemented using ThreeJs. ThreeJs is a JavaScript library and an API that enables the creation and rendering of 3D computer graphics in a web browser using WebGL (Web Graphics Library). It is compatible with various web browsers and supports animation. Above pre-processed models are imported into the implemented application.
142
B. A. P. Madhuwantha et al.
2.2 Marking Cephalometric Landmarks Hard Tissue Landmarks. In the implemented web-based application, the hard tissue landmarks are marked by the doctor in the imported 3D skull mesh. For convenience and a better user experience for the doctor, an x-ray view is incorporated into the skull mesh. An x-ray shader is used to achieve this. Shaders decide how pixels of a 3D object are rendered on the screen. X-ray shader is a fragment-type shader, and it adjusts the opacity of the pixels based on the camera position. Pixels that are closer to the camera are more visible and less transparent, while pixels that are further away are less visible and more transparent. This makes it look like the viewer can see through the skull mesh. It helps the doctor to visualize and mark hard tissue landmarks accurately on the skull mesh in 3D space as shown in Fig. 1. The experience is similar to marking points on a lateral skull x-ray. After marking the points, the cephalometric tracing is done and the medical expert will decide how much the jaw/s should move. For example, the doctor would conclude that the lower jaw should move forward by 0.9 cm. This result is then used in the change simulation. Soft Tissue Landmarks. To achieve this, 13 soft tissue landmarks must be marked by the doctor on the outer mesh (skin tissue layer) and based on those landmarks 10 facial regions are identified according to [7] as per research done previously. Table 1 gives the details of the 13 soft tissue landmarks needed for this study. Figure 2 shows the ten regions are identified using those landmarks. Based on some of those landmarks, four vertical planes and five horizontal planes marks the borders of each region as shown in Fig. 3. Table 1. Identified soft tissue landmarks Landmark
Abbreviation
Definition
Exocanthion
ex l , ex r
The leftmost and rightmost points at the outer commissure of the fissure
Superior alar curvature
sacl , sacr
The leftmost and rightmost points at the upper margin of the curved base line of each alar
Alare
al l , al r
The most lateral point on each alar contour
Alar curvature
acl , acr
The outermost point of the curved base line of each ala, marking the nasal wing base’s facial attachment
Subnasale
sn
The midpoint on the nasolabial soft tissue contour between the columella crest and the upper lip
Labiale superius
ls
The midpoint of the upper vermilion line
Labiale inferius
li
The midpoint of the lower vermilion line
Cheilion
chl , chr
The leftmost and rightmost points located at each labial commissure
Novel Approach to 3D Simulation of Soft Tissue Changes
143
Fig. 2. Soft tissue landmarks and the 10 regions identified [7]
Fig. 3. Identified horizontal and vertical planes based on soft tissue landmarks
2.3 Identifying Mesh Vertices Per Region The position of each vertex on the face mesh under a particular region needs to be updated in order to simulate the soft tissue movement. Taking region no. 3 as an example, it is surrounded by the vertical plane of ExR, horizontal plane of SN and the horizontal, vertical planes of ChR. The horizontal and vertical planes surrounding region 3 is shown in Fig. 4.
Fig. 4. Planes surrounding Region 3
Since the surrounding planes are identified, to get the vertices surrounded by the above planes the following algorithm can be used.
144
B. A. P. Madhuwantha et al.
if vertex.x < rightVerticalPlane.x && vertex.x > leftVerticalPlane.x && vertex.y < topHorizontalPlane.y && vertex.y > bottomHorizontalPlane.y then /* this vertex belongs to this particular region */
2.4 Simulating Soft Tissue Change The bone tissue to soft tissue movement ratios can be obtained for each of the mentioned 10 regions and also using the results of cephalometric tracing, the bone movement is identified to correct the misalignment of the jaw. Combining bone movement with the bone-to-soft tissue movement ratio of each region, the final position of each region is calculated. Using Eq. (1), movement is calculated for each region(R). RSTM = HTM ∗ RRatio
(1)
According to the Eq. (1), soft tissue movement (STM) in region R is calculated using the product of cephalometric analysis result (hard tissue movement-HTM) and bone-to-soft tissue movement ratio of region R. Based on the above calculation the mesh vertices in each region will be moved by a certain value according to the bone-to-soft tissue ratios (RRatio ) as mentioned in [7] which is summarized in Table 2. Table 2. Identified 10 regions Region no.
Region
Avg S/H (%) movement ratio
1
Right paranasal
0.68 ± 0.36
2
Left paranasal
0.63 ± 0.33
3
Right supracommissural
0.68 ± 0.20
4
Left supracommissural
0.67 ± 0.34
5
Upper lip
0.77 ± 0.35
6
Upper vermilion
0.99 ± 0.48
7
Lower vermilion
0.95 ± 0.19
8
Chin
0.83 ± 0.11
9
Right infracommissural
0.80 ± 0.23
10
Left infracommissural
0.82 ± 0.19
Since the regions move in their respective ratios, a custom fall-off algorithm is integrated to smoothen in between regions. This algorithm is based on fall-off algorithms used in Blender software which is shown in Fig. 5. There are several types of fall-off
Novel Approach to 3D Simulation of Soft Tissue Changes
145
algorithms in Blender [19]. For the proposed approach, only sphere, linear, smooth, and inverse-square fall-off algorithms were considered for experimentation. Because those selected algorithms show visual similarities with human skin tissue behavior. Sphere falloff is a type of algorithm in Blender that causes the intensity of a given effect or material to decrease in a spherical pattern around a specific point on the object. This means that the further away from the point, the weaker the effect. Inverse square falloff is based on the inverse square law, which states that the intensity of an effect or force decreases with the square of the distance from the source. In Smooth falloff, the intensity of the effect or material will gradually decrease, rather than abruptly stopping at a certain point on the object. The intensity of the effect or material decreases linearly from the center of the object to the edge in the Linear fall-off algorithm. Using the custom fall-off algorithm the ratios in-between regions will be interpolated to get smoother surface. As an example, the vertices of region no 3 and region no 9 should be interpolated between the values 0.68 and 0.80 as shown in Fig. 6.
Fig. 5. Different fall-off algorithms used in blender software [19]
Fig. 6. Ratios between vertical regions [7]
2.5 Adding Skin Texture and Other Facial Features As the last step, other facial features are incorporated into the mesh by projecting a 2D photograph of the patient onto the changed face mesh as shown in Fig. 7. This technique uses a shader program that calculates the projection of the texture and uses a custom material that is projected onto a 3D object.
146
B. A. P. Madhuwantha et al.
Fig. 7. 2D photograph of the patient and the final prediction
3 Experiments and Preliminary Results We have done several experiments related to our research to find a path to approach research outcomes. CBCT scans of the patients are the input for our research; those are large files, and the mesh contains a large number of vertices that require highperformance computers to handle. So, for that reason, we experiment with several methods and choose to decimate1 the 3D mesh. However, we have to select the best possible decimate value. Because if we decimate from a higher value, the 3D image will become less likely as a human face with facial features. To achieve this we experimented with some 3D modeling tools like Blender, 3DSlicer, MeshLab and researched several other tools as well. Finally, we come up with an approach to use 3Dslicer for mesh simplification and ThreeJS for other required tasks. Free and Open source is another added advantage to using those tools. Decimated skull and face mesh are shown in Fig. 8.
Fig. 8. Decimated face mesh and skull mesh
When marking the soft tissue and hard tissue points in the 3D model, it is difficult to mark the exact point because of the angle at which we are looking at the 3D model. After getting several medical expert inputs, it was decided to incorporate an x-ray shader as mentioned in the methodology. Cephalometric landmarks were then marked on a midplane which divides the face and skull meshes symmetrically. Marking points on this mid-plane is convenient for the medical professional. Landmarks marked on the skull mesh are shown in Fig. 1(a) and soft tissues landmarks are shown in are shown in Fig. 1(b).
Novel Approach to 3D Simulation of Soft Tissue Changes
147
After marking the points and getting results from the cephalometric tracing, the next step is to divide the face mesh into 10 regions. Face mesh is edited according to the boneto-soft tissue ratio and the custom falloff algorithm. Some of the regions considered are shown in Fig. 9. After identifying regions, simulation can be applied based on calculations done using Eq. (1). Figure 10 shows the before and after scenarios of simulation of the soft tissue changes in the upper jaw. After simulating the change, the last step is to incorporate skin texture and other facial features. This stage is still under experimentation with different parameters to get a more realistic prediction. Figure 7 shows the preliminary results obtained. Parameters like lighting effects, and the texture should be enhanced to get a better prediction. The accuracy and other performance aspects of the proposed novel approach are yet to be evaluated properly. After the evaluation, it’s able to set a scope and identify the limitations of the approach. For the evaluation process data will be collected from 10– 15 patients who underwent double jaw surgery. Preoperative CBCT scans, preoperative and postoperative lateral skull x-rays and photographs of each patient will be collected. These will be compared against the predicted model. And qualitative evaluation is yet to be done by a panel of experts.
Fig. 9. Regions separated by planes
Fig. 10. Before and After scenarios of change simulation in upper jaw
148
B. A. P. Madhuwantha et al.
4 Conclusion This research paper proposes a novel approach to simulate soft tissue changes after Orthognathic surgery. The current simulation models like Finite Element Analysis (FEA), Mass Tensor Model (MTM), and Mass Spring Model (MSM) are computationally expensive and complicated, making them unsuitable for small-scale medical practices and public hospitals. In the proposed approach, the lower area of the face is divided into ten regions, and by considering the bone-to-soft tissue movement ratio of each region and a custom fall-off algorithm, the 3D face mesh is changed to reflect the planned changes. Then after adding skin texture the final 3D prediction is achieved. A working browser-based application was implemented using ThreeJS, and the preliminary results of the experiments were promising. The accuracy and other aspects of this approach are yet to be evaluated systematically. The proposed approach is cost-effective, efficient, and helpful for communication with the patients, planning and evaluating surgical outcomes, making it applicable in many small-scale medical practices and public hospitals.
References 1. Phillips, C., Blakey, G., Jaskolka, M.: Recovery after orthognathic surgery: short-term healthrelated quality of life outcomes. J. Oral and Maxillofacial Surgery 66(10), 2110–2115 (2008) 2. Neelambar, R.: Kaipatur and carlos flores-mir. accuracy of computer programs in predicting orthognathic surgery soft tissue response. J. Oral and Maxillofacial Surgery 67, 751–759 (2009) 3. Ceib, P., Bailey, T., Kiyak, H.A.: Effects of a Computerized Treatment Simulation on Patient Expectations for Orthognathic Surgery Nih Public Access (2001) 4. Ryan, F.S., Barnard, M., Cunningham, S.J.: What are orthognathic patients’ expectations of treatment outcome—a qualitative study. J. Oral Maxillofac. Surg. 70(11), 2648–2655 (2012) 5. Dvortsin, D.P., Sandham, A., Pruim, G.J., Dijkstra, P.U.: A comparison of the reproducibility of manual tracing and on-screen digitization for cephalometric profile variables. European J. Orthodontics 30, 586–591 (2008) 6. Moragas, J.S.M., Van Cauteren, W., Mommaerts, M.Y.: A systematic review on soft-to-hard tissue ratios in orthognathic surgery part i: Maxillary repositioning osteotomy. J. CranioMaxillofacial Surgery 42, 1341–1351 (2014) 7. Lo, L.J., Weng, J.L., Ho, C.T., Lin, H.H.: Three-dimensional region-based study on the relationship between soft and hard tissue changes after orthognathic surgery in patients with prognathism. PLoS ONE 13(8) (2018) 8. Donatsky, O., Bjizirn-J0rgensen, J., Holmqvist-Larsen, M.: Computerized Cephalome Tric Evaluation of Orthogna Thic Surgical Precision and Stability in Relation to Maxillary Superior Repositioning Combined with Mandibular Advancement or Setback (1997) 9. Suh, H.Y., Lee, H.J., Lee, Y.S., Eo, S.H., Donatelli, R.E., Lee, S.J.: Predicting soft tissue changes after orthognathic surgery: the sparse partial least squares method. Angle Orthodontist 89, 910–916 (2019) 10. Kolokitha, O.E., Topouzelis, N.: Cephalometric Methods of Prediction in Orthognathic Surgery 10, 236245 (2011) 11. Centenero, S.A.-H., Hernandez-Alfaro, F.: 3d planning in orthognathic surgery: Cad/cam surgical splints and prediction of the soft and hard tissues results - our experience in 16 cases. J. Cranio-Maxillofacial Surgery 40, 162–168 (2012)
Novel Approach to 3D Simulation of Soft Tissue Changes
149
12. Knoops, P.G.M., et al.: A novel soft tissue prediction methodology for orthognathic surgery based on probabilistic finite element modelling. PLoS ONE 13 (2018) 13. Willinger, K., Guevara-Rojas, G., Cede, J., Schicho, K., Stamm, T., Klug, C.: Comparison of feasibility, time consumption and costs of three virtual planning systems for surgical correction of midfacial deficiency. Maxillofacial Plastic and Reconstructive Surgery 43, 12 (2021) 14. Mollemans, W., Schutyser, F., Nadjmi, N., Maes, F., Suetens, P.: Predicting soft tissue deformations for a maxillofacial surgery planning system: from computational strategies to a complete clinical validation. Med. Image Anal. 11, 282–301 (2007) 15. Kim, H., Jurgens, P., Reyes, M.: Soft-Tissue ¨ Simulation for Cranio-Maxillofacial Surgery: Clinical Needs and Technical Aspects (2011) 16. Mundluru, T., Almukhtar, A., Ju, X., Ayoub, A.: The accuracy of threedimensional prediction of soft tissue changes following the surgical correction of facial asymmetry: an innovative concept. Int. J. Oral Maxillofac. Surg. 46, 1517–1524 (2017) 17. Terzic, A., Combescure, C., Scolozzi, P.: Accuracy of computational soft tissue predictions in orthognathic surgery from three-dimensional photographs 6 months after completion of surgery: a preliminary study of 13 patients. Aesthetic Plast. Surg.Plast. Surg. 38, 184–191 (2014) 18. Li, G.: Patient radiation dose and protection from cone-beam computed tomography. Imaging Science in Dentistry 43, 63–69 (2013) 19. “Proportional Edit — Blender Manual,” docs.blender.org. https://docs.blender.org/manual/en/ 2.79/editors/3dview/object/editing/transform/control/proportional_edit.html. Accessed 23 July 2023
EdgeBench: A Workflow-Based Benchmark for Edge Computing Qirui Yang1(B) , Runyu Jin1 , Nabil Gandhi1 , Xiongzi Ge2 , Hoda Aghaei Khouzani2 , and Ming Zhao1 1
Arizona State University, Tempe, USA [email protected] 2 NetApp, San Jose, USA
Abstract. Edge computing has been developed to utilize heterogeneous computing resources from different physical locations for privacy, cost, and Quality of Service (QoS) reasons. Edge workloads have the characteristics of data-driven, latency-sensitive, and privacy-critical. As a result, edge systems have been developed to be both heterogeneous and distributed to utilize different computing tiers’ resources and features. The unique characteristics of edge workloads and edge systems have motivated EdgeBench, a workflow-based benchmark aiming to provide the ability to explore the full design space of edge applications and edge systems. EdgeBench is both customizable and representative. It allows users to customize the workflow logic of edge workloads, the data storage backends, and the distribution of the individual workflow function to different computing tiers. To illustrate the usability of EdgeBench, we implement two representative edge workflows, a video analytics workflow, and an IoT hub workflow that represent a large portion of today’s edge applications. Both workflows are evaluated using the workflow-level and systemlevel metrics reported by EdgeBench. We show that EdgeBench can effectively discover the performance bottlenecks and provide improvement implications for the edge workloads and the edge systems. Keywords: Edge computing · Edge system Function as a service · Edge benchmark
1
· Cloud computing ·
Introduction
While the Internet-of-Things (IoT) revolution brings unprecedented opportunities for economic growth, it also presents serious challenges to the existing computational infrastructure. The cloud is projected to fall short by orders of magnitude to either transfer and store, or process such vast amount of streaming data. Moreover, the cloud-based solution will not be able to provide satisfactory quality of service for many time-sensitive or privacy-sensitive IoT applications. Over the next decade, a vast amount of computation and storage resources will be deployed to the proximity of IoT devices, enabling a new computing paradigm, namely edge computing. Contrary to the rapid growth of edge computing, there is a lack of suitable benchmarks that capture the unique characteristics of edge applications and edge c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 K. Arai (Ed.): FICC 2024, LNNS 921, pp. 150–170, 2024. https://doi.org/10.1007/978-3-031-54053-0_12
EdgeBench: A Workflow-Based Benchmark for Edge Computing
151
systems. Edge applications are highly data-driven, latency-sensitive, and privacycritical. To satisfy the unique requirements of edge applications, edge systems are designed to be both highly heterogeneous and highly distributed covering different computing and storage tiers including the IoT tier, edge tier, and cloud tier. However, existing benchmarks fail to capture the unique characteristics of edge applications and edge systems. First, existing benchmarks [20,27,32] predefine edge applications and users cannot create and customize their own applications and the configurations of the edge systems based on the special requirements of the edge applications. Second, many benchmarks [18,19,29,32] only focus on one type of computation and storage resource and do not take the heterogeneous and distributed characteristics of edge systems into consideration. To better explore the different design choices of edge applications and edge systems, we introduce EdgeBench, a workflow-based benchmark for edge computing. EdgeBench allows users to define edge workflows by providing 1) the workflow logic, 2) the storage backend for data, and 3) the best computing tier for workflow execution. EdgeBench automates the workflow and cross-tier execution based on user inputs. EdgeBench also develops cross-tier monitoring which collects metrics from applications and edge systems to assist system and application developers in better understanding the edge applications and edge systems. When developing edge applications and edge systems, users can try different workflows with different backend storage and computing resources easily with EdgeBench to discover the optimal design. Consequently, representative edge applications with the optimal design can be easily generated using EdgeBench. As Function-as-a-Service (FaaS) is being popularly adopted by the industry in edge computing [7,11], EdgeBench is built on top of OpenFaaS [14], an opensource FaaS based computing platform. We implement two edge workflows to study the usability of EdgeBench. We first show that EdgeBench can be utilized to discover the edge applications’ performance bottlenecks by showing the performance metrics of different functions and their storage backends. We then evaluate the edge system’s performance bottlenecks by running functions across different computing tiers. In summary, the contributions of this paper are as follows: 1. We design and implement EdgeBench1 which is highly customizable to cater to different edge scenarios, by characterizing the unique characteristics of edge applications and the underlying systems. 2. We provide two workflows using EdgeBench which can represent a large portion of modern edge applications as study cases. 3. We demonstrate that EdgeBench can effectively discover edge workloads and edge systems bottlenecks and provide implications for possible improvement through a comprehensive evaluation. The remainder of this paper is organized as follows. Section 2 discusses the characteristics of edge workloads and edge systems. In Sect. 3, we detail the design of EdgeBench and present the two edge workflows developed using 1
EdgeBench is open-sourced at https://github.com/njjry/EdgeBench.
152
Q. Yang et al.
EdgeBench. In Sect. 4, we evaluate EdgeBench using the two workflows and discuss the implications from the results. In Sect. 5, we distinguish EdgeBench from existing works. In Sect. 6, we discuss the future challenges and conclude the work.
2
Edge Applications and Edge System Characteristics
With the proliferation of edge devices, even the fast-growing cloud is believed to be inadequate in handling the explosive demand for data collection, transportation, and processing. The emerging edge computing, which expands computing infrastructure from the cloud to the computing resources on the edge including edge devices, edge servers, and edge clusters [30] and conducts computation close to the data source, is a promising paradigm to resolve the challenges of the cloud side. Edge computing consists of edge applications and edge systems on which applications run. There are some unique characteristics of edge applications and edge systems. First, edge applications are data-driven where the input data for edge applications are mostly generated from edge devices. For example, smart home devices collect home metrics through different sensors. Surveillance systems stream videos from cameras. Edge applications are also latency sensitive. Due to edge devices’ close interaction with the physical world by sensing and actuating, many edge applications have low latency or even real-time requirements. For example, autonomous driving requires the system’s response time to be capped at 100 ms [24] for sensing and processing the surrounding environment. Security alarms are expected to react immediately to detect security issues. Furthermore, edge applications are privacy-critical. Data collected from edge devices may contain physical world information that is private. For example, medical devices capture patients’ private vital data. Based on the above characteristics, edge applications prefer to process the input data locally or in the nearby edge data centers to 1) save the latency of transferring the data to the cloud and the output data back to the edge; 2) avoid the performance bottleneck cased by processing the huge amount of edge data in the cloud; and 3) reduce the possibility of data leakage during network transportation or when private data is stored in the cloud. Despite the characteristics of edge applications requiring edge systems to push computation closer to the data source, edge devices are resource-limited in computing and I/O and have much poorer performance than the cloud. Besides, the cloud is equipped with different accelerators that are good at optimizing the computation performance for different workloads. In order to achieve optimal performance, edge computing can offload computation to peer edge devices, nearby edge data centers, and the far cloud which forms three tiers of computing resources distributed across the edge and IoT devices (IoT tier), the edge servers (edge tier), and the cloud data centers (cloud tier). Within each tier, the resources may also be distributed across different geographical areas and administrative domains. Edge systems are built upon highly heterogeneous resources. Edge devices, edge servers, and cloud servers are drastically different tiers in the system,
EdgeBench: A Workflow-Based Benchmark for Edge Computing
153
Fig. 1. Function template snippet
Fig. 2. EdgeBench workflow
whereas within each tier the resources, especially the devices in the edge device tier, are also highly diverse in terms of processors, accelerators, memory, storage, OS/hypervisor and libraries/runtimes. All these heterogeneous resources are utilized by the edge systems to provide high scalability and good performance for various edge applications. As edge computing continues to evolve, application and system developers and researchers face many design choices, such as deciding which tier to process the data, which computing resource in the tier to be used and which storage resource to use, etc. EdgeBench is designed to help developers easily explore different design choices by offering users options to define the workflow, storage backend, and computing tier for different functions of the application based on the application’s requirements. In this way, EdgeBench encourages the best design for both edge applications and edge systems.
3 3.1
EdgeBench Workflow
In EdgeBench, the user defines each function using a function template (a YAML file). There are several fields in the template that allow users to define the workflow logic, data storage backends, and computing tier (see Fig. 1). The basic
154
Q. Yang et al.
Fig. 3. EdgeBench defined workflow logics
fields include input, output, and sync under the spec field. Input defines the persistent storage backend used for the input data of this function. Output defines both the data output and the function output. Backend defines where the output data of this function is stored while next function includes the function name of the next stage and the desired computing tier of the next function. Sync defines if the function is invoked synchronously or asynchronously. For synchronous invocation, a function returns when the computation is finished. Otherwise, the function returns immediately after the invocation. Figure 1 shows a function template snippet that uses MinIO [12] as the input data storage and AWS S3 [6] as the output data storage. When the function is invoked, EdgeBench fetches the input data from the user-specified input source and feeds it to the function. After the function finishes, the output data is saved to the corresponding output destination. The next function is invoked by the function name on the computing tier as specified under next function. Figure 2 illustrates the EdgeBench workflow. Based on the workflow design, we generalize five basic workflow logics in edge applications which are pipeline, cron, one-to-many, many-to-one, and branching (see Fig. 3). These basic logics can be used as the building blocks for complex workflow logics. The pipeline workflow (see Fig. 3a) forms a pipeline of functions. Each
EdgeBench: A Workflow-Based Benchmark for Edge Computing
155
function’s output is used as the next function’s input. EdgeBench executes each function in order and passes the output from one function to the input of another. In the pipeline, the last function leaves the next function as empty to identify itself as the last function. An example of pipeline workflow is face recognition applications which include image processing (e.g., resizing), face detection, and face identification. The cron workflow (see Fig. 3b) sets the whole workflow as a cron job. Users specify the cron field to illustrate the schedule of the workflow. The time duration can be set from seconds to hours. The cron workflow can be either an end-toend workflow involving many functions or a single function. This workflow is widely used in IoT scenarios where AI-based algorithms or data analytics are periodically carried out for model training or monitoring purposes. The one-to-many workflow (see Fig. 3c) allows many functions to be executed after one function finishes. In this case, users can specify many pairs of function and tier fields under the next function field. All the next functions use the same input data which is the current function’s output data. Each next function, however, can specify its own output storage backend and the execution tier. This workflow follows the pattern of concurrent execution, e.g., running AI inference on different neural networks to analyze age, gender, and race in parallel after the same face detection function. The many-to-one workflow (see Fig. 3d) works in a manner where one function executes after many functions finish. In this case, the outputs generated by the many functions are aggregated together and sent to one function as input data. The many functions set the wait group field to be true under the output field. All of them need to specify the same output field. A long-running service is provisioned and aggregates the input data for the next function. A use case for this workflow is distributed deep learning where each worker is allocated a part of the model training job and the whole model is aggregated on the next function. The branching workflow (see Fig. 3e) works in an if-else conditional branching manner. The first function generates outputs based on different conditions. Following different outputs, the next function will be different. Although branching workflow looks similar as one-to-many workflow, only one of the next functions that satisfy the condition gets invoked. EdgeBench matches the prefix of the output data generated from the current function to the prefix field to decide which branch is executed next. This workflow is useful for some video analytics applications, e.g., processing objects accordingly based on the corresponding categories of the objects after applying object detection on video streams. 3.2
Reported Metrics
EdgeBench deploys Prometheus [15] on different computing tiers as monitors to report system-level and function-level resource usages such as CPU utilization, memory usage, I/O throughput for the workflow and different computing tiers. To provide function-level performance metrics, EdgeBench also traces the function handler, load and store operations of data, and inter-function
156
Q. Yang et al.
Fig. 4. EdgeBench architecture
communication. These data provide measurements for each function’s runtime latency, data storage latency, and network communication latency, respectively. The data is automatically collected and stored in Prometheus and gets parsed and processed by EdgeBench. 3.3
Implementation
EdgeBench is implemented based on OpenFaaS [14], an open-sourced FaaS platform that relies on the underlying container orchestration system such as Kubernetes [17] to support event-driven functions and microservices. Since functions are stateless, input and output data need to be stored and accessed on external storage for persistency [4]. EdgeBench provides the flexibility to utilize different data storage backends for the input and output data of each function. Users can use their own storage backends by implementing the data storage interface provided by EdgeBench. The interface includes load and store functions to be used for loading input data and storing output data. EdgeBench implements three popular storage backends: MinIO [12], AWS S3 [6], and NATS [1]. Users can use them directly by specifying the name of the storage backend as minio, s3, and nats in the function template. In OpenFaaS, each function is deployed using an OpenFaaS template and a function image. EdgeBench extends OpenFaaS’s template to support workflow logic, computing tier, and storage backend of a function as in Fig. 1. To use EdgeBench, a user deploys it at every computing tier to orchestrate the workflow. The user accesses EdgeBench through REST APIs and provides each function’s image and template. EdgeBench parses the function templates and deploys functions on the specified computing tiers. When the user invokes the workflow through EdgeBench REST APIs, functions are invoked at the specified computing tiers and store staging data on the specified storage backend following the workflow logic. Figure 4 shows the architecture of EdgeBench.
EdgeBench: A Workflow-Based Benchmark for Edge Computing
3.4
157
Two Representative Workflows
We implement two representative edge workflows, an end-to-end video analytics workflow, and a multi-component IoT hub workflow using EdgeBench.
Fig. 5. Video analytics workflow
Fig. 6. IoT hub workflow
Video Analytics Workflow. The video analytics workflow (see Fig. 5) consists of four functions: video generator, motion detection, face detection, and face recognition. Each function’s output is used as the input for the next function. The video stream generator continuously generates live videos and chunks them into groups of pictures (GoP) using FFmpeg [9]. These GoPs are the output of the first stage and each picture serves as an input for motion detection. Motion detection detects motion within each picture and only outputs the pictures that contain motions. The motion detection function uses OpenCV [13] to do the inter-frame comparison. Face detection then detects the faces in the pictures using Single Shot MultiBox Detector (SSD) [25] and filters out images containing faces as the output. Face recognition uses a ResNet-34[22]-based pre-trained model to encode each detected face and k-nearest neighbors (k-NN) to classify the faces. The final output is the images containing faces marked with identities. The input and output data of each function is transferred through different storage backends. The functions are chained together using EdgeBench’s pipeline workflow logic.
158
Q. Yang et al. Table 1. Specifications of computing tiers Cloud Tier
Edge Tier
IoT Tier
Xeon Platinum 8175M
Xeon E5 2630 V3
Broadcom BCM2711
RAM
128 GB
64 GB
4 GB
Storage
800 GB EBS 400 GB 64 GB volume NVMe SSD SD Card
Operating System
Amazon Linux 2
CPU
Number of 6–10 Nodes
Ubuntu 18.04
Raspbian Stretch Lite 2020
9
10
IoT Hub Workflow. The IoT hub workflow (see Fig. 6 implements a scenario where hundreds of thousands of IoT devices and sensors report diagnostic data frequently to an IoT hub and generate real-time analysis. The data includes the location (latitude, longitude, elevation) of the IoT device, the temperature and moisture of the environment, the power consumption of each device and its health status. The workflow includes four functions (see Fig. 6) which are sensor data generator, LSTM training, LSTM prediction, and time series query. First, IoT devices’ diagnostic data is generated continuously at a certain frequency and then published to a time series database using MQTT publishers. Second, LSTM training trains a Long Short-Term Memory (LSTM) model [23], which is capable of data prediction based on existing data sequences using the LSTM neural network. LSTM training reads the latest data points from the database, trains an LSTM model, and stores the model as output in the persistent storage. Third, LSTM prediction conducts data prediction based on the trained LSTM model and the history of sensor data. It takes two inputs, sensor data from the database and the latest-trained LSTM model from the persistent storage. The out is printed as standard output. Forth, the query generates a random time series query for the sensor data in the database. It reads the latest data from the database as input and prints the query result as output. Except for the sensor data generator which is a long-running job, the other three functions use cron workflows. Each stage executes separately without getting chained.
4 4.1
Evaluation Evaluation Setup
We evaluated EdgeBench based on the two workflows on OpenFaaS. The environment setup involves three tiers: the IoT tier, the edge tier, and the cloud tier. Each tier’s specifications are shown in Table 1. The IoT tier is a cluster
EdgeBench: A Workflow-Based Benchmark for Edge Computing
159
Fig. 7. Video analytics workflow CPU usage of each edge tier compute node
Fig. 8. Video analytics workflow resource usage
of ten Raspberry Pi 4 devices, which installs K3S [10] (version v1.19.2) with OpenFaaS (gateway version 0.18.18, faas-netes version 0.12.3, same for OpenFaaS deployment on all tiers). The edge tier is the nearby ASU data center. The cluster consists of nine identical edge servers and uses Kubernetes (server version 1.18.9) with OpenFaaS. The cloud tier uses AWS Elastic Kubernetes Service (EKS) [5]. The backend uses m5.8xlarge AWS EC2 [8] instances as worker nodes. The cloud tier scales the number of worker nodes from six to ten and uses Kubernetes with OpenFaaS. In IoT tier and edge tier, one node serves as the master node that manages the services of OpenFaaS and the MQTT message broker and one node serves as the data node that hosts persistent storage and database. InfluxDB (version 1.8.3) and MinIO (version 2020-10-03T02-19-42Z) are deployed on Kubernetes as storage backends on the data node. All the other nodes are compute nodes. For the video analytics workflow, a 1920 × 1080, 22 Mb/s bitrate video file is used as the video source for reproducible results. For the IoT hub workflow, we use Vernemq [3] (version 1.10.4.1) as the MQTT message broker to deliver sensor data. We simulate IoT devices by maintaining publisher network connections to the broker using the Paho MQTT client library [2]. Each IoT device is set to send sensor data every second. LSTM training is set to execute every half an hour. The training process is based on the data generated within the last 30 min. LSTM prediction function is set to be invoked every five seconds. It conducts prediction based on the data generated within the last 30 s. The query function is set to be invoked every three seconds, every time it randomly executes one query from a query pool with 12 queries implemented by the TSBS benchmark [16]. During the experiments, we set OpenFaaS to use its autoscaling feature to scale
160
Q. Yang et al.
the function instances from 25 to 100 when the load increases by 25% and it automatically scales down the number of function instances when the load drops. We ran each experiment five times. 4.2
Edge Tier Evaluation
The first set of experiments run only on the edge tier to evaluate both the edge system and the edge workloads. We vary the number of concurrent requests issued to each workflow to observe different behaviors of edge applications and edge systems under different workload stress.
Fig. 9. CPU usage of each video analytics workflow stage
Fig. 10. Computation latency of each video analytics workflow stage
Video Analytics Workflow. We ran the video analytics workflow with three workload settings: 10, 30, and 50 concurrent video streams. For each setting, the workflow runs for 30 min. Figure 7 shows the average CPU usage of each compute node under different workload settings. We observe that in edge systems, the workload is not evenly distributed. With 30 concurrent video streams, node 5 has the highest CPU utilization which is 99% while node 1 has merely 45% of CPU utilization. Similar behaviors happen in memory and network utilization. For example, with 50 concurrent video streams, node 5 has used 21% of memory capacity while node 2 only used 14% of memory capacity. This uneven distribution of load can easily cause some compute nodes to become performance bottlenecks while some compute nodes are severely underutilized.
EdgeBench: A Workflow-Based Benchmark for Edge Computing
161
Fig. 11. Latency breakdown of each video analytics workflow stage
Fig. 12. Latency comparison of using different storage backends for video analytics workflow
Figure 8 shows the average CPU utilization and I/O throughput of the video analytics workflow for different concurrent video streams. From the figures, we can observe that CPU usage reaches 99.7% when 50 concurrent video streams are processed. I/O throughput, however, grows linearly and the bandwidth of the NVMe SSD has not been fully utilized with 50 concurrent video streams. This observation shows that the video analytics workflow is more CPU bounded rather than I/O-bounded, and accelerators like GPUs and FPGAs can be added to increase the throughput of the workflow. We further investigated this workflow by evaluating the results of different functions: motion detection, face detection, and face recognition. Figure 9 illustrates the CPU utilization of each function under different workload settings. The number of CPU cores used by each function grows with the number of concurrent video streams issued. Compared to the growth rate of increasing the number of concurrent streams from 10 to 30, the growth rate drops when increasing the number of concurrent streams from 30 to 50, since the CPU starts to become the bottleneck with 50 concurrent video streams. Although face recognition is the most CPU-intensive function (it involves two learning algorithms: deep learning inference and k-NN classification), it does not use the most amount of CPU cores in the workflow. This is because the previous two functions have filtered out images without faces so face recognition is not triggered as frequently as the other two functions. In our video streams, approximately 18% of the video frames trigger the face recognition function. Figure 10 shows the 95th percentile latency for each function to finish one execution. Face recognition is the most computation-intensive function and takes the longest time to finish one execution. Motion detection and face detection take roughly the same amount of time. Motion detection runs a bit longer than face
162
Q. Yang et al.
Fig. 13. IoT hub workflow resource usage of each edge tier compute node
Fig. 14. Latency with and without LSTM training
detection since it involves expensive video decoding. With increased workloads, the latency has increased for all three functions due to interference among the functions. However, the latency increases more when the load increases from 30 to 50 concurrent video streams compared to that from 10 to 30 concurrent video streams. Figure 11 shows the latency breakdown of processing one frame of the videos for each function. We only show the 10 concurrent video stream results here to save space. For 30 and 50 concurrent streams, the results have similar implications. The latency is broken down into I/O latency and computation latency. I/O latency includes the data load and data store latency of the MinIO storage backend. Computation latency is the latency of the computation part of each function. Motion detection spends most of the time doing I/Os and computation is not the bottleneck. For the last two stages, face detection and face recognition, however, computation occupies most of the latency. This shows that to optimize the workflow’s runtime latency effectively, users should optimize each function differently depending on whether it is I/O intensive or computation intensive. We also tried using different storage backends for transferring intermediate data and compared the total computation and communication latency for different storage backends. Figure 12 shows the 10 concurrent video stream latency results for storage backends NATS and MinIO. NATS is the default message broker of openFaaS for asynchronous function invocation. OpenFaaS serializes HTTP requests to NATS and asynchronously deserializes the HTTP requests to invoke functions. By using NATS as the storage backend, we embed the intermediate data in the HTTP requests. The total communication latency of using NATS is higher than that of the MinIO by 26%. OpenFaaS deploys NATS to
EdgeBench: A Workflow-Based Benchmark for Edge Computing
163
Fig. 15. Resource usage of MQTT
Fig. 16. Resource usage of InfluxDB
store data in memory for 24 h. With the large intermediate data and the long storage time, the data uses up the memory and it takes a long time to reclaim the memory. These two factors contribute to the high communication latency. The computation latency, however, has no noticeable differences using different storage backends.
IoT Hub Workflow. Contrary to the video analytics workflow, each stage in the IoT hub workflow runs separately. In the experiment, we stress each stage with different workload settings. We first show that EdgeBench can be used as a macro benchmark by running the four functions of the workflow altogether and then show EdgeBench can also be used as a micro benchmark by running each function separately. Since LSTM training is the most CPU-intensive function of the four, we compare the resource usage and performance of the workflow with and without LSTM training to see the impact of LSTM training on other functions. To run the four functions together, we first start the other three functions without LSTM training. LSTM training is triggered 20 min after the start of the other three functions. LSTM training takes more than an hour to finish a 10-epoch training process, we stop the four functions 40 min after the initial start. In the experiment, we set the sensor data generator to receive data from 15K sensors. The sensor data generator runs on a separate compute node and maintains 15K MQTT subscribers to insert received messages to influxDB. A separate node outside the edge tier establishes 15K MQTT publishers and sends data to the corresponding topics periodically. LSTM training issues 20 concurrent training
164
Q. Yang et al.
Fig. 17. Latency changes of IoT hub workflow with different stress of loads
Fig. 18. Latency of video analytics workflow across tiers
requests, LSTM prediction issues 80 concurrent prediction requests, and query issues 20 concurrent queries at a time. Figure 13 shows the CPU and memory usage of each compute node with and without LSTM training. As what we observed from the video analytics workflow, the load distribution is also imbalanced. When LSTM training is not running, the highest CPU usage is 51% for node 4 while node 5 only uses 29% of the CPU. We also observe that after LSTM training starts, both CPU usage and memory usage have increased a lot. On average, the CPU usage increases by 54% and the memory usage increases by 69% after LSTM training starts to run. Besides resource usage, we also observe the 95th percentile latency changes for the other functions with and without LSTM training as in Fig. 14. Both LSTM prediction and query’s latency get prolonged. For LSTM prediction, it was affected more severely, where the latency increases by 67%. We then show the results of running each function separately. In this experiment, each function is stressed with a different number of concurrent requests to show the function’s resource usage and performance. We vary the number of sensors from 5K to 20K in the sensor data generator. For LSTM training, we issue concurrent requests from 10 to 25 increased by 5 requests. For LSTM prediction, the workload varies from 60 concurrent requests to 100 concurrent requests increased by 20 requests. For query, the workload changes from 10 concurrent requests to 30 concurrent requests increased by 10 requests. Figure 15 and Fig. 16 show the experiment results from sensor data generator. Since sensor data generator mainly uses the MQTT broker for data delivery and InfluxDB for data storage, we show the resource usage of these two services. Figure 15a shows the number of CPU cores used by the MQTT broker
EdgeBench: A Workflow-Based Benchmark for Edge Computing
165
Fig. 19. Communication latency of video analytics workflow under IoT and edge setting
Fig. 20. Latency of IoT hub workflow across tiers
under different workloads. When workload changes from 15K sensors to 20K sensors, the CPU usage is not increased linearly which indicates that CPU has become the bottleneck with 20K sensors. In Fig. 15b and Fig. 15c, the network I/O throughput increases linearly with the workload which illustrates network is not under stress with this many concurrent sensors. Figure 16 shows the InfluxDB CPU usage and storage I/O throughput. Both CPU usage and I/O throughput increase linearly when the workload increases and InfluxDB is not overloaded. Figure 17 shows the 95th percentile latency of different workload settings for LSTM prediction and query. Similar to what we observed in the video analytics workflow, the latency increases more when the workload gets more intensive due to the interference among the function instances. 4.3
Cross Tier Evaluation
The second set of experiments runs across the IoT tier, the edge tier, and the cloud tier to evaluate which distribution setting works the best for each workflow. We show the 95th percentile latency of each function for both workflows and the end-to-end latency of the video analytics workflow. For each workflow, we have three distribution settings: use only IoT tier and edge tier, use only IoT tier and cloud tier, and use all three tiers. Because IoT tier is used to generate input data, it is always involved. Video Analytics Workflow. The cross-tier distribution of the video analytics workflow under each setting is shown in Table 2. The first two stages are distributed on the IoT tier to save network latency. The computation-intensive face detection and face recognition are distributed on the edge tier or cloud tier. For the IoT and edge tiers, we have MinIO deployed on the nodes’ local storage.
166
Q. Yang et al.
For the cloud tier, we use AWS S3 as storage. For each setting, we issue 10 concurrent video streams for 30 min considering IoT devices’ capacity. Table 2. Cross tier distribution of video analytics workflow Settings
Video Motion Face Face Generator Detection Detection Recognition
IoT and edge
IoT
IoT
Edge
Edge
IoT and cloud IoT
IoT
Cloud
Cloud
three tiers
IoT
Edge
Cloud
IoT
Figure 18 shows the latency result. We present the latency of each function and the end-to-end latency of each request (Video generator result is not reported here since it is a long-running job). Compare IoT and edge to IoT and cloud, face detection and face recognition execute faster on the cloud tier by 11% and 14%, respectively. However, due to the high network latency when transferring data across tiers, the overall end-to-end latency of IoT and cloud outweighs that of the IoT and edge by 6%. Three tiers takes both the advantages of the computation power of the cloud tier and the low network latency of the edge tier and has the lowest end-to-end latency. Three tiers outweighs IoT and edge by 3% and IoT and cloud by 8.5% in end-to-end latency. The video analytics workflow should be applied across the three tiers to utilize the benefits of different resources. Figure 19 presents the communication latency of transferring data across the tiers under the IoT and edge setting. IoT represents the communication latency within the IoT tier, i.e. the communication latency between the video generator and the motion detection. IoT-Edge represents the communication latency across the IoT tier and the Edge tier, which is the communication latency between motion detection and face detection. Edge represents the communication latency within the Edge tier, which is the communication latency between face detection and face recognition. An interesting finding is that the communication latency within one tier is larger than that across tiers. This may be because that the internal gateway of OpenFaaS within the same network is overloaded and thus generates high latency. The gateway of OpenFaaS handling external network traffic is not as busy as the internal gateway and requests from the external network can be handled faster. This discovers the communication bottleneck of the underlying edge system and shows that under certain circumstances the communication latency can be better across tiers. IoT Hub Workflow. The IoT hub workflow has four separate functions, so we show the latency of running each function on the IoT tier, the edge tier, and the cloud tier here. Sensor data generator is a long-running job so latency is not reported. During the experiments, we run sensor data generator all the time to generate updated data for the database. LSTM training usually takes hours to run. To show the latency, we reduce the number of epoch of the training process
EdgeBench: A Workflow-Based Benchmark for Edge Computing
167
to one to reduce the running time. Since LSTM training and LSTM prediction are computing intensive and not suitable to run on IoT devices, we only show the edge tier and cloud tier latency for the two jobs. To run functions on the IoT tier and the edge tier, we deploy InfluxDB on the edge tier to store sensor data. For the cloud tier, we use AWS Timestream database to store sensor data. The machine learning model used by LSTM training and LSTM prediction is stored to MinIO deployed on the edge tier. For the cloud tier, we use AWS S3 for model storage. Every time the cron jobs get triggered, we issue 10 concurrent requests for LSTM training, 60 concurrent requests for LSTM prediction, and 10 concurrent requests for query. Each experiment runs for 30 min. Figure 20 shows the latency of LSTM training, LSTM prediction, and query when running on different tiers. For query, the IoT tier takes the longest time to finish due to the resource limitation of IoT devices. Compared to the cloud tier, it has 6X overhead. For all three functions, the edge tier takes more time to finish each request than the cloud tier by 39% for LSTM training, 27% for LSTM prediction, and 58% for query. LSTM training, as the most computationintensive function, has the largest overhead of 64% when running on the edge tier compared to the cloud tier. For the IoT hub workflow, users can distribute the most computation-intensive function on the cloud tier to improve the function’s performance and reduce its interference with other jobs.
5
Related Works
Edge workloads have attracted lots of attention and there are multiple edge benchmarks developed for generating edge workloads and evaluating edge systems performance. Edge AIBench [26] and AIoTBench [21] focus on the machine learning and AI algorithms used in edge computing. pCAMP [33] studies the performance and system usages of different machine learning libraries on IoT devices. Edgedroid [28] evaluates augmented reality (AR) and wearable cognitive assistance (WCA) applications by replaying collected traces from these applications. Edge Bench [19] compares the performance of AWS IoT Greengrass [7] and Azure IoT Edge [11] by running three edge applications on IoT devices. EdgeFaaSBench [29] is a FaaS-based benchmark suite on for edge devices on a single tier. ComB [18] proposes a video analytics pipeline based on the microservice architecture to benchmark the underlying hardware’s computing and I/O capabilities. There are certain limitations of these benchmarks. First, benchmarks [21, 26,28,33] only consider one category of edge applications and the workloads are not generic and representative enough of different edge workloads. Second, benchmarks [18,19,21,29] do not follow the heterogeneous and distributed nature of edge systems and the evaluation focuses only at one tier. DeathStarBench [20] is a benchmark suite which includes a drone swarm IoT application to study the implications microservices have on IoT devices and cloud backends. Computation intensive tasks like image recognition is evaluated
168
Q. Yang et al.
on the drones and remote cloud servers. DeFog [27] contains six edge applications that involve machine learning algorithms, IoT gateway and mobile games. Some applications of Defog run across IoT devices and cloud servers to examine the communication overhead of edge systems. CAVBench [32] provides six typical applications targeting at connected and autonomous vehicle scenarios. Compared to DeathStarBench, DeFog and CAVBench which do not support easy customization of the edge workloads and edge systems, EdgeBench provides the flexibility to users so that they can define their own edge workloads and tune the setup of edge systems. This is particular essential for edge computing since edge applications evolve rapidly and frequently. Typically, benchmarks are designed with a narrow focus to target specific workloads cannot represent real and comprehensive edge workloads [31]. Existing benchmarks do not provide the flexibility that is significant for edge computing.
6
Conclusion
We have presented EdgeBench, a workflow-based benchmark for edge computing. EdgeBench is customizable. It provides the user-defined workflow orchestration. Users can choose different data storage backends for different stages of the workflow. Each individual stage can be distributed to different computing tiers. EdgeBench is representative. The benchmark implements two representative edge workflows, the video analytics workflow and the IoT hub workflow. EdgeBench can be used as both micro and macro benchmarks with reported metrics both at system-level and workflow-level. We encourage users to use EdgeBench to study both the edge systems and edge workloads.
References 1. NATS - Open Source Messaging System (2011). https://nats.io/ 2. Eclipse - Paho (2014). https://www.eclipse.org/paho/ 3. VerneMQ - high-performance, distributed MQTT broker (2014). https://vernemq. com/ 4. Choosing between aws lambda data storage options in web apps (2020). https:// aws.amazon.com/blogs/compute/choosing-between-aws-lambda-data-storageoptions-in-web-apps/ 5. Amazon - Amazon Elastic Kubernetes Service (2021). https://aws.amazon.com/ eks/ 6. Amazon - Amazon S3 (2021). https://aws.amazon.com/s3/ 7. Amazon - AWS IoT Greengrass (2021). https://aws.amazon.com/greengrass/ 8. Amazon EC2 - Secure and resizable compute capacity to support virtually any workload (2021). https://aws.amazon.com/ec2/ 9. FFmpeg - FFmpeg (2021). https://ffmpeg.org/ 10. Lightweight Kubernetes - The certified Kubernetes distribution built for IoT & Edge computing (2021). https://k3s.io/ 11. Microsoft - Azure IoT Edge (2021). https://azure.microsoft.com/en-us/services/ iot-edge/
EdgeBench: A Workflow-Based Benchmark for Edge Computing
169
12. Minio - Kubernetes Native, High Performance Object Storage (2021). https://min. io/ 13. OpenCV - = OpenCV (2021). https://opencv.org/ 14. OpenFaaS - Serverless Functions, Made Simple (2021). https://www.openfaas. com/ 15. Prometheus - From metrics to insight. Power your metrics and alerting with a leading open-source monitoring solution (2021). https://prometheus.io/ 16. TSBS Time Series Benchmark Suite (TSBS) (2021). https://github.com/ timescale/tsbs 17. kubernetes (2023). https://kubernetes.io/ 18. B¨ aurle, S., Mohan, N.: Comb: a flexible, application-oriented benchmark for edge computing. In: Proceedings of the 5th International Workshop on Edge Systems, Analytics and Networking, EdgeSys ’22, pp. 19–24. Association for Computing Machinery, New York (2022) 19. Das, A., Patterson, S., Wittie, M.: Edgebench: benchmarking edge computing platforms. In: 2018 IEEE/ACM International Conference on Utility and Cloud Computing Companion (UCC Companion), pp. 175–180. IEEE (2018) 20. Gan, Y., et al.: An open-source benchmark suite for microservices and their hardware-software implications for cloud & edge systems. In: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 3–18 (2019) 21. Hao, T., et al.: Edge aibench: towards comprehensive end-to-end edge computing benchmarking. In: International Symposium on Benchmarking, Measuring and Optimization, pp. 23–30. Springer (2018) 22. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) 23. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997) 24. Lin, S.-C., et al.: The architectural implications of autonomous driving: Constraints and acceleration. In: Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 751–766 (2018) 25. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., Berg, A.C.: SSD: single shot MultiBox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). https:// doi.org/10.1007/978-3-319-46448-0 2 26. Luo, C., Zhang, F., Huang, C., Xiong, X., Chen, J., Wang, L., Gao, W., Ye, H., Wu, T., Zhou, R., Zhan, J.: AIoT bench: towards comprehensive benchmarking mobile and embedded device intelligence. In: Zheng, C., Zhan, J. (eds.) Bench 2018. LNCS, vol. 11459, pp. 31–35. Springer, Cham (2019). https://doi.org/10. 1007/978-3-030-32813-9 4 27. McChesney, J., Wang, N., Tanwer, A., de Lara, E., Varghese, B.: Defog: fog computing benchmarks. In: Proceedings of the 4th ACM/IEEE Symposium on Edge Computing, pp. 47–58 (2019) 28. Olgu´ın, M., Mu˜ noz, Wang, J., Satyanarayanan, M., Gross, J.: Edgedroid: an experimental approach to benchmarking human-in-the-loop applications. In: Proceedings of the 20th International Workshop on Mobile Computing Systems and Applications, pp. 93–98 (2019)
170
Q. Yang et al.
29. Rajput, K.R., Kulkarni, C.D., Cho, B., Wang, W., Kim, I.K.: Edgefaasbench: benchmarking edge devices using serverless computing. In: 2022 IEEE International Conference on Edge Computing and Communications (EDGE), pp. 93–103 (2022) 30. Shi, W., Cao, J., Zhang, Q., Li, Y., Xu, L.: Edge computing: vision and challenges. IEEE Internet Things J. 3(5), 637–646 (2016) 31. Varghese, B., et al.: A survey on edge benchmarking. ACM Computing Surveys (2020) 32. Wang, Y., Liu, S., Wu, X., Shi, W.: Cavbench: a benchmark suite for connected and autonomous vehicles. In: 2018 IEEE/ACM Symposium on Edge Computing (SEC), pp. 30–42. IEEE (2018) 33. Zhang, X., Wang, Y., Shi, W.: pcamp: performance comparison of machine learning packages on the edges. In: USENIX Workshop on Hot Topics in Edge Computing (HotEdge 18) (2018)
An Algebraic-Geometric Approach to NP Problems II F. W. Roush(B) Alabama State University, Montgomery, AL 36101-0271, USA [email protected]
Abstract. We explore in more detail how an actual probabilistic or quantum algorithm for NP-problems might be developed from the ideas in the previous paper. There we proved that there is an NP-complete problem which is equivalent to a system of a bounded number of diophantine equations of at most polynomial degree over a function K(t) where K can be a global field or its algebraic closure. This is equivalent to finding a section of a map from an algebraic variety to projective space of dimension 1. We suggested how this might be done for a curve. Here we propose that it could be done for a higher-dimensional algebraic variety by determining whether a generalized Brauer-Manin obstruction is trivial. Keywords: NP-complete problem · Diophantine equation over function field · Brauer-Manin obstruction · Computational complexity classes
1
Introduction
Informally, for a sequence of problems of increasing size, let n be the number of binary inputs required to specify them. Polynomial time means an algorithm that can be completed in a number of steps which is at most Cnk for some positive integers C, k, for each n. Exponential time means the number of steps k is at most eCn . Algorithms can be given by Turing machines, or any computer languages with unlimited memory, or equivalent mathematical steps. P is the class of problems that can be solved by polynomial time algorithms. NP is the class of problems such that there are certificates (informally solutions) which if given can be verified in polynomial time. NP complete problems are problems such that every NP problem can be reduced to special cases of them. Examples n of NP problems are SAT, satisfying n a Boolean expression of the form i=1 ( j=1 yij ) where yij are any choices from a set of Boolean variables x1 , . . . , xn and their complements, 3-coloring a graph, and the problem we will use as a starting point, the knapsack problem. One form of it is to be given a set of n numbers S having at most n digits and another set N and ask if N is a disjoint union of a subset of S. Another form is, given a c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 K. Arai (Ed.): FICC 2024, LNNS 921, pp. 171–181, 2024. https://doi.org/10.1007/978-3-031-54053-0_13
172
F. W. Roush
family of n subsets of {1, 2, . . . , N } and another subset T to ask if T is a disjoint union of some of the subsets in the family. Our goal of this paper is to propose a possible approach to quickly solving NP problems using techniques from algebraic geometry for a probabilistic algorithm. This was partially done in our previous paper, which involves quantum methods at one point. Here we will propose something considerably more explicit. The additional technique is known but not its application to NP problems.
2
Previous Paper and Other Literature
In a previous paper [16] we argued first that NP-complete problems of the following type can be found by polynomial time algorithms: first problems of the nature x|N, x ≡ c (mod M ). Here x, N, c, M can either be numbers with at most polynomially many digits, in terms of the number of binary inputs to the NP problem, or polynomials of polynomial degree whose coefficients satisfy similar bounds, in either characteristic 0 or characteristic p, and we can make the coefficient fields algebraically closed. The quantity N is given in factored form and x chooses some of these quantities and the second equation enables us to state a version of the knapsack problem. The factors modulo M are products of more basic primes which are like vector entries in a knapsack problem. We will restate our theorems in terms appropriate to this paper. Theorem 1. The following class of problems is NP-complete: ∃x, x|N, x ≡ c (mod M ). Here x, N, c, M are polynomials in a ring F [t] of at most polynomial degree, with coefficients of at most exponential size, specified by a polynomial number of multiplications, additions, and parentheses. The field F may be chosen as any of the following: (i) a cyclotomic extension of the rational numbers generated by a root of unity η of at most exponential order, (ii) its algebraic closure, (iii) Fp [η], (iv) its algebraic closure, (v) an extension of the p-adic field generated by η, or (vi) its algebraic closure. In the p-adic case, we can also work with their rings of integers. We may require the polynomial N and therefore x to be a monic product of factors x − η k and c to be a power of η and M = t. The proof of this theorem was indicated in our former paper by Table 1, which codes the NP-complete knapsack problem stated in terms of disjoint unions of sets. That is, x represents a subset S of the set represented by N , in terms of which factors occur in x. Its congruence class modulo M is a product that indicates exactly which subset the union of S is. Our second theorem makes this into a Diophantine problem over the rational function field F (t). Here we will require in characteristic 0 that F is closed under taking pth roots, for p as in the theorem. Theorem 2. There is an NP-complete problem which can be stated as xy = F, x − c = M z where x is a monic rational function in F (x) of the form k i=0
ti xqi
Algebraic-Geometric Approach
173
Table 1. Outline of proof of Theorem 1 knapsack problem
divisibility problem
∃X T0 = ∪i∈X Si
∃x, x|N0 ∧ x ≡ c (mod M )
elements of Si , T0
small primes pj
subsets Si
large primes Pj
membership in Si
congruence class mod large M
set T0
congruence class of c mod M
disjoint union of sets product of Pi sets Si , i ∈ X
divisors of N0 =
Pi
y, z are polynomials of the same form, N is a given polynomial of this form with degree 2k, M = t, c is a power of a root of unity η of exponential order. Here 2k < q = pn0 , p is a prime, q = pn0 is a number of at most polynomial size. If there is a solution then there is a constant solution involving numbers of at most exponential size. theorem is first proved in the modulo p case from Theorem 1, where This ai ti is the polynomial x in Theorem 1 and similarly for F . This means that if there is a solution there then is a solution here of the required type. The idea of this proof is that polynomials can be modeled in terms of addition and a single multiplication by rational functions of this type, because the denominators have terms only in degrees multiples of pk . When we clear denominators and look at the lowest terms remaining we have polynomials of degree k which are welldetermined. If there is a solution in terms of rational functions then there is a polynomial solution. In characteristic zero we use the case of an extension of the p-adic integers in Theorem 1. If there was a solution before then there is a constant solution here. If there is a solution here then we can clear the p-part of denominators, reduce modulo a prime extending p, and get a solution modulo p. This was illustrated by Table 2: Table 2. Outline of proof of Theorem 2 polynomial ring x|N0 ∧ x ≡ c (mod M )
function field ∃y, z N0 = xy ∧ x − c = zM q
cqi ti
given polynomial
rational function in form
unknown polynomial
rational function, denominator cleared, lowest terms
i=1
factorization of polynomials factorization retracting to polynomials
In turn the problem of finding a solution to the Diophantine system is known to be equivalent to finding a section of a map from an algebraic variety repre-
174
F. W. Roush
senting solutions of the Diophantine solution to CP 1 representing the variable t. Example 1. Consider this knapsack problem: 0110, 1010,1100,1001 with the sum required to be 1111. First code it in terms of roots of unity. We want their order to be a 4th power of a number which is at least 4, such as 54 to allow all combinations of 4 of them to be distinct. The pieces are roots of these orders 51 + 52 , 50 + 52 , 50 + 51 , 50 + 53 . Let η denote a 54 root of unity. We can make calculations with polynomially many decimals. So we want a monic divisor F1 = 4 j b t of this polynomial P1 = (t − η 5+25 )(t − η 1+25 )(t − η 1+5 )(t − η 1+125 ) = j 4 j=0 4 j 1+5+25+125 . Let F2 = P1 /F1 = j=0 bj tj . j=0 aj t which modulo t is C0 = η We choose a power of 2 such that it is strictly greater than the twice the degree of P1 , here 16. Now write 4 4 4 tj x16 tj yj16 ) = tj a16 ( j )( j ). j=0
j=0
j=0
Then we deal with the congruence equation in the same way. This says originally F1 − c0 = tF2 . It becomes 4 3 tj x16 ) − c = t( tj zj16 ) ( 0 j j=0
j=0
We can also eliminate the second equation by substituting it into the first, for the x variables, and we can also solve for one of the q powers of variables. The second part of [16] depended on a number of hypotheses such as the Tate conjecture in dimension 2 in order to propose a quantum algorithm for finding points on algebraic curves in polynomial time. In particular, this is a probabilistic algorithm, and the paper [8] of Jinbai Jin suggests that probabilistic algorithms might be better at solving this problem. The quantum portion is in studying the structure of algebraic curves over finite fields. The biggest obstacle to finding a specific algorithm in the previous paper is to pass from a procedure on a higher-dimensional variety to a procedure on a variety of bounded dimension, preferably a curve. Especially hypersurfaces and their intersections tend to be simply connected by Theorems of Lefschetz and calculations of Hirzebruch, and that means that there is no abelian variety which provides a natural invariant. Here we propose that this could be dealt with using a version of the Brauer-Manin obstruction or its generalizations.
3 3.1
Research Methodology The Brauer-Manin Obstruction
The problem of finding rational points on algebraic varieties over fields is a difficult one in general. It is unsolved even for algebraic curves, but at least
Algebraic-Geometric Approach
175
there are methods which could be used in practice. Algebraic curves of genus at least 2 [12], and higher-dimensional varieties which are hyperbolic in a strong enough sense can be embedded in abelian varieties using maps on their first cohomology groups. If the Tate-Shafarevitch conjecture is true then there is an algorithm for finding a complete set of generators and relations for the rational points on abelian varieties over number fields [17]. If the Tate conjecture is true then there is an algorithm [9,15] for finding points on abelian varieties over rational function fields. Once we determine the structure of an abelian variety, the rational points on its subvarieties are strongly limited. It was proved by Putnam, Davis, Robinson, and Matijasevich, see [13] that the problem of solving Diophantine equations over the integers is algorithmically undecidable; the halting problem for Turing machines can be expressed as such a Diophantine equation. The problem of an algorithm to solve Diophantine equations (systems of polynomials in several variables) over the rationals is open, as is the problem of an algorithm to solve Diophantine systems over the function field C(t) whose coefficients are expressed in terms of the algebraic closure of the rational numbers and given transcendentals. It is also known that Diophantine systems for function fields in 2 or more variables are algorithmically undecidable. In principle an algorithm for finding rational points on an algebraic variety or showing nonexistence could be to try a countable sequence of possible solutions one by one, provided there is a similar way to prove nonexistence of rational solutions in cases where there are none. So the question of nonexistence is a place to start in finding an algorithm. The first condition for nonexistence is the Hasse principle: if there is a rational solution to a system of Diophantine equations then there is a solution to that system over every p-adic field and the real numbers. It was proved by Ax [1] that existence of solutions over all p-adic fields is decidable by an algorithm, and it was known earlier that this is the case for the real numbers and any particular p-adic field. This used a reduction to the case of algebraic curves and then the Weil conjectures to bound the numbers of points on those curves. However it is also known that there are Diophantine systems, even elliptic curves, which have points over all p-adic fields and the real numbers, but no rational points. The best general method for determining existence of rational points which goes beyond the Hasse principle is the Brauer-Manin obstruction and its generalizations. This depends on the properties of the Brauer group, which is used to classify division algebras and correspondingly, simple algebras. It was proved by Wedderburn that every simple finite-dimensional algebra over a field is a full matrix algebra over a division algebra over that field. The Brauer group of a field F is the group of equivalence classes of simple algebras whose center is F where two full matrix algebras over isomorphic division algebras are considered equivalent. The operation is tensor product. The Brauer group of algebraic number fields F is known, and there are expressions for the Brauer group which hold more generally. The simplest nontrivial subgroup of the Brauer group of F is generated by quaternion algebras
176
F. W. Roush
whose norm forms (determinants in a linear representation) are quadratic forms x21 + ax22 + bx23 + abx24 where a, b ∈ F . The norm form determines the division algebra and a great deal is known about these quadratic forms. This part of the norm form is also closely related to Milnor’s algebraic K-theory. More generally the entire Brauer group over a local or global field is represented by cyclic algebras. The cyclic algebra (a, b) has algebra generators u, v and relations un = a, v n = b, uv = ζvu where a, b, ζ ∈ K and ζ is a primitive nth root of unity. Over general fields, the cyclic algebras will generate a proper subgroup of the Brauer group. At least they provide a way to construct many division algebras. Brauer groups are also defined for schemes [7], which include algebraic varieties, sets of points on systems of polynomial equations over a field. Brauer groups can be defined using any of thee following: Azumaya algebras, which generalize simple algebras, projective bundles over the scheme, or ´etale cohomology groups. For a global field K with localizations Kv , Hasse constructed an exact sequence 0 → Br(K) → ⊕v Br(Kv ) → Q/Z → 0 where Br denotes Brauer group. The third map is denoted ⊕v invv , the sum of Hasse invariants over all primes. It is convenient to work with the direct sum ⊕Kv of the local fields or completions (p-adic fields and the real fields in the case of the rational numbers) Kv . A direct sum of groups is the subset of the direct product such that only a finite number of coordinates are not the identity. This is known as the ring of ad`eles. The ring of ad`eles also has a topology which we will not use. For an algebraic variety X defined by a system of polynomial equations X(K) denotes its points over the field K and X(AK ) denotes the set of solutions over the ring of ad`eles. The restriction mentioned above is that X(K) ⊂ X(AK ) so that ad`ele points must exist in order for points over K to exist. Suppose that we have an element α ∈ Br(X). For each x ∈ X(AK ) there will be a restriction xα of α which lies in Br(AK ). If x ∈ X(K) then xα comes from the Brauer group of the algebraic number field K and ⊕v invv (xα ) = 0. This is the Brauer-Manin obstruction: Suppose ∀v, X(Kv ) = 0. Suppose all x ∈ X(AK ) fail to satisfy ⊕v invv (xα ) = 0 then X can have no points over K. Example. We can have Brauer group elements corresponding to quadratic forms. The condition of having total Hasse invariant zero in this case of quadratic forms implies by a form of quadratic reciprocity that the number of v where the local Hasse invariant invv (xα ) is nonzero is even. But it can happen that this number is always odd. Then still X(K) = ∅. For details on this and the above arguments see [19]. Computing Brauer groups other than for specially designed examples tends to be lengthy and complicated. Usually it is not necessary to compute the entire Brauer group to get a nonexistence result. There are several links between the Brauer-Manin invariant and the Tate conjecture in dimension 2. The Brauer-Manin invariant for abelian varieties is equivalent to the Tate-Shafarevitch group, which is analogous to the group of
Algebraic-Geometric Approach
177
Galois invariant cohomology classes used for the Tate conjecture. For the case of global function fields of positive characteristic, which can be used in Theorem 2, it seems likely that functoriality properties give a link between the two. The Brauer group can be expressed in terms of the cohomology of the Picard scheme as in exact sequence (25) of [19]. The identity component of the Picard scheme is closely related to a natural generalization of the Jacobian to higher dimensions. 3.2
Algorithm
The following is the algorithm suggested by this argument. Take an NP problem. Restate it in terms of the knapsack problem using Karp’s theorem. Restate that in terms of a problem of finding a section of an algebraic variety mapped to CP using the two theorems in the previous paper, as described in the second section of this paper and the example. This can be done over a field of the form Q(η, t) where η is a root of unity. This is equivalent to a problem of solving a system of Diophantine equations over this field. Then use the generalization of the Brauer-Manin invariant to test whether the system of Diophantine equations has a solution over this field. There are many primes over a function field, as in the exact sequence of Milner for the algebraic K-theory of K(t) but perhaps we just need to use one representative for each image element of the exact sequence, which means it is the same as for Q. A variation on this algorithm is to take a substitution of a random rational number with polynomially many digits for t and then use the Brauer-Manin obstruction over Q. It is not always true that for function fields K(t) this accurately tells whether there is a point over the function field–it often happens [4] that Diophantine equations over K(t) are algorithmically undecidable but Diophantine equations over K are decidable. If substitution of points always told whether there is a point over the function field, and the field is countable, that would give an algorithm for solving over K(t) which is a contradiction. But here we have some points in our favor: we are working with problems that are algorithmically decidable and can be made hyperbolic. For hyperbolic spaces, solutions tend to be rare, and that also means that spurious solutions tend to be rare. A second variation is to work over a characteristic p global field. The following is an optimistic estimate of the speed of this algorithm. All the steps except the generalized Brauer-Manin invariant can be done in polynomial time. We use the generalization of Corwin and Schlank to do this which involves choosing Zariski open subsets. Make this choice at random using polynomially many subsets and covers of small orders of them. Then we have a polynomial times the method applied to one of them. We estimate that the time required to computer the Brauer-Manin obstruction as being polynomially more than the time required to determine if a solution exists modulo every prime, and that can be done in polynomial time using methods from the previous paper, and reductions to algebraic curves. We may hope that all but a polynomially sized set of primes can be eliminated by general methods and then those can be studied individually.
178
F. W. Roush
All this is beyond the frontiers of established work in algebraic geometry, for reasons mentioned in the previous paper, in terms of rigorous proof, and a probabilistic algorithm is required. However this could conceivably be tested using random cases of the knapsack problem and a computer program. This would be time-consuming for the programmer.
4 4.1
Conclusion Advantages of the Brauer-Manin Invariant
The Brauer-Manin invariant is a powerful invariant to determine whether rational points exist on algebraic varieties over the rational numbers. Theorem 2 means an NP-complete problem can be stated in terms of existence of points on an algebraic variety over a function field of either of the forms Q(η, t), Fq (t), and might be applied to it. We probably cannot use C(t) since its Brauer group is trivial, but Fq (t) is a global field analogous to the rational numbers. So the invariant can be applied to this problem, and to NP problems in general. A first generalization of the Brauer-Manin invariant was provided by Alexander Skorobogatov using torsors and it is called the ´etale Brauer-Manin obstruction. He showed [18] that the original invariant was not always sufficient. Bjorn Poonen [14] showed that the new invariant is also not always sufficient. Another generalization was provided by Corwin and Schlank [3] by expressing an algebraic variety as a finite (nondisjoint) union of Zariski open subsets and applying the invariant to each subset. They showed that over real algebraic number fields including the rational numbers, their new invariant is necessary and sufficient. If we use probabilistic arguments, it might be enough to have an invariant which works with reasonable probability even if not always. However we do need an invariant which can be nontrivial for simply-connected algebraic varieties and that is the case for the last of these. It was proved by Kresch and Tschinkel [10,11] that for surfaces the BrauerManin invariant is finitely computable. Even if it requires a long time, it might still be possible to use this invariant to show a result such as N P = co−N P . This at least provides indicates that there is a plausible argument that N P ⊂ BQP . 4.2
Heuristic Arguments Regarding Non-probabilistic Complexity Classes
If we are interested in plausible arguments, there are arguments that PSPACE coincides neither with NP nor EXP. These are based on the idea of trading space for time or vice versa to simulate P algorithms, or to get a contradiction. That is, a Turing machine might be simulated by another with more time and less space or vice versa. We would expect the exchange rate to involve comparable rates so that we have the same total number of Turing cells in space and time. PSPACE as is has exponential space and polynomial time. We argue that PSPACE reasonably is equivalent to computation with exponential space and polynomial time. This involves what is called the parallel computation thesis.
Algebraic-Geometric Approach
179
A PSPACE machine can simulate parallel P by doing the different parallel branches in turn and carrying over a small amount of information needed for the aggregation phase, which might be represented as a binary tree. The reverse process is difficult, but one form of the parallel computation thesis was proved by Goldschlager [6]. It was proved by Chandra, Kozen, and Stockmeyer that PSPACE equals a form of alternating computation in polynomial time. Alternating computation allows the alternating use of universal and existential quantifiers [2]. This is like a 2-player game of perfect information with polynomially many steps. That can be studied in exponential parallel space by exploring all paths in the game tree, representing wins by one player as 1 and by the other as 0 and then alternatingly using infimum and supremum to amalgamate the results of the paths. Suppose that we seek a PSPACE algorithm A1 and a P algorithm A2 (with larger polynomially sized space) which gives the same output. PSPACE algorithms by definition have more time than P algorithms but not necessarily more space. If we cannot trade time for space, then this simulation by A2 will not be possible. If we can trade time for space we can make the PSPACE algorithm yield a P argument with arbitrarily great polynomial time and space, and thus simulate A2 . This yields a contradiction either way. We can do the same with PSPACE vs. EXP. Suppose that we have an EXP algorithm A3 and a PSPACE algorithm A1 with a polynomially sized space that allows a larger exponentially sized. If we cannot trade time for space then simulation of A3 by A1 is impossible. If we can trade time for space then again we get enough space from EXP to simulate A1 which is contradicts the reverse simulation in general. This also bears on the question whether NC (parallel computers using a polynomial amount of parallelism and polylogarithmic time) equals P. Another question is about what trading is needed. We need to go in two different directions in the proof. This means the current state is not a kind of local maximum. This is implausible because we can cut a tape in two and trade for both halves. There is no simple way to do this for NP vs. P. However we can consider the following. Assume N P = co − N P . NP problems can be represented as logical statements with a second order existential quantifier, such as the coloring for 3-coloring a graph. A rigorous form of this phrasing of NP in terms of 2nd order finite logic was proved by Ronald Fagin [5]. Problems with a bounded number of both exponential and universal quantifiers can be re-expressed in terms of a single existential quantifier over a larger set because N P = co − N P allows universal quantifiers to be re-expressed with existential quantifiers. Suppose we look at #P as a comparison set. It might be that #P is equivalent to a form of parallel P which is equivalent to NP. In parallel P we can have an exponential number of parallel computations which can be aggregated in polynomial time to a single output. It might be possible to arrange a general sort of counting from which this single output can be obtained by using a polynomial amount of duplication and combination using Boolean algebra. For #P we might for instance consider the aggregation step as a SAT computation and look at
180
F. W. Roush
the exponentially many paths through the SAT which represent monomials and then combine them by #P. Suppose this can be expressed with parallel computation with a bounded number of 2nd order quantifiers. Then it might be possible to reexpress this, for instance reducing the count in #P modulo polynomially many polynomially sized primes, using a single 2nd order quantifier. That becomes an NP problem. This would mean that N P = co − N P would imply P SP ACE = N P which we have argued above is impossible. To express #P reduced modulo primes in terms of a bounded number of second order quantifiers, we might have the proposed existence of some vector modulo the prime which represents the accumulation, and a logical step telling us to add the next summand modulo the prime, and then giving the initial and final values of the accumulation. We can look at permanents as a typical #P problem. However this vector seems to have exponential size, which is a quantifier that is too big. So the nonrigorous program fails at this point. In this paper we have proposed a method of solving NP problems probabilisitically. The primary limitation of this study is that it is not possible as far as we know to prove the speed of our algorithm using established knowledge in algebraic geometry. It might be possible to calculate examples but writing this in the form of a computer program would be difficult and time-consuming for one person.
References 1. Ax, J.: Solving diophantine problems modulo every prime. Ann. Math. 85, 161–183 (1967) 2. Chandra, A., Kosen, D., Stockmeyer, L.: Alternation. J. ACM 28, 114–133 (1981) 3. Corwin, D., Schlank, T.: Brauer and ´etale homotopy obstructions to rational points on open covers (2020). arxiv:math/2006.11699 4. Denef, J.: The Diophantine problem for polynomial rings and fields of rational functions. Trans. Amer. Math. Soc. 242, 391–399 (1978) 5. [F] Fagin, R.: Generalized first-order spectra and polynomial-time recognizable sets. SIAM-AMS Proc. 7, 27–41 (1974) 6. Goldschlager, L.: A universal interconnection pattern for parallel computers. J. ACM 29, 1073–1082 (1982) 7. Hartshorne, R.: Algebraic Geometry. Springer (1977) 8. Jin, J.: Explicit computation of the first ´etale cohomology on curves. arxiv:math/1707.08825 9. Kim, K.H., Roush, F.W.: A decision procedure for certain abelian varieties over function fields. J. Alg. 163, 424–446 (1994) 10. Kresch, A., Tschinkel, Y.: Effectivity of Brauer-Manin obstructions on surfaces. Adv. Math. 226(5), 4131–4134 (2011) 11. Kresch, A., Tschinkel, Y.: Effectivity of Brauer-Manin obstructions. Adv. Math. 218(1), 1–27 (2008) 12. Milne, J.: Jacobian varieties. https://www.jmilne.org/math/xnotes/JVs.pdf 13. Poonen, B.: Undecidable problems: a sampler, arxiv:math/1204.0299 14. Poonen, B.: Insufficiency of the Brauer-Manin obstruction applied to ´etale covers, arxiv:math/0806.1312, June 2008
Algebraic-Geometric Approach
181
15. Poonen, B., Testa, D., van Luijk, R.: Computing N´eron-Severi groups and cycle class groups. arxiv:math/1210.3720 16. An algebraic geometric approach to NP problems. In: Proceedings of FICC 2022. Lecture Notes on Networks and Systems, Springer, New York (2022) 17. Silverman, J.: The arithmetic of elliptic curves. Springer (1986) 18. Skorobogatov, A.: Beyond the Manin obstruction. Inventiones Mathematicae 135(2), 399–424 (1999) 19. Viray, B.: Rational points on varieties and the Brauer-Manin obstruction, March 2023. arxiv:math/2303.17796
Augmented Intelligence Helps Improving Human Decision Making Using Decision Tree and Machine Learning Mohammed Ali Al-Zahrani(B) Taif University, Taif, Saudi Arabia [email protected]
Abstract. The Progressive approach is used rather than the regular machine learning algorithms in use, analysis works on an athlete’s data set that holds specific columns that are more about the details of athletes. Work explains how an athlete can improve his abilities by analyzing the data of 120 years old Olympic players. Selective details of the athletes have concluded on the feature-based analysis, the main method used is a decision tree and KNN. Those selective features can best explain on what a new athlete need to focus on to secure a medal in Olympics. However, the works remain focused on the technique of machine learning. In analysis the decision tree also performed better, but in overall performance, the best outputs remain with improved KNN. Basic analysis is based on the age and gender of the athletes. However, major analysis based on augmented intelligence concludes the precision, recall, and F1 score. Thus the KNN concluded best results as in F1 Score. Keywords: Augmented intelligence · Decision tree · K-NN · Improved decisions
1 Introduction The vitality of progressively improving machines in decision-making for augmented intelligence using artificial intelligence is creating a new horizon. Combine analysis is conducted on the data set using a decision tree and machine learning on the augmented intelligence-based problem. Human-based problems are from a small set of issues to a big set of issues involved in them, augmented intelligence concludes some research base analyses to help a human being. The dimensions are so diverse starting from human health and using gadgets involved in providing information. Another side of augmented intelligence is getting support in making solutions, business development, making financial applications. This work explains the twining between human health analysis, by collecting athlete’s data to tackle and govern more stability in the running routine. Augmented intelligence is playing a vital role in telling athletes about running stats, but let’s develops some charts and some analysis that explains more about a healthy running routine. Mainly, this work first performs a basic analysis that is basic, dealing with age, and the ratio © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 K. Arai (Ed.): FICC 2024, LNNS 921, pp. 182–191, 2024. https://doi.org/10.1007/978-3-031-54053-0_14
Augmented Intelligence Helps Improving Human Decision Making
183
between men and women holds. In the second phase the analysis is conducted to obtain the precision, recall, and F1 scores. Precision, recall and F1 score is conducted under augmented intelligence-based analysis, using improved KNN and decision tree.
2 Literature Review Augmented intelligence is pivoting in so many real-time problems and resolving them accordingly. The combination of human intelligence and artificial intelligence leads to the development of augmented intelligence. Artificial intelligence is in use from around more than two decades but augmented intelligence is the sub-research area of artificial intelligence and in use for around a maximum decade [1]. Developing solutions in terms of financial management tools on a mega level, especially the tasks humans are unable to handle without a machine. Similarly, the contribution to the health sector and making solutions to the illness that were never handled before is carried out using augmented intelligence. During the Covid-19 pandemic, many of the analyses are conducted using the augmented intelligence that made history [2]. The developing solutions help in artificial intelligence-based practical tasks, and real-time diseases are monitored. The next generation is evolving around controlling all thoughts of the human brain to be read by the machine built using artificial intelligence. Small decisions added up in the way from the start the minimum is from switching the lights, to maintaining the temperature, the last decade changed the augmented intelligence. Augmented intelligence is playing a vital role in tech-based vehicles, especially electronic vehicles, auto parking systems auto traffic detection. Machine-based tasks ease the next generation, leading to machine versus human power [3]. The internet of things is getting able to perform various tasks by adopting augmented intelligence. Blue dot helped in preventing and spreading the virus named ZIKA virus the tool was developed using augmented intelligence. Google flu is another tool that was developed to analyze flu symptoms [4]. However, the blue dot performed better than the google flu but both were created for different purposes. Last two-three years augmented intelligence played a vital role in recognizing a Covid-19 patient during the pandemic days. The maximum efforts building up in a manner, to think like a brain, Turing machine was somehow unachieved machine of the era. Mainly, augmented intelligence help a machine on real-time facts and factors to achieve the required results. Human intelligence is added to the machines to resolve human-based problems involving cognitive computing cloud-based applications are being used by IBM, to resolve financial applications [5]. The internet of things has evolved using augmented intelligence various tech-based devices help human health, tracking maps, apple watches and other similar devices are one example of augmented intelligence. Using augmented intelligence helped in making weak nods of the decision tree it bridges artificial intelligence and human control, and forms augmented intelligence. Clearing the next big thing is to contribute in terms of verifying that augmented intelligence proves better over time or that artificial intelligence is better to produce results in terms of accuracy and all such factors [6]. The analysis may also verify that augmented intelligence leads to the next dimension which is knowledge discovery.
184
M. A. Al-Zahrani
The knowledge-based analysis will lead to helping and maintaining the abilities of humans and improving databases. Trust issue between the augmented intelligence-based application and the human users are scene in the currently developed application. Trust issues can be resolved after using continuously manner because the main source of the application is data, and as long as the application holds the data of the user it would perform better [7]. Such applications compete with the time complexity as far as the time and users increase the application and systems start getting strong and performing better as the data is the main source for the applications. Decision trees were used as a method for the complex data before the proper machine learning algorithms start performing for big data and complex applications. Augmented intelligence is helping human life by developing solutions based on using artificial intelligence algorithms and techniques.
3 Analysis for Augmented Intelligence Analysis of the performance of the decision tree and the augmented intelligence-based machine learning algorithm has been chosen to contribute and produce high-end results. Comparative analysis in terms of a new generative augmented intelligence model is created and compared with the decision tree [8]. Machine learning engineering is also involved in this selection of the algorithms. Understanding the problem and then utilizing the approach or model, to get into it. 3.1 Decision Tree This method indulges non-parametric supervised learning techniques most of the time used in the classification task, even with the large-scale data decision tree performing better proven results. The working of the decision tree depends on the data set on which it is trained and tested to conclude results. Majorly, it works between the parent node and child node and depends upon the phenomena of entropy, but it prefers to tackle the data of the next node [9]. Till the excessive use of machine learning algorithms took place, a decision tree was excessively in use. The decision tree is a progressive method to obtain facts and actual results. It lacks in backtracking and changing its previous node which is why it is called a greedy algorithm that only works in present nodes. 3.2 Augmented Intelligence The task done under artificial intelligence that enhances the human behavior, one must add them in the augmented intelligence, but the machine is letting the humans benefit through it. Implementing machine learning algorithms sometimes just indulges to conclude some statistical data, but augmented intelligence is a little more in which that data is further used by a machine and the machine helps directly the humans. The next generation will grow using augmented intelligence-based machines in the working areas [10]. The furthermore progressive types of augmented intelligence are cognitive augmentation, enhanced intelligence, machine augmented intelligence, and intelligence amplification.
Augmented Intelligence Helps Improving Human Decision Making
185
3.3 K-Nearest Neighbor While choosing the augmented model using artificial intelligence K-nearest neighbor is chosen as a method to produce constructive results. Comparative research of the decision tree takes place with the k-nearest neighbor method. Mainly, K-NN works progressively as a supervised learning algorithm, and is considered as using the nearest node to fetch the data [11]. K-NN works and depends on the neighboring nodes and performs tasks.
4 Data Set The data set is collected about the athletes that will be analyzed in the manner to be used for the analysis of augmented intelligence using decision tree and machine learning KNN method. The data set is chosen from the website Kaggle and the data set is about the athletes that hold athlete’s records as data elaborating about the age, weight, and medal are the further categories [12]. In such terminology, the data set chosen is about the athlete’s sports records that can be used to improve the sports person’s abilities (see Table 1). Table 1. Data set details Data Set
Number of Records
Platform
About
120 Years Of Olympic History Athletes and Results
134732
Kaggle
The Data-Set Is About Athletes, basic bio, and what type of sports they do for the Olympics
Different columns hold different information about the athletes, most likely name, age, sex, height, and weight are required to consider. This athlete’s data set holds 0.134 Million Records, but the main thing is that the records are truly the details about the athletes. The data set is selected after researching it on various platforms, leading it to be best suitable for research purposes (see Fig. 1).
186
M. A. Al-Zahrani
Fig. 1. Data set in details [13]
5 Machine Learning Engineering The term engineering denotes when and what to apply for the specific problem, as athletes prefer to stay connected with coaches. Deploying machine learning algorithms are quite easy, but understanding which algorithm needs to be implemented for what sort of problem is known as machine learning engineering. The major task is sometimes to know the problem fully and then to use the appropriate algorithm is resolving fifty percent of the task. Using the features that were important and what features are required to choose matters [14]. These all factors are involved in making machine learning a separate and unique area of engineering. 5.1 Data-Set Selected Features The First step is to select the most related features from the data set, as the related selective features help to find out the projected terms in the results phase. Selecting features also involves great engineering that leads to progressive research. Features are selected based on the reasons the research is conducted. Opening up to the leading research phase more progressive needs special attention to select the most appropriate features [15]. Selected features are name, age, sex, height, weight, and medal achieved. The main reason to select these features against the name is that we can recommend the player that can able to improve its game (see Fig. 2).
Augmented Intelligence Helps Improving Human Decision Making
187
Fig. 2. Selected features [16]
5.2 Steps in the Analysis Involved Choosing the features and concluding that the selected feature will help in deciding on of categories of the athletes, on which after analysis it can be analyzed that their working abilities can be enhanced. The enhancement at the moment depends upon the data set of the Olympics athletes, as a feature and the major feature chosen is age. The analysis will elaborate on the percentage of the athletes that were less than 25, some over 26 years of age, and then the ratio of the athletes that were less than 30, and the third category is considered as the athletes at the age of 31 to 35 [17]. However, the research says that better results can be achieved in athletes that are less than 30 as age (see Fig. 3).
Fig. 3. Age ratio selection [18]
188
M. A. Al-Zahrani
6 Proposed Solution Used In the proposed solution mainly the projects initially start using a data set; mainly the selective feature has been used (see Fig. 4). After selecting the name, age, sex, height, weight, and medal the analysis phase began the performance of the improved KNN, and the Decision Tree is analyzed on the text-based selected features [19]. Thus the performance of the algorithm and the improved augmented intelligence-based analysis on the athlete’s data-set, is rationally opens up in the results phase. The developed proposed solution is highly competitive and progressive and follows step-by-step processing. The step-by-step processing leads it to perform better in developing solutions. Similarly, the results can be analyzed in a better manner to give athletes a direction to improve their abilities in performing better in the games segments.
Fig. 4. Proposed solution
7 Results of basic Analysis of Athletes and Augmented Intelligence Using improved KNN and the Decision Tree, the results are concluded. Measuring the F score to analyze the performance of the specific athlete against its name. Augmented intelligence further explains how many percent of athletes can improve their ability of performance by the Body Mass Index.
Augmented Intelligence Helps Improving Human Decision Making
189
7.1 Gender Ratio Analysis Athletes are overall competitive however this is analyzed in basic analysis, to realize the ratio between the male and female athletes. This is considered a basic analysis to realize the real facts and figures of the athletes (see Table 2). Table 2. Gender ratio analysis Male
Female
73%
27%
7.2 Age Ratio Analysis The age of the athletes has been analyzed to further know what age is at maximum athletes are and achieving their achievements. The twenty-six years till thirty years of age athletes are more likely to achieve more medals as compared to fewer than 25 years of age similarly, above the age of 30 (see Table 3). Table 3. Age ratio analysis Under 25 Years of Age
26 Years Till 30 Years
31 Till 35 Years Of Age
36 Years or Above Age
16%
68%
11%
5%
7.3 Augmented Intelligence based Analysis using Improved KNN and Decision Tree The research-based augmented intelligence analysis using the selected feature is conducted to conclude the precision, recall, and decision tree method and improved KNN remains in the progressive method to obtain precision and recall. Similarly, the F1 score is concluded to obtain as the main key factor. The augmented intelligence-based analysis leads it a key factor for athletes to improve their abilities in the game by evolving ups the score against their name. Basic analysis and augmented intelligence analysis will compare to produce the result against a single entity [20]. However, this augmented intelligence analysis is produced against the whole data set. The obtained output of Improved KNN is as including precision as 82.60%, Recall as this 96.80% and F1 score is obtained as 89.11%. Similarly, the decision tree also performed with obtained precision of 84.5%, recall as 84.5%, and F1 score of 84.95% (see Table 4).
190
M. A. Al-Zahrani Table 4. Proposed solution results
Method
Improved KNN
Decision Tree
Precision
82.60%
84.5%
Recall
96.80%
84.5%
F1 Score
89.11%
84.95%
8 Future Work In the work, the analysis of the data-set is conducted whereas the extension of this work can lead to a new dimension in augmented research. A new thing can be done under proper analysis like one can analyze at what pace a runner can easily run. The analysis or even the device built under this impact can easily guide a runner about its ability to improve its run on what phases. Like when its heart rate is even and especially when the inclined is little less in such areas the device can intimate the runner to do a progression run. There is a huge room to construct such devices, watches, or furthermore things that will lead augmented intelligence into a new dimension. New dimension using artificial intelligence algorithm to provide a whole new better solution that is yet needed to be produced.
9 Conclusion Highly progressive work is conducted with a new direction that may help the athletes in real-time in fulfilling the purpose of augmented intelligence. Athletes may figure out that they can improve their segments in the games, by using augmented intelligencebased solutions. The performance analysis and results concluded by the improved KNN remain better than the decision tree whereas the decision tree performed well. F1 score of improved KNN is obtained well than the decision tree F1 score whereas precision and recall are also concluded. Acknowledgment. I acknowledge the efforts of Muhammad Ehtisham, and his devotion towards research based tasks in the area of Data Science, while dealing With Artificial Intelligence, Machine Learning and Augmented Intelligence.
References 1. Gierl, M.J., Lai, H., Matovinovic, D.: Augmented intelligence and the future of item development. Application of Artificial Intelligence to Assessment, pp. 1–25 (2020) 2. Yoo, S.H., et al.: Deep learning-based decision-tree classifier for COVID-19 diagnosis from chest X-ray imaging. Front. Med. 7, 427 (2020) 3. Madni, A.M.: Exploiting augmented intelligence in systems engineering and engineered systems. Insight 23(1), 31–36 (2020)
Augmented Intelligence Helps Improving Human Decision Making
191
4. Long, J.B., Ehrenfeld, J.M.: The role of augmented intelligence (AI) in detecting and preventing the spread of novel coronavirus. J. Med. Syst. 44(3), 1–2 (2020) 5. Marshall, T.E., Lambert, S.L.: Cloud-based intelligent accounting applications: accounting task automation using IBM watson cognitive computing. J. Emerging Technologies in Accounting 15(1), 199–215 (2018) 6. Toivonen, T., Jormanainen, I., Tukiainen, M.: Augmented intelligence in educational data mining. Smart Learning Environ. 6(1), 1–25 (2019) 7. Sil, R., Roy, A., Bhushan, B., Mazumdar, A.K.: Artificial intelligence and machine learning based legal application: the state-of-the-art and future research trends. In: 2019 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS), pp. 57–62). IEEE (2019) 8. del Cerro Velázquez, F., Morales Méndez, G.: Application in augmented reality for learning mathematical functions: a study for the development of spatial intelligence in secondary education students. Mathematics 9(4), 369 (2021) 9. Charbuty, B., Abdulazeez, A.: Classification based on decision tree algorithm for machine learning. J. Applied Science and Technol. Trends 2(01), 20–28 (2021) 10. Zhou, Y., Liu, P., Qiu, X.: Knn-contrastive learning for out-of-domain intent classification. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 5129–5141 (2022) 11. Dimitropoulos, N., Togias, T., Zacharaki, N., Michalos, G., Makris, S.: Seamless human– robot collaborative assembly using artificial intelligence and wearable devices. Appl. Sci. 11(12), 5699 (2021) 12. Snowberger, A.D., Lee, C.H.: An investigation into the correlation between a country’s total olympic medal count, GDP, and freedom index through history. In: Proceedings of the Korean Institute of Information and Commucation Sciences Conference, pp. 495–498. The Korea Institute of Information and Commucation Engineering (2021) 13. https://www.kaggle.com/datasets/heesoo37/120-years-of-olympic-history-athletes-and-res ults 14. Amershi, S., et al.: Software engineering for machine learning: a case study. In: 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), pp. 291–300. IEEE (2019) 15. Velasco-Mata, J., González-Castro, V., Fernández, E.F., Alegre, E.: Efficient detection of botnet traffic by features selection and decision trees. IEEE Access 9, 120567–120579 (2021) 16. Aulia, P., Herawati, S., Asmendri, A.: Pengembangan Media Flowchart (Bagan Arus) Berbasis Microsoft Visio Pada Mata Pelajaran Fiqih Materi Ketentuan Zakat Kelas VIII Di MTsN 6 Tanah Datar. at-Tarbiyah al-Mustamirrah: Jurnal Pendidikan Islam, 1, pp. 1–24 (2020) 17. Aguiar, S.S., et al.: Master athletes have longer telomeres than age-matched non-athletes. a systematic review, meta-analysis and discussion of possible mechanisms. Experimental Gerontology 146, 111212 (2021) 18. Kern, H., Kühne, S.: Integration of microsoft visio and eclipse modeling framework using m3-level-based bridges. In: Proceedings of Second Workshop on Model-Driven Tool and Process Integration (MDTPI) at ECMFA, CTIT Workshop Proceedings, pp. 13–24 (2009) 19. Vicente-Valdez, P., Bernstein, L., Fratoni, M.: Nuclear data evaluation augmented by machine learning. Ann. Nucl. Energy 163, 108596 (2021) 20. Liu, W., Fan, H., Xia, M.: Step-wise multi-grained augmented gradient boosting decision trees for credit scoring. Eng. Appl. Artif. Intell.Intell. 97, 104036 (2021)
An Analysis of Temperature Variability Using an Index Model Wisam Bukaita(B) , Oriehi Anyaiwe, and Patrick Nelson Lawrence Technological University, Southfield, MI 48073, USA [email protected]
Abstract. This study addresses the temperature variation and volatility over 100 years, showing results on a daily, monthly, and annual basis. Our focus is to show how these variations have changed over long periods and have become more volatile over the past 30 years. Using this information, we then developed a comprehensive model, the temperature variation index that examines temperature volatility and provides predictions for future years. This index reveals the biggest impact of climate change is occurring in February and November, while the lowest impact is in the Summer. The index serves as a tool to compare temperature data across different time intervals, enabling data scaling and the identification of differences between data points on a new proposed scale. Data from 1900 to 2021 are examined to analyze temperature distributions and patterns over time. A new mapping technique is introduced, along with algorithms, to illustrate the weather patterns and shifts that have occurred in the past 30 years compared to the last 120 years. This technique provides a visual representation of temperature changes, aiding in the identification of trends and patterns. The results of this study contribute to existing research by offering a detailed analysis of temperature variation and volatility over a long period. The findings enhance understanding and management of climate variability. Keywords: Temperature variation index · Temperature variation · Temperature patterns
1 Introduction Nowadays, global warming and climate change are increasingly under scientific investigation, and there is much controversy on the effects of climate change, many driven simply by opinion or by politics. Many studies have addressed global warming in the twenty-first century from different perspectives [7, 20, 24, 27]. Most are focused on the rise of temperatures and its trend in the next decades [1, 12, 15, 16, 21, 22, 25]. Other researches modeled the temperature rise to forecast the global warming [6, 13, 14, 17]. Hansen [3] stated that global warming has carried the global temperature to its peak in the past few decades. The global surface temperature has increased by nearly 0.5°C since 1975, and the recent global warming pace will continue or accelerate. In King’s research [7], the signal-to-noise ratio is used as a measurement model of the detectability and perceptibility of local climate changes. It includes both the local © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 K. Arai (Ed.): FICC 2024, LNNS 921, pp. 192–212, 2024. https://doi.org/10.1007/978-3-031-54053-0_15
An Analysis of Temperature Variability Using an Index Model
193
change in average temperature and the variability in temperature. In regions of less variable climates, smaller amounts of warming may cause an adverse effect on flora and fauna. According to King’s research [7], the wealthiest countries will only experience an average signal-to-noise ratio of 0.94, and the poorest countries would experience an average signal-to-noise ratio of 1.3 which is equivalent to 35% more. The signal-to-noise ratio is used to analyze the climate change impacts on individual species and ecosystems [12]. In King’s research [7], the signal represents the average model warming of annual temperatures at each location while the noise represents the average model standard deviation of annual temperatures in a preindustrial climate. Crookston [1] stated that the changes in precipitation and hot temperatures associated with climate change are potential factors that cause changes in the population of trees and animal species. These changes include all the development of trees and their genetic adaptation, mortality, and growth. Forest Vegetation Simulator currently focusing on the species level is not enough. Therefore, several factors, such as changing the trees’ habitat location, the mean annual precipitation (MAP), and the annual dryness index (ADI) influenced by climate change are added to the adjusted Forest Vegetation Simulator model. The key driver used to develop mortality models in the Forest Vegetation Simulator is the maximum stand density MaxSDI. Nordhaus [11] study reveals an assessment of climate change and economics with the treatment of both implications. Based on Nordhaus’s study [6], countries have taken little steps to lower greenhouse gas emissions since the first agreement in Kyoto [26] in 1997, and limited improvements were made after the Copenhagen meeting in December 2009. Nordhaus stated that the scientific view is to keep the increase in global mean temperature to 2°C or less in an efficient manner. Global temperature projections illustrate a sharp rise in temperature to reach 3.5°C in 2100 and 5.7°C in 2200 relative to 1900. The Regional Integrated model of Climate and the Economy report (RICE model), denoted the RICE-2010 model is presented in Nordhaus’s research [6]. The RICE model modifies the conventional model to include climate investments. Our goal is to examine these effects using data to begin to define what is perceived compared to what is real. Too many times we hear, “there is no global warming” because it is cooler than normal in the Summer. However, what most people are failing to recognize is that it may be cooler but the variations in temperature are more extreme. Hence, our goal with this paper is to show and model, how what we see on a daily basis is not a predictor of climate change, but instead how these daily occurrences compare to past dates, which can be a week ago, month, year or decade. In fact, we will show that it is the Fall (November) and Winter (February) months where we see the most impact of climate change and contrary to popular belief, the Summer months show the least impact. It is proven that there is an imbalance of absorbed and radiated levels of carbon dioxide, methane, nitrous oxide, water vapor, and other earth energy that are affecting our ozone layer [18, 19, 23, 24]. As a result of these changes, the ecosystem’s health is suffering perceptible and imperceptible damage, and it will grow up rapidly in the near future to reach a point that cannot be repaired or restored back to normal [13, 22]. One of the consequences of global warming is the increase of the average Earth’s surface temperature. Since the 1990’s, climate change is not only taking global temperature to its highest level but also creating more volatility in temperatures [27]. This disruptive
194
W. Bukaita et al.
balance between maximum and minimum temperatures is causing undue stress on our planet, especially impacting our ecosystem [28]. Coal and oil are equal sources of the carbon dioxide CO2 emissions which is the largest factor affecting climate. The effect of non-CO2 , greenhouses gases GHGs on climate change is as large as the CO2 effect while the methane, CH4 produced by agricultural practices is half as large as that of CO2 . As climate models indicate [2], the global surface temperature will increase by 2–4°C by 2100. As a result of climate change, individual health, agricultural productivity, and air quality would be affected directly and indirectly. The weather components data represented by the monthly and annual temperatures utilized in this analysis are recorded since the beginning of the last century by the National Oceanic and Atmospheric Administration, NOAA [4] and [5]. Santos’s analysis [8] investigated climate change and global warming and their consequences based on the volume of scientific publications between 1910 and 2020. Santos’s research [8] demonstrates the number of publications as a key indicator of multiple factors that affect the research output of global warming. Santos’s study objective compared publications on climate change, global warming, and climate emergency developed over time. Non-scientific publications have covered the topics of climate change compared with the scientific literatures that studies the causes and effects of climate change by identifying and addressing many important gaps in the literature. One of Santos’s research [8] findings reveals that the number of publications increases in island states such as the Philippines, Fiji, Bahamas, Palau, Micronesia, and Kiribati that are most susceptible to climate challenges such as increase the seismic activity.
2 Data Analysis and Case Study The temperature model illustrated in this study displays temperature history, maximum, minimum, and averages in the United States. According to U.S. Global Change Research Program [9], the temperature rise due to global warming averages a rate of 0.17°F per decade since 1901, while moving to a much higher rate ranging between 0.32°F and 0.55°F per decade since 1979. To comprehend the variation and trend of temperatures, the monthly temperatures between 1900 and 2021 are investigated and studied based on two-time spans. A short time duration is represented by the last 30 years between 19912021, and a long period of time is represented by the last 10 decades between 1900–1990. The analysis of both time series is illustrated numerically and graphically with the associated outcomes in the following sections. 2.1 Short Term Series The monthly temperatures recorded between 1991 and 2021 are presented in Fig. 1 and display the annual variation of temperatures between extreme temperatures on a monthly basis. For example, in the month of January, the temperatures varied between 18.61°F and 49.64°F. However, the temperatures ranged between 59.40°F and 89.96°F in the month of July. Table 1 presents the ranges highlighting the difference between Winter and Summer months.
An Analysis of Temperature Variability Using an Index Model
195
Fig. 1. The recorded temperatures between 1991 and 2021 are illustrated in three different color to represent the minimum, average, and maximum temperatures fluctuating between extreme values on a monthly basis. The mean and standard deviations are 65.13°F and 15.91°F for the maximum temperatures, and they are 41.50°F and 14.04°F for the minimum temperatures.
Table 1. Summer and winter temperature analysis Winter months
Summer months
Highest minimum temperatures
39.34 - 49.64
Lowest minimum temperatures
18.61 - 28.22
Mean = 42.59°F Standard Deviation = 2.299°F
Highest maximum temperatures
83.03 - 89.96
Lowest maximum temperatures
59.40 - 63.55
Mean = 87.10°F Standard Deviation = 1.519°F
Using the averages from 1900 to 1990, we now looked at the average annual temperatures between 1991 and 2021. In Fig. 2 below, you will see these averages over the last 30 years and how they measure up to the average temperature from 1900 to 1990. You can not only see a significant increase in average temperatures but the pattern of how this is increasing each year. In fact, over the past 30 years, the temperature ranged between 62.68°F and 67.68°F (-2% to 6%) which puts it in the 75th percentile and even above the maximum average temperature from the previous 100 years. The annual temperature variation presented in Table 2, illustrates the minimum, average, and maximum annual temperatures for the period ranging between 1991 and 2021. In order to explain Table 2 visually and facilitate reading data, the numerical values of annual temperature are presented as a heatmap showing the low values in a lighter
196
W. Bukaita et al.
Fig. 2. The average annual temperatures recorded between 1900 and 1990 are categorized based on the minimum, maximum and the 25, 50 and 75th percentile (from bottom to top).The minimum average temperature below 61.9°F, 25 percentile below 63.3°F, 50 percentile below 63.8°F, 75 percentile below 64.4°F, and maximum average temperature below 65.9°F that are utilized as a reference to scale and understand the temperatures in the last three decades (solid line). The other perspective of this graph shows the projection of the average annual temperatures recorded between 1991- 2021 and its trend on the scale of the previous period temperatures.
color, the high values in a darker color, and the middle shade representing the moderate temperature. The temperature is ranged between 19.72°F as the minimum temperature recorded in 1991 and 89.96°F recorded as the maximum temperature in 2012 as shown in Table 2. The minimum temperatures ranged between 19.72°F and 25.11°F showing an increasing difference of 5.32°F. The maximum temperatures ranged between 83.03°F and 89.96°F showing an increasing difference equal to 6.93°F. The average temperature ranges between 51.26°F and 55.28°F with a difference of 4.02°F.
3 Temperature Variability Models This section introduces a newly developed method that can construct the “Index of Temperature Variation” as an accurate measure of temperature volatility. It uses an expression of the widely known “Index of Qualitative Variation” to closely determine extreme variations in annual and seasonal shifts given the existing temperature data accumulated prior to 1990.
An Analysis of Temperature Variability Using an Index Model
197
Table 2. This table demonstrates the fluctuations in temperature over thirty years. The data is represented by colors, with warmer temperatures being signified as darker shades and cooler temperatures appearing by brighter colors. The darkest shade in the matrix represents 88.9°F, the highest recorded temperature during this time frame, while the lightest color signifies 19.36°F, the lowest temperature from 1991–2021. According to this data, 1992 had the smallest range between these two extremes with a difference of 60.43°F while 2011 boasted the largest gap at 67.68°F. Year
Minimum annual Temp, o F
Average annual Temp, o F
Maximum annual Temp, o F
1991
19.72
53.16
86.63
1992
22.60
52.60
83.03
1993
20.95
51.26
84.22
1994
18.90
52.87
86.47
1995
23.56
52.65
87.12
1996
18.61
51.89
86.29
1997
20.10
52.20
85.84
1998
25.11
54.23
88.07
1999
23.13
53.88
86.86
2000
18.34
53.27
87.33
2001
21.31
53.70
87.42
2002
23.90
53.21
88.79
2003
22.24
53.26
88.70
2004
20.26
53.10
85.26
2005
22.62
53.64
88.50
2006
22.98
54.25
89.58
2007
21.33
53.65
87.48
2008
19.76
52.29
87.04
2009
19.76
52.39
85.32
2010
21.31
52.98
87.01
2011
19.36
53.18
88.90
2012
24.67
55.28
89.96
2013
20.61
52.43
86.43
2014
18.68
52.54
85.80
2015
21.58
54.40
85.96
2016
22.59
54.92
87.91
2017
23.74
54.55
88.56
2018
21.16
53.52
88.23
2019
21.96
52.68
87.10 (continued)
198
W. Bukaita et al. Table 2. (continued)
Year
Minimum annual Temp, o F
Average annual Temp, o F
Maximum annual Temp, o F
2020
24.82
54.37
88.41
2021
20.39
54.51
87.66
3.1 Index of Qualitative Variation In this study, the Index of Qualitative Variation equation, IQV developed by Wilcox [10] to measure the variability of the nominal data between several categories is modified to measure the temperature variability between different time intervals. Furthermore, the basic equation of IQV given in Eq. 1 is the ratio of observed differences to maximum differences. K 10000 − y2 (1) IQV = (K - 1)10000 where: K is the number of categories, y is the ratio of observed differences to maximum differences. In this study, the Index of Qualitative Variation equation is modified as IQVm and it is given in Eq. 2. 2 K y − y2 IQVm = (2) 2 y (K - 1) where, y = T/µ, is the ratio of the temperature recorded in sub-intervals, T, to the average temperature, µ within that interval, K is the number of sub intervals. 3.2 Index of Temperature Variation The modified Index of Qualitative Variation, IQVm is employed to derive a new temperature index to express the variation of the temperatures over the last 30 years, compared to the recorded averages over the last century. This new equation is called the Index of Temperature Variation, ITV as given in Eq. 3. Where zero ITV means no variation, positive ITV means the variation is above the reference value of IQVm, and negative is below the reference value. ITV = (IQVmbefore1990 − IQVmafter1990 ) × 106
(3)
Temperature Variation throughout the months is computed using the ITV equation to evaluate and demonstrate the monthly temperature variability. Temperature performance and trends are analyzed using the modified Index of Qualitative Variation and Index of Temperature Variation to evaluate the temperature volatility. The ratio of the maximum monthly temperature in a selected year to the average maximum temperatures for corresponding months in all years within the selected interval is calculated as the y value.
An Analysis of Temperature Variability Using an Index Model
199
The y1900 value in January-1900 is the ratio of the maximum temperature recorded in January of 1900 to the average maximum temperatures recorded in January, between 1900 and 1990. The y value for each year is represented as y1900 , y1901 , y1902 ,…up to y1990 using the same ratio. The summation of the ratios y is the summation of y values of the years within 1900 to 1990. However, the summation of the ratios squared y2 is the summation of y2 values of the years within the same selected interval. The summation of the ratios squared y2, the square of the sum (y)2, and K, the number of years within the selected interval are entered into Eq. 2 to compute the modified Index of Qualitative Variation IQVm. The monthly index is computed using data recorded before 1990, and the monthly index is computed again using data recorded after 1990. Data presented in Table 3 illustrates the Index of Temperature Variation on a monthly basis using the ITV equation. Table 3. The table below shows the monthly value of the Index Temperature Variation (ITV) computed independently using temperature data recorded before 1990 and compared with the data recorded after 1990 utilizing the IQVm reference values. The monthly index displays significant variation particularly in February and November showing a much higher degree of variation than any of the other months. This is likely due to natural fluctuations in temperatures transition between seasons. The information in this table is presented as well graphically as a bar chart in three different colors to illustrate three different variation categories represented by low variation in blue, moderate variation in gray and high variation in red. IQVm 1990–1990
IQVm 1991–2021
Index Temperature Variation, ITV
January
0.99993
0.999903
26.78
February
0.999938
0.999834
103.74
March
0.999958
0.999914
44.13
April
0.999986
0.999974
11.53
May
0.999993
0.999981
11.70
Low Variation
Jun
0.999995
0.999987
8.53
Low Variation
July
0.999997
0.99999
7.19
Low Variation
August
0.999997
0.999991
6.76
Low Variation
September
0.999995
0.999988
6.51
October
0.999987
0.999968
19.36
November
0.999981
0.999876
104.72
December
0.999931
0.999898
32.81
High Variation
High Variation
By comparing the results presented in Table 3, the Index of Temperature Variation is at its highest level in February and November meaning that the temperature has variated with a broader range, and it is at its lowest level between August and September. This
200
W. Bukaita et al.
may be a reason for confusion with people when it comes to the effects of climate change as it is harder to feel the affect in the Winter compared to the Summer. In addition we also considered the coefficient of variation for the relative variability of temperatures to evaluate the Index of Temperature Variation. The coefficient of variation, CV is the ratio of the standard deviation of the maximum monthly temperature recorded between 1900 and 1990 to the mean value of maximum temperatures within the same interval calculated for each month. The lower value of the coefficient of variation indicates that the data has less variability and high stability. Based on temperature analysis, the coefficient of variation for the month of February before 1990 is 0.07, and it is 0.075 after 1990. In November, the CV value is 0.042 before 1990 and 0.06 after 1990. Moreover, in September the CV value is 0.019 and 0.022 in the intervals before and after 1990, respectively. The coefficient of variations in August is 0.015 and 0.017 in the intervals before and after 1990, respectively. The coefficient of variations in February and November is at the larger monthly values before and after 1990, and the value of this coefficient increased after 1990 compared with its value before 1990. However, the coefficient of variations in August and September is at its lowest values which prove the trend of the Index of temperature variation derived in this study. By comparing the temperatures presented in Fig. 3, the temperatures in February and November oscillated with a larger range than temperatures in August and September. However, the variability of temperatures around the average maximum temperature indicates that the temperatures in August and September are more stable and retain less variability.
Fig. 3. The graph illustrates the comparative maximum temperatures on monthly bases for the months of February, August, September, and November from 1900 – 1990 compared with the maximum temperatures recorded in the same months in the period 1991 – 2021, and also compared with the average maximum temperature of each month presented. What one can see is the increase in amplitude of oscillations that occur in November and February but little change with August and September.
An Analysis of Temperature Variability Using an Index Model
201
The Index of Qualitative Variation (IQVm) of temperatures recorded in 1900 is computed based on the monthly variation of temperatures shown in Table 4. This table is a sample of the calculation illustrating the monthly temperatures in the year of 1900 and how the value of IQVm and Index Temperature Variation (ITV) are computed at this year. The results are used as a reference value to demonstrate the ITV for the next 120 years. The value of the IQVm is calculated by taking the maximum monthly temperature over the sub-interval, m, which represents the number of months in the year 1900. Table 4. This table provides an overview of the reference value of the Index of Qualitative Variation used to calculate the index variation temperature for the years following 1900. The index is computed using the monthly temperatures recorded in 1900 with a sub-interval m and based on the maximum monthly temperature observed in 1900. Year
Months, m
Max Temp, T, o F
y = T/µ
1900
Jan-00
44.26
68.52719
4695.976
1900
Feb-00
42.15
65.26031
4258.908
1900
Mar-00
53.01
82.0747
6736.257
1900
Apr-00
63.36
98.09948
9623.507
1900
May-00
74.89
115.9512
13444.69
1900
Jun-00
83.53
129.3284
16725.84
1900
Jul-00
86.16
133.4004
17795.67
1900
Aug-00
85.84
132.9050
17663.73
1900
Sep-00
77.00
119.2181
14212.96
1900
Oct-00
67.91
105.1442
11055.3
1900
Nov-00
52.27
80.92897
1900
Dec-00
44.67
69.16199 y = 1200
µy = 64.59 2 2 m ( y) − y = = 0.994284 (m−1)( y)2
y2
6549.499 4783.381 2 y = 127545.7
The annual Index of Temperature Variation calculated in Appendix A (Table 5) and illustrated in Fig. 4 displays the differences in the index between the two interval models represented by the time before and after 1990 every year. However, this figure shows the index trend behavior in the last 30 years compared with the term between 1900 and 1990. One of the outcomes revealed that the variability of the index ranged between − 1936.67 in 1936 and 944.62 in 1990 and ranged between -624.61 in 2010 and 1490.32 in 1999. The index range on an annual basis is wider than the index on monthly basis due to the significant difference between the temperatures on the hottest day in summer and the coldest day in winter. The minimum value of the Temperature Variation index is increased from −1936.67 to −624.61 by comparing the index before and after 1990.
202
W. Bukaita et al.
While the maximum value of the Temperature Variation index is increased from 944.62 to 1490.32.
Fig. 4. The Index Temperature Variation depicts the temperature variation on an annual basis, with the data demonstrating a significant increase in signal-to-noise ratio extremes post-1990
4 Temperature Ring Distribution Model The temperature is analyzed using a graphical model to visualize the temperature variation every month between 1900 and 1990. We then distribute the data angularly on a timeline by approximating each degree in a circle as one day to form a full year in 360 degrees and radially to visualize the temperature gradient on concentric circles. The temperature model is divided into twelve sectors, and each sector represents a month in the year that is oriented in a clockwise direction from January to December. The yaxis separates January and December from the positive direction, and between June and July from the negative y-axis. The x-axis separates between March and April from the positive direction, and between September and October from the negative x-axis. The recorded temperatures since 1900 are illustrated in Fig. 5 to display the differences in temperature distributions and temperature patterns every month. The maximum monthly temperatures between 1900 and 1990 are distributed in a radial direction to visualize the temperature gradient starting from zero at the central coordinate with a 10°F incremental temperature represented by concentric circles increasing towards the circumference circle. As shown in Fig. 5, the temperatures in January and December are closer to the temperature distribution ring center. While the temperatures in July and August are
An Analysis of Temperature Variability Using an Index Model
203
farthest from the center. Between April and June, the temperature distribution boundary is moving at an increasing rate. On the other hand, the months between September and November show temperature distribution boundary moves at a decreasing rate. The recorded temperatures between 1900 and 1991 formed the temperature distribution as an elliptical shape bounded by extreme temperatures. The maximum temperature is mapped on the outer bound of the shaded ring, and the minimum temperature is on the inner bound. The temperature distribution elliptical ring is utilized as a reference to investigate any expansion, shrinkage, and shifting in temperatures for any evaluated period.
Fig. 5. The shaded elliptical region in the illustration to describe the temperature distribution over the period between 1900 and 1990. Each degree on the circle corresponds to one day. The center of the circle corresponds to a temperature of 0 °C (and the temperature increases by 10 degree outward), while January and December temperatures pivot closest to the center, and temperatures in July and August deviate furthest away from it. The shaded area illustrates all temperatures recorded on a monthly basis and the boundaries of the shaded region represent the extreme temperatures. This graphic is a helpful way of visualizing how monthly average temperatures have varied over months within this time frame.
The temperatures recorded in the last three decades are mapped angularly and radially on the shaded elliptical ring to highlight any angular deviation of temperatures compared with the original angle and any radial expansion or shrinkage of temperature distribution. The angular distribution of the temperatures projected on the shaded ring,
204
W. Bukaita et al.
and the radial mapping takes place by connecting the two extremes between each successive month. By comparing the monthly temperatures in the last three decades to the reference temperatures, the temperature variability is shifted along the year while the temperature distribution shifts in its variation and deviates from its extreme boundary to a higher outline. By analyzing the temperature distribution presented in Fig. 6, the temperatures in the last three decades shrank angularly to form an unfilled gap that begins in the mid of December and ends in the first quarter of February.
Fig. 6. The period between 1991 and 2021 illustrates a new temperature distribution pattern emerges, forming an incomplete elliptical ring. This pattern was generated by mapping the average extreme temperatures per month against corresponding temperatures from the period 19001990, and then shifting the months’ axes to coincide with the corresponding temperatures, resulting in an unfilled gap in the new ring due to the missing of temperature variations. In other words, what we are seeing is the month of January now looks more similar to February and March instead of its 90-year averages. Hence, we removed it from the ring. Other months are also beginning to pattern that of different months creating much less seasonality.
The maximum monthly temperatures are increased above the common temperatures recorded in the last ten decades and that generates a shrinkage in the temperature distribution in an angular direction. The shrinkage in temperature is illustrated in two opposite angular directions: clockwise and counterclockwise. Figure 7 shows 36 degrees of shrinkage in a clockwise direction and 15 degrees of shrinkage in a counterclockwise direction from the positive y -axis. Therefore, the total gap in temperature distribution is 51 degrees. Moreover, the temperature crept from its lower boundary towards the upper
An Analysis of Temperature Variability Using an Index Model
205
boundary in the hottest months represented by June, July, and August. Therefore, the variation of temperatures at mid of July reaches its highest by shifting the difference between the lower and upper bounds as shown in Fig. 7.
Fig. 7. Mapping 1991–2021 temperatures on 1900–1990 temperature distribution. The incomplete ring presented in Fig. 6 is mapped with the reference ring displayed in Fig. 5 to compare the most important features in the two distributions. The first feature that is easy to recognize is the missing temperatures on the central top of the figure, and the second feature in the lower portion of the figure displays the shift in the temperature towards the outer boundary.
This research provides a numerical and graphical analysis of the implications of global warming based on temperature variations over the past 120 years. The study results show a shrinkage in temperature distribution between the months of December and February over the past 30 years when compared to the temperature distribution within the last century. Figure 6 and 7 illustrate the shrinkage in temperatures demonstrated as a gap in temperature distribution, which faces a direct effect on the plants, insects, bacteria, and animals’ continuity and lives that used to inhabit these temperatures. Indirectly, this could lead to the extinction of other creatures that depend on these species, as they would be forced to change their habitats and potentially face termination.
206
W. Bukaita et al.
5 Conclusion This research has investigated and analyzed the consequences of global warming, particularly focusing on increasing heat and temperature variability. The Index of Temperature Variation (ITV) was computed on a monthly and annual basis using a newly derived formula. The results reveal that the annual variation is wider than the monthly variation, primarily due to the variability between winter and summer temperatures, while the monthly temperature index depends on temperature instability within the days of the same month. The monthly Index of Temperature Variation ranged between 6.51 and 104.72, with the highest variability observed in February and November, and the lowest variability between August and September. On an annual basis, the Index of Temperature Variation ranged between −1936.67 and 1490.32. To visualize the differences in temperature distribution between two periods, 1991– 2021 and 1900–1990, the temperature distribution in the last 30 years was projected on the graph of temperature distribution recorded within the last 12 decades. The graphical analysis revealed that temperature distribution in the period 1991–2021 had shrunk, showing a hole in temperature distribution during winter months. Surprisingly, this indicates that the greatest change in temperature is occurring during the winter months, rather than the commonly assumed indicators of global warming in the summer. The research findings highlight that there is now a total gap of 51 degrees in temperature distribution. Climate change is leading to an increase in temperatures that effectively eliminates 51 days of winter. This indicates that our planet’s continuous temperature variation over the past century has experienced missing segments from the temperature map. Moreover, the temperature has gradually shifted from its lower bound towards the upper bound during the hottest months, particularly June, July, and August. Consequently, the variation of temperatures in mid-July reaches its peak by shrinking the difference between the lower and upper bounds. This shift implies that the average temperatures during summer months are being redefined for the next 30 years. In conclusion, this research emphasizes that merely observing some cooler days in the summer is not sufficient evidence to deny the occurrence of global warming. Instead, it is crucial to analyze and measure variations in temperatures (oscillations or volatility from day to day) to comprehend the ongoing changes in our planet’s steady state of temperatures. These changes are likely to have unknown effects in the coming decade, underscoring the urgency for continued study and action in response to climate change.
An Analysis of Temperature Variability Using an Index Model
207
Appendix A
Table 5. Index Temperature Variation between 1900 and 2021 Years
Y = T/µ
2 Y
IQVm
IQVm reference
ITV
1900
1200
127545.7
0.994284
0.994284
0
1901
1200
128566.2
0.99351
0.994284
−773.08
1902
1200
128078.1
0.99388
0.994284
−403.352
1903
1200
127972.6
0.99396
0.994284
−323.364
1904
1200
127539.8
0.994288
0.994284
4.524202
1905
1200
129078.2
0.993123
0.994284
−1160.98
1906
1200
127615.1
0.994231
0.994284
−52.5281
1907
1200
126631.2
0.994976
0.994284
692.8172
1908
1200
126876.5
0.994791
0.994284
506.9709
1909
1200
128313.4
0.993702
0.994284
−581.581
1910
1200
127967.6
0.993964
0.994284
−319.615
1911
1200
127879.9
0.99403
0.994284
−253.184
1912
1200
128612.6
0.993475
0.994284
−808.212
1913
1200
128272.6
0.993733
0.994284
−550.679
1914
1200
128635.8
0.993458
0.994284
−825.844
1915
1200
127398.5
0.994395
0.994284
111.5002
1916
1200
128207.6
0.993782
0.994284
−501.435
1917
1200
128609.1
0.993478
0.994284
−805.612
1918
1200
128416.2
0.993624
0.994284
−659.445
1919
1200
128781.8
0.993347
0.994284
−936.386
1920
1200
128164.6
0.993815
0.994284
−468.871
1921
1200
126325.5
0.995208
0.994284
924.4244
1922
1200
128902.1
0.993256
0.994284
−1027.54
1923
1200
127324.2
0.994451
0.994284
167.8275
1924
1200
128920
0.993242
0.994284
−1041.16
1925
1200
127538.3
0.994289
0.994284
5.626411
1926
1200
127863.6
0.994043
0.994284
−240.808
1927
1200
127099.6
0.994621
0.994284
337.9509
1928
1200
127073.7
0.994641
0.994284
357.5912 (continued)
208
W. Bukaita et al. Table 5. (continued)
Years
Y = T/µ
2 Y
IQVm
IQVm reference
ITV
1929
1200
129036.2
0.993154
0.994284
−1129.15
1930
1200
128808
0.993327
0.994284
−956.242
1931
1200
127439.2
0.994364
0.994284
80.66099
1932
1200
128400.7
0.993636
0.994284
−647.699
1933
1200
127542.6
0.994286
0.994284
2.362438
1934
1200
127287.5
0.994479
0.994284
195.6186
1935
1200
127921.9
0.993999
0.994284
−284.951
1936
1200
130102.1
0.992347
0.994284
−1936.67
1937
1200
129873.2
0.99252
0.994284
−1763.21
1938
1200
127249.8
0.994508
0.994284
224.2124
1939
1200
127224
0.994527
0.994284
243.7279
1940
1200
129192.2
0.993036
0.994284
−1247.31
1941
1200
127398.1
0.994395
0.994284
111.8274
1942
1200
127824.9
0.994072
0.994284
−211.473
1944
1200
128216.3
0.993776
0.994284
−507.977
1945
1200
127581.8
0.994256
0.994284
−27.3013
1946
1200
126403
0.995149
0.994284
865.7052
1947
1200
128539.7
0.993531
0.994284
−753.011
1948
1200
128586.1
0.993495
0.994284
−788.196
1949
1200
128328.8
0.99369
0.994284
−593.217
1950
1200
127003.2
0.994695
0.994284
410.969
1951
1200
128140.5
0.993833
0.994284
−450.553
1952
1200
128329.7
0.99369
0.994284
−593.902
1953
1200
126669.3
0.994948
0.994284
663.9583
1954
1200
126771.6
0.99487
0.994284
586.4228
1955
1200
128791.9
0.993339
0.994284
−944.066
1956
1200
127634
0.994217
0.994284
−66.8838
1957
1200
127636.5
0.994215
0.994284
−68.8071
1958
1200
128301.3
0.993711
0.994284
−572.387
1959
1200
127836.8
0.994063
0.994284
−220.526
1960
1200
129414.9
0.992868
0.994284
−1416.03 (continued)
An Analysis of Temperature Variability Using an Index Model
209
Table 5. (continued) Years
Y = T/µ
2 Y
1961
1200
127616
0.99423
0.994284
−53.2734
1962
1200
127958.5
0.993971
0.994284
−312.711
1963
1200
128572
0.993506
0.994284
−777.464
1964
1200
127753.4
0.994126
0.994284
−157.312
1965
1200
127051.3
0.994658
0.994284
374.5418
1966
1200
128474.8
0.99361
0.994284
−673.913
1967
1200
126867
0.994798
0.994284
514.1588
1968
1200
127930.2
0.993992
0.994284
−291.234
1969
1200
128771.4
0.993355
0.994284
−928.566
1970
1200
128444.3
0.993603
0.994284
−680.714
1971
1200
127773.5
0.994111
0.994284
−172.561
1972
1200
128046.3
0.993904
0.994284
−379.215
1973
1200
127781.3
0.994105
0.994284
−178.456
1974
1200
127048.5
0.99466
0.994284
376.6983
1975
1200
127454.3
0.994353
0.994284
69.28268
1976
1200
126759.9
0.994879
0.994284
595.2932
1977
1200
128218.6
0.993774
0.994284
−509.768
1978
1200
129972
0.992445
0.994284
−1838.07
1979
1200
129502
0.992802
0.994284
−1482
1980
1200
128081.7
0.993877
0.994284
−406.06
1981
1200
126520.3
0.99506
0.994284
776.7992
1982
1200
127879.3
0.994031
0.994284
−252.736
1983
1200
128935.3
0.993231
0.994284
−1052.7
1984
1200
127504.8
0.994315
0.994284
30.98756
1985
1200
129179.2
0.993046
0.994284
−1237.48
1986
1200
126480.4
0.995091
0.994284
807.0499
1987
1200
126984.5
0.994709
0.994284
425.139
1988
1200
128040.3
0.993909
0.994284
−374.643
1989
1200
127863.9
0.994043
0.994284
−241.022
1990
1200
126298.8
0.995228
0.994284
944.6298
1991
1200
127090.2
0.994629
0.994284
345.1025
IQVm
IQVm reference
ITV
(continued)
210
W. Bukaita et al. Table 5. (continued)
Years
Y = T/µ
2 Y
IQVm
IQVm reference
ITV
1991
1200
127090.2
0.994629
0.994284
345.1025
1992
1200
126507.7
0.99507
0.994284
786.3775
1993
1200
127879
0.994031
0.994284
−252.489
1994
1200
127602.8
0.99424
0.994284
−43.2506
1995
1200
126923.6
0.994755
0.994284
471.2925
1996
1200
127685.7
0.994177
0.994284
−106.069
1997
1200
127466.2
0.994344
0.994284
60.24541
1998
1200
126975.9
0.994715
0.994284
431.705
1999
1200
125578.5
0.995774
0.994284
1490.325
2000
1200
127601.1
0.994242
0.994284
−41.986
2001
1200
127261.8
0.994499
0.994284
215.0732
2002
1200
127155.3
0.994579
0.994284
295.8101
2003
1200
127407.3
0.994388
0.994284
104.8812
2004
1200
126730.4
0.994901
0.994284
617.6578
2005
1200
127046.3
0.994662
0.994284
378.3231
2006
1200
126211
0.995295
0.994284
1011.141
2007
1200
127657.7
0.994199
0.994284
−84.8388
2008
1200
127429.1
0.994372
0.994284
88.38435
2009
1200
127166.5
0.994571
0.994284
287.2831
2010
1200
128370.2
0.993659
0.994284
−624.617
2011
1200
127826.6
0.994071
0.994284
−212.761
2012
1200
126159.9
0.995333
0.994284
1049.874
2013
1200
127802.1
0.994089
0.994284
−194.252
2014
1200
127489.6
0.994326
0.994284
42.54319
2015
1200
126588.1
0.995009
0.994284
725.4538
2016
1200
126529.7
0.995053
0.994284
769.6901
2017
1200
126141.8
0.995347
0.994284
1063.57
2018
1200
127799.3
0.994091
0.994284
−192.111
2019
1200
127886.1
0.994026
0.994284
−257.86
2020
1200
126394.9
0.995155
0.994284
871.8553
2021
1200
126980
0.994712
0.994284
428.5696
An Analysis of Temperature Variability Using an Index Model
211
References 1. Crookston, N.L., Rehfeldt, G.E., Ferguson, D.E., Warwell, M.V.: Fvs and Global Warming: A Prospectus for Future Development. Third Forest Vegetation Simulator Conference (2008) 2. Randall, D.A.: General circulation model development: past, present, and future. international geophysics series, Vol 70. Applied Mechanics Reviews 54(5) (2001). https://doi.org/10.1115/ 1.1399682 3. Hansen, J., Sato, M., Ruedy, R., Lacis, A., Oinas, V.: Global warming in the twenty-first century: an alternative scenario. Proc. Natl. Acad. Sci. 97(18), 9875–9880 (2000). https://doi. org/10.1073/pnas.170278997 4. National Oceanic and Atmospheric Administration (NOAA). Encyclopedia of Disaster Relief (2011). https://doi.org/10.4135/9781412994064.n179 5. [email protected]. “Global Mapping: Climate at a Glance.” Global Mapping | Climate at a Glance | National Centers for Environmental Information (NCEI) (2023). https:// www.ncei.noaa.gov/access/monitoring/climate-at-a-glance/global/mapping 6. Nordhaus, W.D.: Economic aspects of global warming in a post-copenhagen environment. Proc. Natl. Acad. Sci. 107(26), 11721–11726 (2010). https://doi.org/10.1073/pnas.100598 5107 7. King, A.D., Harrington, L.J.: The inequality of climate change from 1.5 to 2°c of global warming. Geophys. Res. Lett. 45(10), 5030–5033 (2018). https://doi.org/10.1029/2018GL 078430 8. Santos, R.M., Bakhshoodeh, R.: Climate change/global warming/climate emergency versus general climate research: comparative bibliometric trends of publications. Heliyon 7(11) (2021). https://doi.org/10.1016/j.heliyon.2021.e08219 9. USGCRP. Climate Science Special Report. Climate Science Special Report (2023). https:// science2017.globalchange.gov/ 10. Wilcox, A.R.: Indices of qualitative variation and political measurement. The Western Political Quarterly 26(2), 325–343 (1973) 11. Nordhaus, W.D.: The architecture of climate economics: designing a global agreement on global warming. Bulletin of the Atomic Scientists 67(1), 9–18 (2011). https://doi.org/10. 1177/0096340210392964 12. O’Reilly, C.H., Weisheimer, A., Woollings, T., Gray, L.J., MacLeod, D.: The importance of stratospheric initial conditions for winter north atlantic oscillation predictability and implications for the signal-to-noise paradox. Q. J. R. Meteorol. Soc. 145(718), 131–146 (2019). https://doi.org/10.1002/qj.3413 13. Sara, V.-P., Heikkinen, J., Salemaa, M., Raisa, M.: Global warming will affect the maximum potential abundance of boreal plant species. Ecography 43(6), 801–811 (2020). https://doi. org/10.1111/ecog.04720 14. Ishii, M., Nobuhito, M.: D4pdf: large-ensemble and high-resolution climate simulations for global warming risk assessment. Progress in Earth and Planetary Science 7(1) (2020). https:// doi.org/10.1186/s40645-020-00367-7 15. Gao, M., et al.: Historical fidelity and future change of amundsen sea low under 1.5 °c-4 °c global warming in Cmip6. Atmospheric Res. 255 (2021). https://doi.org/10.1016/j.atmosres. 2021.105533 16. Wang, K., et al.: Continuously amplified warming in the alaskan arctic: implications for estimating global warming hiatus. Geophys. Res. Lett. 44(17), 9029–9038 (2017). https:// doi.org/10.1002/2017GL074232 17. Singh, S., Omar, A., Malviya, A., Shukla, J.B.: Effect of global warming temperature. The International J. Climate Change: Impacts and Res. 13(2), 1–19 (2021). https://doi.org/10. 18848/1835-7156/CGP/v13i02/1-19
212
W. Bukaita et al.
18. Xie, H.-B., Wang, Y.-F., Gong, J., Liu, M.-H., Yang, X.-Y.: Effect of global warming on chloride ion erosion risks for offshore Rc bridges in China. KSCE J. Civ. Eng. 22(9), 3600– 3606 (2018). https://doi.org/10.1007/s12205-018-1547-8 19. Beckmann, J., Beyer, S., Calov, R., Willeit, M., Ganopolski, A.: Modeling the response of greenland outlet glaciers to global warming using a coupled flow line-plume model. Cryosphere 13(9), 2281–2301 (2019). https://doi.org/10.5194/tc-13-2281-2019 20. Gardiner, S.M.: The global warming tragedy and the dangerous illusion of the kyoto protocol. Ethics Int. Aff. 18(1), 23–39 (2004). https://doi.org/10.1111/j.1747-7093.2004.tb00448.x 21. Calleja-Agius, J., England, K., Calleja, N.: The effect of global warming on mortality. Early Human Dev. 155, 105222–105322 (2021). https://doi.org/10.1016/j.earlhumdev.2020.105222 22. Chan, A.W.M., et al.: The effects of global warming on allergic diseases. Hong Kong Medical J. 24(3), 277–77 (2018). https://doi.org/10.12809/hkmj177046 23. Chen, L., Ma, Q., Hänninen, H., Tremblay, F., Bergeron, Y.: Long-term changes in the impacts of global warming on leaf phenology of four temperate tree species. Glob. Change Biol. 25(3), 997–1004 (2019). https://doi.org/10.1111/gcb.14496 24. Sekerci, Y., Ozarslan, R.: Oxygen-plankton model under the effect of global warming with nonsingular fractional order. Chaos, Solitons and Fractals: The Interdisciplinary J. Nonlinear Science, and Nonequilibrium and Complex Phenomena 132 (2020). https://doi.org/10.1016/ j.chaos.2019.109532 25. Li, J., et al.: Irrigation reduces the negative effect of global warming on winter wheat yield and greenhouse gas intensity. Sci. Total. Environ. 646, 290–299 (2019). https://doi.org/10. 1016/j.scitotenv.2018.07.296 26. Li, A.H.F.: Hopes of limiting global warming?: china and the paris agreement on climate change. China Perspectives 1(1), 49–54 (2016) 27. Lewandowski, G.W., Ciarocco, N.J., Gately, E.L.: The effect of embodied temperature on perceptions of global warming. Current Psychology: A J. Diverse Perspectives on Diverse Psychological Issues 31(3), 318–324 (2012). https://doi.org/10.1007/s12144-012-9148-z 28. Gao, X., et al.: Changes in global vegetation distribution and carbon fluxes in response to global warming: simulated results from iap-Dgvm in Cas-Esm2. Adv. Atmos. Sci. 39(8), 1285–1298 (2022). https://doi.org/10.1007/s00376-021-1138-3
A Novel Framework Predicting Anxiety in Chronic Disease Using Boosting Algorithm and Feature Selection Techniques N. Qarmiche1,2(B) , N. Otmani2 , N. Tachfouti3 , B. Amara4 , N. Akasbi5 , R. Berrady6 , and S. El Fakir3 1 Laboratory of Artificial Intelligence, Data Science and Emerging Systems, National School of
Applied Sciences Fez, Sidi Mohamed Ben Abdellah University, Fez, Morocco [email protected] 2 Biostatistics and Informatics Unit, Department of Epidemiology, Clinical Research and Community Health, Faculty of Medicine and Pharmacy of Fez, Sidi Mohamed Ben Abdellah University, Fez, Morocco 3 Department of Epidemiology, Clinical Research and Community Health, Faculty of Medicine and Pharmacy of Fez, Sidi Mohamed Ben Abdellah University, Fez, Morocco 4 Pneumology Department, Hassan II University Hospital, Fez, Morocco 5 Rheumatology Department, Hassan II University Hospital, Fez, Morocco 6 Department of Gastro, Hepato-Enterology - CHU Hassan II Fez, Sidi Mohammed Ben Abdellah University, Fez, Morocco
Abstract. Introduction: Chronic diseases represent a major public health challenge and contribute to a substantial number of deaths. Individuals living with chronic illnesses frequently experience anxiety due to the long-term impact on their overall well-being. The development of a predictive model to accurately identify anxiety in these patients is of great importance for healthcare providers. This study aimed to create a predictive model focused on anxiety among patients with chronic diseases within the Moroccan population. Methods and materials: In this study, a comparative analysis was performed using five different machine learning algorithms. The researchers utilized a cross-sectional dataset comprising 938 patients with chronic diseases. These patients were under monitoring at the Hassan II University Hospital Center in Fez, Morocco, from October 2019 to December 2020. Anxiety levels were evaluated using the validated Moroccan version of the Hospital Anxiety and Depression Scale (HADS). Results: To assess the models’ performance, several metrics were calculated, including accuracy, AUC, precision, recall, and F1-measure. The CatBoost algorithm achieved the highest accuracy of 0.7 in the evaluation. It also obtained an AUC of 0.68, a precision of 0.7, a recall of 0.55, and an F1-measure of 0.6. Based on its accuracy performance, the CatBoost algorithm is considered the optimal model in this study. Conclusion: This study created a CatBoost model to predict anxiety in patients with chronic diseases, offering valuable insights for anxiety prevention. Early psychological support and intervention are beneficial for high-risk patients, aiding in anxiety management. Keywords: Chronic diseases · Prediction · Machine learning · Anxiety · Feature selection © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 K. Arai (Ed.): FICC 2024, LNNS 921, pp. 213–221, 2024. https://doi.org/10.1007/978-3-031-54053-0_16
214
N. Qarmiche et al.
1 Introduction Chronic diseases present a major public health challenge worldwide, as indicated by their significant impact on mortality rates [1]. According to a study by the Global Burden of Disease in 2019, chronic diseases are among the principal causes of death [3]. In countries such as China, the USA, and Canada, chronic diseases are the primary contributors to poor health, disability, and healthcare expenditures. Anxiety disorders are characterized by an excessive and irrational fear response, leading to avoidance behaviors, even in the absence of real danger. Anxiety has cognitive, neurobiological, and behavioral aspects and is distinct from depressive mood, although the two can often coexist. When anxiety becomes severe and persistent, it can significantly impair daily functioning and have negative consequences for individuals [2, 3]. Furthermore, individuals with chronic diseases often experience heightened rates of anxiety and depression, leading to increased mortality and reduced quality of life levels [4]. The development of a predictive model of anxiety risk in patients with chronic illnesses offers a valuable tool for physicians. This model allows for proactive evaluation of anxiety symptoms and assists in identifying individuals who may be experiencing anxiety [5]. Between 2010 and November 2017, a thorough literature review identified 16 studies that predicted anxiety using machine learning algorithms [5]. However, there is a lack of predictive models specifically designed for anxiety related to chronic diseases. Only a small number of models have concentrated on predicting anxiety associated with particular chronic diseases, such as cancer [6] and Alzheimers dementia [7]. This study aimed to develop a model, using regional Moroccan dataset, to predict anxiety disorder among patients with chronic diseases.
2 Methods and Materials The study employed two programming environments, Python (version 3.6.4) and SPSS version 24, with the experiments and analysis conducted using the Jupyter Notebook. The primary package utilized for the project was Scikit-learn. 2.1 Data Source This is a cross-sectional study conducted at the Hassan II University Hospital Center in Fez, Morocco, from October 2019 to December 2020, involving 938 patients with eleven chronic diseases. The diseases included in the study were Stroke (10.64%), Diabetes (25.85%), Viral Hepatitis B (7.23%), Viral Hepatitis C (2.77%), Hypothyroidism (8.51%), Chronic kidney disease (9.36%), Systemic Lupus Erythematosus (10.85%), Rheumatoid arthritis (4.89%), Psoriasis (14.15%), and Ankylosing spondylitis (5.32%). Anonymity and confidentiality of the participants were ensured. Written informed consent was obtained from all participants, and the ethics committee of the Hassan II University Hospital of Fez approved the study. Data collection involved a predefined questionnaire covering sociodemographic and clinical variables related to each disease. Only the
A Novel Framework Predicting Anxiety in Chronic Disease
215
common data, including age, gender, place of residence, marital status, level of education, occupation, monthly income, smoking status, social insurance, lives, comorbidity, age of the disease, and pathology, were used for training and testing the model. 2.2 Outcome Variable The outcome variable, anxiety, was measured using the Hospital Anxiety and Depression Scale (HADS), which has been validated for use in Morocco [9]. The HADS is a selfadministered scale consisting of’ 14 items, with seven items assessing anxiety (total A) and seven items assessing depression (total D). Each item is scored from 0 to 3, resulting in a score range of 0 to 21 for each subscale. A score of 0 to 7 indicates a normal patient, while a score above 7 indicates a patient experiencing anxiety or depression [10]. 2.3 Data Preprocessing The number of missing values was determined using the Pandas function is null().sum(). To handle missing data, numeric variables were imputed with the mean, and categorical variables were imputed with the most frequent value. Variables with missing data exceeding 20% were excluded from the database [8]. The categorical variables were encoded using the LabelEncoder function from the klearn.preprocessing module. The data underwent a normalization process, which standardized the attributes to have a zero mean and unit variance, resulting in a standard normal distribution. The formula is as follows: z=
x−µ δ
(1)
where µ is the mean of the training samples and δ is the standard deviation of the training samples. We used StandardScaler from sklearn.preprocessing. The Sklearn function train_test_split was utilized to randomly partition the data into two subsets: a training set (80%) and a test set (20%), forming our database for training and testing purposes. 2.4 Feature Selection Feature selection is crucial in machine learning as it helps select relevant features, leading to improved model performance, reduced overfitting, and better data interpretation [9]. In this study, too feature selection techniques were employed Recursive Feature Elimination with Cross-Validation (RFECV) and Random Forest Importance. 2.5 Model Development and Evaluation Random Forest is a powerful machine learning algorithm that utilizes an ensemble of decision trees to generate predictions. By aggregating the outputs of multiple individual trees, it enhances accuracy and mitigates overfitting concerns. In a Random Forest model, numerous decision trees are trained on diverse subsets of the training data, and their collective predictions are combined to yield the final prediction [10].
216
N. Qarmiche et al.
Support Vector Machines (SVMs) are a classification algorithm widely used in machine learning. They aim to find a hyperplane in a high-dimensional space that maximizes the separation between data points of different classes [11]. SVMs can employ various types of kernels, with the linear kernel being advantageous in high-dimensional datasets. SVMs are renowned for their proficiency in handling complex, non-linear decision boundaries and performing well in high-dimensional spaces [12]. eXtreme Gradient Boosting (XGBoost) is a highly influential and widely adopted machine learning algorithm employed for both classification and regression purposes. It belongs to the family of ensemble methods called gradient boosting, which leverages the predictions of multiple weak models to generate a robust and accurate predictive model. XGBoost stands out for its exceptional efficiency, scalability, and superior performance, rendering it a preferred choice among numerous machine learning practitioners, Category Boosting [13]. CatBoost, an implementation of gradient boosting, is a powerful algorithm that combines predictions from multiple weak models to construct a robust predictive model. What sets CatBoost apart is its specialized capability to handle categorical data effectively. It boasts high performance, swift learning times, and the ability to handle extensive datasets containing numerous categorical features [14]. Adaboost is renowned for its simplicity and effectiveness. It operates by iteratively invoking a chosen weak or base learning algorithm across multiple rounds (t = 1,…, T). A key concept behind Adaboost is the maintenance of a distribution or set of weights over the training set. These weights, denoted as Dt (i), initially start equally distributed, but in each subsequent round, the weights of misclassified examples are increased. This strategy compels the base learner to pay more attention to challenging instances within the training set [15]. The performance evaluation of the machine learning algorithms in this study involved the utilization of several commonly used metrics, including accuracy, precision, recall, F-measure, and ROC curve. These metrics serve as standard measures to assess the efficacy of classification models and are defined by the following equations: Accuracy =
TN + TP TP + FP + FN + TN
(2)
TP TP + FP
(3)
Precision = Recall = F − mesure =
TP TP + FN
2 × Recall × Precision Recall + Precision
(4) (5)
In this context, the terms TN, TP, FP, and FN are commonly used in the evaluation of binary classification models. TN represents the number of true negatives, TP represents the number of true positives, FP represents the number of false positives, and FN represents the number of false negatives. These terms help describe the accuracy of the classification process.
A Novel Framework Predicting Anxiety in Chronic Disease
217
The AUC, or area under the curve, is a metric used to evaluate the performance of classification models. It is calculated by plotting the true positive rate (TPR) against the false positive rate (FPR) at various classification thresholds and measuring the area under that curve. The AUC value ranges from 0 to 1, with a higher value indicating a better-performing model. The AUC is often used as a summary measure of model performance, with a higher AUC indicating a greater ability to distinguish between positive and negative classes.
3 Experiments and Results 3.1 Handling Missing Data The missing values for “age” and “disease duration” were imputed by replacing them with the mean value. For other variables, the missing values were imputed with the most frequent value using the SimpleImputer function from Sklearn. The variable “monthly income” was excluded from the study due to having more than 20% missing data. 3.2 Feature Selection Results Based on the results obtained from the RFECV method (see Fig. 1), and the Random Forest Importance technique (see Fig. 2), a set of nine features were identified as significant predictors. These features include age, residency, marital status, education level, profession, medical insurance, comorbidity, disease duration, and pathology.
Fig. 1. Plot number of features vs. cross-validation scores
3.3 Models Evaluation Table 1 presents the performance predictions (accuracy, AUC, Precision, Recall, and F1-measure) for the five used models.
218
N. Qarmiche et al.
Fig. 2. Plot the impurity-based feature importance of the forest
After evaluating the predictive performance of different models using accuracy as the measure (see Fig. 3), the study selected CatBoost as the final model for predicting anxiety in patients with chronic diseases. This chosen model exhibited an accuracy of 0.7, an AUC of 0.68, a precision of 0.7, a recall of 0.55, and an F1-measure of 0.6. Table 1. Results of the prediction performance (accuracy, AUC, Precision, Recall and F1measure) Accuracy Test
AUC
Precision
Recall
F1measure
RF
0.68
0.66
0.66
0.53
0.59
AdaBoost
0.64
0.63
0.6
0.53
0.56
SVM
0.67
0.65
0.64
0.55
0.59
XG-Boost
0.62
0.6
0.57
0.51
0.54
CatBoost
0.7
0.68
0.7
0.55
0.6
A Novel Framework Predicting Anxiety in Chronic Disease
219
Fig. 3. Models acuracy histogram
4 Discussion This study pioneers the use of a regional dataset to create an anxiety prediction model specifically designed for individuals with chronic diseases in the Moroccan population. To accomplish this, a cross-sectional study was conducted using a dataset of 938 patients with chronic diseases. These patients were under monitoring at the Hassan II University Hospital Center in Fez, Morocco, during the period of October 2019 to December 2020. Furthermore, a comparative analysis of five machine learning algorithms was conducted as part of this study. Remarkably, the CatBoost model exhibited exceptional performance, achieving an impressive accuracy rate of 0.7. The study acknowledges several limitations and provides insights for future research. The predictive model developed for anxiety in chronic diseases has inconclusive results due to the limited scope of the dataset, which focuses on only 10 specific chronic diseases. To improve future studies, it is recommended to expand the dataset to include a broader range of chronic diseases.
5 Conclusion This study developed a CatBoost-based model to predict anxiety among patients with chronic diseases. The model aims to decrease the prevalence of anxiety disorders by accurately predicting anxiety at an early stage. This can lead to improved quality of life for patients with chronic diseases, reduced hospitalization rates, and decreased healthcare costs. Acknowledgment. We express our sincere gratitude to Professor T. Sqalli Houssaini (Nephrology, Hemodialysis and Transplantation, Hassan II University Hospital, Fez, Morocco), Professor M.F. Belahssen (Neurology Department, Hassan II University Hospital, Fez, Morocco), Professor H. El Ouahabi (Endocrinology Department, Hassan II University Hospital, Fez, Morocco), and Professor H. Baybay (Dermatology Department, Hassan II University Hospital, Fez, Morocco) for their invaluable collaboration in this study.
220
N. Qarmiche et al.
References 1. Bauer, U.E., Briss, P.A., Goodman, R.A., Bowman, B.A.: Prevention of chronic disease in the 21st century: elimination of the leading preventable causes of premature death and disability in the USA. Lancet 384(9937), 45–52 (2014). https://doi.org/10.1016/S0140-6736(14)606 48-6 2. Moser, D.K., Riegel, B., McKinley, S., Doering, L.V., An, K., Sheahan, S.: Impact of anxiety and perceived control on in-hospital complications after acute myocardial infarction. Psychosom. Med. 69(1), 10–16 (2007). https://doi.org/10.1097/01.psy.0000245868.43447.d8 3. DeJean, D., Giacomini, M., Vanstone, M., Brundisini, F.: Patient experiences of depression and anxiety with chronic disease: a systematic review and qualitative meta- synthesis. Ont Health Technol Assess Ser 13(16), 1–33 (2013) 4. Gerontoukou, E.-I., Michaelidoy, S., Rekleiti, M., Saridi, M., Souliotis, K.: Investigation of anxiety and depression in patients with chronic diseases. Health Psychol. Res. 3(2), 2123 (2015). https://doi.org/10.4081/hpr.2015.2123 5. Pintelas, E.G., Kotsilieris, T., Livieris, I.E., Pintelas, P.: A review of machine learning prediction methods for anxiety disorders. In: Proceedings of the 8th International Conference on Software Development and Technologies for Enhancing Accessibility and Fighting Infoexclusion, in DSAI ’18. New York, NY, USA: Association for Computing Machinery, pp. 8–15 (2018). https://doi.org/10.1145/3218585.3218587 6. Haun, M.W., Simon, L., Sklenarova, H., Zimmermann-Schlegel, V., Friederich, H.-C., Hartmann, M.: Predicting anxiety in cancer survivors presenting to primary care – a machine learning approach accounting for physical comorbidity. Cancer Med. 10(14), 5001–5016 (2021). https://doi.org/10.1002/cam4.4048 7. Byeon, H.: Predicting the anxiety of patients with alzheimer’s dementia using boosting algorithm and data-level approach. International Journal of Advanced Computer Science and Applications (IJACSA), 12(3), Art. no. 3 (2021). https://doi.org/10.14569/IJACSA.2021.012 0313 8. Cismondi, F., Fialho, A.S., Vieira, S.M., Reti, S.R., Sousa, J.M.C., Finkelstein, S.N.: Missing data in medical databases: impute, delete or classify? Artif. Intell. Med. 58(1), 63–72 (2013). https://doi.org/10.1016/j.artmed.2013.01.003 9. Torabi, M., Udzir, N.I., Abdullah, M.T., Yaakob, R.: A review on feature selection and ensemble techniques for intrusion detection system. International Journal of Advanced Computer Science and Applications (IJACSA), 12(5), Art. no. 5, 58/31 (2021). https://doi.org/10.14569/ IJACSA.2021.0120566 10. Nilashi, M., et al.: Disease diagnosis using machine learning techniques: a review and classification. JSCDSS 7(1), 19–30 (2020) 11. Sheta, A.F., Ahmed, S.E.M., Faris, H.: A comparison between regression, artificial neural networks and support vector machines for predicting stock market index. International Journal of Advanced Research in Artificial Intelligence (IJARAI), 4(7), Art. no. 7 (2015). https://doi. org/10.14569/IJARAI.2015.040710 12. Battineni, G., Chintalapudi, N., Amenta, F.: Machine learning in medicine: performance calculation of dementia prediction by support vector machines (SVM). Informatics in Medicine Unlocked 16, 100200 (2019). https://doi.org/10.1016/j.imu.2019.100200 13. Chen, T., Guestrin, C.: XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, in KDD ’16. New York, NY, USA: Association for Computing Machinery, pp. 785–794 (2016). https:// doi.org/10.1145/2939672.2939785
A Novel Framework Predicting Anxiety in Chronic Disease
221
14. Daoud, E.A.: Comparison between XGBoost, LightGBM and CatBoost using a home credit dataset. Int. J. Computer and Information Eng. 13(1), 6–10 (2019) 15. Schapire, R.E.: The boosting approach to machine learning: an overview. In: Nonlinear Estimation and Classification, Denison, D.D., Hansen, M.H., Holmes, C.C., Mallick, B., Yu, B. (eds.), in Lecture Notes in Statistics. Springer, New York, NY, pp. 149–171 (2003). https:// doi.org/10.1007/978-0-387-21579-2_9
Machine Learning Application in Construction Delay and Cost Overrun Risks Assessment Ania Khodabakhshian(B) , Umar Malsagov, and Fulvio Re Cecconi Politecnico di Milano, Via Ponzio 31, 20133 Milan, Italy [email protected]
Abstract. Construction projects are prone to experience significant delays and cost overruns due to uncontrollable risks raised by their complex, unique, and uncertain nature. Conventional Risk Management methods have proven inefficient, time-consuming, and highly subjective, making exploring innovative and data-driven solutions essential. Artificial Intelligence (AI) is revolutionizing the construction industry by offering improved, optimized, and automatized Project Management solutions, which can benefit existing RM processes significantly. This study investigates the application of various Machine Learning algorithms for delay and cost overrun risk prediction in construction projects. A case study involving NYC school construction projects is used to train and evaluate algorithms such as Decision Trees, Artificial Neural Networks, Extreme Gradient Boosting, and Linear and Ridge regressions. The ultimate goal of this research is to conduct a comparative analysis between the performances and prediction precision of different ML algorithms for delays and cost overruns, two of the most significant construction risks, concerning each algorithm’s structure and learning process. The results of this study provide automated and precise predictions of risks in new construction projects while also contributing valuable insights into the potential and benefits of ML applications in the construction industry. Keywords: Machine learning · Risk management · Construction delay · Construction cost overrun · Construction project management
1 Introduction Delays and cost overruns are persistent challenges in the construction projects, posing significant risks to on-time project delivery, cost management, and overall project success [1]. These issues are often attributed to poor preparation, limited information regarding project scope and resources, and a lack of communication among project stakeholders, including architects, engineers, contractors, and government entities [2]. Additionally, unforeseen expenses and inadequate cost estimation further contribute to cost overruns. Risk Management (RM) endeavors to ensure project success and keep projects on track and within budget [3]; nevertheless, it faces difficulties due to the complex nature of the construction sector. Furthermore, conventional RM methods fail to a deliver quick, accurate, and automated assessment of risks to enable project managers to take preventive or corrective actions to mitigate them [4]. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 K. Arai (Ed.): FICC 2024, LNNS 921, pp. 222–240, 2024. https://doi.org/10.1007/978-3-031-54053-0_17
Machine Learning Application in Construction Delay
223
In recent years, the emergence of artificial intelligence (AI) and machine learning (ML) has offered a promising avenue for addressing these shortcomings of conventional RM methods, enabling more accurate risk identification, assessment, and mitigation processes [5, 6]. Machine learning (ML), a sophisticated field within AI, empowers computer systems to learn from data and make informed predictions or decisions without explicit programming [7]. ML algorithms analyze large datasets, extract meaningful insights, and continually improve predictions based on accumulated data, making them invaluable for complex problem-solving and decision-making tasks [8]. By leveraging ML techniques, the construction industry has the potential to automate and enhance RM processes while reducing reliance on human expertise. While ML applications in construction risk management are still relatively new, their applications in various processes of design, engineering, construction, and operation phases have been widely studied in literature [9, 10], which, if implemented in practice, can significantly contribute to decreasing time and cost overruns in projects. For instance, Material availability forecasting has benefited from ML by predicting material arrivals and adjusting construction schedules accordingly, reducing the likelihood of delays caused by material shortages. Similarly, ML algorithms can forecast labor demand, optimizing worker scheduling and mitigating the risk of labor shortages [11]. However, one of the most important issues hindering the widespread adoption of ML in RM practice is the lack of comprehensive and well-documented data in the construction industry [12]. Construction projects are unique, lengthy, and complex, resulting in infrequent data generation and documentation from which the ML models can learn. Nevertheless, with the digitalization and data management trends caused by industry 4.0 technologies, data collection from construction projects is becoming faster, more standardized, and more frequent [3, 13, 14]. As a result, ML models can leverage more extensive databases, significantly enhancing their accuracy and effectiveness in construction research and practice. This study aims to explore the application of various ML techniques for predicting construction delays and cost overruns, utilizing a real-world database of New York City school building construction as a case study. By employing various ML algorithms such as Artificial Neural Network, Decision Tree, XG Boost, and Linear and Ridge Regression, this research endeavors to evaluate the effectiveness of each algorithm, with respect to their unique structures, and identify the most reliable approach. Therefore, this study aims to answer the following questions: What are the most important and influential features in a project while predicting delays and cost overruns? What are the most appropriate ML algorithms to predict delays and cost overruns in huge databases with a combination of continuous and categorical values? Which ML algorithm has the best accuracy in predicting delay and cost overrun values? The outcomes of this study not only contribute to the existing knowledge in construction RM but also provide practical and efficient solutions for construction professionals to predict and mitigate the risk of delays and cost overruns as the most common and influential risks in construction projects.
224
A. Khodabakhshian et al.
The results of the extensive literature search on conventional RM methods in the industry and ML-based RM methods are provided in the Literature Review section, which serves as a benchmark for this study and for validating the results against them. The Methodology section depicts the main steps of developing the proposed ML-based model of the study for predicting the delay and cost overrun risks, which is applied to the case study database. Obtained results from each algorithm are presented and compared in the Results section. In the Conclusion section the advantages and disadvantages of the proposed ML-based approach, research limitations, and future directions are presented.
2 Literature Review In order to learn about the state of art ML applications for construction RM and identify the knowledge gaps, an extensive literature review was conducted in the Scopus database. The main keywords used for the search include construction projects, RM, ML, AI, Delay Risk, Cost overrun, and construction management and engineering. The selected articles were critically analyzed to evaluate their quality, relevance, and contribution to the existing knowledge on ML-based RM for construction projects, as well as to reveal the knowledge gaps in the literature and identify potential future research areas in the realm. The findings are categorized into two main subsections a) conventional risk management methods in the construction industry and b) the current state of AI and ML in construction research and their application in construction RM. 2.1 Conventional RM Methods Risk management is essential in the construction industry to ensure the successful delivery of projects and decrease project vulnerability to uncertainties and external factors [15], and essentially follows a systematic process consisting of four main steps: a) risk identification, b) risk analysis, c) risk response and mitigation, and d) risk monitoring and control [16–18]. Various methods, such as expert judgment, checklists, brainstorming sessions, probability and impact matrices, sensitivity analysis, and Monte Carlo simulations, have been conventionally used for RM in construction projects. Furthermore, different established project management tools and techniques like Critical Path Method (CPM), PERT (Program Evaluation and Review Technique) Analysis, Earned Value Management, and Constructability Review have contributed to the systematic management of risks [19, 20]. However, these methods fall short of representing and managing the uncertainties, interrelated influences of project variables, and causal effects on risks present in construction projects in an efficient and quick manner. Moreover, they assess risks isolated and with respect to their effect on one aspect of projects like schedule, budget, quality, etc., assuming independence among activities and risk factors [21]. And finally, they are mainly based on frequents statistics for modeling risks, which hinders the learning ability from various sources like experts’ opinion, project data, and model simulation. The advantages novel technologies like ML algorithms bring to the area of RM and their ability to address the mentioned shortcomings of conventional RM methods need further research and knowledge gap analysis, which is presented in the next subchapter.
Machine Learning Application in Construction Delay
225
2.2 Machine Learning for Construction Risk Management Machine learning, a subset of AI with trending applications in different industries, focuses on designing algorithms that enable machines to learn from data and make predictions or decisions without explicit programming [12, 22]. In the construction sector, AI applications have led to advancements in efficiency in processes, accuracy in predictions of outcomes, objectivity in decisions, and automation in repetitive tasks [23]. AI is a vast umbrella term and includes a plethora of technologies and methods, including knowledge representation and reasoning, information fusion, computer vision, natural language processing (NLP), virtual reality (VR), and robotics [24]. The three groups mostly used for RM in construction are Knowledge representation and reasoning, Information fusion, especially Machine Learning, and Natural Language Processing. Knowledge representation and reasoning involve symbolic representations of domain knowledge and predefined rules to construct a knowledge-based system and allow computers to logically comprehend available knowledge and draw sound conclusions [25, 26]. Structural equation modeling (SEM) and Bayesian networks are two useful tools in this field, both of which can describe causal relationships among variables and reason under uncertainty using graph theory [27, 28]. Information fusion is the process of integrating information from multiple sources to obtain a more accurate understanding and to minimize uncertainty, redundancy, and inconsistency in the information [24]. Machine Learning (ML) is one of the most used techniques in information fusion, which learns from patterns in the data to make predictions or decisions. ML application in the RM realm, the focus of this study, can be in the form of supervised learning (regression or classification problems) or unsupervised learning (clustering problem), based on the type of data available from projects [29]. Decision Trees, Artificial Neural Networks, Linear and Ridge Regression, Support Vector Machine, Bayesian Networks, and Gradient Boosting are some of the most used ML algorithms for RM [23], some of which are used for this study and will be introduced in the Methodology section. Natural language processing (NLP) enables computers to understand and generate human language, facilitating communication and information extraction, which has been widely used in construction document analysis, accident and safety report analysis, and improved comprehensibility [30, 31]. ML application in construction RM has been the topic of several previous studies, the analysis of which provides a holistic image of ML potentials and shortcomings to solve existing RM issues, as well as highlights research gaps in state-of-the-art that this study aims to fill. As the results of these studies show, ML can enable evidencebased decision-making in proactive project RM strategies, empowering the construction industry to handle the inherent risks associated with complexity and interdependent delay risk sources [32]. However, the size of the database and the type of available data have significant impacts on the choice of the ML algorithm and obtained accuracy [33]. In delay risk assessment literature, Gondia et al. (2020) [32] conducted a study that applied DT and Naïve Bayesian classification algorithms to accurately analyze construction project delay risk based on a delay-inducing risk sources database. The results indicated that the naive Bayesian model outperformed the decision tree model for small-sized datasets. Similarly, Erzaij (2021) [34] explored the application of DT and Naïve Bayesian classifiers to develop a predictive tool for identifying and analyzing
226
A. Khodabakhshian et al.
delay sources in construction projects in a dataset from 97 projects, where the decision tree algorithm exhibited better accuracy in predicting delays due to the nature of the data. Sanni-Anibire et al. (2021) [29] developed an ML-based construction delay prediction framework using Multi Linear Regression Analysis, K-Nearest Neighbours, ANN, SVM, and Ensemble methods. In another study, Sanni-Anibire et al. (2022) [35] focused on predicting delay risks in high rise building projects using ML, introducing experts on 36 delay identified risk factors, and the best model for predicting delay risk was based on ANNs. In cost-overrun risk assessment literature, Elmousalami (2020) [36] conducted a vast literature review on the most common AI techniques for cost modeling, such as fuzzy logic (FL) models, artificial neural networks (ANNs), regression models, case-based reasoning (CBR), hybrid models, diction tree (DT), random forest (RF), supportive vector machine (SVM), AdaBoost, scalable boosting trees (XGBoost), and evolutionary computing (EC) such as genetic algorithm (GA). Memon and Rahman (2013) [37] adopted SEM to assess the effects of resource-related factors on project cost, indicating better results compared to multivariant techniques like multiple regression, path analysis, and factor analysis. Rafiei and Adeli (2018) [38] presented an innovative construction cost estimation model using an unsupervised deep Boltzmann machine (DBM) learning approach along with three-layer back-propagation neural network (BPNN) and SVM, taking into account the economic variables and indexes. The reviewed studies indicated ML techniques potential to offer accurate predictions for various risk factors in construction projects, including delay risks, dispute outcomes, fatal accidents, and site-specific risk levels [39–41]. These findings highlight the value of ML as a tool for evidence-based decision-making, proactive safety management, and cost-saving measures in the construction industry. However, challenges and limitations exist, such as scaling ML algorithms to large datasets, learning network structures, computational costs, and noise sensitivity. Moreover, the change-resistant culture of the construction industry hinders the replacement of ML-based methods with the conventional inefficient project management tools [42]. Additionally, due to the context-driven nature of risks, there is no one-fits-all ML model or solution to address the specific risk issues in different studies. Therefore, this study aims to assess the performance of different ML algorithms when applied to the case study database of 1000 school building construction projects in New York for predicting delay and cost overrun risks.
3 Methodology This paper aims to analytically compare various ML algorithms’ performance when predicting Delay and Cost overrun risks in construction projects. Therefore, the five selected algorithms, namely, XGBoost, Decision Tree, ANN, and Ridge and Linear Regression, are being implemented on a database consisting of 13570 school construction projects in New York, and the results of each are being compared. Like all other ML-based applications, this study also follows the standard steps of data collection, data cleaning and preprocessing, model description, model training, model testing, and validation. A more detailed description of the research methodology is presented in Fig. 1.
Machine Learning Application in Construction Delay
227
Fig. 1. Research methodology steps
3.1 Data Collection The success of ML models depends significantly on the quality and quantity of the data used for training and validation. However, data collection is one of the most cumbersome steps in construction projects, as data is not frequently registered and updated, and there are not many available open-access databases like in other industries. The case study database of this study is an open access one sourced from the Capital Project Schedules and Budgets database available on the City of New York’s Open Data Portal (https://data.cityofnewyork.us/Housing-Development/Capital-Project-Sch edules-andBudgets/2xh6-psuq), maintained by the New York City government. Before data cleaning the database had 13570 rows or projects and 14 columns or project attributes, including: Project Geographic District, Project Building Identifier, School Name, Project Type based on funding, Project Description (Description of construction/retrofit services and work packages), Project Phase Name, Project Status (completed, ongoing), Project Phase Actual Start Date, Project Phase Planned End Date, Project Phase Actual End Date, Project Budget Amount ($), Final Estimate of Actual Costs Through End of Phase Amount ($), Total Phase Actual Spending Amount ($), DSF reference Number(s). The substantial number of rows and influencing attributes enables ML algorithms to capture patterns in data more effectively, leading to more accurate predictions. Specific attributes, such as the type of construction work, planned and actual project end dates, and total costs, were considered particularly important for differentiating construction projects and allowing the algorithms to generate meaningful predictions. 3.2 Data Preprocessing Data Preprocessing consists of Data Cleaning and Transformation phases that are conducted to ensure data quality, relevance, reliability, and compatibility with ML algorithms. In this phase, missing values, errors, and outliers that adversely impact the ML model’s performance are removed from the database. Moreover, the features or data
228
A. Khodabakhshian et al.
columns found not applicable or relevant, introducing noise and redundancy into the data, are removed. Finally, through a rigorous workflow, the dataset is transformed into a numeric and suitable format for the ML model called Data Transformation. This step is essential since ML algorithms typically operate on numerical data, as they rely on mathematical and statistical methods to identify patterns, learn from data, and make predictions. Non-numeric data, such as strings or textual categorical variables, may not be directly interpretable by algorithms, hindering their ability to discern relationships between variables and predict the target outcomes effectively. 3.3 ML Algorithms’ Description In this subsection, the five ML algorithms used for the study are briefly introduced. Decision trees are a widely used non-parametric supervised learning algorithm for classification and regression tasks. They consist of a hierarchical structure with a root node, branches, internal nodes (decision nodes), and leaf nodes. Starting from the root node, the tree is constructed by recursively partitioning the data based on available features, creating homogeneous subsets represented by leaf nodes to find optimal split points within the tree. The Decision Tree Regressor is employed for regression tasks, aiming to predict continuous numerical values. It recursively partitions the data into subsets, minimizing the variance of predicted values [43, 44]. Linear regression is a widely employed ML algorithm that predicts a continuous output variable based on input variables. Often used as a benchmark, it fits a straight line or plane through the data to capture the underlying relationship between inputs and output. The objective is to estimate coefficients that effectively predict the output variable from input variables. These coefficients can then be used to make predictions for new input values. The method of Least Squares is commonly used to estimate the coefficients by minimizing the sum of squared errors between predicted and actual values [29]. Ridge Regression, a variation of linear regression, addresses multicollinearity issues caused by highly correlated input variables. It achieves this goal by incorporating a penalty term in the least squares method [45]. Artificial Neural Networks (ANNs) are deep learning models that mimic the structure and operation of biological neural networks and consist of interconnected layers of neurons that process and transmit information. The input layer receives raw data and communicates with the hidden layers. The hidden layer(s) perform transformations on the input values using weighted connections between nodes, representing the connection strength, with the appropriate number of layers selected to prevent overfitting. Moreover, the activation function is applied to the output of a neuron or a layer of neurons and determines the output of a neuron based on its weighted sum of inputs, introducing non-linearity into the network. The output layer generates the final output values corresponding to the prediction of the response variable. The learning process involves adjusting the weights to minimize the error between the predicted and target outputs. This optimization is commonly achieved using the Gradient Descent algorithm, which updates the weights based on the partial derivatives of the error function for each weight [9, 46, 47].
Machine Learning Application in Construction Delay
229
Extreme Gradient Boosting (XGBoost) is a versatile ML algorithm for regression and classification tasks. It belongs to the family of Gradient Boosting Machines (GBMs), which combine weak learners, typically Decision Trees, to create a strong learner that can make accurate predictions. The boosting process involves iteratively training weak learners on the residuals by sequentially building Decision Trees and correcting the errors of previous ones, aiming to minimize the overall loss function. It introduces enhancements and optimizations to improve performance and mitigate overfitting through an objective function consisting of a loss function and a regularization term. XGBoost employs gradient-based optimization, computing gradients to determine the best splits in each iteration. It utilizes column and row block data structures for efficient tree construction, handles sparse data using a sparsity-aware learning algorithm, reflects linearity and non-linearity of data, and incorporates regularization to enhance model robustness [45, 36]. 3.4 Model Development Five models based on the five abovementioned algorithms are developed. The target variables are the “Week Delays” and “Total Phase Actual Spending Amount,” the precision to predict which indicates the performance of the ML algorithms. The dataset is divided into train and test datasets with an 80%–20% proportion, so that the model can be trained on one subset and evaluated on another, providing an unbiased assessment of its performance. The primary goal of model training is to learn underlying patterns and relationships between the predictor and target variables through an iterative process wherein the algorithm minimizes prediction errors by adjusting its internal parameters, while testing enables the evaluation of the model’s generalizability to unseen data. The input data is scaled and normalized before the training process for the ANN model to ensure that it has a similar range and variance to prevent specific features from dominating the learning process. Several metrics, such as R squared, Mean Absolute Error (MAE), Mean Squared Error (MSE), and cross-validation (CV) scores, are employed to evaluate the performance of the models in the testing process. R-squared, also known as the coefficient of determination, is a statistical measure ranging from 0 to 1, representing the proportion of the variance in the dependent variable that can be explained by the model’s independent variables. Mean Absolute Error (MAE) measures the average magnitude of the errors in the predicted values, irrespective of their direction, while Mean Squared Error (MSE) calculates the average squared differences between actual and predicted values. Crossvalidation (CV) is a technique used to evaluate a model’s performance by partitioning the dataset into k subsets or folds. The model is trained on k-1 folds and tested on the remaining fold. This process is repeated k times, with each fold used as the test set once. Analyzing these metrics makes it possible to identify each model’s strengths and weaknesses and select the most suitable model. Moreover, using multiple evaluation techniques helps to mitigate the risk of overfitting and ensures that the chosen model can generalize well to new, unseen data. Ultimately, a thorough model testing process guarantees the reliability and validity of the findings of the study. The codes to all of the five ML algorithms can be found in the Github link.
230
A. Khodabakhshian et al.
3.5 Analytical Comparison of the ML Algorithms As the last step, the results obtained from each algorithm are compared to identify the most accurate and well-performing one. The structure and data requirements of each algorithm are briefly stated and analyzed to identify the reason behind their performances. This step is super important to provide a practical overview of the ML algorithm selection process by researchers and practitioners in the construction industry based on available risk data, database size, application scope and target, and computational complexity of each model. It enables decision-makers to utilize the developed model for estimating the potential delays and cost overruns associated with various projects, thereby allowing for better planning, resource allocation, and risk management. Furthermore, these insights can aid in identifying areas for improvement, optimizing processes, and enhancing overall project performance.
4 Results The initial database obtained from the City of New York’s Open Data Portal contained 13570 rows of data. However, only 1489 records remained in the database after the data cleaning and filtering phase, the details of which are presented in the next subchapter. 4.1 Data Preprocessing, Cleaning and Transformation Upon inspection, several irrelevant columns were identified, such as “Project School Name,” “DSF Number(s),” “Project Building Identifier,” and “Project Type.” These attributes do not impact the target variable and may cause the ML model to fit the noise in the data rather than the underlying patterns. Rows or projects containing missing values were dropped from the dataset, which could potentially introduce inaccuracies or biases in the predictions. The dataset was further filtered to only completed projects that had both actual end dates and total spending costs, enabling the algorithm to calculate the delay and cost overrun. This filtering significantly reduced database size since most projects are of “In-Progress” status. Projects with a negative or zero amount of any of their time or cost-related attributes were dropped as errors. Finally, the duplicates were removed to maintain the uniqueness and integrity of the dataset. In order to convert date attributes to a suitable format for analysis, they were initially transformed into Python’s datetime objects with the “%m/%d/%Y” format. Then, the “Project Phase Actual End Date” and “Project Phase Planned End Date” difference was calculated as “Week Delay,” while the “Project Phase Actual End Date” and “Project Phase Actual Start Date” difference was calculated as “Week Duration,” and both datetime objects were converted into numerical values.
Machine Learning Application in Construction Delay
231
The categorical variables were transformed into numerical values using Label Encoding, which is super useful when the order of the categories is not essential. A similar procedure was conducted for the “Project Description” column, which usually contains different types of work separated by a delimiter and is not interpretable for ML algorithms. For this purpose, a one-hot encoding approach was applied, creating ten new columns, each for a specific work package; the value of each could be 0 or 1. Finally, projects with no work package identified, with delays exceeding 80% of the duration, and with cost overruns more than 75% of the total budget were dropped from the database as outliers deteriorating the accuracy and generalizability of the ML model. Following the data preprocessing steps, redundant columns such as ‘Project Status Name’ and ‘Project Description’ were removed, as their information had been effectively captured in new binary variables. As a result of the data cleaning and data transformation phases, the database was reduced to 1489 rows with 17 columns, 10 of which were the work package columns created after one-hot encoding. 4.2 Summary of ML Algorithms’ Results This subsection presents the outcomes from each ML algorithm, the analysis of which offers a comprehensive understanding of the respective algorithms’ performance in addressing the research problem. Delay Prediction. Table 1 presents the results obtained from each ML algorithm when predicting the delay. The performance of each algorithm is assessed using the four tests of R2, Mean of Cross-Validation (CV), MSE, and MAE. Based on the obtained results, XGBoost model outperforms the other algorithms due to its robust gradient-boosting framework that combines multiple decision trees, enhancing accuracy and reducing overfitting. This results in the highest R-squared value (0.91), indicating a strong correlation between the predicted and observed values, as well as the lowest Mean Squared Error (45.77) and Mean Absolute Error (3.5 weeks), signifying superior prediction accuracy. The Decision Tree model follows with a slightly lower performance, which can be attributed to its single-tree structure, making it more prone to overfitting than the XGBoost model. It is worth noting that the ANN model does exhibit a good mean of CV scores, indicating an acceptable level of generalizability. However, when considering other performance metrics, such as R-squared, MSE, and MAE, the XGBoost model still outperforms the ANN model. Linear and Ridge Regression models have the lowest performance since these models assume a linear relationship between the predictors and the target variable, which is not the case for complex datasets. Consequently, they do not effectively capture the underlying patterns in the data, leading to reduced prediction accuracy. Figure 2 presents the accuracy of delay risk prediction for Decision Tree, ANN, and XGBoost algorithms respectively. Figure 3 presents the feature importance for the abovementioned algorithms.
232
A. Khodabakhshian et al.
Fig. 2. (a). DT prediction accuracy for delays (b). ANN prediction accuracy for delays (c). XGBoost prediction accuracy for delays
Table 1. Comparison of the ML algorithms results for delay prediction Evaluation
DT
ANN
LR
RR
XGBoost
R2
0.86
0.73
0.71
0.71
0.91
Mean of CV
0.68
0.78
0.53
0.53
0.75
MSE MAE
68.1 3.68
128.9 5.32
145.61 8.36
145.3 8.34
45.77 3.50
Machine Learning Application in Construction Delay
233
Fig. 3. (a). Feature importance for DT model (b). Feature importance for ANN model (c). Feature importance for XGBoost model
234
A. Khodabakhshian et al.
Total Cost Prediction. In the case of total costs prediction, the XGBoost model again demonstrates the best performance, as presented in Table 2, achieving the highest Rsquared value and the lowest error rates. The DT model’s lower MSE compared to XGBoost, despite the latter’s better MAE, can be explained by the unique characteristics of these error metrics. The MSE focuses on more significant errors by squaring the differences between predicted and observed values. In this case, the Decision Tree model may have a few significant errors that are heavily penalized by the MSE metric. In contrast, the MAE calculates the average of the absolute differences between predicted and observed values, treating all errors equally. The better MAE for the XGBoost model indicates that, on average, its predictions are closer to the actual values, making it a more accurate model overall for this variable. The Linear and Ridge Regression models yielded better results for predicting total costs than delays. Given that Linear and Ridge Regression models assume a linear relationship between variables and considering that the underlying relationship between the input variables and the “Total Phase Actual Spending Amount ($)” is predominantly linear, it allowed the models to capture the patterns in the data more effectively than the ANN model. Figure 4 presents the accuracy of total cost prediction for Decision Tree, ANN, and XGBoost algorithms respectively. As a result, the XGBoost model emerges as the most accurate algorithm for predicting delays and total costs, owing to its robust gradient-boosting framework and ability to handle complex datasets effectively. Table 2. Comparison of the ML algorithms results for total cost prediction Evaluate
DT
ANN
LR
RR
XGBoost
R2
0.88
0.83
0.97
0.97
0.98
Mean of CV
0.83
0.97
0.84
0.84
0.97
MSE
30238740540
42462464672
6103773414
6102901345
3465322264
MAE
36972
75820
38064
38127
22166
Machine Learning Application in Construction Delay
235
Fig. 4. (a). DT prediction accuracy for total costs prediction (b). ANN prediction accuracy for total costs prediction (c). XGBoost prediction accuracy for total costs prediction
5 Conclusion This study embarked on harnessing the transformative potential of ML algorithms to automate and optimize the risk prediction within construction projects. Using a robust case study centered on NYC school construction projects database, a myriad of ML algorithms was trained and critically evaluated. The overarching goal was to unearth
236
A. Khodabakhshian et al.
insights into the comparative performance of these algorithms and their competitive advantage with respect to traditional RM methods, specifically concerning their precision in predicting delays and cost overruns in construction projects. The backbone of this research was the quality and volume of data, which in terms of this case study, a rich tapestry of project attributes, from project timelines to budgetary considerations was provided. This data abundancy not only fortified the ML models but also lent depth to the risk analyses, allowing for nuanced insights and more precise predictions, which could not be achieved by manual and traditional RM methods. Moreover, the low data acquisition cost and velocity of the prediction process is not even comparable to the conventional RM methods. Therefore, logically and time-vise, it is not possible to run the same kind of simulations using conventional methods, such as P-I matrix, on such a vast database to validate the findings. ML algorithms work best when abundant data is available, while conventional models have a case-based approach and data abundancy overcomplicates the prediction process. Five ML algorithms were employed to predict project delays and total costs: Decision Tree, Artificial Neural Network (ANN), XGBoost, Linear Regression, and Ridge Regression. The performance of each model was assessed using various metrics, such as R-squared, CV scores, MSE, and MAE. The XGBoost model consistently demonstrated superior performance in predicting both delays and total costs, followed by Decision Tree and ANN. The Linear and Ridge Regression models exhibited lower performance compared to the non-linear models, as they assume linear relationships between predictors and target variables, which does not correspond to reality. The outstanding performance of XGBoost can be attributed to several factors, including: • Model complexity: The MLP Regressor uses a fixed architecture with a predefined number of layers and nodes. This architecture might not be optimal for the specific problem at hand, whereas the XGBoost model can better adapt to complex data patterns due to its gradient boosting framework, which combines multiple decision trees, allowing it to capture non-linear relationships more effectively. • Training process: The ANN model relies on gradient-based optimization techniques, such as backpropagation, which are sensitive to the choice of hyperparameters, including learning rate, activation functions, and the number of hidden layers. In contrast, the XGBoost model uses a more robust tree-based boosting method, which is less sensitive to hyperparameter choices, and generally converges more efficiently. • Interpretability and Explainability: The ANN model is often considered a “black box” due to its complex structure, making it difficult to understand and interpret its internal decision-making process. This lack of interpretability may hamper the ability to diagnose and improve the model’s performance. On the other hand, the XGBoost model is built upon decision trees, which are inherently more interpretable and allow for a better understanding of the relationships between the input attributes and the target variable. • Regularization: The XGBoost model incorporates regularization techniques that penalize overly complex models, reducing overfitting and improving generalization. The developed ML models serve as highly effective prediction tools for risk management in construction projects. By leveraging the model’s predictive power, project managers can proactively identify potential delays and cost overruns in ongoing projects by
Machine Learning Application in Construction Delay
237
simply inputting relevant variables. This capability allows stakeholders to make informed decisions, adjust project plans, and allocate resources more effectively to mitigate risks, ultimately leading to more successful project outcomes and better project management. As proved by the results, the choice of an appropriate ML algorithm depends on the nature and availability of data, the complexity of the problem to be solved, and the relationships between the input and target variables. The database used in this study was specifically focused on school construction projects in New York, influenced by its unique characteristics, such as building codes, construction technologies, and regulations. Consequently, the results and the performance of the selected algorithms may not be directly applicable to other types of constructions. This limitation arises due to the context-drivenness of risks and construction projects, making the developed model inapplicable in other contexts or locations. The other limitation of this research was the lack of extensive open-access databases on construction risks, which could serve as a benchmark for the obtained results. Moreover, there were many missing values and outliers in the studied database, removing which significantly shrunk down the database’s size. In general, ML algorithms offer a transformative edge in predicting cost overrun and delay risks in construction projects when compared to conventional RM methods traditionally employed in the industry. First and foremost, ML algorithms can process vast quantities of data at impressive speeds and low data acquisition cost, uncovering intricate patterns and correlations that might be overlooked by human analysis. This data-driven approach ensures a more accurate and proactive risk assessment, enabling stakeholders to anticipate potential challenges before they escalate. Unlike conventional methods, which often rely on subjective judgment and historical data, ML models adapt and evolve with new data, ensuring their predictions remain relevant and updated in a frequent basis. Additionally, ML algorithms can integrate diverse data sources for better generalizability and accuracy. This holistic perspective, coupled with the adaptability of ML, translates to more informed decision-making, reduced project uncertainties, and a higher likelihood of project completion within budget and time constraints. On the other hand, ML-based models also come with certain disadvantages when juxtaposed against conventional risk management methods. One notable limitation is the heavy reliance on data quality and quantity; inaccurate or insufficient data can lead to misleading predictions, potentially exacerbating project risks. Additionally, ML models often operate as “black boxes”, making their decision-making processes opaque and challenging to interpret for stakeholders. This lack of transparency can hinder trust and adoption by industry professionals accustomed to transparent, heuristic-based conventional methods. Furthermore, implementing ML solutions requires specialized expertise and infrastructure, potentially leading to increased initial costs and a steeper learning curve. Unlike traditional methods, which often factor in experiential knowledge and intuitive insights, ML models might miss out on nuanced, context-specific information that doesn’t translate well into data. Thus, while ML holds promise, it is essential to recognize and navigate its limitations in the RM context. In light of this study, several avenues for future research can be pursued. One direction is to explore the potential of ensemble models, combining multiple algorithms to leverage their individual strengths and improve prediction accuracy and reliability for
238
A. Khodabakhshian et al.
construction project delays and cost overruns. Another suggestion is to broaden the scope of research by including datasets from various regions, industries, and project types to assess the generalizability of findings and develop more robust models applicable to a broader range of construction projects. Additionally, incorporating additional input variables, such as project management practices and team experience, could provide a more comprehensive understanding of factors influencing project delays and cost overruns, leading to improved predictive capabilities. As the available data for these projects continues to grow, comparing the performance of ML algorithms across diverse datasets could yield new insights and improve prediction accuracy in different contexts.
References 1. Priyadarshini, L., Roy, P.: Risk assessment and management in construction industry. In: Das, B.B., Gomez, C.P., Mohapatra, B.G. (eds.) Recent Developments in Sustainable Infrastructure (ICRDSI-2020)—Structure and Construction Management. LNCE, vol. 221, pp. 539–556. Springer, Singapore (2022). https://doi.org/10.1007/978-981-16-8433-3_46 2. Ansari, R., Khalilzadeh, M., Taherkhani, R., Antucheviciene, J., Migilinskas, D., Moradi, S.: Performance prediction of construction projects based on the causes of claims: a system dynamics approach. Sustainability 14(4138) (2022) 3. Khodabakhshian, A., Re Cecconi, F.: Data-driven process mining framework for risk management in construction projects. In: IOP Conference Series: Earth and Environmental Science, World Building Congress 2022, vol. 1101, p. 032023 (2022) 4. Eybpoosh, M., Dikmen, I., Talat Birgonul, M.: Identification of risk paths in international construction projects using structural equation modeling. J. Constr. Eng. Manag. 137(12), 1164–1175 (2011) 5. Chattapadhyay, D.B., Putta, J., Rama Mohan Rao, P.: Risk identification, assessments, and prediction for mega construction projects: a risk prediction paradigm based on cross analytical-machine learning model. Buildings 11(4) (2021) 6. Fan, C.L.: Defect risk assessment using a hybrid machine learning method. J. Constr. Eng. Manag. 146(9) (2020) 7. Rongchen, Z., Xiaofeng, H., Jiaqi, H., Xin, L.: Application of machine learning techniques for predicting the consequences of construction accidents in China. Process Saf. Environ. Prot. 145, 293–302 (2020) 8. Oztemel, E., Gursev, S.: Literature review of Industry 4.0 and related technologies. J. Int. Manuf. 31(1), 127–82 (2020) 9. Akinosho, T.D., Oyedele, L.O., Bilal, M., Ajayi, A.O., Delgado, M.D., Akinade, O.O., et al.: Deep learning in the construction industry: a review of present status and future innovations. J. Build. Eng. 32(101827) (2020) 10. Jin, R., Zuo, J., Hong, J.: Scientometric review of articles published in ASCE’s journal of construction engineering and management from 2000 to 2018. J. Constr. Eng. Manag. 145(8), 06019001 (2019) 11. Gumte, K.M., Pantula, P.D., Miriyala, S.S., Mitra, K.: Data driven robust optimization for supply chain planning models. In: 2019 Sixth Indian Control Conference (ICC), Hyderabad, India, pp. 218–23 (2019) 12. Maphosa, V., Maphosa, M.: Artificial intelligence in project management research: a bibliometric analysis. J. Theor. Appl. Inf. Technol. 100(16), 5000–5012 (2022) 13. Regona, M., Yigitcanlar, T., Xia, B., Li, R.Y.M.: Artificial intelligent technologies for the construction industry: how are they perceived and utilized in Australia? J. Open Innov. Technol. Mark. Complex. 8(1) (2022)
Machine Learning Application in Construction Delay
239
14. Kozlovska, M., Klosova, D., Strukova, Z.: Impact of industry 4.0 platform on the formation of construction 4.0 concept: a literature review. Sustainability 13(2638) (2021) 15. Okudan, O., Budayan, C., Dikmen, I.: A knowledge-based risk management tool for construction projects using case-based reasoning. Expert Syst. Appl. 173 (2021) 16. Project Management Institute (PMI): A Guide to the Project Management Body of Knowledge (PMBOK Guide), 6th edn. Project Management Institute, Pennsylvania (2017) 17. Alam, M., Phung, V.M., Zou, P.X.W., Sanjayan, J.: Risk identification and assessment for construction and commissioning stages of building energy retrofit projects. In: Proceedings of the 22nd International Conference on Advancement of Construction. Management and Real Estate, CRIOCM 2017, pp. 753–62 (2019) 18. Keshk, A.M., Maarouf, I., Annany, Y.: Special studies in management of construction project risks, risk concept, plan building, risk quantitative and qualitative analysis, risk response strategies. Alex. Eng. J. 57(4), 3179–3187 (2018) 19. Iromuanya, C., Hargiss, K.M., Howard, C.: Critical risk path method: a risk and contingencydriven model for construction procurement in complex and dynamic projects. In: Transportation Systems and Engineering: Concepts, Methodologies, Tools, and Applications, Hershey, PA, pp. 572–84, IGI Global (2015) 20. Sruthi, M.D., Aravindan, A.: Performance measurement of schedule and cost analysis by using earned value management for a residential building. Mater. Today Proc. 33, 524–532 (2020) 21. Yildiz, A.E., Dikmen, I., Birgonul, M.T., Ercoskun, K., Alten, S.: A knowledge-based risk mapping tool for cost estimation of international construction projects. Autom. Constr. 43, 144–155 (2014) 22. Asadi, A., Alsubaey, M., Makatsoris, C.: A machine learning approach for predicting delays in construction logistics. Int. J. Adv. Logist. 4(2), 115–130 (2015) 23. Darko, A., Chan, A.P.C., Adabre, M.A., Edwards, D.J., Hosseini, M.R., Ameyaw, E.E.: Artificial intelligence in the AEC industry: scientometric analysis and visualization of research activities. Autom. Constr. 112(103081) (2020) 24. Pan, Y., Zhang, L.: Roles of artificial intelligence in construction engineering and management: a critical review and future trends. Autom. Constr. 122(103517) (2021) 25. Hon, C.K.H., Sun, C., Xia, B., Jimmieson, N.L., Way, K.A., Wu, P.P.Y.: Applications of Bayesian approaches in construction management research: a systematic review. Eng. Constr. Archit. Manag. 29(5), 2153–2182 (2020) 26. Chen, L., Fong, P.S.W.: Revealing performance heterogeneity through knowledge management maturity evaluation: a capability-based approach. Expert Syst. Appl. 39, 13523–13539 (2012) 27. Leung, M., Shan, Y., Chan, I., Dongyu, C.: An enterprise risk management knowledge-based decision support system for construction firms. Eng. Constr. Archit. Manag. 18(3), 312–328 (2011) 28. Arabi, S., Eshtehardian, E., Shafiei, I.: Using Bayesian networks for selecting risk-response strategies in construction projects. J. Constr. Eng. Manag. 148(8), 1–19 (2022) 29. Sanni-Anibire, M.O., Zin, R.M., Olatunji, S.O.: Machine learning - based framework for construction delay mitigation. J. Inf. Technol. Constr. 26, 303–318 (2021) 30. Zhong, B., Pan, X., Love, P.E.: Hazard analysis: a deep learning and text mining framework for accident prevention. Adv. Eng. Inform. 46(101152) (2020) 31. Choi, S.J., Choi, S.W., Kim, J.H., Lee, E.B.: AI and text-mining applications for analyzing contractor’s risk in invitation to bid (ITB) and contracts for engineering procurement and construction (EPC) projects. Energies 14(15) (2021) 32. Gondia, A., Siam, A., El-Dakhakhni, W., Nassar, A.H.: Machine learning algorithms for construction projects delay risk prediction. J. Constr. Eng. Manag. 146(1) (2020)
240
A. Khodabakhshian et al.
33. Van Liebergen, B.: Machine learning: a revolution in risk management and compliance? J. Financial Transform. 45, 60–67 (2017) 34. Erzaij, K.R., Burhan, A.M., Hatem, W.A., Ali, R.H.: Prediction of the delay in the portfolio construction using naïve Bayesian classification algorithms. Civ. Environ. Eng. 17(2), 673– 680 (2021) 35. Sanni-Anibire, M.O., Zin, R.M., Olatunji, S.O.: Machine learning model for delay risk assessment in tall building projects. Int. J. Constr. Manag. 22(11), 2134–2143 (2022) 36. Elmousalami, H.H.: Artificial intelligence and parametric construction cost estimate modeling: state-of-the-art review. J. Constr. Eng. Manag. 146(1) (2020) 37. Hameed Memon, A., Abdul Rahman, I., Abdul Aziz, A.A., Abdullah, N.H.: Using structural equation modelling to assess effects of construction resource related factors on cost overrun. World Appl. Sci. J. 21, 6–15 (2013) 38. Rafiei, M.H., Adeli, H.: Novel machine-learning model for estimating construction costs considering economic variables and indexes. J. Constr. Eng. Manag. 144(12), 1–9 (2018) 39. Islam, M.S., Nepal, M.P., Skitmore, M., Attarzadeh, M.: Current research trends and application areas of fuzzy and hybrid methods to the risk assessment of construction projects. Adv. Egn. Inform. 33, 112–131 (2017) 40. Kamari, M., Ham, Y.: AI-based risk assessment for construction site disaster preparedness through deep learning-based digital twinning. Autom. Constr. 134(104091) (2022) 41. Anysz, H., Apollo, M., Grzyl, B.: Quantitative risk assessment in construction disputes based on machine learning tools. Symmetry 13(5) (2021) 42. Bilal, M., Oyedele, L.O., Qadir, J., Munir, K., Ajayi, S.O., Akinade, O.O., et al.: Big Data in the construction industry: a review of present status, opportunities, and future trends. Adv. Eng. Inform. 30(3), 500–521 (2016) 43. Mistikoglu, G., Halil, I., Erdis, E., Usmen, P.E.M., Cakan, H.: Expert systems with applications decision tree analysis of construction fall accidents involving roofers. Expert Syst. Appl. 42(4), 2256–2263 (2015) 44. Ferreira, D.R., Vasilyev, E.: Using logical decision trees to discover the cause of process delays from event logs. Comput. Ind. 70, 194–207 (2015) 45. Khodabakhshian, A., Rampini, L., Vasapollo, C., Panarelli, G., Cecconi, F.R.: Application of machine learning to estimate retrofitting cost of school buildings. In: Krüger, E.L., Karunathilake, H.P., Alam, T. (eds.) Resilient and Responsible Smart Cities. ASTI, pp. 215–228. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-20182-0_16 46. Mohammadfam, I., Soltanzadeh, A., Moghimbeigi, A., Savareh, B.A.: Use of artificial neural networks (ANNs) for the analysis and modeling of factors that affect occupational injuries in large construction industries. Electron. Physician 7(7), 1515–1522 (2015) 47. Re Cecconi, F., Moretti, N., Tagliabue, L.C.: Application of artificial neutral network and geographic information system to evaluate retrofit potential in public school buildings. Renew. Sustain. Energy Rev. 110, 266–277 (2019)
Process Mining in a Line Production Cristina Santos1 , Joana Fialho1,3(B) , Jorge Silva1,2 , and Teresa Neto1 1 Polytechnique Institute of Viseu, Viseu, Portugal
[email protected] 2 Huf-Group, Viseu, Portugal 3 CI&DEI, Viseu, Portugal
Abstract. The search for more efficient strategies, cost savings, time optimization and productivity are the main goals of any successful company. Process mining arises in this context and, although it is not a new concept, its expansion and applicability in the market has recently become notorious. Through an extensive set of data recorded over time, it is possible to determine the real state of a company’s processes, allows diagnosing failures and improving the efficiency of these processes. This paper describes a project realized in the scope of process mining, developed in the company Huf Portuguesa. The machines of a production line record thousands of data. In a specific production line, data was collected in a time range, cleaned and processed. Process mining techniques allowed the discovery and analysis of the real state of the production line. All paths were detected, and each path was analyzed individually. The conformity was also analyzed. Keywords: Production line · Process mining · Process mining tools
1 Introduction The race for privileged positions in the business industry results in constant pressure on organizations. Authenticity, credibility, and leadership are the main goals and motivations of any successful company, leading to a pursuit of business process innovation. This innovation goes beyond simple data analysis or productivity levels; it requires a much deeper approach capable of acting in various areas, improving methods, and processes. The search for efficient solutions allows the rise of information systems (IS) that are increasingly tailored to reality. We are in digital age, where the importance of social networks, rankings, data collection, and manipulation or access to algorithms for personal benefit, is becoming more and more prevalent. Currently, there are few activities that do not leave any kind of record. With such an extensive and comprehensive collection of data, there is the opportunity for processing them and drawing relevant conclusions. The registration of events produced daily in organizations allows storing the necessary information to, after proper manipulation, seek to understand the actual behaviors of people and/or processes executed in those organizations. On the other hand, manipulating this information can also help identify potential points of failure and deviations. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 K. Arai (Ed.): FICC 2024, LNNS 921, pp. 241–257, 2024. https://doi.org/10.1007/978-3-031-54053-0_18
242
C. Santos et al.
This understanding is crucial so that, after a correct analysis, the best decisions can be made to lead to effective improvements. Huf Portuguesa, a member of the German Huf Group, headquartered in Tondela, Portugal, is an automotive industry that produces components for automobiles, such as locks, keys, steering column locks, door handles, and rear door emblem handles with a vision camera. The organization has a diversity of processes in different departments and various information systems (IS) that record data during the execution of these processes. This study focused on a specific process: the series production line of components manufactured by the company. Data recorded (event logs) can be studied and analyzed to detect routines, patterns, and preferences, allowing for the improvement or redesign processes. This technique of discovering, monitoring, and improving real processes by extracting knowledge from data available in information systems is known as process mining. Process mining can be defined as a technique for extracting information from an event log containing relevant data about the activities performed by an organization or system in a business area. It involves the use of Data Mining combined with process modeling and analysis. The objective of process mining is, therefore, the automatic discovery of a process model by observing events recorded by certain corporate systems. The paper is organized as follows: in addition to this introductory section, Sect. 2 briefly discusses the area of process mining, Sect. 3 describes the analyzed Huf process, and Sect. 4 deals with the compliance verification. Section 5 concludes the paper.
2 Process Mining It becomes necessary to organize information since, as cited by [1], “organizations have to confront uncertainty and disorderly events coming from both the interior and the exterior while still providing a clear, operational, and well-defined conceptual scheme for the participants.” The potential and importance of Information Technologies (IT) and Information Systems (IS) within organizations are more than proven. The development of IT allows introducing changes in business and responding to the evolution of markets. Companies face the challenge of analyzing massive amounts of data gathered daily (Big Data). Big Data are attributed to have the features Volume, Velocity, and Variety. Demchenko et al. (2013) proposed wider definition of Big Data as 5 Vs: Volume, Velocity, Variety and additionally Value and Veracity [2]. The volume of data, the speed at which it is recorded, the variety, its veracity, and value are a reflection of the scenario experienced by organizations [3]. Process Mining has revolutionized this area by enabling the fusion of machine learning algorithms and Big Data concepts. This concept is titled as “the Artificial Intelligence revolution that drives process excellence” [4]. The activities performed by people, machines and software leave traces in the socalled event logs [5]. Process mining techniques use these logs to discover, analyze and improve business processes [6]. Process Mining focuses on acquiring specialized information regarding corporate processes. By using the available event logs, it is possible to analyze various factors,
Process Mining in a Line Production
243
such as associating patterns, identifying the most common paths, exceptions, error-prone regions, what triggered a decision, and who made them [7]. Process Mining is a concept that can be interpreted as the intersection of two areas, namely, data science and process science, as shown in Fig. 1. Data science is a concept that encompasses the study and algorithms for problem-solving, including data extraction, preparation, and transformation. Process science “is the interdisciplinary study of processes aimed at understanding, influencing, and designing processes” [8]. In other words, it refers to a generic term regarding a broad discipline that combines IT with knowledge of management sciences, with its main focus on processes [9].
Fig. 1. Process mining [9]
Figure 2 highlights the phases of process mining. Initially, the extraction of data from Information Systems (IS) is crucial since it forms the basis of the entire process. However, it may be necessary to explore data before applying process mining.
Fig. 2. Process mining stages [9]
After the data collection, it is essential to explore its usefulness, select relevant information, and proceed with data treatment and cleansing. The data must contain
244
C. Santos et al.
three essential fields (case id, activity, timestamp) to enable its analysis. The case id is a unique field that identifies the event log, the activity represents the record of the activity, and finally, the timestamp allows for the identification of the timeline and detection of the sequence of activities (see Fig. 3). With the filtered data, the first of the three stages of process mining begins - discovery. This stage involves organizing all data and identifying the real paths of the process. The second stage of process mining is the conformance/compliance verification. Here, it is verified whether the real events follow the predicted path/model. In this phase, a diagnosis can be drawn, leading to the final stage of process mining - applying improvements.
Fig. 3. Data collection example
The main advantage of implementing process mining techniques is the optimization of internal company processes and makes data profitable. By using the data collected daily, it is possible to generate flowcharts that represent the processes exactly as they occur. Process mining ensures a comprehensive and informed view of the actual state of the company, which aids in identifying improvement opportunities, making decisions, increasing productivity, and saving time. Acting based on real data is more reliable than relying solely on studies and assumptions. Process mining can be applied in any area, if there is a constant and temporal record of data. Its applicability provides better visibility into processes to meet the company’s and its potential customers’ expectations. Time optimization, cost reduction, improved outcomes, and increased productivity are particularly attractive in the industrial sector. In the financial services and banking industry, process mining is a great solution to gain transparency over processes and financial transactions, detecting issues that may cause losses. Additionally, the healthcare sector is equally enticed by the possibility of understanding the patient’s journey, improving hospitals’ operational efficiency, and enhancing diagnoses [4]. To apply process mining algorithms, there are several software tools, some open source, e.g. ProM and Apromore, other commercial tools, e.g. Disco, Celonis, ProcessGold, among others. In the literature, ProM [10], Disco [11] and Celonis [12] are mostly used. In this experiment, ProM and Celonis were used.
Process Mining in a Line Production
245
The development of Celonis began in 2011, and nowadays, it stands as one of the leading tools in the market, serving globally recognized clients such as Siemens, Uber, Cisco, Vodafone, among others. One of the significant advantages of Celonis is its realtime integration capability with Relational Database Management Systems (RDBMS). In addition, it also allows the importation of event logs from traditional file formats such as.csv and.xls. Celonis is a cloud-based process mining tool, and its main benefit lies in not requiring any installation on the user’s computer. It is a commercial tool that requires payment, but it provides some basic features available for free use, which was the version utilized for the development of this project. ProM is an open-source extensible framework that supports a wide range of process mining algorithms as plug-ins. It is a tool very used in [6]. The framework is flexible with respect to the input and output format, as it supports several formats, e.g. Petri nets, social networks [13], among others. Plug-ins can be used in a variety of ways and combined to be applied in real-life situations [10]. This software offers more than 1500 plug-ins [14]. It is an independent platform developed at Eindhoven Technical University by a research group led by Wil Van der Aalst [15]. The group actively invites investigators to contribute in the creation and development of new plug-ins, enriching the tool and maintaining the existing ones.
3 Production Line Process The case study relates to a specific production line where the pieces of a product follow a sequential path from one machine to another. Each machine performs its operation and records numerous data regarding the passage of the piece. Ideally, the first record of the piece occurs at the first machine of the line, it goes through all the machines sequentially, and finally passes through the last machine for validation. The BPMN model is used to represent processes in a standardized way through representative standard icons. BPMN diagrams use specific symbols and elements to represent different activities, events, gateways, and flows within a process, making it easier for stakeholders to understand and analyze complex processes. Figure 4 represents a BPMN (Business Process Model and Notation) model of the entire production line layout. The notation of this diagram identifies the machines and describes the process logic and the flow of activities. This is an older production line, and despite being composed of a larger number of machines, it was not possible to collect data from all of them. The line includes machines from other suppliers, and their data is not accessible. However, the BPMN model presented below (see Fig. 5) corresponds to the sequence of machines that provide records. The data was provided using CSV files. Its cleansing was performed using Python language and functions from the Pandas library. To ensure a more realistic analysis, all records that did not contain data from the first or second machine and the last machine were excluded. In other words, only nearly complete records of the production line were considered for the study.
246
C. Santos et al.
Fig. 4. Production line layout
Fig. 5. Layout used
3.1 Event Logs Event logs typically correspond to an extensive database with a vast amount of data. In this study, the final dataset consists of 25 595 cases (number of case ids under analysis) and 181 461 activities (number of records). The data was collected in two periods: from May 22nd (06:02:57) to May 27th (05:53:38) and from May 29th (05:59:59) to May 30th (16:41:34). The information and fields provided by the information systems are generally more extensive than necessary for a particular analysis. In this specific project, the following fields were considered: • part_number: It identifies the reference of the produced part. Each reference is composed of multiple records of serial_number. • serial_number: It represents the serial number of each part, a unique value that corresponds to the case id of the event log. • operation: It identifies the machine, a unique value that serves as the machine’s case id. • transaction_timestamp: It records the date and time when the activity was executed, essentially serving as the timestamp of the event log.
Process Mining in a Line Production
247
3.2 Discovery The discovery phase is responsible for identifying the actual behavior of the process. It involves identifying the most frequent paths and unusual sequences. This phase was executed using the Celonis tool. The event logs were imported and the discovery process was performed, resulting in the real mapping of the process, as shown in Fig. 6. For each path, it was possible to associate the corresponding serial_number. The diagnostics and conclusions herein presented are described in order to safeguard any confidential information related to the company.
Fig. 6. All paths discovered
The numbers on the lines indicate the frequency of occurrences and the thicker the line, the higher is the frequency/probability of following that path/passing through that machine. Figure 7 shows the most frequent path, which corresponds to about 72, 7% of all paths. Notice that the event logs do not initiate their process at the first machine of the line (B). This phenomenon occurs because the first machine is for validation, and it can be done manually by an operator. In these cases, the machine doesn’t register anything, but this path is expected. The remaining paths will be presented in descending order of occurrence percentage and divided into three groups. Group A consists of cases that are also expected to occur in the production line. Group B records cases with a low percentage of occurrence, happening unexpectedly, but they are only process issues that do not interfere with the quality. Finally, Group C corresponds to other identified cases that, despite representing low percentages, are significant to analyze because the company evaluates its quality rate in units per million.
248
C. Santos et al.
Fig. 7. More frequent path
Group A. Table 1 displays the cases of Group A, with the frequency of occurrences and the percentage they represent. Table 1. Variants of group A Cases
Frequency
Percentage
Case 1
2115
8,26%
Case 2
1163
4,54%
Case 3
952
3,71%
Case 1 in Group A represents the event logs that initiate the process at the first machine of the production line (B). Although this is the desired path, out of the 25 595 cases, only 2 115 start the process at the first machine of the layout, due to the reasons
Process Mining in a Line Production
249
explained in the most frequent case. Additionally, as observed in Fig. 8, even though they start at the first machine, there is a division in the paths, which will be further analyzed in cases 2 and 3.
Fig. 8. Case 1 (Group A)
Case 2 corresponds to the scenario where the pieces go directly to the 3rd machine (G) (see Fig. 9). This situation is explained by the existence of pieces in the production line that do not utilize this machine (G). Case 3 (see Fig. 10) represents the expected path, where the parts follow the production line sequentially, starting at the first machine of the layout, with no jumps. However, out of all the records analyzed, only 952 paths, approximately 3, 71%, behave as the process was designed.
250
C. Santos et al.
Fig. 9. Case 2 (Group A)
Group B. Table2 displays the cases of Group B, their frequency of occurrences and percentage. Case 1 represents a set of 1139 paths that do not pass through machine K (See Fig. 11). These cases are common and are related to a process issue associated with component detection. The operator removes the piece from the production line, validates it visually, and then places it back on the line to proceed to the next machine. This situation results in the absence of a record in that machine. It may be considered to implement a specific registration process for these cases. Case 2 (see Fig. 12) corresponds to the loops detected at machine K. Approximately 857 cases (3, 35%) are identified as instances where the piece passes through the machine twice, despite the first pass not encountering any errors. The machine issues two confirmation telegrams due to robot programming reasons. Similarly to the loops detected at machine K, several other cases were identified (see Fig. 13) with a lower percentage, but for the same reason. The machines emit a double signal of OK confirmation, with a time difference of seconds, which is less than the machines’ working period. However, two registers are emitted, resulting in false loops.
Process Mining in a Line Production
Fig. 10. Case 3 (Group A)
Table 2. Variants of Group B Cases
Frequency
Percentage
Case 1
1139
4,45%
Case 2
857
3,35%
251
252
C. Santos et al.
Fig. 11. Case 1 (Group B)
Group C. Table 3 displays the case information of Group C. Case 1 identifies a set of 340 paths that do not pass through machine H. This is due to operator intervention, where the operator performs the task manually while the machine is performing its function on another piece. Case 2 represents a set of 191 cases, approximately 1, 33%, where paths skip machines and do not follow the natural sequence of the production line. This is due to piece revalidation. The piece is rejected by machine L (responsible for quality assurance) and is returned to the production line for quality parameter confirmation.
Process Mining in a Line Production
253
Fig. 12. Case 2 (Group B)
Case 3 consists of 106 records (0, 41%) where the pieces do not pass through machine K and backtrack in the production line sequence. These are rare cases, and there is currently no explanation; it could be a reading issue. Case 4 represents a set of 82 cases (0, 32%) where there is no record at machine J. This may occur due to a machine reset, and the operator performs the assembly manually. Case 5 corresponds to 41 records, approximately 0, 16% of the cases, where the pieces do not pass through machine G. After analysis, it is concluded that these are isolated cases resulting from lack of experience of the production line operators.
254
C. Santos et al.
Fig. 13. Similar cases of Case 2
4 Compliance Verification The conformance/compliance analysis was performed in ProM and requires two inputs: a Petri net and the log file of the process model. The Petri net is defined graphically as a directed graph. It consists of two types of nodes: places, represented as circles, and transitions, represented as squares. The connecting arcs are represented by arrows, used to link two nodes of different types. The CSV data file was converted to XES (a process mining format adopted by the IEEE Task Force and the latest version of this tool). Using the Inductive Miner
Process Mining in a Line Production
255
Table 3. Variants of Group C Cases
Frequency
Percentage
Case 1
340
1,33%
Case 2
191
0,75%
Case 3
106
0,41%
Case 4
82
0,32%
Case 5
41
0,16%
algorithm plugin, responsible for extracting the process model from the logs, a Petri net was generated. In Fig. 14, it is possible to observe the data flow and the machines on the production line; the black box indicating an alternative flow [14].
Fig. 14. Rede Petri net
To analyze conformance, the plugin “Replay a Log on Petri net for Conformance Analysis” was applied, resulting in the diagram shown in Fig. 15.
Fig. 15. Compliance verification
The yellow circles indicate movements outside the model, i.e., non-conformant behavior. When analyzing the diagram, several details are noteworthy: • The green transitions show that the actual logs followed all the expected transitions. • The red outline represents activities that are not in compliance. • The black rectangles indicate decision points for trajectories to be followed after the execution of an activity, with the pink bottom bar indicating cases with divergent executions relative to the model.
256
C. Santos et al.
• The activities with dark blue boxes represent frequent activities in the process execution. • The value inside the rectangles indicates the conformance frequency/nonconformance frequency ratio. In conclusion, the conformance analysis performed reinforces the results obtained by the Celonis tool. It was detected a majority conformance, which starts at the second machine on the production line, with only a small percentage of cases not conforming to the expected behavior.
5 Conclusions and Future Work After the discovery and conformance analysis, non-conformances that disrupt the expected process flow were identified. These cases need to be carefully analyzed, as they may be responsible for inefficiencies in the production line, performance losses, misutilization of available resources, or even productivity delays. Consequently, it becomes feasible to consider implementing improvements or devising a plan. It is indeed relevant to consider reprogramming the machines for future implementation to extract more reliable, concise, and valuable data for process mining and, potentially, leading to predictive models. With optimized data extraction, the following could be achieved: • Forecasting execution times based on shifts. • Identifying the most efficient sequence of machines/production line. • Assessing if the required number of operators remains constant concerning the references in production. • Managing the distribution of the number of operators based on the references in production. • Predicting potential machine failures and scheduling preventive maintenance. • Identifying patterns of issues in the production lines. • Detecting periods of profitability. • Estimating the daily production numbers to meet customer demands. Implementing these suggestions across various production lines in the company would not only increase productivity but also minimize errors and faults, leading to more efficient and streamlined processes. Acknowledgment. This work is funded by National Funds through the FCT - Foundation for Science and Technology, I.P., within the scope of the project Refª UIDB/05507/2020. Furthermore, we would like to thank the Centre for Studies in Education and Innovation (CI&DEI) and the Polytechnic of Viseu for their support.
Process Mining in a Line Production
257
References 1. Daft, R., Lengel, R.: Information Richness: A New Approach to Managerial Behavior and Organizational Design. JAI Press, Greenwich (1984) 2. Demchenko, Y., Membrey, P., et al.: Addressing big data issues in scientific data infrastructure (2013). https://doi.org/10.1109/CTS.2013.6567203. Accessed 27 July 2023 3. Oliveira, R.: Mineração de Processo com Celonis Framework (2016). https://www.linkedin. com/pulse/minera%C3%A7%C3%A3o-de-processo-com-celonis-framework-rosangela-oli veira/?originalSubdomain=pt. Accessed 08 July 2023 4. UpFlux Process Mining. https://upflux.net/pt/process-mining/. Accessed 02 July 2023 5. Van Der Aalst, W.: Process mining: overview and opportunities. ACM Trans. Manag. Inf. Syst. 3(2), 1–17 (2012) 6. Batista, E., Solanas, A.: Process mining in healthcare: a systematic review. In 9th International Conference on Information, Intelligence, Systems and Applications, pp. 1–6 (2018) 7. Iervolino, L.: Process Mining: Entenda a realidade dos seus processos (2018). https://www. linkedin.com/in/luigi-iervolino-67981b/recent-activity/articles/. Accessed 20 June 2023 8. vom Brocke, J., van der Aalst, W., et al.: Process Science: The Interdisciplinary Study of Continuous Change (2021). SSRN. https://ssrn.com/abstract=3916817. Accessed 30 June 2023 9. van der Aalst, W.M.P.: Process mining: a 360 degree overview. In: van der Aalst, W.M.P., Carmona, J. (eds.) Process Mining Handbook. LNBIP, vol. 448, pp. 3–34. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-08848-3_1 10. van Dongen, B.F., de Medeiros, A.K.A., Verbeek, H.M.W., Weijters, A.J.M.M., van der Aalst, W.M.P.: The ProM framework: a new era in process mining tool support. In: Ciardo, G., Darondeau, P. (eds.) ICATPN 2005. LNCS, vol. 3536, pp. 444–454. Springer, Heidelberg (2005). https://doi.org/10.1007/11494744_25 11. Günther, C.W., Rozinat, A.: Disco: discover your processes. In: Lohmann, N., Moser, S. (eds.) Demonstration Track of the 10th International Conference on Business Process Management (2012) 12. Badakhshan, P., Geyer-Klingeberg, J., et al.: Celonis process repository: a bridge between business process management and process mining. In: CEUR Workshop Proceedings, vol. 2673, pp. 67–71 (2020) 13. van der Aalst, W.M.P., Song, M.: Mining social networks: uncovering interaction patterns in business processes. In: Desel, J., Pernici, B., Weske, M. (eds.) BPM 2004. LNCS, vol. 3080, pp. 244–260. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-25970-1_16 14. ProM Tools, ProM Documentation. https://promtools.org/prom-documentation/. Accessed 07 July 2023 15. Ailenei, I.: Process mining tools: a comparative analysis. Master thesis. Eindhoven University of Technology (2011)
Digital Transformation of Project Management Zornitsa Yordanova(B) University of National and World Economy, Sofia, Bulgaria [email protected]
Abstract. The study builds on the notion that increasing the effectiveness and efficiency of project management has a significant and positive impact on the behavioral intention to use digital technologies and their actual use, leading to easier adoption. We, therefore, perform a bibliometric analysis to investigate digital transformation in project management, linking such examples in the literature to efficiency and effectiveness. Several studies have already proven the positive impact of digital transformation, yet there are scarce literature reviews that cover the topic more broadly and provide an overview of future directions or grouping and development of methodological approaches. The paper is based on 431 articles indexed in Web of Science, which simultaneously deal with project management and digitalization/digitization and refer to efficiency and effectiveness as key terms in their titles, abstracts, and keywords. Addressing the topic, several analyses have been performed such as co-word analysis, top authors and journals publishing on the topic, evolution mapping of the literature, thematic mapping, and contextual analysis. The contribution of the study is a comprehensive picture of the digital evolution and progress of this management activity over time and a summary of the main trends of its digital transformation. Both researchers and practitioners can apply the insights of this study in transforming obsolete management practices. Keywords: Project management · Digitalization · Technology management · Digital transformation
1 Introduction This research steps on the perception of increasing the efficiency and effectiveness of project management, which has a significant and positive influence on the behavioral intention to use digital technologies and their actual use, leading to their easier adoption [1]. Technological advances, digitization of all sectors, and management practices are impacting project management in all areas. In this research, we aim to explore the field of digital transformation in project management and formulate key trends and clusters for a further narrowed research agenda. Project Management (PM) has proven its practicality, usability, and functionality over the past 50 years to manage complex new activities in all kinds of initiatives and projects. In recent decades, PM, as a formalization and organization of work, has become a parallel structure of the organization in almost every industry to deal with new activities [2]. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 K. Arai (Ed.): FICC 2024, LNNS 921, pp. 258–268, 2024. https://doi.org/10.1007/978-3-031-54053-0_19
Digital Transformation of Project Management
259
Digitalization has been largely implemented as a management practice in project management over the past 20 years. Project Management 4.0 refers to project management in the digital age [3] and offers digitized management tools such as real-time project monitoring tools, online reports, cost progress indicators in real-time, project execution simulation, purchasing knowledge sharing, and more. With the increasing use and application of PM in business processes on one hand and the technological advancement on the other, PM has been much impacted by digitalization and digitation [4]. Bearing in mind the tremendous dynamics and changes provoked by growing demands of technological projects and expanding possible solutions, PM has been at the epicenter of a tsunami provoked by the digital attack over the last past years. In this paper, we are searching for some common patterns, trends, and streams and mapping the huge literature and research on the topic to support future research and fill knowledge gaps. We performed several bibliometric analyses to be able to provide a comprehensive overview of the main trends, historical evolution, and main issues related to the digitalization of project management. The findings of the study are relevant to practitioners and academics in all managerial aspects of the subject and call for further research on the areas involved. The paper is organized as follows: Sect. 2 deals with the theoretical background of the study. Section 3 explains the methodology used. Section 4 deals with results and discussion. After this holistic overview of the problem, a conclusion indicates the main direction for further research and the already clarified scope of the Digital Transformation of Project Management which is mentioned in Sect. 5.
2 Theoretical Background: Digital Transformation of Project Management Digitalization causes changes for companies due to the adoption of digital technologies in the organization or in the work environment [4]. Here in this study, we are focusing on digital transformation as a critical management issue that requires new ways of managerial thinking [5], in project management in particular. Digital transformation and strategy for its application have been discussed increasingly in recent years. Generally, the basic components of its nature are related to: 1) consumer expectations of data competitive landscape; 2) use of new technologies and 3) changes in value creation [6]. The impact of digital transformation on project management is a very broad and complex research area. It is also undoubtedly a modern and cognitively relevant topic. Broadly, the effects were explored by Kozarkiewicz [7] who concluded that these are process automation implementations of new technologies, remote cooperation, world sourcing, and ultimately, new structures. Moreover, all these lead to changes in PM twofold: positive effects: effectiveness; creativity; quick response time; and negative effects: cost, risk, and information noise. As a response to these challenges, the author proposes the optimization of processes of project delivery by constant access to data IT tools, virtual project teams, constant communication, agile methods, customer orientation, and incremental product delivery. Not on last, the project manager’s role as a facilitator gets more attention and requires transformation. Regarding the project manager’s role, different qualitative case studies have analyzed the project management styles and their
260
Z. Yordanova
influence on the success of the digital transformation of the company [8] and the changing skills and capabilities required for transforming the function [9].
3 Methods 3.1 Data and Scope The scope of this study was set by a Boolean search in the Web of Science database for extracting high-quality publications [10] on matters of digitalization in the management field and project management in particular. The selected dataset allows us to perform wide bibliometric analyses for achieving the purpose of addressing the evolution of using technologies and digitalization as a means of applying technologies in business processing (project management in particular). The formula used was as follows, limiting the articles in scope, as well as the key concepts regarding the results pursued by this study: Results for “project management” OR “management” (Topic) AND “digital transformation” OR “digital tool” OR “digitalization” OR “digitizing” OR “digitalizing” OR “automation” (Topic) AND “efficiency” OR “effectiveness” (All Fields) and Article (Document Types) and English (Languages) and Social Sciences Citation Index (SSCI) (Web of Science Index). The query for replication of the results: https://www.webofscience.com/wos/woscc/summary/f344d105-c7ed-47f48320-9da26ab7d1e0-6e1cb91d/relevance/1. Because of the machine search, we got 431 articles from 226 sources between 1991 and 2022 (the search was done in January 2023). The contributors are 1623 with only 38 articles of single authorship. The rate of international co-authorship is 32%. The average document age is five years, which makes the topic very up-to-date and topical. Figure 1 presents the huge increase in publications in the last two years.
Fig. 1. Publications on digital transformation of project management between 1991–2023
Digital Transformation of Project Management
261
3.2 Bibliometric Analysis The bibliometric analysis was introduced as a systematic type of structural research by Pritchard [11] and is currently considered as one of the most efficient scientific methods for portraying the evolutionary picture of a research field from a wide angle and perspective [12]. The bibliometric analysis facilitates the mapping of current research conducted as well as identifies knowledge gaps, streams of research already done, authors’ information, and recognizes further research agenda which is the goal of this study. This method is widely used in the domain of digitalization and innovation [13]. Bibliometric analysis is an effective method to explore the emergence of a research domain [14] as digital transformation and it has the power to monitor the research status of a particular domain and forecast future research trends [15]. For the purpose of bibliometric analysis, this research used R software and the Biblioshiny package [16].
4 Results and Discussion The results section proposes the most insightful findings from the bibliometric analysis and discussion on the targeted overview of the topic. Generally, they are oriented into terms analysis (word analysis) and time evolution.
Fig. 2. Most commonly used author keywords (on the right), title terms (in the middle) journal keywords (on the left)
The main author’s words in the scoped publications are digital transformation, digitalization, and automation, as shown in Fig. 2. However, terms such as Industry 4.0 reveal much more about the engineering orientation of PM in recent years, as well as the term energy efficiency. An interesting insight is the frequency of the term COVID19 which appeared only after 2020 and now largely affects the whole research field of digital transformation of project management. Amongst the academic organizations with a clear focus on this multidisciplinary topic are Qatar University, the University of Cambridge, the University of Nottingham, and Shandong University (see Fig. 3). From a purely management perspective, most of
262
Z. Yordanova
Fig. 3. Most commonly used author keywords (on the right), association (in the middle) journal keywords (on the left)
Fig. 4. Topics mapping
the research has tried to come up with frameworks, designs, or models in order to speed up the applicability of different digitalization tools and approaches. Figure 4 presents the general distribution of main terms within the domain literature. Still, digitalization is much more related to knowledge management, efficiency, effectiveness, Covid-19, digital tools, and practices rather than some concrete industries. Amongst the trendy emerging technologies involved in digital transformation, big data, the Internet of Things, and artificial intelligence are the ones at the research epicenter so far. This insight alludes that many robotics, RPA, and other emerging technologies applicable to managerial processes such as blockchain and machine learning are still under research. Figure 5, along with journals and keywords, also provides insights into the corresponding authors’ countries. China, the USA, and the UK are the most thorough in the topic.
Digital Transformation of Project Management
263
Fig. 5. Most commonly used authors’ keywords, most relevant authors, and countries
The journals contributing to the area of digital transformation of project management are among the top-tiered journals: Journal of Cleaner Production, Sustainability, Technological Forecasting, and Social Change, etc. Figure 5 and Fig. 6 show the relation between keywords and journals and the intensity of publishing such topics, for instance, Sustainability which journal is the leading one in this term (see Fig. 6).
Fig. 6. The most impactful journals in the scope
When it comes to impact and citations, Sustainability and IEEE Transactions on Engineering Management are the top ones. The countries producing such research are predominantly the USA, China, the UK, Germany, and Australia as shown in Fig. 7. Figure 8 demonstrates the most used terms in titles. In 20% of the cases, these are purely managerial articles. Next two figures (see Fig. 9 and Fig. 10) present the same information again for terms’ usage in titles but for terms with two and three words within. Insightfully, supply chain and air traffic appear at the top. The next group of two-gram terms is related to the circular economy and energy efficiency. Logically, process automation and robotic processes find their place in the top 10.
264
Z. Yordanova
Fig. 7. Publication activity per country
Fig. 8. The most used terms in titles (unigram)
Fig. 9. The most used terms in titles (two-grams)
The same trends are seen through the trigrams analysis of the most used terms in titles, but more focused with the clear terminology: robotic process automation, digital
Digital Transformation of Project Management
265
supply chain, sales force automation, air traffic control, supply chain management, air traffic management, etc.
Fig. 10. The most used terms in titles (three-grams)
Figure 11 presents the distribution of terms used over time. Industry 4.0 is among the hottest topics recently. Interesting findings from this particular analysis is the increasing trend of research on sustainability and the healthcare sector.
Fig. 11. Word analysis over time (author keywords)
The historical evolution is automatically divided into publications between 1991and 2019, those between 2020 and 2021 and the last portion, between 2022 and 2023 (see Fig. 12). Such distribution shows once more the topicality and the recent growth in the publications. When it comes to the most used terms, some changes can be observed as well. Business process management is still a relevant topic until 2020 and not that much after. The same observation is for human factors. Blockchain got most of the research on
266
Z. Yordanova
supply chain management and technology. Thereafter, this research has been transformed towards business model evaluation in the latest period. Even though an emerging topic, COVID-19 has quickly decreased its impact on research in 2022.
Fig. 12. Historical evolution
Figure 13 presents the conceptual structure map of the studies within scope. Still, they all can be summarized in a single domain between technological advancement studies (concerning more technical aspects such as big data, artificial intelligence, and machine learning), managerial research (circular economy, sustainability, transformation, decision-making), and industry-focused publications (health).
Fig. 13. Conceptual structure map of the studies within the scope
Clearly, big data, artificial intelligence, and machine learning are already highly in use in project management. All these usages summarize the digital transformation of all kinds of endeavors related mostly to implementing and incorporating emerging technology in all areas of business and life. Project management is a crucial activity in this regard and considers high intention from business and IT, academia, and practitioners.
Digital Transformation of Project Management
267
5 Conclusion In conclusion, the study provides an overview of the main trends in the domain of digital transformation of project management, which we categorize into these groups: • • • • •
Managerial topics: decision-making, transformation, development, Technological topics: big data, artificial intelligence, machine learning, robotics Industrial topics: Air traffic, Health, engineering projects, Energy efficiency Trending and growing topics: Sustainability, Circular Economy Matured and declining topics: Covid-19, Supply Chain Management, Block chain
The research gives practitioners and academics an overview of the currently developed sub-topics around the digital transformation of project management and calls for further research on trendy and under-researched topics such as industry verticals in project management, Industry 5.0, and changing skills and capabilities of project managers. This study builds upon the comprehensively analyzed historical progression of digitalization and technological integration within project management over the past decade [17]. It equips practitioners with a robust empirical foundation to make informed decisions when implementing digital strategies, thereby enhancing the efficacy of their project management practices. Acknowledgment. This work was financially supported by the UNWE Research Programme.
References 1. V˘arzaru, A.A.: An empirical framework for assessing the digital technologies users’ acceptance in project management. Electronics 11(23), 3872 (2022) 2. Aubry, M., Hobbs, B., Thuillier, D.: A new framework for understanding organisational project management through the PMO. Int. J. Proj. Manag. 25(4), 328–336 (2007) 3. Simion, C., Popa, S., Albu, C.: Project management 4.0. - project management in the digital era. In: Proceedings of the 12th International Management Conference “Management Perspectives in the Digital Era”, Bucharest, Romania, 1st–2nd November 2018 (2018) 4. Parviainen, P., Tihinen, M., Kääriäinen, J., Teppola, S.: Tackling the digitalization challenge: how to benefit from digitalization in practice. Int. J. Inf. Syst. Proj. Manag. 5(1), 63–77 (2017) 5. Hassani, R., El Bouzekri El Idrissi, Y., Abouabdellah, A.: Digital project management in the era of digital transformation: hybrid method. In: Proceedings of the 2018 International Conference on Software Engineering and Information Management, pp. 98–103, January 2018 6. Vial, G.: Understanding digital transfor-mation: a review and a research agenda. J. Strateg. Inf. Syst. 28, 118–144 (2019) 7. Kozarkiewicz, A.: General and specific: the impact of digital transformation on project processes and management methods. Found. Manag. 12(1), 237–248 (2020) 8. Khan, M.A.: The impact of project management styles on digital transformation: a case study of an IT services company. Int. J. Project Manag. 4(1), 1–9 (2020) 9. Singh, A., Hess, T.: How chief digital officers promote the digital transformation of their companies. In: Strategic Information Management, pp. 202–220. Routledge (2020)
268
Z. Yordanova
10. Martín-Martín, A., Thelwall, M., Orduna-Malea, E., Delgado López-Cózar, E.: Google Scholar, Microsoft Academic, Scopus, Dimensions, Web of Science, and OpenCitations’ COCI: a multidisciplinary comparison of coverage via citations. Scientometrics 126(1), 871–906 (2021) 11. Pritchard, R.D.: Equity theory: a review and critique. Organ. Behav. Hum. Perform. 4(2), 176–211 (1969) 12. Donthu, N., Kumar, S., Mukherjee, D., Pandey, N., Lim, W.M.: How to conduct a bibliometric analysis: an overview and guidelines. J. Bus. Res. 133, 285–296 (2021) 13. Kraus, S., Jones, P., Kailer, N., Weinmann, A., Chaparro-Banegas, N., Roig-Tierno, N.: Digital transformation: an overview of the current state of the art of research. SAGE Open 11(3), 21582440211047576 (2021) 14. Ellegaard, O., Wallin, J.A.: The bibliometric analysis of scholarly production: how great is the impact? Scientometrics 105(3), 1809–1831 (2015) 15. Chawla, R.N., Goyal, P.: Emerging trends in digital transformation: a bibliometric analysis. Benchmarking Int. J. 29(4), 1069–1112 (2022) 16. Aria, M., Cuccurullo, C.: Bibliometrix: an R-tool for comprehensive science mapping analysis. J. Informetr. 11(4), 959–975 (2017) 17. Yordanova, Z.: Raise the bar: technology and digitalization in project management over the last decade. In: Yang, X.S., Sherratt, S., Dey, N., Joshi, A. (eds.) Proceedings of Seventh International Congress on Information and Communication Technology. LNNS, vol. 465. Springer, Singapore (2023). https://doi.org/10.1007/978-981-19-2397-5_69
Text Analysis of Ethical Influence in Bioinformatics and Its Related Disciplines Oliver Bonham-Carter(B) Allegheny College, Department of Computer and Information Science, Meadville, PA 16335, USA [email protected] https://www.oliverbonhamcarter.com/ Abstract. Scientific research has played a significant role in driving human progress and technological advancements. However, the importance of ethical considerations in research cannot be overstated. This is particularly crucial in the field of Bioinformatics and its parent disciplines: Biology, Computer Science and Mathematics, which must continually evolve. Ethical norms acting as guidelines for conduct serve to protect individual privacy of participants, ensure fair access to data, promote data integrity and reproducibility of results, and promote responsible algorithm deployment for reproduction. These guidelines also extend to social responsibility and trust of the research itself. Due to the diversity of the discipline, ethical guidelines in Bioinformatics also ensure that the diverse teams follow accepted societal values and obey norms throughout the project to publication so that the deliverable is acceptable to the public and community. In this paper, we present a text analysis study of the reports of ethics in Bioinformatics research, and in related disciplines. A corpus of curated articles from the National Center for Biotechnology Information (NCBI) is analyzed using the BeagleTM text analysis software. The software parses and locates keywords related to ethical themes in the abstracts of articles in the fields of Bioinformatics, Biology, Computer Science, Mathematics, and others. Relationship Networks (RNs) are created to visualize the connectivity of articles and their references. We emphasize the importance of integrating ethics into the research of Bioinformatics, and other areas, to ensure responsible and sustainable advancement. Our results call for better practice and increased awareness of ethical considerations among research teams. By incorporating ethical perspectives, Bioinformatics can contribute to a trustworthy and inclusive scientific ecosystem, benefiting society as a whole. Keywords: Text mining · Literature analysis networks · Relationship models
1
· Relationship
Introduction
Scientific research has long been a driving force behind human progress, contributing to groundbreaking discoveries and technological advancements. Howc The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 K. Arai (Ed.): FICC 2024, LNNS 921, pp. 269–289, 2024. https://doi.org/10.1007/978-3-031-54053-0_20
270
O. Bonham-Carter
ever, the pursuit of knowledge must be balanced with ethical considerations to ensure responsible and accountable practices. In the evolving and largely interdisciplinary field of Bioinformatics, considerations of ethics and responsibility are critical. Ethics establishes the norms in research, ensuring appropriate conduct and procedures, allowing appropriate advancement of the field. By protecting individual privacy, ensuring data integrity and reproducibility, promoting fair access, and deploying algorithms responsibly, bioinformaticians contribute to a trustworthy and inclusive scientific ecosystem. Resnik [7] highlights the importance of ethical norms in research, as they promote the aims of research, facilitate collaboration, ensure accountability to the public, build public support, and uphold moral and social values. In [28], V¨ ah¨ akangas, identifies that the post-genomic era genetic biomarker research faces ethical challenges due to advancements in genomic technology and data analysis, including method validation, data interpretation, and the risk of premature translation of findings into clinical practice. Ethical considerations in Bioinformatics research also extend to social responsibility and implications. Bioinformatics research impacts society in various ways and ethical considerations help identify potential risks and benefits as they can take into account societal values, equity, and justice. By incorporating ethical perspectives, we can avoid research that perpetuates discrimination or exacerbates existing societal inequalities. Discussed in Reijers et al. [24], by upholding ethical standards in Bioinformatics (for example), we may conduct research that advances knowledge, while advancing the research area that respects the dignity and rights of individuals, both human and non-human. Integrating ethics into Bioinformatics research is not only a moral imperative but also a pathway to responsible and sustainable advancement of Bioinformatics for the betterment of society as a whole. In this paper, we present a text analysis study of the reports of ethics in Bioinformatics research and those in its related disciplines. We begin this work by building a corpus of curated articles which are made available by the NCBI (National Center for Biotechnology Information). We then use a text analysis software package, BeagleTM, to parse and locate selected keywords that signal the discussion text of ethical themes across all articles’ abstracts associated with Bioinformatics, and its parent disciplines: Biology, Computer Science, Mathematics and others. We create Relationship Networks (RNs), networks which show how one article is related to its references and others, to discuss the connectivity of articles which we argue represents an influence of knowledge. We conclude by drawing attention to many articles that may have missed opportunities to discuss their ethical procedural content. For this, we discuss under-connected RNs that suggest that more work is necessary to vocalize the ethical themes at play in articles of Bioinformatics and other related disciplines. This communication, we believe, would help to form a better practice for new research teams and would help to advance the evolution of the field appropriately.
Text Analysis of Ethical Influence
1.1
271
Text Analysis
Text analysis has become an indispensable tool in various fields of research, enabling investigators to extract valuable insights from vast amounts of textual data. In particular, text analysis in research helps to unveil hidden patterns, understand sentiment and emotions, identify trends, classify information, and enable qualitative analysis, ultimately leading to more robust and nuanced findings. Text analysis also allows researchers to uncover hidden patterns within published data. Here, we imply that methodologies are able to facilitate the classification and categorization of information. By employing techniques such as text clustering or topic modeling, researchers can organize large amounts of unstructured textual data into meaningful categories. This aids in information retrieval, summarization, and knowledge organization, enabling researchers to conveniently navigate through vast datasets such as scientific publications, as in the case of our work. Text mining and tools of text analysis serve to uncover hidden patterns within textual data. More specially, these methods which are well understood and have been applied to extract information for convenient use (text summarization, document retrieval), assess document similarity (document clustering, key-phrase identification), extract structured information (entity extraction, information extraction) [22], and extract social medial information [16]. Since so much data that is produced is textual in nature, it is logical to use algorithms and methods from text analysis to determine results. For instance, Manoharan et al. [15] determined Microarray Expression from voluminous data sets. Yoo et al. [31], used Latent Dirichlet Allocation topic modeling to determine dietary patterns from data. Other examples may be found in [10,14,26]. Text analysis is broad in spectrum and one will note that new tools and libraries are available for a host of research tasks. For example, libraries for programming languages such: TM [8], Rattle [30] exist with excellent community involvement. Conveniently available are web-based platforms and on-line tools such as Textpresso Central [18] and Textalyzer to generate statistics about text (https://seoscout.com/tools/text-analyzer), Lexos for visualizing large text sets (http://lexos.wheatoncollege.edu/), in addition to many others. In this work, we apply text analysis to uncover trends of ethical discussion across a corpus of over one million scientific articles. Since this dataset is made up of text, we have also chosen text analysis methodologies as our approach and solution.
2 2.1
Methods Textual Data
We created our corpus from the noncommercial publication archives of National Center for Biotechnology Information (NCBI) https://www.ncbi.nlm.nih.gov/ on the 15th of May 2023. The articles of the repository originated from 3436
272
O. Bonham-Carter
respected publishers (i.e., Science, Nature, Elsevier, IEEE, ACM and others) covering diverse subjects. With such diversity of writing, we were satisfied that the corpus would cover many facets of scientific investigation, and would provide a breadth of ethical thinking across science. This diversity was especially important to this work as we were interested in studying the articles from disciplines that have some clear association with Bioinformatics such as Biology, Computer Science, Mathematics, and other related STEM areas. In addition, thanks to the tradition of publishing quality, peer-reviewed, work by many of the publishers of this corpus, the articles that we processed for our study were likely themselves top-quality, peer-reviewed sources of factual information. We now discuss the technical details of creating a corpus. From the repository available from NCBI (https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa bulk/), commercial and non-commercial reviewed and published scientific articles may be downloaded. We note that the articles are also searchable online using PubMed’s search engine at https://pubmed.ncbi.nlm.nih.gov/. From the FTP site, we obtain tar.gz compressed files, totalling about 19G in disk space. After the extraction process, following the bash code shown in Fig. 1, the uncompressed files total about 87G of disk space. The articles were in the nxml (text) formatting, which is similar to xml – the uncompressed Extensible Markup Language (XML) text formatting.
for x in ‘ls ∗ .tar.gz‘ do
tar − vxf $x
done
Fig. 1. The bash script used to untar the archives from NCBI
2.2
Text Analysis
The overarching goal of this project was to quantify and follow the language of ethical consideration in articles from disciplines in science related to Bioinformatics. Several essential points are made below to signify the importance of employing text analysis as a method with the use of advanced keywords, as opposed to other types of methodologies (i.e., statistical models, Artificial Intelligence/Machine Learning, and others). While these other methods have their place in research, it seemed appropriate for our study to concentrate on the basic language which was placed in articles by writers. 1. Curated keywords to locate specific discussion: Ours is a text analysis study where we follow the language built around specific keywords to isolate types of discussions. Here, using keywords of ethical themes, we can categorize and quantify the occurrence of ethically-relevant discussion points.
Text Analysis of Ethical Influence
273
2. Convenience for studying use of human language: Text analysis is well suited for studies of customized human language (or, academic language) which carries specialized meaning for the community. By this we imply text analysis affords different ways to study the nuances of keywords that lead to relevant discussions across the many articles of different disciplines. When we justify the keywords for this project in Sect. 2.3, we will return to this discussion. 3. Correlation studies: Text analysis is able to explore the multi-faceted use of language uniquely used by each discipline to create numerical data for statistical evaluation such as correlation. Utilizing methods from text analysis, we were able to uncover the intricate linguistic patterns, as associated with the use and definitions of the keywords that comprise the discipline-specific language(s) of the article. In our project, the terms used to describe ethical content, according to diverse disciplines, as illustrated in Fig. 2, were used to infer the ethical thinking behind the work. Furthermore, using this methodology where a host of keywords may be used to gain data points from articles, we allow ourselves to study and compare other kinds of traits in textual data. 4. A Scaled approach: Each year brings new articles which may be processed and analysed for references to ethical themes. With the advancing size of a corpus, the keywords will also have to be checked for effectiveness and changed as necessary when they cease to be effective with advancing textual data. 2.3
Keywords
AKw1
AKW2
...
AKWn
Ethical Quality of F ield A Ethical Quality of F ield B BKW1
BKW2
...
BKWn
Fig. 2. A list of keywords has been selected from two academic groups (A and B ) for a text analysis study of ethical quality in textual data of two fields. Both lists are necessary as the academic language is not standard across the fields – while the words may appear the same, they may carry their own nuances, shades of meaning, or unusual collocations. To avoid complications where keywords are used at cross purposes, our curated list of keywords concerns a host of simple definitions of ethical issues in articles.
274
O. Bonham-Carter Calculus
Limit
T o bind
Derivative
T o extract
Integrate
T o incorporate
Fig. 3. In Mathematics, there are keywords which pertain specifically to the discipline. However, outside of the discipline, such keywords have completely different implications
Selection of Keywords. Defined in Charmot et al. [5], an academic language is one that is used by teachers and students for the purpose of acquiring new knowledge and skills, imparting new information, describing abstract ideas, and developing conceptual understandings. An academic vocabulary, therefore signifies words which are those that are typically learned through exposure to school texts and occur across disciplines, and are found more frequently in academic than nonacademic contexts [13,19]. For example, in Fig. 3, Mathematics has the defined terms, limit, derivative and integral which have different implications outside of the discipline. Table 1. The keywords for our study were selected by inspection from articles containing discussion of ethical procedures from Bioinformatics, Biology, Computer Science, Mathematics and other disciplines. These keywords were checked for occurrence across 1,380,596 articles in NCBI. Keywords analysis analytical bioinformatics biology code of ethics computer science ethic ethical ethics general science
informatics informed consent liability mathematics responsibility responsible stem trust whistle-blowing
Here we imply the necessity for multiple terms when determining the content in textual data because there is no assurance that keywords share conventional definitions across diverse academic fields. Discussed in [19,20], words used to describe academic topics may instead indicate converse definitions from the opposing perspectives of particular fields. For example, in Fig. 2, multiple words
Text Analysis of Ethical Influence
275
must be selected from each of the two academic areas (i.e., A and B ) to complete a text analysis study of the ethical quality of the work. The idea of utilizing different keywords to create a multitude of data points can be noted in Topic Modeling methodologies such as the Latent Dirichlet Allocation (LDA), described in [1]. The supporting idea behind LDA is that a keyword belonging to a particular topic, or with a high probability of belonging to a topic, would be desirable for locating the topic in text. Therefore, in keeping with LDA methodology of employing large lists of words for data exploration, while staying clear of academic vocabulary complications, Table 1 displays the listing of general words that we used to tag discussions of ethical natures across several different disciplines. To explore the actual usage of specific words, we considered words like ethic, ethical and ethics to be independent of each other. To justify our selection of keywords from Table 1, we take a moment to explain how the individual words were discovered during the lecture of prominent research articles across the disciplines of our study. Keywords by Inspection. Our curated keywords were selected by inspection throughout articles from the disciplines of Bioinformatics, Biology, Computer Science, Mathematics, in addition to other articles from STEM origins. To obtain these generalized words, we read through arbitrarily chosen articles originating from each of the above-mentioned disciplines to locate discussion of ethics, in conjunction with research. After finding discussions in articles, keywords were selected with the interest that they would also be able to locate ethical themes in other types of articles. We now address the mechanism used to select keywords of ethical discussion in research. In a selected article by Jamil et al. [12], research was introduced to determine the causative factor of obesity-associated insulin resistance and lipid metabolism in the context of acute inflammation. The authors indicated that research had been approved by an Institutional Ethical Committee of the Hospital (IEC) which cared for those who gave data for analysis. In Patnaik et al. [21] and Henze et al. [11], studies concerning the study of biomarkers and patients were described and, again, we found mention that the work had been approved by an IEC, in addition to other ethics committees. In these articles, we noted the word, ethics, in addition to Bioinformatics. For other areas of science (i.e., for the Biology, Mathematics and STEM ), we proceeded in the same way. For example, in Gasparich, [9], student training for making ethically sensitive decisions was discussed. In Mathematical research, McKay et al. [17] introduced an article in which language was included to explain that the ethics of their work had been approved by several specialized Ethics Committees. Reading these diverse papers gave our project a distinct view of the utilization of the keywords (i.e., taken from abstracts, and from the lists supplied by authors) to describe ethical interests.
276
O. Bonham-Carter
RefA3
RefA7
RefA1
RefA5
ArticleA
RefA6
RefB1
RefA2
ArticleB
RefA4
RefB2
RefB3
Fig. 4. Relationship networks. The larger (red) nodes represent the articles in which keyword occurrences were found. The smaller (blue) nodes represent the supporting documents comprising the article’s bibliography. The edges connecting nodes imply a citation exists between the two publications. (Color figure online)
2.4
Method of Text Analysis: BeagleTM
Since this project concerns processing a large dataset of over a million articles, it was necessary to find a robust and text analysis toolkit to complete the work. We had three basic requirements for the tool for this work: – Data size flexibility: NCBI curates the dataset that we chose for this project. Each month, new articles are automatically added to their publicly available corpus. While it cannot be known how many new articles would be added each month, it was sure that the processing job would never become easier for a tool. It was therefore necessary to have a text analysis tool that would be able to process all articles of the corpus for perhaps years to come. – Memory limitations: With all the new articles being added, it was essential that the tool would be able to handle the workload of over a million articles and not run out of memory. – We were interested in building networks out of the results to assist with the visualization of trends. BeagleTM [2] was originally designed for convenient text analysis in Bioinformatics research [3] where speed and avoiding memory limitations were considerations. We chose this text analysis software for its approach to scanning and analysing articles article-by-article to avoid memory limitations. Rather than having to load an entire corpus into memory before processing could begin as noted with other advanced tools such as TM for R [8], BeagleTM processes an article in absence of the others of the set. This single feature serves to avoid memory limitations which would prevent research over large-scale datasets. NCBI articles are set in nxml formatting, creating a block with other text analysis software packages. We chose BeagleTM for its ability to process and parse files with nxml formatting without additional configuration on our part.
Text Analysis of Ethical Influence
277
In consideration of keywords, we required a software that would allow us to conveniently test and interchange terms without extensive effort. With BeagleTM, we were able to change an arbitrarily long list of keywords for testing, and alter it as necessary without hesitation. Since NCBI continually increments the depth of its article repository, which when taken with our own incrementing list of keywords, made BeagleTM an ideal tool for our analysis project and its widening data requirements. We were interested in determining the amount of overlapping discussion between articles, as well as the degree of connectivity between articles that follow common themes. Such a study is facilitated by the use of networks. BeagleTM’s most crowning feature is its ability to output its results as networks to allow the investigator to explore article depth and their levels of connectivity. Utilizing Relationship Networks (discussed in more detail in Sect. 2.4), we gained a visualized map of the depth of common thinking across articles, in addition to being able to investigate the breadth of their connectivity in function of our keywords. Abstract Analysis. While authors are often given space to include selected keywords to help readers locate their article, the keywords are often too vague and do little to convey the importance of the work. Conversely, author-supplied keywords may be too specific and, to the uninitiated, the focus of the article is not clear. An abstract is a short piece of writing that contains a wealth of keywords that are delicately focused on the article’s material. When writing an abstract of a 255 word-size limit, the author must choose terms and language carefully to convey exact and particular meaning. It is therefore very likely that any mentioned topics will be largely developed in the body of the article. It is equally unlikely that weak references to topics will be included in abstracts. Consequently, this text serves as a reliable source of information as opposed to relying only on the keywords left by the authors. As BeagleTM processes articles, it is actually reading only the text of the abstracts. Processing this enriched and factual text serves to improve speed and efficiency when working with large sets of textual data. The text of abstracts is likely to offer more factual information concerning the article aims and could offer more details than those gained from the author-supplied keywords. One caveat is that when processing ethically conscientious publications, some abstracts may not mention any discussion of ethical procedures. In some cases, the abstracts will include no details concerning data handling, patient privacy, the approval of committees for responsible conduct or similar. Instead this information may be contained as a separate subsection in the publication, outside of the abstract. In these cases, our analysis will miss this detail and the article will be incorrectly categorized. To avoid the hardships that arrive with incorrectly categorizing articles, we assert that the “ethically conscientious” nature of articles will, for this study, be defined by discussion in abstracts.
278
O. Bonham-Carter
Relationship Networks. The determination of how keywords in abstracts are related to other keywords in abstracts can be described visually with nodes and edges in Relationship Networks (RNs). Shown in Fig. 4, the network displays articles which are connected to others on the basis of their language, as tagged by keywords. In Fig. 4, there are larger and smaller types of nodes which represent Articles and References, respectively. The Article nodes represent the corpus publications, containing the user-selected keywords. The smaller Reference nodes represent the supporting articles that comprise the bibliography (i.e., the literature review) of an article. As discussed later in Sect. 3, actual Relationship Networks from our results are shown in Fig. 6 and Fig. 7. Since the edges in RNs are formed when one document has a citation to another, the cluster becomes much more connected in terms of informational content. For instance, citations connect the red nodes (i.e., articles) to their supporting references (i.e., blue nodes), as discussed in Fig. 4. In the case of two bridged nodes, the edge signals the influence of one article over the other. In RNs, red nodes represent articles which were located as a result of keyword content in the abstract. A network of publications in which all articles share the same kinds of keywords, suggest a type of cluster that is likely to contain much common knowledge. Additionally, when one finds more than two keywords in the abstracts of multiple articles, then one may conclude that the language of the abstracts share discussion of similar themes. We therefore argue that when the same subset of keywords are found across articles, then there is a profound indication that the articles share an informational overlap. It is understood that densely connected networks are those that have more edges than sparsely connected ones. For densely connected RNs, the abundant edges suggest that authors wrote articles when under the influence of other researchers. In such networks that were created by an identical set of terms, the common keywords suggest that the articles already share a common thread. Edges that imply references to other articles suggest evidence of an actual influence of ideas. Interestingly, networks of high connectivity suggest a broad sharing of ideas across the articles of the RN, and for large networks, this connectivity may imply frequent author interactions. Conversely, in RNs sparse or zero connectivity, we assert that such articles were written without the influence of others. With the idea that nodes represent publications and that edges represent the citations between them, we take a moment to discuss the network featured in Fig. 4. We note the types of edges that we used during the course of our study of RNs. – Article ↔Article: Two articles likely share some common research interest or methodology. – Reference ↔ Reference: Two references likely share some common research or methodology. It is likely that both references may be equally suitable for articles where one serves as a supporting article.
Text Analysis of Ethical Influence
279
– Article ↔ Reference: The reference supports the article with common information. The reference node appears to have unique information which is not found in any other nodes. – Article ↔ Reference ↔Article: Common thread – the reference is strong since it has been included in both articles. – Reference ↔ Article ↔ Reference: Common thread – the reference has common information to both articles.
3
Results and Discussion
We applied BeagleTM to parse and locate the keywords of Table 1 throughout the articles of our corpus. An article is recorded by the analysis software when at least one of the provided keywords has been located in its abstract.
Fig. 5. A Relationship Network of articles (red nodes) and their supporting references (blue nodes). Two keywords found in abstracts were used to locate these articles; responsible and Bioinformatics. The red edges, describe common language and connected knowledge in function of the keywords. (Color figure online)
After parsing, we discovered that some single keywords, shown in Table 2, had very rare occurrences in abstracts and could not support building RNs. For example, the words; liability, code of ethics, mathematics, whistle-blowing, general science and computer science were rare in articles. This result was unexpected since these terms were thought to play significant roles in the discussions of ethical concepts across many fields such as in mathematics and general science. There were abundant terms such as analysis, ethic, ethical and ethics which we had expected to find and potentially use to create large RNs. The sizes of these RNs according to article-collection size are indicated in Tables 3 and 4. These RNs appear to be sizable on their own, however, according to the rounded proportions of Table 2, we note that these sizes are actually quite minor values when one considers the size of the corpus at the time of this study. The reduction
280
O. Bonham-Carter
Table 2. The keywords for our study checked for occurrence across 1,380,596 articles in NCBI. We note the term, the number of articles, and their rounded proportions from ). We note that there are keywords for ethical terms which were the corpus ( totalcount articles very rarely found in the corpus. Keywords, Articles and Proportions informatics analytical ethic stem liability analysis bioinformatics ethics responsible code of ethics
6931 5532 10810 21023 303 168036 6305 6366 16093 22
0.005 0.004 0.008 0.015 0.0002 0.122 0.005 0.005 0.012 0
informed consent ethical mathematics responsibility whistle-blowing general science trust computer science biology
2128 5281 296 1498 1 6 1868 74 6777
0.002 0.004 0 0.001 0 0 0.001 0 0.005
in size of RNs may suggest that much communication concerning the ethical matters for the purpose of research was omitted from seemingly thousands of corpus articles. Keywords in Pairs. We studied pairs of keywords and the types and sizes of RNs that could be produced in function of Bioinformatics, Biology and other disciplines. In Fig. 5, we build an RN using the paired keywords; responsible and Bioinformatics containing 125 articles. We created another RN containing 181 articles that was created from the pair Biology and responsible. While the numbers of the collected articles seem to be sizable, when one considers that there are 1,380,596 potential articles in which keywords may be found to create RNs, we would have expected to build larger RNs for the above-mentioned pairs. The total number of collected articles for RNs of keyword pairs is given in Tables 3, 4 and 5. We note that these tables are not exhaustive – their pairs were created by permutation of the terms from Table 1, and their counts have been sorted in descending order. Finding information about the ethical consideration in methodology was a focus of our study. For instance, we found discussion of ethical content from RNs articles when using terms: Bioinformatics and responsible. Randomly selecting an article from the network, we found focused ethical discussion in the article by Gasparich et al. [9]. Likewise, for Bioinformatics and ethics, we uncovered three articles. We randomly selected the article by Jamil et al. [12] from this network and found ethically themed discussion of the method. Our results did not always uncover articles having discussion of ethical methodologies. For example, in one of the uncovered articles by Tang et al.[27], the words Bioinformatics and responsible were both found. However, the word
Text Analysis of Ethical Influence
281
Fig. 6. A Relationship Network of articles (red nodes) and their supporting references (blue nodes). The following keywords were used to build this RN; analysis, Bioinformatics and responsible. (Color figure online)
responsible was not used in the context of ethical conduct. Other examples of this phenomenon were found in [6,23,32]. Here we concluded that, while the term responsible could be used to build RNs containing ethical discussion, in many cases, the term was also used in non-ethical discussion. Keywords in Triples. To specify articles that were more specifically concerned with disciplines, research and ethical themes, we employed three keywords (i.e., triplets) to build RNs. In Fig. 6, we display a large RN of 72 articles, constructed from the terms: analysis, Bioinformatics and responsible. In Fig. 7, we display another RN of 35 articles for the terms: analysis, Biology and responsible. The term, responsible, was still being used in a majority of articles having no ethical discussion. When the term, ethical replaced responsible, then three articles were recovered in which ethical discussion was found: [4,25,29]. In Table 6, there are other triplets from which RNs could be built. While it is optimistic to build RNs from the found articles specified in these tables, we remember again that the corpus contained over a million articles, and that numbers collected by the triplets were generally quite small. In some cases, there were no recovered articles from keyword combinations. Across all RNs we created, we noted the number of edges in the RN. We were especially interested in the edges that connected articles to other articles (not references, but other articles which we shall call here, “red to red node” edges). These edges suggested that one article had been written under the influence of another, hence the citation. In the RN of Fig. 6 of terms: analysis, Bioinformatics and responsible (72 articles), discounting the edges to references (i.e., blue nodes), there were many cases of red to red node edges which were expected. For instance, by inspection, the RN of Fig. 6 contained many edges that connected articles of Bioinformatics themes to articles stemming from Biology, Informatics, and other areas.
282
O. Bonham-Carter
Fig. 7. A Relationship Network of articles (red nodes) and their supporting references (blue nodes). The following keywords were used to build this RN; analysis, Biology and responsible. (Color figure online)
The influence of its parent discipline over Bioinformatics research was expected. For instance in the RN created by terms analysis and ethics (1430 articles), we noted that about one-fourth of the article nodes exhibited a red to red node edge. This suggests that about three-fourths of these publications had no influence to or from other articles. We note that it generally does not occur to have many articles of ethical distinction referencing other articles in research. This trend was consistent with Biology, Mathematics (we were not able to build RNs for Computer Science). The most ethically influenced RN (ignoring individual disciplines) is shown in Fig. 8. In the network, there are 6366 total articles taken from the corpus for which the terms analysis and ethics were found in abstracts. As noted, there is high coverage of red to red node edges which suggests a high amount of coverage between articles. According to our data shown in Table 3, this RN may be considered to have the largest number of red to red node edges. However, as there are only 6366 articles from the total of 1,380,596 of the corpus, one may posit that much more ethical influence is necessary across many of the research articles. In addition, while there are many red to red node edges available in the RN, we note that there are still many red nodes which seem to share little to no influence of ethical thinking from other nodes.
4
Conclusions
The RNs of our work were generated using the permuted pairs and triplets of keywords words Table 1. The keyword combinations may be found in pairs (Tables 3, 4) and triplets (Tables 5 and 6). Using broad nets to catch ethical discussion in the corpus, we address two large RNs. In Fig. 8, we display the RN created from all available articles (1430
Text Analysis of Ethical Influence
283
Table 3. Part 1: A non-exhaustive list ranking pairs of keywords across 1,380,596 total articles. The counts represent the total number of articles in which the keywords were found together Keywords
Articles
ethic, ethics bioinformatics, informatics ethic, ethical analysis, informatics analysis, stem analysis, bioinformatics analysis, responsible analysis, ethic analysis, analytical analysis, ethics analysis, biology analysis, ethical
6366 6305 5281 3384 3270 3252 2732 2328 1722 1430 1220 1084
Table 4. Part 2: A non-exhaustive list ranking pairs of keywords across 1,380,596 total articles. The counts represent the total number of articles in which the keywords were found together Keywords and Articles ethical, ethics ethic, informed consent ethics, informed consent responsible, stem biology, stem analysis, trust analysis, informed consent ethical, informed consent analysis, responsibility biology, informatics ethic, responsible bioinformatics, biology
904 898 619 588 492 427 412 377 309 244 228 222
ethic, stem biology, responsible ethical, stem informatics, stem bioinformatics, stem ethic, responsibility ethical, responsible informatics, responsible bioinformatics, responsible ethics, responsible ethical, responsibility
205 181 167 155 152 147 140 133 125 121 108
total) in the corpus for which the terms analysis and ethics were found. In Fig. 9, we cast another wide net to find all articles (1084 total) from the corpus having ethical and ethics. In these figures, we note highly connected networks. For instance, we note that there is much red to red node connectivity throughout the network, suggesting that there is much influence of ideas across the articles. However when we consider that these RNs represent very minor subsets of
284
O. Bonham-Carter
Table 5. A non-exhaustive list of the middle highest ranking pairs of keywords across 1,380,596 total articles. The counts represent the total number of articles in which the keywords were found together Keywords and Articles
analytical, ethic responsibility, responsible ethic, trust mathematics, stem analytical, informatics analytical, responsible analytical, bioinformatics ethical, trust biology, ethic analytical, biology biology, mathematics ethics, responsibility analysis, liability analytical, ethical analytical, ethics analysis, mathematics ethics, trust analytical, stem responsibility, trust ethics, stem ethic, informatics biology, ethical
98 96 95 87 83 78 73 63 63 62 61 61 60 54 53 52 51 48 45 44 43 42
informed consent, responsible responsible, trust ethical, informatics biology, ethics analytical, responsibility analytical, trust code of ethics, ethics code of ethics, ethic informed consent, stem informatics, trust informed consent, trust analytical, informed consent ethics, informatics code of ethics, ethical informed consent, responsibility analysis, computer science bioinformatics, ethic liability, responsible liability, responsibility stem, trust bioinformatics, ethical ethic, liability
39 39 32 27 24 23 22 22 20 20 19 18 16 15 15 15 14 14 13 11 10 10
Fig. 8. A Relationship Network of articles (red nodes) and their supporting references (blue nodes). The following keywords were used to build this RN; analysis and ethics. These terms signaled seemingly all ethical discussion from any article in the corpus, while there are lots of red to red node edges, indicating influence, one will note that there are still many articles having no influence from others. (Color figure online)
Text Analysis of Ethical Influence
285
Table 6. A non-exhaustive list of the second highest ranking triplets of keywords Across 1,380,596 total articles. The counts represent the total number of articles in which the keywords were found together Keywords
Articles Keywords
analysis, bioinformatics, informatics 3252 1430 analysis, ethic, ethics 1084 analysis, ethic, ethical 904 ethic, ethical, ethics 619 ethic, ethics, informed consent 377 ethic, ethical, informed consent 222 bioinformatics, biology, informatics 200 analysis, ethical, ethics 167 ethic, ethical, stem 163 analysis, ethic, informed consent 152 bioinformatics, informatics, stem 140 ethic, ethical, responsible bioinformatics, informatics, responsible 125 121 ethic, ethics, responsible 116 analysis, ethics, informed consent 108 ethic, ethical, responsibility 101 ethical, ethics, informed consent
analysis, biology, informatics analysis, bioinformatics, biology analysis, responsible, stem analysis, informatics, stem analysis, bioinformatics, stem analysis, informatics, responsible analytical, bioinformatics, informatics analysis, bioinformatics, responsible analysis, biology, stem analysis, ethical, informed consent ethic, ethical, trust ethic, ethics, responsibility analytical, ethic, ethical analytical, ethic, ethics ethic, ethics, trust analysis, ethic, responsible ethic, ethics, stem
Articles 97 92 89 85 85 74 73 72 72 68 63 61 54 53 51 45 44
Fig. 9. A Relationship Network of articles (red nodes) and their supporting references (blue nodes). The following keywords were used to build this RN; ethical and ethics.
the 1.3 million articles which were processed by BeagleTM, it would seem that the networks are actually quite small. The pursuit of knowledge must be balanced with ethical considerations to ensure responsible and accountable practices are followed. Discussed in Resnik et al. [7] and Reijers et al. [24], the purpose of having general ethics in research is to provide guidance in how to complete research.
286
O. Bonham-Carter
The steps to attain scientific knowledge (i.e., initiation, conduct, completed and publication) are not trivial tasks and must be coordinated especially in fields as diverse as Bioinformatics, where the diversity is profound. These ethical guidelines direct conduct in seemingly all aspects of the research process such as cooperating with those who are involved with the project (investigators and participants), handling data, working with materials, preparing accurate results, in addition to many other chores of investigative work. In other diverse and developing areas such as informatics and STEM, ethical guidelines are absolutely necessary in the cultivation of research and for the evolution of the field. Having ethical standards imply that whatever the outcome of the work is as close to the actual truth as possible. This implies that the work may be reproduced, and the new and former results will corroborate each other. Furthermore, since much of scientific research is grant-funded by public sources, reporting ethical guidelines helps to ensure transparency so that knowledge derived from a study will have value to the public. If these tenets of the research process, whether in Bioinformatics or otherwise are missing, then the tradition and quality of research may likely deteriorate over time. It would seem that much of the effort to bring ethical reasoning into research projects may be due to being reminded by other famous ethically conscientious articles in the field. With each additional evidence of connectivity described in the RNs built from ethical terms in our results, one may become increasingly optimistic about the forward development of the field’s research. For example, in well-connected networks where there are lots of nodes and edges presented, then there may well be lots of evidence for a busy communication of ideas across the discipline. In terms of ethical ideas, high connections might suggest that articles became closely connected thanks to their authors sharing ideas about ethics and other topics to inform the style of writing articles. We note that the habit in research of sharing ideas about ethical procedure is a driving force of the community behind the research. In addition, new researchers will surely benefit from the ethical influence that is likely to stem from the popular articles of the published literature.
5
Future Works
One of the main faults of our project was that our keywords were not always suitable to signal ethical discussion in the literature. For example, our technique is likely to incorrectly categorize articles which are in fact “ethically conscientious.” This is because we scanned the abstracts of the articles to determine their ethical status. In future work, we will broaden the parsing window of articles beyond the abstracts to locate for the details of ethical preparations. This change in technique will involve the addition of NLP (Natural Language Processing) for its AI support to discern types of discussion. Our more immediate steps in this study are to broaden the list of ethical keywords to improve the detection of ethical language and themes across the
Text Analysis of Ethical Influence
287
corpus. When working keywords having multiple definitions such as the term, responsible, we noted many false positives which could have been avoided by better signals from keywords. It has not escaped our attention that a single keyword may not be capable of locating a specific article. To combat this potential failure, we will investigate pairs, triplets and other combinations of keywords which may help to locate articles of very specific themes. Since research has fashions trends, we speculate that including discussion of ethical themes in articles may be a recent trend. To learn more about the trends in publishing ethical discussions, we intend to study articles in terms of the date of publication to determine publishing trends. For instance, if it were known that recent articles in Bioinformatics are more likely to publish information about the ethical conduct of the work, then we might suggest that the field is undergoing a type of evolution. Such a phenomenon in Bioinformatics could be used to infer similar types of development in adjacent disciplines.
References 1. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003) 2. Bonham-Carter, O.: Beagletm: an adaptable text mining method for relationship discovery in literature. In: Advances in Information and Communication: Proceedings of the 2020 Future of Information and Communication Conference (FICC), Volume 2, pp. 237–256. Springer (2020) 3. Bonham-Carter, O., Bastola, D.R.: A text mining application for linking functionally stressed-proteins to their post-translational modifications. In: 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 611–614. IEEE (2015) 4. Cao, X., et al.: Impact of helicobacter pylori on the gastric microbiome in patients with chronic gastritis: a systematic review and meta-analysis protocol. BMJ Open 13(3), e050476 (2023) 5. Chamot, A.U., O’malley, J.M.: The CALLA handbook: Implementing the cognitive academic language learning approach. Addison-Wesley Publishing Company Reading, MA (1994) 6. Das, M., et al.: In silico investigation of conserved mirnas and their targets from the expressed sequence tags in neospora caninum genome. Bioinformatics Biology Insights 15, 11779322211046729 (2021) 7. David, R., et al.: What is ethics in research & why is it important? (2015) 8. Feinerer, I.: Introduction to the tm package text mining in r (2017) 9. Gasparich, G.E., Wimmers, L.: Integration of ethics across the curriculum: From first year through senior seminar. J. Microbiol. Biol. Educ. 15(2), 218–223 (2014) 10. Fatih Gurcan and Nergiz Ercil Cagiltay: Research trends on distance learning: a text mining-based literature review from 2008 to 2018. Interact. Learn. Environ. 31(2), 1007–1028 (2023) 11. Henze, L., et al.: Towards biomarkers for outcomes after pancreatic ductal adenocarcinoma and ischaemic stroke, with focus on (co)-morbidity and ageing/cellular senescence (saskit): protocol for a prospective cohort study. BMJ Open 10(12), e039560 (2020)
288
O. Bonham-Carter
12. Jamil, K., Jayaraman, A., Ahmad, J., Joshi, S., Yerra, S.K.: Tnf-alpha- 308g/a and238g/a polymorphisms and its protein network associated with type 2 diabetes mellitus. Saudi J. Biol. Sci. 24(6), 1195–1203 (2017) 13. Lawrence, J.F., Knoph, R., McIlraith, A., Kulesz, P.A., Francis, D.J.: Reading comprehension and academic vocabulary: exploring relations of item features and reading proficiency. Reading Res. Q. 57(2), 669–690 (2022) 14. Liu, K., Liu, W., He, A.J.: Evaluating health policies with subnational disparities: a text-mining analysis of the urban employee basic medical insurance scheme in China. Health Policy Plann. 38(1), 83–96 (2023) 15. Manoharan, S., Iyyappan, O.R.: A hybrid protocol for finding novel gene targets for various diseases using microarray expression data analysis and text mining. In: Biomedical Text Mining, pp. 41–70. Springer (2022). https://doi.org/10.1007/9781-0716-2305-3 3 16. Maynard, D., Roberts, I.: Mark A Greenwood, Dominic Rout, and Kalina Bontcheva. A framework for real-time semantic social media analysis. Web Semantics, Science, Services and Agents on the World Wide Web (2017) 17. McKay, E., Richmond, S., Kirk, H., Anderson, V., Catroppa, C., Cornish, K.: Training attention in children with acquired brain injury: a study protocol of a randomised controlled trial of the tali attention training programme. BMJ Open 9(12), e032619 (2019) 18. M¨ uller, H.-M., Van Auken, K.M., Li, Y., Sternberg, P.W.: Textpresso central: a customizable platform for searching, text mining, viewing, and curating biomedical literature. BMC Bioinform. 19(1), 94 (2018) 19. Nagy, W., Townsend, D.: Words as tools: Learning academic vocabulary as language acquisition. Read. Res. Q. 47(1), 91–108 (2012) 20. Neri, N.C., Retelsdorf, J.: The role of linguistic features in science and math comprehension and performance: a systematic review and desiderata for future research. Educational Research Review, p. 100460 (2022) 21. Patnaik, S.K.: Can microrna profiles predict corticosteroid responsiveness in childhood nephrotic syndrome? a study protocol. BMJ Paediatrics Open 2(1) (2018) 22. Paynter, R., et al.: Epc methods: an exploration of the use of text-mining software in systematic reviews (2016) 23. Ravichandran, S., Hartmann, A., Del Sol, A.: Sighotspotter: scrna-seq-based computational tool to control cell subpopulation phenotypes for cellular rejuvenation strategies (2020) 24. Reijers, W., Wright, D., Brey, P., Weber, K., Rodrigues, R., O’Sullivan, D., Gordijn, B.: Methods for practising ethics in research and innovation: a literature review, critical analysis and recommendations. Sci. Eng. Ethics 24, 1437–1481 (2018) 25. Saibaba, G., Rajesh, D., Muthukumar, S., Sathiyanarayanan, G., Aarthy, A.P., Archunan, G.: Salivary proteome profile of women during fertile phase of menstrual cycle as characterized by mass spectrometry. Gynecol. Minimally Invasive Therapy 10(4), 226 (2021) 26. Takacs, V., O’Brien, C.D.: Trends and gaps in biodiversity and ecosystem services research: a text mining approach. Ambio 52(1), 81–94 (2023) 27. Tang, W., Liang, P.: Comparative genomics analysis reveals high levels of differential retrotransposition among primates from the hominidae and the cercopithecidae families. Genome Biol. Evol. 11(11), 3309–3325 (2019) 28. V¨ ah¨ akangas, K.: Research ethics in the post-genomic era. Environ. Mol. Mutagen. 54(7), 599–610 (2013)
Text Analysis of Ethical Influence
289
29. Vrijenhoek, T., et al.: Next-generation sequencing-based genome diagnostics across clinical genetics centers: implementation choices and their effects. Europ. J. Human Genetics, 23(9), 1142–1150 (2015) 30. Williams, G.J., et al.: Rattle: a data mining gui for r. R J. 1(2), 45–55 (2009) 31. Yoo, R., et al.: Exploring the nexus between food and veg* n lifestyle via text mining-based online community analytics. Food Quality Preference 104, 104714 (2023) 32. Zhang, Y., Guo, H., Ma, L., Chen, X., Chen, G.: Long noncoding rna linc00839 promotes the malignant progression of osteosarcoma by competitively binding to microrna-454-3p and consequently increasing c-met expression [retraction]. Cancer Manage. Res. 13, 8007–8008 (2021)
A Multi-Criteria Decision Analysis Approach for Predicting User Popularity on Social Media Abdullah Almutairi(B) and Danda B. Rawat EECS Department, Howard University Washington, Washington, DC 20059, USA {almutairi,db.rawat}@ieee.org Abstract. Social media platforms like Facebook and Twitter have enabled users to connect and share information on an unprecedented scale. However, the extensive adoption of these platforms has led to only a select few gaining mass popularity. Identification of popular social media influencers is crucial for applications like marketing and recommendations. But manual assessment of user popularity is challenging due to numerous multivariate indicators of influence. In this paper, we propose an intelligent system to predict social media user popularity by applying Multi-Criteria Decision Analysis (MCDA). Our approach selects key popularity criteria based on profile data from Facebook and Twitter like follower count, friends, tweets etc. We construct a comparison matrix that systematically assigns weights to each criterion according to its importance for determining popularity. The weighted criteria are then utilized in a prediction algorithm that collects user data via APIs, normalizes it and computes an overall popularity score. We discuss the system design and key technical details of the platform. The structured weighting technique and flexible data-driven prediction algorithm offer an automated and scalable means to identify top social media influencers. We highlight the ability to customize the criteria comparisons and algorithm for different contexts. The proposed approach advances capabilities for analyzing social media user influence. Keywords: Social media Social network
1
· Behavior · MCDA method · Predict ·
Introduction
Data analytics is a recent technological advancement widely used on the Internet to collect and analyze large amounts of data to draw conclusions about a specific topic. Data mining is a technique used to find the most effective solutions by analyzing enormous amounts of data. The question remains: what insights can we gain about people through data mining? Can we predict a person’s popularity based on their online profiles? This paper aims to develop a set of criteria based on gathered data to forecast an individual’s popularity rating, demonstrating the true potential and necessity of social media and the Internet. The remainder of this paper is organized as follows. First, we provide a review of related work on social media popularity prediction using data mining c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 K. Arai (Ed.): FICC 2024, LNNS 921, pp. 290–303, 2024. https://doi.org/10.1007/978-3-031-54053-0_21
Predicting User Popularity
291
approaches. Next, we present background information on Multi-Criteria Decision Analysis models. We then state the research goal, objectives, and details of the proposed intelligent popularity prediction system. The system design, architecture, prediction methodology, and algorithm are described. Experimental results on sample datasets are presented to demonstrate the functionality and performance of the approach. We conclude by summarizing the contributions of the paper and identifying future work to enhance the system. This paper aims to showcase an innovative application of Multi-Criteria Decision Analysis for automated social media user popularity quantification that can enable more effective influencer identification. 1.1
Need Analysis
Social media and the Internet have become central to our lives, with people often influenced to follow others based on common interests or connections. However, sometimes the recommended friends may not have much in common. While social networking sites do a decent job of connecting people, a person’s online profile can provide more accurate information. If the information gathered is accurate, almost anything about a person can be predicted, including their level of popularity. This kind of analysis can also help identify trends and similarities between people. By making predictions based on a person’s characteristics and similarities, it may be possible to anticipate significant life events, such as marriage or having a baby. With accurate information, it is possible to predict more than just individual popularity, and this field of analysis has significant potential for the future. 1.2
Problem Statement
Social media has become an essential aspect of modern life, with many individuals and companies using various platforms to share information. Popularity on social media is critical to successfully conveying information, with highly popular individuals, referred to as social media influencers (SMIs), having a large audience on platforms such as YouTube, Facebook, Instagram, and TikTok. To reach a large audience for business objectives, brand and product marketers can use SMIs. It is, therefore, essential to predict the popularity of individual users to identify appropriate collaboration partners. However, social media is fragmented across multiple platforms, making manual assessment of SMI popularity challenging due to many potentially informative metrics. Hence, there is a need to develop advanced autonomous methods of predicting social media popularity rapidly. The ability to follow almost anyone on social media has prompted the question of what makes a person unique and popular. Different opinions exist on the definition of coolness, with some associating it with popularity, while others associate it with good looks. To determine an individual’s popularity, data from various social media platforms needs to be analyzed against a set of criteria. By mining through this data, it is believed that an individual’s popularity can be determined.
292
2
A. Almutairi and D. B. Rawat
Literature Review
According to recent literature, data mining and machine learning are effective in quickly and automatically predicting social media popularity. These approaches can be used to predict popular individuals or identify popular posts. This section provides an overview of current research in both areas. 2.1
Directly Predicting Popular Individuals
Arora et al. [1] propose measuring the influencer index to predict popular individuals on social media platforms directly. This metric alludes to the ability to instigate action and receive users’ engagement on a post. The authors use a regression approach to model several features that influence popularity, such as posting rate, audience sentiment, daily engagement velocity, hourly engagement velocity, engagements and outreach, and overall footprint. Different machine learning methods, including Lasso regression, support vector regression (SVR), K-NN regression (KNN), and ordinary least squares (OLS), are then implemented to quantify the cumulative influencer index score. The consequent findings indicate that combining the four machine learning algorithms achieves the greatest prediction accuracy. However, KNN performs adequately well independently. The most critical popularity metrics are engagements and outreach, audience sentiment, daily engagement velocity, and hourly engagement velocity. These results validate machine learning for accurately predicting social media popularity. In a similar vein, [2] developed a framework that combines machine learning modules and semantic analysis to predict users’ credibility across different domains and times. For sentiment analysis, they used AlchemyAPI to assess the taxonomies of each user and the website content of the affiliated URL [2]. The study employed several machine learningbased classifiers such as Naıve Bayes, logistic classifier, treebased classifier, deep learning classifier, generalized linear model, random forest, and gradient boosted tree to predict social media features. The generalized linear model yielded the best outcome in predicting highly influential domainbased users. However, [3] validated gradient boosting and demonstrated that it can achieve a balanced accuracy of 64.72 % symbol in real-world scenarios. These conclusions further confirm the efficacy of machine learning in directly detecting popular individuals on social media platforms. On the other hand, [4] used text mining and data mining methods to quantify the influence of users’ electronic wordof- mouth (e-WOM). The underlying premise is that popular individuals have highly influential e-WOM. The decision tree was selected as the feasible data mining approach as it uses the most relevant input variables to create decision tree models [4]. R programming was also used to reduce the error made by the data mining tools. The results indicate that the most popular social media accounts exhibit a strong association between eWOM and content length, followings, and number of followers. Decision trees are also effective at rapidly and accurately gauging these metrics to determine popularity.
Predicting User Popularity
2.2
293
Detecting Popular Posts
[5] propose and test a machine learning-based method of predicting the popularity of social media content. The authors consider both social networks and early adopters for quantifying popularity. Furthermore, they recommend explicitly capturing the cascading effect in a network. In particular, a target user’s activation state is modeled considering the neighbors’ influence and captivation state. A novel machine learning model is developed to achieve this function by combining two couple graph neural networks to quantify the interaction between the spread of influence and node activation states [5]. The use of stacked neural network strata helps capture the cascading effect. The developed model accurately identifies popular posts, revealing potential SMIs. Similarly, [6] propose a deep learning-based popularity prediction model. This tool extracts and amalgamates rich information on user and time series and text content in a data-driven manner [6]. Furthermore, the model comprises three encoders to learn highlevel representations of the mentioned content. Attention mechanisms are also incorporated to eliminate noise. Notably, the developed tool integrates timeembedding to account for the time-varying levels of social media activity and different time intervals of posts. Empirical assessments confirm the efficacy and accuracy of the proposed deep learning model in predicting popular social media posts. The reviewed articles confirm the efficacy of data mining and machine learning in predicting popular social media users. Both indirect and direct prediction approaches achieve adequately high accuracy and robustness. Therefore, the decision to detect popular individuals or posts should be based on influencer marketing preferences and strategies. Furthermore, the most efficacious machine learning methods include gradient boosting, deep learning, neural networks, generalized linear models, and KNN, as reported by Arora et al. [1]. While current social media popularity prediction techniques have shown promise, they are severely limited by siloed single platform data, lack of a systematic optimization framework for criteria weighting, and inability to adapt predictions to diverse context needs. This hinders their real-world effectiveness for data-driven influencer identification tailored to specific marketing campaigns and audiences. Our proposed Multi-Criteria Decision Analysis approach provides an integrated cross-platform analysis capability lacking in present models. The comparison matrix at the core of our method enables structured, transparent optimization of influence criteria based on statistical techniques like AHP. This alleviates haphazard criteria selection issues that encumber existing techniques. Finally, the customizable algorithm design empowers users with flexibility to tune criteria importance for individualized insights absent in one-size-fits-all predictions of current tools. Together, these multi-platform, optimized and adaptive capabilities significantly advance the state-of-the-art by addressing critical gaps that have obstructed adoption. The proposed research lays the technical foundations for the next generation of tailored, extensible and trustworthy data-driven influence quantification essential for the evolving social media ecosystem.
294
3 3.1
A. Almutairi and D. B. Rawat
Background Multi-criteria Decision Analysis MCDA
The reviewed literature indicates that MCDA has been extensively studied in various domains, with some investigations focusing on the entire process, while others concentrate on specific techniques like TOPSIS and decision matrices. Moreover, certain studies have compared the effectiveness of these methods in analyzing diverse decision options. This review article provides a chronological evaluation of prior research on the use of MCDA to improve decision-making accuracy in various fields, with a particular emphasis on studies that examine human behavior factors such as emotions and social connections. Furthermore, the literature reviewed may also explore aspects related to social media Table 5, including the internet, social networks, and social computing [7]. 3.2
Predictions
Predictions are defined as statements about what will happen or might happen in the future [6]. Predictions have the potential to provide companies with consumer-specific information. For instance, Netflix sponsored a competition to improve their current movie rating system. Netflix uses software called Cinematch to analyze viewing habits and recommend other movies that the user may enjoy. When customers log into Netflix, they can rate any movie from one to five stars, allowing the Netflix Cinematch system to learn an individual’s preferences. During the competition, Netflix allowed contestants access to previous ratings for a particular movie, but the actual movie titles were not identified. Instead, each movie was identified by a number. If a competitor could develop an algorithm that improved the current Netflix rating system by 10 percent, they would win the competition [6].
4 4.1
Research Goal and Objectives Goal
The goal of this paper is to determine how popular a person really is based on their social media profiles. By analyzing social media data and comparing it to set criteria, a popularity rating can be predicted. This goal will be achieved by following these particular objectives. 1. Specify the criteria for how cool a person is. 2. Use programming language to interact with established social media APIs and gather data specific to criteria. 3. Store gathered into session variables to allow variables to be accessed globally. 4. Produce a importance scale to compare and apply weights to all data. 5. Reference the session variables to calculate a popularity rating using an algorithm.
Predicting User Popularity
5
295
Proposed System
This section will provide a detailed discussion of the building and architectural design of the Popularity of a Person Predictor. The architecture design will provide an overview of the entire web tool and its features. The detailed design will include the APIs, extraction system, database, human factors, and data analysis. 5.1
Architectural Design
Figure 1 shows the entire system architecture. The user will be instructed to choose the social media platform they wish to log in with. Once the user is logged in, the web tool will communicate with that social media platform’s API. Information such as username, email, follower count, profile picture, and many other user-specific information can all be pulled from social media APIs. This gathered information is then stored inside a database. In order to make a prediction about a person’s popularity, the information must be pulled and analyzed. An algorithm is used to compare data, perform calculations, and make a prediction. Criteria that are not relevant to determining the popularity of an individual are eliminated and not included in the prediction. Finally, the tool outputs the actual popularity score for that user based on the calculations performed by the algorithm. 5.2
Detailed Design for Proposed System
API. An Application Programming Interface (API) is a set of routines, protocols, and tools for building software applications [8]. Websites grant developers access to their API, which allows developers to write code that can communicate with the API and retrieve the necessary information. For example, Twitter, Facebook, and Google all provide APIs that can be used to access user data needed for analysis. Data Extraction. APIs documentation often changes, which affects the information developers have access to. For instance, to develop a Facebook app, a developer must create an app within the Facebook developer’s console. Once the app is created, the user is provided with an app ID and app secret, which allows developers to write code to communicate with the API and access user data. This process is called authentication (Auth), and it provides applications with a secure way to allow access to servers without sharing their credentials [9]. Users will specify which social media platform they want to use to login. Once the user is logged in/authenticated, the tool can communicate with the API to gather user information. To develop the user’s profile inside the database, information such as ID, email, display name, and picture data are extracted and saved into a database. The user’s status updates, tweets, about me, friends, followers, and other user-specific information can be accessed and added to the user’s profile by communicating with the API.
296
A. Almutairi and D. B. Rawat
Database
Data extraction system
Data Analysis
Crateria
Display
Fig. 1. Overview of proposed social media influencer prediction system showing data collection via APIs, criteria selection, prediction algorithm, and output of influence scores
Tools for Extraction. The PHP programming language will be used for communicating and collecting data from the APIs. HTML is used to display content on the web. CSS and Bootstrap provide the style and layout of the HTML page. HTML and PHP work together to provide functionality, communicate with the database, and display content on the web. Cascading Style Sheets (CSS) are files saved with .css extensions, and these sheets, along with Bootstrap, are used to determine how the HTML page is displayed. Bootstrap is an HTML, CSS, and JavaScript framework that provides flexibility and responsiveness to layouts, no matter which device the web application is used on. CSS and Bootstrap are used to provide the web tool with a user-friendly experience. Database. MariaDB is a drop-in replacement for MySQL, with additional features and better performance. MySQL is a relational database that uses tables to store and organize data. Developers can communicate with the database by using the SQL language to form SQL queries. SQL queries are used to insert and pull records from the database. Human Factors. Facebook and Twitter specifically contain information regarding how many people are interested in the information a person shares. Information such as how many followers, friends, and likes an individual has can all be used to determine how popular a person is. From the collected data, human factors decide the importance of each piece of data collected relative to other data. This information is used to form the criteria for judging each user’s popularity and output a score. Data Analysis. Data Analysis is a process of collecting large amounts of data to help make a decision or prediction. During this process, all relevant data deter-
Predicting User Popularity
297
Table 1. Example of multi-criteria decision analysis Absolute
Tag pic Tagg Videos Num. of Friends Verified FB # followers Favorite Tweets Verified accounts Tweets
Tagged pic Tagged Videos Num. of Friends Verified FB3 # Followers Likes Tweets Ver. Tweets Tweets
1 1/2 3 3 3 3 3 1/2 17.0000
2 1 3 3 3 3 3 2 20.0000
1/3 1/3 1 1 1 1/2 2 1/3 6.5000
1/3 1/3 1 1 1/2 1/2 1 1/3 5.0000
1/3 1/3 1 2 1 1/2 2 1/3 7.5000
1/3 1/3 2 2 2 1 3 1/3 11.0000
1/3 1/3 1/2 1 1/2 1/3 1 1/3 4.3333
2 1/2 3 3 3 3 3 1 18.5000
mined by criteria will need to be accessed to make a prediction. An algorithm provides a way to sort through and compare data efficiently Table 3.
6
Results
In this section, the results of the criteria will be explained. The criteria and methods used determine the weight for each element within the criteria. 6.1
Criteria
Table 1 shows the criteria gather used to determine a popularity score. After the criteria is selected, a scale of relative importance shown in Table 2 is developed. This scale is used to compare all of the criterion to each other and establish their relative importance. Table 2. Relative importancelabel Relative importance Equally important
1
Slightly more important
2
Significantly more important 3
Table 3. Specific criteria used to evaluate social media user influence across platforms Tagged pic
Tagged Videos
Number of Friends Verified FB
Number of followers Favorited Tweets Verified accounts
6.2
Tweets
Multi-criteria Decision Analysis
Once the scale has been set, all elements of the criteria are placed into a matrix. The Multi-Criteria Decision Analysis (MCDA) matrix is a tool used to make decisions and compare data. It helps to focus on what is important and is logical
298
A. Almutairi and D. B. Rawat
and consistent. The matrix is used to assign numerical values for the relative importance of each criterion based on the scale. The leftmost column and the topmost row contain the criteria. Table 4 shows the remaining cells, which hold the values of the relative importance of the criterion in that row compared to the criterion in that column. Table 4. Pairwise comparison matrix showing relative importance scoring of criteria for influencer prediction Tagged pic
Tagged Number of friends Verified FB Twitter followers Favorited tweets Verified twitter Tweets Average video
Tagged pic
0.05882
0.10000 0.05128
0.06667
0.04444
0.03030
0.07692
0.10811 0.0671
Tagged video
0.02941
0.05000 0.05128
0.06667
0.04444
0.03030
0.07692
0.02703 0.0470
Number of friends 0.17647
0.15000 0.15385
0.20000
0.13333
0.18182
0.11538
0.16216 0.1591
Verified FB
0.17647
0.15000 0.15385
0.20000
0.26667
0.18182
0.23077
0.16216 0.1902
Twitter followers
0.17647
0.15000 0.15385
0.10000
0.13333
0.18182
0.11538
0.16216 0.1466
Favorited tweets
0.17647
0.15000 0.07692
0.10000
0.26667
0.09091
0.07692
0.16216 0.1125
Verified twitter
0.17647
0.15000 0.30769
0.20000
0.26667
0.27273
0.23077
0.16216 0.2208
Tweets
0.02941
0.10000 0.05128
0.26667
0.04444
0.03030
0.07692
0.05405 0.0566
Normalized
Cells above the diagonal are filled with the user-defined criteria comparisons based on the scaled importance value of each criterion. Cells below the diagonal are auto-generated from the values that the user defined in the cells above the diagonal. For example, if the user compares the first element in the topmost row, “Tagged Videos,” to the first element in the leftmost column, “Tagged Pictures,” the user can determine the individual significance of the criteria “Tagged Pictures” compared to “Tagged Videos.” It has been determined based on the example that “Tagged Pictures” receives a scaled value of 2 and is “slightly more important” than “Tagged Videos.” If it were determined that “Tagged Videos” were “slightly more important” than “Tagged Pictures,” the scaled value would be reciprocal 1/2. 6.3
Justification for Relative Importance
Verified Facebook and Twitter accounts were determined to be equally weighted and ranked highest compared to the rest of the criteria. A verified Twitter or Facebook account indicates an extremely popular individual, usually labeled as a celebrity or athlete. Criteria such as the number of friends and Twitter followers were also weighed equally, but are considered slightly less significant than a user with a verified Facebook or Twitter account. A person may have a lot of Twitter followers and Facebook friends but not a verified account. Ultimately, this person may be very popular amongst their peers but not a celebrity or athlete. Favorite tweets were slightly less important than the number of Facebook friends and Twitter followers, and significantly less important than a verified Facebook or Twitter account. Favorite tweets determine how many people actually enjoyed what you are tweeting. Some may enjoy what you tweet and never favorite it, some may favorite your tweet instantly, and others may not enjoy your tweets at
Predicting User Popularity
299
all. It is safe to assume that the more followers a person has, the more people are exposed to their tweets, which produces more favorited tweets. A large number of tweets does not tell us much about popularity. Someone with a lot of tweets may have a low number of followers. This can also be correlated with a low number of favorited tweets, and can be ranked slightly less important than tagged pictures. Tagged pictures are slightly more important than tweets and videos. More users tend to upload more pictures than videos. The number of tagged pictures correlates to the number of Facebook friends. The birth of Vine and Instagram video platforms provided their own video tagging software, which takes away from the Facebook video tagging feature. 6.4
Weight Calculating
The matrix is now filled with data, but the data must be normalized. This is done by dividing each value by the sum of its column. Table 4 shows the table used to calculate the average for each row. The averages are located in the rightmost column of the table highlighted in green, which represents the relative weight for each element of the criteria. These weights can now be applied to the corresponding criteria, and an algorithm is used to compute a “popularity” score.
7
Algorithm
We consider a number of criteria denoted by n. Each criterion, represented by wi , indicates its relative importance. We utilize an M CDAm atrix, which is a matrix representing the relative importance of each criterion. The weights for each criterion are calculated and stored in the weights array. Each session variable, vi , holds a specific value, and avgi corresponds to the average value associated with that variable from statistics. To determine the percentage pi , we divide vi by avgi and then divide the result by avgi . The final score, denoted as Score, is calculated by summing up the weighted values obtained by multiplying pi with the corresponding weights. Finally, the score is multiplied by 100 and can be displayed in the web tool interface. Let: – n be the number of criteria – wi be the relative importance of criterion i – M CDA matrix be the matrix representing the relative importance of each criterion – weights be the calculated weights for each criterion – vi be the value of session variable i – avgi be the average value related to session variable i from statistics – pi be the percentage obtained by dividing vi by avgi and then dividing by avgi – score be the final score
300
A. Almutairi and D. B. Rawat
Algorithm: 1. Create an empty matrix M CDA matrix of size n × n. 2. Set M CDA matrix[i][i] = wi for each criterion i. 3. Calculate the weights: M CDA matrix[i][j] M CDA matrix[i][j] = n , for k from 1 to n k=1 M CDA matrix[k][j] n M CDA matrix[i][j], for j from 1 to n weights[i] = j=1
4. Set score = 0. 5. For each session variable i: pi =
vi avgi
/avgi
score = score + pi × weights[i] 6. Multiply the score by 100: score = score × 100. 7. Obtain the output, which can be displayed in the web interface.
Table 5. Comparison between different related paper Paper
Data Collection
proposed methods
Facebook and/or Predict user’s popularity Multi-Criteria Decision User’s popularity Twitter profile data API rating Analysis (MCDA) ranking
Objectives
Methods
Outcome
Key Challenge Limited to twitter and facebook
Lavanya T. and Miraclin Online reviews from Joyce Pamila J. C. Twitter and Veningston K. [10]
Extract opinion targets and opinion words
Word alignment model
Large volume of make choice of designing opinionated potential consumer oriented data generated products e.g. Mobile, and available laptop, and so on in digital forms
Meghawat, Mayank and Yadav, Satyendra and Mahata, Debanjan and Yin, Yifang and Shah, Rajiv Ratn and Zimmermann [11]
Predict social media photos popularity
Multimodal approach using visual, textual, and social features
Predict popularity of social media photos in terms of view counts
No such multimodal dataset exists for the prediction of social media photos
Collaborative filterin method
A method to predict popularity of photos in incomplete social network site
Difficulty in getting all of the photos and the whole network due to the large scale of the network in Flickr
Niu, Xiang and Li, Lusong and Mei, Tao and Shen, Jialie and Xu, Ke [12]
8
Multimodal data from social media platforms
Using the data from 30 months in Flickr
Predict popularity of photos in incomplete social network site
Future Plans
A current limitation of our model is the reliance solely on Facebook and Twitter data for popularity prediction. To improve coverage and accuracy, we plan to expand the scope to incorporate emerging platforms like Instagram, YouTube,
Predicting User Popularity
301
and TikTok which also have large influencer ecosystems. Each additional platform presents unique data ingestion and normalization challenges. For Instagram, collecting engagement metrics on posts requires reverse engineering the feed algorithm. For YouTube, video-specific indicators like view counts, watch time, and subscriber growth rates are relevant. TikTok introduces short viral video dynamics. We plan to develop platform-specific data connectors and criteria evaluation pipelines to handle these varied data types. The prediction algorithm codebase also needs to flexibly incorporate new platforms as modular components. To account for cross-platform effects, network analysis and influence tracing methods will be applied. We envision developing a unified influence scoring system spanning the major social media ecosystems. This expansion will provide more comprehensive assessment of influencer impact and help identify rising stars native to new platforms. The challenges include scaling data processing, handling disparate data formats, and adapting the model to highly dynamic environments like TikTok. But overcoming these will bolster our position as an industry leader in multi-platform social media influencer analytics. While our current approach relies on an MCDA comparison matrix, we intend to explore more advanced methods for optimizing the criteria weights. One approach is the Analytic Hierarchy Process (AHP) which constructs layered hierarchical criteria models for systematic weight derivation through structured pairwise comparisons. AHP provides ratio-scale priorities for each indicator and consistency measures to improve robustness. Another technique we plan to evaluate is training machine learning models to predict influence labels based on various criteria permutations. The optimized feature importances output from algorithms like gradient boosted trees can inform the weighting. This data-driven approach avoids human subjectivity. Both AHP and ML-based techniques allow mathematically optimizing the weights versus qualitative assignment. We will research integrating these with the comparison matrix or as a secondary adjustment layer. However, additional validation is needed to verify alignment with true influence. Refining the weighting process aims to boost accuracy and provide sensitivity analysis around criteria impacts. But human-in-the-loop guidance will remain important to balance data-driven insights with experiential context.
9
Conclusion
In this paper, we were able to create criteria to predict an individual’s popularity using PHP, HTML/CSS, Facebook, and Twitter API clients. Before working with the API, we needed to determine what information to target to determine popularity. Once the criteria were established, we pulled the necessary data from the API by having the user log in with their Facebook credentials. If applicable, they could also log into their Twitter account. After logging in, the predetermined criteria were gathered from their profile, and we made requests to the selected social media platform’s API to gain access to this information. We referred to the developer’s documentation regarding the social media platform to learn about the different types of data we could access. Successful responses were
302
A. Almutairi and D. B. Rawat
returned as a JSON array, which required us to decode and parse through to obtain the necessary information. Each element was stored in a session variable, allowing the user to access it later for calculations. After gathering all necessary information, we inserted the data into a matrix called Multi-Criteria Decision Analysis, or MCDA matrix (see Table 3), which is a tool used to make decisions and compare data. It helped us focus on what is important, logical, and consistent. The matrix assigned numerical values for the relative importance of each criterion based on a scale, allowing us to determine how each element should be weighted. Once the weights were calculated, we used them in an algorithm to calculate the user’s score. The session variables were then referenced by the algorithm, making them dynamic. All the session variables were used to compute the score. Each element in the algorithm was divided by the average and set equal to a variable. We discovered several statistics related to Facebook and Twitter platforms that produced an average number per user for each variable we used. Each variable was divided by its average to produce a percentage, which was then multiplied by the weight we received from the normalized data from the MCDA matrix. The results were then multiplied by 100, and all the variables were added and set equal to a score variable. Once the score variable was calculated, we obtained an output, which could be displayed in the web tool interface.
References 1. Anuja, A., Shivam, B., Chandrashekhar, K., Reema, A., Yogesh, D.: Measuring social media influencer index- insights from facebook, Twitter and Instagram. J. Retailing Consumer Serv. 49, 86–101 (2019) 2. Bilal, A.-S., Yan, C.K., Omar, A.-K., et al.: Time-aware domain-based social influence prediction. J. Big Data 7 (2020) 3. Salvatore, C., Alessandro, P.: Reforgiato Recupero Diego, Saia Roberto. Usai Giovanni, Popularity Prediction of Instagram Posts (2020) 4. Zohreh, Y.D., Nastaran, H., Saeed, R.: User response to e-WOM in social networks: how to predict a content influence in Twitter. Int. J. Internet Marketing Advertising, 13 (2018) 5. Qi, C., Huawei, S., Jinhua, G., Bingzheng, W., Xueqi, C.: Popularity prediction on social platforms with coupled graph neural networks. In: Proceedings of the 13th International Conference on Web Search and Data MiningWSDM ’20, pp. 70–78. Association for Computing Machinery, New York (2020) 6. Guandan, C., Qingchao, K., Nan, X., Wenji, M.: NPP: a neural popularity prediction model for social media content. Neurocomput. 333, 221–230 (2019) 7. Abdullah, A., Danda, R.: Predict Individuals’ Behaviors from Their Social Media Accounts, Different Approaches: A Survey, 823–836 (2022) 8. Trupthi, M., Suresh, P., Narasimha, G.: Sentiment analysis on Twitter using streaming API. In: 2017 IEEE 7th International Advance Computing Conference (IACC), pp. 915–919 (2017) 9. Adarsh Samuel, R.J., Lenin, F.J., Sunil, R.Y.: Predictive Analytics on Facebook Data, pp. 93–96 (2017)
Predicting User Popularity
303
10. Lavanya, T., Miraclin Joyce Pamila, J.C., Veningston, K.: Online review analytics using word alignment model on Twitter data. In: 2016 3rd International Conference on Advanced Computing and Communication Systems (ICACCS). https://doi. org/10.1109/ICACCS.2016.7586388 11. Meghawat, M., Yadav, S., Mahata, D., Yin, Y., Shah, R.R., Zimmermann, R.: A Multimodal Approach to Predict Social Media Popularity 2018. https://doi.org/ 10.1109/MIPR.2018.00042 12. Niu, X., Li, L., Mei, T., Shen, J., Xu, K.: Predicting image popularity in an incomplete social media community by a weighted bi-partite graph. In: 2012 IEEE International Conference on Multimedia and Expo, Melbourne, VIC, Australia, pp. 735–740 (2012). https://doi.org/10.1109/ICME.2012.43
A Computationally Inexpensive Method for Anomaly Detection in Maritime Trajectories from AIS Dataset Zahra Sadeghi(B) and Stan Matwin Faculty of Computer Science, Dalhousie University, Halifax, Canada [email protected]
Abstract. Vessel behavior analysis can unfold valuable information about maritime situation awareness. Maritime anomaly detection deals with finding the suspicious activities of vessels in open water using AIS dataset. In this paper, an inexpensive method is introduced for automatic anomaly detection by utilizing historical sequence of trajectories of three vessel types of tanker, cargo and tug. We propose to project sequential data to a visual space in order to analyze and uncover the incongruent and inconsistent regions through segmentation of saliency maps. In order to evaluate the results, we take a statistical measurement for assigning the degree of anomality to each data point. The comparison of results indicate that this method achieves effective performance and efficient computational complexity in finding maritime anomalies in an unsupervised manner. In addition, this method is explainable and the results are visually interpretable. Keywords: Anomaly detection · Outlier detection anomalies · AIS dataset · Saliency map
1
· Maritime
Introduction
Automatic Identification System (AIS) is a rich source of big data for doing research in time series analysis [22]. This dataset contains dynamic, static and meta data information about vessels in regard to their coordinates of navigation, speed, direction and vessel types. This information is transmitted by vessels to the base stations as well as other vessels. The AIS information are used to track, monitor and manage vessels’ activities in real-time. The historical AIS tracks encompasses information about voyages of various vessel types and are collected by coast guard. Analyzing this dataset is considered as a situation assessment process for maintaining the situation awareness in open waters [24]. The state of situation awareness is prominent in decision making process [17] by providing information about vessels’ movement behavior for a range of different applications such as high-traffic locations [27], hazardous vessel behavior [5], port arrival estimation [4], collision avoidance [20], etc. In principle, AIS data analysis c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 K. Arai (Ed.): FICC 2024, LNNS 921, pp. 304–317, 2024. https://doi.org/10.1007/978-3-031-54053-0_22
A Computationally Inexpensive Method for Anomaly Detection
305
is a challenging task. This dataset is inherently unreliable, inaccurate and noisy. Both the AIS transmitters and receivers are noisy equipment. In addition, the entry of input data is done manually which is error prone. There are also typically large gaps between consecutive messages because the information is not updated on a timely manner. Furthermore, there is no public annotated AIS dataset with labeled information about anomalous data points. However, many of the machine learning methods are largely relied upon labeled information. These methods are known as supervised learning. In order to alleviate the label requirement, other machine learning approaches are developed which are known as semi-supervised learning and unsupervised learning. Semi-supervised learning methods are less dependant on annotated datasets but they still require a portion of labeled information to model the data. In contrast, unsupervised learning methods have bypassed the labeling requirement. These techniques are focused on extracting patterns from data by employing statistical and similarity analysis. In this work, we analyze AIS data to detect abnormal patterns and sudden changes of vessels’ movements in an unsupervised manner. To this end, we propose a novel idea for finding abnormal changes in sequence data by projecting them into a visual space and extracting salient points. Our saliency based method can detect anomalous spikes in a coherent and interpretable manner and hence can be regarded as a change point detector. With this approach, we can segment a time series data into subsequences with similar movement patterns. This paper is structured as follows. Section 2 introduces the problem and provides related works. In Sect. 3, we present the proposed method. Our results are outlined ins Sect. 4. Finally, we conclude in Sect. 5.
2
Background
Anomaly detection has been explored in machine learning and artificial intelligence to a large extent. Nevertheless, despite the considerable number of proposed methods, there is still a lack of efficacy in the applicability of existing methods to different situations and contexts. This problem is closely related to the problem of outlier detection in statistical analysis and these two terms can be used interchangeably. Anomaly detection has a wide range of applications for identifying risks, faults and threats in different domains such as healthcare, transportation, finance, and manufacturing. Identification of anomalous points is of significant value due to security reasons. Maritime anomaly detection is critical for safe seafaring, secure maritime navigation and for protecting marine transportation against common threats and illegal activities such as smuggling of drugs and weapons. As a result, a large body of research has been devoted to automatic detection of anomalies in maritime domain. In this paper, we are only concerned with extraction of inconsistent behavior in the ship movements. There are three main machine learning approaches for anomaly detection in machine learning. Supervised learning is the most widespread technique, which relies on large labeled datasets. With this approach, anomaly detection can be treated as a binary classification task. This approach requires a balanced dataset which
306
Z. Sadeghi and S. Matwin
is usually not the case for outlier detection tasks in which normal data is more abundant compared to abnormal data. The next approach is semi-supervised or weakly supervised learning, which is an attempt to train a machine learning model on normal training data and decide about the abnormality of test data based on their fitness to the trained model. Typically, a large error is a sign of irregularity of data. The issue with this approach is that it is not optimized for anomaly detection and can result in many false positive or false negative identifications. This approach is also known as one-class classification. Finally, the last approach is a fully unsupervised learning which is an attempt to discover the dominant underlying structure or behavior of data based on the similarity of patterns and by leveraging clustering methods. In the maritime domain, identification of vessels’ abnormal behavior can be roughly categorized into three major classes: Clustering [29], modeling and statistical based approach [15]. The first and third classes are usually carried out in an unsupervised approach, while the second class has been mostly studied in a supervised or semi-supervised approaches. A thorough survey of machine learning approaches for maritime anomaly detection can be found in [26] and [16]. Clustering based approach aim to group a stream of data into sub-series on the account of their similarities and distances [11]. Typically Dynamic Time Warping is employed as a time series measurement metric [28]. Mahalanobis square distance is also utilized to detect outliers in [1]. From the clustering perspective, anomalies are the points or groups of points which cannot properly be assigned to any clusters. Moreover, the small-sized clusters can also be considered as an anomaly group [1]. Clustering algorithms such as DBSCAN [7], hierarchical clustering [7], self-organizing maps (SOM) [25], Gaussian Mixture Model (GMM) [10] are also applied for identification of anomalies in waterways. Model based approaches, on the other hand, are centered around describing data with machine learning models. The general trend is to train a model on representative samples. Residual and prediction error of the model can be employed for detection of abnormal points. This approach can be further divided into two categories of shallow learning and deep learning based modeling. One of the widely studied modeling tool from the first category is Autoregression based modeling such as ARMA [14] and ARIMA [18]. Deep sequential modeling such as LSTM [2] and Autoencoders [12,21] are conventional examples under the umbrella of deep learning modeling. Statistical based approach explores the statistical properties of data. Abnormal points, in this context are those with lower probability of occurrence. Histograms [1], sliding windows [8] and probability [9] are common approaches from this category. One of the high performance and efficient algorithms for anomaly and outlier detection is iforest isolation. This algorithm can find anomalies in an unsupervised fashion based on creating ensemble of binary trees which are called iTrees. The idea is based on creating multiple decision trees that randomly partitions the data, and obtaining the average path length of each terminating node (i.e. isolated instances) to the root. The shorter the path is, the higher is likelihood of being an anomaly point. One disadvantage of this method is the fact that splits are randomly selected and hence it cannot
A Computationally Inexpensive Method for Anomaly Detection
307
capture all the outliers accurately [19]. In addition, iforest isolation is very much sensitive to the threshold parameter. Another unsupervised anomaly and outlier detection is Local Outlier Factor (LOF) [3], which measures the outlierness (i.e. anomality) of each point based on the local density of k nearest neighbor. Local density is approximated based on reachability of all k nearest pairs which is computed based on the maximum distance between two points. The problem with this approach is that it is not accurate and fails to identify many outliers [6]. In this study, we are concerned with finding anomalies in a time series of AIS messages. A major challenge of anomaly detection in AIS dataset is the lack of annotation for each message that is sent at a particular time, which makes it impossible to apply supervised learning approaches. To alleviate this problem, we take a statistical approach for creating a groundtruth for evaluation. To this end, we define anomality likelihood for each time point. We then transfer time series data into a visual space and search for the salient changes in image of each time series. We apply image processing techniques for extracting the changes in the image space and translate them back to the original space to localize the anomalous points.
3
Method
We propose a method for time series analysis with the application of outlier detection and segmentation. The gist of our method is based on extracting the salient parts of vessel trajectories by projecting them to a visual space. To this end, we borrow the idea of Hankel matrix representation [23]. In order to create a Hankel matrix, we rearrange each trajectory Ti of length N into subseries of length L, where 2 = 17” feature. The same quality for the RCB Dataset in Fig. 9 is evident while detailed through the subsequent nodes in the decision tree. Among the cases in an RCB is with indications of vulnerability during the crisis or any disruptions (as represented by “1” in the model): • CAR is below 13.0% (node “Sol1 >= 13”). • CAR is between 13.0 to 17.0% and with concerns with liquidity (nodes “Liq3 < 38” and “Liq1 > 56”). • CAR is greater than or equal to 17.0% but with concerns with leverage (node “Lev1 < 2.1”), deployment of funds to lending business (node “Liq3 < 116”), and quality of loan portfolio (node “AQ1 >= 12”).
Fig. 9. RCB’s EWS decision tree model
Predictive Analytics for Non-performing Loans and Bank Vulnerability
365
4.5 Discussion Based on the literature reviewed, the VAR model is commonly used for forecasting. Considering this model to accomplish the first objective, the researchers were able to determine significant variables and obtain an NPL ratio forecast for the next four quarters beginning the third quarter of 2022 at the financial system-wide (SW) level and at the industry level, namely UKBs, TBs and RCBs in the case of Philippine banks (see Table 4). Table 4. VAR model summary Dataset
Significant Variables
Forecast Trend
Impression
SW
NPL itself, Grt1, Mac9, Mac10
Decreasing
Favorable
UKB
NPL itself, AQ1, Sol1, Mac10
Decreasing
Favorable
TB
NPL itself, Liq2, Prf1, Mac5
Increasing
Unfavorable
RCB
NPL itself, Grt4, Mac9, Mac10
Increasing
Unfavorable
For the second objective, the researchers determined that most vulnerability indicators for Philippine banks are bank-specific variables, primarily the solvency measure Sol1 (Capital Adequacy Ratio), as this metric directly indicates the bank’s strength. Other significant variables identified in this study include Lev1 (Debt to Equity Ratio), AQ1 (NPL Ratio) AQ2 (NPL Coverage Ratio), and Prf1 (ROE.). In creating the EWS predictive model, most of the datasets (at the industry level) led to using the Decision Tree Model with better performance values.
5 Conclusion The proposed models and the significant variables provided in this study show the dynamics of NPLs during the crisis in the case of Philippine banks. The insights and findings may be considered by the monetary authorities, particularly the BSP, in its policymaking in its drive to reduce the pressure or distress in the financial system through public reports and bank supervision/surveillance, which also pursue safe and sound banking for all players in the financial/banking sector. For the supervision of financial institutions, this paper provided tools for NPL Ratio forecasting using the VAR model. It recommended the Decision Tree model as an EWS to assess the adequacy and effectiveness of an individual bank’s credit risk management system (CRMS) and, generally, for proactively handling the country’s financial stability threats. An individual bank may refer to this study for its CRMS, particularly in measuring credit risk, peering analysis, and stress testing. The models provided herein can be used to evaluate the borrower’s creditworthiness and may serve as forward-looking factors (FLFs.) before credit granting, a bank generally quantifies/rates a borrower’s credit risk considering the available past and current information through an internal credit risk rating system and/or credit scoring models. Credit risk measurement tools should be capable of considering FLFs to have a holistic credit evaluation of borrowers that do
366
J. J. C. De Guzman et al.
not solely rely on lagged or historical client information. An individual bank may also conduct a peering analysis vis-à-vis another peer bank with a similar business model, the banking system, and the respective industry regarding its NPL ratio aiming to have a self-assessment on the adequacy and effectiveness of its CRMS. Furthermore, a bank may also refer to the processes in this study for stress testing. Moreover, the tools and insights proposed in this study require the application of sound judgment of the users, especially for the monetary authority in promoting financial stability and bank supervision and for the Philippine banks in observing prudence in their lending operations.
References 1. Mileris, R.: Macroeconomic determinants of loan portfolio credit risk in banks. Eng. Econ. 23, 496–504 (2012) 2. bankingsupervision.europa.eu: Non-performing Loans (2021). https://www.bankingsupervis ion.europa.eu/banking/priorities/npl/html/index.en.html 3. bsp.gov.ph: Financial Stability Report 2nd Semester 2020 (2020). https://www.bsp.gov.ph/ Media_And_Research/FSR/FSR2020_2NDSEM.pdf 4. Grigoli, F., Mansilla, M., Sald´ıas, M.: Macro-financial linkages and heterogeneous nonperforming loans projections: an application to Ecuador. J. Bank. Finance 97, 130–141 (2018) 5. Ari, A., Chen, S., Ratnovski, L.: The dynamics of non-performing loans during banking crises: a new database with post-COVID-19 implications. J. Bank. Finance 133, 106140 (2021) 6. Staehr, K., Uusküla, L.: Macroeconomic and macro-financial factors as leading indicators of non-performing loans: evidence from the EU countries. J. Econ. Stud. (2020) 7. Prykhodko, N., Prykhodko, S., Vorona, M.: The non-linear regression model to estimate the part of NPLs in the whole loan portfolio of Ukrainian banks. In: 2018 IEEE First International Conference on System Analysis & Intelligent Computing (SAIC) (2018) 8. Vouldis, A.T., Louzis, D.P.: Leading indicators of non-performing loans in Greece: the information content of macro-, micro and bank-specific variables. Empir. Econ. 54, 1187–1214 (2018) 9. Tracey, M.: A VAR analysis of the effects of macroeconomic shocks on banking sector loan quality in Jamaica, January 2006 10. Us, V.: Analyzing the banking sector in Turkey before and after the global crisis by ownership. TISK Akademi 10, 391 (2015) 11. Belgrave, A., Guy, K., Jackman, M.: Industry specific shocks and non-performing loans in Barbados. Rev. Finance Bank. 4, 123–133 (2012) 12. Amelda, B., et al.: Analysis of early detection system of banking industry in Indonesia on shock. In: 2020 International Conference on Information Management and Technology (ICIMTech) (2020) 13. Cecchin, M., Esposito, D.: Vulnerability analysis on Italian banks during the Covid-19 pandemic (2021) 14. Lang, J.H., Peltonen, T.A., Sarlin, P.: A framework for early-warning modeling with an application to banks (2018) 15. Zaghdoudi, T.: Bank failure prediction with logistic regression. Int. J. Econ. Financ. Issues 3, 537–543 (2013) 16. Serrano-Cinca, C.: Feedforward neural networks in the classification of financial information. Eur. J. Finance 3 (1998)
Predictive Analytics for Non-performing Loans and Bank Vulnerability
367
17. Worrell, C.A., Brady, S.M., Bala, J.W.: Comparison of data classification methods for predictive ranking of banks exposed to risk of failure. In: 2012 IEEE Conference on Computational Intelligence for Financial Engineering & Economics (CIFEr) (2012) 18. Le, H.H., Viviani, J.-L.: Predicting bank failure: an improvement by implementing a machinelearning approach to classical financial ratios. Res. Int. Bus. Finance 44, 16–25 (2018) 19. Srinivasulu, A., Murthy, M.N., Naidu, M.B.: Testing granger causality and the assumptions of residuals in vector auto regressive model by using R (2021) 20. Eloriaga, J.: A Deep Dive on Vector Autoregression in R, June 2020. https://towardsdatascie nce.com/a-deep-dive-on-vector-autoregression-in-r-58767ebb3f06 21. Maitra, S.: Time-series Analysis with VAR & VECM: Statistical Approach, November 2019. https://towardsdatascience.com/vector-autoregressions-vector-error-correction-multiv ariate-model-a69daf6ab618
Classification of Academic Achievement in Upper-Middle Education in Veracruz, Mexico: A Computational Intelligence Approach Yaimara Céspedes-González1(B) , Alma Delia Otero Escobar1 , Guillermo Molero-Castillo2 , and Jerónimo Domingo Ricárdez Jiménez1 1 Universidad Veracruzana, Xalapa, Veracruz, Mexico
{ycespedes,aotero,jricardez}@uv.mx
2 Universidad Nacional Autónoma de México, Mexico City, Mexico
[email protected]
Abstract. Due to current technological development, there are accumulations of data from various sources, susceptible of being analyzed and interpreted. Particularly, in the education field, the analysis of data generated in educational environments, such as the academic achievement of students who reach the end of their studies at different educational levels, becomes relevant. This research presents the academic achievement of upper-middle education students in the state of Veracruz, Mexico, based on the type of school they belong to (State, Federal or Private). For this, the PLANEA-2017 database (National Plan for the Evaluation of Learning) was used, as well as computational intelligence algorithms, such as random forests and support vector machines. As a main result, it was obtained that the model produced by the random forests presented a better classification accuracy, and the most relevant variables to classify academic achievement by type of schools evaluated were the total number of students enrolled; the level of academic performance achieved by students in Language and Communication, highlighting the excellent (IV) and insufficient (I) level, respectively; and the number of students tested in Mathematics. Keywords: Academic achievement · Random forests · Support vector machines · Upper-middle education
1 Introduction Given the current technological development, the data is generated directly in digital format [1]; therefore, it is produced and stored in a massive and cheap way [2]. The foregoing gives rise to a tsunami of data, coming from various sources, which are susceptible to interpretation and its analysis can have a profound impact on people, organizations, and society in general [3]. Precisely, the term of big data is used to refer to this explosion of available information [3]; which gives rise to the inherent difficulty of handling these large amounts of data; © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 K. Arai (Ed.): FICC 2024, LNNS 921, pp. 368–382, 2024. https://doi.org/10.1007/978-3-031-54053-0_26
Classification of Academic Achievement in Upper-Middle Education
369
therefore, challenges are generated related to its storage, analysis, and visualization [4, 5]. The computational techniques implemented to perform these analyses must be capable of finding unsuspected relationships and summarizing the data in a novel and understandable way [6]; as well as identify rules, groups, underlying patterns, and correlations that exist in the data, which once visible can be understood and used. Computational intelligence (CI) concentrates a wide variety of methods and algorithms that are applied to mimic the power of human reasoning to deal with complex problems in the real world [7, 8]. At present, the momentum that computational intelligence has taken in different fields of human activity, is evident; especially in cases where the traditional approaches of artificial intelligence and machine learning are ineffective due to [9]: high complexity, uncertainty, and data nature. These fields are health, safety, education, biology, among others. Undoubtedly, at present, in the education field, it becomes relevant to analyze the information generated in educational environments, such as the academic achievement of students who reach the end of their studies at different educational levels. Academic achievement is the extent to which a student has achieved short- and long-term educational goals. Academic performance is commonly measured through tests or continuous assessments [10]. In Mexico, the National Institute for the Evaluation of Education (INEE, for its acronym in Spanish) has developed the National Plan for the Evaluation of Learning (PLANEA, for its acronym in Spanish) to know how well Mexican students master key learning [11]. PLANEA measures and evaluates learning achievement in basic and upper-middle education. This exam is applied in all schools in the country to a sample of students. The exam focuses on two fields of training, which are considered critical for the academic development of students [11]: Language and Communication, and Mathematics. PLANEA began to be applied from 2015, and currently the results obtained in three consecutive years 2015, 2016 and 2017 are available [12]. The objective of this research was to classify the academic achievement of uppermiddle school students in the state of Veracruz, Mexico, based on the type of sustenance to which the evaluated educational centers belong (State, Federal, or Private). The most recent and available PLANEA database (PLANEA-UM-2017) [13] was used to develop the research, which stores the results of the academic achievement of the uppermiddle level students corresponding to the year 2017. Also, algorithms for computational intelligence were used. Undoubtedly, the PLANEA-UM-2017 data is a useful instrument to measure the quality of education at the upper-middle level in Veracruz. Analyzing them from a computational intelligence approach allows having relevant information to develop evidencebased state public policies, as well as to plan the educational system. In a general sense, provide feedback to the entire school community (students, teachers, administrators, parents, and society) regarding the learning that their students achieve and recognize the level of challenge that each school must face. This paper is arranged as follows: Sect. 2 presents the background, computational intelligence, random forests, support vector machines, and related works; Sect. 3 outlines the established method as a proposed solution; Sect. 4 presents the results obtained, based
370
Y. Céspedes-González et al.
on the case study, such as the classification of academic achievement of upper-middle school students in the state of Veracruz; and Sect. 5 summarizes the primary findings and potential for future research.
2 Background Artificial intelligence faces important challenges, which are growing today, such as new ways of searching, optimizing, and solving problems in different fields of knowledge. The path from the traditional to the modern has allowed the emergence of better computational tools such as computational intelligence [9]. Through computational intelligence it is possible to build models, reasoning, machines, and processes, based on structured and intelligent behaviors [9]. This type of intelligence adopts methods that tolerate incomplete, imprecise, and uncertain knowledge in complex environments. In this way, they allow approximate, flexible, robust, and efficient solutions [14]. However, the difficulty in solving problems is because the processes can be complex to mathematical reasoning, contain uncertainty, or simply due to the data nature [15]. Therefore, computational intelligence can be implemented to address problems that affect the actual society [16]. The advantage of computational intelligence is that it uses methods that are close to the reasoning way of the human being, i.e., it uses inaccurate and incomplete knowledge, and it can produce control actions adaptively [15]. 2.1 Random Forest To build computational intelligence models, which base their operation on the discovery patterns from examples, one of the most used algorithms are decision trees; through which forecasting, and classification problems can be solved, to build a hierarchical, efficient, and scalable structure based on the conditions (variables) established in the data. The divide and conquer strategy are used for this purpose. A tree is represented graphically by a set of nodes, leaves, and branches. The main or root node is the variable from which the classification or forecast process begins. Internal nodes correspond to each of the conditions (variables) associated with a given problem. While each possible response to the conditions is represented by a child node. The branches leaving each of these nodes are tagged with the possible attribute values. The final nodes correspond to a decision, which coincides with some class (label) of the variable to be classified or a forecast of a continuous value [17]. It is important to mention that sometimes decision trees are susceptible to overfitting, which means that they tend to learn very well from the training data, but their generalization may not be as good. One way to improve the generalizability of decision trees is to combine multiple trees, known as Random Forests (RFs). Random forests are widely used today. Their goal is to build (assemble) a set of decision trees, that when they get together, they see different data slices. No tree uses all the training data, rather each is trained on different samples for the same problem. By combining the results, the errors compensate each other, and it is possible to have a prediction (forecast or classification) that better generalizes the problem. Figure 1 shows a general scheme of how random forests work, which consists of four steps:
Classification of Academic Achievement in Upper-Middle Education
371
Fig. 1. General scheme of a random forest
1) 2) 3) 4)
Random samples are selected from the data set. A decision tree is built for each sample and a result is obtained. A vote (classification) or average (forecast) is done based on the results. The result with the most votes (classification) or final average (forecast) is selected.
For classification problems, the results of the decision trees are combined using different strategies. One of the most common is soft-voting, through which, more importance is given to the highest coincidences results whereas for regression problems, the usual way to combine the results of the decision trees is by taking the average (arithmetic mean). 2.2 Support Vector Machines Support Vector Machines (SVM) is a Computational Intelligence algorithm used for classification and forecasting problems [18]. The objective of the model is to find an optimal separation hyperplane between two classes, maximizing the margins between them [19]. The algorithm uses support vectors to find the hyperplane that best solves the separation problem [20], which are parallel to the hyperplane and lie in the element of each group closest to it. Figure 2 shows an example of a separable problem using SVM in a two-dimensional space. The SVM model is divided into two types: linear and non-linear [19]. Linear models can be divided by a line or hyperplane. While nonlinear models are transformed to a higher dimensional space for their division [21]. This transformation is done through functions, known as Kernel. Finding the correct transformation for a data set is not an easy task. Hence, different Kernels are used in an SVM implementation. Some of the most used functions are linear, polynomial, RBF (radial basis functions), sigmoid, among others. The procedure used by SVMs to establish an optimal separation between classes is as follows:
372
Y. Céspedes-González et al.
+
Se
ra pa
t in
y gh
pe
r
in
+
arg
+ +
+ Support vector
Support vector
M
+
Support vector
-
ne pla
-
Support vector
-
-
Fig. 2. Hyperplane, support vectors and margin of an SVM in a two-dimensional space
1) The closest points to the line of both classes (border) are identified. These points are called support vectors. 2) The distance between the hyperplanes and the support vectors is calculated. This distance is called margin (M). 3) The goal is to maximize the margin. 4) The hyperplane whose margin is greater is considered the optimal hyperplane. Computationally, a small fraction of support vectors (SV) is useful; since these are used to decide on which side of the separator each test case will be classified. The further away from the hyperplane the support vectors are, the higher the probability of correctly classifying the new cases in their respective classes. Therefore, at present, the interest in this algorithm has not stopped growing in its application to real problems, since it is used both for classification and for forecasting (regression). 2.3 Related Works At present, several researches have been carried out based on the data from the PLANEA evaluations. In this regard, [22] used the data from PLANEA 2015, on which they focused on a sample of 137166 middle school students who took the mathematics test. As part of the research, the dimensionality of the data was reduced, from 232 to 18 variables, and an interactive visualization system was developed that allowed observing association patterns and rules when combining variables and federal entities. As a result, some representative variables were identified, such as academic aspiration, family resources, mother studies and father studies, which impact academic achievement. On the other hand, as preliminary thesis work in [23] analyzed the academic performance, in Language-Communication and Mathematics, of upper-middle education students in Mexico using a partitional clustering algorithm. For this purpose, data from PLANEA 2015 was used, and a user-centered data mining process was followed. Subsequently, as an extension of the previous work in [24], a user-centered interactive system was developed that allowed generating the view of mineable data with 16 variables and,
Classification of Academic Achievement in Upper-Middle Education
373
14539 records. The algorithm applied was K-means, and the elbow method was also used to analyze the number of suitable groups. As a result, a variety of academic achievements was observed, with Insufficient and Elementary standing out in the evaluated population, while Good and Excellent were achieved by a reduced number of schools. In [25] the K-means algorithm was used to analyze the results of PLANEA 2017 test corresponding to upper-middle schools for the 32 states of the Mexican Republic, in Language and Communication. The algorithm results produced three groups, which correspond to a level of academic achievement, so that: K1 for Good, K2 for Elementary and K3 for Insufficient. In addition, the variables were analyzed: federal entity, shift, subsystem, sustenance, and level of marginalization of the schools evaluated; finding that there is an important impact of these attributes on the academic achievement of the students. In [26] the analysis of the academic performance (Language and Communication and Mathematics) of students from autonomous, public, and private schools of uppermiddle education in Mexico was presented, through the method of descriptive analysis and partitional grouping. The data analyzed were the PLANEA 2017 records. It was observed that among the academic achievements, Insufficient and Elementary stood out in the evaluated population. While the smaller number reached acceptable achievements, that is, Good and Excellent. Table 1 shows a summary of the related works. Table 1. Summary of related works Author
Description
Method used
Heredia et al. (2020) [22]
ANCONE system was developed, revealing association patterns and rules. Variables that impact academic achievement were identified. Data used: PLANEA 2015
The correlation-based feature selection method, J48 and Naive Bayes The apriori algorithm, to get association rules
Maldonado et al. (2016) [23] Academic performance was analyzed, highlighting Insufficient and Elementary, while Satisfactory and Excellent were reached in few schools. Data used: PLANEA 2015 Molero-Castillo et al. (2018) [24]
K-means and Elbow method
An interactive system was K-means and Elbow method developed to analyze academic performance, highlighting Insufficient and Elementary. Data used: PLANEA 2015 (continued)
374
Y. Céspedes-González et al. Table 1. (continued)
Author
Description
Method used
Gutiérrez et al. (2020) [25]
Academic variables were analyzed, and it was discovered that these have a major impact on academic achievement. Data used: PLANEA 2017
K-means
Molero-Castillo et al. (2019) [26]
Academic performance was analyzed, highlighting Insufficient and Elementary. Data used: PLANEA 2017
Partitional clustering
Based on the above, in several of the identified research, K-means was used as an unsupervised learning algorithm, which groups objects (data vectors) into k groups, based on their characteristics. However, other computational learning methods, such as classification using random forests, support vector machines, and others, were not explored, which stand out for their ability to separate classes (labels in the data) in a robust way, based on supervised learning. In addition, it is necessary to include variables that are relevant to understand the student academic performance and illustrate the real situation of the educational system.
3 Method The solution method to classify academic performance in upper-middle education in Veracruz, Mexico, was approached based on a computational intelligence approach. For this, data obtained through the Ministry of Public Education of Mexico was used. Therefore, the method was divided into four stages: i) acquisition of the data source, ii) variables selection, iii) classification and iv) validation. 3.1 Data Source The data source analyzed corresponds to records from the database of the National Plan for the Evaluation of Learning (PLANEA) for upper-middle education, of the Ministry of Public Education. Access to the data source was made through the PLANEA institutional page. This data source provides information on the academic performance of schools and their students in the 2017 period, specifically for the state of Veracruz, Mexico. In the evaluation carried out in 2017, a total of 1640 upper-middle schools in the Veracruz state were examined, including state, federal and private schools. The study was focused on the upper-middle level, since it is considered a critical point in the school trajectory of a student, since they must possess certain competencies, characteristics of the upper-middle school graduate profile, so that they can continue their university studies.
Classification of Academic Achievement in Upper-Middle Education
375
PLANEA aims to inform society about the current state of upper-middle education, in terms of student learning achievement, in two areas of competence: Language and Communication (Reading Comprehension) and Mathematics, being the evaluation objectives [11]: • To know to what extent students achieved mastery of a set of essential learning at the end of compulsory education. • Offer information for improving of teaching processes in schools. • Inform society about the education state, in terms of the student learning achievement. • Provide information to educational authorities to monitor, plan, Program and operate the educational system and its schools. It is important to note that these academic achievement assessments are not designed to assess the educational quality of the schools or the teacher performance, either; rather, they are results obtained by the students evaluated at different levels of aggregation, i.e., individual, school, entity, and country. The purpose is to use the information to contribute to the improvement of educational quality. 3.2 Variables Selection The original data source contains 52 variables, which provide information about the schools that participated in the PLANEA evaluation in upper-middle education in Veracruz, Mexico. However, not all variables provide significant information for the academic achievement classification in Language-Communication, and Mathematics of the students evaluated, based on the sustenance type to which the educational center belongs, i.e.: state, federal or private. Therefore, an exploratory data analysis (EDA) was carried out to make a careful selection of variables used in classification. Thus, from a selection of significant variables from the point of view of academic achievement by levels, and from the data analysis, a data source composed of 16 variables was obtained, which are listed in Table 2. The other variables were omitted since they present redundant and insignificant information, such as entity code, name of the federal entity, municipality, school code, extension, marginalization level, percentage of students evaluated in language and communication, percentage of students evaluated in mathematics, similar schools for comparison, percentage of students from similar schools at each academic achievement level in language and communication, percentage of students from similar schools at each academic achievement level in mathematics, among others. All the selected variables contain relevant information about the evaluated school; the shift in which the school carries out its activities; the locality to which the school belongs; the type of sustenance to which the school belongs; the total number of students enrolled in the last grade of upper-middle education; the total number of students scheduled for the evaluation; the total number of students evaluated in Language and Communication; the total number of students evaluated in Mathematics; and achievement levels reached in Language and Communication, and Mathematics.
376
Y. Céspedes-González et al. Table 2. Selected variables
Item Name
Description
1
SCHOOL
Name of the school that was evaluated
2
SHIFT
Corresponds to the shift in which the school carries out its activities
3
LOCATION
Indicates the name of the locality to which the school belongs
4
SUSTENANCE
Indicates the type of sustenance to which the school belongs (State, Federal or Private)
5
ENROLLED
Corresponds to the total number of students enrolled in the last grade of upper-middle education
6
SCHEDULED
Corresponds to the total number of students scheduled for the evaluation
7
EVALUATED-LC
Indicates the total number of students evaluated in Language and Communication
8
EVALUATED-MATH Indicates the total number of students evaluated in Mathematics
9
LEVEL I-LC
Number of students achieving level I (Insufficient) in Language and Communication
10
LEVEL II-LC
Number of students achieving level II (Elementary) in Language and Communication
11
LEVEL III-LC
Number of students with an achievement level III (Satisfactory) in Language and Communication
12
LEVEL IV-LC
Number of students with an achievement level IV (Outstanding) in Language and Communication
13
LEVEL I-MATH
Number of students achieving level I (Insufficient) in Mathematics
14
LEVEL II- MATH
Number of students achieving level II (Elementary) in Mathematics
15
LEVEL III- MATH
Number of students with an achievement level III (Satisfactory) in Mathematics
16
LEVEL IV- MATH
Number of students with an achievement level IV (Outstanding) in Mathematics
3.3 Classification The academic achievement classification of upper-middle school students in Veracruz was made based on the class variable (SUSTENANCE) that allows identifying the type of support to which the school belongs, i.e.: state, federal or private. Therefore, once the preparation and selection of variables was completed, a structure composed of 13 independent variables and a class variable (SUSTENANCE) was established as an input matrix for the operation of the algorithms, described in Table 2. In addition, two other
Classification of Academic Achievement in Upper-Middle Education
377
variables were left as a reference, such as SCHOOL and LOCATION, which indicate the name of the educational center and the town in which they belong. Subsequently, for the classification and validation process, the input matrix was split into training and test data vectors, as it is shown in the code segment written in Python (see Fig. 3).
Fig. 3. Separation of training and test data vectors
Regarding the application of the algorithms, in the case of random forests, the hyperparameters were adjusted to avoid overfitting of the estimators, leaving the final configuration (see Fig. 4): 150 estimators; maximum depth of 14 levels; minimum number of separation elements of 4; and a minimum number of elements in the terminal node of 2. While, in the case of support vector machines, the Kernel with the best class separation was the linear one (see Fig. 5).
Fig. 4. Configuration of the random forest hyperparameters
Fig. 5. Configuration of the linear kernel in support vector machines
In this sense, with the established configuration and the defined training and test data vectors, both algorithms were applied: a) random forests, and b) support vector machines. 3.4 Validation For the validation, 492 test cases were used (30% of records, new cases that were not used in the training process), of which 405 were well classified through the random
378
Y. Céspedes-González et al.
Fig. 6. Classification matrices obtained for random forests (a) and support vector machines (b)
forests as shown in Fig. 6(a). While through the support vector machines, 388 cases were correctly classified as shown in Fig. 6(b). The classification matrix shows information on the performance of the algorithms, that is, it allows taking measurements of accuracy, precision, specificity, and sensitivity of the models obtained for multiple classifications (State, Federal, and Private).
4 Results As a result of the application and validation of the algorithms, it was observed that the model obtained by the random forests presents a better classification accuracy with 82.32%, compared to the model obtained by the linear support vector machines with 78.86% accuracy. In addition, the average error was 17.6%, which represents that the solution through the random forest was better. It also significantly reduces weaknesses in a decision tree, such as overfitting. On the other hand, the ROC (receiver operating characteristics) curve, which is a performance measure for classification problems, is a probability curve and AUC (area under the curve) represents the degree of separability of the classes. It is observed, through Fig. 7, that class 1 has an 80.5% chance that the model can distinguish this type of class (State); while class 2 with 94.95% makes it possible to distinguish the ‘Federal’ class; and class 3 distinguishes the label ‘Private’ with 83.09% of possibilities. This means that the federal educational centers are the institutions with the best classification,
Classification of Academic Achievement in Upper-Middle Education
379
which represents a great classification capacity of the model obtained; followed by private institutions, and finally state institutions.
Fig. 7. Performance curve of the multiple classifications on academic achievement in uppermiddle education in Veracruz
Regarding validation, it was observed that the cases of private institutions (class 3) were the ones that presented the greatest misclassification, with 55 cases in the random forests and 82 in the support vector machines, respectively. While in the case of state institutions, these were correctly classified by both algorithms, with 359 and 370 cases, respectively. On the other hand, based on the best model obtained (Random Forests), the variables with the greatest information gain were identified, which show the importance of the variables in the construction of the model. Table 3 shows that the variable with the greatest relevance to classify the academic achievements by sustenance of the evaluated schools was ENROLLED (13.07%), which refers to the total number of students enrolled in the last grade of upper-middle education in Veracruz. Other important variables were: LEVELIV-LC (11.82%) and LEVELI-LC (11.13%), whose purposes are to identify the level of academic achievement reached by students in Language and Communication, highlighting the excellent (IV) and insufficient (I) level, respectively. The variable EVALUATED-MATH (10.11%) was also important, which describes the number of students evaluated in Mathematics. Other variables with lower percentages of importance refer to the evaluation in Language and Communication (EVALUATED-LC), elementary academic achievement in Language and Communication (LEVELII-LC), insufficient academic achievement in Mathematics (LEVELI-MATH), the number of students scheduled for the assessment (SCHEDULED), among others.
380
Y. Céspedes-González et al. Table 3. Importance of variables based on information gain Item
Variable
Importance
1
ENROLLED
13.07%
2
LEVEL IV-LC
11.82%
3
LEVEL I-LC
11.13%
4
EVALUATED-MATH
10.11%
5
EVALUATED-LC
8.76%
6
LEVEL II-LC
8.55%
7
LEVEL I-MATH
7.49%
8
SCHEDULED
6.56%
9
LEVEL III-LC
6.23%
10
LEVEL II- MATH
6.01%
11
LEVEL III- MATH
4.1%
12
LEVEL IV- MATH
3.31%
13
SHIFT
2.87%
5 Conclusions Education is one of the pillars of social life and economic development of a region and the country. The students who are currently enrolled in compulsory education, such as the upper-middle level, will be, in the immediate future, responsible for becoming the workforce and economic force of the federal entities, as in the case of Veracruz. Therefore, academic performance is an important measure of the quality of education in educational systems. Nevertheless, satisfactory results of academic achievement are attained through good education systems, which play a decisive role in improving educational quality as an instrument of transformation and social mobility of people, in this case young people in upper-middle education with aspiration to pursue a professional career. Undoubtedly, it is of interest to know to what extent students achieve essential learning in different domains at the end of each educational cycle. The objective is to make a diagnosis of the achievement and knowledge reached by the students. This can serve as support in decision-making in the educational field to improve the quality of academic performance. In the case of upper-middle education in Veracruz, there are cases of students who, at the end of their studies, do not obtain the necessary knowledge to pass the entrance exams to the universities of the region; with the consequence that they delay or suspend their university studies. Faced with this situation, making diagnoses of the necessary knowledge acquired by school-age students is important; because through these analyses accurate strategies can be articulated to improve the academic level so that students continue their university career.
Classification of Academic Achievement in Upper-Middle Education
381
In addition, the average error was 17.6%, which represents that the solution through this approach can serve as a notable tool for classifying the academic achievements reached by students in the upper-middle level, both in Language and Communication, and Mathematics, based on the type of sustenance to which the evaluated institutions belong (state, federal, or private). As future work, to enrich the results obtained, it is intended to carry out a new analysis with updated information and new algorithms used in computational intelligence, such as deep neural networks (DNN). This may be important and of great interest due to the heterogeneous behavior of academic achievement achieved by the various state, federal and private institutions established in Veracruz. Together with the above, it is expected that the results of the PLANEA evaluation carried out in 2022 will be available to society as soon as possible. This is important information because the COVID-19 pandemic has had a big impact on education in the last few years (2020, 2021, and 2022).
References 1. Agrawal, D., Bernstein, P., Bertino, E., Davidson, S., Dayal, U., Franklin, M., et al. Challenges and Opportunities with Big Data. A white paper prepared for the Computing Community Consortium committee of the Computing Research Association (2012). Available. http://cra. org/ccc/docs/init/bigdatawhitepaper.pdf, Accessed 15 Mar 2023 2. Fan, J., Han, F., Lio, H.: Challenges of big data analysis. Natl. Sci. Rev. 1(2), 293–314 (2014) 3. van der Aalst W. M.: process mining data science in action. Springer (2016) https://doi.org/ 10.1007/978-3-662-49851-4 4. Iqbal, R., Doctor, F., More, B., Mahmud, S., Yousuf, U.: Big data analytics: computational intelligence techniques and application areas. Technol. Forecast. Soc. Change 153, 119253 (2020). https://doi.org/10.1016/j.techfore.2018.03.024 5. Sagiroglu, S., Sinanc, D., Big data: a review. In: International Conference on Collaboration Technologies and Systems (CTS), pp. 42–47 (2013) 6. Hand, D., Mannila, H., Smyth, P.: Principles of data mining. MIT press (2001) 7. Bezdek J. C.: What is computational intelligence? United State (1994) 8. Kumar, G., Jain, S., Singh, U.: Stock market forecasting using computational intelligence: a survey. Archives Comput. Methods Eng. 28(3), 1069–1101 (2021) 9. Raj, J.S.: A comprehensive survey on the computational intelligence techniques and its applications. J. ISMAC 1(3), 147–159 (2019) 10. Ward, A., Stoker, H.W., Murray-Ward, M.: Achievement and ability tests - definition of the domain. Educ. Measur. 2, 2–5 (1996) 11. INEE, Plan Nacional para la Evaluación de los Aprendizajes, Available: https://www.inee. edu.mx/evaluaciones/planea/, Accessed 20 Apr 2023 12. Ministry of Public Education. Plan Nacional para la Evaluación de los Aprendizajes, Available: http://planea.sep.gob.mx/, Accessed 21 Feb 2023 13. Plan Nacional para la Evaluación de los Aprendizajes: Bases de datos PLANEA 2017. Available: http://planea.sep.gob.mx/ms/base_de_datos_2017/, Accessed 1 Mar 2023 14. Kruse, R., Borgelt, C., Braune, C., Mostaghim, S., Steinbrecher, M. Introduction to Computational Intelligence. Springer (2017) 15. Siddique, N., Adeli, H.: Computational Intelligence: Synergies of Fuzzy Logic, Neural Networks and Evolutionary Computing. Wiley (2013). https://doi.org/10.1002/978111853 4823
382
Y. Céspedes-González et al.
16. Fulcher, J.: Computational intelligence: an introduction. Comput. Intel. Compendium 115, 3–78 (2008) 17. Barrientos, R. E., et al.: Árboles de decisión como herramienta en el diagnóstico médico. Revista médica de la Universidad Veracruzana, 9(2), 19–24 (2009) 18. Noble, W.S.: What is a support vector machine? Nat. Biotechnol.Biotechnol. 24, 1565–1567 (2006) 19. Birzhandi, P., Kim, K.T., Lee, B., Youn, H.Y.: Reduction of training data using parallel hyperplane for support vector machine. Appl. Artif. Intell.Artif. Intell. 33(6), 497–516 (2019) 20. Kurani, A., Doshi, P., Vakharia, A., Shah, M.: A comprehensive comparative study of artificial neural network (ANN) and support vector machines (SVM) on stock forecasting. Annals Data Sci. 10(1), 183–208 (2023) 21. Scholkopf, B., et al.: Comparing support vector machines with Gaussian kernels to radial basis function classifiers. IEEE Trans. Signal Process. 45(11), 2758–2765 (1997) 22. Heredia, A., Chi, A., Guzmán, A., Martínez, G.: ANCONE: an Interactive system for mining and Visualization of Students’ information in the context of PLANEA 2015. Computación y Sistemas 24(1), 151–176 (2020) 23. Maldonado, G., Molero-Castillo, G., Rojano-Cáceres, J., Velázquez-Mena, A.: Análisis del logro académico de estudiantes en el nivel medio superior a través de minería de datos centrada en el usuario. Res. Comput. Sci. 125(1), 121–133 (2016) 24. Molero-Castillo, G., Maldonado-Hernández, G., Mezura-Godoy, C., Benítez-Guerrero, E.: Interactive system for the analysis of academic achievement at the upper-middle education in Mexico. Computación y Sistemas 22(1), 223–233 (2018) 25. Gutiérrez, I., Gutiérrez, D., Juan, J.E., Rodríguez, L., Rico, R., Sánchez, M.: Aplicación del algoritmo K-means para el análisis de resultados de la prueba PLANEA 2017. Res. Comput. Sci. 149(8), 407–419 (2020) 26. Molero-Castillo, G., Bárcenas, E., Velázquez-Mena, A., Céspedes-González, Y. Analysis of academic achievement in higher-middle education in mexico through data clustering methods. Educ. Syst. Around World. IntechOpen (2019)
Accelerating the Distribution of Financial Products Through Classification and Regression Techniques A Case Study in the Wealth Management Industry Edouard A. Ribes(B) Mines Paristech, Paris, France [email protected] Abstract. Financial products mostly consists in instruments used by households to prepare for retirement and/or to transfer wealth across generations. However, their usage remains low. Current technical solutions aimed at boosting financial products’ consumption mostly revolve around robo-advisors automating portfolio management tasks. Nonetheless, there is no tool to automate the distribution of those products, a gap this study aims to bridge. To do so, a private data-set from a French Fintech is leveraged. It describes at a macro level the structure of 1500+ households and their wealth. This information is fed to standard classification algorithms and regression techniques to predict whether or not households are likely to subscribe to a life insurance or a retirement plan or a real estate program over the forthcoming year and to forecast the associated level of investment. Calibrations show that households’ subscription behavior over the next 12 months towards core investments products can be predicted with a high level of performance (A.U.C > 90%). The information was yet inappropriate to predict the level of investments on those products (R2 < 30 − 40%). Standard classification techniques could thus be used by financial advisors to accelerate client discovery on their existing portfolio. This should thereby result in productivity gains for those professionals and improve the distribution of financial products. Keywords: Wealth management · Brokerage · Machine learning Classification · Fintech · Technological change
1
·
Introduction
Personal banking pertains to the way households manage their finances and leverage the associated eco-system to meet their needs [14,23,53,71]. Those needs have traditionally been articulated around four pillars: payment, loans, investment and insurances. This has given birth to an industry structured around two core activities. On one hand, advisors and brokers distribute/sell financial solutions and products addressing those needs. On the other, asset managers, banks and insurers produce the associated services and process the associated c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 K. Arai (Ed.): FICC 2024, LNNS 921, pp. 383–405, 2024. https://doi.org/10.1007/978-3-031-54053-0_27
384
E. A. Ribes
orders. Personal banking has, however, been the subject of recent critics as the associated services are not only considered expensive but also have not fundamentally changed over the past decades. For instance, investments products (e.g. private retirement plans based on equity) which yield, on average, a 5% return, have been the subject to a 2–3% fee every year [8,61]. As a result, for every $ invested, only 2 or 3c are earned by a household, which barely covers the inflation (currently running at a 1–2% rate in most mature countries). Those longstanding critics have thus given birth over the past decade to the Fintech movement [62], an entrepreneurial stream aimed at leveraging automation technology to make personal finance more efficient. And of course, this entrepreneurial movement has had its academic counterpart (see [48] for a definition). Fintechs nowadays target both households and financial professionals. When it comes to households, the questions Fintechs are trying to address are mainly one of speed. Recent evidences, for instance, show that automation technologies have proven useful to reduce by about 30% the time it takes to get loans and insurances. When it comes to professionals, the question then becomes one of productivity and Fintechs aim there at reviewing the value chain of a profession [37] and automating as many activities as possible so that professionals can serve more households [16,72]. If examples of Fintechs applications are numerous in the literature when it comes to loans and insurances [1,58,73], they are quite sparse when it comes to investments products. Most examples indeed focus on the topic of robo-advisors, which, if efficient, are not adopted in a wide fashion as their usage suffers not only from black-box considerations [64] & lack of trust [25,57] (for instance, studies show that only 29% (resp. 53%) of babyboomers (Gen X/Y) trust robo-advisor) but also is not so frequent (most financial portfolios associated to investments are indeed only re-balanced one every year). The value chain associated to the distribution of financial investments products is however full of automation opportunities. In essence, it boils down to four steps [53]. Contacts (i.e. households) are first garnered and qualified (e.g. level of revenue, financial objectives etc.). Based on this qualification, pre-sales activities then occur to match clients’ needs with one or two products based on both financial & fiscal considerations. If interested, households then subscribe to the product. Finally, post subscription, households embark on a cycle of yearly reviews with their financial advisor. This last step, namely, the client review is, as of today, prone to optimizations. An advisor has indeed about 100 clients [33] in its rooster1 and needs to spend about several hours for each clients. This time is indeed spent contacting them, understanding if their household structure has changed and assessing if new products could be useful. This thus requires at least 100 h of the total of 1750 working hours per year available per advisor2 . The reason for this poor efficiency is that advisors main role is to distribute 1 2
Each client has on average 2 financial products leveraging a mixture of equity and bonds organised on 4 to 5 supports. This is based on the assumption of a 40 h working week, 50 weeks per year and a 80% of time availability for clients related matters.
Classifiers in the Distribution of Financial Products
385
financial products and that less than 10% of the existing clients actually subscribe to a new one during those reviews. Other activities (i.e. re-balancing of positions in existing financial products, information updates), which occur during those reviews, can indeed easily be (if not already are) automated (e.g. via robo-advisors, digital client information files (referred to later as K.Y.Cs). An option to gain in productivity is therefore to accurately target this 10% section of the portfolio which needs new investments products. This is where automation technology, notably through the use of artificial intelligence (I.A.) can become useful, something that this paper will illustrate. This illustration will be two-fold. First, this article will show that standard information available in K.Y.Cs can be used to assess which household is likely to subscribe to a financial product over the next year. This will be done by using an anonymised and randomized proprietary data-set from Manymore, a french Fintech firm specialised in productivity tools for financial advisors. Second, this paper will explore if the same information (i.e. households’ K.Y.C) can be also used to predict the amount they are likely to invest. Using the previously detailed figures, it can be estimated that equipping advisors with this type of tools could boost their productivity by at least 5% as they would be not only able to target who to contact in their clients’ rooster, but also be able to prepare pre-sale pitch to accelerate the distribution process. Note that the contribution of this article are likely to be broader than this simple use case. This will be further discussed in Sect. 2 in light of the current challenges highlighted by specialists in the field of personal finance. Now, in terms of structure, this paper will first begin with a small literature review in Sect. 2 highlighting, with more details, what’s currently known about the Fintech movement and what is understood of the field of personal banking. Section 3 will then follow to explain the type of data that was used, the classification techniques which were employed to assess which household is likely to consume an investments product and the regression methods which where employed to predict the amount invested by households. Results will then be reviewed in details in Sect. 4 and a discussion will be articulated around the limitations of this study as well as areas of future research in Sect. 5. A crisp conclusion will then wrap the article.
2
Theoretical Background
According to the recent review of [36], personal banking is articulated around four domains: loans, insurances, payments means and investment tools. The methods described in this article are aimed at facilitating the distribution of this later category of product. The contribution of this paper will therefore be illustrated by first depicting what is known of households’ investments behavior, then by proposing a view of the current challenges acknowledged by academic and practitioners associated to this field and finally by highlighting the specific contributions of the proposed methodology.
386
2.1
E. A. Ribes
Households and Investments
Investments products are structured to help individuals and households accumulate capital to smooth their consumption throughout their life-cycle. The objective is to achieve a stable revenue under the form of a permanent income [34]. If loans are used to transfer future wealth to the present, investments products are used to transfer current wealth to the future. Investments are thus mainly used as a vehicle to accumulate wealth during the active period of life and to smooth consumption during one’s retirement (for instance through pension plans/funds) or even throughout generations. Empirical studies such as the one of [33] have shown that investments and the associated advice was indeed primarily a topic for households whose head is in his/her 50 s and who seeks to prepare for retirement. Those households generally invest about 100 k$ on 1 to 2 products (e.g. a life insurance and a retirement plan) and investments are spread across 5 to 6 supports (e.g. a mixture of equity and bonds). Beyond retirement, households also leverage investment products to transfer wealth across generations. Transmission is generally facilitated through specific products such as life insurances or through products structured to facilitate bequests. Bequests are indeed a tool frequently used by households to tie together generations through a specific incentive schemes (the “strategic motive” described by [10]) and/or to serve altruistic objectives of more experienced individuals [76]. Investments are generally categorized depending in their nature: physical investments, which notably encompasses households’ primary residence and financial ones (e.g. pension plans etc.) [22]. On this front, empirical studies [11,20] indicate that 30% to 50% of households own financial investments products (with notable differences across countries) and that about 70% to 80% own physical investments products, notably through their home. General decumulation patterns associated to investments are simple: physical assets can generate a form of annuity through a reverse mortgage or be sold so that the associated capital is converted into financial investments. Financial investments can then either generate an annuity (although a large portion of households remain reluctant to leverage this option [59]) or be subject to annual withdrawals which are generally of about 4% of the total accumulated capital [9] (with variations and adjustments based on individuals’ risk appetite [56,60]). As for bequest, they are generally structured either as a lump sum payment or as the transfer of the full or partial bare ownership of an asset [52]. Bequests are yet heavily regulated, notably to limit an excessive polarization of wealth across households [27,28]. 2.2
Investments Challenges
When it comes to investments-related instruments, a number of dynamics are currently at play. First, local regulations are being reinforced [5] to protect households from hazardous providers and avoid situations similar to the subprime financial crisis in 2009. If this generates an increment in labor for professionals distributing investments products and dampen their productivity, this also
Classifiers in the Distribution of Financial Products
387
generates opportunities in terms of technological replacement since regulatory activities are heavily scripted (see [37] for the seminal theory on labor displacement and replacement). Those opportunities are currently addressed by RegTech actors, service providers specialized in digital technologies automating regulatory tasks [4,13]. Note that this regulatory pressure is increasing the specialisation of actors partaking to the distribution of investment products. Second, the way households garner wealth (notably through labor related income) is shifting. Global labor displacement (off/near-shoring) and replacement (through automation technologies) are indeed driving a polarization of households income structure [26,41]. As a result, households savings capability is, on average, decreasing [77]. The polarisation of the economy at play in mature countries indeed entails an increased proportion of low income jobs with no savings capability, a shrinking “middle class” who could save to invest and a reinforcement of the economic means of the wealthiest (who could already save about (if not more than) 20% of their income [45])). When it comes to investments products, those stylized fact simply translate is an increased challenge in terms of access. On one hand, wealthy households benefit from the change and invest more heavily, whilst the others actually exit financial markets. Note that the challenge nowadays appear different from before. In the nineties, the question of access was linked to informational constraint [38] and access boomed with the development of the internet (see for instance [11] for households measure on the evolution of the stock market participation). Several decades later, participation rates have not only plateaued but are declining [55]. Interestingly, the answer on the distributor side has been to attempt to lower investment fees (currently in the 2–3% range) which are deemed too “high” [33,61]. This spurred massive investments towards robo-advisors [32]. However, the adoption of those tools has so far been limited and only a handful of technology fuelled initiatives have been recorded on other investment related topics [66,67]. Robo-advisors indeed appear to be of use for small transactions and investments and therefore limited to a certain fringe of households [63]. But the bulk of investments (measured in terms of asset under management - A.U.M) still come from wealthy households who prefer to leverage physical advisors. Their investment decisions indeed mix objective and subjective components which are best addressed through human interactions [69]. The third and last main investments challenge encountered by households revolves around portfolio choices and diversification, which is heavily intertwined with notions of financial literacy and behavioral finance. Empirical studies (such as the one of [35]) have indeed shown that households equipped with either basic financial notions or supported by a financial advisor achieve a reasonable level of diversification. However, considering that about 60% of households [3] do not possess enough financial knowledge3 and that only 10 to 15% of households ben-
3
This type of measure is however directional as there is no standard metrics/consensus when it comes to assessing financial literacy [74]. Besides reported figures mainly depicts the situation observable in the US and differences may exist across countries.
388
E. A. Ribes
efit from the services of a financial advisor [75], it would seem4 that at least 30 to 40% of the households who can invest are making poor portfolio choices.5 At an aggregated level, this translates into a landscape where investing households leverage on average only one equity stock, which is generally the stock of their employers [47]. But beyond notion of financial literacy, households are also known to have behavioral biases when it comes to investing. This is associated to personal traits such as risk taking, overconfidence and time preferences [6,51]. This is something that the field of behavioral finance has consistently addressed at an individual level since the eighties [40] and which is now mutating to consider how individual behavior are modeled by social norms as the field is shifting towards social finance [43]. 2.3
Current Contribution
With respect to the macro-level challenges highlighted in the previous subsection, the tools & methods described in this paper can contribute in two ways. Those tools are indeed aimed at democratizing the services provided by financial advisors. As such, they first contribute to the diversification of households portfolio as more advice leads to more diversification. Second, they act as productivity tools which can free advisors’ time so that they can acquire new clients. At scale, this could potentially lead to an increased participation of households to the financial market. The proposed tools are therefore also a contribution to the access challenge highlighted above. To further illustrate the current contribution of the article, a short discussion will follow on the activities of financial advisors and on how the proposed methods could be integrated in their day to day. Looking at both existing benchmarks [33] and the professional experience gathered within Manymore 6 , a financial advisor serves between 70–80 households. Each household has 1 to 2 financial products (representing a total of about 110 products managed per advisor). From an advisor point of view, those products are divided in two blocks. About 80 products are considered as mature. They represent an investment of about 50–100 k$. Given that advisors are compensated retro-commission of 2 to 3% of the invested capital, they generate a recurring revenue of about 110–120 k$ per year. The activity on those products is mainly one associated to re-balancing: advisors indeed need to make sure that given the last market variations, the positions of their clients are diversified and 4 5
6
Considering that financial literacy, the availability of savings and the use of advisors are independent. It would be interesting here to further invest in a comprehensive survey to assess the portion of households making poor investment decisions. Typically, households who invest generate a level of income and posses a higher level of education (incl. financial literacy). Manymore is currently one of the leading software providers for financial advisors in France. It equips more than 2400 advisors with a comprehensive list of solutions enabling regulatory activities as well financial advice and product distribution. Manymore’s software suite is used to onboard households, capture & maintain their data and facilitate digital subscription and financial products management.
Classifiers in the Distribution of Financial Products
389
meet households expectations in terms of risks and returns. Then about 20 to 30 products managed by advisors are products which have been subscribed during the year. Investments on this front are on average of 10 k$ and the pool of product generate a revenue for the advisor of about 5 k$ per year. Note that this is something that will be further described in the Sect. 3 when the data-set available for the study gets depicted. In the meantime, this explains how the average productivity 120 kA C of an advisor can be decomposed. Activity wise, advisors work about 210 days per year (of course, with variations depending in local regulations). Mature contracts generally require an annual re-balancing which takes at least a day of work, leaving about 150 days of work to distribute 20 to 30 new products (to either new or already known households). The activity advisors is thus heavily skewed toward the distribution of new products although the core of their revenue is associated the products which have been under management for a long time. In light of this structure, the tools proposed in this article can be used in two ways. First, they can be positioned online and made available directly to the households. Given a certain level of promotional effort to make sure that households are made aware and leverage the service, this could be used to generate qualified leads and reroute households to financial advisors. This mechanism could serve to democratize financial advice and, as previously mentioned, contribute to more diversification at an aggregated level. Second, the tools could be offered directly to advisors so that they can scan their portfolio of existing clients. This would translate into an early identification of households who can benefit from financial products and the days of work saved here could be repurposed to increase either advisors’ efforts towards households’ education (therefore boosting the overall societal effort towards more financial literacy) or towards clients’ portfolio expansion (which would benefit the overall eco-system).
3
Methodology
There are two main goals to this study. The first objective is to test whether or not standard classification techniques can be used to predict accurately the subscription behavior of households over the next twelve months towards specific investment products (namely life insurance, private retirement plans and real estate assets). The second ambition is to assess the extent to which their level of investments during the first year can be forecasted with standard regression techniques. Note that this analysis is based on information which is normally available within financial distributors Customer Relationship Management systems [C.R.M]. To better understand the current intent and its applicability, this section will focus on describing the kind of data-set in Subsect. 3.1 that can be found in such information systems as well as the methods that were used in Subsect. 3.2 to draw the results detailed in Sect. 4.
390
3.1
E. A. Ribes
Data
The data used for this study stems from one of Manymore’s database. This data set, which is privately held, is anonymized. The data holds anonymous information about 2439 households in France. Within this sample, 1622 (66%) households have detailed information in their “Know Your Client” records [K.Y.C] regarding their assets. Those households have been used as the baseline sample for the study. From a reference standpoint, households were represented by 488 distinct independant f inancial advisors [IFAs]. Each IFA covers on average 12.7 (±20) households. Note that this is lower than what would be expected in the profession to date, as IFAs traditionally service about 80 to 100 households [33]. This can however be explained by the fact that the software associated to the selected database is primarily used by IFAs to provide in depth fiscal and financial advice. If this kind of advice comes with the capture of detailed information on households, which comes handy for this study, its scope is limited to a subset of IFAs’ client rooster ( 20%). Among the households considered in the study, 232 hold a private retirement plan (referred to as RP in the rest of the study) (14% of the sample) of which 59 were initiated in year prior to the study. 870 households hold a life insurance (referred to as LI in the rest of the study) (53% of the sample) of which 202 were subscribed in the year prior to the study. Finally 1435 households (88% of the sample) hold real estate assets (referred to as ReA in the rest of the study), 88 of which were bought in the year prior to the study. Investments represented, on average, respectively 280 k$ (±869 k$) for LIs, 30.3 (±61 k$) for RPs and 282 k$ (±372 k$) for ReAs. This means, based on the current sample, that each professional has a 12% (resp. 3.6%) chance to distribute new LIs (resp. RPs) and has a 5.4% to support a new investment on ReAs in one year. The data points recorded in Manymore’s system for the 1622 households (indexed by the letter n in the rest of this paper) are of two different natures. One hand, structural information on the households gets captured. This includes the age of the main representative of the household (An ), the age of his/her partner (Apn ), marital status (Mn ), number of children (Cn ). Additional information is also recorded on existing bequests for fiscal reasons (δn ). This later element was recorded as a dummy variable (i.e. δn ∈ 0; 1). The median household was composed of a E(An ) = 53.4 (±14/9) years old individual married to a E(Apn ) = 53.4 (±12.9) years old person with E(Cn ) = 1.6 (±1.62) children. 181 households (E(δn ) = 11%) have bequests in place. On the other hand various data points are stored pertaining to households’ wealth. This includes a detailed decomposition of the household’s revenues, expenses, assets and loans. Revenue was first analyzed according to three dimensions: the total revenue C/year (±113 KA C)], the revenue generated of the household (Rn ) [E(Rn ) = 63.3 kA by the main representative of the household (Rnc ) (resp. his/her partner (Rnp )) C/year (±96.6 KA C), E(Rnc ) = 14.6 kA C/year (±32 KA C)]. The rea[E(Rnc ) = 42.6 kA son behind this split was that, intuitively, households’ revenue levels (which is notoriously correlated to savings capability) as well as internal differences could play a role in their financial product subscription decisions. At a high level, the
Classifiers in the Distribution of Financial Products
391
average profile shows that the households in the data-set were in the upper end of the French income distribution (top 10% according the statistics recorded by the french state (I.N.S.E.E.7 ). Besides, this also shows that households presented a high level of discrepancy between partners in terms of revenue. This is not unusual as French records show that, on average the associated gap is of 22%8 ), but is perhaps exacerbated in the available sample as individuals are at the end of their career or already retired. Revenue was also decomposed at the household level according to its source (Rns with s ∈ 1...15). Sources were structured according to the referential in place in Manymore’s software. Details are displayed in Table 3 (in Appendix) and show that most of the revenue from household was generated from standard wages and non commercial earnings, complemented by some pensions (probably associated to the fact that a non negligible portion of the data-set was composed of retired households), some level of income from stocks and shares and various allowances. When it comes to expenses, assets and debts, information was captured both at an aggregated household level as well as according to the expenses, assets and debts classification structure in place within Manymore’s software. In this samC (±43 kA C) ple, the average household exhibits a level of expenses of E = 15.4 kA per year. On the overall sample about 43% of households were reported to have contracted a number of loan(s) of L = 0.8 (±1.5) associated to a charge of about 14.3 kA C per year (±166 kA C) [representing about 20 to 25% of households revenue]. Additionally, the average households owned A = 7.1 (±8.8) assets. The details of expenses and assets are displayed in Tables 4 and 5 (in Appendix). Given that households had a very simple debt pattern (about 1 loan), loans were not further classified according to specific local categories. Expenses, which represent, on average, about 20–25% of households revenue, are primarily composed of common expenses and income taxes. The average household had a total wealth in terms of assets worth 1.3 MA C (±2.7 MA C), which was primarily allocated between a primary residence (≈300 kA C), some rental properties (≈200 kA C), life insurances contracts (≈130 kA C) and private equity (≈150–200 kA C). 3.2
Methods
As it can be seen from the Fig. 1, three types of methodology were used in this article. First, standard feature selection techniques were leveraged to mine the data-set and extract valuable information which can be used to predict household’s investment behavior (see Subsect. 3.2). Second, standard classification algorithms were calibrated to the current data-set to understand if it could be feasible to predict whether or not a household would subscribe to a life insurance (LI) or a retirement plan (RP), or buy a real estate asset (ReA) over the next 12 months (see Subsect. 3.2). Finally, regressions were used to evaluate if households’ level of investments on these newly opened contracts or newly 7 8
https://www.insee.fr/fr/statistiques/5431993. https://www.insee.fr/fr/statistiques/6047789.
392
E. A. Ribes
bought assets could be anticipated (see Subsect. 3.2). Note that the aforementioned techniques were implemented using the Caret package [49] available on R.
Fig. 1. Methodology used to calibrate the proposed classification and regression methods
Feature Selection Techniques. The amount of variables which can be extracted or generated (through combination) from the current data-set is important. Therefore feature selection is mandatory to avoid potential over-fitting problems and achieve the best prediction/classification performance possible [46]. In light of the existing literature [15], this gets nowadays performed through two main techniques: a correlation test [39] or a mutual information test [7]. In this case, both methods were tested and the mutual information yielded the best performance. On top of feature selection, additional pre-processing tasks were completed notably to handle class imbalance on the classification problems at hand (i.e. is the households likely to subscribed to a LI,RP or to buy a ReA over the next 12 months?). This is indeed a common theme in machine learning [18] which is known to lead sub-optimal performances when calibrating classification algorithms. Looking back at the statistics displayed in Sect. 3.1, the current data-set and thus this study are no stranger to this phenomena. This is usually tackled through one of three core techniques [65]. A first option consists in down-sampling the training data-set (i.e. re-sampling the majority class to make its frequency closer to the rarest class.). Another option consists in upsampling the training data (i.e. replicating casing from the minority). The third option “ROSE” [54] consists in the generation of artificial data points. In order to find the most suitable alternative, all three methods were tested and the one yielding the best results was kept. Note that additional alternatives exist (e.g. assigning weights to each class [19] or combining up- and down-sampling [17]) but were not tested in this study.
Classifiers in the Distribution of Financial Products
393
Classification and Regression Methods. As highlighted in the recent reviews of [2,78], machine learning techniques dedicated to classification problems are numerous. Yet, three main ones appear to dominate the field (as seen in recent reviews dedicated to financial topics [42,50]), namely random f orests [12] [referred to as RF in this rest of this paper], support vector machines [referred to as SVM in the rest of this paper] [24] and neural networks [referred to as NNet in the rest of this paper] [68]. Given their prominence, those three algorithms were pre-selected for the purpose of this study. Note that in light of the non linearity of the classification problem (which was assessed through a Principal Component Analysis on the training data set), the chosen implementation of the SVM technique relied here on radial/exponential kernels. In addition to RF, SVM and NNet algorithms, an additional contender techniques was selected amongst the ones known to yield the best performance on real world problems [78], namely conditional inference trees [44] [referred to as CTREE in the rest of this paper]. When it comes to regressions, three types of classical models [21,31] were used to explore if the amount households are likely to invest in the next 12 months following the subscription of a LI or RP (resp. the amount invested in ReA) can easily be explained. First, standard linear regressions were used. Second, non linear techniques such as random forest regressions [12], support vector regressions SVR [24], neural network regressions NNR [70] were also used. Additional techniques (for instance CART etc.) were also tested in this study, but their usage did not yield any performance improvement compared to the one previously mentioned (see Sect. 4). Those attempts have therefore not been reported here.
4 4.1
Results Key Learnings for Life Insurance (LI) Products
When it comes to predicting the amount invested by households in a newly subscribed life insurance product over the past year, the methods used in the this article (linear regression, tree based model (CART), support vector regression, Neural networks and random forest) did yield mixed results as shown in Table 1(in Appendix). The available variables were indeed only able to explain 25% of the variance observed on the data-set with the best performing model here being the random forest one. Looking at the input variables used to feed to the regression, there are four data points yielding a normalized importance above the 25% threshold (see Fig. 2). First, households’ investments with respect to life insurance product appear to be primarily driven by the proportion of financial asset they hold. The more financial products they have, the more they are likely to invest in life insurance instruments. Second, wealthier clients (“client total asset”) have higher savings levels (“epargne”) and thus higher levels of financial investments. Third, the higher the amount invested in a retirement plan (“PER”), the higher the amount invested in life insurance products. Here the correlation is likely to be driven by the amount of savings the household is able to generate.
394
E. A. Ribes
Fig. 2. Importance of the variables used to predict the amount invested in life insurance by households over the past year (linear model)
When it comes to predicting whether or not a household is likely to subscribe to a life insurance over the next 12 months, the proposed methods are efficient (A.U.C 80–85%). The selected algorithms yield relatively similar performances as seen in Table 2 (in Appendix) on a balanced data-set of 320 households (class imbalance was handled through down sampling). As seen in Fig. 3 (in Appendix), 7 variables have a normalized importance above 25% (when fed to the most performing algorithm). The pattern behind the algorithm can be summarized in the following fashion: the wealthier a household and the more it is inclined to contract financial products, the more likely they are to subscribe to a life insurance. Besides, the more liquidities they have, the higher the chance that they contract such a product. So here, statistical techniques simply replicate and industrialize professional knowledge available in the field. 4.2
Key Learnings for Retirement Plans (RP)
The current experiments do not yield statistically significant results when it comes to predicting the amount invested by households in a retirement plan in
Classifiers in the Distribution of Financial Products
395
France over the past year. The methods used in this article indeed did yield poor results with the current data-set (R2 < 30%) as seen in Table 1 (in Appendix). If predicting the amount invested by household to prepare privately for retirement appears challenging, the likelihood of contracting such a product over the next 12 months can easily be assessed see Table 2 (in Appendix). Standard machine learning methods indeed yield very good results on this type of problem with an A.U.C > 85% (post down-sampling to train the models on balanced data-set made of 96 observations). Analyzing the importance of the variables as seen in Fig. 5 (in Appendix) shows that, for the best performing algorithm, the likelihood of opening a private retirement plan is highly linked to households’ total revenue (from both professional activities or real estate properties) and existing financial market participation (amount invested in life insurance products, proportion of wealth invested in financial products etc.). Once again, statistical methods simply activate and industrialize professional knowledge materializing the fact that the higher households’ revenue, the lower the replacement rate allocated by state sponsored public pensions scheme in France and the greater the incentive to capitalize and save through private mechanisms. 4.3
Key Learnings for Real Estate Assets (ReA)
Investigating the profile of households buying real estate properties within the current data-set also yields mixed results. On one hand, it proves difficult to predict the amount invested by individuals. When calibrating linear regression or tree based models to the available information, no statistical techniques was able to explain more than R2 = 10% of the inherent variance in the data as seen in Table 1 (in Appendix). But, when it comes to detecting a statistical pattern to infer whether or not a household will buy a propriety over the next 12 months, results appears much more interesting as shown in Table 2 (in Appendix). Once the data-set has been normalized through down-sampling (100 data points), standard algorithms indeed show a “good” performance (expressed in terms of A.U.C) of about 80%. On this front, key variables used to yield a prediction revolve around the current worth of real estate assets owned by the household, the proportion of physical versus financial assets in households’ portfolio as well as households revenue and savings levels as shown in Fig. 4 (in Appendix). This aligns with two stylized fact known within the financial advisor profession. First, real estate is considered as household’s primary investment vehicle (as soon as they have some savings capability). Second, this type of investment vehicle is persistent. Rather than diversifying with financial products, households (which net worth is within the 500 kA C to 1–2 MA C as described within this data-set) indeed tend to accumulate wealth within physical assets over time.
5 5.1
Discussion Limitations
The current study presents a couple of limitations. First, the sample used to calibrate the proposed algorithms remains “small” and biased towards a certain
396
E. A. Ribes
social layer. It only covers 1622 households belonging to the top 1% (incomewise) of the French economy. The households in the sample indeed showcase an average revenue of 113 k$ per year, while public sources9 highlights that the median revenue of French households is close to 20 to 25 k$ per year. This has two consequences. On one hand, results displayed here are likely to be inappropriate for the general population. The criteria driving the appetite for a financial product such as a retirement plan are very likely to differ between a household with 100+ k$ of revenue per year and a household with 20 k$ of annual revenue. On the other, there is a question regarding the replicability of the study. The top 1% of the French population indeed contains about 300 k households, of which the available sample covers about 0.5%. The data-set may therefore present some underlying biases, which could not be assessed prior to the study. Second, the data used to fuel the study only encompasses French households. Despite being one of the largest economy in the world, France has several specific regulatory characteristics which may further hinder the portability of the results. France is known to have a very strong public system catering for retirement (in line with the European “Bismarkian” tradition) and benefits from one of the highest public replacement rate across O.E.C.D countries10 when distributing pensions. This means that the elements driving French households to subscribe to private retirement plans are likely to be very different from the ones driving the decision of an American of a English household, even within the portion of the population with a very high level of income. There are however good chances that results would be structurally similar across Germany, Italy and Spain for similar population groups. Third, retirement plans and life insurance products come with very specific tax incentives in France which are more than likely to affect the subscription behavior of households. In France, life insurances come with specific inheritance tax exemptions up to about 100 kA C. This usually tend to dimension the major part of households investments, especially as 60% of the social fringe encompassed in this study tend to look for means to optimize its succession. A similar mechanism is also at play for retirement plans as deposits comes with a rebate on income taxes up to 20 kA C per year. This again creates a strong incentive with respect to the size of the deposits, their frequency but only for households with a certain level of income. Those two specificities are inherently intertwined with the French tax system and when replicating the study in alternative geography, some attention should be paid to the associated context as it may impact the results (for instance the importance of the ‘tax’ variable highlighted in Fig. 5 (in Appendix). Finally, the application of standard algorithms on the current data set does not present good predictive capabilities (R2 < 30 − 40%) with respect to the level of investments performed by households on the core investments media (RE, LI, RP) available in France. Although more efforts could be invested in leveraging a broader array of techniques, the reason may very well be that the data-set in itself is inappropriate. Deposits are indeed of two types: 9 10
https://www.insee.fr/fr. https://www.oecd.org/.
Classifiers in the Distribution of Financial Products
397
one-off and recurring. One-off deposits are usually triggered by specific events, for instance the perception of a lump sum because of an inheritance, and are often subject to constraints (for instance some distributors require an initial upfront deposit to open a contract). It is highly likely that those one off contextual elements are the key elements determining the level of investments over the first year of a contract. However, exploring this hypothesis would require a more granular data-set encompassing the history of operations (deposits, withdrawal, re-investments, portfolio re-balancing etc.) on contracts. 5.2
Future Research Opportunities
There are four streams of activities that could be undertaken to build upon the current research. First, additional research in the context of France could be useful. To address the topic of portability to other social layers, it could be interesting to expose the proposed algorithms to the public via a software platform used at scale, and thereby analyse a bigger volume of information. An option could be to see if collecting data for 15 to 30k households with a “high earner” profile would cement the current findings or not. This could be easily done by deploying and re-calibrating the algorithms in a financial advice network with more 2000 counselors (e.g. a large french bank). Second, it would be interesting, still in the French context, to see how the proposed methods perform for households with a different income profile. The topic of retirement planning is indeed vibrant in O.E.C.D countries with Bismarkian retirement systems since replacement rates offered by public pensions are getting lower due to population ageing [29,30]. This in turns triggers a number of questions around the usage of private retirement plans and their democratization. Third, additional research in the context of France would be necessary to understand the levels of investment displayed by households. As stressed at the end of the previous Subsect. 5.1, this would most likely require another type of data-set. This could be done in the context of France by leveraging anonymous information from private financial data aggregators such as Harvest or Manymore or by accessing anonymous information stemming from the back office a large financial product distribution network (for instance a large bank). This effort could however prove complex as it would require to blend two types of information. On one hand, it would require granular data on financial contracts, their operations and their ownership structure. On the other hand, it would need to have detailed information about the wealth of the contracts’ owners (e.g. their level of revenue etc.). Data collection and reconciliation would therefore present a significant challenge. In light of the considerations developed in this article, a last natural area for future research lies in the extension of the study to other countries. If the usual key markets (USA, UK, Germany, Spain, Italy, Japan) jump to mind, there a few things to consider. The level of participation of households to financial markets is high in the US in light of the prominence of private retirement systems. Besides, life insurance does not benefit here from the same aura as in Europe. Differences
398
E. A. Ribes
are therefore expected on this front. When it comes to European countries, regulation differs between countries. The UK has notably implemented the latest Mifid regulations (changes in terms of the compensation schemes associated to the distribution of financial products) and as a result has completely altered its local structure in terms of financial advice and products’ distribution. This is therefore likely to have an effect on the variables driving households investments behavior and decisions. As per the other European countries (Germany, Spain and Italy), local retirement systems, demographic pressure (notably for Japan) and structural differences when it comes to the financial institutions are also more than likely to generate some heterogeneity in the proposed results.
6
Conclusion
This article shows that simple information depicting households’ wealth can be used to infer their subscription behavior towards standard financial products (i.e. life insurance and retirement plans) as well as their purchasing intention towards real estate assets. This can be done using standard machine learning techniques such as random forest with a high level of performance (A.U.C >90%) for French households. However the data-set used for this study proves insufficient to predict the exact level of investments of those households based on their macro level characteristics (R2 < 30 − 40%). The considerations developed in this study suggests that accessing a more granular data-set (for instance, the past behavior of households in terms of deposits and/or withdrawals on already owned financial products) could be of use.
Appendix
Table 1. Regression results - amount invested in private retirement plans, life insurance and real estate assets by households over a 12 months period Model type Linear regression
RP - R2 LI - R2 ReA - R2 7.9%
18.4%
10.1%
15.4%
7.6%
4.4%
SVR
10.5%
18.6%
4.3%
NNR
11.5%
4.1%
5.6%
Random Forest
13.8%
26.3%
5.6%
CART
Classifiers in the Distribution of Financial Products
399
Table 2. Classification results - Will a household buy a life insurance/retirement plan/real estate asset over the next 12 months? Model type
LI - A.U.C RP - A.U.C ReA - A.U.C
C Tree
77.8%
89.2%
73.3%
CART
84.6%
82.7%
69.7%
SVM Radial
84.5%
86.6%
74.3%
Random Forest 84.5%
90.4%
78.9%
NNET
92.1%
70.4%
86.3%
Fig. 3. Importance of the variables used to predict whether or not a household is likely to subscribe to a life insurance over the next 12 months.
Fig. 4. Importance of the variables used to predict whether or not a French household is likely to buy real estate over the next 12 months.
Fig. 5. Importance of the variables used to predict whether or not a household is likely to subscribe to a private retirement plan in France over the next 12 months
400
E. A. Ribes Table 3. Households revenue decomposition in Manymore’s sample
Revenue sources
Mean household revenue (kA C/year) Std deviation (kA C/year)
1 Wages (Rn )
34.1
74.1
2 Non commercial earnings (Rn )
8.9
42.2
3 Commercial earnings (Rn )
1.7
13
4 Agricultural wages (Rn )
0.5
18.3
Pensions
5 (Rn )
5.12
18.3
6 Non taxable pensions (Rn )