Emerging Trends in Intelligent Systems & Network Security (Lecture Notes on Data Engineering and Communications Technologies, 147) 3031151909, 9783031151903

This book covers selected research works presented at the fifth International Conference on Networking, Information Syst

159 28 56MB

English Pages 548 [549] Year 2022

Table of contents :
Preface
Committee
Conference General Chair
Conference General Co-chairs
Conference Chairs
Technical Program Committee Chair
Treasurer Chair
Program (Paper/Special Session/Poster) Chair
Proceedings/Book Chair
Sponsorship, Event, and Logistics Chair
Event Publication and Documentation Chair
IT Support/Web Administrator Chair
Technical Program Committee Member
Event Publication and Documentation Member
IT Support/Web Administrator Member
International Advisory Committee
Steering Committee
Technical Program Committee
Keynote Speakers
Difficult Data Analysis
E-CARGO and Role-Based Collaboration
E-CARGO and Role-Based Collaboration
Feature Object Extraction – Fusing Evidence, Not Rolling the Die
AIoT: Cloud or Edge, Big or Small Data, Public or Private Model
Communication and Computing Resource Management for Internet of Things
Hardware-Software Co-design Approaches for Sustainable AI
Architectures, Challenges and Opportunities within 6G Emerging Technologies
Contents
A Bi-objective Evolutionary Algorithm to Improve the Service Quality for On-Demand Mobility
1 Introduction
2 The On-Demand Transport Problem
3 The Proposed Method
3.1 The Representation of a Solution
3.2 The Fitness Function
3.3 The Move Operator
4 Experimental Results
5 Conclusion and Prospects
References
A Dynamic Circular Hough Transform Based Iris Segmentation
1 Introduction
1.1 Related Work
2 Materials and Methods
2.1 Dataset
2.2 Proposed Methodology
3 Results
3.1 Comparing of the Segmentation Methods
4 Conclusion
References
A Game Theoretic Framework for Interpretable Student Performance Model
1 Introduction
2 Methodology
2.1 Model Creation
2.2 Model Evaluation
2.3 Model Interpretation
3 Experiments and Result Analysis
3.1 Data Description
3.2 Description of Experimental Protocol
3.3 Result Analysis
4 Conclusion
References
A New Telecom Churn Prediction Model Based on Multi-layer Stacking Architecture
1 Introduction
2 Description of the Problem
3 Machine Learning Using Stacknets
4 Methodology
4.1 Dataset
4.2 Data Preprocessing
4.3 Feature Engineering
4.4 Unbalanced Class in Binary Classification
4.5 Model Building
4.6 Performance Metrics
5 Results of Experiments
5.1 Used Algorithms
5.2 Performance Evaluation
5.3 Discussion
6 Conclusion
References
A Novel Hybrid Classification Approach for Predict Performance Student in E-learning
1 Introduction
2 Related Work
3 Proposed Method
4 Applied Algorithms
4.1 Voting
4.2 Random Forest
5 Results
5.1 Dataset
5.2 Validation
6 Discussion
7 Conclusions
References
A Proposed Big Data Architecture Using Data Lakes for Education Systems
1 Introduction
2 Background
2.1 Traditional Data Warehouse
2.2 Modern Data Warehouse
2.3 The Data Lake
3 Literature Review
4 Proposed System
4.1 Data Source Layer
4.2 Data Consolidation Layer
4.3 Data Warehouse Layer
4.4 Data Consumption Layer
4.5 Potential Application
5 Discussion
6 Conclusion and Future Works
References
A SVM Approach for Assessing Traffic Congestion State by Similarity Measures
1 Introduction
2 Traffic Congestion Identification
2.1 Jaccard Similarity Coefficient
2.2 Dice Similarity Coefficient
2.3 BF (Boundary F1) Score
3 Methodology
3.1 Data Collection
3.2 Pre-treatment
3.3 SVM Learning
3.4 Classification
3.5 Results
4 Experiments
4.1 Data Sources
4.2 Traffic Classification Based on SVM
5 Conclusion and Future Work
References
A Tool for the Analysis, Characterization, and Evaluation of Serious Games in Teaching
1 Introduction
2 Related Works
3 Research Method
3.1 Experimental Study
3.2 Proposed Approach
4 Evaluation
4.1 Presentation of the Analysis Tool
4.2 Evaluation Protocol
4.3 Results
5 Discussion
6 Conclusion
References
Analysis and Examination of the Bus Control Center (BCC) System: Burulaş Example
1 Introduction
2 Material and Method
3 Application
3.1 General Features of the Bus Control Center (BCC) System and Determination of Application Areas
3.2 Identification of Lines, Routes, Vehicles and Stops
3.3 Services Planning
3.4 Emergency Management
3.5 Violation Management
4 Discussion and Recommendations
5 Conclusion
References
Analyze Symmetric and Asymmetric Encryption Techniques by Securing Facial Recognition System
1 Introduction
2 Literature Review
3 Methodology
3.1 AES (Advanced Encryption Standard)
3.2 RSA (Rivest Shamir Adleman)
3.3 Efficiency Criteria
4 Experimental Result
4.1 Histogram Analysis
4.2 PSNR Analyses
4.3 Entropy Analyses
4.4 Elapsed Time Analyses
5 Discussion
6 Conclusion
References
Aspect-Based Sentiment Analysis of Indonesian-Language Hotel Reviews Using Long Short-Term Memory with an Attention Mechanism
1 Introduction
2 Methods
2.1 Data Collection
2.2 Data Cleaning
2.3 Data Preprocessing
2.4 Word2vec Model Training
2.5 Data Splitting
2.6 Model Training
2.7 Testing
2.8 Evaluation
3 Results and Discussion
4 Conclusion
References
Big Data and Machine Learning in Healthcare: Concepts, Technologies, and Opportunities
1 Introduction
2 The Concept of Big Data
2.1 Big Data Definition
2.2 Characteristics of Big Data
2.3 Big Data in Healthcare
3 Big Data Analytics
3.1 Hadoop
3.2 Spark
4 The Concept of Machine Learning
4.1 Machine Learning Definition
4.2 Deep Learning
4.3 Machine Learning in Healthcare
5 The Impact of Big Data on Healthcare
6 The Impact of Machine Learning in Healthcare
7 Conclusion and Future Work
References
Classification of Credit Applicants Using SVM Variants Coupled with Filter-Based Feature Selection
1 Introduction
2 Related Works
3 Background
3.1 SVM Variants
3.2 Filter-Based Feature Selection
4 Experimental Design
5 Results and Discussions
6 Conclusion
References
Classification of Hate Speech Language Detection on Social Media: Preliminary Study for Improvement
1 Introduction
2 Related Work
3 Method
3.1 Building a Dataset
3.2 Hate Speech and Abusive Language Detection
4 Experiments and Results
4.1 Experiment Results
4.2 Discussion and Limitation
5 Conclusion and Future Work
References
CoAP and MQTT: Characteristics and Security
1 Introduction
2 Background
3 CoAP Security
3.1 DTLS Based Communications
3.2 Related Works
4 MQTT Security
4.1 TLS Based Communications
4.2 Related Works
5 Conclusion
References
Combining Static and Contextual Features: The Case of English Tweets
1 Introduction
2 The Proposed Model
3 Experiments
3.1 Data Preprocessing
3.2 Experimental Settings
4 Conclusion
References
Comparative Study on the Density and Velocity of AODV and DSDV Protocols Using OMNET++
1 Introduction
2 Related Works
3 Methods and Techniques
3.1 Routing Protocols Overview
3.2 Simulation Environment
3.3 Performance Metrics
3.4 Simulation Scenario
4 Results and Discussions
4.1 Node Density Versus AODV and DSDV
4.2 Pause Time Versus AODV and DSDV
4.3 Comparison Between OMNET++ and NS2 for AODV
4.4 Comparison Between OMNET++ and NS2 for DSDV
5 Conclusion and Future Work
References
Cybersecurity Awareness Through Serious Games: A Systematic Literature Review
1 Introduction
2 Methodology
2.1 Systematic Literature Review
2.2 Research Question
2.3 Search Strategy
3 Results and Discussion
4 Conclusion
References
Design and Implementation of a Serious Game Based on Recommender Systems for the Learning Assessment Process at Primary Education Level
1 Introduction
2 Background
2.1 Learning Assessment
2.2 Serious Game
2.3 Recommendation Systems
3 Proposed Approach
3.1 Learning Assessment
3.2 Recommendation Process
4 Results and Discussion
5 Conclusion
References
DLDB-Service: An Extensible Data Lake System
1 Introduction
2 Related Works
2.1 Data Lake Concept
2.2 Metadata Management
3 DLDB-Service Architecture
3.1 Data Ingestion
3.2 Data Storage
3.3 Data Analysis
3.4 Data Exploration
4 Technical Demonstration
5 Discussion
6 Conclusion
References
Effect of Entropy Reshaping of the IP Identification Covert Channel on Detection
1 Introduction
2 The Proposed REIPIC Method
2.1 Parameters and Notations
2.2 Frequencies of Occurrence Matrix
2.3 REIPIC Algorithm
2.4 Frequencies of Occurrence Treatment (FOT)
3 Experiment and Settings
4 Results and Discussion
4.1 REIPIC Effect on Entropy Reshaping
4.2 REIPIC Effect on the IP ID CC Detection
5 Conclusion and Future Work
References
Explainable Machine Learning Model for Performance Prediction MAC Layer in WSNs
1 Introduction
2 Methodology
2.1 Prepossessing
2.2 Models Development
2.3 Interpretation of ML Models Using SHAP Values
2.4 Interpretation of ML Models Using SHAP Values
3 Experiments and Results Analysis
3.1 WSN Dataset
3.2 Results
3.3 Results of Model Prediction Interpretation
4 Conclusion
References
Hadoop-Based Big Data Distributions: A Comparative Study
1 Introduction
2 Hadoop Distributions
2.1 Stand-Alone Hadoop Distributions
2.2 Hybrid Hadoop Distributions
2.3 Cloud-Based Hadoop Distributions
3 A Comparison of Hadoop Distributions
3.1 Comparing Factors
3.2 Comparing Hadoop Distributions
3.3 Advantages and Disadvantages of Hadoop Distributions
3.4 Discussion
4 Conclusion
References
HDFS Improvement Using Shortest Path Algorithms
1 Introduction
2 HADOOP Platform
2.1 HDFS – Reliable Storage
2.2 DIJKSTRA’s Algorithm Presentation
2.3 DIJKSTRA’s Algorithm Examples
2.4 SubAlgorithm Dijkstra(Dijkstratype)
3 Article Sections
4 Proposed Work
4.1 Little Reminder
4.2 The Proposed Approach for Small File Management
5 Experimental
5.1 Experimental Environment (RatioTable)
5.2 Phases Algorithms
5.3 Results
5.4 Conclusion and Future Works
References
Improved Hourly Prediction of BIPV Photovoltaic Power Building Using Artificial Learning Machine: A Case Study
1 Introduction
2 Materials and Methods
2.1 Proposed BIPV Study of Case
2.2 Description of Machine Learning Models
3 Empirical Evaluations
3.1 Datasets
3.2 Error Metrics
3.3 Results and Interpretation
4 Conclusion
References
Improving Speaker-Dependency/Independency of Wavelet-Based Speech Emotion Recognition
1 Introduction
2 Speech Emotion Recognition Literature
3 Multilevel Wavelet Transform
3.1 1-D Discrete Wavelet Transform (DWT)
3.2 1-D Stationary Wavelet Transform (SWT)
4 Methods
4.1 Feature Extraction and Selection
4.2 Classification
5 Experimental Setups and Results
5.1 Speaker-Dependent (SD)
5.2 Speaker-Independent (SI)
5.3 Comparison with SER Studies
6 Conclusion and Future Work
References
Improving the Quality of Service Within Multi-objective Customer-Oriented Dial-A-Ride Problems
1 Introduction
1.1 Presentation of DARP
1.2 The Customer Oriented Design of the Quality of Service in DARPs
1.3 Contributions
2 Literature Review
2.1 The Quality of Service in DARPs
2.2 Customized Service Quality in Multi-objective DARPs
3 The Multi-objective Customer Oriented DARP
3.1 Description of the Problem
3.2 Mathematical Formulation of the Problem
4 Preliminary Tests
5 Discussion and Conclusion
References
Interpretability Based Approach to Detect Fake Profiles in Instagram
1 Introduction
2 Methodology
2.1 Dataset and Features
2.2 Data Preprocessing
2.3 Over-simpling
2.4 Models Construction
2.5 Interpretation of Machine Learning Models
3 Results and Analysis
4 Conclusion
References
Learning Styles Prediction Using Social Network Analysis and Data Mining Algorithms
1 Introduction
2 Learning Styles
3 The Relationship Between Learning Styles and Social Networks
4 Proposed Method
5 Conclusion
References
Managing Spatial Big Data on the Data LakeHouse
1 Introduction
2 Background and Related Works
3 Data LakeHouse
4 Spatial Big Data on the LakeHouse
5 Experimental Study
5.1 The Spatial Lakehouse Architecture
5.2 Storage and Query Optimization
5.3 Input Datasets
5.4 Setup and Hardware
5.5 Analysis and Results
6 Conclusion
References
Medication Decision for Cardiovascular Disease Through Fermatean Fuzzy Bipolar Soft Set
1 Introduction
2 Literature Review
3 Fermatean Fuzzy Sets with Soft Sets
4 Medication Decision (Selection of the CVD Therapies)
5 Conclusion
References
Microservices: Investigating Underpinnings
1 Introduction
2 Fundamental Concepts of MSA
2.1 Internal Structure of Microservices
2.2 MSA Design Patterns in DevOps
3 Different Aspects of Microservices Architecture
3.1 Core Principles for MSA
3.2 Service Management Activities
4 Discussion
4.1 Motivations Behind the Migration to MSA
4.2 Open Research Challenges in MSA
5 Conclusion
References
Network Slicing User Association Under Optimal Input Covariance Matrix in Virtual Network MVNO
1 Introduction
2 Related Work
3 System Model
4 Ergodic Capacity of MIMO
4.1 Without Channel State Information (CSI)
4.2 With Channel State Information (CSI)
5 Proposed Slice Selection Method: Softmax
6 Numerical Investigation
7 Conclusion and Future Work
References
Pedagogical Classification Model Based on Machine Learning
1 Introduction
2 Related Work
3 Methodologies
3.1 Theoretical Foundation of Our Proposed Approach
3.2 Approach
4 Experimental Study
4.1 Evaluation Metrics
4.2 Datasets Analysis and Exploration
4.3 Environment Setup and Implementation
4.4 Results and Analysis
5 Conclusion
References
Performance Evaluation of NS2 and NS3 Simulators Using Routing Protocols in Mobile Ad Hoc Networks
1 Introduction
2 Related Works
3 Methods and Technics
4 Results and Discussions
5 Conclusion and Future Work
References
Gamification in Software Development: Systematic Literature Review
1 Introduction
2 Relevant Theory
2.1 Gamification
2.2 Software Development
3 Methodology
4 Results and Discussion
4.1 Publication
4.2 RQ1 What Gamification Elements Have Been Used
4.3 RQ2 What Gamification Framework Has Been Used
5 Conclusion
References
Robust Method for Estimating the Fundamental Matrix by a Hybrid Optimization Algorithm
1 Introduction
2 Fundamental Matrix
3 Estimation of the Fundamental Matrix by an AG-LM Algorithm
4 Experiments
5 Conclusion
References
SDN Southbound Protocols: A Comparative Study
1 Introduction
2 SDN Background
2.1 SDN Architecture
2.2 SDN Interfaces
3 Overview of SDN Southbound Protocols
3.1 OpenFlow
3.2 ForCES
3.3 P4
4 Discussion
5 Conclusion
References
Simulating and Modeling the Vaccination of Covid-19 Pandemic Using SIR Model - SVIRD
1 Introduction
2 Overview
3 Method and Model Description
3.1 SIR Model
3.2 SVIRD Extention of SIR Model
4 Sample and Analysis
4.1 Low Vaccinated Population
4.2 High Vaccinated Population
5 Simulator Description
5.1 Simulation Parameter
5.2 Simulation Process
6 Conclusion
References
The New Generation of Contact Tracing Solution: The Case of Morocco
1 Introduction
2 State of the Art
3 Method
3.1 System Architecture
3.2 The Proposed Algorithms
3.3 Assessment of Concerns for General Contact Tracing Applications
3.4 Privacy Policy and Data Confidentiality
3.5 Results
4 Discussion of Results
5 Conclusion
References
The Prediction Stock Market Price Using LSTM
1 Introduction
2 Related Works
3 Methodology
3.1 Raw Data
3.2 Data Processing
3.3 Model
3.4 Training
3.5 Compared Methods
3.6 Performance Evaluation
4 Results
4.1 Simulation Tools
4.2 Experimental Results
5 Discussion
6 Conclusion and Future Work
References
Hybrid Movie Recommender System Based on Word Embeddings
1 Introduction
2 Related Work
2.1 Recommendation Based on Content Filtering
2.2 Recommendation Based on Collaborative Filtering
3 Proposed Approach
3.1 Hybrid Recommendation
3.2 Word2vec
3.3 K Nearest Neighbor
3.4 The Overall Approach
4 Experiments
5 Conclusion
References
Towards Big Data-based Sustainable Business Models and Sustainable Supply Chain
1 Context and Motivations
2 Overview
2.1 Sustainability and Supply Chain
2.2 Business Models Vs Sustainable Business Models
2.3 Sustainable Business Models and Sustainable Supply Chain
3 Big Data-driven Sustainable Business Models
4 Conclusion and Future Work Directions
References
Treatment of Categorical Variables with Missing Values Using PLS Regression
1 Introduction and Literature Review
2 PLS1 for Categorical Predictors with Missing Values
3 Comparative Case Study
3.1 Data Description
3.2 Matrix Quantification
3.3 PLS1-NIPLAS Application
4 Conclusion
References
Type 2 Fuzzy PID for Robot Manipulator
1 Introduction
2 System Modeling of Two-Link Robot Manipulator
3 Proposed Controller Design
4 Simulations Results
5 Conclusion
References
Using Latent Class Analysis (LCA) to Identify Behavior of Moroccan Citizens Towards Electric Vehicles
1 Introduction
2 Proposed Method: Latent Class Analysis
3 Description of Data
4 Experimental Results
4.1 Multiple Correspondence Analysis Application
4.2 Interpretation of MCA Results
4.3 LCA Method Application
4.4 Analysis and Discussion
5 Conclusion
References
Using Learning Analytics Techniques to Calculate Learner’s Interaction Indicators from Their Activity Traces Data
1 Introduction
2 Research Background
3 LA Techniques and Their Application in E-Learning Systems
4 LA Techniques to Get Interaction Indicators in E-Learning Environments
4.1 Learner Interaction Indicators
4.2 Classification of Interaction Indicators
5 Synthesis and Discussion
6 Conclusion and Future Work
References
Web-Based Dyscalculia Screening with Unsupervised Clustering: Moroccan Fourth Grade Students
1 Introduction
2 Related Works
3 Non-verbal Intelligence and Mathematical Performances
4 Materials and Methods
4.1 Participants
4.2 Measures
4.3 Screening Model
5 Results and Discussion
5.1 Dendrogram
5.2 Silhouette Score
5.3 Dataset Clustering
6 Conclusion
References
Correction to: A New Telecom Churn Prediction Model Based on Multi-layer Stacking Architecture
Correction to: Chapter “A New Telecom Churn Prediction Model Based on Multi-layer Stacking Architecture” in: M. Ben Ahmed et al. (Eds.): Emerging Trends in Intelligent Systems & Network Security, LNDECT 147, https://doi.org/10.1007/978-3-031-15191-0_4
Author Index

Recommend Papers

Intelligent Learning for Computer Vision: Proceedings of Congress on Intelligent Systems 2020 (Lecture Notes on Data Engineering and Communications Technologies, 61) 9813345810, 9789813345812

This book is a collection of selected papers presented at the First Congress on Intelligent Systems (CIS 2020), held in

103 22 23MB Read more

Innovative Data Communication Technologies and Application: ICIDCA 2019 (Lecture Notes on Data Engineering and Communications Technologies, 46) 3030380394, 9783030380397

This book presents emerging concepts in data mining, big data analysis, communication, and networking technologies, and

117 111 92MB Read more

IoT-based Intelligent Modelling for Environmental and Ecological Engineering: IoT Next Generation EcoAgro Systems (Lecture Notes on Data Engineering and Communications Technologies) 3030711714, 9783030711719

This book brings to readers thirteen chapters with contributions to the benefits of using IoT and Cloud Computing to agr

124 41 13MB Read more

Computational Methods and Data Engineering: Proceedings of ICCMDE 2021 (Lecture Notes on Data Engineering and Communications Technologies, 139) 9811930147, 9789811930140

The book features original papers from International Conference on Computational Methods and Data Engineering (ICCMDE 20

107 60 Read more

Advances in Intelligent Systems, Computer Science and Digital Economics III (Lecture Notes on Data Engineering and Communications Technologies) [1st ed. 2022] 3030970566, 9783030970567

The book comprises high-quality refereed research papers presented at the Third International Symposium on Computer Scie

121 14 31MB Read more

Intelligent Data Communication Technologies and Internet of Things: Proceedings of ICICI 2021 (Lecture Notes on Data Engineering and Communications Technologies, 101) 9811676097, 9789811676093

This book gathers selected papers presented at the 5th International Conference on Intelligent Data Communication Techno

105 54 29MB Read more

Proceedings of the 9th International Conference on Advanced Intelligent Systems and Informatics 2023 (Lecture Notes on Data Engineering and Communications Technologies, 184) 3031432460, 9783031432460

This proceedings book constitutes the refereed proceedings of the 9th International Conference on Advanced Intelligent S

104 20 38MB Read more

Proceedings of Emerging Trends and Technologies on Intelligent Systems: ETTIS 2022 (Advances in Intelligent Systems and Computing, 1414) 9811941815, 9789811941818

This book presents best selected papers presented at the 2nd International Conference on Emerging Trends and Technologie

123 34 13MB Read more

Congress on Intelligent Systems: Proceedings of CIS 2021, Volume 2 (Lecture Notes on Data Engineering and Communications Technologies, 111) 9811691126, 9789811691126

102 58 23MB Read more

Proceedings of the International Conference on Advanced Intelligent Systems and Informatics 2021 (Lecture Notes on Data Engineering and Communications Technologies) 3030897001, 9783030897000

This proceeding book constitutes the refereed proceedings of the 7th International Conference on Advanced Intelligent Sy

103 92 42MB Read more

Emerging Trends in Intelligent Systems & Network Security (Lecture Notes on Data Engineering and Communications Technologies, 147)
3031151909, 9783031151903

Author / Uploaded
Mohamed Ben Ahmed (editor)
Boudhir Anouar Abdelhakim (editor)
Bernadetta Kwintiana Ane (editor)
Didi Rosiyadi (editor)

0 0 0
Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up

File loading please wait...

Citation preview

Lecture Notes on Data Engineering and Communications Technologies 147

Mohamed Ben Ahmed Boudhir Anouar Abdelhakim Bernadetta Kwintiana Ane Didi Rosiyadi Editors

Emerging Trends in Intelligent Systems & Network Security

Lecture Notes on Data Engineering and Communications Technologies Volume 147

Series Editor Fatos Xhafa, Technical University of Catalonia, Barcelona, Spain

The aim of the book series is to present cutting edge engineering approaches to data technologies and communications. It will publish latest advances on the engineering task of building and deploying distributed, scalable and reliable data infrastructures and communication systems. The series will have a prominent applied focus on data technologies and communications with aim to promote the bridging from fundamental research on data science and networking to data engineering and communications that lead to industry products, business knowledge and standardisation. Indexed by SCOPUS, INSPEC, EI Compendex. All books published in the series are submitted for consideration in Web of Science.

More information about this series at https://link.springer.com/bookseries/15362

Mohamed Ben Ahmed Boudhir Anouar Abdelhakim Bernadetta Kwintiana Ane Didi Rosiyadi •

•

•

Editors

Emerging Trends in Intelligent Systems & Network Security

123

Editors Mohamed Ben Ahmed FST Tangier Abdelmalek Essaâdi University Tetouan, Morocco Bernadetta Kwintiana Ane University of Stuttgart Stuttgart, Germany

Boudhir Anouar Abdelhakim FST Tangier Abdelmalek Essaâdi University Tetouan, Morocco Didi Rosiyadi Research Center for Informatics at National Research and Innovation Agency (BRIN) Jawa Barat, Indonesia

ISSN 2367-4512 ISSN 2367-4520 (electronic) Lecture Notes on Data Engineering and Communications Technologies ISBN 978-3-031-15190-3 ISBN 978-3-031-15191-0 (eBook) https://doi.org/10.1007/978-3-031-15191-0 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023, corrected publication 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, speciﬁcally the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microﬁlms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a speciﬁc statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional afﬁliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Preface

The advancement in intelligent information systems, cyber-security, and networking technologies has become a critical requirement in the era of interconnected digital worlds. With the exponential growth of wireless communications, the Internet of things, and cloud computing, and the increasingly dominant roles played by electronic commerce in every major industry, safeguarding the information in storage and traveling over the communication networks is increasingly becoming the most critical and contentious challenges for the technology innovators. This trend opens up signiﬁcant research activity for academics, research institutes, and their partners (industrialists, governments, civil society, etc.) in order to establish essential and intelligent bases to solve several complex computing problems in the active areas of networking, intelligent systems, and security. In the past two decades, there have been hundreds of algorithms developed capitalizing on the effectiveness of artiﬁcial intelligence. Therefore, we envisage intelligent systems and improve our research and lead to cutting-edge discovery by achieving the highest scientiﬁc capability as well as encourage open discussions on recent advances in computer communication and information technologies. In this context, this book addresses the challenges associated with scientiﬁc research and engineering applications for the construction of intelligent systems and their various innovative applications and services. The book also aims to provide an integrated view of the problems to researchers, engineers, and practitioners and to outline new topics in networks and security. This edition is the result of work accepted and presented at the Fifth International Conference on Networks, Intelligent Systems and Security (NISS2022) held on March, 30–31, 2022, in Bandung, Indonesia. It brings together original research, works carried out, and proposed architectures on the main themes of the conference. The goal of this book edition is to construct and build the basics and essentials research, innovations, and applications that can help in the growth of the future next generation of networks and intelligent systems.

v

vi

Preface

We would like to acknowledge and thank the Springer Nature staff for their support, guidance, and for the edition of this book. Finally, we wish to express our sincere thanks to professors Fatos Xhafa, Thomas Ditzinger, and Suresh Dharmalingam for their kind support and help to promote and develop research.

Committee

Conference General Chair Ahmad Aﬁf Supianto

National Research and Innovation Agency, Indonesia

Conference General Co-chairs Mohamed Ben Ahmed Anouar Boudhir Abdelhakim Bernadetta Kwintiana Ane

FST Tangier, UAE University, Morocco FST Tangier, UAE University, Morocco University of Stuttgart, Germany

Conference Chairs Pavel Kroemer Arafat Febriandirza

VSB - Technical University of Ostrava, Czech Republic National Research and Innovation Agency, Indonesia

Technical Program Committee Chair Hilman F. Pardede

National Research and Innovation Agency, Indonesia

Treasurer Chair Purnomo Husnul Khotimah

National Research and Innovation Agency, Indonesia

vii

viii

Committee

Program (Paper/Special Session/Poster) Chair Wassila Mtalaa

Luxembourg Institute of Science and Technology (LIST), Luxembourg

Proceedings/Book Chair Heru Susanto

National Research and Innovation Agency, Indonesia

Sponsorship, Event, and Logistics Chair Parthasarathy Subashini

Avinashilingam University, India

Event Publication and Documentation Chairs Muhammad Yudhi Rezaldi Lotﬁ el Achak Zameer Gulzar

National Research and Innovation Agency, Indonesia Abdelmalek Essaâdi University, Morocco S.R. University Warangal, India

IT Support/Web Administrator Chair Andria Arisal

National Research and Innovation Agency, Indonesia

Technical Program Committee Members Vicky Zilvan Akbari Indra Basuki

National Research and Innovation Agency, Indonesia National Research and Innovation Agency, Indonesia

Event Publication and Documentation Members Abdurrakhman Prasetyadi Siti Kania Kushadiani Andri Fachrur Rozie

National Research and Innovation Agency, Indonesia National Research and Innovation Agency, Indonesia National Research and Innovation Agency, Indonesia

Committee

ix

IT Support/Web Administrator Members Shidiq Al Hakim Syam Budi Iryanto

National Research and Innovation Agency, Indonesia National Research and Innovation Agency, Indonesia

International Advisory Committee Laksana Tri Handoko Vincenzo Piuri Imre J. Rudas Vaclav Snasel Kaoutar el Maghraoui

National Research and Innovation Agency, Indonesia University of Milan, Italy Obuda University, Hungary VSB-TU Ostrava, Czech Republic IBM Thomas J. Watson Research Center, USA

Steering Committee Didi Rosiyadi Wahyudi Hasbi Ilsun You Ismail Rakip Keras

National Research and Innovation Agency, Indonesia IEEE Indonesia Section, Indonesia Soonchunhyang University, South Korea Karabuk University, Turkey

Technical Program Committee Abdellah Chehri Adel Alti Ahaitouf Ali Agnes Mindila Anouar Abtoy Ayoub Elhoucine Abdellatif Medouri Abderrahim El Mhouti Ambar Yoganingrum Arafat Febriandirza Arnida L. Latifah Aziz Mahboub Azlinah Mohamed Ben Ahmed Mohamed Belkacem Kouninef Boudhir Anouar Abdelhakim Didi Rosiyadi

University of Quebec in Chicoutimi, Canada University of Setif, Algeria FST-Fez, Morocco Jomo Kenyatta University of Agriculture and Technology, Kenya ENSATe-Abdelmalek Essaâdi University, Morocco INTTIC, Algeria ENSA, Tetouan, Morocco FS, Tetouan, Morocco BRIN, Indonesia BRIN, Indonesia BRIN, Indonesia FST, Tangier, Morocco UiTM Shah Alam Malaysia, Malaysia FST, Tangier, Morocco INTTIC, Algeria FST, Tangier, Morocco BRIN, Indonesia

x

Dmitry Bystrov E. M. Dogo El Arbi Abdellaoui Alaoui Enrique Arias Es Sbai Najia Esa Prakasa El Aachak Lotﬁ El Kalkha Hanae Fatma Zohra Bessai Mechmache Heru Susanto Hilman F. Pardede Ibrahima Niang Joel Rodrigues Kashif Saleem Lawrence Nderu Lindung P. Manik Mariam Tanana Natalia Kolesnik Norjansalika Janom Norli Shariffuddin Nur Zahrah Farida Ruslan Olga Sergeyeva Riouch Fatima Shidiq Al Hakim Sanjar Giyasov Sibel Senan Tarekmohamed El-Fouly Tolga Ensari Wassila Aggoune-Mtalaa Youssou Kasse Yasyn Elyusuﬁ Zakaria Elmrabet Ziani Ahmed

Committee

Tashkent State Technical University, Uzbekistan Federal University of Technology Minna, South Africa E3MI, Morocco University of Castilla-La Mancha, Spain FST, Fez, Morocco BRIN, Indonesia FST, Tangier, Morocco ENSAT, Tangier, Morocco CERIST, Algeria BRIN, Indonesia BRIN, Indonesia LID/UCAD, Senegal National Institute of Telecommunications (INATEL), Brazil King Saud University, Saudi Arabia Jomo Kenyatta University of Agriculture and Technology, Kenya BRIN, Indonesia ENSA, Tangier, Morocco Sociological Institute of the RAS, Russia Universiti Teknologi Mara (UiTM), Malaysia UiTM Shah Alam Malaysia, Malaysia Universiti Teknologi Mara (UiTM), Malaysia Saint Petersburg University, Russia INPT, Rabat, Morocco BRIN, Indonesia Tashkent State Technical University, Uzbekistan Istanbul University Qatar University, Qatar Istanbul University-Cerrahpasa, Turkey Luxembourg Institute of Science and Technology, Luxembourg Université Alioune Diop de Bambey, Senegal FST, Tangier, Morocco University of North Datoka, USA FST, Tangier, Morocco

Keynote Speakers

Difﬁcult Data Analysis Michał Woźniak

Biography: Michał Woźniak is a professor of computer science at the Department of Systems and Computer Networks, Wroclaw University of Science and Technology, Poland. His research focuses on machine learning, compound classiﬁcation methods, classiﬁer ensembles, data stream mining, and imbalanced data processing. Prof. Woźniak has been involved in research projects related to the aforementioned topics and has been a consultant for several commercial projects for well-known Polish companies and public administration. He has published over 300 papers and three books. He was awarded numerous prestigious awards for his scientiﬁc achievements as IBM Smarter Planet Faculty Innovation Award (twice) or IEEE Outstanding Leadership Award and several best paper awards of the prestigious conferences. He serves as program committee chairs and the member for the numerous scientiﬁc events and prepared several special issues as the guest editor.

xiii

E-CARGO and Role-Based Collaboration Haibin Zhu

Biography: Haibin Zhu is a full professor and the chair of the Department of Computer Science and Mathematics, founding director of Collaborative Systems Laboratory, and member of the Research Committee, Nipissing University, Canada. He received a B.S. degree in computer engineering from the Institute of Engineering and Technology, China (1983), and M.S. (1988) and Ph.D. (1997) degrees in computer science from the National University of Defense Technology (NUDT), China. He was a visiting professor and a special lecturer in the College of Computing Sciences, New Jersey Institute of Technology, USA (1999–2002), and a lecturer, an associate professor, and a full professor at NUDT (1988–2000). He has accomplished over 200 research works including 28 IEEE Trans. articles, six books, ﬁve book chapters, three journal issues, and four conference proceedings.

xv

E-CARGO and Role-Based Collaboration Shi-Jinn Horng

Biography: Shi-Jinn Horng is a chair professor at Department of Computer Science and Information Engineering, National Taiwan Institute of Science and Technology. He is an expert in distributed and parallel computing, artiﬁcial intelligence, and hardware. Shi-Jinn have more than 200 published articles. His current research interests include deep learning and big data, biometric recognition, information security, cloud and fault computing, multimedia, and medical applications.

xvii

Feature Object Extraction – Fusing Evidence, Not Rolling the Die Kathleen Kramer

Biography: Kathleen Kramer is a professor of Electrical Engineering at the University of San Diego, San Diego, CA. She worked to develop new engineering programs as a founding member of the faculty and eventually became the chair of electrical engineering, and then serving as Director of Engineering (2004–2013), providing academic leadership for all of the university’s engineering programs. She has also been Member of Technical Staff at several companies, including ViaSat, Hewlett Packard, and Bell Communications Research. Author or co-author of over 100 publications, she maintains an active research agenda and has recent publications in the areas of multi-sensor data fusion, intelligent systems, and neural and fuzzy systems. Her teaching interests are in the areas of signals and systems, communication systems, and capstone design. She received the B.S. degree in electrical engineering magna cum laude with a second major in physics from Loyola Marymount University, and the M.S. and Ph.D. degrees in electrical engineering from the California Institute of Technology.

xix

AIoT: Cloud or Edge, Big or Small Data, Public or Private Model Ying-Dar Lin

Biography: Ying-Dar Lin is Chair Professor of computer science at National Chiao Tung University (NCTU), Taiwan. He received his Ph.D. in computer science from the University of California at Los Angeles (UCLA) in 1993. He was a visiting scholar at Cisco Systems in San Jose during 2007–2008, CEO at Telecom Technology Center, Taiwan, during 2010– 2011, and Vice President of National Applied Research Labs (NARLabs), Taiwan, during 2017– 2018. He cofounded L7 Networks Inc. in 2002, later acquired by D-Link Corp. He also founded and directed Network Benchmarking Lab (NBL) from 2002, which reviewed network products with real trafﬁc and automated tools, also an approved test lab of the Open Networking Foundation (ONF), and spun off O’Prueba Inc. in 2018. His research interests include machine learning for network security, wireless communications, network softwarization, and mobile edge computing. His work on multi-hop cellular was the ﬁrst along this line, and has been cited over 1000 times and standardized into IEEE 802.11s, IEEE 802.15.5, IEEE 802.16j, and 3GPP LTE-Advanced. He is IEEE Fellow (class of 2013), IEEE Distinguished Lecturer (2014–2017), ONF Research Associate (2014–2018), and received K. T. Li Breakthrough Award in 2017 and Research Excellence Award in 2017 and 2020. He has served or is serving on the editorial boards of several IEEE journals and magazines, including Editor-in-Chief of IEEE Communications Surveys and Tutorials (COMST) with impact factor increased from 9.22 to

xxi

xxii

Y.-D. Lin

29.83 during his term in 2017–2020. He published a textbook, Computer Networks: An Open Source Approach (http://www.mhhe.com/lin), with Ren-Hung Hwang and Fred Baker (McGraw-Hill, 2011).

Communication and Computing Resource Management for Internet of Things Ning Zhang

Biography: Ning Zhang is Associate Professor in the Department of Electrical and Computer Engineering at University of Windsor, Canada. He received the Ph.D. degree in Electrical and Computer Engineering from University of Waterloo, Canada, in 2015. After that, he was a postdoc research fellow at University of Waterloo and University of Toronto, respectively. His research interests include connected vehicles, mobile edge computing, wireless networking, and machine learning. He is a highly cited researcher and has 20 ESI highly cited papers. He serves as an associate editor of IEEE Internet of Things Journal, IEEE Transactions on Cognitive Communications and Networking, and IEEE Systems Journal; and a guest editor of several international journals, such as IEEE Wireless Communications, IEEE Transactions on Industrial Informatics, and IEEE Transactions on Intelligent Transportation Systems. He also serves/served as a TPC chair for IEEE VTC 2021 and IEEE SAGC 2020, a general chair for IEEE SAGC 2021, a track chair for several international conferences and workshops. He received eight Best Paper Awards from conferences and journals, such as IEEE Globecom and IEEE ICC. He also received IEEE TCSVC Rising Star Award for outstanding contributions to research and practice of mobile edge computing and Internet of things service.

xxiii

Hardware-Software Co-design Approaches for Sustainable AI Kaoutar El Maghraoui

Biography: Kaoutar El Maghraoui is a principal research scientist at the IBM Research AI organization where she is focusing on innovations at the intersection of systems and artiﬁcial intelligence. She leads the End-Use experimental AI testbed of the IBM Research AI Hardware Center, a global research hub focusing on enabling next-generation chips and systems for AI workloads. She is currently focusing on the operationalization aspects of AI systems in hybrid cloud environments. Kaoutar has extensive experience and deep expertise in HPC, systems software, cloud computing, and machine learning.

xxv

Architectures, Challenges and Opportunities within 6G Emerging Technologies Anouar Boudhir Abdelhakim

Biography: Anouar Boudhir Abdelhakim is currently an associate professor at the Faculty of Sciences and Technique of Tangier. Actually, he is the president of the Mediterranean Association of Sciences and Technologies. He is an adviser at the Moroccan union against dropping out of school. He received the HDR degree from Abdelmalek Essaadi University; he is the co-author of several papers published in IEEExplorer, ACM, and in high indexed journals and conference. He co-edited a several books published on Springer series, and he is a co-founder of a series of international conferences (Smart health17, SCIS’16, SCA18, SCA19, SCA20, NISS18, NISS19, NISS20, NISS21, ICDATA21) till 2016. His supervise several thesis in artiﬁcial intelligence, security, and e-healthcare. His key research relates to networking and protocols, ad hoc networks, VANETS, WSN, IoT, big data, AI computer healthcare applications, smart city applications and security applications.

xxvii

Contents

A Bi-objective Evolutionary Algorithm to Improve the Service Quality for On-Demand Mobility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sonia Nasri, Hend Bouziri, and Wassila Aggoune-Mtalaa A Dynamic Circular Hough Transform Based Iris Segmentation . . . . . Abbadullah .H Saleh and Oğuzhan Menemencioğlu

1 9

A Game Theoretic Framework for Interpretable Student Performance Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hayat Sahlaoui, El Arbi Abdellaoui Alaoui, and Said Agoujil

21

A New Telecom Churn Prediction Model Based on Multi-layer Stacking Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jalal Rabbah, Mohammed Ridouani, and Larbi Hassouni

35

A Novel Hybrid Classiﬁcation Approach for Predict Performance Student in E-learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hanae Aoulad Ali, Chrayah Mohamed, Bouzidi Abdelhamid, Nabil Ourdani, and Taha El Alami A Proposed Big Data Architecture Using Data Lakes for Education Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lamya Oukhouya, Anass El haddadi, Brahim Er-raha, Hiba Asri, and Naziha Laaz A SVM Approach for Assessing Trafﬁc Congestion State by Similarity Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Abdou Khadre Diop, Amadou Dahirou Gueye, Khaly Tall, and Sidi Mohamed Farssi A Tool for the Analysis, Characterization, and Evaluation of Serious Games in Teaching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Farida Bouroumane and Mustapha Abarkan

45

53

63

73

xxix

xxx

Contents

Analysis and Examination of the Bus Control Center (BCC) System: Burulaş Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kamil İlhan and Muharrem Ünver

84

Analyze Symmetric and Asymmetric Encryption Techniques by Securing Facial Recognition System . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mohammed Alhayani and Muhmmad Al-Khiza’ay

97

Aspect-Based Sentiment Analysis of Indonesian-Language Hotel Reviews Using Long Short-Term Memory with an Attention Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 Linggar Maretva Cendani, Retno Kusumaningrum, and Sukmawati Nur Endah Big Data and Machine Learning in Healthcare: Concepts, Technologies, and Opportunities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 Mustafa Hiri, Mohamed Chrayah, Nabil Ourdani, and Taha el alamir Classiﬁcation of Credit Applicants Using SVM Variants Coupled with Filter-Based Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 Siham Akil, Sara Sekkate, and Abdellah Adib Classiﬁcation of Hate Speech Language Detection on Social Media: Preliminary Study for Improvement . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 Ari Muzakir, Kusworo Adi, and Retno Kusumaningrum CoAP and MQTT: Characteristics and Security . . . . . . . . . . . . . . . . . . 157 Fathia Ouakasse and Said Rakrak Combining Static and Contextual Features: The Case of English Tweets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 Nouhaila Bensalah, Habib Ayad, Abdellah Adib, and Abdelhamid Ibn El Farouk Comparative Study on the Density and Velocity of AODV and DSDV Protocols Using OMNET++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 Ben Ahmed Mohamed, Boudhir Anouar Abdelhakim, Samadi Sohaib, Faham Hassan, and El belhadji Soumaya Cybersecurity Awareness Through Serious Games: A Systematic Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 Chaimae Moumouh, Mohamed Yassin Chkouri, and Jose L. Fernández-Alemán Design and Implementation of a Serious Game Based on Recommender Systems for the Learning Assessment Process at Primary Education Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 Fatima Zohra Lhafra and Otman Abdoun

Contents

xxxi

DLDB-Service: An Extensible Data Lake System . . . . . . . . . . . . . . . . . . 211 Mohamed Cherradi and Anass El Haddadi Effect of Entropy Reshaping of the IP Identiﬁcation Covert Channel on Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 Manal Shehab, Noha Korany, Nayera Sadek, and Yasmine Abouelseoud Explainable Machine Learning Model for Performance Prediction MAC Layer in WSNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 El Arbi Abdellaoui Alaoui, Khalid Nassiri, and Stephane Cedric Koumetio Tekouabou Hadoop-Based Big Data Distributions: A Comparative Study . . . . . . . . 242 Ikram Hamdaoui, Mohamed El Fissaoui, Khalid El Makkaoui, and Zakaria El Allali HDFS Improvement Using Shortest Path Algorithms . . . . . . . . . . . . . . 253 Mohamed Eddoujaji, Hassan Samadi, and Mohammed Bouhorma Improved Hourly Prediction of BIPV Photovoltaic Power Building Using Artiﬁcial Learning Machine: A Case Study . . . . . . . . . . . . . . . . . 270 Mouad Dourhmi, Kaoutar Benlamine, Ilyass Abouelaziz, Mourad Zghal, Tawﬁk Masrour, and Youssef Jouane Improving Speaker-Dependency/Independency of Wavelet-Based Speech Emotion Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281 Adil Chakhtouna, Sara Sekkate, and Abdellah Adib Improving the Quality of Service Within Multi-objective CustomerOriented Dial-A-Ride Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292 Sonia Nasri, Hend Bouziri, and Wassila Aggoune-Mtalaa Interpretability Based Approach to Detect Fake Proﬁles in Instagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306 Amine Sallah, El Arbi Abdellaoui Alaoui, and Said Agoujil Learning Styles Prediction Using Social Network Analysis and Data Mining Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315 Soukaina Benabdelouahab, Jaber El Bouhdidi, Yacine El Younoussi, and Juan M. Carrillo de Gea Managing Spatial Big Data on the Data LakeHouse . . . . . . . . . . . . . . . 323 Soukaina Ait Errami, Hicham Hajji, Kenza Ait El Kadi, and Hassan Badir Medication Decision for Cardiovascular Disease Through Fermatean Fuzzy Bipolar Soft Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332 Kanak Saxena and Umesh Banodha Microservices: Investigating Underpinnings . . . . . . . . . . . . . . . . . . . . . . 343 Idris Oumoussa, Souﬁane Faieq, and Rajaa Saidi

xxxii

Contents

Network Slicing User Association Under Optimal Input Covariance Matrix in Virtual Network MVNO . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352 Mamadou Diallo Diouf and Massa Ndong Pedagogical Classiﬁcation Model Based on Machine Learning . . . . . . . . 363 Hanane Sebbaq and Nour- eddine El Faddouli Performance Evaluation of NS2 and NS3 Simulators Using Routing Protocols in Mobile Ad Hoc Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 372 Boudhir Anouar Abdelhakim, Ben Ahmed Mohamed, Abbadi Aya, Achahbar Salma, and Souﬁani Assia Gamiﬁcation in Software Development: Systematic Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386 Oki Priyadi, Insan Ramadhan, Dana Indra Sensuse, Ryan Randy Suryono, and Kautsarina Robust Method for Estimating the Fundamental Matrix by a Hybrid Optimization Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399 Soulaiman El Hazzat and Mostafa Merras SDN Southbound Protocols: A Comparative Study . . . . . . . . . . . . . . . . 407 Lamiae Boukraa, Safaa Mahrach, Khalid El Makkaoui, and Redouane Esbai Simulating and Modeling the Vaccination of Covid-19 Pandemic Using SIR Model - SVIRD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419 Nada El Kryech, Mohammed Bouhorma, Lotﬁ El Aachak, and Fatiha Elouaai The New Generation of Contact Tracing Solution: The Case of Morocco . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 432 Badr-Eddine Soussi Niaimi, Lotﬁ Elaachak, Hassan Zili, and Mohammed Bouhorma The Prediction Stock Market Price Using LSTM . . . . . . . . . . . . . . . . . . 444 Rhada Barik, Amine Baina, and Mostafa Bellafkih Hybrid Movie Recommender System Based on Word Embeddings . . . . 454 Amina Samih, Abderrahim Ghadi, and Abdelhadi Fennan Towards Big Data-based Sustainable Business Models and Sustainable Supply Chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464 Lahcen Tamym, Lyes Benyoucef, Ahmed Nait Sidi Moh, and Moulay Driss El Ouadghiri Treatment of Categorical Variables with Missing Values Using PLS Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475 Yasmina Al Marouni and Youssef Bentaleb

Contents

xxxiii

Type 2 Fuzzy PID for Robot Manipulator . . . . . . . . . . . . . . . . . . . . . . . 486 Nabil Benaya, Faiza Dib, Khaddouj Ben Meziane, and Ismail Boumhidi Using Latent Class Analysis (LCA) to Identify Behavior of Moroccan Citizens Towards Electric Vehicles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496 Taouﬁq El Harrouti, Mourad Azhari, Abdellah Abouabdellah, Abdelaziz Hamamou, and Abderahim Bajit Using Learning Analytics Techniques to Calculate Learner’s Interaction Indicators from Their Activity Traces Data . . . . . . . . . . . . . 504 Lamyaa Chihab, Abderrahim El Mhouti, and Mohammed Massar Web-Based Dyscalculia Screening with Unsupervised Clustering: Moroccan Fourth Grade Students . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512 Mohamed Ikermane and A. El Mouatasim Correction to: A New Telecom Churn Prediction Model Based on Multi-layer Stacking Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . Jalal Rabbah, Mohammed Ridouani, and Larbi Hassouni

C1

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 521

A Bi-objective Evolutionary Algorithm to Improve the Service Quality for On-Demand Mobility Sonia Nasri1 , Hend Bouziri2 , and Wassila Aggoune-Mtalaa3(B) 1 Higher Business School of Tunis, Manouba University, Manouba, Tunisia Higher School of Economic and Commercial Sciences Tunis, Tunis University, Tunis, Tunisia Luxembourg Institute of Science and Technology, 4362 Esch/Alzette, Luxembourg [email protected] 2

3

Abstract. This work aims at improving the quality of the service provided to the customers within real-life and customized demandresponsive transportation systems. Therefore, a new bi-objective model is designed to minimize both the total transit time which induces lower costs for the transportation service providers and the total waiting time for the travellers. To solve the new problem, an evolutionary algorithm is proposed based on two perturbation operators. A comparison between the proposed method and a hybrid evolutionary one from the literature is carried out. Preliminary computational experiments show the eﬀectiveness of our method regardless of the complexity of the evolutionary schema operated. Some promising outputs are obtained allowing us to follow up the research for larger-scale transport-on-demand problems.

Keywords: Dial a ride problems method · Evolutionary algorithm

1

· Quality of service · Bi-objective

Introduction

This study addresses a new problem of passenger transport belonging to the class of customer-oriented problems, see [12]. This class is a variant of the well-known on-demand transport problem named Dial-A-Ride Problem (DARP) [7]. In this class, a maximal riding time is provided for each customer, and time windows are designed with customer-dependent bounds. Authors such as [14] and [10] have agreed to include more customer’s speciﬁcations within the design of the model. This enables a compromise between the costs of the transportation service and the quality or simply the improvement of the quality of service provided to the customers. Several references consider more the customer needs through the design of the problem with challenging models and solving methods, see the works of [4,9,15,19] and [6]. This increases the complexity of the problem since it is designed from the customer viewpoint in a real-life conﬁguration. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ben Ahmed et al. (Eds.): NISS 2022, LNDECT 147, pp. 1–8, 2023. https://doi.org/10.1007/978-3-031-15191-0_1

2

S. Nasri et al.

To address such real-life on-demand mobility problems, we propose in this study to design a new bi-objective model seeking for assessing both the travel costs and the service quality in terms of waiting time. The latter includes new time windows constraints that allow for a redeﬁnition of these time windows during the solving process and contribute to their tightening, [13]. Then, we propose a new Evolutionary Algorithm for solving the new bi-objective problem. This method is named Bi-objective Evolutionary Algorithm (BEA). A comparative study of the BEA with an Evolutionary Local Search (ELS) method is conducted. Preliminary tests are operated on real-life transport on-demand problems deﬁned in [6]. The results of the hybrid method used for comparison on the same data set can be found in [5]. It applied a mutation and a local search in order to disrupt and improve the solutions. The local search is managed through six neighborhood strategies from the literature. When comparing the BEA method with the more sophisticated ELS, the outputs indicate the eﬀectiveness of the BEA to improve the service quality, when the total waiting time is considered as an objective. This work presents ﬁrst the deﬁnition of the bi-objective customer-oriented DialA-Ride Problem. Section 3 presents the main components of the BEA algorithm. The computational results are presented and discussed in Sect. 4. Section 5 concludes the paper.

2

The On-Demand Transport Problem

The on-demand transport problem can be formulated as a customer-oriented problem, see the survey in [12]. It consists in satisfying a set of transport requests using a set of vehicles with a maximum capacity. A demand is a limited number of persons who require to be transported from one place to another. Thus, the number of requests corresponds to that of the picking nodes and we have the same number of pickup and delivery sites. In addition, the maximum total duration of the operations is set to limit the total travel time of the vehicle. Each customer has a riding time which is the time interval between the time of departure and that of the arrival, including the waiting times which occur during the journey. An illustrative example of a DARP considering customized speciﬁcations is presented in Fig. 1. Two requests (node 1 to 3 and node 2 to 4) are processed by a vehicle. A maximal riding time is speciﬁed for each customer. If we assume that the maximal ride time for the ﬁrst request is equal to 30 min and it is equal to 1 h and 20 min for the second request, the delivery time windows will be aﬀected accordingly. This is the diﬀerence between the problem considered in this work and the classical DARP of [7]. This diﬀerence is observed in Fig. 1 through the two solutions provided. Moreover, the major contribution of the customized on-demand transport problem relies on reducing the customers waiting times while the total riding time of the vehicles is minimized. This is basically the service quality criterion that we aim to improve in addition to the ﬁnancial criterion considered by transportation companies. Thus, in this work, we propose a bi-objective formulation

A Biobjective Evolutionary Algorithm for On Demand Mobility

3

Fig. 1. An example of a customer-oriented DARP.

to tackle the customer-oriented DARP. Two objective functions are considered in this work, namely, the ones for deﬁning the Total Riding Cost (TRC) and the Total Waiting Time (TWT), which have to be minimized simultaneously. The TRC is given in (1), since each travel cost c(i,j) is generated by a visited arc (i, j) in the vehicle tour. The TWT corresponds to the sum of the times Wiv each vehicle v had to wait at a visited node i, as expressed by formula (2). M in T RC =

v=m

c(i,j)

(1)

Wiv

(2)

v=1 (i,j)∈A

M in T W T =

v=m v=1 i∈N

The total number of vehicles is m and that of the requests is n. The set of nodes N includes the set of n pickup nodes, that of the deliveries and the depot. The set of arcs (i, j) is denoted by A. The constraints deﬁning feasible routing plans in a customer-dependent DARP are detailed in the work of [11].

3

The Proposed Method

Given a starting solution S produced by an insertion heuristic, the algorithm creates a population (S1 ..Snp ) of np individuals by performing a perturbation and a repairing technique. Next, a move operator is applied on each individual ). The best of the current population to generate neighbouring solutions (S1 ..Snp

4

S. Nasri et al.

neighbour is then selected according to the ﬁtness evaluation. Then the process is restarted until a maximum number of iterations is reached. An overview of our proposed evolutionary algorithm is provided in Fig. 2.

Fig. 2. Overview of the evolutionary algorithm.

The idea behind this algorithm is to move the entire population in each generation to reach promising regions in the solutions space according to both objective functions. 3.1

The Representation of a Solution

A solution in the evolutionary algorithm is a set of vehicle tours encoded by a set of arrays as illustrated in Fig. 3. The number of tours is equal to the number of vehicles. Each array represents a vehicle’s tour satisfying a set of requests. A location in the array corresponds to a pick-up or delivery node in the transportation graph. In Fig. 3, we suppose that seven requests are served by two vehicles. Each satisﬁed request is represented by a pair of origin and destination nodes such as the pair (5,12) for request 5 in tour 1.

A Biobjective Evolutionary Algorithm for On Demand Mobility

5

Fig. 3. The solution representation

3.2

The Fitness Function

To capture the bi-objective feature of the problem, we choose in this work to consider a scalar ﬁtness function deﬁned by (3). f (S) = α ∗

v=m

v=1 (i,j)∈A

c(i,j) + (1 − α) ∗

v=m

Wiv

(3)

v=1 i∈N

This ﬁtness function can ensure a good trade-oﬀ between the total riding cost, which is related basically to the business perspective, and the total waiting time, which reﬂects the quality of the service. 3.3

The Move Operator

The neighbourhood operator used in this work tries to disrupt a current solution by ﬁrst duplicating it. Given a solution S and its copy, a set of requests (with their pick-up and delivery points) are removed from each solution. These requests are then assigned to a new solution by exchanging the vehicles. Then, the rest of the tours is assigned to this new constructed neighbouring solution, ensuring that the constraints of the problem are respected. In the next section, the results found using this new evolutionary algorithm are presented. Indeed, although exact methods [2,3] are eﬃcient for solving complex problems such as scheduling and transportation ones, metaheuristics [1,8,18] and especially hybrid ones [16,17] are also eﬀective and much faster for the customer oriented DARP instances foreseen.

4

Experimental Results

To validate the new design of the problem and the bi-objective evolutionary algorithm BEA developed for this purpose, we apply this BEA on real-life instances of Chassaing et al. [5]. The total number of requests is varying between 10 and 59 with a total number of vehicles up to 9. The locations are separated into big areas. The time windows are deﬁned separately for the pickup and the destination locations. These time windows are of diﬀerent sizes according to the customers’ preferences. The maximum riding time for each customer depends on the distance he/she has to travel. The number of passengers in a location may be greater than 1 and up to 4. All the vehicles have the same average speed equal

6

S. Nasri et al.

to 1.33 km per minute. The maximum duration for the vehicles tour is equal to 480. The maximal capacity of the vehicles is equal to 8. The BEA algorithm is run for 1000 iterations and the best results over ﬁve runs on all the instances are reported. We compare the results found by the BEA with the ones of the Evolutionary Local Search algorithm [6]. In this preliminary study, we report 10 instances of those transport on- demand problems. Each instance (I) is deﬁned by the total number of vehicles m and the number of requests n. The comparison is based on the best riding cost T RC and the total waiting time T W T obtained. The weight α in the ﬁtness function varies in the interval [0,1]. Table 1 summarizes the computational results obtained with the two evolutionary algorithms on ten instances. Table 1. Results obtained with the BEA and the ELS methods I

n m ELS TRC

BEA TWT

TRC

TWT

d75 2 10

150.91 398.55

156.11 350.25

d92 2 17

347.01 250.09

352.37 210.85

d52 4 29 1607.74

70.81 1710.22

81.23

d10 4 34 1341.02 377.87 1488.23 311.52 d39 6 38 2030.44 524.72 2012.33 503.61 d36 6 42 2139.52 390.75 2133.17 388.35 d90 6 51 1133.45 355.16 1173.01 331.45 d87 8 54 2714.31 339.88 2738.92

523.44

d12 9 56 3614.27 341.82 3605.12 300.47 d13 9 59 3183.19 441.76 3351.13 439.72

Table 1 clearly shows that the proposed method enhances the service quality, in terms of the total waiting time TWT for most of the instances. Indeed, the values of TWT obtained with the BEA are lower than the ones obtained with the ELS for 8 out of the 10 instances tested. The BEA method is able to achieve a good compromise between the cost and the quality of service and is sometimes able to improve both of these objectives. This can be observed in instances d36 and d39 where the TRC and TWT are both better for the BEA than those obtained with the ELS. Moreover, there are solutions where the TRC found by the BEA is decreased as compared with the ELS, as for the case of d12 where we have 3614,27 with the ELS against 3605,12 for the BEA. Although the majority of the riding costs found by the ELS are better than those of the BEA, the deviations are not high such as for the instances d75, d92, and d87. These outputs lead to consider that the results obtained with the BEA are promising when compared with those of the ELS. This latter relied on several neighborhood operators unlike the single move operator performed

A Biobjective Evolutionary Algorithm for On Demand Mobility

7

by the BEA. Furthermore, in almost all the tested instances, the improvement of the quality of the service does not aﬀect strongly the riding cost, which was expected beforehand and is in line with our bi-objective solving approach.

5

Conclusion and Prospects

This paper proposed a new bi-objective resolution of a demand-responsive problem dealing with the service quality improvement for real-life customers requirements. The solving approach consists of an evolutionary algorithm based on a single move operator. In spite of its simplicity, the BEA algorithm yielded encouraging preliminary results. This leads us to consider a more elaboratedbased dominance evolutionary schema. Moreover, additional genetic operators could enhance our results on large real-life instances.

References 1. Aggoune-Mtalaa, W., Aggoune, R.: An optimization algorithm to schedule care for the elderly at home. Int. J. Inf. Sci. Intell. Syst. 3(3), 41–50 (2014) 2. Amroun, K., Habbas, Z., Aggoune-Mtalaa, W.: A compressed generalized hypertree decomposition-based solving technique for non-binary constraint satisfaction problems. AI Commun. 29(2), 371–392 (2016) 3. Bennekrouf, M., Aggoune-Mtalaa, W., Sari, Z.: A generic model for network design including remanufacturing activities. Supply Chain Forum 14(2), 4–17 (2013) 4. Braekers, K., Caris, A., Janssens, G.K.: Exact and meta-heuristic approach for a general heterogeneous dial-a-ride problem with multiple depots. Transp. Res. Part B Methodol. 67, 166–186 (2014) 5. Chassaing: Instances of chassaing (2020). http://fc.isima.fr/∼lacomme/Maxime/. Accessed 19 July 2020 6. Chassaing, M., Duhamel, C., Lacomme, P.: An ELS-based approach with dynamic probabilities management in local search for the dial-a-ride problem. Eng. Appl. Artif. Intell. 48, 119–133 (2016) 7. Cordeau, J.F., Laporte, G.: A tabu search heuristic for the static multi-vehicle dial-a-ride problem. Transp. Res. Part B Methodol. 37(6), 579–594 (2003) 8. Djenouri, Y., Habbas, Z., Aggoune-Mtalaa, W.: Bees swarm optimization metaheuristic guided by decomposition for solving max-sat. In: ICAART 2016 - Proceedings of the 8th International Conference on Agents and Artiﬁcial Intelligence, vol. 2, pp. 472–479 (2016) 9. Masmoudi, M.A., Braekers, K., Masmoudi, M., Dammak, A.: A hybrid genetic algorithm for the heterogeneous dial-a-ride problem. Comput. Oper. Res. 81, 1–13 (2017) 10. Molenbruch, Y., Braekers, K., Caris, A.: Operational eﬀects of service level variations for the dial-a-ride problem. CEJOR 25(1), 71–90 (2015). https://doi.org/10. 1007/s10100-015-0422-7 11. Nasri, S., Bouziri, H.: Improving total transit time in dial-a-ride problem with customers-dependent criteria. In: 2017 IEEE/ACS 14th International Conference on Computer Systems and Applications (AICCSA), pp. 1141–1148 (2017)

8

S. Nasri et al.

12. Nasri, S., Bouziri, H., Aggoune-Mtalaa, W.: Customer-oriented dial-a-ride problems: a survey on relevant variants, solution approaches and applications. In: Ben Ahmed, M., Mellouli, S., Braganca, L., Anouar Abdelhakim, B., Bernadetta, K.A. (eds.) Emerging Trends in ICT for Sustainable Development. ASTI, pp. 111–119. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-53440-0 13 13. Nasri, S., Bouziri, H., Aggoune-Mtalaa, W.: Dynamic on demand responsive transport with time-dependent customer load. In: Ben Ahmed, M., et al. (eds.) SCA 2020. LNNS, vol. 183, pp. 395–409. Springer, Cham (2021). https://doi.org/10. 1007/978-3-030-66840-2 30 14. Paquette, J., Bellavance, F., Cordeau, J.F., Laporte, G.: Measuring quality of service in dial-a-ride operations: the case of a Canadian city. Transportation 39(3), 539–564 (2012) 15. Parragh, S.N.: Introducing heterogeneous users and vehicles into models and algorithms for the dial-a-ride problem. Transp. Res. Part C Emerging Technol. 19(5), 912–930 (2011) 16. Rezgui, D., Bouziri, H., Aggoune-Mtalaa, W., Siala, J.C.: An evolutionary variable neighborhood descent for addressing an electric VRP variant. In: Sifaleras, A., Salhi, S., Brimberg, J. (eds.) ICVNS 2018. LNCS, vol. 11328, pp. 216–231. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-15843-9 17 17. Rezgui, D., Bouziri, H., Aggoune-Mtalaa, W., Siala, J.C.: A hybrid evolutionary algorithm for smart freight delivery with electric modular vehicles. In: 2018 IEEE/ACS 15th International Conference on Computer Systems and Applications (AICCSA), pp. 1–8. IEEE (2018) 18. Rezgui, D., Siala, J.C., Aggoune-Mtalaa, W., Bouziri, H.: Towards smart urban freight distribution using ﬂeets of modular electric vehicles. In: Ben Ahmed, M., Boudhir, A.A. (eds.) SCAMS 2017. LNNS, vol. 37, pp. 602–612. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-74500-8 55 19. Zhang, Z., Liu, M., Lim, A.: A memetic algorithm for the patient transportation problem. Omega 54, 60–71 (2015)

A Dynamic Circular Hough Transform Based Iris Segmentation Abbadullah .H Saleh

and O˘guzhan Menemencio˘glu(B)

Department of Computer Engineering, Karabuk University, 78050 Karabuk, Turkey [email protected]

Abstract. Iris segmentation in the case of eye diseases is a very challenging task in the computer vision field. The current study introduces a novel iris segmentation method based on an adaptive illumination correction and a modified circular Hough transform algorithm. Some morphological operations are used to create the illumination correction function, while the pupil localization process is used to detect the Hough circle radius range to obtain the iris circle. Warsaw BioBase-Disease-Iris dataset V1.0 contains 684 images is used to verify the proposed methodology. The proposed approach results show that the true segmentation rate is 90.5%, and the main problem of iris segmentation is the absence of complete or part of the iris in some diseases like blindness, rubeosis, synechiae and retinal detachment. Keywords: Iris segmentation · Hough transform · Illumination correction · Iris disease · Image processing

1 Introduction Nowadays, human recognition is one of the most critical issues related to computer science applications. Traditional security methods like passwords and cards can be stolen or corrupted; on the other hand, biometrics belong to humans and have many distinctive features making them available for human recognition applications [1]. Some biometrics like face, for example, have essential distinctive features like face parts locations, the distance between them, face contour, etc. In contrast, others like the iris, for example, have distinctive parts called iris edges (patterns) [2]. Some biometrics change over time (like face), while others do not (like iris, ear, and fingerprint). While fingerprint, palm print, and face exposed to changes due to accidents (burns, wounds, etc.) others like iris, ear, and footprint affected less [3]. However, identical twins’ problem also considers a critical issue in human recognition systems; fortunately, some biometrics like iris and DNA donot suffer from this problem. Although iris is one of the most accurate biometrics [4], it is affected by diseases that impact on the ability of recognition [5]. Furthermore, iris segmentation is the essential step of iris recognition [6] which is affected by many factors like imaging conditions (illumination and pose variations) and some eye diseases. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ben Ahmed et al. (Eds.): NISS 2022, LNDECT 147, pp. 9–20, 2023. https://doi.org/10.1007/978-3-031-15191-0_2

10

A. H Saleh and O. Menemencio˘glu

1.1 Related Work Many studies have been introduced in the field of iris segmentation, and many approaches have been designed and evaluated. Mayya and Saii [7] proposed an iris segmentation method using pupil and iris localization, and then they normalized iris into polar coordinates. They applied experiments on the CASIA iris dataset including 250 individuals. Kaur et al. [8] used the Hough transform and normalization on four public datasets: CASIA-IrisV4- Interval, IITD.v1, UBIRIS.v2 and UPOL. The main problem of this previous approach is that the segmented iris circle contains noisy parts (eyelids) in most cases. Zernike moments and Gabor filter, along with the iris pattern methods, were proposed by Naji et al. [9]. They extracted the iris regions with the slightest noise (eyelash and eyelid) based on segmenting the iris region into eight sub-regions and studying the statistical information within each region. They removed all high noise sub-regions and merged the resulting iris sub-regions into a unified 120*120 matrix. They applied their experiments on two datasets CASIA v1 and their own collected dataset, but the problem is that their method removes parts of the iris in all cases. Azam and Rana [10] introduced an iris recognition system using CNN deep networks to extract iris features while the SVM classifier was used for the classification stage. They first segmented the iris image using Daugman’s segmentation method; then, they used CNN and SVM for the recognition task. They got a 96.3% recognition rate on the 100 images of the CASIA v1 dataset. In another recent research, Li et al. [11] proposed a new method of iris segmentation using K-means to retrieve the outer border of iris and Residual U-Net for semantic segmentation. They applied their results on the CASIA-Iris-Thousand dataset. The main problem of this approach is the computational time. Recently, Trokielewicz et al. [12] used 76 cases of Warsaw-BioBase-Disease for iris segmentation and the deep CNN classifier to segment iris. They obtained 3.11% as an Equal Error Rate EER for their selected dataset. CNN is a powerful technique to segment eye images; however, it takes more computational time than traditional methods. In this research, we introduce a novel iris segmentation method in which a specific type of people who have eye disease is targeted to study the effect of disease on the iris segmentation system, so we will extract the iris patterns only without noisy parts like eyelids as possible and without removing any part of iris patterns.

2 Materials and Methods Two different segmentation methods are suggested in the current research. The proposed methodology includes two basic steps. The first is the preprocessing step in which a new illumination correction process is applied, while the second is the segmentation method. 2.1 Dataset To validate the segmentation methods, the Warsaw BioBase-Disease-Iris dataset V1.0 [13] is used. This dataset includes illumination variation and left-right iris images. Most

A Dynamic Circular Hough Transform Based Iris Segmentation

11

of these iris images are taken in the presence of eye diseases (some of those diseases affect only one of the two irises). Warsaw V1.0 dataset includes 684 near-infrared captured images and 222 corresponding colored eye images. Table 1 includes a detailed explanation of the dataset [14]. Table 1. Warsaw V1.0 dataset details. Disease

Right eye

Left eye

Cataract

20

26

Anterior synechiae

1

1

Posterior synechiae

4

2

Anterior chamber lens implant

1

-

Corneal haze

1

2

incipient cataract

1

3

Pseudophakic

9

9

silicon oil

1

1

cataract, diabetic retinopathy

1

0

cataract, glaucoma

1

5

cataract, glaucoma, iridotomy

1

2

glaucoma

4

8

Rubeosis, synechiae

1

0

Blindness

1

1

Retinal detachment

3

4

Distorted Pupil

2

1

Healthy

7

5

Not defined

8

7

2.2 Proposed Methodology Preprocessing. For the preprocessing step, we suggest using an enhancement approach to avoid the illumination problem. We suggest using a new adaptive illumination correction method consisting of inferring the illumination variation function then applying the opposite function on the image to restore it to the original status. The illumination variation starts adding darkness to the image columns from the right border of the iris (for left eyes) until the final column of the image, so we localize the pupil circle then infer the right iris border based on pupil location. After that, we

12

A. H Saleh and O. Menemencio˘glu

start making the opposite illumination correction function by adding incremental gray levels starts from 1 until reaching the right border of the image, as Eq. 1 suggests. Cimg = Oimg + Ifun , Ifun (:, j) = {0 if 0 < j < K − 1, L otherwise where L = 1, 2, 3, . . .}

(1)

where Cimg is the illumination-corrected version of the original image Oimg , Ifun is the illumination correction function, j is the image column number, k is the column number of the right iris border, L is the incremental value that will be form the illumination correction function. Figure 1 illustrates the preprocess of an image, while Algorithm 1 introduces the pseudo code of the suggested illumination correction algorithm.

This additive illumination will correct the left eye’s darkness and reverse the illumination variation, as Fig. 1 shows. For the right eye, a flipping step is needed before the illumination correction.

Fig. 1. The suggested illumination correction method.

Iris Segmentation. In this research, we suggest using two new segmentation methods inspired by the fact that all previous methods use predefined masks and a normalization step or extract only specific parts of the iris that least have noise (eyelash and eyelid). In our first method (Morphological-Based Segmentation (MBS) method, we suggest using

A Dynamic Circular Hough Transform Based Iris Segmentation

13

a methodology based on illumination compensation and many morphological operations steps to define only the iris region and exclude all other non-iris regions (eyelash and eyelid). Figure 2 illustrates a possible output of our first segmentation suggested method. The corrected-illumination version of the eye image is used as input of this method. Then, a mask is obtained using the closing gray-scale morphological operation using a 50-radius disk structural element (huge size) to fuse the image’s gray levels and constitute unified illuminations regions. This mask is then subtracted from the illumination-corrected version of the image, and the resulting image is threshold and holes-filled. The small regions are removed to get the final iris region. The method (MBS) removes eyelashes and eyelids based on the correction methodology and the close-based mask subtraction approach. The final step in the proposed method is to use the segmented iris region of interest (ROI) as an input of the iris normalization method proposed by the study [7] to enhance the normalization process. The normalization process, in this case, will not need to add specific masks to the result to remove eyelashes and eyelids since our methodology has already removed them. In this way, the problem of defining the mask’s ROI in each image will be avoided. Algorithm 2 illustrates the pseudo-code first proposed iris segmentation method.

For the second suggested segmentation method (Improved Dynamic Circular Hough Transform (IDCHT)), the mask subtraction and thresholding stages are the same as the first method. The next step requires applying the morphological operations (opening) to remove outliers and the circular Hough transform to get the potential circular regions (iris region) of the threshold image. The center and radius of the iris region are calculated using the Hough circle, and the obtained circle is used to extract the corresponding iris region. Many studies used Hough transform for the segmentation stage [15, 16]. Still, the problem with those studies was that they define a specific range for the radius in which the Hough algorithm should search for a circle. In contrast, in our study, we suggest using the pupil’s major axis length (MApup) multiplied by 2 and displaced by ± MApup/4 pixels, this method will enhance the Hough transform performance and always produce the right circles as Eq. 2 shows. HoughCircle_radius_range = 2 ∗ MApup ± MApup /4

(2)

14

A. H Saleh and O. Menemencio˘glu

Fig. 2. Morphological-Based Segmentation method.

The result iris may contain some parts of eyelids, so we try to eliminate them by removing the high gray-level pixels of the iris image (since eyelids have more brightness than the iris region). To perform this step, we compute the mean value of gray levels inside the iris region µ, and threshold the iris image based on this mean value. This step needs some morphological post-processing operations like region filling to get the final iris region, illustrated in Fig. 3. Algorithm 3 describes the pseudo code of IDCHT method.

A Dynamic Circular Hough Transform Based Iris Segmentation

15

3 Results We implemented our approaches on the massive Warsaw data; 684 eye images of the Warsaw dataset have been used for segmentation. As a result, 619 images are segmented correctly while 65 samples caused problems, and the true segmentation rate is 90.5%, as Eq. 2 shows. TSR =

Number of true segmentation samples/total number of dataset samples

(3)

Three essential factors have been used to evaluate results: True pupil localization, true iris localization, and rejection of noisy areas (eyelids and eyelashes).

Fig. 3. Improved Dynamic Circular Hough Transform method.

3.1 Comparing of the Segmentation Methods For evaluating the proposed iris segmentation methods, some challenge samples of the Warsaw dataset have been selected for this comparison, including occlusion by eyelashes or eyelids, rotation, up and down occlusion and illumination variations. Figure 4 includes the comparative results. Our two methods’ performance is significantly better than the Naji et al. [9] segmentation method. Figure also indicates that the Second suggested segmentation method has a better performance against the first one. Comparing our result with Mayya and Saii [7] also proves that our method is better since their method did not remove eyelashes and eyelids properly, while our method could.

16

A. H Saleh and O. Menemencio˘glu

Fig. 4. Comparing of the two suggested methods and previous approaches.

Case 1 represents a left eye with pseudophakic disease and partial occlusion. The suggested MBS first segmentation method successes partially since some eyelids still exist, while the second suggested method (IDCHT) successes completely without any eyelashes or eyelids. Mayya and Saii [7] segments the iris but the eyelids and eyelashes exist in the final segmentation result. For Naji et al. [9], the algorithm completely fails, due to they used fixed values for their approach. Beside they did not use any illumination correction method. Case 2 includes left iris with cataract disease. Both suggested methods segment the iris successfully, while for case 3 (cataract disease) which includes huge occlusion and eyelashes, IDCHT is the only successful method that segments iris without any eyelids or eyelashes. For the final cases 4 left eye (cataract disease) and 4 right eye (pseudophakic disease), IDCHT is the best method. Some eye diseases affect the iris segmentation process (see in Fig. 5). The retinal detachment with silicon oil disease (of the right eye of the individual:8) causes problems with iris segmentation (Fig. 5-a). Figure 5-b includes an example of multiple eye diseases where the segmentation is completely incorrect due to several problems (uveitis,

A Dynamic Circular Hough Transform Based Iris Segmentation

17

secondary glaucoma, cataract, and blindness). Moreover, in the case of Rubeosis and Synechiae diseases, the iris parts are occluded by other tissues making the iris segmentation more complicated (see Fig. 5-c). The blindness disease (Fig. 5-d) associated with many other diseases (glaucoma and cataract) affects the iris segmentation badly. By removing the individual-35 of Warsaw dataset, the TSR will be 94.3% (i.e., in some eye disease cases, the iris segmentation is affected partially; while in others, the segmentation will be completely affected). a

b

c

d

Fig. 5. Partially and completely bad iris segmentation examples in some eye disease.

Previous studies that used the Warsaw-BioBase-Disease-Iris V1 dataset did not apply any segmentation process; they used some software to segment them and use them directly in recognition [17–19]. However, Trokielewicz et al. [12] segments a partial subset of Warsaw-BioBase-Disease-Iris V1, including 76 individuals. Proposed method has segmented the entire dataset with an accurate and fast methodology. Table 2 includes a detailed comparison of the most recent iris segmentation approach and the proposed method. It also shows that most previous studies focused on no-disease iris datasets (only a few studies dealt with challenging eye-disease datasets like Warsaw Biobase). Trokielewicz et al. [12] study focused only on 76 selected cases of Warsaw dataset; on the other hand, our approach took all cases. While their approach needed too much time, ours did not. Some previous studies suffered from the computational time [11], while others had too many false positives [7, 8] or false negatives [9].

18

A. H Saleh and O. Menemencio˘glu

Table 2. A detailed comparison of the recent iris segmentation methods and the current proposed one. Study

Methods

Dataset

Results

Notes

Mayya and Saii [7]

Pupil localization with Daugman

CASIA v1 dataset

98% segmentation rate

High true positive rate but high false positives rate (eyelids and eyelashes)

Kaur et al. [8] Hough transform

CASIA-IrisV4Interval, IITD.v1, UBIRIS.v2 and UPOL

Not mentioned

High true positive rate but high false positives rate

Naji et al. [9]

Iris pattern segmentation

CASIA v1 dataset

No segmentation rate mentioned

Bad segmentation (too many false positives and false negatives)

Azam and Rana [10]

Daugman’s method

100 images of CASIA v1 dataset

No segmentation rate mentioned

Low dataset size

Li et al. [11]

K-means, CASIA-Iris-Thousand Residual U-Net

Accuracy 98.9%

High computational time

Trokielewicz et al. [12]

CNN

76 selected individuals EER = 3.11% of Warsaw-Biobase V1

The current research

Novel IDCHT

684 eye images of 222 individual of Warsaw Biobase V1 dataset

TSR = 90.5%

High computational time, eye diseases challenge Low computational time, Low false positive rate, eye diseases challenge

4 Conclusion The current research proposed a new iris segmentation method based on a modified version of circular Hough transform using a dynamic circular range to search within it. The method corrects the variant illumination using an illumination correction approach and removes the eyelids from the final segmented image using some morphological operations and statistical information of the iris and eyelid regions since the iris and eyelids regions have different mean values. The results indicate that the proposed IDCHT

A Dynamic Circular Hough Transform Based Iris Segmentation

19

methodology segments the Warsaw images successfully with an 90.5% TSR. The results also prove that some eye diseases affect the segmentation process. The retinal detachment with silicon oil disease, Rubeosis, Synechiae, uveitis, secondary glaucoma, cataract, and blindness are examples of the eye diseases that affect the iris segmentation and reduce the TSR. For example, by removing the individual 35 (right eye) (including glaucoma and blindness diseases), TSR will be increased by 3.8%. For future work, we will build an iris recognition system using the segmented iris images and study the effect of disease on the performance of iris recognition. The future strategy will focus on segmenting more extensive datasets like Warsaw-Biobase V2 and CASIA-V3 interval datasets to use them inside an entire iris recognition system.

References 1. Kakkad, V., Patel, M., Shah, M.: Biometric authentication and image encryption for image security in cloud framework. Multiscale and Multidisciplinary Modeling, Experiments and Design 2(4), 233–248 (2019). https://doi.org/10.1007/s41939-019-00049-y 2. Mayya, A.M., Saii, M.: Iris and Palmprint Decision Fusion to Enhance Human Recognition. Int. J. Comput. Sci. Trends Technol. 5 (2017) 3. Rajarajan, S., Palanivel, S., Sekar, K.R., Arunkumar, S.: Mathematics, A.: study on the diseases and deformities causing false rejections for fingerprint authentication. Int. J. Pure Appl. Math. 119, 443–453 (2018) 4. Hu, Q., Yin, S., Ni, H., Huang, Y.: An end to end deep neural network for iris recognition. Procedia Comput. Sci. 174, 505–517 (2020). https://doi.org/10.1016/j.procs.2020.06.118 5. Trokielewicz, M., Czajka, A., Maciejewicz, P.: Iris recognition under biologically troublesome conditions - Effects of aging, diseases and post-mortem changes. In: BIOSIGNALS 2017 10th International Conference Bio-Inspired Syst. Signal Process. Proceedings; Part 10th Int. Jt. Conf. Biomed. Eng. Syst. Technol. BIOSTEC 2017. 4, pp. 253–258 (2017). https://doi. org/10.5220/0006251702530258 6. Huang, J., Wang, Y., Tan, T., Cui, J.: A new iris segmentation method for recognition. Proc. Int. Conf. Pattern Recognit. 3, 554–557 (2004). https://doi.org/10.1109/ICPR.2004.1334589 7. Mayya, A.M., Saii, M.M.: Iris recognition based on weighting selection and fusion fuzzy model of iris features to improve recognition rate. Int. J. Inf. Res. Rev. 03, 2664–2680 (2016) 8. Kaur, B., Singh, S., Kumar, J.: Robust iris recognition using moment invariants. Wireless Pers. Commun. 99(2), 799–828 (2017). https://doi.org/10.1007/s11277-017-5153-8 9. Naji, S.A., Tornai, R., Lafta, J.H., Hussein, H.L.: Iris recognition using localized Zernike features with partial iris pattern. In: Al-Bakry, A.M., et al. (eds.) New Trends in Information and Communications Technology Applications. Communications in Computer and Information Science, vol. 1183, pp. 219–232. Springer, Cham (2020). https://doi.org/10.1007/978-3-03055340-1_16 10. Choudhari, G., Mehra, R.: Iris recognition using convolutional neural network design. Int. J. Innov. Technol. Explor. Eng. 8, 672–678 (2019). https://doi.org/10.35940/ijitee.I1108.078 9S19 11. Li, Y.H.H., Putri, W.R., Aslam, M.S., Chang, C.C.: Robust iris segmentation algorithm in non-cooperative environments using interleaved residual U-Net. Sensors 21, 1–21 (2021). https://doi.org/10.3390/s21041434 12. Trokielewicz, M., Czajka, A., Maciejewicz, P.: Post-mortem iris recognition with deeplearning-based image segmentation. Image Vis. Comput. 94, 103866 (2020) 13. Mateusz, M., Czajka, T. A. P.: Warsaw-BioBase-Disease-Iris v1.0 (2021). http://zbum.ia.pw. edu.pl/EN/node/46. Accessed Apr 1

20

A. H Saleh and O. Menemencio˘glu

14. Trokielewicz, M., Czajka, A., Maciejewicz, P.: Database of iris images acquired in the presence of ocular pathologies and assessment of iris recognition reliability for disease-affected eyes. In: Proceedings of 2015 IEEE 2nd International Conference Cybern. CYBCONF 2015. pp. 495–500 (2015). https://doi.org/10.1109/CYBConf.2015.7175984 15. Okokpujie, K., Noma-Osaghae, E., John, S., Ajulibe, A.: An improved iris segmentation technique using circular Hough transform. In: Kim, K.J., Kim, H., Baek, N. (eds.) IT Convergence and Security 2017. Lecture Notes in Electrical Engineering, vol. 450, pp. 203–211. Springer, Singapore (2018). https://doi.org/10.1007/978-981-10-6454-8_26 16. Hapsari, R.K., Utoyo, M.I., Rulaningtyas, R., Suprajitno, H.: Iris segmentation using Hough Transform method and Fuzzy C-Means method. J. Phys. Conf. Ser. 1477 (2020). https://doi. org/10.1088/1742-6596/1477/2/022037 17. Trokielewicz, M., Czajka, A., Maciejewicz, P.: Iris Recognition in Cases of Eye Pathology. In: Nait-Ali, A. (ed.) Biometrics under Biomedical Considerations. Series in BioEngineering, pp. 41–69. Springer, Singapore (2019). https://doi.org/10.1007/978-981-13-1144-4_2 18. Moreira, D., Trokielewicz, M., Czajka, A., Bowyer, K.W., Flynn, P.J.: Performance of humans in iris recognition: The impact of iris condition and annotation-driven verification. Proceedings of the 2019 IEEE Winter Conf. Appl. Comput. Vision, WACV, pp. 941–949 (2019). https://doi.org/10.1109/WACV.2019.00105 19. Karim, R.A., Mobin, N.A.A.A., Arshad, N.W., Zakaria, N.F., Bakar, M.Z.A.: Early rubeosis iridis detection using feature extraction process. In: Kasruddin Nasir, A.N., et al. (eds.) In ECCE 2019. Lecture Notes in Electrical Engineering, vol. 632, pp. 379–387. Springer, Singapore (2020). https://doi.org/10.1007/978-981-15-2317-5_32

A Game Theoretic Framework for Interpretable Student Performance Model Hayat Sahlaoui1 , El Arbi Abdellaoui Alaoui2(B) , and Said Agoujil1 1

Department of Computer Science, Faculty of Sciences and Techniques at Errachidia, University of Moulay Isma¨ıl, Route Meknes, 52000 Errachidia, Morocco 2 Department of Sciences, Ecole Normale Sup´erieure Meknes, Moulay Isma¨ıl, University, Meknes, Morocco [email protected] Abstract. Machine learning is used in many contexts these days. And they’ve been integrated into the decision-making process in many critical areas, some applications including predicting at-risk students and automating student enrollment. From these applications, it is clear that machine learning models have a major impact on students’ professional success, therefore, it is imperative that the student performance model is well understood and free of any bias and discrimination. The kinds of decisions and predictions made by these machine-learning-enabled systems become much more profound and, in many cases, critical to students’ professional success. Various higher education institutions rely on machine learning to drive their strategy and improve their students academic success. Therefore, the need to trust these machine learning-based systems is paramount, and building a model that educational decisionmakers who may not be familiar with machine learning can understand is critical. But sometimes, even for the experts in machine learning, it becomes diﬃcult to explain certain predictions of the so-called “black box models”. Therefore, there is a growing need for easy interpretation of a complex black box model. Therefore, this study aims to provide a framework for an interpretable student performance model by introducing a local model-agnostic interpretability method shap value, which is a novel explanatory technique that explains the predictions of any classiﬁer in an interpretable and faithful way by opening the black -box model and explaining how the ﬁnal result came out and which parts of the model are responsible for certain predictions. By understanding how the student performance model works, education decision-makers can have a greater advantage and be smarter about what they should be doing. Keywords: Game theory · SHAP values · Ensemble methods Machine learning · Student performance prediction

1

·

Introduction

With the advancement in the ﬁeld of machine learning, many new complex machine learning models are widely used in various ﬁelds, including education. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ben Ahmed et al. (Eds.): NISS 2022, LNDECT 147, pp. 21–34, 2023. https://doi.org/10.1007/978-3-031-15191-0_3

22

H. Sahlaoui et al.

Especially with the advancement of deep learning, data is used to make some important decisions. But sometimes, even for the experts in machine learning, it becomes diﬃcult to explain certain predictions made by the so-called “black box models” [9]. These machine-learning-enabled systems’ conclusions and forecasts become considerably more sophisticated and, in many circumstances, vital to students’ professional success. Various higher education institutions are using machine learning to drive their strategy and improve their students academic success [4]. As a result, it’s critical to have faith in these machine-learningbased solutions. We receive higher performance as the complexity of the machine learning model rises, but we lose interpretability [4]. So how can we create interpretable machine learning models that don’t compromise on accuracy? The traditional metrics like accuracy, r2 score, roc AUC curves, precision recall curves, etc. Do not give the machine learning practitioner enough conﬁdence in the model performance and reliability [12]. One idea is to use simpler models. This way you can ensure that you have full conﬁdence in the interpretability. Models like regressions are inherently easy to interpret because we can easily review the contribution of each feature and know how each decision was made [7]. Because we have access to all of the splits for each feature, a decision tree is another method that is extremely easy to understand. From the root node to the leaf node, we can plainly see how choices are made. To explain each prediction, we just need to follow the rules based on independent variables and list them [4]. The downside, in general, is that interpretable models have lower performance in terms of accuracy or precision, making them less useful and potentially dangerous for production. Therefore, there is a growing need for easy interpretation of a complex black box model [1]. Interpreting more complex models such as random forest and gradient boosting can be challenging. The interpretation of certain machine learning models is even more complex. For example, a deep neural network model, which is simply an extreme form of a black box model, may include millions of learnt parameters. We receive higher performance as the complexity of the machine learning model rises, but we lose interpretability [4]. The aim of this study is to provide a framework for an interpretable student performance model by introducing a local model-agnostic interpretability method [7] shap value, which is a novel explanatory technique that explains the predictions of any classiﬁer in an interpretable and faithful way, by opening the black box model and explaining how the end result was arrived at and which parts of the model are responsible for certain predictions. In this study we try to answer the following questions: What is model interpretability? Why is it important? And how to interpret student performance models? How can we create and use more complex models without losing all interpretability? Can we open the black box model and explain how the end result came out? And which features contribute and how much to the respective prediction? What are Shapley values? How are Shapely values calculated? What makes shap a good model explainer? How does shap achieve the explainability of the student performance model? Goals: – Build a framework for an interpretable student performance model,

A Game Theoretic Framework for Interpretable Student Performance Model

23

– Build an interpretable student performance model that doesn’t compromise on accuracy. – Create and use a more complex student performance model without losing all interpretability. – Build trust in the student performance model by opening the black box model and explaining how the end result was arrived at and which parts of the model are responsible for certain predictions. – Better understand which features contribute and how much to a given prediction, so we can better understand models that make sense to people who are not ML practitioners. The following is a breakdown of the paper’s structure: The approach used to demonstrate how to employ ensemble methods to develop a student performance model in order to enhance its performance, as well as the interpretability framework that aids in the explanation of the model, were described in Sect. 2. Section 3 delves into the experimental techniques, ﬁndings, and debate. By giving a detailed account of the factors that are utilized to predict student grades. Following then, the emphasis was on putting the model into practice and interpreting it. By putting our model to the test. The assessment step is then expanded by study into how shap achieves student performance model explainability using the interpretabililty framework. Finally, in Sect. 4, present the paper conclusions.

2

Methodology

This section demonstrates how ensemble techniques are used to develop a student performance model, as well as the foundation for interpreting the model that is built. 2.1

Model Creation

In this part, we use a supervised classiﬁcation approach to train the model to learn the input features to output labels mapping function. The testing set was utilized to evaluate the algorithm’s model’s prediction performance. We employed ten-folds cross validation to train and test our model to increase accuracy. We used xgboost to test a classiﬁcation algorithm as follows: Extreme Gradient Boosting (XGBoost) is a strong ensemble learning approach and an improved gradient boosting algorithm that avoids overﬁtting by leveraging parallel processing, coping with missing information, and regularization. According to [3], the goal function of XGBoost at the tth iteration may be expressed as Θ(t) = Φ(t) + Ω(t) Θ(t) =

n i=1

Φ(yi , yˆi ) +

t k=1

Ω(fk )

(1)

24

H. Sahlaoui et al.

where n represents the nth prediction, and yˆi

(t)

yˆith =

m i=1

can be represented as

(t−1)

fm (xi ) = yˆi

+ ft (xi )

(2)

The regularization term Ω(fk ) is considered by completing the convex differentiable loss function, and this term is deﬁned by m

1 2 Ω(fk ) = γT + λ ω 2 i=1 j

(3)

The term Ω is interpreted as a combination of ridge regularization of coeﬃcient λ and Lasso penalization of coeﬃcient γ (Fig. 1).

Fig. 1. XGBoost algorithm procedure

2.2

Model Evaluation

For assessing a model, choosing performance measures is critical [10]. When it comes to a classiﬁcation issue, such as ours, the accuracy described by Eq. 4 is a commonly utilized statistic. This calculation, like those for other measures, comes from Table 1 confusion matrix. In our case, TP stands for true positive, indicating the set of instances for which the model’s prediction is correct; TN stands for true negative, indicating the set of instances correctly predicted; FP stands for false positive, indicating the number of instances incorrectly predicted; and FN stands for false negative, indicating the number of instances incorrectly predicted [11]. TP + TN (4) Accuracy = TP + FP + FN + TN TP (5) P recision = TP + FP TP (6) Recall = TP + FN

A Game Theoretic Framework for Interpretable Student Performance Model

25

However, since it favors the dominant class, this statistic is constrained and does not necessarily reﬂect reality when data is uneven. As a result, in addition to correctness, certain extra metrics allow for the adaptation of imbalanced data by taking the minority and majority classes into account. We utilized a few of these measures. Table 1. Confusion matrix Predicted positive Predicted negative

2.3

Actual positive TP

FN

Actual negative FP

TN

Model Interpretation

The student performance model is nothing more than an algorithm that can learn patterns, it might feel like a black box for educational decision makers. Therefore we need tools for model interpretation Fig. 2. In general, the student performance model needs to obtain predictions and use those predictions and eventual insights to solve a variety of educational problems. But the question is How reliable are these predictions? Are they reliable enough to make critical decisions? Student performance model Interpretation is the degree to which decision makers in education can understand the cause of a decision [6], in other words, they want to understand not only the what part of a decision, the prediction itself, but also the why, the reasoning and the Data used by the model to arrive at this decision. Also, the focus shifts from “What was the conclusion?” to “Why was that conclusion reached?” In this way, decision makers in education can understand the decision-making process of the model [8], i.e. what exactly drives the model to classify or misclassify a student. Why Interpretability of Models is Important? This black-box nature of ML raises many concerns, from privacy (processing of personal data) to security and ethics. The need for interpretability arises from several factors: – Security and domain-speciﬁc regulations: Interpretability is a requirement in areas where wrong decisions can result in physical or ﬁnancial loss, such as Healthcare and Finance. – The right to information: End users have the right to understand what exactly is driving the model in order to make a particular decision. – Avoiding bias and discrimination based on gender or race. – Social acceptance and insights: Interpretability should not be viewed as just a limiting matter. But even in scenarios where a machine learning algorithm is making non-critical decisions, humans are looking for answers. – Debug a model: Models need to be debugged. They can focus on the wrong features, show a great result on testing data, but uncover really bad errors in production.

26

H. Sahlaoui et al.

Fig. 2. Black box model interpretation

Framework for Interpretable Machine Learning: Now that we have an idea of what ML interpretability is and why it matters, let’s look at how to interpret an ML model? Machine learning models vary in complexity and performance. Therefore, there are diﬀerent ways to interpret them. To do this, we’ll do a quick walkthrough of the building block of the interpretable machine learning framework that we’ll use to interpret our student performance model Fig. 3. – Model: intrinsic or post hoc: + Intrinsic interpretability: refers to models that are considered interpretable due to their simple structure, such as linear models or trees. Post hoc interpretations, on the other hand, occur after the model has been trained and are not related to its internal design. Most approaches to complex models are therefore post hoc. + Post hoc interpretability refers to the interpretation of a black box model such as a neural network or an ensemble model by applying model interpretability methods after training the model. And are not related to its internal design. – Method: model speciﬁc or model independent: Model-speciﬁc interpretation tools are speciﬁc to a single model or a group of models. These tools are highly dependent on how a particular model works. They depend on a model’s inner workings to make certain conclusions. These methods may involve the interpretation of coeﬃcient weights in generalized linear models or weights and biases in the case of neural networks. In contrast, model-independent tools can be used for any machine learning model, no matter how complex. They typically work by analyzing the relationship between feature input-output pairs and do not have access to the model’s internal mechanics such as weights or assumptions. – Local or global: Does the interpretation method explain a single prediction or the entire model behavior? How does the model behave in general? Which features drive predictions?

A Game Theoretic Framework for Interpretable Student Performance Model

27

Fig. 3. Framework for interpretable machine learning

3

Experiments and Result Analysis

The settings and techniques of the studies are examined in this section. The resulting ﬁndings are then evaluated and commented. We started by providing a detailed explanation of the factors utilized to predict student performance, and then we concentrated on model implementation and interpretation. We lay out the steps for testing and validating the suggested model for predicting student performance. Our machine learning method’s training and cross-validation scores are used to assess the outcomes. The accuracy, precision, and recall values for model performance are then recorded. 3.1

Data Description

This section discusses the publicly accessible data from the kaggle repository1 that was used in the research [2]. These data are utilized to build a model of student performance. There are 480 student observations with 17 attributes in the educational dataset utilized in this study. Table 2 gives a detailed overview of the characteristics used to predict a student’s success in Jordanian school. Personal characteristics, such as gender and place of origin; educational background characteristics, such as class participation, use of digital resources, and 1

https://www.kaggle.com/aljarah/xAPI-Edu-Data.

28

H. Sahlaoui et al.

student absence days; institutional characteristics, such as education level and grades; and social characteristics, such as parents in charge of students, parents who responded to the survey, and parents’ evaluation of the school, are among these variables. Table 2. Description of the variables that were utilized to predict the performance of the students N

Feature

Description

Feature type

1

Nationality

Place of origin

Categorical

2

Gender

The student gender(‘Male’ or ‘Female’)

Categorical

3

Place of birth

Birth place of the student

Categorical

4

Relation

Parent in charge of the student (‘Mom’ or ‘Father’)

Categorical

5

Stage ID

Educational level student belongs (‘Lowerlevel’, ‘MiddleSchool’, ‘HighSchool’)

Categorical

6

Grade ID

Grade levels student belongs as (from G-01 to G-12)

Categorical

7

Section ID

Classroom student belongs as (A, B, C)

Categorical

8

Semester

School year semester as (First or Second)

Categorical

9

Topic

Course topic as (Math, English, IT, Arabic, Science, Quran)

Categorical

10

Student absence days

Student absence days (≺ 7, 7)

Categorical

11

Parent answering survey

Parent is responding to the surveys or not

Categorical

12

Parent school satisfaction

The degree of satisfaction of the parent (Good, Bad)

Categorical

13

Discussion

How many times the student participate on discussion groups

Numerical

14

Announcements view

How many times the student checks the new announcements

Numerical

15

Visited resources

How many times the student visits a course content

Numerical

16

Raised hands

How many times the student raises his/her hand on classroom

Numerical

3.2

Description of Experimental Protocol

This section explains the experimental strategy for testing and assessing the suggested model’s performance in predicting student performance. The experiments were carried out using the Python 3.7 package, especially the scikitlearn module. The computer utilized is a “HP” model with 16 GB of RAM, an Intel Core i7 processor, and an NVIDIA Geforce 930M graphics card. 3.3

Result Analysis

1. Model evaluation: To train our model in this phase, we employed a tenfold cross-validation strategy. We primarily employed the xgboost machine learning technique to develop our student performance prediction model since validation is a vital component in creating eﬀective predictive models. Table 3 shows that the prediction model has a precision and accuracy of above 98%. These ﬁndings were

A Game Theoretic Framework for Interpretable Student Performance Model

29

produced by the use of methods and techniques such as SMOTE, hyperparameter optimization, and the cross-validation procedure, all of which indicate the model’s dependability. These results indicate that utilizing ensemble approaches, we can forecast student performance and improve the prediction. Table 3. Results analysis for various classiﬁers. Algorithm Accuracy

Precision

Recall

Validation score Test score Validation score Test score Validation score Test score XGBoost 0.9935 (XGB)

0.9940

0.9936

0.9950

0.9943

0.9939

2. Model interpretation: Many diﬀerent tools allow us to analyze the black-box models, and interestingly, each of them looks from a slightly diﬀerent angle. Our focus is to introduce shap value and to show How does shap achieve student performance model explainability through the lens of the interpretabililty framework. Shap value an idea with roots in game theory being applied to innovate ways to explore machine learning predictions. Shap value ﬁrst introduced by the economist lyod shapley in his work in 1953. When he tried to solve a rideshare optimization problem. The question is the fairest way to split the contribution from each rider. What the fare would be for each possible subset of riders and what is the marginal contribution to the cost when each rider join the each possible subsets. Projecting this to machine learning. We can think of the riders as the features in our model or the columns in the data set that we use to train the model and the total cost of the car as the model output. The Shapely value is the average marginal contribution of a feature value across all possible coalitions. That is, all i feature combinations possible (i goes from 0 to n, where n is a total number of features available). The order of adding the feature to the model is important and inﬂuences the prediction. Lloyd Shaply proposed this approach in 1953 (hence the name “Shapley values” for phi values measured in this manner). Provided a prediction p (this is the prediction by the complex model), the Shapley value for a speciﬁc function i (out of n total features, and S is a subset of n) is [5]

φi (p) =

|S|!(n − |S| − 1)! [p(S ∪ {i}) − p(S)] n!

(7)

S⊆n{i}

By calculating the diﬀerence in model prediction with and without feature i this equation may be used to determine the relevance or eﬀect of a feature. The feature’s result is simply a change in model estimates. – Shap interpretability framework:

30

H. Sahlaoui et al.

Ad Hoc vs. Post Hoc: Shap is a way for determining interpretability after the fact. The Shap Model Explainer is a basic, interpretable model that adds another layer on top of the black box to help others comprehend it. The model’s internal architecture has no bearing on the interpretations that occur after it has been trained. Methods or Model Types Supported: SHAP uses diﬀerent explainers that focus on analyzing speciﬁc model types. For example, the treeexplainer can be used for tree-based models, Deep explainer for neural network models, and the kernelexplainer for any model. However, the properties of these explainers differ signiﬁcantly, for example kernelexplainer is signiﬁcantly slower than treeexplainer. Therefore, shap works particularly well for tree-based models. Scope: Global vs. Local Explainability: Model interpretability is a wide term for assessing and comprehending the outcomes of machine learning models. It’s most typically employed in “black box” models, when it’s diﬃcult to show how the model came to a speciﬁc conclusion. We can examine black box models using a variety of technologies. The SHAP framework gives a global and local interpretation, which is a wonderful method to improve the transparency of tree ensemble models. The feature signiﬁcance plot and summary plot are used in the global interpretation to provide a comprehensive perspective of the important elements inﬂuencing the model’s predictions. Local interpretation, on the other hand, use force plots to provide a local explanation, which entails concentrating on a single prediction made by the model at the individual level. – Feature importance plot: The feature importance of xgboost for diﬀerent classes of student performance is plotted in a traditional bar chart, as illustrated in Fig. 4. Based on Fig. 4, the features are ordered from the highest to the lowest eﬀect on the prediction. It takes in account the absolute SHAP value. This plot show that student absences and visited resource are very important. But it doesnt show how important is the impact. It doesnt show if the impact is positive or negative. Is it making the prediction go higher or lower. Student absences have a greater impact than other characteristics, implying that changing this element may have a greater impact than changing others. In contrast, the least relevant characteristics, such as stage ID, section ID, and semester, show no signiﬁcant impact on student performance across all classes of performers, suggesting that changing these features has no discernible eﬀect on model prediction.

A Game Theoretic Framework for Interpretable Student Performance Model

31

Fig. 4. Traditional feature importance plots

– Summary plot: Another plot, the summary plot Fig. 5, may be used to understand the relevance or contribution of the characteristics for the whole dataset. All variables are listed in order of global feature relevance, with the most signiﬁcant being ﬁrst and the least important being last. The base value is zero, which is the model’s average forecast over all predictions. Anything over zero is considered positive, whereas anything below zero is considered negative. And the hue indicates how strong the impact is (red means high, whereas blue means low) For instance, qualities with a high value, such as student absence days, raise hands, and visited resources, have a signiﬁcant and beneﬁcial impact on student performance. The red hue represents a high, and the x-axis represents a positive value. The feature has a beneﬁcial inﬂuence on student performance when the number of student absence days is less than seven. Participation in class has a favorable impact on student achievement. Furthermore, the utilization of digital resources has a good correlation with student achievement. When the father is in command, the feature parent is in charge has a detrimental impact on student achievement (negative shap value).

32

H. Sahlaoui et al.

Fig. 5. Summary plots

– Force plot: This plot shows us what are the main features aﬀecting the prediction of a single observation, and the magnitude of the SHAP value for each feature Fig. 6. The force plot is another way to see the eﬀect each feature has on the prediction, for a given observation [13]. In this plot the positive SHAP values are displayed on the left side and the negative on the right side, as if competing against each other. The highlighted value is the prediction for that observation. the average value or the base value of an outcome or prediction is 0.5937 for the color coding, the red color show the features that contribute to driving prediction above the base value. And the blue color show the features that are driving the prediction value below the base value. The students’ absence days and visited resource features can increase the prediction. By contrast, the parent in charge and announcements view features can decrease the ﬁnal output prediction.

A Game Theoretic Framework for Interpretable Student Performance Model

33

Fig. 6. Force plot depicting feature contribution towards a single prediction.

4

Conclusion

The machine learning-based student performance model is no longer a black box. If we cannot explain the results to others, then what is the beneﬁt of a good model. Interpretability is as important as creating a model. In order to achieve broader acceptance among the population, it is important that machine learning systems provide a satisfactory interpretation of their decisions. The shap value can give us insights into the inner workings of student performance models and the kind of insights that can be generated. Machine learning models have already proven to be remarkably accurate and useful. Now we need ways to explain the predictions they make, and Shapley value are a great tool to develop an explainable model of student performance. Shap value interpretability framework helps build trust in our model, has an ethical commitment to it, strikes a balance between interpretability and accuracy, and provides global and local interpretation through various visualization packages. Eﬀectively, the SHAP value can show us both the global contribution through the use of feature importance and the summary graph, and the local feature contribution for each instance of the problem through the force graph. As Albert Einstein said, “If you can’t explain it simply, you don’t understand it well enough” [14].

References 1. Adyatama, A.: Interpreting classiﬁcation model with lime. 2 December 2019. https://algotech.netlify.app/blog/interpreting-classiﬁcation-model-with-lime/ 2. Amrieh, E.A., Hamtini, T., Aljarah, I.: Mining educational data to predict student’s academic performance using ensemble methods. Int. J. Database Theory Appl. 9(8), 119–136 (2016) 3. Chen, T., Guestrin, C.: XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794 (2016) 4. Choudhary, A.: Decoding the black box: an important introduction to interpretable machine learning models in Python, 26 August 2019. https://www. analyticsvidhya.com/blog/2019/08/decoding-black-box-step-by-step-guideinterpretable-machine-learning-models-python/ 5. Explain your model with the shap values, 07 November 2019. https:// towardsdatascience.com/explain-any-models-with-the-shap-values-use-thekernelexplainer-79de9464897a 6. Interpretability in machine learning: looking into explainable AI, 21 September 2020. https://www.altexsoft.com/blog/interpretability-machine-learning/

34

H. Sahlaoui et al.

7. Introduction to local interpretability models with lime and shap, 2 March 2020. https://aqngo.com/2020/03/02/local-interpretability.html 8. Jha, A.: Ml model interpretation tools: what, why, and how to interpret, 14 January 2022. https://neptune.ai/blog/ml-model-interpretation-tools 9. Kundu, A.: An overview of how diﬀerent machine learning interpretability tools are used for interpretation, 29 November 2021. https://towardsdatascience.com/ shap-explained-the-way-i-wish-someone-explained-it-to-me-ab81cc69ef30 10. Lykourentzou, I., Giannoukos, I., Nikolopoulos, V., Mpardis, G., Loumos, V.: Dropout prediction in e-learning courses through the combination of machine learning techniques. Comput. Educ. 53(3), 950–965 (2009) 11. Sahlaoui, H., El Arbi, A.A., Nayyar, A., Agoujil, S., Jaber, M.M., et al.: Predicting and interpreting student performance using ensemble models and Shapley additive explanations. IEEE Access 9, 152688–152703 (2021) 12. Solanki, S.: How to useLIME to understand sklearnmodels predictions?, 21 October 2020. https://coderzcolumn.com/tutorials/machine-learning/how-to-use-lime-tounderstand-sklearn-models-predictions 13. Trevisan, V.: Using Shap values to explain how your machine learning model works learn to use a tool that shows how each feature aﬀects every prediction of the model, 17 January. https://towardsdatascience.com/using-shapvalues-to-explain-how-your-machine-learning-model-works-732b3f40e137#:∼: text=SHAP20values20(SHa 14. Xiaoqiang: Interpretable machine learning, 10 April 2019. https://easyai.tech/en/ blog/interpretable-machine-learning/

A New Telecom Churn Prediction Model Based on Multi-layer Stacking Architecture Jalal Rabbah(B)

, Mohammed Ridouani , and Larbi Hassouni

RITM Laboratory, CED Engineering Sciences, Hassan II University, Casablanca, Morocco [email protected] Abstract. Customer retention costs four times less than new customer acquisition for telecom operators. This has made churn prediction an important issue for major service providers around the world. For this reason, significant investments are increasingly devoted to the development of new anti-churn strategies. Machine Learning is among the new approaches used today in this area. In the proposed work, we constructed a multilayer Stacking Network, composed of a set of 8 selected trained models, based on grid search cross-validation. The model we suggested scored the best result in terms of accuracy, 80.1%, higher than any of its base models. Keywords: Machine learning · Stacknet · Customer churn · Boosting · Bagging

1 Introduction Churn is where customers switch between service providers anonymously [1]. As a result of growing market pressure In the Telecom sector, this problem has become a major concern for the Telecom firms. The solution to this plague means that it becomes imperative for companies anticipate ahead of time those customers more likely to quit. Therefore, retaining existing customers is now a critical part of protecting revenue of investors in the telecommunications sector. In response to this reality, telecommunications providers are obliged to implement loyalty programs (Fig. 1) and to invest more in the retention of their customers than in the acquisition of new ones [2]. In contrast to databases of several Terabytes, which usually hold multimedia streams, telco databases mostly include many smaller records of network transactions and state events [3]. This makes the task of processing this huge data a big deal that requires a lot of work and expertise. The biggest barrier to addressing this issue is the amount of processing time required to detect true churners; this time should be limited to no more than a few days, to prevent the customer from leaving, allowing the marketing department enough time to offer suitable options that can retain the customer. It is worth noting that investing in the retention of customers is highly crucial and errors in detecting real possible churners may cost a lot of funds to companies. Predicting churn is mainly a classification problem, in which the objective is to forecast churners prior to their leaving. For each subscriber, we consider binary labels The original version of this chapter was revised: The second author’s name has been changed to “Mohammed Ridouani”. The correction to this chapter is available at https://doi.org/10.1007/978-3-031-15191-0_50 © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023, corrected publication 2023 M. Ben Ahmed et al. (Eds.): NISS 2022, LNDECT 147, pp. 35–44, 2023. https://doi.org/10.1007/978-3-031-15191-0_4

36

J. Rabbah et al.

Fig. 1. Customer attrition process

such as Target. It should be pointed out that the proportions are not the same and that a small number of customers are susceptible to churn in a random sample. These are referred to as unbalanced data sets. To handle these data and arrive at an accurate average model, it is necessary to take into account multiple features in order to provide insight into all the possible contributing factors to customer churn. A large number of approaches and requirements exist to choose the most appropriate model. C.A. Hargreaves used the logistic regression model because it needs low processing time than other advanced machine learning algorithms and its results are also very simple to understand [4]. S. Induja1 [5] suggested that Kernelized Extreme Learning Machine (KELM) algorithm should be used for classifying churn schemes in the telecommunications sector. The key technique of the suggested design is the data organization using preprocessing with Expectation Maximization (EM) clustering algorithm. Besides investigating the combination of algorithms and methods, T. Kamalakannan [6] applied k-means normalized algorithm for preprocessing the dataset, followed by a features selection based on the pre-processed image using mRMR (the minimum redundancy and maximum relativity) method. (T. Kamalakannan, 2018). The goal of this work is to select features that correlate strongly with the class (output) and weakly to each other. An SVM (support vector machine) with PSO (particle swarm optimization) is considered for prediction. M.Gunay&T.Ensari [7] investigated multiple machine learning approaches and proposed a new prediction design. H.Abbasimehr, M.Setak&M.J.Tarokh [8], conducted a comparative evaluation of 4 main ensemble techniques which are Boosting, Bagging, Voting and Stacking. Using four basic learners (SVM, ANN, Decision Tree and RIPPER (Reduce Incremental Pruning to produce Error Reduction). Training was carried out using SMOTE. This research shows significant improvement in predictions using the ensemble learning techniques. To summarize, this research aims to build a predictive churn detection solution that can rapidly identify with a high level of confidence possible cases. The structure of this paper is as follows: Sect. 2 gives a brief overview of the issue. Section 3 presents the analysis methods used in our study. Section 4 outlines the process and design of the research. Section 5 describes the experimental results. Finally, Sect. 6 provides the conclusions of the paper.

A New Telecom Churn Prediction Model Based on Multi-layer Stacking Architecture

37

2 Description of the Problem The management of churn is a high priority issue for Telcos. Predicting accurate rates of potential customer churn is a major concern for teams. For many factors: Features: To understand this issue, it is important to consider all the drivers that can affect customers. In fact, it is necessary to assess customer satisfaction along several dimensions, including network connectivity, customer service interactions, consumption and recharge rates, customer category, etc. Behaviors: Customer behavior changes are difficult to predict and to consider in predictive churn models, as customers may unexpectedly change behavior, whether by switching rating plan or for reasons related to a private choice of the customer, like giving SIM card to other person, or going on a trip or any other reason. Data Volume: Telcos are generating millions, if not billions, of Call Data Records (CDRs) and Event Data Records (EDRs) each day, according to the company’s size. This data contains a wide variety of pertinent information that is hard to handle using conventional databases. Having the right combination of tools and skills to capture, manage, analyze, and retrieve meaningful insights from these records. Other data related to customer satisfaction, quality of the network and information from social networks are also important. Time: Time is of the essence in churn management in telcos. Results come late in most cases, which limits the time available to marketing to take action and retain the most risky customers. Money: Prediction model performance in such cases is a crucial financial metric. For instance, when a churn prevention algorithm shows a high false-positive rate, it causes significant company losses, which are results of both trying to keep customers that were not intending to churn, and the departure of true churners from the operator.

3 Machine Learning Using Stacknets Within the field of machine learning, building a meta-model that combines multiple machine learning algorithms to achieve better performance than single algorithm models are called ensemble learning. However, the obtained “Ensemble” is also a supervised algorithm. That can also be used to predict Chrun after training. The statistical hypothesis for which the “Ensemble” is intended to be used is not necessarily included in its components. Several ensemble learning methods exist. Bagging [9] uses weak parallel learners, Boosting [10] is sequential learning of several weak learners. Stacking was defined by Wolpert [11] as a combination learning method of models from various algorithms. Stacknets are a generalization of neural networks [11] to enhance the performance of machine learning. The goal of Stacknets is to improve the classification accuracy or to decrease the errors of learning. In this context, we have used a Java implementation of Stacknets [12].

38

J. Rabbah et al.

4 Methodology 4.1 Dataset For this research, we used a telecommunications customer churn dataset composed of 7043 items of data composed of 21 features (Table 1.) and available on Kaggle [13]. 4.2 Data Preprocessing In this stage, we performed various manipulations of the original data to build the prediction model. First, we normalized the data; We next transformed the Boolean features to include only 0s and 1s and encoded categorical features. Then we normalized the features. Table 1. shows a summary of the descriptive statistics for the dataset used. Table 1. Customer churn fields description Feature

Count

Mean

Std

Min

25%

50%

75%

Max

gender

7032

0.505

0.5

0

0

1

10

1

Senior citizen

7032

0.162

0.369

0

0

0

0

1

Partner

7032

0.483

0.5

0

0

0

1

1

Dependents

7032

0.298

0.458

0

0

0

1

1

tenure

7032

32.422

24.545

1

9

29

55

72

Phone service

7032

0.903

0.296

0

1

1

1

1

Online security

7032

0.287

0.452

0

0

0

1

1

Online backup

7032

0.345

0.475

0

0

0

1

1

Device protection

7032

0.344

0.475

0

0

0

1

1

Tech support 7032

0.290

0.454

0

0

0

1

1

Streaming TV

7032

0.384

0.486

0

0

0

1

1

Streaming movies

7032

0.388

0.487

0

0

0

1

1

Paperless billing

7032

0.593

0.491

0

0

1

1

1

Monthly charges

7032

64798

30.086

18.25

35.588

70.350

89.862

118.75

Total charges

7032

2283.3

2266.771

18.80

401.450

1397.475

3794.7

8684.8

Churn

7032

0.266

0.422

0

0

0

1

1

A New Telecom Churn Prediction Model Based on Multi-layer Stacking Architecture

39

4.3 Feature Engineering Throughout our process of analysis, we performed a feature selection study (Table 2.), to reduce the features dimensionality and have a meaningful number of pertinent features for effective learning. Table 2. Features scoring and selection using different algorithms A6_LightGBM

A5_RandomF

A4_Logistics

A3_Pearson

A2_Chi-2

A1_RFE

Features Importance (Top 10) F1_tenure F2_TotalCharges F3_TenureG_Tenure_0-12 F4_MonthlyCharges F5_InetServ_FiberOptic F6_Contract_MonthToMonth F7_gender F8_TenureG_Tenure_gt_60 F9_TenureG_Tenure_48-60 F10_TenureG_Tenure_12-24

We made a combination of several feature selection methods to remove the less significant. A filter based method was firstly used to compute the Pearson correlation absolute value between the target and each attribute. In the second method, we did the same by calculating the Chi-Squared. Our third technique is the RFE (recursive feature elimination) that we computed recursively by allowing to eliminate features after every run. We also used LASSO, which has its built-in selection function that forces some weakest attributes weights to zero, in addition to selecting feature importance using Random forest. We have than eliminated the less important features in an attempt to optimize the performance of our solution. In intuition, the radar graph (Fig. 2 and Fig. 3) below can explain some cases. For instance, all clients have Multilines_NoPhoneServ as 0, meaning that it gives no useful information to our model. The same goes for DeviceProtection, Multi-lines_Yes and PhoneService. 4.4 Unbalanced Class in Binary Classification Such a problem is frequently encountered in this kind of use cases, when the number of observations of the target classes is significantly imbalanced. Telco Customer Churn is among the rare cases where churners are uncommon compared to the majority of telecom customers. Among the best-known techniques used to deal with unbalanced datasets is SMOTE which stands for Synthetic Minority Oversampling Technique [14].

40

J. Rabbah et al.

Fig. 2. Features distribution for churners

Fig. 3. Features distribution for no churners

4.5 Model Building A large choice of machine learning models is available to design a Stacknet.[15–19].We decided to use a combination of various machine learning algorithms chosen from scikitlearn machine learning package in Python. Hyperparameter estimation based on the grid search cross-validation approach was used to tune the classifier. This method is preferred to prevent overfitting and to maintain nearly constant performance in prediction from training data to unobserved data. In order to do this choice, X_train is used to train the model, the tuning will then be done using each combination of grid hyperparameters. Then score tests will determine the set of parameters for the best model. The Fig. 4 gives a general view on the construction of our churners prediction system. In fact, the raw data is first cleaned by verifying and substituting missing values and fixing any data type or format errors. We then verified the existence of outliers. These steps are essential in order to have quality data, and which will subsequently generate quality results. After the cleaning step, comes the step of preparing and optimizing the data, including normalization, rescaling, binarization, and encoding of categorical Data using One Hot encoder. The prepared data is then passed to the feature selection step; we combined a set of techniques to make an in-depth analysis of the features. This allowed us to delete the least relevant features for our analysis. We used a diverse models including various algorithms to build a robust and accurate model. We implemented our design using the python language package Scikit-Learn [20]. In addition, we used a stacknet implementation called pystacknet [16], [16]. Our Stacknet architecture is represented in the Fig. 4. 4.6 Performance Metrics The metrics we used are calculated using the confusion matrix as a tool to evaluate the proposed model. Holdout: This involves dividing the dataset as represented in Fig. 5 into three separate subsets: the training set, the validation set, and the test set. The first one is used during

A New Telecom Churn Prediction Model Based on Multi-layer Stacking Architecture

41

Fig. 4. The churn prediction system based on multi-layer stacking

the machine learning model building phase, the second set is used to tune performance, and finally the test set is used for performance evaluation. Cross Validation (k-fold): By using this approach, we split the initial data to k subsets and iteratively evaluate performance by each time using 1 part for validation/test and k-1 for training. As most of the data is used to fit the model, this technique has the advantage to reduce clearly bias. Model evaluation Metrics: The type of evaluation metrics varies depending on the machine learning problem. We have used the next metrics to evaluate our classification model. • Classification accuracy: this is the proportion of correct predictions out of all predictions. Accuracy =

Number of correct predictions Total number of predictions

(1)

• Receiver operating characteristic curve (ROC): True versus False positive rate for various thresholds for a classifier. • AUC: Area under the curve is the measure of performance that consists of computing the total two-dimensional area under the entire ROC curve. • Precision: Indicates the proportion of accurate positive predictions. True Positive True Positive + False Positive

(2)

True Positives True Positives + False Negative

(3)

Precision = • Recall or hits rate: Recall =

42

J. Rabbah et al.

• F-measure: balance between precision and recall. F1 = 2 ×

Precision × Recall Precision + Recall

(4)

5 Results of Experiments 5.1 Used Algorithms

Fig. 5. Performance comparison

5.2 Performance Evaluation The Fig. 5 shows the performance of major models of Table 3. applied to the churn Table 3. List of algorithms used to build the stacknet churn classifier Level 1

Level 2

Meta-learner

E1.1: LGBM regressor E1.2: Bagging E1.3: AdaBoost E1.4: CatBoost regressor E1.5: Random forest

E2.1: Gradient boosting E2.2: Bagging

Gradient boosting

dataset. As we can observe, algorithms with good accuracy were LightGBM (79, 8% accuracy), Adaptive Boosting Classifier (79, 5% accuracy) and Bagging Classifier (78, 1 accuracy), the model that we built using Stacknet achieved the best accuracy score with 80, 1%, confirmed also using the correlation matrix of Fig. 6.

A New Telecom Churn Prediction Model Based on Multi-layer Stacking Architecture

43

Fig. 6. Models output correlation matrix

5.3 Discussion In our case we used a clean dataset from the internet that contains 7043 data points with 27% churners, the best accuracy we obtained was 80, 1% using the proposed Stacknet architecture, whereas the precision is ranked in the second place with 63% of detected cases are confirmed as real churners, we can improve the performance by adding other features such as customer care interaction information or by using more historical data regarding customers and Business ecosystem in general. The next step is to find out when and why churn occurring according for example to a features importance analysis.

6 Conclusion This work evaluated a newly developed software implementation of stack generalization called pystacknet against various types of classification algorithms. From the obtained results, which can be can be further improved by the Stacknet hyper-parameters finetuning, the proposed solution can constitute a satisfactory answer for the telecommunication operators in order to be able to be alerted in advance and with a great precision on the possible churners, This will allow to launch targeted loyalty campaigns and retention, but also to save important budget which would be possibly lost either to the departure of the customers or to the expense of acquiring new ones.

References 1. Umayaparvathi, V., Iyakutti, K.: Applications of Data Mining Techniques in Telecom Churn Prediction. Int. J. Comput. Appl.(0975 – 8887) 42(20), 5–9 (2012) 2. Idris, A., Khan, A.: Customer churn prediction for telecommunication: Employing various various features selection techniques and tree based ensemble classifiers. In: 15th International Multitopic Conference (INMIC). pp. 23–27 (2012)

44

J. Rabbah et al.

3. Koutsofios, E.E., North, S.C., Keim, D.A.: Visualizing Large Telecommunication Data Sets. IEEE Comput. Graph. Appl. 19(3) 16–19 (1999) 4. Hargreaves, C.A.: A machine learning algorithm for churn reduction & revenue maximization: an application in the telecommunication industry. Int. J. Fut. Comput. Commun. 8(4) (2019) 5. Induja, S., Eswaramurthy, V.P.: Customers churn prediction and attribute selection in telecom industry using kernelized extreme learning machine and bat algorithms, Int. J. Sci. Res. (IJSR) ISSN 5(12), 2319–7064 (2016) 6. Kamalakannan, T.: Efficient customer churn prediction model using support vector machine with particle swarm optimization. Int. J. Pure Appl. Math. 119(10), 247–254 (2018) 7. Gunay, M., Ensari, T.: New Approach for Predictive Churn Analysis in Telecom. WSEAS Trans. Commun. E-ISSN 18, 2224–2864 (2019) 8. Abbasimehr, H., Setak, M., Tarokh, M.J.: A comparative assessment of the performance of ensemble learning in customer churn prediction. Int. Arab J. Inf. Technol. 11(6), 599–606 (2014) 9. Breiman, L.: Bagging predictors. Mach. Learn. 24(2), 123–140 (1996) 10. Freund, Y., Schapire, R., Abe, N.: A short introduction to boosting. J. Japan. Soc. Artif. Intell. 14(771–780), 1612 (1999) 11. Wolpert, D.H.: Stacked generalization. Neural Netw. 5(2), 241–259 (1992) 12. Michailidis, M.: Investigating Machine Learning Methods in Recommender Systems. UCL (University College London), Diss (2017) 13. BlastChar, Telco Customer Churn Dataset. https://www.kaggle.com/blastchar/telco-cus tomer-churn . (2018) 14. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16 321–357 (2002) 15. Kim, J., Kim, J. and Kwak, N.: StackNet: Stacking Parameters for Continual learning (2018) arXiv preprint arXiv:1809.02441 16. Michailidis, M.: kaz-Anova/StackNet : https://github.com/kaz-Anova/StackNet 17. Michailidis, M., Soni, A.: H2Oai/Pystacknet, https://github.com/h2oai/pystacknet, pystacknet is a light python version of StackNet 18. Rabbah, J., Ridouani, M., Hassouni, L.: A new classification model based on stacknet and deep learning for fast detection of COVID 19 through X rays images. Fourth Int. Conf. Intell. Comput. Data Sci. (ICDS) 2020, 1–8 (2020) 19. Elaanba, A., Ridouani, M., Hassouni, L.: Automatic detection using deep convolutional neural networks for 11 abnormal positioning of tubes and catheters in chest X-ray images. IEEE World AI IoT Congr. (AIIoT) 2021, 0007–0012 (2021) 20. Pedregosa, F., et al.: Scikit-learn: Machine learning in Python. J. Mach. learn. Res. 12, 2825–2830 (2011)

A Novel Hybrid Classification Approach for Predict Performance Student in E-learning Hanae Aoulad Ali1(B) , Chrayah Mohamed2 , Bouzidi Abdelhamid1 , Nabil Ourdani1 , and Taha El Alami1 1 TIMS Laboratory, FS, Abdelmalek Essaadi University, Tetuan, Morocco

[email protected] 2 TIMS Laboratory, ENSATE, Abdelmalek Essaadi University, Tetuan, Morocco

Abstract. Predicting educators’ learning outcomes at some stage in their educational career has gotten a considerable interest. It supplied essential facts that can aid and recommend universities in making rapid judgments and upgrades that will enhance the success of students. In the tournament of the COVID-19 epidemic, the boom of e has increased, enhancing the quantity of digital studying data. As a result, machine learning (ML)-based algorithms for predicting students’ performance in virtual classes have been developed. Our proposed prediction is a novel hybrid algorithm for predicting the achievements of freshmen in online courses. To enhance prediction results, hybrid gaining knowledge of mix many models. The Voting is a useful technique that is extremely fine when solely one model is present. The researchers concluded that our approach used to be the most successful accuracy performance of 99%. Keywords: Classification · Ensemble classifier · Voting · Online education · Machine learning

1 Introduction With the rapid growth of the online education business, a growing number of students have been attracted to e-learning. While there are a high number of individuals in a classroom, numerous instructors started to research how to enhance the integrity of online teaching for every learner [1]. Therefore, the idea for a learning outcomes predictive model was developed. The majority of the data was stored of the learning outcomes prediction algorithm is based on confidential communications from several virtual learning systems’ back ends [7]. Several local and international universities are being given the opportunity to participate contribute to the improvement of student outcomes prediction algorithms for virtual learning systems [11]. They can create learning outcomes prediction models using the personal information of online learning systems [3]. The learner action stream sequence is used in References to determine whether a learner provides a project at a specific time, asks a question at a specific time, finishes the certification at a specific time, and so on. Modern machine learning techniques are used for forecasting learners outcome in small © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ben Ahmed et al. (Eds.): NISS 2022, LNDECT 147, pp. 45–52, 2023. https://doi.org/10.1007/978-3-031-15191-0_5

46

H. A. Ali et al.

student cohorts. As a result, Learners are migrating to online classrooms in massive numbers. When a significant number of people are registered in a classroom at the same time, several of the primary difficulties that digital learning systems should address is how to increase the reliability for every learning. We’re considering developing a system for forecasting the achievement of learners in order to forecast learning outcomes. By online educational platforms, we get information about learners. Demographic characteristics and web log records are examples of these types of data, to predict the final grades of educators [12]. Machine learning (ML) has improved to the point where it is now employed in university classrooms for a diverse data analysis techniques. According to research, a growing sector in virtual classrooms can provide valuable insights into a range of classroom performance areas. An in-depth examination of the use of ML. The proposed hybrid strategy produces better results compare with the single approaches. This paper presents a method to predict outcome student using hybrid approach of decision tree and Random Forest. The paper provides the comparison with other existing technique, shows that the use of hybrid approach can improve the efficiency of predict student . The proposed hybrid approach gives better result as compare to the existing techniques. The rest of the paper is described as follows: Section 2 describe related work. Section 3 introduces discusses proposed methods in this field. Section 4 applied algorithms. Section 5 explain the results and analysis obtained. Section 6 presents the conclusion for the proposed work.

2 Related Work In the digital world, ELearning is a brand-new way to study, Vast volumes of high- quality educational resources, like as classroom videos, quizzes, exams and certifications from several reputable universities throughout the world, are accessible via online learning, such as MOOCs [5]. Learners could use an virtual learning platform to choose courses that are suitable for them for a low or no cost. Making it much easier for them to study independently. The dependence on online education services has new way into the traditional education industry [9]. Unlike traditional teacher-student interactions, which take place face-to-face, The location of learning or the conditions of the teacher are not constraints in online education. It accepts a infinite number of Learners.It also allows instructors to double-check the quality of their lesson by pre-recording and reviewing educational videos. Because it provides many of the advantages outlined above, online learning is rapidly gaining popularity and attracting a larger number of learned [4] . Since the year 2012, elearning have gotten a lot of interest in the socially and academically network worlds as a new way of learning. They have amassed a larger number of students and a volume of data on what occurs in during course This has resulted in Several researchers have used a range of qualitative and quantitative methodologies to analyse the influence of learning strategies on MOOC levels of engagement [16]. To this aim, 140,546 learners’ data was analyzed by multiple linear regression. The data demonstrated that learners frequently disregard the linear regression of the content knowledge, whereas age and level were likely to be strongly correlated also with quantity

A Novel Hybrid Classification Approach

47

of the teaching reviewed by a learners. Cisel (2014) attempted to identify the indicators that have a substantial effect on students graduation rates in a French xMOOC in a comparable study [8]. Feng et al. (2019) presented a context-based feature interaction network (CFIN). Several tests compared the efficacy of CFIN to well-known algorithms such as LR, support vector machines (SVM), and random forest (RF). In moreover, by integrating CFIN and XGBoost, an ensemble approach was created.

3 Proposed Method In this article, we present an approach that combines the efficiency of Random Forest and Decision for the supervised mechanisms to resolve the categorization challenge. Firstly Random forest is an ensemble learning approach that builds a numeral of decision trees with randomly picked attributes and uses voting to evaluate the accuracy of a test dataset. On the other hand, decision trees are well- known as very effectual ML and artificial intelligence techniques that can generate accurate and simple-to-understand models. They are reliable and can handle massive amounts of data in a short amount of time. Therefore, This research suggests an approach to developing a model in a step-bystep way. Important phases, as shown in the steps that follow. Phase 1 is data collection; In this research, we used the OULA dataset. In this phase, further multiple pre-processing techniques applied such as missing values adjustment, normalization, outlier detection, and others. The whole method includes four parts: data preprocessing, feature engineering, model selection, and model fusion. Data Preprocessing Module Data preprocessing is the process of finding, removing, or correcting abnormal samples from raw data and converting the data into a format that is easier for machine to learn. In this paper, we summarize it in three aspects: Data cleaning: that is, to find out abnormal samples from the raw data and remove or correct them as necessary. Analyze the distribution of data labels: If the proportion of certain labels in raw data is too high, sampling method or label weight modification should be used. This prevents unbalanced samples from influencing the final output of the model. Numerical features Normalization: If there are numerical features with skewness distribution, operations such as Z-Score Standardization, normalization, and log transformation should be carried out according to the situation to make the overall distribution of data more balanced. Feature Engineering Module In some application scenarios, feature engineering is even more important than algorithms. In this article, we summarize it in two aspects: Feature encoding: If raw data include non-numerical features, they need to be converted into numbers that can be recognized by the computer. We encode different types of variables differently. For label variables, one-hot encoding is used for encoding operation. For ordered variables, ordinal encoding is used. For interval and ratio variables, no coding is required.

48

H. A. Ali et al.

Feature selection: If there are too many features in the raw data, the noise in features will have a negative effect on the learning effect of the model. In this case, it should be considered to delete some features with weak correlation to screen out the optimal feature subset to improve the model effect. In this paper, the filtering a method based on feature important. Phase 2 in the proposed system prepares and applies the ML model. In this phase, ensemble methods are applied separately in different experiments. Finally, phase 3, the last step, provides comprehensive feedback, analysis, and comparison among all experiments.

4 Applied Algorithms Ensemble approach is a technique in ML which multiple algorithms have been taught how to address the same challenge and then combined to achieve exceptional results [2]. This technique offers better prediction when compared to a traditional algorithm. The idea is to train a set of classifiers (experts) before allowing them to vote. 4.1 Voting The voting classifier combines many types of ML classifiers, aggregates the output of each classifier fed to it, and uses voting to predict the class label of a new instance. There are two forms of voting: harsh and soft. Majority use hard voting. In this circumstance, the class that receives the most votes will be chosen (predicted). A forecast is created for soft voting by averaging the class-probabilities of each classifier. The projected class is the one with the best average probability. We used soft voting in this project. The tree-based ensemble classifiers are also employed as the VC’s basis estimators [2]. 4.1.1 Decision Tree A decision tree is a type of ML algorithm tree in which each node represents a choice between a number of alternative solutions, and each leaf node represents a classification or decision. The first decision tree classification algorithms are old. Decision trees are simple to use and give a visual, audible, and easy-to-read representation of an of one’s categorization protocol [11] . Because the decision tree technique is simple to construct and test, displaying the tree may be useful in some scenarios. This is impossible to do in sophisticated algorithms that handle non-linear requirements, such as ensemble approaches. 4.2 Random Forest The random forest technique is a bagging approach in which deep trees are coupled with bootstrap samples to provide a reduced variance output. Random forests, furthermore, utilize a different approach to make the numerous fitted trees less correlated [11].

A Novel Hybrid Classification Approach

49

5 Results 5.1 Dataset To evaluate our approach, we used the The Open University Learning Analytics (OULAD) [10] dataset contains a series of online education related data provided by online education platforms such as student demographic data, student clickstream data, and course data. Student demographic data is background information such as the student’s gender, age, and highest education level. It is unique and is the data collected by the online education platform when the student registers. Student clickstream data is the type and frequency information of students interacting with the Virtual Learning Environment (VLE) platform in a course, which includes accessing resources, web-page click, forum click and so on. It reflects the active degree of students participating in the course. The OULAD dataset contains 22 modules, 32,593 learners, and 10,655,280 data points on student-VLE platform interactions.Students’ output is divided into four categories, including Distinction (D), Pass (P), Fail (F) and Withdrawn (W). When the student’s score is higher than 75 points, his outcome is D. When the student’s score is higher than 40 but lower than 75, his outcome is P. When a student completes the course, but the score is less than 40 points, his outcome is F. When the student does not complete the course, his outcome is W. 5.2 Validation Accuracy, Precision, Recall, and F-Measure are the classification metricsused in our study, and these parameters are evaluated using the suggested hybrid approach’s . Table 1. Different performance parameters for selected three classifiers Heading level

KNN

GB

XGBoost

RF

DT

FR + DT

Accuracy

0,56

0,85

0,96

0,99

0,97

1,00

Precision

0,58

0,80

0,95

0,99

0,94

1,00

F1 score

0,49

0,66

0,93

0,98

0,94

0,99

Accuracy of train

0,76

0,85

0,97

0,99

0,97

1,00

Accuracy of test

0,98

0,85

0,96

0,99

0,97

1,00

KNN: K-nearest neighbors GB: Gradient boosting FR: Random Forest DT: Decision Tree XGBC: XGBoostclassifaer

50

H. A. Ali et al.

Fig. 1. Graphical representation of comparative analysis of ML algorithms

From Table 1 and Fig 2, it is represented of the different metrics of classification. Result show our proposed method hybrid with Random Forest and Decision Tree is better as compared to single algorithm Random Forest and Decision Tree.

6 Discussion The paper considered the combination of supervised classification algorithms to product review data and also predicted the performance student. The performance of the tree models was systematically evaluated and analyzed to select the optimized model for the study area. Our results were generated using hybrid methods has shown more significant improvement. The hybrid technique with Decision tree and Random forest achieved the best results in terms of accuracy and precision performance metrics. The comparison outcome revealed that ensemble technique get the best result compared with the single algorithms which have shown in Table 1 and Fig. 1 and 2.

A Novel Hybrid Classification Approach

Decision Tree

Random Forest

Xgboost

51

Gradient Boosting

k-nearest neighbors

Voting

Fig. 2. Confusion matrix of comparative analysis of ML algorithms

7 Conclusions Digital learning has received much interest participants, with the enrollment for each subject far exceeding that of school classes. For this of these issues, we have to suggest a way for providing a high standard of virtual education, and that was to construct a learning outcomes forecasting model [9]. Learner demographic information and web

52

H. A. Ali et al.

logs data are collected by the eLearning website in order to employ learning outcomes forecasting model for tracking learning outcomes situation in real time [6]. Our proposed prediction a model hybrid classification model for predict performance student in online courses. In general, ensemble models combine multiple base models to improve the predicting performance. The hybrid is a useful technique, which comes especially handy when a single model. Besides, the outcomes demonstrated that Ensemble Voting Classifier indicated better test score of about 99%.

References 1. Albreiki, B., Zaki, N., Alashwal, H.: A systematic literature review of student’performance prediction using machine learning techniques. Education Sciences 11(9), 552 (2021) 2. Gajwani, J., Chakraborty, P.: Students’ performance prediction using feature selection and supervised machine learning algorithms. In: Gupta, D., et al. (eds.) International Conference on Innovative Computing and Communications: Proceedings of ICICC 2020, Volume 1, pp. 347–354. Springer Singapore, Singapore (2021). https://doi.org/10.1007/978-981-155113-0_25 3. Aoulad, A.H, et al.: Prediction MOOC’s for student by using machine learning methods. In: 2021 XI International Conference on Virtual Campus (JICV). IEEE (2021) 4. Mohammad, A., et al.: Towards designing profitable courses: predicting student purchasing behaviour in MOOCs. Int. J. Artif. Intell. Educ. 31(2), 215–233 (2021) 5. Kloft, M., Stiehler, F., Zheng, Z., Pinkwart, N.: Predicting MOOC dropout over weeks using machine learning methods. In: Proceedings of the EMNLP Workshop on Analysis of Large Scale Social Interaction in MOOCs. pp. 60–65 (2014) 6. Jiajun. L., Li. C., Zheng, Li.: Machine learning application in MOOCs: Dropout prediction. In: 2016 11th International Conference on Computer Science & Education (ICCSE). IEEE (2016) 7. Guo, P.J., Reinecke, K.: Demographic differences in how students navigate through MOOCs. In: Proceedings of the First ACM Conference on Learning @ Scale Conference, Atlanta, GA, USA, pp. 21–30 (2014) 8. Tang, C., Ouyang, Y., Rong, W., Zhang, J., Xiong, Z.: Time series model for predicting dropout in massive open online courses. In: Rosé, C..P.., et al. (eds.) Artificial Intelligence in Education. Lecture Notes in Computer Science (Lecture Notes in Artificial Intelligence), vol. 10948, pp. 353–357. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-938462_66 9. Wenzheng, F., Tang, J., Liu, T.X.: Understanding dropouts in MOOCs. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 33 (2019) 10. Kuzilek, J., Hlosta, M., Zdrahal, Z.: Open university learning analytics dataset. Sci. data 4(1), 1–8 (2017) 11. Aoulad Ali, H., et al.: A course recommendation system for MOOCs based on online learning. In: 2021 XI International Conference on Virtual Campus (JICV). IEEE (2021) 12. Panagiotakopoulos, T., et al. : Early dropout prediction in MOOCs through supervised learning and hyperparameter optimization. Electronics 10(14), 1701 (2021)

A Proposed Big Data Architecture Using Data Lakes for Education Systems Lamya Oukhouya1(B) , Anass El haddadi2 , Brahim Er-raha1 , Hiba Asri1 , and Naziha Laaz3 1 LIMA Laboratory, Ibn Zohr University, Agadir, Morocco

{l.oukhouya,b.erraha,h.asri}@uiz.ac.ma

2 DSCI Teams, Abdelmalek Essaadi University, Al-Hoceima, Morocco

[email protected] 3 LaGeS Laboratory, Hassania School of Public Works, Casablanca, Morocco

Abstract. Nowadays, educational data can be defined through the 3Vs of Big Data: volume, variety and velocity. Data sources produce massive and complex data, which makes knowledge extraction with traditional tools difficult for educational organizations. Indeed, the actual architecture of data warehouses do not possess the capability of storing and managing this huge amount of varied data. The same goes for analytical processes; which no longer satisfy business analysts; in terms of data availability and speed of execution of queries. These constraints have implied an evolution towards more modern architectures, integrating Big Data solutions capable of promoting smart learning to students. In this context, the present paper proposes a new big data architecture for education systems covering multiple data sources. Using this architecture, data is organized through a set of layers, starting with the management of the different data sources to their final consumption. The proposal approach includes data lake as a means of modernizing decision-making processes, in particular data warehouses and OLAP methods. It will be used as a means for data consolidation for the integration of heterogeneous data sources. Keywords: Big Data Architecture · Data lake · Data warehouse · Education systems

1 Introduction Data sources are increasingly producing large and complex data. According to the International Data Corporation, the evolution of data in terms of quantity will reach 175 zettabytes in 2025, or ten times the volume of data produced in 2018 [1]. This massive data brings new challenges to the architecture of decision support systems, in terms of data storage and analysis. Indeed, data warehouses and OLAP have been used for decades in the decision-making process [2]. However, they support only structured data sources. Big data can be defined through multiples characteristics “Vs”, but we believe that volume, variety and velocity are the main important Vs of Big Data; including © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ben Ahmed et al. (Eds.): NISS 2022, LNDECT 147, pp. 53–62, 2023. https://doi.org/10.1007/978-3-031-15191-0_6

54

L. Oukhouya et al.

volume, variety and velocity. Data collected nowadays comes; in a speedy way; from different sources and can be concretized in different formats. Hence the traditional data storage techniques do not have the capacity to store and analyze this data. For these reasons; Big Data has become the major concern of many organizations, and the educational environment is no exception. Educational systems are no longer limited to a fixed course schedule or to a traditional classroom setting [4]. The use of the new information and communication technologies for education have led to the distance learning industry [5]. This concept, which is increasingly adopted by universities, forms the basis of Educational Big Data (EBD), which is defined by the volume, the variety the velocity as well as the value. This 4th characteristic (4V) of Big Data makes it possible to extract new knowledge by analyzing large amounts of data, and this in order to improve the process learning and teaching [6]. Efficient big data analysis depends heavily on good data governance. However, the traditional data warehouse cannot handle the storage of large volumes of educational data, nor can it integrate heterogeneous data sources such as MOOCs and social media, including images and videos [7]. As a result, educational Big Data is attributed to evolution of the architecture of decision support systems by integrating new systems together with traditional data warehouses. A hybridization of technologies can be envisioned with distributed NoSQL systems, offering the design of modern data warehouses to manage educational Big Data. With the scalability approach, data warehousing has been able to gain scalability and flexibility [8], nonetheless, the problem of information silos has yet to be resolved. Indeed, these data silos store blocks data in different applications, which limits the overall visibility on the data, and therefore negatively impacts decision making [9]. In this sense, data lakes are introduced into the architecture of decision support systems as a Big Data storage and analysis solution, not only to break data silos [10], but also to modernize data warehouses [11]. This modernization can be achieved through data consolidation, which means that the data lake is used as a general data source for the data warehouse, allowing for the design of a single data repository for the organization. In this article, we will propose an architecture for educational Big Data. The purpose of this proposal is firstly to break down information silos by building a data lake. The data lake will also be used as a general data source for the data warehouse. The second objective of this architecture is to set up a modern data warehouse in order to apply analytical tasks, in particular OLAP analyzes on educational data. The article is divided as follows: Sect. 1 presents a general introduction, Sect. 2 discusses the general context on traditional data warehouses, modern data warehouses, and on the data lake. Section 3 discusses the literature review while Sect. 4 presents the proposed educational Big Data architecture. Section 5 presents a discussion about our contribution and the proposed model. Finally, Sect. 6 concludes the paper and highlights future work.

A Proposed Big Data Architecture Using Data Lakes

55

2 Background 2.1 Traditional Data Warehouse The data warehouse is a strategic environment where business intelligence is collected, logged and stored, while ensuring efficient data management. The design of the data warehouse in an organization, can be carried out according to one of the three classifications of data warehouse architecture, namely, single-layer architecture, two-layer architecture, and 3-layer architecture [12]. The first classification aims to reduce the volume of data stored by eliminating data redundancy. The two-layered classification consists of four consecutive steps to manage and process data, namely, data source, data staging, data warehouse and data analysis. The last layer includes the same steps of the previous classification with an additional step to materialize the operational data. This step, called data reconciliation, occurs after data sources are integrated and cleaned up. The ETL process is among the fundamental concepts of a data warehouse. It allows data to be extracted from operational and transactional sources, to be cleaned and transformed it into fixed structured formats, before being loaded into the data warehouse. However, with the explosion of data produced by the internet, varieties of data have emerged, hence the architecture of traditional data warehouses faces two limitations; Horizontal scaling, where adapting storage to the volume of data is very expensive. The second limitation is in applying the writing schema approach which applies a fixed pattern to the data before the data warehouse is created. Such a scheme, cannot be applied to unstructured data. In all, traditional data warehouses have increasingly been used in educational establishments. However, in the face of these limits presented previously, they must be modernized in order to manage Big Data. 2.2 Modern Data Warehouse Big data education can quickly exceed the capacity of a traditional data warehouse, frequently leading to blockage, or even failure of the data processing and analysis system. It is for this reason that data storage architectures must have the ability to scale over time to adapt storage and processing with increasing volumes and the variety of educational data. In this sense, modern data warehouses are increasingly being used, adopting distributed architectures to manage educational big data. This architecture is based on the scalability approach, using a cluster to distribute large data storage across multiple nodes. These nodes are adjusted gradually with the volume of data collected. Similarly for data processing, they are also distributed over the cluster. This architecture is more advantageous, especially with the inexpensive machines used to build the cluster. In addition, it ensures fault tolerance. It is therefore possible to build massive data warehouses based on this principle of extensibility of storage space [13, 14]. These Big Data warehouses can apply ELT processes, from which the data is extracted and loaded into Data Warehouse Without Processing (ELT). We are talking about the strategy of the schema on read, where the schema on the data is not applied only at the time of the analysis on this data. Data warehouses with NoSQL systems are a good way to modernize educational organizations. Indeed, the design of a hybrid system between these two technologies is

56

L. Oukhouya et al.

used in various fields [24, 25]. This is achieved through the implementation of multidimensional conceptual models with NoSQL models at the logical and physical level [26], making it possible to design massive data warehouses capable of handling Big Data. 2.3 The Data Lake The data lake represents a new strategy for managing an organization’s big data. More precisely, it is a reservoir of data which collects and centralizes data in native format [15]. It adopts the schema on read approach, which gives it great flexibility in data storage and processing. The data lake stores all the data in flat format, regardless of their structured, semi-structured or unstructured types, either in streaming or batch. The architecture of a data lake is based on a set of concepts, namely, ingestion, storage, exploration, discovery and governance [12]. Data ingestion allows data to be collected and integrated into the data lake. Data storage must be scalable at a lower cost. It must also provide timely access to data. Data discovery is applied prior to data analysis. This allows the decision maker to derive values from the data in an intuitive way. Data governance is applied across the data lake to standardize information from various sources, ensure accuracy and transparency, and prevent misuse. Data quality and security are included in data governance, and finally data exploration allows the analyst to use data visualization and statistical techniques in order to understand the value of data before proceeding to the exploitation of this data. The implementation of such an architecture in higher education will make the education system in schools smarter and more innovative, through the availability of information and its sharing among a large number of users, including administrators and teachers. The extraction of the maximum value from the educational information is guaranteed by the data lake. However, a strategy to govern data is needed. In this sense, metadata management systems have become a necessity in the data lake, allowing structure and contextualization of the stored information, with the aim of avoiding the transformation of the data lake into a data swamp [16].

3 Literature Review To take advantage of educational data, traditional data warehouses are no longer appropriate. Modern architectures must be used to effectively exploit the wealth of educational settings. In [17], the authors present an educational big data platform using data lake to store data produced by data sources in raw format. The platform is fully developed to perform analytical tasks. It has 5 layers, including an architecture of 4 areas for the data lake namely, raw data area, refined data area, public data area, work area, and sensitive data area. Another work [18] presents a distributed architecture under the Hadoop ecosystem to analyze learning in higher education. This architecture is composed of 3 layers, data access layer uses Scoop to connect to the storage space; data storage layer represents the data lake described by the Hbase system and HDFS files. And finally, the data processing layer based on the Spark cluster, which applies predictive processing to improve student’s performance. In this architecture, the data lake is represented by a

A Proposed Big Data Architecture Using Data Lakes

57

single zone to store all the collected data, however, a multi-zone architecture allows a proper data management. The consolidation of data by the data lake for modern warehousing is not used that much in the literature. However, we have not found such architectures in the education sector. This triggered the idea of proposing an architecture that can be applied in educational field. In this work [19], the authors present a Big Data architecture for social media, using the data lake MongoDB as a data source for the NoSQL data warehouse. The treatments applied to the raw data loaded in this zone are stored separately in another NoSQL data warehouse, then, the data analysis area uses the data warehouse to apply the analytical processes. In the same vein, authors in [20] present a big data warehouse architecture, using the lake of for the grouping of heterogeneous data sources. The key components of this proposal are, Data Highway (data lake), OLAP Cube Computation, Adaptation Component and Data Analysis. The description of the structure of the data sources as well as the data lake is done by a metadata management system. This metadata covers the structural, semantic and evolutionary aspect of the data sources. However, a good metadata management system in data lake must cover other functionalities such as data versioning, data polymorphism, data evolution and data granularity. In the context of data Lakehouse approaches, the authors present a technical architecture to modernize traditional data warehouses [21]. This architecture is composed of a data lake area based on the Hadoop ecosystem, organized in 4 layers, layer Ingestion, Metadata Layer, Landing Data Layer and Governance Layer. Delta Lake architecture with Spark represents the second area, which is characterized by atomic data layer and departmental data layer, both of which are built on Object Storage. This architecture is interesting, because it represents the newest concepts in this field of system hybridization. Nevertheless, the authors have not demonstrated how a traditional data warehouse can adapt its structure to be loaded from a Big Data source.

4 Proposed System Data in educational organizations plays a catalytic role in moving towards Big Data architectures. The design of such an architecture makes it possible to efficiently explore different data sources using a set of data reporting, analysis and visualization tools. The quality attribute for the educational big data architecture proposed in this article is: data engineering, data platform and data visualization. Data engineering collects and integrates data produced by different data sources. The data platform provides security and access to data, as well as scalability and flexibility for data storage and analysis. It also has the ability to be compatible with reference technologies, and can even be enriched to cover multiple use cases. The third attribute data visualization can be associated with existing BI tools of the establishment and offer new ways to explore data. Our proposed Big Data architecture for an intelligent educational system is described in Fig. 1. This architecture is composed of 4 layers: data source layer comprising the different educational data sources, data consolidation layer represented by the data lake, the data warehouse layer (data warehouse, meta-store and the OLAP engine layer for the pre-calculated cubes) and the last layer is the use and consumption of data. The following subsections will detail the interest of each layer.

58

L. Oukhouya et al.

Fig. 1. Big Data Architecture for educational organizations

4.1 Data Source Layer Data sources used include all the data generated by educational systems; including administrative management systems, financial systems and research systems. in this context, we can cite: Learning Management Systems, Massive Open Online Course (MOOC), Open Educational Resources (OER) and Open Course Ware, social media, linked data [22]. This layer supports three types of data: structured data represented by relational systems; semi-structured data described in JSON, CSV or XML files; and unstructured data such as log files, images, user messages or videos. This data is collected using either batch for RDBMS or flow for XML or log files. The raw data loading is performed to the data lake storage layer explained in the following subsection. 4.2 Data Consolidation Layer Once the data is collected, its data lake turn for the consolidation of data sources. It consolidates the data in a single repository and the storage is made in flat format for future use. The data lake has a great flexibility, however, without adopted architecture it can turn into a data swamp. In order to override this problem, our data lake architecture is organized into areas that form a data highway, from raw data to loading into the data warehouse. ELT processes are used for storage in the data lake, while ETL processes are used e for the data transformation or the data mining processing. As well as for storage analytical processes in some cases. We distinguish 4 components areas:1) Raw data zone: it ensures ingestion of data source data in the data lake (raw format), 2) Data process zone: It integrates, processes and transforms raw data using a set of processes to prepare data for analytical applications, 3) Trusted data zone: This zone contains data of analytical processes requested by users. It is also used as a data source to feed the data warehouse. In this level, data is highly processed and clean; which makes this area of data lake truthful in terms of data in the educational organization. 4) Access data zone: this zone gives access to all the zones of the data lake, either to feed them from the trusted zone or to perform analytical processes, and finally Governance zone: is applied to the 4 zones of the data lake. It provides metadata management, data security and quality, as well as data access control.

A Proposed Big Data Architecture Using Data Lakes

59

A data lake deserves to be built in educational organizations, especially in scalable projects that require more advanced architectures. In addition, they are very useful in case the purpose of the data has not yet been defined. 4.3 Data Warehouse Layer Nowadays, the basis of any decision-making system is to enable the production of decision-making or even operational reporting. To achieve this objective, setting up a data warehouse is essential. In this architecture, a modern data warehouse is used to store data request by the users or the applications. In this layer of the architecture, data warehouse is loaded from the trusted data zone located in the data lake, with the aim of converting to a structured scheme represented by a star schema or a flake schema. Moreover, metadata management is an important element for the smooth functioning in our architecture. The meta-store includes the different metadata captured from the raw data area of the data lake to the data warehouse through the data consumption area. The meta-store is created by a developer and its accessibility is based on a metadata management tool. The metadata collected can be categorized into 4 types: structuring metadata, contextualisation metadata, mapping metadata and the metadata of the OLAP cubes. This metadata is crucial for the logical description of the data warehouse and is used during the pre-calculation process as well as for the execution of queries. Finally, the last component of this layer is the pre-computed OLAP cube. It is used in this architecture to facilitate the different users of the educational organization to access and explore the big data stored in data warehouse. In addition, OLAP cubes allow for rapid execution of queries on the data sets represented by these precomputed cubes. 4.4 Data Consumption Layer The consumption of data enables the extraction of information and helps improve decision-making throughout the education system. In this architecture, the exploitation of data is carried out at several levels and various ways. Self-service can be used as a technology for various analyses, such as reporting, statistical analysis, business intelligence analysis, machine learning algorithms. Data visualization is another graphical technique for data representation. These analysis and representations can be applied directly to the raw data lake area, data trust zone, data warehouse or OLAP cubes. On the other hand, educational data mining (EDM) is among the most promising techniques in data consumption in educational organizations. This is achieved through the application of data mining, machine learning and statistics. The objectives of the EDM are to predict the future learning behavior of students, discovery or improvement of domain models, to study the effects of educational support and to advance scientific knowledge on learning and learners [23]. 4.5 Potential Application We propose a potential application to our educational Big Data architecture. We believe that the use of Hadoop ecosystem with an original solution for the deployment of this

60

L. Oukhouya et al.

architecture is an improvement point. The storage of raw sources in the data lake will be based on HDFS files, and apache Hive technology will be assigned to data warehouse storage. The apache Kylin solution will be adequate to implement cube engine technology since it is capable of producing cubes from Hive data structures. These pre-calculated cubes by Kylin will be stored in the NoSQL Hbase database. Concerning metastores, we intend to exploit an RDBMS and potentially pentaho for the implementation of ELT processes.

5 Discussion The objective of implementing modern architecture in the educational field is to improve the quality of teaching in its various aspects. By applying Big Data analytics, we can predict student scores and results, recommend courses, predict student success, and minimize dropout. Regarding the technologies used, we have the HADOOP ecosystem with all these tools such as Sqoop, HDFS, Yarn, MapReduce, Spark, NoSQL, Hive, Flink, Oozie. The Data Lake is a modern solution for storing and analyzing educational data. The data lake eliminates interoperability between education systems, and silos work by using a single data model, including different sources of education data. However, according to all related works cited above, the use of data lakes for higher education suffers from several problems. The first one is related to the architecture of the data lake where three limitations are highlighted: the lack of a specific raw data storage solution, the loss of data traceability due to the extreme number of areas in the data lake and finally, the limited access to data that are generated from a single area of the data lake. To overcome those problems, we develop a data lake architecture where you can benefit in terms of raw data storage, data lineage, and the ability to exploit data stored in different areas of the data lake. The second limitation represents a bad management of metadata that can be correctly implemented in the architecture of data lakes. In fact, metadata are essential to organize and integrate efficiently data in the data lake. However, existing architectures manage metadata in a task-specific manner such as structuring, contextualization or other tasks. A metadata management system must be generic enough to cover the different use cases a user may encounter, from data integration in the data lake to data analysis. To overcome those problems, we have presented a conceptual metadata management model for the data lake [16], which will be integrated into our architecture in the data warehouse layer presented in the previous section. This model has the ability to handle any case studies the user is facing during the analysis process. Moreover, it is characterized by functionalities that not only manage structuring and contextualizing metadata, but it also integrates other features such as schema evolution, data granularity description and data versioning. This model is a first step toward designing our metadata management system for this architecture, based on the OWL ontology.

6 Conclusion and Future Works This paper presents a modern architecture for educational organizations. This architecture benefit from the use of the data lake in the consolidation of data sources in order to

A Proposed Big Data Architecture Using Data Lakes

61

modernize the traditional systems used in decision-making processes, in particular data warehouses and OLAPs. The introduction of the data lake made it possible to bring the notion of scalability to the data warehouse in order to manage the evolution of data in terms of volume and varieties. Our proposed model contains 5 main layers; comprising 5 major components namely, the data sources, the data lake, the data warehouse, the OLAP engine, the metastore and the analyzed data. Future works consist on designing our metadata management tools. The latter will be developed based on OWL ontology, making it possible to manage 8 types of metadata. Necessary for the efficient exploitation of the data lake by the data warehouse, namely semantic enrichment, data polymorphism, data versioning, usage tracking, categorization, similarity links, metadata properties, multiple granularity levels.

References 1. Janev, V.: Semantic intelligence in big data applications. In: Jain, S., Murugesan, S. (eds.) Smart Connected World, pp. 71–89. Springer, Cham (2021). https://doi.org/10.1007/978-3030-76387-9_4 2. Bimonte, S., Boussaid, O., Schneider, M., Ruelle, F.: Design and implementation of active stream data warehouses. In: Research Anthology on Decision Support Systems and Decision Management in Healthcare, Business, and Engineering, pp. 288–311. IGI Global (2021) 3. Xu, L.D., Duan, L.: Big data for cyber physical systems in industry 4.0: a survey. Enterp. Inf. Syst. 13(2), 148–169 (2019) 4. Cebrián, G., Palau, R., Mogas, J.: The smart classroom as a means to the development of ESD methodologies. Sustainability 12(7), 3010 (2020) 5. Abdullayev, A.A.: System of information and communication technologies in the education. Sci. World Int. Sci. J. 2, 19–21 (2020) 6. Jha, S., Jha, M., O’Brien, L.: A step towards big data architecture for higher education analytics. In: 2018 5th Asia-Pacific World Congress on Computer Science and Engineering, pp. 178–183. IEEE (2018) 7. Baig, M.I., Shuib, L., Yadegaridehkordi, E.: Big data in education: a state of the art, limitations, and future research directions. Int. J. Educ. Technol. High. Educ. 17(1), 1–23 (2020). https:// doi.org/10.1186/s41239-020-00223-0 8. Petricioli, L., Humski, L., Vrdoljak, B.: The challenges of NoSQL data warehousing. In: E-business Technologies Conference Proceedings, vol. 1, no. 1, pp. 44–48 (2021) 9. Wibowo, M., Sulaiman, S., Shamsuddin, S.M.: Machine learning in data lake for combining data silos. In: Tan, Y., Takagi, H., Shi, Y. (eds.) Data Mining and Big Data. Lecture Notes in Computer Science, vol. 10387, pp. 294–306. Springer, Cham (2017). https://doi.org/10.1007/ 978-3-319-61845-6_30 10. Patel, J.: Bridging data silos using big data integration. Int. J. Database Manage. Syst. 11(3), 1–6 (2019) 11. How, M.: The Modern Data Warehouse in Azure: Building with Speed and Agility on Microsoft’s Cloud Platform, 1st edn. Apress (2020) 12. Blaži´c, G., Pošˇci´c, P., Jakši´c, D.: Data warehouse architecture classification. In: 2017 40th International Convention on Information and Communication Technology, Electronics and Microelectronics, pp. 1491–1495. IEEE (2017) 13. Santos, M.Y., Costa, C.: Big Data: Concepts, Warehousing and Analytics. River Publishers (2020)

62

L. Oukhouya et al.

14. Martins, A., Martins, P., Caldeira, F., Sá, F.: An evaluation of how big-data and data warehouses improve business intelligence decision making. In: Rocha, Á., Adeli, H., Reis, L.P., Costanzo, S., Orovic, I., Moreira, F. (eds.) Trends and Innovations in Information Systems and Technologies. Advances in Intelligent Systems and Computing, vol. 1159, pp. 609–619. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-45688-7_61 15. Sawadogo, P., Darmont, J.: On data lake architectures and metadata management. J. Intell. Inf. Syst. 56(1), 97–120 (2021). https://doi.org/10.1007/s10844-020-00608-7 16. Oukhouya, L., Elhaddadi, A., Er-raha, B., Asri, H.: A generic metadata management model for heterogeneous sources in a data warehouse. In: E3S Web of Conferences, vol. 297, p. 01069. EDP Sciences (2021) 17. Munshi, A.A., Alhindi, A.: Big Data Platform for Educational Analytics. IEEE Access 9, 52883–52890 (2021) 18. Alblawi, A.S., Alhamed, A.A.: Big data and learning analytics in higher education: demystifying variety, acquisition, storage, NLP and analytics. In: 2017 IEEE Conference on Big Data and Analytics, pp. 124–129. IEEE (2017) 19. Dabbèchi, H., Haddar, N.Z., Elghazel, H., Haddar, K.: Nosql data lake: a big data source from social media. In: Abraham, A., Hanne, T., Castillo, O., Gandhi, N., Rios, T.N., Hong, T.-P. (eds.) Hybrid Intelligent Systems. Advances in Intelligent Systems and Computing, vol. 1375, pp. 93–102. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-73050-5_10 20. Solodovnikova, D., Niedrite, L.: Change discovery in heterogeneous data sources of a data warehouse. In: Robal, T., Haav, H.-M., Penjam, J., Matuleviˇcius, R. (eds.) Databases and Information Systems. Communications in Computer and Information Science, vol. 1243, pp. 23–37. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-57672-1_3 21. Saddad, E., Mokhtar, H.M.O., El-Bastawissy, A., Hazman, M.: Lake data warehouse architecture for big data solutions. Int. J. Adv. Comput. Sci. Appl. 11(8), 417–424 (2020) 22. Ang, K.L.M., Ge, F.L., Seng, K.P.: Big educational data and analytics: survey, architecture and challenges. IEEE Access 8, 116392–116414 (2020) 23. Khan, A., Ghosh, S.K.: Student performance analysis and prediction in classroom learning: a review of educational data mining studies. Educ. Inf. Technol. 26(1), 205–240 (2021). https:// doi.org/10.1007/s10639-020-10230-3 24. Sebaa, A., Chikh, F., Nouicer, A., Tari, A.: Medical big data warehouse: architecture and system design, a case study: improving healthcare resources distribution. J. Med. Syst. 42(4), 1–16 (2018). https://doi.org/10.1007/s10916-018-0894-9 25. Ngo, V.M., Le-Khac, N.-A., Kechadi, M.-T.: Designing and implementing data warehouse for agricultural big data. In: Chen, K., Seshadri, S., Zhang, L.-J. (eds.) Big Data – BigData 2019. Lecture Notes in Computer Science, vol. 11514, pp. 1–17. Springer, Cham (2019). https:// doi.org/10.1007/978-3-030-23551-2_1 26. Sellami, A., Nabli, A., Gargouri, F.: Transformation of data warehouse schema to NoSQL graph data base. In: Abraham, A., Cherukuri, A.K., Melin, P., Gandhi, N. (eds.) Intelligent Systems Design and Applications. Advances in Intelligent Systems and Computing, vol. 941, pp. 410–420. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-16660-1_41

A SVM Approach for Assessing Traffic Congestion State by Similarity Measures Abdou Khadre Diop1(B) , Amadou Dahirou Gueye1 , Khaly Tall2 , and Sidi Mohamed Farssi2 1 TIC4Dev Team, ICT Department, Alioune DIOP University, Bambey, Senegal

[email protected] 2 Polytechnic Higher School, Cheikh Anta DIOP University, Dakar, Senegal

Abstract. Nowadays, we are seeing more and more problems related to the analysis of large data acquired in CCTV systems. Among these, the evaluation of traffic jams, with the aim of detecting and predicting anomalies, is a perfect illustration. The use of support vector machines (SVM) has alleviated the problems of classification, detection and prediction of anomalies in several areas [1, 2]. In this paper, we propose an approach to assessing traffic jam conditions based on similarity measures to produce better classification and prediction of abnormal events in real time. By considering standardized datasets PETS (Performance Evaluation of Tracking and Surveillance). The results obtained allow the events to be classified by a logistic regression at a rate of 99.88%. Keywords: Support Vector Movement (SVM) · CCTV systems · Similarity measuring · Traffic congestion

1 Introduction Compared to the regularity of traffic jams and the resurgence of accidents on main roads and even highways, the advent of CCTV systems has seen an increasingly significant contribution allowing the acquisition of a large number of visual data [3, 4]. Faced with the large amount of data acquired through these systems, visual data analysis requires exploring intelligent technologies to help administrators make the right decisions [5]. Among intelligent technologies, machine learning occupies an important place. In artificial intelligence, Support Vector Machines (SVMs) are machine learning algorithms for solving classification, regression, or anomaly detection problems. In the implementation of these algorithms, the data is separated into several classes overlapping at the maximum margin, with a border chosen to maximize the distance between the groups of data [6–8]. The difficulty is to determine this optimal frontier: to achieve this, we must formulate the problem as a quadratic optimization problem. The points closest to the border are called Support Vectors. As the visual data acquired via CCTV systems is presented as big data, the exploitation of SVMs will make it possible to solve the problems of regression, of anomaly detection and possibly of prediction of certain anomalies present in these systems such © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ben Ahmed et al. (Eds.): NISS 2022, LNDECT 147, pp. 63–72, 2023. https://doi.org/10.1007/978-3-031-15191-0_7

64

A. K. Diop et al.

as traffic jams, accidents, [6–8]. Considering that the traffic varies depending on the time of day and that similarity measurements between two successively transmitted images make it possible to assess their resemblance, we propose in this article an approach based on similarity measurements in order to evaluate the state of the congestion in a CCTV system. To describe our work, the paper is structured in several parts, with the first part briefly introducing the identification of traffic with standardized datasets (PETS); the second part describes our proposed methodology based on similarity measures. The third part presents the implementation and evaluation of the algorithm of the proposed method.

2 Traffic Congestion Identification Many methods of traffic congestion identification which is a complicated problem, are developed. Many factors affect the traffic congestion state, such as temperature, road establishment, traffic flow parameters and so on. Mainly used traffic flow parameters are volume, speed and time occupancy that are useful for traffic congestion detection [9]. Traffic congestion not only relates current intersection but also upstream and downstream intersection. In each intersection, it includes current simple time and previous simple time’s value. For improving the identification correct rate, lots of history traffic data should be used to find the principles. In the literature many papers treat the problematic; so in [10] authors proposed a traffic congestion identification based only on the traffic flow parameters into consideration to detect traffic state. They combined with the Parallel SVM method to identify traffic congestion at a precision rate of 94.6% for a duration time of 63 s. To evaluate the traffic level of road section in real time with considering road speed, road density, road traffic volume and the rainfall of road sections, [11] and al use fuzzy theory. Their proposed SVM-based real-time highway traffic congestion prediction (SRHTCP) model collects the road data, the traffic events reported by road users and the weather data from the national institution specialized. In addition, the SRHTCP model predicts the road speed of next time period by exploring streaming traffic and weather data. Results showed that the proposed SRHTCP model improves 25.6% prediction accuracy than the prediction method based on weighted exponential moving average method under the measurement of mean of absolute relative error. In the same context, [3] and al propose to provide solutions based on mathematical tools which make possible to measure the similarity between two successive frames acquired via CCTV systems. The SSIM and the cross correlation are compared; so the SSIM metric provides better performance compared to the cross correlation with the processing time and the probabilities distributions. Consequently, it is possible to predict traffic jams as soon as the SSIM metric value exceeds a certain threshold. In the paper, we take only the similarity measurements providing to exploit the temporal correlation into consideration to detect traffic state and eventually to predict abnormal events. Several types of similarity and dissimilarity measures are used in data science. The similarity measure is a way of measuring how data samples are related or closed to each other [12, 13]. On the other hand, the dissimilarity measure is to tell how much the data objects are distinct. Moreover, these terms are often used in clustering when similar data samples are grouped into different ones. It is also used in classification, where the

A SVM Approach for Assessing Traffic Congestion State

65

data objects are labeled based on the feature’s similarity [14]. The similarity measure is usually expressed as a numerical value. It gets higher when the data samples are more alike. It is often expressed as a number between zero and one by conversion: zero means low similarity (the data objects are very similar) [15, 16]. For similarity measures there are Jaccard Similarity Coefficient, Dice Similarity Coefficient and BF (Boundary F1) score. 2.1 Jaccard Similarity Coefficient The Jaccard Similarity Coefficient of two sets A and B (also known as intersection over union or IoU) is expressed as (1): jaccard (A, B) =

|A ∩ B| |A ∪ B|

(1)

where |A| represents the cardinal of set of A. Jaccard similarity coefficient returned as a numeric vector with values in [0, 1]. A similarity of 1 means that the segmentations in two images are a perfect match. The Jaccard index can also be expressed in terms of true positives (TP), false positives (FP) and false negatives (FN) as (2): jaccard (A, B) =

TP TP + FP + FN

(2)

2.2 Dice Similarity Coefficient The Dice similarity coefficient of two sets A and B is expressed as (3): dice(A, B) = 2 ∗

|A ∩ B| |A| + |B|

(3)

where |A| represents the cardinal of set A. Dice similarity coefficient returned as a numeric scalar or numeric vector with values in the range [0, 1]. A similarity of 1 means that the segmentations in the two images are a perfect match. The Dice index can also be expressed in terms of true positives (TP), false positives (FP) and false negatives (FN) as (4): dice(A, B) = 2 ∗

TP 2 ∗ TP + FP + FN

(4)

2.3 BF (Boundary F1) Score The BF score measures how close the predicted boundary of an object matches the ground truth boundary. The BF score is defined as the harmonic mean (F1 measure) of the precision and recall values with a distance error tolerance to decide whether a point on the predicted boundary has a match of the ground truth boundary or not.

66

A. K. Diop et al.

The BF score returned as a numeric scalar or vector with values in the range [0, 1]. A score of 1 means that the contours of objects in the corresponding class in prediction and groundtruth are a perfect match (5): score = 2 ∗

Precision ∗ Recall Precision + Recall

(5)

In order to train the model proposed, the SSIM (Structural Similarity Index Metric) which is a metric measuring similarity based on structures is considered.

3 Methodology Based on the concept of Big Data, the paper proposes a mechanism that uses normalized PETS datasets to collect data and analyze Highway traffic information. The proposed model, based on similarity measures, utilizes SVM to classify several events and to predict abnormal events. The objective of our proposed approach, based on the SVM model, is to classify types of traffic observed in a CCTV system. To do this, we go through a series of operations illustrated in Fig. 1. 3.1 Data Collection This phase allows data acquisition. As there are normalized PETS datasets, we consider these data (video sequences) acquired by CCTV systems. The datasets considered are presented in three forms of traffic: Low Traffic, Medium Traffic and High Traffic. For each of these datasets, we extract the raw images constituting it [17]. The different datasets can be illustrated in Fig. 2.

Fig. 1. Classification algorithm based on the SVM model

A SVM Approach for Assessing Traffic Congestion State

Low Traffic

Medium Traffic

67

High Traffic

Fig. 2. Normalized PETS datasets for different traffic

3.2 Pre-treatment This step consists of removing noise from images such as colors, spots and discontinuities through the following operations: 1. Binarization of the raw image; 2. Replace each black pixel surrounded by white pixels with a white pixel; 3. Replace each white pixel between two black pixels, either vertically or horizontally, with a black pixel. In this same phase, the feature extraction takes place, consisting in determining the feature vectors for the cars present in each raw image of each of the datasets. The impact of Windows Sliding enables features vectors to be optimized. 3.3 SVM Learning In this phase, we build, from a set of images for different types of datasets, a multimedia database containing the various traffic cited and their characteristic vectors. Then, a decision model is obtained by the SVM approach. 3.4 Classification In this phase, we take images of target traffic to extract the features vectors of the cars that constitute them, then we use the decision model obtained in the learning phase to distinguish the Medium Traffic and High Traffic that can be considered as abnormal events. 3.5 Results In this phase alerts can be generated to predict abnormal events.

68

A. K. Diop et al.

4 Experiments 4.1 Data Sources In this subsection, we are interested in the construction of the corpus for each of the three datasets that we have considered. The methods defined to measure the similarity are subjected to a comparison in order to consider the two most efficient for the training of the model. Thus Figs. 3, 4 and 5 show Jaccard and Dice brings a better performance. For the classification of abnormal traffic, we consider in the following the similarity measures with Jaccard and Dice in the traffic classification based on the SVM model.

Fig. 3. Similarity measures in low traffic

Fig. 4. Similarity measures in medium traffic

While the SSIM metric makes it possible to predict the abnormal event “High Traffic” beyond a threshold [3], our classification will be based on the measures of similarity with the SSIM metric compared to Jaccard and Dice which provided better performance compared to Bfscore.

A SVM Approach for Assessing Traffic Congestion State

69

Fig. 5. Similarity measures in high traffic

4.2 Traffic Classification Based on SVM An SVM is a supervised learning algorithm used for many classification and regression problems, such as medical applications of signal processing, natural language processing, speech recognition, and image recognition [11, 18]. The goal of the SVM algorithm is to find a hyperplane representing the largest margin between the two classes, indicated by “Medium Traffic” and “High Traffic” shown in Fig. 4. The implementation of the proposed algorithm makes it possible to separate the data relating to the “Medium Traffic” event and the data relating to the “High Traffic” event. Then it will be possible to determine the traffic state in a CCTV system (Fig. 6 and 7).

Fig. 6. Traffic classification based on SVM

Figure 4 shows that we can find a hyperplane whose margin denotes the maximum width of the space parallel to the hyperplane does not contain any data points. Such a determined hyperplane shows that our approach to assessing traffic fits into linearly separable problems. To visualize the performance of the multi-class classification model, we use the AUC-ROC Curve. ROC curve stands for Receiver Operating Characteristics Curve and

70

A. K. Diop et al.

AUC stands for Area Under the Curve. It is the graph that shows the performance of the classification model at different thresholds. This proposed approach allows to classify “High Traffic” and “Medium Traffic” with an AUC of 99.8% illustrate by Fig. 5.

Fig. 7. Classification “high traffic” and “medium traffic” by logistic regression

By varying a sliding window, we end up with the AUC values entered in Table 1, but also with the processing time and the number of video sequences obtained. It is important to note that; the more we widen the Sliding Windows, the more the processing time and the number of images in the corpus decreases. All the results presented in this document were carried out with Matlab R2020 installed in a machine with the following characteristics: Processor [Intel (R), Core (TM) i5–4200 CPU @2.50 GHz 2.50 GHz]; Installed memory (RAM) [8.00 GB]; System Type [64-bit Operating System, x64-based processor].

A SVM Approach for Assessing Traffic Congestion State

71

Table 1. AUC values obtained by sliding window Sliding windows

AUC (%)

Processing time

Number of sequences

3

99.56

16.40

44

4

99.92

15.42

43

5

99.89

15.38

42

6

99.92

14.45

41

7

99.92

14.44

40

8

99.85

13.81

39

9

99.87

13.66

38

10

99.88

13.13

37

20

99.80

9.57

27

5 Conclusion and Future Work Several decades ago, the traffic congestion identification has become increasingly problematic. This paper presents a SVM approach to assess traffic congestion by similarity measures in order to classify and predict abnormal events. Considering a sliding window variation, the proposed solution provides satisfied results of AUC which can be to 99% with large sliding window. To optimize the performance of the proposed solution, it should be deployed in data centers that allow high performance calculations. Thus, the duration of treatment can be minimized. The problem addressed in this paper concerns linear problems in machine learning. So, we plan in our future work to explore unsupervised methods in alleviating these encountered problems. However, we plan to explore the hyper parameters to assess the sensitivity of the proposed approach. Different solutions offered.

References 1. Shetty, S., Rao, Y.S.: SVM based machine learning approach to identify Parkinson’s disease using gait analysis. In: 2016 International Conference on Inventive Computation Technologies (ICICT), pp. 1–5 (2016) 2. Yang, Z., Lin, C., Gong, B.: Support vector machines for incident detection in urban signalized arterial street networks. In: 2009 International Conference on Measuring Technology and Mechatronics Automation, pp. 611–616 (2009) 3. Diop, A.K., Gueye, A.D., Tall, K., Farssi, S.M.: Measuring similarity in CCTV systems for a real-time assessment of traffic jams. In: 2020 3rd IEEE International Conference on Information Communication and Signal Processing, pp. 460–464 (2020) 4. Alifi, M.R., Supangkat, S.H.: Information extraction for traffic congestion in social network. In: 2016 International Conference on ICT for Smart Society, pp. 53–58 (2016) 5. Jiang, J., Zhang, X.-P., Loui, A.C.: A new video similarity measure model based on video time density function and dynamic programming. In: 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1201–1204 (2011)

72

A. K. Diop et al.

6. Zhao, M., Liu, Y., Sun, D., Cheng, S.: Highway traffic abnormal state detection based on PCA-GA-SVM algorithm. In: 2017 29th Chinese Control and Decision Conference (CCDC), pp. 2824–2829 (2017) 7. Nguyen, H.N., Krishnakumari, P., Vu, H.L., van Lint, H.: Traffic congestion pattern classification using multi-class SVM. In: 2016 IEEE 19th International Conference on Intelligent Transportation Systems (ITSC), pp. 1059–1064 (2016) 8. Rewadkar, D.N., Aher, C.N.: Vehicle tracking and positioning in GSM network using optimized SVM model. In: International Conference on Current Trends in Engineering and Technology (ICCTET), pp. 187–191 (2014) 9. Chang, A., Jiang, G., Niu, S.: Traffic congestion identification method based on GPS equipped floating car. In: 2010 International Conference on Intelligent Computation Technology and Automation, pp. 1069–1071 (2010) 10. Sun, Z., Feng, J., Liu, W., Zhu, X.: Traffic congestion identification based on parallel SVM. In: 2012 8th International Conference on Natural Computation (ICNC 2012), pp. 286–289 (2012) 11. Tseng, F.-H., Tseng, C.-W., Yang, Y.-T., Chao, H.-C.: Congestion Prediction with big data for real-time highway traffic. In: IEEE Access, vol. 6, pp. 57311–57323 (2018) 12. Adel, N., Crockett, K., Crispin, A., Chandran, D.: FUSE (fuzzy similarity measure) - a measure for determining fuzzy short text similarity using interval type-2 fuzzy sets. In: 2018 IEEE International Conference on Fuzzy Systems (FUZZ), pp. 1–8 (2018) 13. Wongchaisuwat, P.: Semantic similarity measure for Thai language. In: 2018 International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP), pp. 1–6 (2018) 14. Song, A., Kim, Y.: Improved iterative error analysis using spectral similarity measures for vegetation classification in hyperspectral images. In: 2018 IEEE International Geoscience and Remote Sensing Symposium, pp. 2662–2665 (2018) 15. Asikuzzaman, M., Pickering, M.R.: Object-based motion estimation using the EPD similarity measure. In: IEEE 2018 Picture Coding Symposium (PCS), pp. 228–2320 (2018) 16. Cross, V., Mokrenko, V., Crockett, K., Adel, N.: Using fuzzy set similarity in sentence similarity measures. In: 2020 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), pp. 1–8 (2020) 17. Diop, A.K., Tall, K., Farssi, S.M., Samb, N.A., Tambedou, M.: Cross-correlation to assess the state of traffic jam in CCTV systems. In: International Conference on Electrical, Communication and Computer Engineering (ICECCE), pp. 576–579 (2020) 18. Zhao, J.D., Bai, Z.M., Chen, H.B.: Research on road traffic sign recognition based on video image. In: 2017 10th International Conference on Intelligent Computation Technology and Automation, pp. 110–113 (2017)

A Tool for the Analysis, Characterization, and Evaluation of Serious Games in Teaching Farida Bouroumane(B)

and Mustapha Abarkan

Laboratory of Engineering Sciences, Faculty Polydisciplinaire, Taza, Morocco {farida.bouroumane,mustapha.abarkan}@usmba.ac.ma

Abstract. In recent years, digital gaming has been a growing interest in the education sector. The serious game is an interesting tool for renewing teaching practices. However, their integration into education is not simple and several problems limit their adoption. The first obstacle is the difficulty that has been approved by teachers. The second obstacle is the relevance of the choice of the game to adopt. Indeed, the process of adopting a game-based pedagogy requires accompanying teachers with tools that facilitate understanding and effective implementation in a learning situation. In this article, we aim at measuring the usability of a serious game in order to guide the teacher toward effective evaluation of their pedagogical and ludic potential for an adequate insertion in the teaching process. Keywords: Serious games · Usability · Teacher

1 Introduction Right now, with the COVID-19 crisis, digital tools have taken an important place in education as in other areas. Game-based pedagogy has become a common practice in the world of education, especially at a time when the Internet has taken a considerable importance, which has generally made video games a favorite hobby of the new generation. The aim of this pedagogy is not to replace traditional teaching, but to complement it by making the learner benefit from interactivity and motivation. The serious game (SG) is an educational tool of our time and is increasingly present in the world of education that uses the elements of the video game to teach particular knowledge. Now, it covers almost all areas [1–3]: health, education, economics, politics, ecology, and others. Thanks to their numerous strengths[4–6], SG is a potentially useful tool for teachers to train their learners on various issues in a way that differs from traditional teaching. Indeed, it allows teaching knowledge that would be difficult to teach either for reasons of dangerousness, time, cost and others. In addition, it is an entertaining tool based on a problem-solving approach, which allows increasing the engagement of learners and the acquisition of skills. Although several experiments have assured that SGs are very promising to strengthen teaching, they are not always accepted by teachers. It is clear that teachers react with SGs according to their experiences with digital technology. Indeed, young teachers who have grown up with digital technologies are more likely to accept and positively respond to the use of SGs in the classroom. In addition, teachers who © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ben Ahmed et al. (Eds.): NISS 2022, LNDECT 147, pp. 73–83, 2023. https://doi.org/10.1007/978-3-031-15191-0_8

74

F. Bouroumane and M. Abarkan

work in the fields of information technology and computer science are more receptive to the adoption of game-based pedagogy than other teachers are. Likewise, research studies [7–10] affirm that the effectiveness of the use of SGs in education depends on the teacher who chooses adapted SGs to the knowledge of his learners and compatible with the pedagogical objectives to be achieved. This explains that the teacher must necessarily test the SG before using it in the classroom. Indeed, the study by Hébert et al. [11] confirms that there is a need for education that takes into account the barriers that teachers face in adopting SGs into their pedagogies. For this, it seems essential to accompany teachers with tools that facilitate the analysis and evaluation of SGs so that they could effectively integrate them in their teaching and offer their learners games according to their educational needs. The purpose of this research work is to propose an approach to analyze and evaluate the ludic and educational potential of a SG. On the other hand, it seeks to put in place a tool describing all the necessary elements to understand SGs for an acceptable and effective adoption in teaching. After this introduction, the second section describes the related works, where research on approaches to the evaluation of SGs is presented. The third section is reserved for the presentation of the principle of the proposed analytical approach. The fourth section is devoted to the used protocol to evaluate this approach. Finally, we discuss in detail the results of the evaluation.

2 Related Works There are several approaches to the evaluation of SGs. However, most of the models aim at analyzing the quality of SGs design phases, and to evaluate their effectiveness in a specific training process. There are a few approaches that can help teachers assess SGs that may be most effective and acceptable in their learning context and in particular field. Marfisi-Schottman et al. [12], proposed the tool LEGADEE (Learning Game Design Environment) to help designers analyze the quality of their SGs to reduce the time and design costs. Similarly, Djelil [13] proposed a methodology for evaluating SGs to reduce the risk of poor design, using analytical and empirical evaluation methods based on the three dimensions: usability, usefulness, and acceptability of SGs. The CEPAJe proposed model by Alvarez and Chaumette [14] that is inspired by the work of Freitas and Oliver [15] allows the teacher to question himself on 5 dimensions in the conception of SGs: the dimensions « context », « Teacher », « Pedagogy », « Learner » and « Game » to maximize the balance between ludic experience and maximizing learning. The authors ÁvilaPesántez et al. [16], assert that the evaluation of SGs consists in ensuring beforehand that the educational objectives have been achieved. On the contrary, the GameFlow model proposed by Sweetser and Wyeth [17] evaluates the pleasure provided by games such as concentration, the challenge of the game, the feeling of control, immersion, and social interaction. The MoPPLiq model is accompanied by a tool called APPLiq proposed by Marne [18] is able to describe the ludic and educational aspects of the scenario of an SG and make this scenario understandable and manipulable by teachers. Giani Petri et al. [19] proposed a new method (MEEGA +) improving the initial version of the MEEGA model to confirm the expected benefits of the games through systematic evaluation. This model is based on the objective, question, and metric approach [20].

A Tool for the Analysis, Characterization, and Evaluation

75

These methods and tools provide a framework for evaluating SGs. However, even if teachers are convinced of the potential of SGs in education, they will have difficulty to appropriate them to change their pedagogy and have more difficulty adapting to computer tools to build their own games. The central objective of this work is, therefore, to define a model that will guide teachers in the analysis of SGs without having new roles that require computer skills.

3 Research Method In this section, we present a thought process to teachers who would like to effectively integrate SGs y into their teaching. 3.1 Experimental Study To assess teachers’ knowledge on the value of using SGs during a learning activity, an experimental study for teachers in the Department of Computer Sciences, Economics and Biology of the Multidisciplinary Faculty of Taza, and the Multidisciplinary Faculty of Nador have been conducted. The choice of disciplines is dictated by the nature of the study and its aims. In the mentioned departments, teachers are fairly acquainted with computers and modern teaching technologies. The 36 teachers participating in this experiment are external to this research project and have no experience in SGs. This study sought to capture both positive and negative perceptions of SGs. The experimental study was carried out in two distinct phases in the following way: The first stage is devoted to presenting the operation and use of the three games selected for this study (see Table 1) below. The tested three games during this experimental study showed a positive appreciation for the motivation and commitment of the learners. The second phase of the questionnaires consists of 15 questions divided into 3 sections. The first section deals with teachers’ general assessments. The second section deals with the level of interest in the content. The last section concerns the use of SGs. Respondents have access to online questions; the link is sent to them by email. On this occasion, we asked teachers to mention their age, gender, and skills in the use of general and educational computer software and also to indicate their views on the possible use of serious games. The average age of participants is 35 years with extremes ranging from 30 to 45 years. Also, the majority of participants were male, 66.66% (24 teachers). Group 2 (Leuco’war) was strictly formed by male teachers. This does not mean that the games are especially appreciated by men. But the numbers of women in the selected departments remain in the minority compared to men. We also point out that teachers report having the necessary skills to handle word processing, presentation, and spreadsheet software but no mastery for the use of SGs.

76

F. Bouroumane and M. Abarkan Table 1. The serious games of experimental study.

Game

Thematic

Code combat

Game teaches different programming languages (JavaScript, Python and other)

Leuco’war

Game allows students to visualize the functioning of the immune system following an infection

Cartel euros

Game allows students to learn management

Teachers participating

Our study included 36 teachers with a participation rate of 100%. Also, the response rate to the questionnaire was 100% for these participants. Figure 1 presents the results of the experimental study concerning the level of interest in the contents.

30 20

Completely

10

A little Not at all

0 Q1

Q2

Q3

Q4

Q5

Questions

Fig. 1. Results of the experimental study concerning the level of interest in the contents.

Teachers participating

Participants’ impressions were positive about the level of realism of the serious games used, as well as the interest of the content. Indeed, most of the participants specified the need to adapt the scenario of the game according to the pedagogical objectives to be achieved, the characteristics of the learners, and the method of work of each teacher. Figure 2 presents the results of the experimental study on general assessment.

30 20 Yes 10

No

0 Q1

Q2

Q3

Q4

Q5

Questions

Fig. 2. Results of the experimental study on general assessment.

Teachers were generally satisfied, the majority had agreed on the need for such tools during a learning situation. They can help to better understand certain concepts and motivate learners. Figure 3 presents the results of the experimental study on the use of SGs.

Teachers participating

A Tool for the Analysis, Characterization, and Evaluation

77

30 20

Completely

10

A little Not at all

0 Q1

Q2

Q3

Q4

Q5

Questions

Fig. 3. Results of the experimental study on the use of serious games.

The majority of teachers had identified technical and material difficulties that can disrupt the use of games, such as installation problems, regular “crashes”, access to sufficient, and other computer stations. 3.2 Proposed Approach The adoption of a SG by the teacher is conditioned by the suitability and compatibility of the elements of the SG with the practices and pedagogical objectives of the teacher as well as the characteristics of the learners (skills, expectations, needs, and others). To our knowledge, few studies are interested in teachers’ perceptions of the adoption of SGs in their teaching practices. Clearly, if teachers are not motivated, or if they lack a positive attitude toward using a SG in a learning activity, it will be difficult for them to adopt this pedagogy. In general, each serious game has its own interface, and each interface must be adapted to the needs of its use by the teacher, which can influence the process of adoption of SGs by teachers. Usability is a concept that denotes the qualitative property of an interface to be easily usable. Some studies [21, 22], show that the concept of usability encompasses the handling and ease of use of a SG that is to say that the features and information on the interface of a SG must be easily identifiable, legible, understandable, and consistent with the intended pedagogical objectives to meet the teacher’s expectations. Again, according to ISO 9241–11 [23], the concept of usability

Fig. 4. Proposed structural model.

78

F. Bouroumane and M. Abarkan

refers to the degree to which a product can be used by identified users, and achieve defined goals effectively, efficiently, and satisfactorily in a specified context of use. Based on the proposals expressed, we have constructed a research model (see Fig. 4) based mainly on the definition of usability proposed by ISO 9241–11. A game is effective if it fulfills its aim and objective. This is the accuracy and degree of completion with which learners will achieve the learning objectives. So, the content and components of the game must be presented to the learner so that he could perceive them. Thus, the content must be understandable and robust enough to be interpreted reliably by the learner. Then, a game is efficient if it provides the best performance. Indeed, it is the ability of the game to enable learners to achieve learning objectives with a minimum of effort. So, the content and components of the game must be presented to the learner in an easy way and compatible with the skills, expectations, needs of the teacher and the learner. Finally, a game satisfies the teacher if it is consistent with what was expected. It is a subjective assessment coming from a comparison between what the act of use brings to the individual and what he expects to receive. So, the content and components of the game must be adaptable and allow easy use for the teacher to perform specific tasks. Through this study, which is concerned with the concept and criteria of usability of a SG, we defined a process of analysis, characterization and evaluation of a SG to judge its relevance in a specific learning situation. Figure 5 presents a description of the steps for verifying the ludic and pedagogical potential of a SG.

Fig. 5. Steps in the process of verifying a serious game.

In order to be able to judge the effectiveness of a SG for a given pedagogical situation, it is essential to evaluate the criteria of usability of an SG (see Table 2) that make it possible to give a judgment on it.

A Tool for the Analysis, Characterization, and Evaluation

79

Table 2. Criteria for analyzing the usability of a serious game. Aspect

Criteria

Description

Effectively

Adaptability

Measures the ability of the game to react according to the context, the needs, and the characteristics of the learners

Precision

Measures the adequacy between the learners’ needs and the appropriate choice of words and used symbols in the pedagogical and ludic scenario

Learnability

Measures the ease with which the learner discovers the game features and achieves the learning objectives

Compatibility

Measures the agreement between the characteristics of the learners, the tasks, and the interface and interaction elements. Also, it measures the compatibility of the game with the teacher’s pedagogical style, his usual steps as well as his pedagogical organization

Flexibility

Measures the ability of the game to adapt to the varied actions of teachers

Accessibility

Measures the ability of the game to allow easy use for the teacher to perform specific tasks

Efficiently

Satisfactorily

4 Evaluation In this section, we outline the used methodology to assess our approach. This study was carried out in two stages and was devoted to the implementation of a tool to analyze, characterize and evaluate an SG. Then, in the second step, we established an evaluation protocol that allows us to measure the usability, usefulness, and acceptability of the tool we propose. 4.1 Presentation of the Analysis Tool Our tool uses dynamic navigation on a set of questions grouped in a hierarchical structure on three main phases. Each phase proposes a list of criteria. This environment was designed to guide teachers in the choice of adapted games to the prerequisites of their learners; its educational objectives without involving there are complicated procedures requiring computer skills. Each game has associated a list of details such as the classification, the number of published reports, the results of analyses carried out by teachers and others. The teacher can then use these details to appreciate the qualities and usefulness of a game in relation to predefined teaching areas. The games deposited on the tool are stored in a database. The latter stores other information such as user profiles, user traces, analysis results and others. As it is shown in Fig. 6, users can perform three types of actions, namely: Perform a game analysis to generate an analysis report, import published analysis results to generate a custom report and compare analysis published results by other teachers.

80

F. Bouroumane and M. Abarkan

Fig. 6. Structure of the serious game analysis tool.

4.2 Evaluation Protocol In order for the proposed tool to be usable by teachers, we carried out an experiment in two phases: 1) the actual use of the tool to collect traces of use in order to verify the usability of the tool, 2) the use of a teacher questionnaire to verify the utility and acceptability of the tool. We tested the tool in a real-life situation by a sample of 40 teachers who agreed to participate in such an experiment, 24 teachers from the Multidisciplinary Faculty of Taza and 16 teachers from the Polidisciplinary Faculty of Nador, in the following disciplines: computer science, physics, economics, biology, English and French. Four of the teachers who are experienced users of the games and have been using them for 2 years. The others are using the game for the first time. The choice of disciplines was based on teachers’ interest in such a subject and also their habits and skills related to the use of computer software in the classroom. We have prepared and sent out a video and user guide to teachers. The duration of the experiment is related to the duration of the tested game and the habit of playing a game by teachers. Some of them played the game twice in order to familiarize themselves with the game content and with operational details such as control how to access levels, and overcome obstacles. 4.3 Results During the period of use of the tool, a monitoring mechanism kept track of teachers’ actions. We noted that 2% of participants analyzed 2 SGs. Some SGs were analyzed several times by different profiles. 73% of teachers managed to draw up a summary of the SG selected, while 27% of users stopped the analysis in the second or third phase. This is due to several reasons, such as lack of precision, or technical and other difficulties. Figure 7 presents the average of the Likert scale for the 12 questions in the questionnaire.

A Tool for the Analysis, Characterization, and Evaluation

81

Average results

5 4 Question 1 Question 2 Question 3 Question 4

3 2 1 0 Utility

Usability Questions

Acceptability

Fig. 7. Averages of the questionnaire results.

The averages of the responses are between 3, 35 and 4, 55. The mean values of 12 are greater than 3 (the neutral response), suggesting that teachers’ attitudes are generally positive.

5 Discussion The obtained results from teacher’s feedback show that more than 70% of summary reports of the SG chosen were effective for teachers, which explains the importance of the proposed analysis process. Only 23% of teachers had difficulties with the chosen SG, this may be due to several reasons, which limited their involvement and therefore their results. The results of the questionnaire show that teachers were generally satisfied and had agreed on the need for such a tool to effectively integrate SGs into their teaching. 60% of participants mentioned the need for professional development opportunities related to game-based pedagogy. In all cases, these teachers plan to reuse the SGs for years to come. Generally, the obtained results from the experiments show that the tool helps and accompanies teachers in the decision-making concerning the adoption of a game-based pedagogy. For this work to be all the more relevant, the integration modalities could be improved to explain the protocol of practical use of SGs to teachers.

6 Conclusion Currently, the SG could be considered an essential tool for learning and teaching. However, the process of adopting a game-based pedagogy requires the accompaniment of teachers by tools facilitating the understanding and effective implementation of this strategy in education. In our study, we implemented a usability-based tool that facilitates the understanding and effective use of SGs in education without requiring teachers to take on new roles that are too complex and require skills in computer science. In general, the usability approach is characterized by the determination of the evaluation criteria to appreciate the ludic and pedagogical dimension of an SG and also guide the teacher to verify that the SG chosen is well-targeted learning objectives of the course and facilitate the learner’s learning. Overall, the evaluation protocol yielded positive results relative to

82

F. Bouroumane and M. Abarkan

the usefulness of the reports generated. We obtained more than 75% satisfaction with the precise evaluation criteria on this report. In addition, the majority of teachers expressed that the use of such an environment helps them to change their view on the use of SGs in education.

References 1. Wang, R., DeMaria, S., Jr., Goldberg, A., Katz, D.: A systematic review of serious games in training health care professionals. Simul. Healthc. 11(1), 41–51 (2016) 2. Laucelli, D.B., Berardi, L., Simone, A., Giustolisi, O.: Towards serious gaming for water distribution networks sizing: a teaching experiment. J. Hydroinf. 21, 207–222 (2019) 3. Batko, M.: Business management simulations - a detailed industry analysis as well as recommendations for the future. Int. J. Serious Games 3, 47–65 (2016) 4. Kaimara, P., Deliyannis, I.: Why should I play this game? The role of motivation in smart pedagogy. In: Daniela, L. (ed.) Didactics of Smart Pedagogy: Smart Pedagogy for Technology Enhanced Learning, pp. 113–137. Springer International Publishing, Cham (2019). https:// doi.org/10.1007/978-3-030-01551-0_6 5. Leblanc, G.L.: Analyse comparative d’une formation en présentiel et d’une formation mixte (jeu sérieux et une journée en présentiel) à Hydro-Québec (2020) 6. Antit, S., et al.: Evaluation of students’ motivation during the gamification of electrocardiogram interpretation learning. Tunis. Med. 98, 776–782 (2020) 7. Becker, K.: Choosing and Using Digital Games in the Classroom. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-12223-6 8. Kangas, M., Koskinen, A., Krokfors, L.: A qualitative literature review of educational games in the classroom: the teacher’s pedagogical activities. Teach. Teach. 23, 451–470 (2017) 9. Mathieu, O.: Conception d’une grille d’analyse des jeux sérieux pour l’enseignement de la science et de la technologie au secondaire en regard des bonnes pratiques reconnues dans le domaine de l’apprentissage par le jeu vidéo (2018) 10. Molin, G.: The role of the teacher in game-based learning: a review and outlook. In: Ma, M., Oikonomou, A. (eds.) Serious Games and Edutainment Applications, pp. 649–674. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-51645-5_28 11. Hébert, C., Jenson, J., Terzopoulos, T.: “Access to technology is the major challenge”: teacher perspectives on barriers to DGBL in K-12 classrooms. E-Learn. Digit. Media 18, 307–324 (2021) 12. Marfisi-Schottman, I., George, S., Tarpin-Bernard, F., Prévot, P.: Comment évaluer la qualité d’un Learning Game pendant sa conception? In: Conférence Technologies de l’Information et de la Communication pour l’Enseignement (TICE), pp. 80–90 (2012) 13. Djelil, F.: Comment évaluer un jeu d’apprentissage en contexte de formation professionnelle. In: Rencontres Jeunes Cherch. en EIAH, pp. 11–16 (2014) 14. Alvarez, J., Chaumette, P.: Présentation d’un modèle dédié à l’évaluation d’activités ludopédagogiques et retours d’expériences. Rech. Prat. pédagogiques en langues spécialité. Cah. l’Apliut 36 (2017) 15. De Freitas, S., Oliver, M.: How can exploratory learning with games and simulations within the curriculum be most effectively evaluated? Comput. Educ. 46, 249–264 (2006) 16. Ávila-Pesántez, D., Rivera, L.A., Alban, M.S.: Approaches for serious game design: a systematic literature review. ASEE Comput. Educ. J. 8(3) (2017) 17. Sweetser, P., Wyeth, P.: GameFlow: a model for evaluating player enjoyment in games. Comput. Entertain. 3(3) (2005)

A Tool for the Analysis, Characterization, and Evaluation

83

18. Marne, B.: Modèles et outils pour la conception de jeux sérieux: une approche meta-design, Université Pierre et Marie Curie-Paris VI (2014) 19. Petri, G., von Wangenheim, C.G.: MEEGA+: a method for the evaluation of the quality of games for computing education. Proc. SBGames, 28–31 (2019) 20. Basili, V.R., Caldiera, G., Rombach, H.D.: The Goal, metric, and question Approach. Kaiserslautern, Ger. (1994) 21. Szilas, N., Widmer, D.S.: L’évaluation rapide de jeux d’apprentissage: la clef de voûte de l’ingénierie ludo-pédagogique (Instructional Game Design). Actes l’Atelier « Serious games, jeux épistémiques numériques » 24 (2013) 22. Warren, S., Jones, G., Lin, L.: Usability and play testing: the often missed assessment. In: Annetta, L., Bronack, S.C. (eds.) Serious Educational Game Assessment, pp. 131–146. SensePublishers (2011). https://doi.org/10.1007/978-94-6091-329-7_8 23. ISO 9241–11:2018: Ergonomics of human-system interaction—part 11: usability: definitions and concepts. Int. Organ. Stand. 9241 (2018). https://www.iso.org/obp/ui/#isostdiso

Analysis and Examination of the Bus Control Center (BCC) System: Burula¸s Example Kamil ˙Ilhan1 and Muharrem Ünver2(B) 1 Asis Electronics and Information Systems Inc., Istanbul, Turkey 2 Faculty of Engineering, Department of Industrial Engineering, Karabuk University, Karabük,

Turkey [email protected]

Abstract. BCC Management System, developed by ASIS CT company, is a webbased application that enables public transportation enterprises to plan public transportation efficiently by defining vehicles, drivers, lines, routes, stops and services, to detect violations such as routes, stops etc. over the data collected from public transportation vehicles and to generate reports, and to allow instant and historical monitoring of vehicles on the map and to monitor in-vehicle camera images. BCC Management System can work in integration with ERP, Electronic Fare Collection Systems (EFCS) and Vehicle Telemetry System (CanBus) currently used by public transportation enterprises. BCC is an intelligent transportation system that enables invehicle and road tracking, location tracking, access to stop and route information by the web user. This system increases the quality of service and life in cities. As a part of smart public transportation, it has been evaluated that it is a system that provides information such as speed, route, location and arrival time for all vehicles used in public transportation in real time. It is an efficient, effective, innovative, dynamic, environmentalist, value-added and sustainable smart transportation platform that integrates with all transportation modes in our country, uses up-to-date technologies, makes use of domestic and national resources. Keywords: Smart technology · Urban transportation systems · Sustainable transportation components

Abbreviations BURULAS: ¸ Bursa Transportation Public Transportation Management Tourism Industry and Trade Inc. EFCS: Electronic Fare Collection Systems. DCP: Driver Control Panel. BCC: Bus Control Center. CanBus: Vehicle Telemetry Systems. IT: Information Technologies.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ben Ahmed et al. (Eds.): NISS 2022, LNDECT 147, pp. 84–96, 2023. https://doi.org/10.1007/978-3-031-15191-0_9

Analysis and Examination of the Bus Control Center

85

1 Introduction The increase in the population living in cities, the new developments in information and communication technologies and the differentiation in the demands and expectations of urban individuals also force urban administrations to change. It is thought that limited public resources and crowded cities will be managed more effectively by making more use of the opportunities offered by smart technologies. In this context, governments and city administrations in different geographies attach importance to smart city applications and projects. In the face of the advantages of smart cities and many smart city projects, academic interest in this field is also increasing with a multidisciplinary approach [1]. Smart technologies increase the quality of service, reduce costs and make longterm plans, thanks to the support of transportation and logistics systems by IT, and the smartening of environmentally friendly, sustainable, safe and integrated transportation systems with real-time data [2]. Smart city policies are also being developed at the national level in Turkey. In particular, investments are made by municipalities in areas such as transportation, public transportation, water management and energy efficiency. This study focuses on big data-based innovative transportation technologies behind smart cities [3].

2 Material and Method The smart transportation platform we will discuss in our study will be the Bus Control Center (BCC), a web application that enables businesses to plan transportation efficiently by defining vehicles, drivers, lines, routes, stops, tasks and services, to produce violations and reports with the data collected from buses, and to access and monitor in-vehicle camera images of vehicles that can be viewed on the map [4]. In this application, with real-time monitoring data, it plays an important role in the operation of the public transportation vehicles in the field in Bursa by providing the management and control of their processes. In addition, it is a system that provides the instant location, speed, daily travel time, instant vehicle tracking and many similar information of the vehicle used in public transportation vehicles. With the BCC Management System, the safety of vehicles and operational efficiency are ensured, and accurate planning and regulation is possible thanks to the detailed reports it provides. Some of the reports; • • • • • • • • • •

Follow-up of all vehicles on a single web screen. Vehicle occupancy rates. Retrospective monitoring of the system offline. Working hours of drivers. Online center notification. Driver behaviors. Movement, stop, towing, speed, idling. Vehicle line violation, trip violation report. Regular reporting. Vehicle optimization.

86

• • • •

K. ˙Ilhan and M. Ünver

Cost/Fuel optimization. Stop, line, bus, trip based general passenger reports. Dealer, card filling reports. Receiving and reporting data on the vehicle with Canbus.

3 Application 3.1 General Features of the Bus Control Center (BCC) System and Determination of Application Areas It is a real-time (online) system. According to the requirement in the system, it is possible to make the arrangement according to the authorization level via web/mobile by the authorized personnel [5]. Vehicle tracking can be performed retrospectively. It is ensured that the vehicle speed and time-dependent location information is monitored and the desired line can be seen together with the directions of return while the vehicle history is monitored, and the traffic density is shown on the map information [6]. It is literally a human and Smart city-oriented transportation system in Turkey with advanced information technologies (Figs. 1, 2, 5, 6 and 7).

Fig. 1. BCC project vehicle tracking screen

It is possible to monitor all movements (where it stops, where it exceeds the speed limits you can set, where passenger boarding takes place, etc.) of a particular vehicle retrospectively or instantaneously In this way, information such as the departure/arrival times of the transportation vehicles, the routes followed, the departure times, the location information of the vehicle where the vehicle paused, and the pause times can be seen instantly.

Analysis and Examination of the Bus Control Center

87

In order to prevent misuse of transportation vehicles, it was ensured that traffic rules were more complied with by monitoring their speed. Thus, the risk of traffic accidents is reduced and the number of traffic fines is reduced. In case of theft or loss of transportation vehicles, the location of the vehicle is determined as soon as possible with GPS data. Detailed reports were obtained about the transportation vehicles and their drivers by accessing the data retrospectively through the BCC system. With these reports, the measurement of the efficiency of the operation in the field and the performance of the drivers were evaluated in the most accurate way with statistical information, and overtime calculations and overpayments were prevented [7].

Fig. 2. BCC project vehicle history monitoring screen

BCC is a smart transportation system that provides the daily operation report of the selected vehicle. It enables us to view the latest status information, ignition opening and closing times, the total idling and stopping time of the vehicle during the day, the distance traveled and the speed-time graph during the day. After selecting one of the license plate,

Fig. 3. BCC project stop based operation statistics report screen

88

K. ˙Ilhan and M. Ünver

device ID number, driver or group options of the vehicle you want to receive the report from, you need to select the day you want to receive the report from and click on the Show Report button. The stop-based working statistics report screen image is shown in Fig. 3 [8]. In the system, only the distance between the two points is displayed linearly by providing the display of their positions with an interface screen according to the movement information accessed from the travel location information of the vehicles defined and operating on the same line. The related report screen also showing the frequencies of the vehicles belonging to the relevant line on any line route, and the Vehicle Docking Monitoring and Management screen, where we can see the alignment according to the distances between the vehicles on the line stretching between two specific points according to the positions of the vehicles, on Fig. 4 below. In the determination of docking, only vehicles on the same line are taken as a basis. According to the frequency of the expedition and the variant distance, the system is able to detect the doubling with the rate specified by the user in the interface [7].

Fig. 4. BCC project vehicle accordion monitoring and management screen

Multiple lines/vehicles can be selected on the map at the same time. The selected lines are displayed on the route. It can be determined which information will be displayed by the label on the icon showing the vehicle on the map. In this way, you can track your vehicles by license plate, driver name, serial number of vehicle tracking device or team number. If the GPS devices in the transportation vehicles cannot receive information from the GPS satellites during the specified period, the vehicle is shown in black on the map. The vehicle may have entered an enclosed space or there may have been a problem with the system.

Analysis and Examination of the Bus Control Center

89

Fig. 5. BCC project vehicle tracking map screen image

On the digital map of the city, the movements of transportation vehicles can be tracked from the center on different maps (Google, Here, etc.).

Fig. 6. BCC project map editor screen

3.2 Identification of Lines, Routes, Vehicles and Stops In route definition processes; order of stops, kilometers and time between stops are defined. The system should be able to automatically update the time information according to the actual data of a certain time.

90

K. ˙Ilhan and M. Ünver

Fig. 7. BCC project line definition screen

According to the length and capacity of the transportation vehicle, the automatic determination and recording of the stop angles according to the route can be drawn more than one route belonging to the same line (variant). When variants are requested, the time is subject to the request of the day-based user. The data of the transportation vehicles used on the lines in the field in Bursa province are given as an example in Table 1 below. Table 1. Transportation vehicles (bus) information used in Burula¸s field. Company

Trademark

Model

Year

Uzunluk (m)

Quantity

BURULAS¸

MERCEDES

CONECTO

2007

18

10

TEMSA

TOURMALINE

2015

12

6

OTOKAR

KENT 290

2011

12

30

MERCEDES

CONECTO

2007

12

74

GÜLERYÜZ

COBRA

2018

12

2

BMC

PROCITY

2018

12

1

ISUZU

CITIBUS

2012

9

1

KARSAN

ATAK

2016

8

12

3.3 Services Planning Drivers can see the tasks assigned to them via the web. In case of change, SMS notification can be made. It is ensured that the spare vehicle and driver duties are defined, the driver job definition is made and the shift plan can be printed out. The data in Table 2 below includes the driver information assigned by Burula¸s company in bus transportation vehicles.

Analysis and Examination of the Bus Control Center

91

Table 2. Burula¸s transportation vehicles (bus) working in the field driver information Name surname

Registry no

T.R. Identification number

Mobile phone

Duty

Address

A… S…

20…

5………

532………

SOFÖR ¸

U…… MAH. 2. ….. SOK. NO.19 YI…

A. K…

20…

2………

532………

H.M

I…… MAH. 2……. SK. NO:25/2 Y…./BURSA

A. G…

20…

2………

533………

SOFÖR ¸

B… S… ¸

20…

1………

535………

SOFÖR ¸

Y…… MAH. 55. (….) SK. NO 3/2 N…. K˙I…… MH. GÖ…… NO: 10 BURSA

B…. Ö…

20…

1………

533………

SOFÖR ¸

ÜÇ….. MAH. D…… SOK. NO. 4/2 N˙IL…

C… K…

20…

1………

533………

SOFÖR ¸

GÜL…. MAH. 3G….. SOK. NO. 1/2 YEAR…

E… E.

20…

3………

532………

SOFÖR ¸

HA…… MAH. 3. K… SOK. NO. 39/2 OS…

Drivers can follow the tasks assigned to them with the users opened on the BCC website, as can be seen on the sample image in Fig. 8.

92

K. ˙Ilhan and M. Ünver

Fig. 8. BCC project services planning-campaign operations screen

3.4 Emergency Management In case of emergency, the driver can send messages to BCC Web displays as well as alarm. Warning or information messages can be sent to the driver via BCC Web, and the system can give a warning to the driver when a message comes to the vehicle. Alarms can be sent by the driver via the on-board computer. In Fig. 9, Emergency Screen-Accident, Failure, Traffic Jam, Ambulance, etc. alarm emergency types are shown. Other sensor-based programming sent to BCC Web Center as an alarm; Temperature Alarm Programming: By entering the minimum and maximum temperature values in degrees, an alarm report is generated when the values exceeding the tolerance are in question. Separate programming is performed for both sensors. Non-periodic Programming: It is ensured that information is received from the sensors at the desired time interval. Humidity Alarm Programming: In case of values exceeding the tolerance, an alarm report is generated. Door Left Open Alarm: When the duration information is entered, an alarm report is generated if the door open time exceeds this value. If there is more than one door sensor, separate time information must be entered for each of them. Vehicle Moved without Driver Card: The movement of the vehicle without the driver identification unit card is transmitted as an alarm report. You can start this alarm with the “Start” button. Vehicle Moved without Driver Card Insertion Warning Sound: The duration information is entered in seconds and the buzzer sounds for the desired duration. Speed Warning Sound: If the two-stage speed limit and warning time values are entered, it gives an audible warning when the limit is exceeded. For warning sound warnings, a special connection must be made to the vehicle from the buzzer and km sensor [9].

Analysis and Examination of the Bus Control Center

93

Fig. 9. BCC project services planning-campaign operations screen

A pop-up warning message appears in front of the BCC Web user related to the vehicle or line. The relevant user can open the instant camera images of the vehicle from the pop-up. Alarms sent by the driver are reported via the emergency button. Retrospective data can be listed by determining the date range of the vehicle for which the Emergency Alarm Report is to be received. Thanks to this feature, when the BCC Vehicle Tracking system is turned on, alarms, audible warning and pop-up windows can be transmitted instantly, as can be seen in Fig. 10. In the emergency temperature and speed pop-up warning message that comes in front of the operator, the alarm can be reported when the temperature and speed values fall below or exceed the desired values, and when the specified time is exceeded if the time information is entered. Data can be listed by determining the date range after selecting one of the license plate, device ID number, driver or group options of the vehicle you want to receive the Temperature and Speed Alarm Report.

Fig. 10. BCC project services planning-campaign operations screen

94

K. ˙Ilhan and M. Ünver

3.5 Violation Management Each line has a route and departure times. The vehicles open the line code and travel on the route associated with the departure time at the specified hours. If the vehicle goes out of the defined route, the vehicle does not come to the last stop for the relevant voyage, the vehicle does not leave the garage on time, or the vehicles deviate from another road due to reasons such as accidents, traffic, and go out of the route, a violation is generated and can be viewed on the BCC Web project screen [10]. Instant notifications of all violations made by vehicles on the vehicle tracking map screen are as shown in Fig. 11 below.

Fig. 11. BCC project violation notification screen

It is possible to display the violation list by entering the entry and exit times of the zones defined for the vehicle, the distance traveled in the zone and the time spent, the license plate, device number and driver criteria, and the stop, motion and idling alarms within the zones defined for the vehicle can be reported as in Fig. 12. The contents of the report, which shows the entry and exit times of the vehicle to the programmed regions or whether it has entered the region within the selected date range, are presented below; • It is possible to monitor whether the vehicles have left their lines and whether they have stopped at the stop. • If the vehicle does not leave the garage on time and does not reach the first time point on time, notification is sent and the violation list is reflected on the report screen. • Notification and reporting if the vehicle does not leave the first defined stop on time. • Notification and reporting if the vehicle is out of the defined route. • Notification and reporting if the vehicle does not arrive at the last stop. • It is ensured that the vehicle’s non-stop information is logged at the stops on the route and that reports on these logs can be obtained on the BCC web when requested.

Analysis and Examination of the Bus Control Center

95

Fig. 12. BCC project violation list report screen

4 Discussion and Recommendations The BCC system is an example of an intelligent transportation platform system that is widely used not only in Turkey but also all over the world, as it provides the management and control of the processes of public transportation vehicles.In the kentever-increasing mobile life, increasing competition with each passing day, increasing fuel prices and, more importantly, time becoming the most valuable asset necessitate every institution to manage its resources very carefully and correctly. Through the BCC system, you can reduce your fuel consumption, reduce accidents, efficiently manage your power in the field, prevent driver and passenger-based abuses, serve your passengers better, and stay ahead of the competition with suitable special solutions in the transportation sector [11]. The BCC system provides solutions for both reducing fuel costs, which constitute a serious expense item in the transportation sector, and increasing the performance and efficiency of the transportation vehicles in the field by ensuring fleet discipline. In transportation vehicles, it can monitor passengers and drivers working in the field 24/7 via the internet from wherever they are [12]. The fuel consumption of the vehicles, the route (variant), and passenger data can be tracked through the system. Fuel savings can be achieved by driving on the right route, using vehicles compatible with economical driving methods, and less idling. Information on how the driver drives the vehicle, which is vital along with the savings in fuel costs, and bad vehicle usage habits such as sudden acceleration, hard braking, skidding can be specially reported. Thus, based on these reports, managers encourage driver behaviors to use the right vehicle by developing reward and punishment practices, contributing to the reduction of accidents and occupational safety. Driver violations can be checked instantly and speeding violations can be prevented. Through the BCC System, information such as the departure times, speed, stop

96

K. ˙Ilhan and M. Ünver

exit/arrival times of the transportation vehicles, the places where the transportation vehicles stop, and the pause times can be seen instantly. In addition to the system, which is based on the ability to track and control vehicles from the internet with the data received from GPS satellites, the current situation can be followed moment by moment thanks to the cameras integrated into the vehicles in transportation vehicles.

5 Conclusion At the last point of the developments in digital technologies, it is to meet the expectations and demands of the passengers quickly and on the spot with the use of smart technologies in the solution of transportation problems in smart cities. From this perspective, BCC smart transportation platform is needed to go beyond the classical transportation applications and solve the problems with the help of smart technologies that collect big data with various sensors and vehicles. Burula¸s can access status information about all the vehicles used in the field at a glance using the BCC Smart transportation platform.It can access all information of transportation vehicles such as real-time location, speed, violation, and provides the opportunity to monitor vehicle movements of a determined past period on the map. Also, thanks to BCC’s rich reporting menu, summary and detailed information of all movements of all vehicles can be accessed with different report types.

References 1. Köseo˘glu, Ö., Demirci, Y.: Smart cities and empowering innovative technologies for packaging policy issues. Int. J. Polit. Stud. 4(2) (2018) 2. Gürsoy, O.: Smart city approach and application opportunities for metropolitan cities in Turkey, Hacettepe University, Institute of Social Sciences, Department of Political Science and Public Administration, Department of Public Administration Master Thesis (2018) 3. Ersen, H.: Smart Governance Republic of Turkey Ministry of Environment and Urbanization, November 2020. Press: 07.04.2020 4. Gürler, U.G.: Asis Elektronik Ve Bili¸sim Sistemleri A.S. ¸ Bursa Municipality Burula¸s Bcc Project Functional Analysis and Scope Document, Version : V02 (2019). Press: 22.03.2019 5. Bursa Metropolitan Municipality, Smart Municipalism and Smart Urbanism Applications, E-Promotion (2018). https://www.Bursa.Bel.Tr. Accessed 07 Jan 2018 6. Ça˘gatay, D.: Python R&D and Software House, Bus Control Project Draft Document Version : V01 (2020). Press: 17.05.2020 7. Ilhan, K., Asis Elektronik Ve Bili¸sim Sistemleri A.S. ¸ Bursa Bcc Project Functional Analysis Document Version : V1.6 (2020). Press: 07.04.2020 8. Burula¸s Smart Transportation Platform, Bus Control Center Hardware and Software Purchase Work Technical Specification Version: V3.6 (2018). Press: 21.11.2018 9. Ünal, A.N.: Bahçe¸sehir University Sensors, Data Collection and Actuators (2020), November 2020 10. Serin, H.: Asis Elektronik Ve Bili¸sim Sistemleri A.S. ¸ Bursa Bcc Project Functional Analysis Document Version : V1.1 (2019). Press: 05.09.2019 11. Xsights, Innovation and Change Guide for Smart Cities, Smart Cities Desk Research – Public Technology Platform (2016). http://www.Akillisehirler.Org/Akilli-Cevre/. Accessed 16 Dec 2017 12. Turkish Telecom, Smart Cities, E-News (2018). http://www.Sehirlerakillaniyor.com/. Accessed 14 Jan 2018

Analyze Symmetric and Asymmetric Encryption Techniques by Securing Facial Recognition System Mohammed Alhayani(B)

and Muhmmad Al-Khiza’ay

Al-Kitab University, Altun Kupri, Kirkuk, Iraq {mohammed.yaseen,Mohammed.kazim}@uoalkitab.edu.iq

Abstract. This paper aims to protect the facial recognition system by securing the stored images and preventing unauthorized people from accessing them. Symmetric and asymmetric encryption techniques were proposed for the image encryption process. To ensure the use of the most efficient techniques, thus, compare the results between two popular encryption algorithms. AES was chosen to represent the symmetric cipher and the RSA for asymmetric ciphers. High-resolution face images are encoded by both algorithms, as well as the ability to be analyzed by quantitative parameters such as PSNR, histogram, entropy, and elapsed time. The results showed through the proposed criteria the preference of AES, as it provided distinguished results in image coding in terms of coding quality and accuracy, processing speed and execution, coding complexity, coding efficiency, and homogeneousness. To sum up, symmetric encryption techniques protect the face recognition system faster and better than asymmetric encryption techniques. Keywords: Facial recognition · Symmetric techniques · Asymmetric techniques · Entropy

1 Introduction This section will focus on reviewing significant literature related to facial recognition technology. Determining a person’s identity through facial recognition technology has become dominant in all government departments and institutions. Facial recognition requires a large number of face images to be collected in storage centers, Hackers can access and manipulate that data. Therefore, a secure and tamper-resistant system is presented from data breaches and hacking, as well as the continuity of data provision. Symmetric and asymmetric encryption techniques used to encrypt image data by private and public keys, where keys are shared between the sender and receiver. Symmetric encryption uses one key and asymmetric encryption uses two keys (private & public). It has been proposed AES algorithm for symmetric cipher and the RSA algorithm for an asymmetric cipher. After encoding the contents of the images, the cipher images are evaluated using mathematical criteria (PSNR, Histogram, Entropy, Elapsed time). The value of the pixel changes in the images. The average elapsed time for the © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ben Ahmed et al. (Eds.): NISS 2022, LNDECT 147, pp. 97–105, 2023. https://doi.org/10.1007/978-3-031-15191-0_10

98

M. Alhayani and M. Al-Khiza’ay

encryption process is calculated to find out the fastest algorithm. Relying on the resulting evaluations, the best and most efficient algorithm is selected to encode the stored images. The rest of the study is structured as following. Section 2 literature review, to review previous studies related to this research. Section 3 The Methodology, Algorithms and standards are explained as well as the method of its work. Section 4 Experimental result, processing algorithm results were presented and discussed, also the values of the proposed criteria and standards. Section 5 Discuss the results and how to evaluate it. Section 6 throws light into conclusions drawn.

2 Literature Review For protecting facial recognition, Mohammad, Omar Farook, et al. [1], also Jain, Yamini, encoded the image content using some symmetric and asymmetric coding algorithms according to standards (PSNR), (NPCR), (UACI), and entropy. A detailed comparison of image encryption mechanisms was presented. It was concluded that; symmetric chaotic encryption algorithms are more effective and complex in image encryption. Our contribution in this study was to determine the AES algorithm from symmetric encryption as the fastest technology, by execution time criteria [2]. Qi Zhang and Qunding [3] were interested in image encryption by AES algorithm. They encrypted the images and then evaluate the encryption process by Histogram and key sensibility. Our contribution to this study was to compare AES with asymmetric coding techniques and use criteria for smoothing pixel rates in images to discover the best algorithm. Memon F, and others studied the performance evaluation of eight factors measuring the concentration like (CURV Curvature, GRAE, Histogram Entropy, LAPM, LAPV, LAPD (Diagonal Laplacian), LAP3, and WAVS). Through special criteria such as PNSR, MSE, etc. they evaluated the mentioned concentration measures in this research and evaluation of images quality. They typically find the LAPD method is relatively better than the other seven focus factors [3, 4] Qi Zhang and Qunding were interested in image encryption by AES algorithm and, they evaluated the encryption process by Histogram and key sensitivity. We contributed to our study by comparing the AES algorithm with techniques that use two keys in the encryption process, such as the RSA algorithm, and it was found that it outperforms them.

3 Methodology This paper study investigates different methodologies used to encode images of facial recognition systems to reach the most efficient technology, by providing a deeper comparison between the algorithms. Encryption works by the chaotic technique of scattering the contents of the image, where the encryption is done with special mechanisms and utilizing encryption keys obtained by both the sender and the receiver. For decryption, the encryption process is reversed by the key used in the encryption to obtain the natural data content. Symmetric encryption technology is via one key secret or public for example (AES, DES, 3DES, CAST5, IDEA, RC4, RC5, RC6, Blowfish) [5]. For asymmetric

Analyze Symmetric and Asymmetric Encryption Techniques

99

encryption methods, it uses two private and public keys in the encryption process, including (RSA, ECC, DSA, Merkle’s Puzzles, YAK, Diffie Hellman) [1]. In this study, the proposed algorithm of symmetric encryption technique is AES and asymmetric cipher is RSA. 3.1 AES (Advanced Encryption Standard) It is one of the most famous algorithms used around the world that works with symmetric block cipher technology. Published by the National Institute of Standards and Technology (NIST) in the year 2000 due to the need for new encryption techniques to work in place of old ones that had a security weakness [6]. AES has three different key sizes like 128, 192, and 256 bits with a block size of 128. AES has a number of rounds to convert the data for each block. According to the length of the key, 10 for key value 128, 12 for 196, and 14 for 256. At the beginning of the encryption process, the block is transformed into a 4 * 4 array and then the rounds are executed. Each round of AES includes a set of operations to hash the clear data content, Substitute Bytes, Shift Rows, Mix Columns, and Add Round Key [7].When decoding, the process and steps are reversed [8]. 3.2 RSA (Rivest Shamir Adleman) It is a well-known algorithm that was described in 1977 by Leonard Adleman, Adi Shamir, and Ronald Rivest at the Massachusetts Institute of Technology, and it works with asymmetric encryption technology, meaning that it has two keys for encryption and decryption. It got a patent for an “encrypted communication system and method” in 1983. The encryption process is by generating a public key and encrypting the message with it, then matching it with a private secret key and sending it to the receiver, the receiver decrypts the message with the private key that he obtained with the ciphertext of the sender [9]. Below is the pseudo-code of key generation and the encryption process, for decoding process steps are reversed [10]. 1. 2. 3. 4. 5. 6. 7. 8. 9.

Initialization: Procedure Key generation Generate public key Generate private key Procedure Encryption Step1: Sensor nodes have public key (i, m) I is public Step2: CF = (M ˆ i) mod (m) Step3: The packet is encrypted End procedure

3.3 Efficiency Criteria It assumes measuring the efficiency of the image encoding algorithms to arrive at the most suitable and effective algorithm. The efficiency of the algorithm is measured through special criteria, to know the impact of coding on the images [11].

100

M. Alhayani and M. Al-Khiza’ay

PSNR (Peak Signal-to-Noise Ratio) The PSNR criteria can be used to determine the quality of the coding, as it refers to the analysis of the image quality and shows the pixel change between the normal image and the encrypted image. Where the value of PSNR is taken by the pixel, for each pixel it has eight bits, which is a value of PSNR. The lower value indicates the high quality of the encryption methods [12]. Value of PSNR can be calculated as [2]: MAX is the maximum value, that can be taken by a pixel. MAX is equal to 255 for the full value. P is an original plain image, C is the ciphered image and n, m are the width and height of images. PSNR = 10xlog10 MSE =

MAX 2 MSE

2 1 n m P(i, j) − C(i, j) i=1 j=1 nxm

(1) (2)

Histogram Analysis Histogram analysis was used to extract the mean, maximum, and covariance values of the image [13]. It provides a visual interpretation of the numerical data by showing the number of data points that fall within a specified range in the figure. The graph depicts the frequency distribution of the pixel density values of the image, as the pixel density values are scattered in the original images, and are expected to be uniform in the encoded images. The histogram is different between the normal image and the encrypted image thus preventing data leakage, as shown in Table 2. Entropy The average information or entropy of images means a measure of image randomness that is important in the field of image coding. Entropy is large in focused images and small in unfocused images. Also, the entropy picture indicates that the homogeneous regions have a lower percentage of entropy than the heterogeneous regions, and thus the strength and complexity of the coding method are known. The entropy of the image can be calculated by calculating each pixel position (i, j) and as in the mathematical law: [14]. pk log2(pk) (3) H1 = k

where H1 is entropy, K is the number of gray levels, and pk is the probability associated with gray level k. Elapsed Time Analysis The speed of the encryption and decryption process is important in measuring the efficiency of an algorithm, when the algorithm accomplishes its tasks in real-time it will increase the reliability and security of the system. The run times of encryption algorithms depend on the characteristics of the encryption algorithm being used, as well as computer resources such as the RAM, CPU, and disk.

Analyze Symmetric and Asymmetric Encryption Techniques

101

4 Experimental Result Two high-resolution face images with different dimensions were used in a gray color system. The size of the first image is fixed, dimensions 4800 × 4800 (as a medium image), and for the second image is variable dimensions 6000 × 4000 (as a large image) pixels. Images were encrypted using AES and RSA algorithms. The performance of the algorithms for encoding and decoding images; has been analyzed according to the criteria proposed in the simulation process. Online facial image samples were taken from the Faces dataset at Unsplash site, for high-resolution photos. All algorithms & simulations are coded in the Python programming language. Table 1. Natural shapes and sizes of original and encrypted, decrypted images. Process

Image

Original Image

AES

RSA

919,692 bytes

12,802,921 bytes

7,083,713 bytes

2,195,948 bytes

13,337,617 bytes

10,527,472 bytes

919,692 bytes

919,692 bytes

919,692 bytes

2,195,948 bytes

2,195,986 bytes

2,195,948 bytes

Medium Image (4800x4800) Encryption Large Image (6000x4000)

Medium Image (4800x4800) Decryption Medium Image (6000x4000)

Table 1 shows two normal images, in their encoded and decoded formats, as well as their size in bytes. The sizes are the same between normal, encrypted, and decrypted images in both the AES and RSA algorithm. According to the visual evaluation, AES is entirely hiding all of image details, as well as restoring them to their original state. RSA was successful in encrypting and decrypting the photographs, although the encryption procedure did not hide all of the images’ features. However, visual evaluation is not a sufficient criterion to measure and analyze the efficiency of the algorithms. Some quantitative analysis must be performed to measure the image encoding performance of AES and RSA algorithms.

102

M. Alhayani and M. Al-Khiza’ay

4.1 Histogram Analysis Table 2 shows the graphs of the original, encoded, and decoded images using the AES and RSA algorithms. The AES algorithm succeeded in encrypting the image by distributing the pixel density uniformly on the graph, unlike the RSA algorithm, which scattered the pixel values in the graph, where the pixel density values are scattered in the original images, and uniform in the encrypted images. Table 2. Histogram graphics. Process

Image

Original Image

AES

RSA

Medium Image

Encryption

Large Image

Medium Image Decryption Large Image

4.2 PSNR Analyses As shown in Table 3 to determine the encoding quality by analyzing the image quality, it shows the pixel change between the normal image and the encoded image via both algorithms. The output indicates that the PSNR-value is lower in AES-encoded images than in RSA-encoded images, and this leads to a higher quality of AES-encrypted methods. Table 3. PSNR values. Image

PSNR value of AES

PSNR value of RSA

Medium

7.4908

9.5880

Large

7.0829

8.3107

4.3 Entropy Analyses The results for entropy measures of random images are presented in Table 4. The consistency and amount of noise in the image are determined by calculating the position of each pixel in the image. The higher the image consistency, the lower the entropy value. The AES and RSA algorithms were used to calculate the entropy of both regular and encoded pictures.

Analyze Symmetric and Asymmetric Encryption Techniques Table 4. Entropy for original and encrypted, decrypted images. Image

Medium Image

Large Image

Medium AES Encrypted

Large AES Encrypted

Medium RSA Encrypted

Large RSA Encrypted

Image Processing

Entropy

103

104

M. Alhayani and M. Al-Khiza’ay

It concluded from the results, amount of consistency for the images encoded with the AES algorithm, in contrast to the images encoded by the RSA algorithm, which had a large entropy value. The strength of the complexity of coding method in the AES algorithm, and the homogenization of pixel positions in the encrypted images. 4.4

Elapsed Time Analyses

The results shown in the Table 5, refer to the elapsed time for encoding and decoding operations per second for both AES and RSA algorithms. The speed of the AES algorithm exceeded that of RSA in the encryption and decryption processes according to a wide range. Table 5. Elapsed time for AES and RSA encryption, decryption processes. Image

Encryption time of Encryption time of Decryption time of Decryption time of AES RSA AES RSA

Medium 0.000270 Large

0.000609

79.956264 186.12002

0.006757

1131.39515

0.005206

2617.48209

5 Discussion Through the criteria of PSNR, histogram, entropy, and elapsed time, AES and RSA provided distinct results in encoding face images. The AES algorithm outperformed the competition in terms of coding quality, processing speed, execution, coding complexity, coding efficiency, and consistency. According to this study, the AES algorithm outperforms the RSA algorithm in encoding the stored face recognition system images.

6 Conclusion To determine the optimum strategy for securing the photos stored in the facial recognition system, we evaluated symmetric and asymmetric coding strategies. AES, a-symmetric encryption algorithm, was employed, as well as RSA, an asymmetric encryption system. A comparison of AES with RSA using quantitative and physical criteria revealed that the AES algorithm outperformed RSA in many areas, including speed of implementation, the accuracy of results, encryption quality and efficiency, and image uniformity. As a result, AES outperforms RSA in terms of time standards, accuracy, efficiency, reliability, and processing speed, and it is recommended to use symmetric encryption techniques in securing images for the face recognition system because they outperform asymmetric techniques.

Analyze Symmetric and Asymmetric Encryption Techniques

105

References 1. Mohammad, O.F., et al.: A survey and analysis of the image encryption methods 12, 13265– 13280 (2017) 2. Jain, Y., et al.: Image encryption schemes: a complete survey. Int. J. Sig. Process. Pattern Recogn. 9(7), 157–192 (2016) 3. Zhang, Q., Ding, Q.: Digital image encryption based on advanced encryption standard (AES). In: 2015 Fifth International Conference on Instrumentation and Measurement, Computer, Communication and Control (IMCCC), pp. 1218–1221. IEEE (2015) 4. Memon, F., et al.: Image quality assessment for performance evaluation of focus measure operators. J. Eng. Technol. 34, 379–386 (2015) 5. Ebrahim, M., Khan, S., Khalid, U.B.: Symmetric algorithm survey: a comparative analysis. Int. J. Comput. Appl. 61(20) (2014) 6. Abdullah, A.J.C., Security, N.: Advanced encryption standard (AES) algorithm to encrypt and decrypt data 16 (2017) 7. Chen, X.: Implementing AES encryption on programmable switches via scrambled lookup tables. In: Proceedings of the Workshop on Secure Programmable Network Infrastructure, pp. 8–14 (2020) 8. Banik, S., Bogdanov, A., Regazzoni, F.: Compact circuits for combined AES encryption/decryption. J. Crypt. Eng. 9, 69–83 (2019) 9. Albalas, F., et al.: Security-aware CoAP application layer protocol for the internet of things using elliptic-curve cryptography. Int. Arab J. Inf. Technol. 1333, 151 (2018) 10. Fotohi, R., Firoozi Bari, S., Yusefi, M.: Securing wireless sensor networks against denialof-sleep attacks using RSA cryptography algorithm and interlock protocol. Int. J. Commun. Syst. 33, e4234 (2020) 11. Farrag, S., Alexan, W.: Secure 2D image steganography using Recamán’s sequence. In: 2019 International Conference on Advanced Communication Technologies and Networking (CommNet), pp. 1–6. IEEE (2019) 12. Sara, U., et al.: Image quality assessment through FSIM, SSIM, MSE and PSNR—a comparative study. J. Comput. Commun. 7(3), 8–18 (2019) 13. Cho, G.Y., et al.: Evaluation of breast cancer using intravoxel incoherent motion (IVIM) histogram analysis: comparison with malignant status, histological subtype, and molecular prognostic factors. Eur. Radiol. 26(8), 2547–2558 (2016) 14. Lin, G.-M., et al.: Transforming retinal photographs to entropy images in deep learning to improve automated detection for diabetic retinopathy. J. Ophthalmol. 2018(2), 1–6 (2018)

Aspect-Based Sentiment Analysis of Indonesian-Language Hotel Reviews Using Long Short-Term Memory with an Attention Mechanism Linggar Maretva Cendani , Retno Kusumaningrum(B) and Sukmawati Nur Endah

,

Department of Informatics, Diponegoro University, Semarang, Indonesia [email protected], [email protected], [email protected]

Abstract. The development of tourism and technology has given rise to many online hotel booking services that allow users to leave reviews on hotels. Therefore, an analytical model that can comprehensively present the aspects and sentiments in user reviews is required. This study proposes the use of a long short-term memory (LSTM) model with an attention mechanism to perform aspect-based sentiment analysis. The architecture used also implements double fully-connected layers to improve performance. The architecture is used simultaneously for aspect extraction and sentiment polarity detection. Using 5200 Indonesian-language hotelreview data points with labels of five aspects and three sentiments, the model is trained with the configuration of hidden units, dropout, and recurrent dropout parameters in the LSTM layer. The best model performance resulted in a microaveraged F1-measure value of 0.7628 using a hidden units parameter of 128, dropout parameter of 0.3, and recurrent dropout parameter of 0.3. Results show that the attention mechanism can improve the performance of the LSTM model in performing aspect-based sentiment analysis. Keywords: Aspect-based sentiment analysis · Long short-term memory · Attention mechanism · Hotel reviews

1 Introduction Tourism is one of the major supporting sectors of the Indonesian economy, which is rapidly developing. From January to October 2021, the number of foreign tourists coming to Indonesia reached 1.33 million [1]. Developments in the tourism sector are accompanied by developments in the field of technology in Indonesia. Indonesia has the fifth highest number of startups in the world with 2300 startups [2]. The combination of the tourism and technology developments in Indonesia has given rise to many digital startups

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ben Ahmed et al. (Eds.): NISS 2022, LNDECT 147, pp. 106–122, 2023. https://doi.org/10.1007/978-3-031-15191-0_11

Aspect-Based Sentiment Analysis of Indonesian-Language

107

that provide online hotel booking services. These services allow users to review hotels by discussing various aspects of the hotel and their sentiments, namely positive, negative, and neutral. As these reviews are large in number, an analytical model is required that can thoroughly and comprehensively present the aspects and sentiments in user reviews, such as aspect-based sentiment analysis (ABSA). Sentiment analysis, or opinion mining, comprises three levels: document, sentence, and aspect [3]. ABSA is a sentiment analysis model that is conducted at the aspect level, assuming that each document or sentence has various aspects or targets with their respective sentiment classes [4]. Research on the ABSA of Indonesian-language reviews has been conducted previously. The first study proposed the use of latent Dirichlet allocation, semantic similarity, and long short-term memory (LSTM) for performing ABSA with GloVe word embedding [5]. The F1-score obtained was 85% for aspect detection and 93% for sentiment analysis. A subsequent study used coupled multi-layer attentions and a bi-LSTM architecture with a double-embedding mechanism in FastText to extract aspects and opinions from the Indonesian-language hotel review domain [6]. This study produced the best results with a combination of double embedding and the attention mechanism, with F1-scores of 0.914 and 0.90, respectively, for the extraction of aspects and opinions. Another study proposed the use of bidirectional encoder representations from transformers (BERT) to perform aspect extraction on the Indonesian-language review domain from the TripAdvisor website [7]. This model produced an F1-score of 0.738. By adding the preprocessing stage, the precision value increased to 0.824. Subsequent research proposed the use of the bidirectional GRU model with multi-level attention in the Indonesian-language health service review domain [8]. This resulted in an F1-score of 88% for sentiment classification and 86% for aspect detection. Another study proposed the use of the convolutional neural network (CNN) model with Word2vec word embedding [9]. The F1-score obtained from this research was 92.02%. However, this study had several shortcomings. It failed to provide a method to determine the target opinion and to deal with the problem of negation in sentiment classification. A later study proposed the use of the BERT or m-BERT multilingual language representation model on Indonesian-language hotel review data and produced an F1-score of 0.9765 [10]. However, misclassification and vocabulary problems persisted because it used multilingual BERT, which is not specific to the Indonesian-language domain. Subsequent research proposed the use of LSTM with the addition of double fully connected layers on top [11]. This study used Word2vec word embedding and produced an F1-score of 75.28%. In [9] and [11], the aspect and sentiment extraction processes were conducted in one process.

108

L. M. Cendani et al.

Some previous ABSA studies used different methods for sentiment classification and aspect detection [5, 8]. Study [9] can perform sentiment classification and aspect detection simultaneously with the same method by implementing deep learning using CNN architecture. Based on research [12], the LSTM architecture outperforms the CNN architecture in processing text data such as hotel review data on sentiment analysis tasks, because the CNN architecture does not have a memory cell so it can store information about sequence data such as text. Furthermore, study [11] used the LSTM architecture to perform ABSA. The LSTM architecture can remember information in long data sequences using memory cells; thus, it can store the context information contained within a sentence. In addition, LSTM can overcome the vanishing gradient problem that occurs in the vanilla recurrent neural network (RNN) architecture during backpropagation [13]. The LSTM model thoroughly processes text data and assigns the same priority to all words. However, there are only a few words that contain the aspect and sentiment information necessary to obtain the aspect and sentiment output. Because the LSTM model does not prioritize words that contain aspects and sentiments, it sometimes fails to recognize the aspects and sentiments in sentences. This study proposes the use of the LSTM model with the attention mechanism proposed in [14] and [15] to perform ABSA. In contrast to research [6] which used fasText, this study used Word2vec as word embedding. In research [6], the data used are labeled with annotations, while this study uses data with separate labeling from the review text, so that the text review is in the form of raw data and is not annotated. Labeling in this way is considered faster. In addition, this study uses the attention mechanism proposed by Raffel and Ellis [13], which requires lower computing power than other types of attention mechanisms. This study also applies double fully-connected layers to the proposed architecture, which is proven to improve performance as in the study [11]. One thing that distinguishes this research from other ABSA studies is that most of the other ABSA studies provide different methods or networks for performing aspect/entity extraction and sentiment polarity detection on the architecture, while this study uses the same network that is simultaneously used for aspect classification and sentiment polarity. In addition, in other ABSA studies, most of them still use the SemEval dataset which uses annotations as labeling, this study tries to use a novel dataset with separate labeling with the text review data, so there is no annotation on the text data.

2 Methods Research was conducted in two main stages: data preparation and model building. Data preparation comprised four stages: data collection, data cleaning, data preprocessing, and Word2vec model training. Model building comprised four stages: data splitting, model training, testing, and evaluation. The entire process of this study is shown in Fig. 1.

Aspect-Based Sentiment Analysis of Indonesian-Language

109

Fig. 1. Research methods

2.1 Data Collection The data used in this study were hotel user review data obtained from the Traveloka website by crawling the data using the Selenium library. In total, there were 5200 data points comprising 1–5 star hotel reviews. A total of 2500 data points were obtained from studies [12] and [16], and 2700 data points were obtained from study [11]. Data labeling on the dataset had been previously performed. Each data point was labeled in the form of five aspects, namely “makanan” (foods and beverages), “kamar” (rooms), “layanan” (services), “lokasi” (location), and “lain” (others). The label “lain” (others) is a label given to data point that contains aspects but is not included in the first 4 aspects. Each aspect could have a sentiment polarity value in the form of positive, negative, or neutral. The “neutral” sentiment polarity is given to data point that contains opinions but does not include positive or negative opinions, or in other cases the review provides both positive and negative opinions on the same aspect. Because there were five aspects and each aspect had three possible sentiment polarities, 15 labels were used for each review, where each label had a binary representation of 1 or 0. The representation of the label on each review in the form of sentiment polarity for each aspect is shown in Fig. 2, while the data distribution for each aspect are shown in Fig. 3.

Fig. 2. Representation of the sentiment and aspect polarity labels on reviews

110

L. M. Cendani et al.

Fig. 3. Data distribution for each aspect

The review data primarily discussed the “rooms” aspect, with a total of 3278 reviews. However, the distribution of the data as a whole was balanced. The data were divided into two parts: 200 testing and 5000 development data points comprising training and validation data. Examples of hotel review data and their labels can be seen in Table 1. and Table 2., respectively. Table 1. The examples of hotel review data No

Review data

Review data (Eng.)

1

Kamar ok tempat strategis apalagi deket dengan stasiun

The room is ok, the location is strategic, especially close to the station

2

Kamar kotor ada banyak rambut belum disapu tissue toilet basah

The room is dirty, there is a lot of hair that hasn’t been swept, the toilet tissue is wet

3

Lantai kotor seperti tidak disapu. Kamar mandi bau. Suasana kamarnya agak serem mungkin karena warna catnya gelap jadi kesan pencahayaannya kurang

The floor is dirty like hasn’t been swept. The bathroom is smelly. The atmosphere of the room is a bit scary maybe because the paint color is dark so the impression of lighting is lacking

Table 2. The examples of hotel review data labels No

Foods

Rooms

Services

Location

Others

+

=

−

+

=

−

+

=

−

+

=

−

+

=

−

1

0

0

0

1

0

0

0

0

0

1

0

0

0

0

0

2

0

0

0

0

0

1

0

0

1

0

0

0

0

0

0

3

0

0

0

0

0

1

0

0

1

0

0

0

0

0

1

In Table 2., the sign “+” means positive sentiment polarity, the sign “=” is neutral sentiment polarity, and the sign “−” means negative sentiment polarity.

Aspect-Based Sentiment Analysis of Indonesian-Language

111

2.2 Data Cleaning The data obtained at the data collection stage were still in the form of noisy data. Data cleaning was performed to improve data quality and eliminate the noise in the data that occurred from missing values and incorrect labels, so that the data could be used in the next process [17]. 2.3 Data Preprocessing Data preprocessing was conducted to prepare the data for use in training and testing [18]. The data preprocessing conducted in this study had six stages. • Case folding: converting all letters into lowercase, so that the data were uniform [19]. • Filtering: eliminating words, letters, characters, or other parts contained in the text data that appeared often, were considered unimportant, and could affect the performance of the resulting model [20]. The filtering process in this study had three main stages: stopword, punctuation, and number removal. The deletion of these elements was done because they do not have a significant effect on the classification of sentiment. • Stemming: Changing each word to its basic form. Words in documents typically have affixes in the form of prefixes, suffixes, infexes, and confixes [20]. Stemming was performed by removing affixes. • Tokenization: Converting document data in the form of a collection of sentences into word parts called tokens [21]. The main process of tokenization was to separate words that were combined in a document into single words that existed alone, so that each document would be in the form of a list of tokens. The tokenization stage was divided into two processes: sentence and word tokenization. The documents in the form of a collection of sentences were first broken down into a list of sentences; thereafter, each sentence comprising a collection of words was broken down into a list of tokens, where each token was one word. • Padding: equating the size or length of each document so that the length of all data was the same [11]. The padding size used the longest data size. Other data that had a length less than the padding size would be added with a “” token until the length was equal to the padding size. • Vectorization: converting text data into numeric data. Vectorization was performed to reduce the amount of memory required, to make it easier for machines to process the data [11]. Vectorization produced two outputs: (a) vectorized review data and (b) a vectorizer, which is a word dictionary with contents in the form of pairs of words and numbers that serve as numerical representations of the vectorized review data. The output of the data preprocessing stage was the output of the vectorization stage, i.e., preprocessed data and the vectorizer. 2.4 Word2vec Model Training Word2vec is a word-embedding technique that can represent words in vector form with adjustable length and store the context information of words [22]. The Word2vec model

112

L. M. Cendani et al.

training parameters utilized in this study were the parameters with the best results in study [12]. Four parameters were used: Word2vec architecture, evaluation method, vector dimensions, and window size. The parameters and their values used in the Word2vec model training are listed in Table 3.. Table 3. Word2vec model training parameters No

Parameters

Values

1

Word2vec architecture

Skip-gram

2

Evaluation method

Hierarchical Softmax

3

Vector dimensions

300

4

Window size

3

Word2vec model training was conducted unsupervised on preprocessed data and produced two outputs, namely the trained Word2vec model and embedding_matrix, which is a vector dictionary for each word with dimensions of 300. 2.5 Data Splitting Before training was conducted, the data were first divided via data splitting. Data splitting is a process for dividing data into training, validation, and testing data, so that the data can be used for training and testing. The 5200 data points from the preprocessing stage were divided into two categories: development data, which amounted to 5000 data points, and testing data, which amounted to 200 data points. Development data were further divided into two categories, namely training data with a total of 4500 data points and validation data with a total of 500 data points. 2.6 Model Training The architecture of the LSTM model with an attention mechanism comprised an input layer, five hidden layers, and an output layer. The input layer accepted input in the form of preprocessed hotel review data. The output layer had 15 neurons with a sigmoid activation function to predict aspects and sentiments. The five hidden layers used in this study were the embedding, LSTM, attention, and double fully connected layers. The proposed LSTM architecture with the attention mechanism is illustrated in Fig. 4.

Aspect-Based Sentiment Analysis of Indonesian-Language

113

Fig. 4. Proposed architecture

The embedding layer used the embedding matrix generated via Word2vec model training to translate the input data into Word2vec word vectors with a word dimension of 300. The output from this embedding layer was then processed at the LSTM layer, and hidden state values for all timesteps or LSTM cells were generated. The LSTM model used in this study was proposed by Hochreiter and Schmidhuber [23] with the use of memory cells with three gate units: input, forget, and output. Each LSTM memory cell received three inputs: previous memory (C t−1 ) or memory from the previous cell, the previous hidden state (ht−1 ) or hidden state generated from the previous memory cell, and the input vector at the current timestep (x t ). Each memory cell then produced two outputs: the cell state (C t ) or current memory and the hidden state (ht ). The LSTM memory cell architecture is illustrated in Fig. 5.

114

L. M. Cendani et al.

The hidden states output from the LSTM layer was processed at the attention layer for word weighting by prioritizing words that better represent the sentiments and aspects contained in the sentence. The attention mechanism for text classification cases proposed by Raffel and Ellis [14] was used in this study. The attention mechanism used is different from the attention mechanism used in Transformers architecture such as BERT which is self-attention [24]. The attention mechanism used in this study requires smaller computing power. However, the attention mechanism used works like other attention mechanisms by prioritizing inputs (in this case words) that have a greater weight to obtain classification outputs, more specifically words that contain information on aspects and their sentiment polarity.

Fig. 5. LSTM memory cell architecture [24]

The attention mechanism used in this study has been tested previously on datasets with long sequences, and can be applied with the LSTM architecture and fully-connected layer. The architecture of the attention mechanism is shown in Fig. 6.

Fig. 6. Attention mechanism architecture [14]

The hidden states of the LSTM layer were first processed to obtain the alignment score (et). The formula for the alignment score (et) is given in Eq. (1). et = a(ht ) = tanh(Wa ht + ba )

(1)

Aspect-Based Sentiment Analysis of Indonesian-Language

115

The results of the calculation of the alignment score were then entered into the softmax activation function to obtain the final value of the context vector (α t ). The formula is given in Eq. (2). exp(et ) αt = T k=1 exp(ek )

(2)

The result of context vector calculation was then multiplied by the hidden states (ht ) at the corresponding time step. Thereafter, the value of the attention vector (c) was obtained by calculating the weighted average of the multiplication results. The formula for the attention vector is given in Eq. (3). c=

T t=1

αt ht

(3)

The dimensions of the attention vector were the same as those of the hidden units. This attention vector was then processed on double fully connected layers, referring to research [11], which has been proven to increase the F1-score by 10.16%. The first fully connected layer had 1200 neurons with the TanH activation function, while the second fully connected layer had 600 neurons with the ReLU activation function. The two fully connected layers had a dropout rate of 0.2. The output from the fully connected layer was then processed by the output layer to perform loss function calculation on the aspect and sentiment label values with a size of 15 binary spaces. The loss function used in this study was binary cross-entropy [25]. The formula for binary cross-entropy is given in Eq. (4). L = − ylog yˆ + (1 − y) log 1 − yˆ (4) The proposed architecture with the combination of LSTM, attention mechanism, and fully-connected layer in this research work in a complementary way. LSTM captures the context and word order in sentences, then the attention mechanism gives word priority to words that contain the most important information in the sentence, then double fullyconnected layers are added to improve the performance resulting from the first two layers, as well as classifying aspects and their sentiments, so that accurate output is obtained in the form of aspects and sentiments contained in the input. 2.7 Testing Testing was conducted to measure the performance of the resulting model. Testing was conducted using the model generated in the model training process and using the testing data from the previous data splitting stage. The results of testing were predictive data in the form of a vector along a length of 15 with decimal numbers according to the number of labels. For comparison with the label on the testing data, the predicted data needed to be thresholded by changing the decimal number to a binary number in the form of 0 or 1. Thresholding was done by rounding numbers greater than or equal to 0.5 up to 1, and rounding numbers less than 0.5 down to 0. The predicted data can provide more than one sentiment prediction on an aspect. For example, the “foods” aspect can have a sentiment polarity result of “positive” and

116

L. M. Cendani et al.

“negative” simultaneously. Because one aspect can only have one sentiment polarity, it was necessary to determine the sentiment classification for each polarity combination that could occur. The sentiment classification for all possible prediction results is shown in Table 4.. Table 4. Sentiment classification for each polarity combination No

Polarity combination

Classification

Positive

Neutral

Negative

1

1

0

0

Positive

2

1

0

1

Neutral

3

1

1

0

4

1

1

1

5

0

1

1

6

0

1

0

7

0

0

1

Negative

8

0

0

0

None

2.8 Evaluation Evaluation was conducted using the micro-averaged F1-measure method, commonly referred to as the F1-score. The F1-Score metric, specifically the micro-averaged F1measure is used as a metric in this study because it is very commonly used in multi-label classification tasks, especially in text data. The F1-Score gives combined performance for all classes with the same importance, with more emphasis on the most common labels in the data. Therefore, this metric is preferred for use in multi-label classification tasks. F1-score calculation was performed by calculating the precision and recall values using the confusion matrix table. The confusion matrix was calculated for each aspect because each review could be classified into more than one aspect. In the micro-averaged F1-measure, the total calculation of TP (true positive), FP (false positive), and FN (false negative) in the confusion matrix table does not consider each class individually; it considers them as a whole [26]. The formulas for precision and recall for the calculation of the micro-averaged F1-measure are given in Eqs. (5) and (6). N TP i (5) Precision = N i=1 i=1 TP i + FP i N TP i Recall = N i=1 (6) i=1 TP i + FN i

Aspect-Based Sentiment Analysis of Indonesian-Language

117

The formula for the micro-averaged F1-measure or F1-score itself is given in Eq. (7). F1 − score =

2 × Precision × Recall (Precision + Recall)

(7)

3 Results and Discussion Training was conducted using three combinations of parameters to perform hyperparameter tuning or optimization. The parameters used were hidden units, dropout, and recurrent dropout, all of which were in the LSTM layer. The LSTM layer provided an output that was used as the input for the attention layer, so the effect of changing the three parameters in the LSTM layer would also affect the process of determining the weights in the attention layer, which would in turn affect the performance of the resulting model. The hyperparameters and their values used in the training process are listed in Table 5.. Table 5. Model training hyperparameters No

Parameters

Values

1

Hidden units

32, 64, 128, 256

2

Dropout

0.2, 0.3, 0.5

3

Recurrent dropout

0.2, 0.3, 0.5

Hyperparameter tuning was conducted to determine the effect of each parameter value on the model performance, as well as to obtain a model with the best combination of parameters. Scenario 1 aimed to determine how changing the hidden units parameter could affect the performance of the model. Hidden units refer to the number of neurons in each cell in the LSTM layer. The hidden units parameters used in this scenario were 32, 64, 128, and 256. The graph of the effect of the number of hidden units on the F1-score is shown in Fig. 7.

Fig. 7. Effect of hidden units parameters on the F1-score

118

L. M. Cendani et al.

The graph shows that the larger the number of hidden units used, the better the performance. However, this pattern ceased after 128 hidden units. Thereafter, the F1score decreased, as observed for 256 hidden units. Based on the graph, the number of hidden units that resulted in the highest F1-score was 128, which produced an F1-score of 0.7648241206 or approximately 76.48%. Scenario 2 was conducted to identify how changing the dropout parameter would impact the performance of the model. Dropout refers to a technique that randomly removes neurons during training. This technique aims to avoid overfitting to improve the network performance. The dropout parameters used in this scenario were 0.2, 0.3, and 0.5. The graph showing the effect on the F1-score of the dropout used can be seen in Fig. 8.

Fig. 8. Effect of dropout parameters on the F1-score

The graph shows that the dropout rate that yielded the best result was 0.3. Dropout with a rate of 0.2 yielded the worst result because it is excessively small, and there was no increase in the generalization process. The dropout with a rate of 0.5 produced slightly better results than that with a rate of 0.2, but still lower than the dropout rate of 0.3. This is because the dropout rate of 0.5 was excessively large, resulting in more neurons being removed, which made training less effective. A dropout of 0.3 was neither excessively large nor small and produced an average F1-score of 0.76448911222 or approximately 76.45%. Scenario 3 was conducted to identify how changing the recurrent dropout parameter would impact the performance of the model. Recurrent dropout refers to the dropout technique applied to neurons between the recurrent units. The recurrent dropout parameters used in this scenario were 0.2, 0.3, and 0.5. The graph showing the effect of recurrent dropout on the F1-score can be observed in Fig. 9.

Aspect-Based Sentiment Analysis of Indonesian-Language

119

Fig. 9. Effect of recurrent dropout parameters on the F1-score

The graph shows that the rate of the recurrent dropout that yielded the best result was 0.3. Recurrent dropout with a rate of 0.5 produced the worst result because recurrent dropout was applied to the neurons between recurrent units and it can affect the memory for each word or each timestep. The recurrent dropout rate of 0.5 was excessively large, causing a large number of neurons to be removed, which resulted in the weakening of its ability to remember words in the previous timestep. Recurrent dropout with a rate of 0.2 produced a better result, close to that of the recurrent dropout with a rate of 0.3. This can happen because the memory ability with a recurrent dropout of 0.2 is relatively high considering the timestep of the previous words. However, its performance was still worse than that of the recurrent dropout with a rate of 0.3, because the recurrent dropout with a rate of 0.2 tends to not have an increase in the generalization process. Recurrent dropout with a rate of 0.3 produced the best result because the generalization process achieved an adequate increase in the recurrent dropout value. A dropout rate of 0.3 produced an average F1-score of 0.76423785594 or approximately 76.42%. Scenario 4 was the determination of the model with the best parameters. Based on the experimental results in scenarios 1, 2, and 3, the best parameters obtained were 128 for the number of hidden units, 0.3 for the dropout rate, and 0.3 for the recurrent dropout rate. The model with the best use of these parameters produced an F1-score of 0.7628140703517587 or approximately 76.28%. The last scenario, scenario 5, was conducted to compare the performance of the resulting model with that of the previous research model that did not use the attention mechanism technique [11]. As explained in the Introduction section, that the LSTM architecture outperforms other architectures such as CNN in processing text data, because LSTM has a memory cell that can process long sequence data such as text data, this study only compares the LSTM architecture using the attention mechanism and without attention mechanism. The results of the comparison can assess the effect of using the attention mechanism on the LSTM architecture to perform ABSA on hotel reviews in the Indonesian language. The results of the comparison between the two models are shown in Fig. 10.

120

L. M. Cendani et al.

Fig. 10. Model performance comparison with previous study

The graph shows that the proposed architecture produced an F1-score of 0.7628140703517587, outperforming the architecture of the previous study, which produced an F1-score of 0.752763819. The model using the attention mechanism technique is proven to be able to give a higher priority value to words that contain aspects and sentiments. The model generated from the proposed architecture with the best parameters can increase the F1-score with the addition by 0.01005025135 or approximately 1.005%. The resulting increase is not very large, but in the case of ABSA, the increase in the F1-score is significant and is sufficient to provide significant results in real-time implementation. Despite obtaining adequate results, some misclassifications still occurred. For example, there were several reviews with negative sentiment classes that were predicted as neutral sentiment classes. This happened because there were some sentences that were excessively short and did not refer to the aspects discussed but contained some sentiment polarity. For example, in the sentence “Bagus, kebersihan kamar kurang,” there are two words that can represent two different sentiments, namely “bagus” for positive sentiments and “kurang” for negative sentiments. However, there is only one aspect that is discussed, namely “kebersihan kamar.” The word “bagus” at the beginning is not clearly intended for a particular aspect, because it seems very general, while the next sentence clearly refers to the aspect of “kebersihan kamar” with a negative sentiment polarity. Therefore, the sentence should give a negative sentiment for the room aspect, but the model gives an incorrect classification because there are words that are too general and are not necessarily in reference to a particular aspect. In addition, there were several positive and negative reviews that were not classified in any sentiment class (None) because the model could not recognize the aspects and sentiments discussed. This can happen if words appear that are not recognized by the model but are actually very representative of the aspects and sentiments discussed, such as the use of foreign languages or languages that are not found in the word dictionary in the Word2vec training process. Because the model does not recognize the word, it cannot detect aspect and sentiment values.

Aspect-Based Sentiment Analysis of Indonesian-Language

121

4 Conclusion From the results of the experiments that have been conducted on the architecture of LSTM with an attention mechanism, it can be concluded that the best model is formed using a hidden units parameter of 128, dropout rate of 0.3, and recurrent dropout rate of 0.3. The best model in this study resulted in an F1-score of 0.7628140703517587 or 76.28%. This value outperforms the models from previous studies that only used the LSTM architecture without an attention mechanism, the best of which produced an F1-score of 0.752764 or approximately 75.28%. Acknowledgment. The data used in this study were generated at Intelligent Systems Laboratory at the Department of Informatics, Diponegoro University. The data are available on request from the corresponding author, Retno Kusumaningrum. Please Email [email protected] to requests for access.

References 1. Badan Pusat Statistik: Kunjungan Wisatawan Mancanegara per bulan Menurut Kebangsaan (Kunjungan), Badan Pusat Statistik (2021). https://www.bps.go.id/indicator/16/1470/1/kun jungan-wisatawan-mancanegara-per-bulan-menurut-kebangsaan.html 2. Startup Ranking: Countries - With the top startups worldwide. Startup Ranking (2021). https:// www.startupranking.com/countries 3. Liu, B.: Sentiment Analysis and Opinion Mining. Morgan & Claypool Publishers (2012) 4. Ma, Y., Peng, H., Cambria, E.: Targeted aspect-based sentiment analysis via embedding commonsense knowledge into an attentive LSTM. In: 32nd AAAI Conference Artificial Intelligence AAAI, vol. 2018, pp. 5876–5883 (2018) 5. Priyantina, R.A., Sarno, R.: Sentiment analysis of hotel reviews using Latent Dirichlet Allocation, semantic similarity and LSTM. Int. J. Intell. Eng. Syst. 12(4), 142–155 (2019). https:// doi.org/10.22266/ijies2019.0831.14 6. Fernando, J., Khodra, M.L., Septiandri, A.A.: Aspect and opinion terms extraction using double embeddings and attention mechanism for Indonesian hotel reviews. In: Proceedings 2019 International Conference Advance Informatics Concepts, Theory, Appl. ICAICTA 2019 (2019). https://doi.org/10.1109/ICAICTA.2019.8904124 7. Yanuar, M.R., Shiramatsu, S.: Aspect extraction for tourist spot review in Indonesian language using BERT. In: 2020 International Conference Artificial Intelligence Information Communication ICAIIC, vol. 2020, pp. 298–302 (2020). https://doi.org/10.1109/ICAIIC48513.2020. 9065263 8. Setiawan, E.I., Ferry, F., Santoso, J., Sumpeno, S., Fujisawa, K., Purnomo, M.H.: Bidirectional GRU for targeted aspect-based sentiment analysis based on character-enhanced tokenembedding and multi-level attention. Int. J. Intell. Eng. Syst. 13(5), 392–407 (2020). https:// doi.org/10.22266/ijies2020.1031.35 9. Bangsa, M.T.A., Priyanta, S., Suyanto, Y.: Aspect-based sentiment analysis of online marketplace reviews using convolutional neural network. IJCCS (Indonesian J. Comput. Cybern. Syst. 14(2), 123 (2020). https://doi.org/10.22146/ijccs.51646 10. Azhar, A.N.: Fine-tuning Pretrained Multilingual BERT Model for Indonesian Aspect-based Sentiment Analysis (2020) 11. Jayanto, R., Kusumaningrum, R., Wibowo, A.: Aspect-based sentiment analysis for hotel reviews using an improved model of long short-term memory, unpublished

122

L. M. Cendani et al.

12. Muhammad, P.F., Kusumaningrum, R., Wibowo, A.: Sentiment analysis using Word2vec and long short-term memory (LSTM) for Indonesian hotel reviews. Procedia Comput. Sci. 179(2020), 728–735 (2021). https://doi.org/10.1016/j.procs.2021.01.061 13. Smagulova, K., James, A.P.: A survey on LSTM memristive neural network architectures and applications. Eur. Phys. J. Spec. Top. 228(10), 2313–2324 (2019). https://doi.org/10.1140/ epjst/e2019-900046-x 14. Raffel, C., Ellis, D.P.W.: Feed-forward networks with attention can solve some long-term memory problems, pp. 1–6 (2015) 15. Sun, X., Lu, W.: Understanding Attention For Text Classification, no. 1999, pp. 3418–3428 (2020). https://doi.org/10.18653/v1/2020.acl-main.312 16. Nawangsari, R.P., Kusumaningrum, R., Wibowo, A.: Word2vec for Indonesian sentiment analysis towards hotel reviews: an evaluation study. Procedia Comput. Sci. 157, 360–366 (2019). https://doi.org/10.1016/j.procs.2019.08.178 17. Roh, Y., Heo, G., Whang, S.E.: A survey on data collection for machine learning: a big dataAI integration perspective. IEEE Trans. Knowl. Data Eng. 33(4), 1328–1347 (2021). https:// doi.org/10.1109/TKDE.2019.2946162 18. Mujilahwati, S.: Pre-Processing Text Mining Pada Data Twitter, Semin. Nas. Teknol. Inf. dan Komun., vol. 2016, no. Sentika, pp. 2089–9815 (2016) 19. Jurafsky, D., Martin, J.H.: Speech and Language Processing: An Introduction to Natural Language Processing,Computational Linguistics, and Speech Recognition. Prentice Hall (2021) 20. Najjichah, H., Syukur, A., Subagyo, H.: Pengaruh Text Preprocessing Dan Kombinasinya Pada Peringkas Dokumen Otomatis Teks Berbahasa Indonesia, J. Teknol. Inf., vol. XV, no. 1, pp. 1–11 (2019) 21. Rosid, M.A., Fitrani, A.S., Astutik, I.R.I., Mulloh, N.I., Gozali, H.A.: Improving Text Preprocessing for Student Complaint Document Classification Using Sastrawi, IOP Conf. Ser. Mater. Sci. Eng. 874, 1 (2020). https://doi.org/10.1088/1757-899X/874/1/012017 22. Kurniasari, L., Setyanto, A.: Sentiment analysis using recurrent neural network-lstm in bahasa Indonesia. J. Eng. Sci. Technol. 15(5), 3242–3256 (2020) 23. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 13–39 (1997). https://doi.org/10.1007/978-1-4757-5388-2_2 24. Zhang, F., Fleyeh, H., Bales, C.: A hybrid model based on bidirectional long short-term memory neural network and Catboost for short-term electricity spot price forecasting, J. Oper. Res. Soc. 1–25 (2020). https://doi.org/10.1080/01605682.2020.1843976 25. Vaswani, A., et al.: Attention is all you need. Adv. Neural Inf. Process. Syst. 2017(Nips), 5999–6009 (2017) 26. Parmar, R.: Common Loss functions in machine learning, Towards Data Science (2018). https://towardsdatascience.com/common-loss-functions-in-machine-learning-46af0f fc4d23 27. Shmueli, B.: Multi-Class Metrics Made Simple, Part II: the F1-score. Towards Data Science (2019). https://towardsdatascience.com/multi-class-metrics-made-simple-part-ii-thef1-score-ebe8b2c2ca1

Big Data and Machine Learning in Healthcare: Concepts, Technologies, and Opportunities Mustafa Hiri1(B) , Mohamed Chrayah2 , Nabil Ourdani1 , and Taha el alamir1 1 TIMS Laboratory, FS, UAE, Tetuan, Morocco

[email protected] 2 TIMS Laboratory, ENSATE, UAE, Tetuan, Morocco

Abstract. Recently, big data and machine learning have become increasingly important in various environments, including health care, industry, government, scientific research, business organizations, social networking, and natural resource management. In particular, they apply to the field of health care and make it possible to process and analyze many data that humans or traditional computer tools cannot do simultaneously. Big data and machine learning are having a significant influence on the health care business, and their analyses may help the whole industry. The opportunities they will create will serve as a springboard for the future of the health care business. This will lead to a paradigm shift, shifting the industry’s focus from disease-based diagnosis to patient-based diagnosis. This article will discuss big data technology, including its definition, characteristics, technology, relationship, and influence on the healthcare industry. The concept of machine learning technology, as well as its influence on healthcare, will be also explored. Keywords: Big data · Big Data Analytics · Healthcare · Machine learning · Big data opportunities

1 Introduction With the degradation of our environment and its consequences on health, the health effects associated with the environment are increasingly being integrated into the public debate. Populations are exposed to many pollutants in their environment (food, environment pollution, work, etc.) and they are the first to be affected by these global health issues. In parallel to the emergence of this environmental issue, the use of new information and communication technologies is increasing, in particular, Big Data, Machine Learning, etc. In this study, we will examine the effects of Big Data and Machine Learning on health care. Big Data is a term used to describe vast databases with a broad, diverse, and dynamic structure, as well as the challenges of storing, analyzing, and visualizing them for other processes or outcomes. Humans generated 5 exabytes (1018 bytes) of data until 2003. This amount of data is produced every two days in 2013 [1]. At present, total global data volume is growing exponentially with the advancement of digital technology (Fig. 1). A massive amount of “Big Data” is rapidly increasing and gives rise to new forms of statistics. Unstructured, semistructured, and structured data are included in the Big Data theory, but unstructured data is the main category [2]. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ben Ahmed et al. (Eds.): NISS 2022, LNDECT 147, pp. 123–135, 2023. https://doi.org/10.1007/978-3-031-15191-0_12

124

M. Hiri et al.

Fig. 1. The world’s total data volume [3].

Digital health data have increased considerably in recent years with the rapid growth of new technologies. Additional data sources have been added due to further medical discoveries and emerging innovations such as the internet, mobile applications, capturing devices, novel sensors, and wearable technology. As a result, the healthcare industry generates massive digital data by combining data from different sources such as medication histories, lab reports, Electronic Health Records (EHRs), and Personal Health Records (PHRs). [4]. The health sector combines and adds the industries that include medical, preventive, socializing, and reassuring patients with the properties and services provided by the commercial system. The healthcare sector consists of nursing homes, telemedicine, medical testing, health insurance, outsourcing, charities, and health care facilities. [5]. According to Google Indicators, the word “Big Data in Healthcare” first emerged in late 2011 (Fig. 2). The increase in interest in this concept can be traced back to a famous study published by McKinsey & Company in early 2013 [6]. According to the study, healthcare costs account for about 17.6% of GDP and can cut healthcare spending by $300 billion to $450 billion. [7].

Fig. 2. The Google trend of Big Data in healthcare between 2004 and today [31].

Big Data and Machine Learning in Healthcare

125

The rest of the paper is organized as follows. Section 2 delves into the topic of Big Data. Big Data Analytics is covered in Sect. 3. Section 4 delves into the topic of Machine Learning. The impact of Big Data on health care will be discussed in Sect. 5. The influence of Machine Learning on health care is discussed in Sect. 6. Section 7 concludes with a summary of the work.

2 The Concept of Big Data 2.1 Big Data Definition The word “Big Data” has been in use since the 1990s, with credit for its popularization going to John Mashey [8]. However, Francis X Diebold [9] described Big Data as an explosion in the amount of accessible and potentially essential data in 2000, one of the first definitions. The term “Big Data” is recently used for datasets that have grown so large that they are challenging to work within traditional database management systems [10]. It refers to all the digital data produced by using new technologies for personal or professional purposes. This includes company data and data from sensor data, content published on the web, exchanges on social networks, e-commerce transactions, data transmitted by connecting objects, medical data, etc. 2.2 Characteristics of Big Data Most definitions of Big Data focus on the volume of data stored. However, Big Data has other essential characteristics, such as data variety and velocity, in addition to its size. The six V’s of Big Data (volume, variety, velocity, veracity, variability, and value) form a systematic concept that debunks the idea that Big Data is solely about volume (Fig. 3). Additionally, each of the six V’s has its definition. • Volume refers to the quantity of data we have access to. Traditionally, it was measured in gigabytes, but it is estimated in yottabytes currently. In addition, the Internet of Things is causing data to grow exponentially; such a large amount of data creates a massive storage and analysis problem. • Variety refers to the heterogeneity and complexity of many structured, semi-structured, and unstructured data sets. • The rate at which data is generated, developed, produced, or updated is referred to as velocity. Velocity, therefore, includes both the equivalent of the speed of data fabrication and the rate of data processing to meet demand. The data generated can be batch or real-time data [11]. • Veracity means ensuring that the data is accurate and requires methods to prevent the development of erroneous data in your systems. Unfortunately, Big Data has an insufficient degree of truthfulness, can never be entirely correct [12], and is difficult to verify because most of the data originate from unknown and unverified sources. As a result, t’s critical to have a standard to ensure the functionality of the data already involved.

126

M. Hiri et al.

• Variability is related to the consistency of data over time. • Value is the process of obtaining useful information from massive data volumes, often known as Big Data analysis [13]. The value of the data helps make appropriate decisions.

Fig. 3. The 6 V’s of big data [32].

Some authors have described more than these 6 “Vs” to describe the characteristics of Big Data [14]. 2.3 Big Data in Healthcare Big Data is described in healthcare as electronic health data sets that are too massive and complicated to manage with traditional software or hardware [16]. Big Data in healthcare is intimidating, not just because of its sheer magnitude but also because of the diversity of data types and the speed it must handle. Big Data in medicine and clinics includes various types and large amounts of data generated by hospitals and new technologies. Some of the sources of Big Data in healthcare are electronic health record data, clinical registries, data generated from the machine/sensor system, administrative databases (for example, claims for services and pharmaceuticals), social media posts, including status updates on Facebook, Twitter feeds (so-called tweets) and other platforms, medical imaging information, and biomarker data, including all the spectrum of ‘Omics’ data [16, 17].

Big Data and Machine Learning in Healthcare

127

3 Big Data Analytics Decision-makers have been increasingly interested in learning from previous data to achieve a competitive edge, affecting science and technologies [10]. Data analytics is the method of analyzing data sets and extracting valuable and unknown patterns, relationships, and information using algorithms [10]. Big Data analytics aims to exploit better large sets of data to detect associations between information, recognize previously unknown patterns and trends, better understand consumer or target desires to make a company more responsive and intelligent in its market, and predict marketing phenomena. Organizations ranging from businesses and research institutions to governments are now routinely producing unprecedented scope and complexity data. Extracting meaningful information and competitive advantages from massive amounts of data has become increasingly important for organizations all over the world. It is not easy to efficiently extract meaningful insights from such data sources quickly and easily. As a result, analytics has become indispensable in realizing the total value of Big Data to improve business performance and increase market share. In recent years, the tools available to handle the volume, velocity, and variety of Big Data have greatly improved. These technologies are not prohibitively expensive in general, and much of the software is open source [15]. • • • • • •

There are many Big Data analysis tools available. Here are some of them: Hadoop aids in the storage and analysis of data. MongoDB is a database that is used to store constantly changing datasets. Talend is a data integration and management platform. Cassandra - a distributed database that is used to handle chunks of data. Spark is an open-source distributed computing framework that is used to process and analyze vast volumes of data in real-time. • Kafka is a fault-tolerant storage platform that uses distributed streaming. • Storm is a real-time open-source computing framework. 3.1 Hadoop Hadoop is a free and open-source software framework for storing data and running applications on clusters of standard computers. This system provides enormous storage space for all forms of data, immense processing power, and the ability to manage an almost infinite number of activities. This framework, based on Java, is part of the Apache project supported by the Apache Software Foundation. The Hadoop ecosystem includes the Hadoop cluster, MapReduce, HDFS, and several more components such as Apache Hive, Base, and Zookeeper. The following sections discuss HDFS and MapReduce. HDFS The Hadoop Distributed File System (HDFS) is a fault-tolerant storage system included in Hadoop. HDFS can store massive volumes of data, scale up progressively, and sustain significant storage infrastructure failures without losing data [33]. It collaborates with MapReduce to distribute storage and computation over large clusters by combining

128

M. Hiri et al.

storage resources that may scale based on requests and queries while keeping affordable and within budget. Text, photos, videos, and other data types are all accepted by HDFS [34]. HDFS has a master/slave file architecture, as well as a single name node or master server to administer the file system namespace and govern client access to files. Furthermore, the storage on the cluster divides the data into ‘blocks,’ which are then redundantly stored throughout the server pool. The HDFS architecture keeps three copies of each file, which are referred to as the name node or master node, the data node, and the HDFS clients or edge nodes [35]. The HDFS architecture is seen in Fig. 4.

Fig. 4. HDFS (Hadoop Distributed File System) [35]

MapReduce The MapReduce framework is the Hadoop ecosystem’s processing pillar. Hadoop MapReduce is a framework for developing applications that process and analyze massive data sets in parallel on commodity hardware clusters in a scalable, reliable, and fault-tolerant manner. MapReduce is a popular technique for batch data processing. The data are split into smaller parts dispersed over several nodes so that intermediate results are obtained. When data processing is over, the results are integrated to produce final results [33, 34]. The MapReduce program is divided into three stages: map, shuffle, and reduce. • The map’s or mapper’s job is to process the input data. In most cases, the input data comes in a file or directory, then stored in the Hadoop file system (HDFS). Line by line, the input file is supplied to the mapper function. The mapper parses the data and generates multiple little bits of information. • The shuffling stage consumes the output of the mapping phase. It task is to consolidate the relevant record from the mapping phase output.

Big Data and Machine Learning in Healthcare

129

• Reduce stage, this stage is the last stage in MapReduce. The Reducer’s role is to process the data from the shuffle stage. It generates a new set of output after processing, which is saved in the HDFS. 3.2 Spark Spark is an in-memory data analytics engine that is fault-tolerant. It contains numerous components that enable the execution of Scala, Java, Python, and R jobs. The Apache Software Foundation introduced Spark to accelerate the Hadoop computational computing software process. Spark is intended to handle various workloads, including batch applications, iterative algorithms, interactive queries, and streaming [36]. Spark includes several higher-level tools, such as Spark SQL for SQL, GraphX for graph processing, MLlib for machine learning, and Spark Streaming for streaming data [37]. Spark SQL: is a Spark Core component that introduces a new data abstraction called Schema RDD, which supports structured and semi-structured data. GraphX: a Spark Distributed Framework for Graph Processing. MLlib (Machine Learning Library): is a Spark distributed framework for machine learning. It is designed for large-scale learning environments that benefit from dataparallelism or model-parallelism for storing and processing data and models. MLlib is a collection of quick and scalable implementations of popular learning algorithms such as classification, regression, collaborative filtering, clustering, and dimensionality reduction for common learning scenarios. It also includes primitives for fundamental statistics, linear algebra, and optimization. MLlib supports Java, Scala, and Python APIs and is released as part of the Spark project under the Apache 2.0 license. It is written in Scala and uses native (C ++-based) linear algebra libraries on each node [38]. Spark MLlib’s performance has increased 100 times over MapReduce [37]. Spark Streaming: is a Spark-based stream computing platform that includes a robust API and integrates batch, streaming, and interactive query applications.

4 The Concept of Machine Learning 4.1 Machine Learning Definition Machine Learning is a branch of AI (artificial intelligence) in which computer programs (algorithms) discover predictive power connections from instances in data. Machine learning is essentially the use of computers to apply statistical models to data. Machine Learning employs a larger range of statistical approaches than are commonly employed in medicine. Deep Learning, for example, is based on models that make fewer assumptions about the underlying data and can thus handle more complicated data [19]. The word “Machine Learning” has a broad meaning. For example, the analysis of techniques and methods for recognizing patterns in data is known as Machine Learning. These trends can then either help us understand the present world (e.g., acknowledge infection risk factors) or can help us foresee the future (e.g., predict who will become infected) [25].

130

M. Hiri et al.

Although there are various forms of Machine Learning, most implementations fall into one of three categories: supervised, unsupervised, or reinforcement learning [25]. 4.2 Deep Learning Deep Learning [39–41] (also known as deep neural networks) is a field of machine learning that uses mathematical models inspired by the biological brain to recognize patterns [42]. Deep Learning is a collection of computational approaches that allow an algorithm to program itself by learning from a huge number of instances that illustrate the desired behavior, eliminating the need to specify rules explicitly. [29]. Deep Learning is known to be highly successful in many areas such as computer vision, natural language, speech recognition, and time series (eg ECG). 4.3 Machine Learning in Healthcare Machine Learning has been used in a variety of medical fields, including diabetes, cancer, cardiology, and mental health [20]. There are several machine learning approaches that may be used to healthcare; certain models, such as support vector machines or decision trees, can be used for both classification and regression tasks, whilst others are better suited to a single job (ie, logistic regression for classification). Machine Learning algorithms are effective for discovering complex patterns in large amounts of data. This facility is ideal for clinical applications, especially for those who rely on sophisticated genomes and proteomics measures. It is frequently employed in the diagnosis and detection of a variety of illnesses. Machine learning algorithms in medical applications will make better judgements about treatment plans for patients by suggesting how to create an effective health-care system. [43].

5 The Impact of Big Data on Healthcare Many companies, organizations, banking sectors, government sectors, telecommunications, and traffic control use Big Data analytics tools and innovations to improve their business value, decision-making processes, successful marketing, customer loyalty, and increase their profits [5]. Various applications in the healthcare industry have been attempting to benefit from Big Data technology for the past few years. Today’s healthcare organizations are using Big Data to optimize treatment, save lives, deter fraud, minimize waste, and other things. In six ways, the quest for massive volumes of data would significantly impact the healthcare environment (illustrated in Fig. 5). As the healthcare sector advances to the best, improving patient outcomes along these pathways, as detailed below, will be at the heart of the healthcare system and directly impact the patient [18–21]. Reduced Cost: Predictive analysis is one field that significantly lowers healthcare costs. It enables healthcare organizations to reliably forecast the cost of a treatment in terms of staffing and to plan their resources more effectively. According to McKinsey &

Big Data and Machine Learning in Healthcare

131

Company [22], Big Data could save the healthcare industry from $300 billion to $450 billion per year. Improved results for patients: Big data can boost healthcare quality and productivity [23, 24]. Physicians and other medical professionals may use Big Data to make more precise and accurate diagnoses and treatments, which help healthcare organizations, improve patient outcomes. In addition, using Big Data processing tools, healthcare facilities can provide safer and more accurate responses to unusual illnesses. Health Monitoring: Big Data is critical in assisting medical institutions in actively analyzing their patients’ lives, including cardiovascular, eating, and other things. This form of comprehensive health monitoring is needed to design and implement preventative health care solutions such as pulse rate, sugar level, blood pressure, and other metrics. Increased Security: In the healthcare industry, personal data is precious, and any breach can have disastrous consequences. As a result, an increasing number of businesses are turning to data analytics to help them prevent security threats by detecting suspicious behavior that could indicate a cyber-attack or changes in network traffic. Push Innovation: Big Data helps healthcare organizations accelerate progress by raising the pace at which new medicines and therapies are identified and enhance care quality. The core goal of Big Data in health care is to systematically detect challenges and then find creative ways to assist organizations in reducing total costs. Multiple stakeholders in the medical process, such as healthcare professionals, patients, suppliers, and insurers, may benefit from the process. Improved Patient Engagement: Big Data will also benefit the healthcare industry by increasing patient participation. Patients can accurately monitor their medical records using smartphone applications and smart-watches, such as keeping track of their heart rate after an exercise. All of this information is then processed and stored in the cloud, where doctors can access and study it to keep track of their patients. This virtually removes the need for patients to attend medical facilities for unnecessary tests.

Fig. 5. Big data impacts.

132

M. Hiri et al.

6 The Impact of Machine Learning in Healthcare The field of artificial intelligence (AI) research in medicine is rapidly expanding. Healthcare AI ventures attracted more investment in 2016 than any other field of the global economy [26]. One of the most critical and influential methods for analyzing highly complex medical data is Machine Learning. With large quantities of medical data being generated, it is essential to efficiently use this data to support the medical and health care sectors all over the world [27]. Work on identifying lymph node metastases from breast pathology [28], diagnosing diabetic retinopathy [29], autism subtyping by clustering comorbidities [30], and other work has sparked much interest in Machine Learning for healthcare. This technology’s arrival has brought with it a slew of advantages (Some of them are shown in Fig. 6): • • • • • • • •

Cost-cutting in the healthcare sector. Improved efficiencies and workflows. Improved contact between health care providers and patients. Tele-health solutions, for example, allow for better patient management. Identify and diagnose diseases. Predicting diseases. Personal medical treatment. Drug discovery and manufacture.

Fig. 6. Some of machine learning impacts.

7 Conclusion and Future Work With the help of Big Data, Machine Learning, and other technology such as Cloud Computing, medical industries will keep up with the technology and remain on top.

Big Data and Machine Learning in Healthcare

133

Medical professionals and researchers will improve successful care measures, diagnostic procedures, and preventive steps by analyzing Big Data. As a result, they also can be able to develop medical advancements. This paper attempted to express the numerous benefits that Big Data and Machine Learning technologies can provide to health care systems. This paper also includes an overview of Big Data’s concept and characteristics, Big Data Analytics, Big Data on health care, and an overview of Machine Learning. Our aim for future work will be to create a Big Data analytics framework that responds to current health care trends with the help of Deep Learning.

References 1. Sagiroglu, S., Sinanc, D.: Big data: a review. 42–47 (2013) 2. Poel, H.G.: Big data. Tijdschr. Urol. 7(1), 1–1 (2017). https://doi.org/10.1007/s13629-0170001-x 3. Wu, Y.: South-South Corporation in a Digital World (2020) 4. Anfar, S., Yang, Y., Chen, S.C. Iyengar, S.S.: Computational health informatics in the big data age: a survey. ACM Comput. Surv. 49(1) 1–36 (2016) 5. Thara, D.K., Premasudha, B.G., Ram, V.R., Suma, R.: Impact of big data in healthcare: a survey. In: Proceedings of 2016 2nd International Conferance Contemporary Computing Informatics, IC3I 2016 5, 729–735 (2016) 6. Groves, P., Kayyali, B., Knott, D., Van Kuiken, S.: The big data revolution in healthcare: accelerating value and innovation. Proces. Leng. Nat. (2013) 7. Senthilkumar, S.A., Rai, B, Meshram, A.A., Gunasekaran, A. S, C.: Big data in healthcare management: a review of literature. Am. J. Theor. Appl. Bus. 4, 57 (2018) 8. Yali, C.: The protection of database copyright in the era of big data. J. Phys. Conf. Ser. 1437 (2020) 9. Diebold, F.X.: Big Data Dynamic Factor Models for Macroeconomic Measurement and Forecasting (2000) 10. Elgendy, N., Elragal, A.: Big data analytics: a literature review paper. In: Perner, P. (ed.) ICDM 2014. LNCS (LNAI), vol. 8557, pp. 214–227. Springer, Cham (2014). https://doi.org/ 10.1007/978-3-319-08976-8_16 11. Bonney, S.: HIM’s role in managing big data: turning data collected by an EHR into information. J. AHIMA 84, 62–64 (2013) 12. Ward, J.C.: Oncology reimbursement in the era of personalized medicine and big data. J. Oncol. Pract. 10, 83–86 (2014) 13. Bello-Orgaz, G., Jung, J.J., Camacho, D.: Social big data: recent achievements and new challenges. Inf. Fusion 28, 45–59 (2016) 14. Alexandru, A., Alexandru, C.A., Coardos, D., Tudora, E.: Definition, A.B.D.: Big Data: Concepts, Technologies and Applications in the PublicSector 10, 8–10 (2016) 15. Rajaraman, V.: Big data analytics. Resonance 21(8), 695–716 (2016). https://doi.org/10.1007/ s12045-016-0376-7 16. Raghupathi, W., Raghupathi, V.: Big data analytics in healthcare: promise and potential, 1–10 (2014) 17. Rumsfeld, J.S., Joynt, K.E., Maddox, T.M.: Big data analytics to improve cardiovascular care: promise and challenges. Nat. Rev. Cardiol. 13, 350–359 (2016) 18. Kaur, P., Sharma, M., Mittal, M.: Big data and machine learning based secure healthcare framework. Procedia Comput. Sci. 132, 1049–1059 (2018)

134

M. Hiri et al.

19. Archenaa, J., Anita, E.A.M.: A survey of big data analytics in healthcare and government. Procedia Comput. Sci. 50, 408–413 (2015) 20. Kumar, S., Singh, M.: Big data analytics for healthcare industry: impact, applications, and tools. Big Data Min. Anal. 2, 48–57 (2019) 21. Roy, A.K.: Impact of big data analytics on healthcare and society. J. Biom. Biostat. 7, 1–7 (2016) 22. Kayyali, B., Knott, D., Steve Van, K.: The big-data revolution in US health care: accelerating value and innovation (2013) 23. Chawla, N.V., Davis, D.A.: Bringing big data to personalized healthcare: a patient-centered framework. J. Gen. Intern. Med. 28, 660–665 (2013) 24. Moore, P., Thomas, A., Tadros, G., Xhafa, F., Barolli, L.: Detection of the onset of agitation in patients with dementia: real-time monitoring and the application of big-data solutions. Int. J. Space-Based Situated Comput. 3, 136 (2013) 25. Wiens, J., Shenoy, E.S.: Machine learning for healthcare: on the verge of a major shift in healthcare epidemiology. Clin. Infect. Dis. 66, 149–153 (2018) 26. Buch, V.H., Ahmed, I., Maruthappu, M.: Debate & analysis artificial intelligence in medicine : current trends and future possibilities. Br. J. Gen. Pract. 68, 143–144 (2018) 27. Garg, A., Mago, V.: Role of machine learning in medical research: a survey. Comput. Sci. Rev. 40(2021) 28. Golden, J.A.: Deep learning algorithms for detection of lymph node metastases from breast cancer helping artificial intelligence be seen. JAMA - J. Am. Med. Assoc. 318, 2184–2186 (2017) 29. Gulshan, V., et al.: Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. JAMA - J. Am. Med. Assoc. 316, 2402–2410 (2016) 30. Doshi-Velez, F., Ge, Y., Kohane, I.: Comorbidity clusters in autism spectrum disorders: an electronic health record time-series analysis. Pediatrics 133 , 54–63 (2014) 31. Google Trends Homepage. https://trends.google.com/trends/explore?date=all&q=big%20d ata%20in%20Healthcare. Accessed 1 Feb 2022 32. Andreu-Perez, J., Poon, C.C.Y., Merrifield, R.D., Wong, S.T.C., Yang, G.Z.: Big Data for Health. IEEE J. Biomed. Heal. Informatics 19(4), 1193–1208 (2015) 33. Bhosale, H.S., Gadekar, D.P.: A review paper on big data and hadoop. Int. J. Sci. Res. Publ. 4(10), 1–7 (2014) 34. Ghazi, M.R., Gangodkar, D.: Hadoop, mapreduce and HDFS: a developers perspective. Procedia Comput. Sci. 48, no. C, pp. 45–50, 2015 35. Thanuja Nishadi, A. S.: Healthcare big data analysis using hadoop mapreduce. Int. J. Sci. Res. Publ. 9, 3 (2019) 36. Nibareke, T., Laassiri, J.: Using big data-machine learning models for diabetes prediction and flight delays analytics. J. Big Data, vol. 7, no. 1 (2020) 37. Fu, J., Sun, J., Wang, K.: SPARK-A big data processing platform for machine learning. In: 2016 Interence Conference Industrial Informatics - Computing Technology Intelligent Technology, Industrial Information Integration ICIICII 2016, pp. 48–51 (2016) 38. Meng, X., et al.: MLlib: machine learning in Apache Spark. J. Mach. Learn. Res. 17, 1–7 (2016) 39. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015) 40. Ioffe S, Szegedy C.: Batch Normalization: accelerating deep network training by reducing internal Covariate shift, March 2015 41. Szegedy, C., Vanhouke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision, December 2015

Big Data and Machine Learning in Healthcare

135

42. Williams, D., Hornung, H., Nadimpalli, A., Peery, A.: Deep learning and its application for healthcare delivery in low and middle income countries. Front. Artif. Intell. 2021(4), 553987 (2021) 43. Shailaja, K., Seetharamulu, B., Jabbar, M.A.: Machine learning in healthcare: a review. In: 2018 Second International Conference on Electronics, Communication and Aerospace Technology (ICECA) (2018)

Classification of Credit Applicants Using SVM Variants Coupled with Filter-Based Feature Selection Siham Akil1(B) , Sara Sekkate2 , and Abdellah Adib1 1

Team: Data Science and Artiﬁcial Intelligence, Laboratory of Mathematics, Computer Science and Applications (LMCSA), Faculty of Sciences and Technologies, Hassan II University of Casablanca, Casablanca, Morocco [email protected], [email protected] 2 Higher National School of Arts and Crafts of Casablanca, Hassan II University of Casablanca, Casablanca, Morocco

Abstract. For ﬁnancial institutions and in particular banks, Credit risk management is one of the most crucial issues in ﬁnancial decision making. Accurate credit scoring models are extremely important for ﬁnancial agencies to classify a new applicant, in order to decide whether to approve or reject its credit application. This paper presents a credit scoring model based on SVM variants (C-SVM, ν-SVM) combined with two ﬁlter feature selection methods, to enhance the preprocessing task and models performances. In this study, a public credit dataset namely Australian has been used to test our experiments. Experimental results indicate that our methods are eﬃcient in credit risk analysis. They make the assessment faster and increase the accuracy of the classiﬁcation. Keywords: C-SVM ANOVA · FDR

1

· ν-SVM · Credit scoring · Feature selection ·

Introduction

Credit risk is the particular risk arising from a loan transaction. It corresponds to the probability that a negative event will aﬀect the servicing of the debt to which the debtor has committed. It is an event that can negatively aﬀect the ﬂows that a borrower must pay under a credit agreement. In practice, it refers to the risk that a borrower in default will not be able to pay his debts. In order to ensure that ﬁnancial institutions are able to meet the requirements of their internal control systems, risk management is of particular importance. Credit scoring is one of the main methods of credit risk management which is a risk measurement tool that uses historical data and statistical techniques. It produces scores, which are ratings that measure the risk of default for potential or actual borrowers. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ben Ahmed et al. (Eds.): NISS 2022, LNDECT 147, pp. 136–145, 2023. https://doi.org/10.1007/978-3-031-15191-0_13

Classiﬁcation of Credit Applicants Using SVM Variants

137

Financial institutions can use these scores to classify borrowers into risk classes [1]. Eﬃcient scoring models assign high scores to debtors with a low probability of default and low scores to those whose loans are performing badly (high probability of default). However, ﬁnancial institutions must pay considerable attention to characteristics that can generate major risk. Redundant and irrelevant attributes in the credit scoring model are likely to increase the computational complexity and generate inappropriate results in the risk analysis [2]. In addition, feature selection can ﬁlter out redundant features by selecting an optimal subset with the aim of enforcing the stability and generalization of the classiﬁer to improve learnability, eﬃciency, and convenience [3] while avoiding the curse of dimensionality and overﬁtting. Currently, support vector machine (SVM) classiﬁcation is a popular research area that provides an eﬀective solution to classiﬁcation problems in many ﬁelds [2,4], including credit scoring. SVMs are very suitable for diﬀerent data sets, they can handle both linear and nonlinear data with diﬀerent kernel functions. Accordingly, to determine the impact of the ﬁlter-based feature selection method on the performance of the classiﬁer, we aim to examine the performance of SVMs with regard to credit scoring with diverse SVM kernels (RBF Polynomial, Sigmoid and Linear) on two SVM variants (C-SVM and ν-SVM), in order to compare them to other machine learning classiﬁers. This paper is structured as follows: Sect. 2 summarizes the related work on credit scoring using SVMs. Section 3 presents an overview of C-SVM and νSVM and the ﬁlter-based feature selection. Data used and experimental design are detailed in Sect. 4. Results are reported and discussed in Sect. 5 and ﬁnally conclusion and future work are presented in Sect. 6.

2

Related Works

Huang et al. [5] constructed a hybrid SVM-based with three strategies against other Machine Learning (ML) techniques such as genetic algorithm, decisions tree and back-propagation neural network). Authors mentioned that SVM may be not the more accurate data mining method but it is a competitive one for credit scoring problem compared to other methods. For the prediction of Turkish bank failures, Boyacioglu et al. [6] exploited artiﬁcial neural network (ANN), SVM and multivariate statistical methods to denote the best performing classiﬁer. The experimental results showed that the prediction performances of SVM is satisfying. A comparison between SVM, RF, two advanced ML techniques, and logistic regression (LR) was performed to detect fraud. Bhattacharyya et al. [7] revealed that LR had similar performance with diﬀerent sampling, while SVM performance increases with low fraud ratio in the training data. It is also found that RF performs better than SVM and LR. In order to develop a credit risk assessment, Danenas et al. applied [8] a feature selection technique for data collection using a relationship-based feature subset selection algorithm as well as Taboo Search within the feature subsets. The study of Subashini et al. [9] improves credit card approval by comparing ﬁve ML models: SVM using sequential minimal optimization, LR, Decision Tree (C5.0 and CART) and Bayesian

138

S. Akil et al.

belief networks to reveal fraud in credit card approval. This process reduces the ﬁnancial risk that each ﬁnancial institution needs to secure its customers. A weighted feature SVM was proposed by Shi et al. [10] for the credit scoring model and applied to credit risk assessment, in which an embedded feature selection (FS) method (RF) and a ﬁlter-based FS method (F-score) are adopted to score the input features according to their importance ranking. The aim of the study of Maldonado et al. [11] is to perform a proﬁt-driven feature selection with a Holdout Support Vector Machine (HOSVM) to o extract the most proﬁtable features while minimizing feature costs as a regularization penalty. The paper [12] aims to use SVM and Multilayer perceptron (MLP) with some baseline algorithms in rating distribution with genetic algorithm for feature selection. The overall empirical results conﬁrm that SVM is a robust classiﬁer that performs well for credit assessment problems. Rtayli et al. [13] have conceived a system for identifying credit card risks which utilize RF as a FS technique and The SVM as a ML classiﬁer, so as to obtain a better sensitivity and accuracy of the proposed model. Zhou et al. [14] intend to ﬁnd out the impact of feature selection techniques on four robust prediction classiﬁers. The ﬁnding showed eﬃcacy and adeptness of the combination of least absolute shrinkage and selection operator (LASSO) feature selection method and the SVM which exhibits considerable enhancement and outperforms other competing classiﬁers. In this study, Akil et al. [2] have shown the importance of ﬁlter and embedded feature selection techniques for the credit scoring task combined with ﬁve ML classiﬁers. These methods were found to be faster, computationally inexpensive, and very eﬀective in generating an optimal subset increasing model performance. Based on our analytical ﬁndings, we concluded that in several previous works, SVM outperforms other competitive classiﬁers [2,9,10,12,14], since it is a suitable technique for binary classiﬁcation, as it is highly regularized and used to optimize a classiﬁcation error estimate on the training set, which has largely contributed to its popularity in credit scoring research. In this study we seek to explore its variants (C-SVM and ν-SVM), on two ﬁltering FS techniques Analysis Of Variance (ANOVA) and False Discovery Rate (FDR), in order to ﬁx which features are most eﬀective in characterizing the creditworthiness of credit applicants.

3 3.1

Background SVM Variants

SVM is an extremely popular ML technique for binary classiﬁcation problems, ﬁrst developed by Vladimir Vapnik et al. in the early 1990s s [15]. The primary aim of this technique is to ﬁnd a hyperplane or a set of hyperplanes in a Ndimensional space where N is the number of features [16] that meaningfully classiﬁes the data points, by using diﬀerent types of kernel functions. Intuitively, the hyperplane that provides the largest distance to the nearest training data point of any class, also known as the functional margin, provides good separation, since generally the larger the margin, the smaller the generalization error of the classiﬁer [17].

Classiﬁcation of Credit Applicants Using SVM Variants

139

Below, we present two variants of SVM namely C-SVM and ν-SVM. C -SVM: The C-style soft margin Support Vector Machine (C-SVM) was proposed by Cortes and Vapnik [15], it is an extension of standard SVM to nonseparable cases and proposed to exchange the margin size with the data separation error [18] according to: n

1 min ω T ω + C ζi ω,b,ζ 2 i=1

(1)

yi (wT φ(xi ) + b) ≥ 1 − ζi where ζi denotes the distance to the correct margin with ζi ≥ 0, i = 1, ..., n, C denotes a regularization parameter, ω T ω denotes the normal vector, φ(xi ) denotes the transformed input space vector, b denotes a bias parameter and yi denotes the i − th target value. The parameter C controls the trade-oﬀ between 0 and inﬁnity and indicates the optimization of the SVM, for large values of C. The optimizer will choose a hyperplane with smaller margin if this latter classiﬁes all training points better. Conversely, a very small value of C will lead the optimizer to look for a separation hyperplane with a larger margin, even if this hyperplane misclassiﬁes a larger number of points. For very small values of C, it should obtain misclassiﬁed examples, often even if the training data are linearly separable, C-SVM was shown to work very well in various real-world applications (Sch¨ olkopf and Smola 2002) [19]. ν-SVM: The ν-SVM was proposed by Sch¨ olkopf et al. [20], essentially used a parameter ν instead of the parameter C on controlling the number of vectors [20]: n 1 1 ζi (2) min ω T ω + νρ + ω,b,ζ,ρ 2 n i=1 yi (ω, xi + b) ≥ ρ − ζi , ζi ≥ 0, ρ ≥ 0 where ν ∈ R is the trade-oﬀ parameter. Thus, we note that the ν-SVM formulation provides a more enriching interpretation than the C-SVM formulation, which may eventually be useful in real applications [21]. Since the parameter ν corresponds to a fraction, it is between 0 and 1 and represents the lower and upper bound on the number of examples that are support vectors and lie on the wrong side of the hyperplane. 1 yields the same According to Sch¨ olkopf et al. [20] C-SVM with C = nρ solution as ν-SVM if the ν-SVM solution yields ρ > 0. Thus, ν-SVM and CSVM are basically the same with diﬀerent parameters of regularization (C and ν) that implement a sanction during class separation. However, the ν-SVM has more intuitive interpretations, for example, ν is an upper bound on the fraction of margin errors and a lower bound on the fraction of support vectors [18].

140

3.2

S. Akil et al.

Filter-Based Feature Selection

The process of feature selection involves selecting the most signiﬁcant features for a given data set. It can also enhance the performance of a model by selecting a reduced features subset that are suﬃcient to well predict the output variable. Fundamentally, a relevance score is assigned to each feature in the dataset. Then, they are ranked according to their relevance score. Figure 1, illustrates the process of ﬁlter-based feature selection method. In principle, features with a high relevance score are then selected and features with a low relevance score are eliminated [22].

Fig. 1. The process of ﬁlter-based feature selection method

Below, we present two ﬁlter techniques namely ANalysis Of VAriance (ANOVA) and False Discovery Rate (FDR). ANOVA: ANOVA is a univariate statistical test that evaluates each feature individually. It measures the dependency of two features or more that are signiﬁcantly diﬀerent from each other diﬀerent from each other. The basis of the ANOVA is to obtain the value of the F-ratio, which is the ratio of the variability between the classes to the intra-class variability [2]. The value of the F-statistic is considered as the score. The higher the F-ratio, the more the mean values of the corresponding characteristic are diﬀerent between the classes [23]. FDR: In 1995, the false discovery rate was originally introduced by Yoav Benjamini and Yosef Hochberg [24], it consists of a statistical approach that aims to control the procedure as a FS method. It is the proportion of false positives among all variables called signiﬁcant, typically used to adjust events that appear to be falsely signiﬁcant [2]. To assess if an observed score is statistically relevant while testing a null hypothesis, the p-value, which is a measure of conﬁdence, is then determined and compared to a conﬁdence level α. By testing k hypotheses at one time with a α conﬁdence level, the probability of false positives is equal to 1 − (1 − α)k, potentially leading to a signiﬁcant error rate in the

Classiﬁcation of Credit Applicants Using SVM Variants

141

experiment. This means a correction for multiple tests, such as FDR, is needed to adapt the statistical conﬁdence measures to the number of performed tests. Q = V /R = V /(V + S) is usually described as the proportion of false discoveries from the rejections of null hypothesis, where V is the number of false discoveries and S is the number of true ones. Then, we deﬁne the FDR to be the expectation of Q: (3) F DR = Qe = E[Q]

4

Experimental Design

In this section, we present the data description and performance measurement. Then, we describe the steps of our experimental process illustrated in Fig. 2, to classify the applicants as creditworthy or non-creditworthy.

Fig. 2. The ﬂowchart of our experiments

Data Description and Performance Measurement: In this study, we use one of the most common data sets in the ﬁeld of credit scoring to evaluate the performance of the models used in comparison with previous studies. So, the Australian credit data set taken from the UCI Machine Learning Repository was used. [25]. It consists of 690 applicants with 14 features, covering 383 creditworthy and 307 default-cases. The features values and names are transformed into symbolic data for conﬁdentiality reasons. To assess the performance of our models, we employ the accuracy (ACC) which is a common performance metric used to assess credit scoring prediction. It is given as the proportion of true predictions among all examined predictions. Formally, accuracy has the following deﬁnition: Accuracy =

(T P + T N ) (T P + F P + T N + F N )

(4)

where T P , T N , F P and F N are the true positive, true negative, false positive and false negative, respectively.

142

S. Akil et al.

The Main Steps in the Experimental Process: Step 1: Prepare and preprocess the data by ﬁrst checking for missing values and replace them by the mean or median value of the remaining values for numerical variables. For categorical variables, we replace them by the modal values of their category. Second, we normalize our original data using MinMaxScaler, the choice of this technique was based on its simplicity: xs caled =

x − min(x) max(x) − min(x)

(5)

It consists of normalizing the variables to get the data on the range of [0,1] or the [−1, 1] if the data set contains negative values, by subtracting the minimum and dividing by the diﬀerence between the original maximum and minimum. Step 2: Splitting the dataset into two sets assigning 70% of the data for training and 30% for testing. Step 3: Train and test SVM variants combined with ﬁlter feature ranking methods. Step 4: Predict the results of our models.

5

Results and Discussions

This section presents and discusses the obtained results. These are summarized in terms of accuracy. From Table 1 that summarizes the number of retained features in each FS technique, we notice that while using ANOVA the number of features was reduced by almost 64.29% and when we used FDR the original number of features was reduced by 23%. Table 1. Reduced number of features using Filter FS. FS techniques

ANOVA FDR

Original number of features 14

14

Reduced number of features

10

5

From Table 2, we notice that both SVM variants C-SVM and ν-SVM provide good performance using the polynomial SVM kernel for the Australian Credit Dataset, because the data is overlapped, this involves separating creditworthy. Therefore, the In fact, the polynomial kernel not only considers the given features of the input samples to identify their similarity, it even explores the combinations of these features. The success of SVM variants combined with FS ﬁltering techniques to provide reliable results is due to many reasons: their eﬀectiveness in learning with a very small number of features, their robustness against model errors, and their computational reliability with regard to other machine learning methods, resulting also in a reduced number of features and no loss of information on the importance of features in the data set.

Classiﬁcation of Credit Applicants Using SVM Variants

143

Table 2. Results obtained in our experiments while using diﬀerent kernels. ML classiﬁer

FS techniques ANOVA FDR

C-SVM - Polynomial 88.40% 87.95% C-SVM – RBF

86.57%

86.95%

C-SVM - Sigmoid

85.50%

80.81%

C-SVM – Linear

87.40%

85.50%

ν-SVM – Polynomial 88.19% 89.95% ν-SVM – RBF

87.40%

85.50%

ν-SVM – Sigmoid

86.57%

86.95%

ν-SVM – Linear

86.57%

86.95%

Table 3 shows the results of our model and performance comparison. In [5], according to their result, the combination of SVM with genetic algorithm (GA) gives the best result in their study with an accuracy of 86.90% using a wrapper FS that is computationally intensive, it also presents a high probability of overﬁtting and a long training time. Authors in [10] used the SVM combined with a ﬁlter (F-score) and an embedded (RF) FS techniques and got respectively 86.7% and 87.7%. In [2] authors used the SVM combined with three ﬁlter FS techniques (ANOVA, FDR and Kendall) and two embedded FS techniques Table 3. Performance comparison with previous studies. Study

Methods

Accuracy

Huang et al. 2007 [5] SVM + Grid search SVM + Grid search + F-score SVM + GA

85.51% 84.20% 86.90%

Shi et al. 2013 [10]

SVM RF-FWSVM FS-FWSVM

85.7% 86.7% 87.7%

Akil et al. 2021 [2]

SVM + Kendall Rank Correlation 85.50% SVM + FDR 87.43% SVM+ GBDT 87.43% SVM + RF 87.92% SVM + ANOVA 87.92%

Our study

C-SVM + FDR ν-SVM + ANOVA C-SVM + ANOVA ν-SVM + FDR

87.95% 88.19% 88.40% 89.95%

144

S. Akil et al.

(RF and Gradient Boosting Decision Tree) and got the highest accuracy 87.92% while using SVM+ANOVA and SVM+RF. Our experiments with C-SVM and ν-SVM using the polynomial kernel and combined with ANOVA and FDR seem to be more eﬃcient than previous studies. We can also concluded that ν-SVM is a better choice for binary classiﬁcation problem on accuracy, ﬂexibility.

6

Conclusion

This paper considered SVM variants for credit scoring using a real-world dataset from UCI Machine Learning Repository, C-SVM and ν-SVM coupled to ANOVA and FDR ﬁlter-based feature selection separately. These combinations have been revealed to be important approaches for credit scoring, it can correctly classify credit requests as accepted or rejected. The model achieves good accuracy and signiﬁcantly outperforms existing studies using standard SVM with other feature selection techniques. These promising results encourage to carry out further investigations using SVM variants and especially the ν-SVM deserves to be given more attention. Ongoing work aims to use other SVM variants such as Least squares SVM and Multikernel SVM, and will also aim at proposing hybrid strategies for credit scoring for eﬃcient classiﬁcation of good and bad customers. Acknowledgements. This work was supported by the Ministry of Higher Education, Scientiﬁc Research and Innovation, the Digital Development Agency (DDA) and the CNRST of Morocco (Alkhawarizmi/2020/01).

References 1. Olson, D.L., Wu, D.D.: Credit risk analysis. In: Enterprise Risk Management, pp. 117–136 (2015) 2. Siham, A., Sara, S., Abdellah, A.: Feature selection based on machine learning for credit scoring: an evaluation of ﬁlter and embedded methods. In: 2021 International Conference on INnovations in Intelligent SysTems and Applications, INISTA 2021 - Proceedings (2021) 3. Chen, W., Li, Z., Guo, J.: A VNS-EDA Algorithm-Based Feature Selection for Credit Risk Classiﬁcation. vol. 2020 (2020) 4. Chakhtouna, A., Sekkate, S., Adib, A.: Improving speech emotion recognition system using spectral and prosodic features. In: Abraham, A., Gandhi, N., Hanne, T., Hong, T.P., Nogueira Rios, T., Ding, W. (eds.) ISDA 2021. LNNS, vol. 418, pp. 1–10. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-96308-8 37 5. Huang, C.L., Chen, M.C., Wang, C.J.: Credit scoring with a data mining approach based on support vector machines. Expert Syst. Appl. 33(4), 847–856 (2007) ¨ 6. Boyacioglu, M.A., Kara, Y., Baykan, O.K.: Predicting bank ﬁnancial failures using neural networks, support vector machines and multivariate statistical methods: a comparative analysis in the sample of savings deposit insurance fund (SDIF) transferred banks in Turkey. Expert Syst. Appl. 36(2 PART 2), 3355–3366 (2009)

Classiﬁcation of Credit Applicants Using SVM Variants

145

7. Bhattacharyya, S., Jha, S., Tharakunnel, K., Westland, J.C.: Data mining for credit card fraud: a comparative study. Decis. Support Syst. 50(3), 602–613 (2011) 8. Danenas, P., Garsva, G., Gudas, S.: Credit risk evaluation model development using support vector based classiﬁers. Procedia Comput. Sci. 4(June), 1699–1707 (2011) 9. Subashini, B., Chitra, K.: Enhanced system for revealing fraudulence in credit card approval, 2(8), 936–949 (2013) 10. Shi, J., Zhang, S.Y., Qiu, L.M.: Credit scoring by feature-weighted support vector machines. J. Zhejiang Univ. Sci. C 14(3), 197–204 (2013) ´ Verbraken, T., Baesens, B., Weber, R.: Proﬁt-based 11. Maldonado, S., Flores, A., feature selection using support vector machines - general framework and an application for customer retention. Appl. Soft Comput. J. 35, 240–248 (2015) 12. Wang, D., Zhang, Z., Bai, R., Mao, Y.: A hybrid system with ﬁlter approach and multiple population genetic algorithm for feature selection in credit scoring. J. Comput. Appl. Math. 329, 307–321 (2018) 13. Rtayli, N., Enneya, N.: Selection features and support vector machine for credit card risk identiﬁcation. Procedia Manuf. 46, 941–948 (2020) 14. Zhou, Y., Uddin, M.S., Habib, T., Chi, G., Yuan, K.: Feature selection in credit risk modeling: an international evidence. Econ. Res.-Ekonomska Istrazivanja 0(0), 1–31 (2020) 15. Cortes, C., Vapnik, V.: Support vector networks. Mach. Learn. 20, 273–297 (1995) 16. Ben-hur, A., Horn, D.: CrossRef Listing of Deleted DOIs, vol. 1, no. November (2000). https://doi.org/10.1162/15324430260185565 17. Shmilovici, A.: Chapter 12: Support vector machines. In: Data Mining and Knowledge Discovery Handbook, pp. 231–247 (2005) 18. Takeda, A., Sugiyama, M.: Support vector machine as conditional value-at-risk minimization. In: Proceedings of the 25th International Conference on Machine Learning, pp. 1056–1063 (2008) 19. Sch¨ olkopf, B.: Learning with kernels. In: Proceedings of 2002 International Conference on Machine Learning and Cybernetics, vol. 1 (2002) 20. Sch, B., Williamson, R.C., Bartlett, P.L.: Sch¨ olkopf et al. - 2000 - New support vector algorithms.pdf, vol. 1245, pp. 1207–1245 (2000) 21. Chang, C.C., Lin, C.J.: Training ν-support vector classiﬁers: theory and algorithms. Neural Comput. 13(9), 2119–2147 (2001) 22. Talavera, L.: An evaluation of ﬁlter and wrapper methods for feature selection in categorical clustering. In: Famili, A.F., Kok, J.N., Pe˜ na, J.M., Siebes, A., Feelders, A. (eds.) IDA 2005. LNCS, vol. 3646, pp. 440–451. Springer, Heidelberg (2005). https://doi.org/10.1007/11552253 40 23. Isabelle, G., Andr´e, E.: An introduction to variable and feature selection. J. Mach. Learn. Res. 1056–1063 (2003) 24. Benjamini, Y., Hochberg, Y.: Controlling the false discovery rate?: a practical and powerful approach to multiple testing. J. Roy. Stat. Soc. 57(1), 289–300 (1995) 25. Dua, D., Graﬀ, C.: UCI machine learning repository (2017)

Classification of Hate Speech Language Detection on Social Media: Preliminary Study for Improvement Ari Muzakir1,4(B)

, Kusworo Adi2

, and Retno Kusumaningrum3

1 Doctoral Program of Information System, School of Postgraduated Studies, Diponegoro

University, Semarang, Indonesia [email protected] 2 Department of Physics, Faculty of Science and Mathematics, Diponegoro University, Semarang, Indonesia 3 Department of Informatics, Faculty of Science and Mathematics, Diponegoro University, Semarang, Indonesia 4 Faculty of Computer Science, Universitas Bina Darma, Palembang, Indonesia

Abstract. This study aims to improve performance in the detection of hate speech on social media in Indonesia, particularly Twitter. Until now, the machine learning approach is still very suitable for overcoming problems in text classification and improving accuracy for hate speech detection. However, the quality of the varying datasets caused the identification and classification process to remain a problem. Classification is one solution for hate speech detection, divided into three labels: Hate Speech (HS), Non-HS, and Abusive. The dataset was obtained by crawling Twitter to collect data from communities and public figures in Indonesia. The optimization process of the classification algorithm is carried out by comparing the SVM algorithm, eXtreme Gradient Boosting (XGBoost), and RandomSearchCV. The experiment was carried out through two stages of activity: the classification algorithm using SMOTE and without SMOTE with Stratified K-Fold CV algorithm. The highest performance accuracy with SMOTE is obtained at 90.7% with the SVM model. Whereas without SMOTE, the highest performance accuracy uses the SVM with a value of 87.8%. Furthermore, with the addition of the Stratified K-Fold CV algorithm for non-SMOTE, the train_score is 97.4%, and the test_score is 95.2%. The results show that SMOTE oversampling can increase the accuracy value in detecting hate speech. Keywords: Hate speech detection · Building datasets · Text classification · Natural language processing · Social media analysis

1 Introduction The increase in social media users has recently experienced a swift growth that impacts the increasing number of data posted through social media platforms. The resulting impact is the increasingly uncontrolled freedom of expression [1]. This impact is, of © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ben Ahmed et al. (Eds.): NISS 2022, LNDECT 147, pp. 146–156, 2023. https://doi.org/10.1007/978-3-031-15191-0_14

Classification of Hate Speech Language Detection

147

course, also increasingly challenging to identify the data posted on social media. The types of user tweet data vary, which can be categorized into categories, namely: Hate Speech (HS), non-HS, Abusive, Individual HS, Group HS, Religious HS, Race HS, Physical HS, Gender HS, Other HS, Weak HS, Medium HS [2]. The technology contained in social media platforms allows information to spread very quickly to the world to become a topic of uncontrolled conversation. Worse, social media habits make a hazardous space and a field for criminal groups to attack each other using hate language or different abusive language. For the case in Indonesia, the results of a survey conducted by the Indonesian Internet Service Providers Association in 2020, the number of active social media users was recorded at 196.71 million from 266.91 million users in 20201 . Certain groups often misuse social media to create provocations, opinions, and slander others, as well as hatred whose targets are other individuals or groups [3]. The definition of an opinion or someone remark that leads to acts of hatred cannot be accepted universally, but no individual aspects of these definitions are fully approved [4]. In addition, a clear description of a free speech opinion that leads to an element of hatred can help in research. A good annotation will provide ease in performing detection of a hate speech, making annotations easier reliable [5]. Aside from the positive side of the utilization of the Internet, freedom of expression also offers various losses against social media users, namely as a means of improper actions even to adversely affect other users and the wider community [2]. The negative implications of social media occurred in 2017. Cases of spreading hate speech on social media based on data from the Ministry of Communication and Informatics “Kementerian Komunikasi dan Informatika” (KOMINFO), there were 13,829 cases of dangerous content in the form of hate speech [6]. Other research states that there is still a lot of hateful, offensive, and unpleasantness [7]. This case has a relationship with the influence of the language used on content posted by users that are unpleasant [8]. Many types of language found on social media lead to acts of radicalism. So, we need a medium that can moderate the content posted by users [9, 10]. In one study, it was suggested that there were still problems that had not been resolved, namely that there was still a need for ways to fake opinions that led to different forms of hate speech [11]. Ethnic, Cultural, Political, Religious, and Economic Status someone can cause various levels of discrimination. It is more prone to hate crimes that lead someone to a group–anti-Muslim, anti-Indonesian, anti-US, and others [12]. A survey of internet users, mainly social media networks FB and Twitter, shows that words that encourage hatred have severe consequences for groups and individuals [13]. For example, anti-Semites have targeted Jewish journalists who have mobilized ethnic violence in Myanmar, Kenya, and Sri Lanka [14]. Therefore, mechanisms must be put in place to protect the community from the risk of hate speech. But, to date, social media platforms have tried various techniques to maintain this inappropriate behavior to change and circumvent hate speech detection methods on social networks [15]. In a study conducted by McAlaney et al. [4] found that the problem of diversity in language, different definitions of a word that leads to hateful acts, and limitations on the availability of data for training and system testing are still challenges that need to be done. Of course, this has to do with the influence of the language used in content posted 1 https://apjii.or.id/survei2019x.

148

A. Muzakir et al.

by users that are unpleasant [8]. As in Indonesia, some of the opinions posted on social media contain many words that are rated leads to hateful speech, but these words of hate speech are not yet a word that is expression humor or entertainment. The automated hate speech detection process poses enough challenges to identify content that contains hateful elements, such as non-standard variations in spelling and grammar.

2 Related Work Research conducted by Ayo et al. [16] proposes clustering model for classifying hate speech based on probabilistic. The research found that the main challenges in automated hate speech classification on Twitter are the lack of generic architecture, inaccuracies, threshold settings, and fragmentation problems. The model implemented in this study is divided into four: data representation, data training, classification, and clustering. This study performs feature representation with TF-IDF algorithm and is enhanced based on the Bayes Classifier algorithm. Fuzzy Logic Rules based on cementatics and modules are used in this hate speech classification process. The result of this research is a better performance improvement in the detection process. Furthermore, Mossie and Wang also researched identifying communities prone to hate speech [17]. This study proposes an approach in detecting hate speech to identify hatred towards a minority group vulnerable to hate speech. The problems that arise from this research are related to a person’s ethnicity, culture, politics, religion, and economic status, which impact crimes and hatred against someone. The RNNs and GRU algorithms are used in this study. Word2Vec method is used for grouping hate speech. The results of the experiments conducted show that the GRU algorithm has succeeded in achieving the best results. This approach succeeded in identifying vulnerable groups such as ethnic Tigre. Modha et al. [1] detect and visualize hate speech on cyber-watchdog-based social media. The goal is to make the right approach to detect and provide a visualization of every hate speech attack on social media. Several classification algorithms are used, such as BERT, Linear Regression, SVM, LR, DL-CNN, and Attention Based Model. This study presents a user interface for used in visualizing aggressive content. The findings in this study are that there is still a need for opinion input that leads to different hate speech. For this reason, errors in the classification process are still very likely to occur by existing algorithms, so a method that has high accuracy is needed in the classification process. It was found that the use of trained models such as BERT could not solve all use cases perfectly. Other research discusses the development of systems for hate speech detection using big data [18]. This study designed a three-layer deep learning method that monitors, detects, and visualizes the emergence of hate speech on social media. Deep Learning (DL) methods include CNN and RNN, which automatically learns the abstract feature representation of the input across multiple layers of the DL architecture for adequate classification. The classification results show better performance when compared to the latest methods. Vrysis et al. [19] build a web interface for analyzing hate speech. This research is a project Preventing Hate against Refugees and Migrants (PHARM). The methodology used in this study: data collection, data format, data analysis, project analysis, and

Classification of Hate Speech Language Detection

149

usability evaluation. To detect hate speech using the recurrent neural network (RNN) algorithm. Several participants then evaluated the results of this study to see possible problems or negative aspects that had arisen. This study modified a dataset to detect hate speech and abusive language, especially in Indonesian such as hatred against a group, religion, race, and gender. This research is a preliminary study to conduct experiments in finding a good algorithm in classifying and predicting hate speech with machine learning algorithms. There are two contributions to this research, namely: • We modified a new dataset for the detection of Indonesian hate speech. • Experiment with datasets to measure the best performance of machine learning algorithms with multiple classification models. We show some of the combinations of classification algorithms for the best accuracy. This paper is divided into five sections. Section 2 explains how the relationship between previous research on hate speech. Furthermore, Sect. 3 discusses the methodology used. The discussion about experiments and test results is in Sect. 4. In Sect. 5, we conclude the results of our work and further research plans.

3 Method This section discusses how to build a dataset and methodology for detecting hate speech using single method of machine learning approach.

Fig. 1. General workflow diagram of hate speech detection

150

A. Muzakir et al.

3.1 Building a Dataset According to Fig. 1, the activities carried out in making the dataset in this study are divided into two processes: collecting and annotation of the data. Data Sources The process of collecting data to build a dataset in this study uses Twitter data. The process of retrieving Twitter data is done by crawling via the Twitter Streaming API2 . Keywords used in Twitter data collection are related to topics that are currently hotly discussed, such as Religion, Race and Gender, Politics, and Government in Indonesia. User accounts with a significant potential impact, such as daily news accounts (@detikcom, @kompasnewcom, @metro_TV) and several public figure accounts, are used in the data collection process. Frequently, this account has the potential to become a gathering place for fans and haters to discuss. In addition, we also use several datasets on hate speech3 and abusive4 in Indonesian as additional data in this study. Data Annotation The tweet data that has been obtained through the Twitter Streaming crawling process is then saved into the Comma Separated Values (CSV) format for subsequent labeling processes. In this study, three labels will be used, namely Hate Speech (HS), Non-HS, and Abusive. The raw data obtained as many as 4130 records were then selected to select Tweet data that had weight or value and deleted the looping Tweets. After the selection process, only 2043 Tweet data could be used for the labeling process. The labeling process is carried out manually by involving: the number of 40 students at Bina Darma University, Gender: male and female, aged between 18–20 years, religion: Islam, Christianity, Hinduism, and Buddhism, as well as the area of origin from South Sumatra Province: Palembang, Musi Banyuasin, Ogan Komering Ilir, Ogan Komering Ulu, Ogan Ilir, Prabumulih, and Lahat. The process of distributing tweet data for labeling is sent via email and Whatsapps messages with distribution to each volunteer, each checking data as many as 52 tweets. Then the data was returned for re-checking by three language experts in the field of linguistic qualifications, namely a lecturer from the Indonesian Language Education Study Program at Bina Darma University with a duration of more than one week. Furthermore, 1477 tweets were obtained valid data5 . The tweets got the distribution of label data: non-HS 594 tweets, HS 465 tweets, Abusive: 417 tweets. Based on the data distribution, it can be said that there is an unbalanced dataset. In this study, the distribution of data on each label experienced an imbalance in tweets. If the dataset is unbalanced, it can have a negative effect on classification performance [20]. For that, we transform a dataset to overcome the problems of balance data using oversampling SMOTE (Synthetic Minority Oversampling Technique). 2 http://apps.twitter.com. 3 https://github.com/GigasTaufan/Indonesian-Hate-Speech-Classification. 4 https://github.com/okkyibrohim/id-multi-label-hate-speech-and-abusive-language-detection. 5 The dataset used in this study: https://github.com/arimuzakir/Modification-of-Dataset-Indone

sian-Hate-Speech-and-Abusive.

Classification of Hate Speech Language Detection

151

3.2 Hate Speech and Abusive Language Detection This research aims to compare machine learning features and classifications with several algorithms to find the best performance. The method used is divided into three stages: data pre-processing, feature extraction, and classification. Data Pre-processing The pre-processing data stage is carried out by separating noise and labeling. Twitter data contains primarily meaningless messages, so it is necessary to preprocess and label for Training and Testing [21]. We adopted the data pre-processing method using minimal data modification. There are five steps taken, namely 1) removal username, hashtag, link; 2) remove punctuations and numbers; 3) remove stop word; 4) remove repeated characters; 5) and normalization of word. The data labeling process is divided into three, namely hate speech (HS), non-HS, and Abusive. To normalize words to be applied into standard and formal terms, this study refers to the Indonesian standard through the Indonesian Dictionary “Kamus Besar Bahasa Indonesia (KBBI)”. The normalization results are then used as a pre-trained sentiment analysis model to obtain the polarity scale of each word contained in the tweet data in the form of 0 for non-HS, 1 for HS, and 2 for Abusive. Furthermore, the polarity value will be used to manually annotate the label data of each tweet that falls into these criteria (non-HS, HS, and Abusive). Features Extraction In the experiment, we conducted using the Term Frequency-Inverse Document Frequency (TF-IDF) score extraction feature. The frequencies associated with hate speech and abusive speech such as religion, gender, swearing, and others are made by compiling a list of words into a dictionary form of hate speech and abusive. Furthermore, for the sentiment analysis of the list of words, weighting was carried out using the lexicon method, namely non-HS (0), HS (1), and Abusive (2). As an example, for HS: “Indonesia = Sylvi keliatan bloonnya. English = Sylvi looks whacky”, and example for abusive: “Indonesia = Kalo tolol tuh coba jangan di tunjukin ke orang. English = If it’s stupid, try not showing it to people”. Classification and Evaluation This study uses a supervised learning approach to detecting hate speech. Next, we compare three classification algorithms to see the best performance, namely SVM, eXtreme Gradient Boosting (XGBoost), and RandomSearchCV on our modified dataset. Google Collaboration6 and Python7 programming is very helpful in the experiments we are working on to conduct training data, evaluate and predict hate speech. Processing using the Python 3 Google Compute Engine backend (TPU) with 12.69 Gb RAM specifications and 107.77 Gb Storage. Evaluation is done using the 10-KFold cross-validation methods. In this study, only the measurement uses f1_weighted scoring for all existing classes. 6 https://colab.research.google.com. 7 https://www.python.org.

152

A. Muzakir et al.

4 Experiments and Results In this section, we discuss experimental and analysis results. 4.1 Experiment Results The experiment we conducted was divided into two scenarios: testing with SMOTE oversampling and in the next test without SMOTE oversampling but applying Stratified K-Fold CV algorithm on the SVM Classifier. In addition, we also make predictions for hate speech and abusive speech using a model that has been created using the TF-IDF algorithm. Table 1 shows the results of the classification of several comparisons of machine learning algorithms with TF-IDF algorithm feature with weighted using f1-weighted. The best results with SMOTE oversampling were obtained in the SVM algorithm with an accuracy value of 90.7% and without SMOTE oversampling of 87.8% with the same algorithm. Thus, the experimental results with the SVM algorithm have so far been reliable for detecting hate speech and abusive speech. Meanwhile, when implementing the Stratified K-Fold CV algorithm on the SVM Classifier on data without SMOTE oversampling, the measurement results are stable at 95% with a total of 10 data points (see Table 2). Table 1. F-Measure for TF-IDF features and algorithms Features

Algorithms SVM

TF-IDF

XGBoost

RandomSearchCV

Value accuracy (%) with SMOTE 0.907563

0.851541

0.857143

Value of accuracy (%) without SMOTE 0.878378

0.847973

0.847973

Table 2. Score metrics with stratified K-Fold cross validation Stratified Kfold Accuracy % Precision % Recall % F1-Score % 0

0.614

0.685

0.614

0.582

1

0.682

0.648

0.684

0.650

2

0.614

0.547

0.614

0.581

3

0.750

0.758

0.750

0.713

4

0.918

0.924

0.918

0.916

5

0.945

0.946

0.945

0.946 (continued)

Classification of Hate Speech Language Detection

153

Table 2. (continued) Stratified Kfold Accuracy % Precision % Recall % F1-Score % 6

0.965

0.962

0.965

0.966

7

0.918

0.913

0.918

0.918

8

0.979

0.977

0.975

0.979

9

0.952

0.948

0.952

0.952

4.2 Discussion and Limitation Based on the experimental results described in the previous section, the SVM algorithm has been able to correctly detect words that have been identified at the data preprocessing stage. Based on a high accuracy value of 90.7% compared to the Xboost and RandomSearchCV algorithms which can detect an accuracy of 85%. Whereas for the process without SMOTE, the accuracy achieved is also relatively high, namely at 87.8% compared to SMOTE, which is 90.7%. Then in the following experiment, with the addition of the Stratified K-Fold CV algorithm, the accuracy value increased in the 97.4% train data and 95.2% test data (see Fig. 2). Generally, training data and test validation results can be addressed on the data that has a lower refractive.

Fig. 2. Label correlation

There are several cases where the data has errors in the detection process but a tolerable space. For that, it is necessary to do the process of determining and adding a dictionary for lexicon data on words containing hate speech and abusive speech. Along

154

A. Muzakir et al.

with the continued development of the language used by social media users, namely slang or abbreviated language, the process of adding hate speech and abusive speech dictionaries must be updated continuously. Table 3. shows examples of some errors in the word detection process using model. Table 3. Examples of errors detections using model Input text

Detection results Description

Indonesia “Dasar kamu cantik” English “You are beautiful”

HS

Is a sentence of praise, it should be “Non-Hate Speech”

Indonesia “hahaa silvy ohh silvy cacingan amat loe ahhh” English “hahaha silvy ohh silvy you are very wormy ahhh”

Abusive

Hate Speech should be detected

Indonesia “Sepertinya ada bang jago Abusive baru nih Tolong pantau dan ambil tindakan segera” English “Looks like a new champion, please monitor and take immediate action”

Hate Speech should be detected

Indonesia “_@detikcom dari pusat Non-HS sdah di intruksikan tidak blh mudik tapi didaerah banyak yg mudik kurangnya…” English “@detikcom from the center has been instructed not to go home but in many areas there are shortages…”

This sentence actually leads to an act of hate speech, but there is a missing word

5 Conclusion and Future Work This study presents a dataset that can help detect hateful and abusive speech on Twitter social media. The dataset used is the result of modifications and additions to existing data. Furthermore, from the annotated dataset involving several volunteers and Indonesian language experts, testing and validation were carried out using several classification algorithms based on machine learning. The word annotation sourced from Twitter data involved 40 students from various backgrounds, including ethnicity, religion, and gender. Meanwhile, language experts come from academics and practitioners who have the appropriate scientific experience, especially those who are expert in the field of linguistic qualifications. So, we get 1477 valid tweet data with label distribution: non-HS 594 tweets, HS 465 tweets, Abusive: 417 tweets. Based on the results of our experiments, we get the highest accuracy value on the SVM algorithm with the addition of SMOTE oversampling of 90.7%. In comparison,

Classification of Hate Speech Language Detection

155

the investigation without SMOTE obtained an accuracy rate of 87.8% with the same algorithm, namely SVM. Then we experimented again on the model without SMOTE, but this time with the addition of the Stratified K-Fold CV algorithm on the SVM Classifier. The results achieved in this experiment get a better number than before, namely an accuracy of 95%. We will be experimenting with an updated dataset with lexicon data for hate speech and abusive dictionaries for future work. In addition, we will also experiment with a Neural Network-based algorithm with a pre-trained language BERT model on Indonesian language social media data. The pre-trained language model is known as the newest model, and it is said that it has sophisticated classification performance in the word detection process.

References 1. Modha, S., Majumder, P., Mandl, T., Mandalia, C.: Detecting and visualizing hate speech in social media: a cyber Watchdog for surveillance. Expert Syst. Appl. 161, 113725 (2020). https://doi.org/10.1016/j.eswa.2020.113725 2. Pratiwi, N.I., Budi, I., Jiwanggi, M.A.: Hate speech identification using the hate codes for Indonesian tweets. In: Proceedings of the 2019 2nd International Conference on Data Science and Information Technology, pp. 128–133 (2019) 3. Kapil, P., Ekbal, A.: A deep neural network based multi-task learning approach to hate speech detection. Knowl. Based Syst. 210, 106458 (2020). https://doi.org/10.1016/j.knosys.2020. 106458 4. MacAvaney, S., Yao, H.-R., Yang, E., Russell, K., Goharian, N., Frieder, O.: Hate speech detection: challenges and solutions. PLoS ONE 14(8), e0221152 (2019) 5. Ross, B., Rist, M., Carbonell, G., Cabrera, B., Kurowsky, N., Wojatzki, M.: Measuring the reliability of hate speech annotations: the case of the European refugee crisis. arXiv Prepr. arXiv:1701.08118 (2017) 6. Irawan, D., Yusufianto, A.: Laporan Survei Internet APJII 2019 – 2020, Asos. Penyelenggara Jasa Internet Indones, vol. 2020, pp. 1–146. https://apjii.or.id/survei (2020) 7. Seglow, J.: Hate speech, dignity and self-respect. Ethical Theory Moral Pract. 19(5), 1103– 1116 (2016) 8. Assimakopoulos, S., Baider, F.H., Millar, S.: Online hate speech in the European Union. Springer International Publishing, Cham (2017). https://doi.org/10.1007/978-3-319-72604-5 9. Pelzer, B., Kaati, L., Akrami, N.: Directed digital hate. In: IEEE International Conference on Intelligence and Security Informatics (ISI), pp. 205–210 (2018) 10. Kwok, I., Wang, Y.: Locate the hate: detecting tweets against blacks. Proc. AAAI Conf. Artif. Intell. 27(1), 1621–1622 (2013) 11. Modha, S., Majumder, P.: An empirical evaluation of text representation schemes on multilingual social web to filter the textual aggression. arXiv Prepr. arXiv:1904.08770 (2019) 12. Faulkner, N., Bliuc, A.-M.: ‘It’s okay to be racist’: moral disengagement in online discussions of racist incidents in Australia. Ethn. Racial Stud. 39(14), 2545–2563 (2016) 13. Del Vigna, F., Cimino, A., Dell’Orletta, F., Petrocchi, M., Tesconi, M.: Hate me, hate me not: hate speech detection on Facebook. CEUR Workshop Proc. 1816, 86–95 (2017) 14. Hinduja, S., Patchin, J.W.: Offline consequences of online victimization: school violence and delinquency. J. Sch. Violence 6(3), 89–112 (2007) 15. Gröndahl, T., Pajola, L., Juuti, M., Conti, M., Asokan, N.: All you need is love evading hate speech detection. In: Proceedings of the 11th ACM Workshop on Artificial Intelligence and Security, pp. 2–12 (2018)

156

A. Muzakir et al.

16. Ayo, F.E., Folorunso, O., Ibharalu, F.T., Osinuga, I.A., Abayomi-Alli, A.: A probabilistic clustering model for hate speech classification in twitter. Expert Syst. Appl. 173, 114762 (2021). https://doi.org/10.1016/j.eswa.2021.114762 17. Mossie, Z., Wang, J.-H.: Vulnerable community identification using hate speech detection on social media. Inf. Process. Manag. 57(3), 102087 (2020). https://doi.org/10.1016/j.ipm.2019. 102087 18. Paschalides, D., et al.: MANDOLA: a big-data processing and visualization platform for monitoring and detecting online hate speech. ACM Trans. Internet Technol. 20(2), 1–21 (2020). https://doi.org/10.1145/3371276 19. Vrysis, L., et al.: A web interface for analyzing hate speech. Future Internet 13(3), 80 (2021). https://doi.org/10.3390/fi13030080 20. Ganganwar, V.: An overview of classification algorithms for imbalanced datasets. Int. J. Emerg. Technol. Adv. Eng. 2(4), 42–47 (2012) 21. Alfina, I., Sigmawaty, D., Nurhidayati, F., Hidayanto, A.N.: Utilizing hashtags for sentiment analysis of tweets in the political domain. In: Proceedings of the 9th International Conference on Machine Learning and Computing, pp. 43–47 (2017)

CoAP and MQTT: Characteristics and Security Fathia Ouakasse1,2(B) and Said Rakrak1 1 Laboratory of Computer and Systems Engineering (L2IS), Faculty of Sciences and

Techniques, Cadi Ayyad University, Marrakesh, Morocco [email protected] 2 Private University of Marrakesh, Marrakesh, Morocco

Abstract. There is no doubt that Internet of Things (IoT) has a significant impact on many aspects of our lives including how we live, drive, irrigate, the way we consume energy and the way we manage our confidential and personal data. Data is generated and gathered from different lightweight IoT gadgets and smart devices using two widely protocols; Constrained Application Protocol (CoAP) and Message Queuing Telemetry Transport (MQTT). These protocols are based on Publish/Subscribe model. Nevertheless, as the use of these emerging protocols increase, the risk of attacks increases as well. Indeed, these communications come up with many security vulnerabilities. In this paper, we describe these two emerging messaging protocols to address the needs of the lightweight IoT nodes, we discuss protocols and techniques used to manage security in CoAP and MQTT, we reveal some of security limitations and issues, we conclude with some future directions. Keywords: CoAP · MQTT · Security · DTLS · TLS

1 Introduction The use of smart gadgets, wireless sensors and RFID tags in various IoT application fields had led to the generation of a large amount of data. Thus, IoT technologies have been widely used in many applications in order to measure, control or detect physical and environmental events like temperature, pressure, humidity and pollution levels, as well as other critical parameters. In many IoT application fields, the principle consists of queries sent to sensors to collect data retrieved from the measurements or detections. Nevertheless, in recent critical IoT application, such industry process control, healthcare, smart grid and ambient assisted living, the challenge is getting reliable information when an event of interest occurs in order to react in real-time. In this context, several IoT application layer protocols are used, most of them are based on publish/subscribe model such as MQTT, CoAP, AMQP, XMPP and DDS. Furthermore, the use of IoT protocols has led to a fast lifestyle evolution, which has not only connected billions of gadgets and smart devices but also objects and people. As a result, with this rapid growth, there is also a steady increase in security vulnerabilities of the linked objects. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ben Ahmed et al. (Eds.): NISS 2022, LNDECT 147, pp. 157–167, 2023. https://doi.org/10.1007/978-3-031-15191-0_15

158

F. Ouakasse and S. Rakrak

In many previous published papers, we were interested in application layer protocols, especially MQTT and CoAP, like in [1], we describe these two emerging messaging protocols, we reveal some of their limitations then we quantitatively evaluate their performances in a static environment using the Core network emulator. Then, in [2], we studied the impact of Link Delay Variation on MQTT and CoAP based Communication Performances in Mobile Environment. Moreover, in [3], we proposed an Adaptive Solution for Congestion Control in CoAP-based Group Communications and in [4] in unicast communications. Nevertheless, in none of our papers has the security aspect been addressed or studied. So, in this paper we shed the light in CoAP and MQTT security sides. Indeed, in the past, the concerns of security were around network infrastructure layers which means in physical and network layer. Currently, due to the growing use of networks and the internet concept dominance, serious vulnerabilities are being revealed in the application layer. Therefore, the concept of application security layer emerged as an essential task in the development process. The remainder of this paper is organized as follow: the Sect. 2 presents an overview of CoAP and MQTT protocols. The Sects. 3 and 4 shed light on protocols and techniques used to secure communications based on CoAP and MQTT then reveal some limitations and vulnerabilities and review related works. Finally, conclusion and some perspectives are presented in the Sect. 5.

2 Background Both of these protocols implement a lightweight application layer. In Table 1, we draw the IoT protocols layers defining the application layer protocols described in this study and their supporting protocols in layers below [5]. CoAP has been designed by the Internet Engineering Task Force (IETF) to support IoT with lightweight messaging for constrained devices operating in a constrained environment. On the other hand, CoAP runs over UDP. In Fig. 1 an overview architecture of CoAP protocol is drawn.

Fig. 1. An overview architecture of CoAP protocol

CoAP and MQTT: Characteristics and Security

159

Table 1. IoT protocol layers

CoAP utilizes four message types: 1) confirmable; 2) non-confirmable; 3) reset; and 4) acknowledgment, where two among them concern reliability messages. The reliability of CoAP consists of a confirmable message and a non-confirmable message [6]. On the other hand, MQTT is a lightweight machine-to-machine messaging protocol, it is oriented to be used on the mobile sector. MQTT is an application layer protocol designed for resource-constrained devices including three components a subscriber, a publisher and a broker. It uses a topic-based publish-subscribe architecture as shown in Fig. 2.

Fig. 2. MQTT protocol architecture

To guarantee that a message has been received, an acknowledgements exchange mechanism between the client and the broker is taken place. This mechanism is associated with a quality of service level specified on each message. o QoS level zero (QoS = O): the sender sends the message only once and no retries are performed, which means “fire and forget”, it offers a best effort delivery service, in which messages are delivered either once or not at all to their destination. o QoS level one (QoS = 1): the protocol ensures that a message arrives to its destination at least once. The protocol stores the published message in the publisher internal buffer until it receives the ACK packet. Once the acknowledgement is received, the message is deleted from the buffer, and the delivery is complete. o QoS level two (QoS = 2): the protocol guarantees that a published message will be delivered “exactly once”. Neither loss nor duplication of messages are acceptable, a requirement which is met through a two-step acknowledgement process.

160

F. Ouakasse and S. Rakrak

Thus, MQTT and CoAP are rapidly emerging and integrating the IoT market as leading lightweight messaging protocols for constrained devices. Each protocol offers unique benefits, and each poses challenges and trade-offs. In order to connect to a MQTT broker, subscribers should have an authentication password and a username. Otherwise and to ensure privacy, encryption of data exchanged across the network can be handled independently from the MQTT protocol using Secure Sockets Layer (SSL) or its successor Transport Layer Security (TLS). However, CoAP is based on Datagram Transport Layer Security (DTLS).

3 CoAP Security 3.1 DTLS Based Communications According to Federal Information Processing Standard (FIPS) (The National Institute of Standards and Technology (NIST), 2010), there are three security core principles that guide the information security area [7]: o Confidentiality: preserve the access control and disclosure restrictions on information. Guarantee that no one will be break the rules of personal privacy and proprietary information. o Integrity: avoid the improper (unauthorized) information modification or destruction. Here is included ensure the non-repudiation and information authenticity. Thus, communication companions can detect if a message has been modified during the transmission. o Availability: the information must be available to access and use all the time and with reliable access. Certainly, it just must be true for those who have right of access [8]. With the evolution of threats and attacks scenarios, in IoT applications, it is not possible for a system to be 100% secure. Basic security services can work against many threats and support many policies. There is a large core of policies and services on which applications should rely on. Figure 3 represents examples of most encountered IoT applications attacks and security approaches.

Fig. 3. IoT attacks and security approaches

Devices using the CoAP protocol must be able to protect the flow of information of a sensitive nature (for example the health sector).

CoAP and MQTT: Characteristics and Security

161

To secure flows in CoAP, Datagram Transport Layer Security (DTLS) the main security protocol in IoT that was specified by the IETF in RFC 634798, was designed to secure end-to-end communication between two devices [9]. DTLS is a version of TLS and takes the latter’s functionalities but uses the Transport layer provided by UDP unlike TLS that uses TCP. Table 2 shows the DTLS representation in the protocol stack. Table 2. CoAP base DTLS protocol stack Application layer – CoAP Security – DTLS Transport layer – UDP Network layer – IPv6 Physical layer/MAC – IEEE 802.15.4

Unlike network layer security protocols, DTLS in application layer protect endto-end communication. It protects data circulating throughout CoAP nodes and avoids cryptographic overhead problems that occur in lower layer security protocols. As for DTLS structure, there are two layers. In the bottom; Record protocol that is used for negotiating cryptographic algorithms, compression parameters and secret key in DTLS [10]. In the top three protocols a) Alert; used for reporting errors messages, b) Handshake is responsible for transporting data between connected peers using parameters negotiated during Handshake protocol, and c) Application data and in some condition Change Cipher Spec protocol may replace one of them. Change Cipher Spec message is used to notify Record protocol to protect subsequent records with just-negotiate cipher suite and keys. Figure 4 represents DTLS structure including the working Mechanism of Handshake and Record protocols [11]. Record protocol protects application data by using keys generated during Handshake. For outgoing message, protocol divide, compress, encrypt and apply Message Authentication Code (MAC) to them. For incoming message, protocol reassemble, decompress, decrypt and verify them [12]. In addition, in order to negociate keys, CoAP defines four separate security modes: noSec, pre-shared key (PSK), raw public key (RPK), and certificate. In noSec mode, DTLS is disabled. In pre-shared key mode, a list of keys of trusted clients is provided. In raw public key mode, an asymmetric key pair without a certificate is provided. However in the certificate mode an asymmetric key pair with an X.509 certificate. Figure 5a and 5b show the diagrams of message exchange for CoAP communication with and without DTLS. In order to manage some types of attacks and threats, CoAP specification and standard presents several measures and mitigation more than the adoption of DTLS for securing CoAP nodes.

162

F. Ouakasse and S. Rakrak

Fig. 4. Handshake and Record protocols mechanism in DTLS structure

Fig. 5a. CoAP Request/Response

3.2 Related Works Several works have proposed mitigation measures to cope with different scenarios in CoAP based networks. Notably a mitigation based on two scenarios: Access control mechanisms and secure communication. As for the first scenario; access control, authors in [13] proposed a collection of general use cases for authentication and authorization in constrained environments then presented a report of the main authorization problems arising during the life cycle of a device and provides a guideline for implementing effective solutions. As for the second scenario; secure communication, authors in [14] compared the security services provided by IPSec and DTLS and figure out theirs advantages. In the same context, in the literature, several woks were interested in various security dimensions using IPsec; Internet Protocol Security. Indeed, IPsec can offer various security services like limited traffic flow confidentiality, anti-replay mechanism, access control, confidentiality, connection-less integrity, and data origin authentication [15,

CoAP and MQTT: Characteristics and Security

163

Fig. 5b. CoAP Request/Response with DTLS

16]. Furthermore, to secure the CoAP transactions using IPsec, Encapsulating Security Payload Protocol (RFC 2406) IPSec-ESP is used. Moreover, protocol aspect was also highlighted in some papers like in [17] where authors discuss certain IoT protocols like the 802.15.4, 6LoWPAN and RPL along with their security issues and the exploitation of 6LoWPAN in header compression. In [18], authors present the importance of caching and translating communication between different protocols via a proxy.

4 MQTT Security 4.1 TLS Based Communications Contrary to CoAP, which is based on UDP, MQTT relies on the TCP transport protocol. By default, the MQTT protocol allows anybody to subscribe to the broadcasted topic without any authentication and anyone can easily subscribe to any MQTT server available on the Internet. Therefore, basically TCP connections do not use an encrypted communication. To encrypt the whole MQTT communication, many MQTT brokers allow use of TLS instead of plain TCP. It uses TLS/SSL to authenticate clients and encrypt the transferred data. Therefore, in order to cope with security threats, the MQTT standard lists the mechanisms that should be included in MQTT implementations, namely:

164

o o o o

F. Ouakasse and S. Rakrak

Authentication of users and devices; Authorization of access to server resources; Integrity of MQTT control packets and application data; Privacy of MQTT control packets and application data.

Using MQTT over TLS is not always reliable; one drawback of using MQTT over TLS is that security increases the cost because it needs a great consumption in terms of CPU usage and communication overhead. While the additional CPU usage is typically negligible on the broker, it can be a problem for very constrained devices that are not designed for computation-intensive tasks. Furthermore, in order to ensure a secured communication on the TCP/IP stack at the transport level, TLS uses a cipher suite. A cipher suite consists of cryptography algorithms enabling the exchange of keys, encryption, and the securing of integrity and authenticity via message authentication codes (MACs). In Fig. 6 a TLS key exchange in MQTT-based communication is drawn.

Fig. 6. MQTT Request/Response with TLS

Regarding the MQTT architecture which is based in three entities (publisher, broker and subscriber), TLS can‘t guarantee end-to-end encryption between publishers and subscribers because it implements a transport layer encryption between two directly communicating devices. As a result, encryption can be ensured only between publisher and broker or broker and subscriber. Since broker is the central unit of MQTT architecture, it read and control all communicating messages [19]. Thereby, publisher or subscriber starts by sending a Hello message called with one of these cipher suites. The broker picks the cipher suite and answers to the Hello by sending the website certificate (signed by the CA) and a signature on portions including the key share. Publisher or subscriber generates its own key share, mixes it with the broker key shared, and thus generates the encryption keys for the session.

CoAP and MQTT: Characteristics and Security

165

Nevertheless, in some constrained devices, TLS might be not feasible due to insufficient resources. TLS can be very intensive in terms of memory and processor and this depends on the ciphers used. 4.2 Related Works Even though MQTT is used everywhere it was not designed with security in mind. According to a survey conducted by the IoT developer community, MQTT is the second most popular IoT messaging protocol [20]. The stupendous growth of MQTT successively requires a secure version and the alleviation process should consider the resource constraint nature of the device. For unconstrained devices, TLS/SSL can be used for transport-level encryption [21]. Barring using TLS, in [22] authors proposed a mechanism providing end-to-end security through application layer implementation and hop-byhop protection using link layer, which they implemented in an industrial wind park. Withal, authors in [20] have proffered a solution for the enforcement of security policy rules at the MQTT layer, which is a part of Model-based Security Toolkit (SecKit) to confront the security, data protection, and privacy requirement at the MQTT layer. Furthermore, in [23], authors implemented a secure version of MQTT and MQTT-SN, which replaces the use of SSL/TLS certificates. It is based on lightweight Attribute Based Encryption (ABE) over elliptic curves. The implementation in [24] and [25] focuses on the authentication and authorization mechanism for MQTT-based IoT using OAuth. On the other hand, in [26] authors describe some of the security solutions and improvements typically suggested and implemented in real-life deployments of MQTT. Furthermore, in [27] authors propose a simple security framework for MQTT (for short, AugMQTT) by incorporating the AugPAKE protocol. The framework AugMQTT does not require any certificate validation checks and certificate revocation checks on both publishers/subscribers and broker sides. In the same context, authors in [28] surveyed many recent advances that happened for the MQTT Security and list out significant challenges in the IoT based industry faces when it comes to securing devices from physical as well as logical attacks. Then contribute in-depth security survey of IoT based industry 4.0.

5 Conclusion The Internet of Things (IoT) is now offering the ability to transfer data over a network without requiring human-to-human or human-to-computer interaction. It is connecting different devices in our entourage through the use of different protocols. Besides, IoTconnected devices are predicted to expand to 75 billion by 2025. So far, in many critical application fields like industry process and health, the connection of devices must be managed to ensure reliable data transmission and secured communication. In this paper, we present a review of the most appropriate protocols for lightweight devices and constrained resources in terms of memory, energy, and computing; CoAP and MQTT. We conduct also a description of security protocols used in CoAP, MQTT based communications, we discuss issues and related works to security in CoAP, and MQTT based communications.

166

F. Ouakasse and S. Rakrak

References 1. Ouakasse, F., Rakrak, S.: A comparative study of MQTT and CoAP application layer protocols via. performances evaluation. J. Eng. Appl. Sci. (JEASCI) 13(15), 6053–6061 (2018). https:// doi.org/10.3923/jeasci.2018.6053.6061 2. Ouakasse, F., Rakrak, S.: Impact of link delay variation on MQTT and CoAP based communication performances in mobile environment. Int. J. Comput. Sci. Netw. Secur. (IJCSNS) 17(8), 187–193 (2017) 3. Ouakasse, F., Rakrak, S.: An adaptive solution for congestion control in CoAP-based group communication. Int. J. Adv. Comput. Sci. Appl. (IJACSA) 8(6) (2017). https://doi.org/10. 14569/IJACSA.2017.080629 4. Ouakasse, F., Rakrak, S.: An improved adaptive CoAP congestion control algorithm. Int. J. Online Biomed. Eng. (iJOE) 15(03), 96–109 (2019). https://doi.org/10.3991/ijoe.v15i03. 9122 5. Bansal, M.: Performance comparison of MQTT and CoAP protocols in different simulation environments. Invent. Commun. Comput. Technol. 145, 549–560 (2020) 6. Davis, E.G., Calveras, A., Demirkol, I.: Improving packet delivery performance of publish/subscribe protocols in wireless sensor networks. J. Sens. 13, 648–680 (2013) 7. Bhattacharjya, A., Zhong, X., Wang, J., Li, X.: Security challenges and concerns of Internet of Things (IoT). In: Cyber-Physical Systems: Architecture, Security and Application, pp. 153– 185 (2018) 8. Silva, M.A., Danziger, M.: The importance of security requirements elicitation and how to do it. In: PMI Global Congress - EMEA (2015) 9. Capossele, A., Cervo, V., De Cicco, G., Petrioli, C.: Security as a CoAP resource: an optimized DTLS implementation for the IoT. In: IEEE International Conference on Communications (ICC), pp. 549–554 (2015). https://doi.org/10.1109/ICC.2015.7248379 10. Shaheen, S.H., Yousaf, M.: Security analysis of DTLS structure and its application to secure multicast communication. In: 12th International Conference on Frontiers of Information Technology (2015). https://doi.org/10.1109/FIT.2014.39 11. Westphall, J., Loffi, L., Westphall, C.M., Martina, J.E.: CoAP + DTLS: a comprehensive overview of cryptographic performance on an IOT scenario. In: IEEE Sensors Applications Symposium (SAS) (2020). https://doi.org/10.1109/SAS48726.2020.9220033 12. Kumar, P.M., Gandhi, U.D.: Enhanced DTLS with CoAP-based authentication scheme for the internet of things in healthcare application. J. Supercomput. 76(6), 3963–3983 (2017). https://doi.org/10.1007/s11227-017-2169-5 13. Pereira, P.P., Eliasson, J., Delsing, J.: An authentication and access control framework for CoAP-based Internet of Things. In: 40th Annual Conference of the IEEE Industrial Electronics Society (2015). https://doi.org/10.1109/IECON.2014.7049308 14. Al Ghamedy, T., Lasebae, A., Aiash, M.: Security analysis of the constrained application protocol in the Internet of Things. In: Second International Conference on Future Generation Communication Technologies (2014). https://doi.org/10.1109/FGCT.2013.6767217 15. Bhattacharjya, A., Zhong, X., Wang, J., Li, X.: CoAP-application layer connection-less lightweight protocol for the Internet of Things (IoT) and CoAP-IPSEC security with DTLS supporting CoAP. In: Farsi, M., Daneshkhah, A., Hosseinian-Far, A., Jahankhani, H. (eds.) Digital Twin Technologies and Smart Cities, pp. 151–175. Springer, Cham (2019). https:// doi.org/10.1007/978-3-030-18732-3_9 16. Tamboli, M.B., Dambawade, D.: Secure and efficient CoAP based authentication and access control for Internet of Things (IoT). In: IEEE International Conference on Recent Trends in Electronics, Information & Communication Technology (2016). https://doi.org/10.1109/RTE ICT.2016.7808031

CoAP and MQTT: Characteristics and Security

167

17. Rahman, R.A., Shah, B.: Security analysis of IoT protocols: a focus in CoAP. In: 3rd MEC International Conference on Big Data and Smart City (ICBDSC), pp. 1–7 (2016). https://doi. org/10.1109/ICBDSC.2016.7460363 18. Esquiagola, J., Costa, L., Calcina, P., Zuffo, M.: Enabling CoAP into the swarm: a transparent interception CoAP-HTTP proxy for the Internet of Things. In: Global Internet of Things Summit (GIoTS), pp. 1–6 (2017). https://doi.org/10.1109/GIOTS.2017.8016220 19. Prantl, T., Iffländer, L., Herrnleben, S., Engel, S., Kounev, S., Krupitzer, C.: Performance impact analysis of securing MQTT using TLS. In: International Conference on Performance Engineering, pp. 241–248 (2021). https://doi.org/10.1145/3427921.3450253 20. Neisse, R., Steri, G., Baldini, G.: Enforcement of security policy rules for the Internet of Things. In: IEEE 10th International Conference on Wireless and Mobile Computing, Networking and Communications (2014). https://doi.org/10.1109/WiMOB.2014.6962166 21. Soni, D., Makwana, A.: A survey on MQTT: a protocol of Internet of Things (IoT). In: International Conference on Telecommunication, Power Analysis and Computing Techniques (2017) 22. Katsikeas, S., Fysarakis, K., Miaoudakis, A., Bemten, A.V., et al.: Lightweight & secure industrial IoT communications via the MQ telemetry transport protocol. In: IEEE Symposium on Computers and Communications (ISCC) (2017). https://doi.org/10.1109/ISCC.2017.802 4687 23. Singh, M., Rajan, M.A., Shivraj, V.L., Balamuralidhar, P.: Secure MQTT for Internet of Things (IoT). In: Fifth International Conference on Communication Systems and Network Technologies (2015). https://doi.org/10.1109/CSNT.2015.16 24. Niruntasukrat, A., Issariyapat, C., Pongpaibool, P., Meesublak, K., et al.: Authorization mechanism for MQTT-based Internet of Things. In: IEEE International Conference on Communications Workshops (ICC) (2016). https://doi.org/10.1109/ICCW.2016.7503802 25. Fremantle, P., Aziz, B., Kopecký, J., Scott, P.: Federated identity and access management for the Internet of Things. In: International Workshop on Secure Internet of Things (2014). https://doi.org/10.1109/SIoT.2014.8 26. Perrone, G., Vecchio, M., Pecori, R., Giaffreda, R.: The day after Mirai: a survey on MQTT security solutions after the largest cyber-attack carried out through an army of IoT devices. In: 2nd International Conference on Internet of Things, Big Data and Security (2017) 27. SeongHan, S., Kazukuni, K., Chia-Chuan, C., Weicheng, H.: A security framework for MQTT. In: IEEE Conference on Communications and Network Security (CNS) (2016). https://doi. org/10.1109/CNS.2016.7860532 28. Patel, C., Dishi, N.: A novel MQTT security framework in generic IoT model. Procedia Comput. Sci. 171, 1399–1408 (2020). https://doi.org/10.1016/j.procs.2020.04.150

Combining Static and Contextual Features: The Case of English Tweets Nouhaila Bensalah1(B) , Habib Ayad1 , Abdellah Adib1 , and Abdelhamid Ibn El Farouk2 1

Team: Data Science & Artiﬁcial Intelligence Hassan II, University of Casablanca, 20000 Casablanca, Morocco [email protected] 2 Teaching, Languages and Cultures Laboratory Mohammedia, University of Hassan II, Casablanca, Morocco

Abstract. In recent years, social media networks have emerged as a veritable source for gathering information in various disciplines. This enables the extraction and analysis of users’ feelings, opinions, emotions and reactions to diﬀerent topics. Pre-trained Neural Language Models (LMs) have yielded outstanding scores on a variety of Natural Language Processing (NLP) related tasks across diﬀerent languages. In this paper, we propose that the eﬃciency of contextual word vectors using LMs for social media can be further boosted by incorporating static word embeddings that are speciﬁcally trained on social media (e.g., Twitter). We demonstrate that the combination of static word embeddings with contextual word representations is indeed eﬀective for building an enhanced Sentiment Analysis system for English tweets. Keywords: Sentiment classiﬁcation · Static word embeddings · Contextual word representations · Word2vec · FastText · BERT

1

Introduction

As millions of user opinions have become freely accessible on the web throughout this century, Text classiﬁcation has emerged as one of the most promising areas of research in Natural Language Processing (NLP). Text classiﬁcation is an instance of supervised Machine Learning where a specialist is required to assign one or multiple predeﬁned labels to a sequence of texts. It has drawn considerable interest from both the academic and industrial communities because of its several applications, including topic modeling [24], Spam Detection [6] and Sentiment Analysis (SA) [19]. In recent years, Sentiment Analysis has emerged as a very active area of study in the NLP industry. Brieﬂy, it addresses whether a given document or sentence contains either a positive or a negative opinion. Applications of SA on websites and social media include product reviews and brand reception to political issues and the stock market [1,18,21]. Recent approaches of SA adopt c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ben Ahmed et al. (Eds.): NISS 2022, LNDECT 147, pp. 168–175, 2023. https://doi.org/10.1007/978-3-031-15191-0_16

Combining Static and Contextual Features: The Case of English Tweets

169

Deep Neural Networks, such as Convolutional Neural Networks (CNNs) [15,16] and Recurrent Neural Networks (RNNs) based on Long Short-Term Memory (LSTM) [14] or Gated Recurrent Unit (GRU) [9]. Lately, Attention Based Neural Networks have achieved widespread success across a variety of tasks ranging from Image Captioning, Speech Recognition, Question Answering to Machine Translation and SA [2,8]. Word embeddings are the core component of many Natural Language Processing (NLP) tasks, and are regarded as valuable words’ representations, leading to improved overall eﬃciency across multiple tasks. Word embedding, also known as distributed word representation, refers to techniques that map an input word into a vector in a way that distinguishes it from all input words. The core idea of word embedding relies on the fact that when words occur in the same contexts, they share a high degree of meaning similarity [13]. Several models have been proposed such as Word2vec [20], FastText [7], Global Vectors (GloVe) [22] and Bidirectional Encoder Representations from Transformers (BERT) [11] to achieve such vectors based on a corpus and has obtained a signiﬁcant improvements in diﬀerent NLP tasks [3–5,17,23]. Several challenges arise when it comes to pre-trained LMs like BERT, which have primarily been trained on Wikipedia. In the case of English, many NLP tasks have achieved prior scores through the use of LMs. Nevertheless, BERT was trained on Wikipedia and news articles. As a result, It has never learned the characteristics associated to the language in which most posts on social media are written. Unexpectedly, some works [10,12] found that the LMs outperform other models on social media tasks. This includes methods such as Word2vec, FastText vectors trained on Twitter. In this paper, we assume that pretrained word embeddings and static word embeddings possess complementary strengths, thus better ﬁndings can be reached when these two sources are combined. However, we expect a weaker improvements for English language. This is due to the fact that when compared to Arabic, English social media vocabulary is often closest to the vocabulary of traditional resources. To evaluate these hypotheses, we suggest and test an attention layer to combine the pretrained with the static word vectors. Our main ﬁndings conclude that word vector embeddings can eﬀectively improve SA eﬃciency. The remainder of this paper is organized as follows: In Sect. 2, we describe in detail our proposed model. In Sect. 3, we discuss our ﬁndings. And ﬁnally, in Sect. 4, we draw our conclusions and outline further research plans.

2

The Proposed Model

BERT-based models can be combined with static word vectors in several ways. However, the tokenization strategy adopted by BERT relies on the fact that many words are split into two or three subwords. This leads to the challenge of combining these two types of contextualized word vectors predicted by BERT with the static word vectors. An alternative solution, introduced by Zhang et al. [25], is to combine the word-piece tokens of a single word into a single vector, either by means of a Convolutional Neural Network or a Recurrent Neural

170

N. Bensalah et al.

Network. Thereafter, the resulting word-level vector can be concatenated with the corresponding static word vector. However, in this case, the representations generated by BERT can be degraded through this aggregation process. In this paper, as depicted in Fig. 1 given a text sentence S = w1 , w2 , ..., wt , ..., wn , where wi refers the i-th word in the sentence, we rather combine the BERT derived embeddings at the sentence level with the static word vectors by means of the attention mechanism. Speciﬁcally, we get a sentence level representation p from the ﬁne tuned BERT model using the GRU layer. In the static embedding layer, the input word wi is mapped into a word vector ei using the static embedding model. The obtained word emeddings are fed into the GRU layer. Next, the attention mechanism is ﬁrst applied on top of the output generated via GRU, i.e., h1 , h2 , .., ht , ..., hn . For each time step t, we ﬁrst feed ht into a fully-connected network to get ut as a hidden representation of ht , and then measure the importance of the word wt as the similarity of ut with the context vector p, followed by a softmax function to obtain a normalized importance weight αt . Next, we compute the text vector d as a weighed arithmetic mean of h based on the weights α = {α1 , α2 , ..., αn }. ut = tanh(Ww ht + bw ) exp(uTt p) αt = T t exp(ut p) d= αt ht

(1)

(2) (3)

t

where Ww and bw are the weight matrix and bias, respectively. By merging the sentence vector from the ﬁne-tuned BERT model and the outputs of GRU applied on the static word vectors using the attention mechanism, the uniﬁed model can pick the useful features from the static embedding model according to the context generated by the BERT model, and thus preserve the merits of both models. After concatenating the two types of sentence vectors, we apply a Dense layer, followed by a softmax classiﬁcation layer; see Fig. 1.

3

Experiments

The dataset, provided by Stanford University, was collected based on tags associated to consumer products, companies, among others. The entire dataset gathered from Twitter between April 6, 2009 and June 25, 2009, containd over 1,600,000 tweets, of which 800,000 are positive and 800,000 are negative. We have eliminated the irrelevant columns for the speciﬁc purpose of conducting Sentiment Analysis. We use as the evaluation criteria: F1-score, which is widely adopted in text classiﬁcation and SA in particular.

Combining Static and Contextual Features: The Case of English Tweets

171

Fig. 1. The proposed architecture

3.1

Data Preprocessing

Emojis, hashtags, numbers, special characters and punctuation have been removed. HTML encoding has been replaced with text based on BeautifulSoup. We removed non-ASCII characters, replaced emoticons with text, and turned common contractions into their full form. Lastly, the data were randomly split into 70% for training and 30% for testing.

172

3.2

N. Bensalah et al.

Experimental Settings

There are two ways in which word embeddings can be used: pre-trained or trained with the model. In our experiments, we generate the word embeddings by training the Word2Vec (Continious Bag Of Words (CBOW) and Skip Gram (SG)) and FastText models on the training dataset. We ﬁx the word embedding dimension to 300. Given the diﬀerent lengths of tweets, padding and truncating the sentences is required to equalize their lengths. According to the cumulative distribution function on the length of tweets, about 100% of processed tweets are composed of 60 tokens or less. As a result, we set the length of processed texts to 60 tokens. We use the pre-trained multilingual BERT model which consists of a 12-layer transformer with token embeddings of size 768; that it was trained by Google on Wikipedia from 104 languages, including English [11]. Baselines: In order to assess the eﬀeciency of the proposed model in analyzing the sentiment of English tweets, we show the performance of GRU applied only on the static word embeddings, we set the parameter return sequences to False. Moreover, we show results for multilingual BERT alone. We applied GRU layer on the obtained representations. In terms of the RNN layer, we set the GRU dimension to 100, we apply dropout after the RNN layer, where the dropout rate is set as 0.5. We use Adam optimizer with a batch size of 16, for 3 epochs, with the usage of early stopping callback. The implementation is done using Keras with Tensorﬂow as the backend. All experiments are performed on T4 GPU. Table 1. F1-score of the English SA system using the proposed approach and the baseline models, where numbers are in percentage. Model

Embedding F1-score

The proposed approach CBOW

81,5

The proposed approach SG

83

The proposed approach FastText

85

GRU

FastText

77,5

MutlilingualBERT

-

83,5

Table 1 summarizes the performance of the baseline models and the proposed strategy for the English language. In this case, we use three embedding models (CBOW, SG and FastText). The results for FastText are the best ones that were achieved. The ﬁrst basline model uses the GRU model on the top of the tokens embeddings using the multilingual BERT model only. The second one is based on the GRU model applied on the static word embeddings only generated using FastText model.

Combining Static and Contextual Features: The Case of English Tweets

173

On the whole, the suggested model’s scores are signiﬁcantly improved over those of the baseline models. This shows that the key enhancements lie in the fact that the proposed word embeddings (combination of static and contextual ones) are more suitable for the social media genre. Additionally, they also pick up further facets of word meanings. While BERT suﬀers from rare words, the proposed word embeddings yield rich additional information for rare words in social media. Further, As shown in Table 1, the attention mechanism adopted in this paper is an eﬃcient scheme to unify static and contextual embeddings using FastText and multilingual BERT, respectively.

4

Conclusion

In this article, we have introduced an attention mechanism-based approach towards combining static word embeddings with a BERT-based multilingual language model. Interestingly, our proposed approach surpasses the BERT-based multilingual model given that the latter was trained only on Wikipedia. Therefore, in order to overcome this problem, it would be appropriate to train a linguistic model on relevant social media. Such a strategy is capable of improving overall performance, but in practice it is not necessarily feasible. The use of static word vectors, for instance, can have a signiﬁcant boost to the processing of emerging words, such as trending hashtags, as it would be too costly to upgrade the Language Models regularly (for many diﬀerent languages). Likewise, incorporating static word vectors appears to be a valuable and promising approach to improve Language Models. Funding. This work was supported by the Ministry of Higher Education, Scientiﬁc Research and Innovation, the Digital Development Agency (DDA) and the CNRST of Morocco (Alkhawarizmi/2020/01).

References 1. Arora, D., Li, K.F., Neville, S.W.: Consumers’ sentiment analysis of popular phone brands and operating system preference using twitter data: A feasibility study. In: 29th IEEE International Conference on Advanced Information Networking and Applications, AINA 2015, 24-27 March 2015, Gwangju, South Korea, pp. 680–686 (2015) 2. Bahdanau, D., Cho, K., Bengio ,Y.: Neural machine translation by jointly learning to align and translate. In: 3rd International Conference on Learning Representations, ICLR (2015) 3. Bensalah, N., Ayad, H., Adib, A., Farouk, A.I.E.: Arabic Sentiment Analysis Based on 1-D Convolutional Neural Network. In: International Conference on Smart City Applications, SCA20 (2020) 4. Bensalah, N., Ayad, H., Adib, A., Farouk, A.I.E.: Combining word and character embeddings for Arabic chatbots. In: Advanced Intelligent Systems for Sustainable Development (AI2SD’2020), pp. 571–578 (2022)

174

N. Bensalah et al.

5. Bensalah, Nouhaila, Ayad, Habib, Adib, Abdellah, Ibn El Farouk, Abdelhamid: CRAN: an hybrid CNN-RNN attention-based model for Arabic machine translation. In: Ben Ahmed, Mohamed, Teodorescu, Horia-Nicolai L.., Mazri, Tomader, Subashini, Parthasarathy, Boudhir, Anouar Abdelhakim (eds.) Networking, Intelligent Systems and Security. SIST, vol. 237, pp. 87–102. Springer, Singapore (2022). https://doi.org/10.1007/978-981-16-3637-0 7 6. Blanzieri, E., Bryl, A.: A survey of learning-based techniques of email spam ﬁltering. Artif. Intell. Rev. 29(1), 63–92 (2008) 7. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017) 8. Chen, Z., Yang, R., Cao, B., Zhao, Z., Cai, D., He, X.: Smarnet: Teaching machines to read and comprehend like human. CoRR, abs/1710.02772, 2017 9. Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, A meeting of SIGDAT, a Special Interest Group of the ACL, 25-29 October 2014, Doha, Qatar, pp. 1724–1734 (2014) 10. Col´ on-Ruiz, C., Segura-Bedmar, I.: Comparing deep learning architectures for sentiment analysis on drug reviews. J. Biomed. Inform. 110, 103539 (2020) 11. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, 2-7 June 2019, USA, vol.1, pp. 4171–4186 (2019) 12. Gonz´ alez-Carvajal, S., Garrido-Merch´ an, E.C.: Comparing BERT against traditional machine learning text classiﬁcation. CoRR, abs/2005.13012 (2020) 13. Harris, Z.S.: Distributional structure. In: Papers on Syntax. Synthese Language Library, vol. 14, pp. 3–22, Springer, Dordrecht. (1981) 14. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997) 15. Kalchbrenner, N., Grefenstette, E., Blunsom, P.: A convolutional neural network for modelling sentences. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014, 22-27 June 2014, Baltimore, USA, vol.1, pp. 655–665. (2014) 16. Kim, Y.: Convolutional neural networks for sentence classiﬁcation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, A meeting of SIGDAT, a Special Interest Group of the ACL, 25-29 Oct 2014, Doha, Qatar, pp. 1746–1751 (2014) 17. Lagrari, F.E., Ziyati, H., Kettani, Y.E.: An eﬃcient model of text categorization based on feature selection and random forests: case for business documents. In: Advanced Intelligent Systems for Sustainable Development (AI2SD’2018), pp. 465– 476 (2019) 18. Lai, M., Cignarella, A.T., Far´ıas, D.I.H., Bosco, C., Patti, V., Rosso, P.: Multilingual stance detection in social media political debates. Comput. Speech Lang. 63, 101075 (2020) 19. Maas, A.L., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y., Potts, C.: Learning word vectors for sentiment analysis. In: The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference, pp. 142–150 (2011)

Combining Static and Contextual Features: The Case of English Tweets

175

20. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Eﬃcient estimation of word representations in vector space. In: 1st International Conference on Learning Representations, ICLR (2013) 21. Pagolu, V.S., Challa, K.N.R., Panda, G., Majhi, B.: Sentiment analysis of twitter data for predicting stock market movements. CoRR, abs/1610.09225 (2016) 22. Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP, pp. 1532–1543. ACL (2014) 23. Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C.D., Ng, A., Potts, C.: Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1631–1642 (2013) 24. Wang, S.I., Manning, C.D.: Baselines and bigrams: simple, good sentiment and topic classiﬁcation. In: The 50th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, 8-14 July 2012, Jeju Island, Korea, vol.2, pp. 90–94 (2012) 25. Zhang, Z., et al.: Semantics-aware BERT for language understanding. In: The Thirty-Fourth AAAI Conference on Artiﬁcial Intelligence, AAAI 2020, The ThirtySecond Innovative Applications of Artiﬁcial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artiﬁcial Intelligence, EAAI 2020, 7-12 Feb 2020, New York, USA, pp. 9628–9635 (2020)

Comparative Study on the Density and Velocity of AODV and DSDV Protocols Using OMNET++

Ben Ahmed Mohamed1 , Boudhir Anouar Abdelhakim1(B) , Samadi Sohaib2 , Faham Hassan2 , and El belhadji Soumaya2 1 Laboratory List, FSTT, Abdelmalek Essaadi University, Tetouan, Morocco

{mbenahmed,aboudhir}@uae.ac.ma

2 List Laboratory, FSTT, Abdelmalek Essaadi University, Tetouan, Morocco

{sohaib.samadi,hassan.faham,soumaya.elbelhadji}@etu.uae.ac.ma

Abstract. In the last few years there are a lot of spontaneous events and situation which happen without any planning, those situations need a flexible and temporary and high-performance ad-hoc network which need an improvement in routing protocols to handle multi-hop communications. The multi-hop communications are the most appropriate solution to avoid the limit of transmission range of mobile terminals. The choice of which routing protocol to use in this step is a big deal and may affect directly the efficacity of the network. So, in this work we will try to compare between the most popular Manet routing Protocols in the market which are AODV and DSDV. The comparison is done using Omnet++ simulator based on 3 metrics which are end to end delay, packets delivery ratio, and throughput and finally the results will be based on varying the number of nodes and pause time. Keywords: MANET · AODV · DSDV · Ad hoc · OMNeT++ · Multi-hop

1 Introduction A wireless ad hoc network, or MANET (Mobile Ad Hoc Network) [1] is a technology that is different from traditional wireless communication networks. The wireless ad hoc network does not need the support of fixed equipment [2]. Each node, that is, the user terminal, forms a network by itself, and other user nodes perform data forwarding during communication [3, 4]. This network form breaks through the geographical limitations of traditional wireless cellular networks, and can be deployed more quickly, conveniently and efficiently, Networks of this kind can be successfully used in various fields of human activity, for example, in military communications networks, emergency network During emergencies, in remote areas where there is no infrastructure pre-request. A wireless ad-hoc network is a multi-hop mobility peer-to-peer network that is composed of dozens to hundreds of nodes, adopts wireless communication, and a dynamic network. Its purpose is to transmit multimedia information flows with quality-of-service © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ben Ahmed et al. (Eds.): NISS 2022, LNDECT 147, pp. 176–189, 2023. https://doi.org/10.1007/978-3-031-15191-0_17

Comparative Study on the Density and Velocity of AODV and DSDV

177

requirements through dynamic routing and mobility management techniques [5]. Routing in this kind of networks can be executed using many types of routing protocols that can be categorized under different metrics. The most general distinction of MANET routing protocols is proactive and reactive. In this paper, we will briefly describe two protocols and we will try to compare their performances for UDP packet transmission through end–to-end delay, throughput and packet delivery ratio as metrics. Actually, our goal is to carry out this analysis to study node density and node pause times effects in AODV and DSDV. We made a number of simulations for different scenarios to compare the protocols performances. The rest of the paper is organized as follows: Sect. 1 - illustrates the necessity and motivation of the research, Sect. 2 - briefly describes the routing protocols used in this research, Sect. 3 - gives the overview of the simulation environment, and finally Sect. 4 - analyzes the results obtained.

2 Related Works There are many researchers that did several studies to compare ad-hoc routing protocols performance and evaluate each one of them based on many metrics. Those works have been done using different simulators, the details of literature review are as follow: A Albaaser. et al. [8] in their work they compared between the most famous routing protocols which are AODV, DSR, DSDV, OLSR using OMNET++ simulator, the simulation has been done with a variant number of nodes and they took into consideration three metrics which are end to end delay, collision and packet delivery ratio. In their results of simulation that, they observe that the collisions increase when the node number increases and DSDV shows the highest end-to-end packet delay starting from 40 nodes on. D.Kampitaki and A. Economides [9], investigate the impact of the type of the traffic in the performance of three popular MANET routing protocols which are OLSR, DSR, AODV in OMNET++ simulator. F.T. Al-Dhief.et al. [10], in this work the researcher presents a performance evaluation of DSR, AODV, DSDV routing protocols to detects which one is more effective. The Study has been done using NETWORK SIMULATOR (NS) in terms of the packet delivery ratio, average throughput, average end-to-end delay, and packet loss ratio with a variant node number. S.K. Gupta. et al. [11], has presented the simulation results for AODV and DSDV protocols using NETWORK SIMULATOR-2 (NS-2), the results are based on Quality of Service (QoS) metrics like throughput, drop, delay and jitter.

3 Methods and Techniques In this section, we discuss the simulation setup used to test our scenarios going through the simulator, parameters and performance metrics.

178

B. A. Mohamed et al.

Fig. 1. Ad-Hoc routing protocols [12].

3.1 Routing Protocols Overview Ad-hoc On-demand Distance Vector (AODV) Protocol Ad Hoc On-demand Distance Vector (AODV) is one of the most used reactive routing protocols (Fig 1) which build routing path before the packet is sent, including three processes: route request RREQ, route response RREP and route maintenance RERR. Each node maintains the routing table and processes different packets respectively to maintain the correct and effective routing information [6]. Destination-Sequenced Distance Vector (DSDV) Protocol The Destination Sequenced Distance Vector (DSDV) protocol is based on the Bellman Ford algorithm and it is an improvement of the distance vector routing protocol. The routing table of each node include the destination address, the metric parameter (minimum hops), list of id of destination node, details of the next hop, and the sequence number associated with the destination node. Each exchanges its updated routing table with each other. Updated in the routing table can be sent to other nodes in two ways: Full dump update or Incremental update. 3.2 Simulation Environment To perform our analysis, we choose OMNeT++ as the main simulator, OMNeT++ is an extensible, modular, component-based C++ simulation library and framework, primarily for building network simulators. It offers an Eclipse-based IDE, a graphical runtime environment, and a host of other tools. OMNeT++ provides a component architecture for models programmed in C++, then assembled into larger components and models using a high-level language (NED) [13]. To use our protocols, we need to implement the INET Framework within OMNeT++, INET contains models for the Internet stack (TCP, UDP, IPv4, IPv6, OSPF, BGP, etc.), wired and wireless link layer protocols (Ethernet, PPP, IEEE 802.11, etc.), support for

Comparative Study on the Density and Velocity of AODV and DSDV

179

mobility, MANET protocols, DiffServ, MPLS with LDP and RSVP-TE signaling, several application models, and many other protocols and components [14]. The following table describes for each parameter the value used during this simulation. Table 1. Simulation parametres. Parameter

Value

Operating System

Windows 10 Pro

Simulator

OMNeT++ V 5.7

Type of mobility model

Random waypoint

The dimension of topology

1600 m * 600 m

Speed

Uniform(20 m/s, 50 m/s)

Pause time

20s, 40s, 60s, 80s, 100 s

MAC layer Type

802.11/Mac

Sources (fixed)

2 nodes

Destination (fixed)

2 nodes

Mobile nodes

8, 16, 32, 40 and 80

Packet size

512 bytes

Simulation time

100s

3.3 Performance Metrics The following metrics are used in this paper to analyze the performance of AODV and DSDV routing protocols: Packet Delivery Ratio (PDR): It is measured as the ratio of total number of packets successfully received by the destination nodes to the number of packets sent by the source nodes throughout the simulation. It is presented in percentage (%) and higher values of PDR provide better performance [7]. PDR =

Number of received packets × 100% Number of packets sent

(1)

Average End to End Delay (EED): This is defined as the time taken by the packet reach the destination, it is the fraction of delay sum including time like propagation, transmission, queuing, and processing to the packets received. EED is derived in seconds (s) and the performance is better when it is low. EED =

Delay Sum Number of received packets

(2)

180

B. A. Mohamed et al.

Throughput: The total data transmitted from the source node to the destination node in a time unit. It is derived in Kilobits per second (Kbps) and for achieving better performance it should be high. Throughput =

Total received bytes × 8 Simulation time × 1024

(3)

3.4 Simulation Scenario In this paper, the main goal of our experiment was to analyze the impact of node density and pause time on routing protocols. To do so, we test DSDV and AODV routing protocols over two parameters that we change alternately. We put two fixed source nodes and two fixed destination nodes and vary the number of mobile nodes (8, 16, 32, 40, 80) and fix the pause time at 20 s, then we change the pause time (20s, 40s, 60s, 80s, 100s) while all other parameters are fixed including the number of nodes at 16, and again measure all the metrics needed.

Fig. 2. Simulation example with 16 nodes.

4 Results and Discussions As stated in Sect. 3.3, the performance of AODV & DSDV has been analyzed with variation of numbers of nodes (8,16,32,40,80) and pause time (20s, 40s, 60s, 80s, 100s) with a fixed bitrate value of 2 Mbps. We measured the packet delivery ratio, average end-to-end delay & Throughout of AODV and DSDV and the simulated output has shown by using graphs in Fig. 3 to Fig. 8. 4.1 Node Density Versus AODV and DSDV Packet Delivery Ratio (PDR): First, we discuss the impact of node density on the packet delivery ratio for AODV and DSDV, and we can see clearly from Fig. 3 that

Comparative Study on the Density and Velocity of AODV and DSDV

181

AODV maintains a higher value of PDR through the whole experiment. While DSDV gets affected when the number of nodes increases and particularly when it passes 40 nodes.

100 90 80

PDR(%)

70 60 50 AODV

40

DSDV

30 20 10 0 8

16

32

40

80

Nodes Number Fig. 3. Node density versus AODV and DSDV Packet Delivery Ratio analysis.

Average End to End Delay: Fig. 3 shows that EED becomes higher in DSDV when the number of nodes = 32 and above which means packets sent from sources take longer to reach their destinations, whereas AODV has lower EED values even with a dense environment. That’s because DSDV requires a regular update of its routing tables, which consumes small amount of bandwidth even when the network is idle, while AODV is an on demand algorithm, meaning that it builds routes between nodes only as desired by source nodes.

B. A. Mohamed et al.

End-To-End Delay(s)

182

3 2.8 2.6 2.4 2.2 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0

AODV DSDV

8

16

32

40

80

Nodes Number Fig. 4. Node density versus AODV and DSDV Average End to End Delay analysis.

Throughput: When comparing the routing throughput of each protocol, AODV has a higher throughput value than DSDV. Which means using AODV allows sending more data to the destinations, on the other hand, DSDV decreases to half when the number of nodes reaches 80 nodes. In DSDV, throughput decreases because of the periodic updates of routes information especially when the node mobility is at high speed. In AODV, throughput is stable because it doesn’t broadcast any routing information.

Comparative Study on the Density and Velocity of AODV and DSDV

183

450

Throughput (Kbps)

400 350 300 250 200

AODV

150

DSDV

100 50 0 8

16

32

40

80

Nodes Number Fig. 5. Node density versus AODV and DSDV Throughput analysis.

4.2 Pause Time Versus AODV and DSDV

PDR(%)

Packet Delivery Ratio (PDR): When comparing AODV and DSDV, AODV has shown better performance for every single value of pause time, whereas DSDV has shown slightly better results for the higher pause times. In this situation, the problem with

100 99 98 97 96 95 94 93 92 91 90 89 88 87 86

AODV DSDV

20

40

60

80

100

Pause Time (s) Fig. 6. Pause time versus AODV and DSDV Packet Delivery Ratio analysis.

184

B. A. Mohamed et al.

DSDV is its routing information that is determined at the start of a simulation may be old and invalid in the middle of mobility. Average End to End Delay: AODV has larger delays for lesser pause times but it has lesser delay values for higher node pause times starting with 40s DSDV acts the opposite, it has lesser delay values for lesser node pause times compared to AODV.

0.2

End-To-End Delay(s)

0.18 0.16 0.14 0.12 0.1 DSDV

0.08

AODV

0.06 0.04 0.02 0 20

40

60

80

100

Pause Time (s) Fig. 7. Pause time versus AODV and DSDV Average End to End Delay analysis.

Throughput: Here, average throughput of the AODV is way better found compared to DSDV routing protocol that decreases continuously once the value of pause time increases. As throughput depends on time and as DSDV is the table-driven protocol, it requires extra time to set up routing tables before delivering packets to the next node. Thus, its throughput becomes less than AODV.

Throughput (Kbps)

Comparative Study on the Density and Velocity of AODV and DSDV 402 399 396 393 390 387 384 381 378 375 372 369 366 363 360

185

AODV DSDV

20

40

60

80

100

Pause Time (s) Fig. 8. Pause time versus AODV and DSDV Throughput analysis.

4.3 Comparison Between OMNET++ and NS2 for AODV While carrying out the simulations and analyzing the results, we also compare our results of AODV with the ones found using NS2 simulator and summarize them in Table 2: 4.4 Comparison Between OMNET++ and NS2 for DSDV And concerning the DSDV routing protocol we intend to recapitulate the comparison results between the two simulators in the Table 3:

0.192

388.546

EED (s)

Throughput (kbps)

OMNET++

86.64

OMNET++

PDR (%)

40

20

Pause time(s)

300

0.007

68

NS2

275,47

371.588

0.021

90.27

346,56

0,0616

318,44

0,1421

0,0198

Throughput (kbps)

86,64

EED (s)

70,03

OMNET++

79,61

16

NS2

8

OMNET++

PDR (%)

Nbr of nodes

275

0.015

70

NS2

248,02

0,138

75,18

NS2

370.073

0.022

90.35

OMNET++

60

339,52

0,7829

84,88

OMNET++

32

272

0.019

73

NS2

246,62

0,1409

62,71

NS2

367

0.042

89.6

OMNET++

80

326,08

1,0671

81,52

OMNET++

40

Table 2. AODV performance in OMNET++ and NS2.

271

0.028

79

NS2

245,55

0,1342

57,82

NS2

364.666

0.019

89.03

OMNET++

100

146,92

1,3201

36,73

OMNET++

80 NS2

268

0.021

84

NS2

230,57

0,163

43,48

186 B. A. Mohamed et al.

0.192

399.68

EED (s)

Throughput (kbps)

OMNET++

OMNET++

99.92

20

Pause time(s)

PDR (%)

40

373,48

235

0.03

96.5

NS2

253,83

399.76

0.021

99.94

399,68

0,037

99,92

Throughput (kbps)

0,162

90,72

93,37

0,1067

OMNET++

EED (s)

16

NS2

8

OMNET++

PDR (%)

Nbr of nodes

226

0.045

96

NS2

231,39

0,1572

99,871

NS2

399.92

0.019

99.98

OMNET++

60

398,92

0,0402

99,73

OMNET++

32

233

0.04

97

NS2

232,98

0,1712

98,396

NS2

399.84

0.022

99.96

OMNET++

80

393,12

0,097

98,28

OMNET++

40

Table 3. DSDV performance in OMNET++ and NS2.

240

0.038

98

NS2

245,15

0,187

96,57

NS2

399.92

0.019

99.98

OMNET++

100

392,68

0,0713

98,17

OMNET++

80

252

0.022

97.8

NS2

251,07

0,1687

94,83

NS2

Comparative Study on the Density and Velocity of AODV and DSDV 187

188

B. A. Mohamed et al.

5 Conclusion and Future Work This paper provides explanation and simulation analysis of two widely used on demand routing protocols one is reactive and the other is proactive (AODV and DSDV) for Ad hoc mobile networks. Their performances in different size networks and in mobile scenarios have been studied using simulations developed in OMNET++. It has also presented a comparison of these on-demands routing protocol under variation of Pause Time and number of nodes, simultaneously measured performances under various performance metrics including throughput, packet delivery ratio and end to end delay. It is noticed that the performance of the AODV routing protocol is better in all the metric calculations as compare to DSDV routing protocol. In the future, the reasons that cause this variation in the results can be examined in a more precise manner. Another direction for our future work is to further analyze the impact of mobility speed and bitrate on other performance metrics. It is also possible to test the energy consumption of these protocols in order to identify the optimal parameters values that gives a lower energy consumption level.

References 1. Murthy, C.S.R., Manoj, B.S.: Ad hoc Wireless Networks: Architectures and Protocols, p. 879. Pearson Education (2004) 2. Abdulleh, M.N., Yussof, S., Jassim, H.S.: Comparative study of proactive, reactive and geographical manet routing protocols. Commun. Netw. 07(02), 125–137 (2015). https://doi.org/ 10.4236/cn.2015.72012 3. Singh, G., Singh, A.: Performance evaluation of Aodv and Dsr routing protocols for Vbr traffic for 150 nodes in manets. Int. J. Comput. Eng. Res. 2(5), 1583–1587 (2012) 4. Kaur, R., Rai, M.K.: A novel review on routing protocols in manets. Undergraduate Acad. Res. J. 103–108 (2012) 5. Talooki, A.N., Rodriguez, J.: Quality of service for flat routing protocols in mobile ad-hoc network. In: ICST, 7–9 Sep 2009 (2009) 6. Perkins, C., Royer, E., Das, S.: Ad hoc on-demand distance vector (AODV) routing. rcf-3561. Ietf, (2003) 7. Elahe, F., et al.: A Study of Black Hole Attack Solutions On AODV Routing Protocol in MANET. Elsevier (2016). https://doi.org/10.1016/C2015-0-04114-4 8. Albaseer, A., Bin Talib, G., Bawazir, A.: Multi-hop wireless network: a comparative study for routing protocols using OMNET++ simulator. J. Ubiquitous Syst. Pervasive Netw. 7(1), 29–33 (2016) 9. Kampitaki, D., Economides, A.A.: Simulation study of manet routing protocols under ftp traffic. Procedia Technol. 17, 231–238 (2014) 10. Taha, F., AL-Dhief, Naseer Sabri, M.S. Salim, S. Fouad, S. A. Aljunid,: Manet routing protocols evaluation: AODV, DSR and DSDV perspective. MATEC Web Conf. 150, 06024 (2018) 11. Gupta, S.K., Saket, R.K.: Performance metric comparison of AODV and DSDV routing protocols in MANETs using NS-2. Int. J. Res. Rev. Appl. Sci. 7(3), 339–350 (2011) 12. Kannammal, K.E., Eswari, K.E.: Behaviour of Ad hoc routing protocols in different terrain sizes. Int. J. Eng. Sci. Technol. 2(5), 1298–1303 (2010)

Comparative Study on the Density and Velocity of AODV and DSDV

189

13. Kaur, P., Kaur, D., Mahajan, R.: Simulation based comparative study of routing protocols under wormhole attack in MANET. Wireless Pers. Commun. 96(1), 47–63 (2017) 14. Steinbach, T., et al.: An extension of the OMNeT++ INET framework for simulating realtime ethernet with high accuracy. In: Proceedings of the 4th International ICST Conference on Simulation Tools and Techniques, pp. 375–382 (2011)

Cybersecurity Awareness Through Serious Games: A Systematic Literature Review Chaimae Moumouh1,2(B) , Mohamed Yassin Chkouri1 , and Jose L. Fern´ andez-Alem´an2 1 2

SIGL Laboratory, ENSATE, Abdelmalek Essaadi University, Tetouan, Morocco [email protected], [email protected] Department of Computer Science and Systems, University of Murcia, Murcia, Spain [email protected]

Abstract. Technological innovations represent a challenge for security professionals and ordinary individuals in many industries, including business, e-learning, and health. This increases the need for more understanding and knowledge of cybersecurity. The worldwide digital system, as well as people’s privacy, are under constant threat in relevant areas such as health, where patients’ data is among the most conﬁdential information. The serious game domain is rapidly expanding, and it focuses on using game elements in non-game situations to reassure learning and motivation among users while maintaining a fun element throughout the procedure. The purpose of this study is to outline existing literature, and investigate the numerous serious games used as a tool to increase cybersecurity skills among employees in diﬀerent domains. Keywords: Serious games review

1

· Cybersecurity · Systematic literature

Introduction

The digital revolution has been steadily improving over the previous several years. Individuals, organizations, and the international community all beneﬁt from the migration of activities to digital technology [1]. These features became an aim for growing economies as well as a deﬁning trait of developed ones. Many elements inﬂuence the success of applying the aforementioned concepts, including the concept of sustainable development, one of which is the necessity to maintain cybersecurity [2]. Achieving goals connected to future technological growth necessitates guaranteeing the security of information and systems deployed in online communications [3]. The rising cyber threats to individuals, corporations, and authorities and the evolution of cyberattacks has made cybersecurity a worldwide concern [4]. The rising value of personal information makes them more appealing to individuals seeking to breach systems for several proﬁts, or to simply cause chaos [5]. In today’s world and societies c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ben Ahmed et al. (Eds.): NISS 2022, LNDECT 147, pp. 190–199, 2023. https://doi.org/10.1007/978-3-031-15191-0_18

Cybersecurity Awareness Through Serious Games: SLR

191

that depend on digitalization, ensuring cybersecurity through increased awareness is critical for attaining the goals of Sustainable development [6]. However, investment in cybersecurity, as well as how this area should be addressed by the government and industry, has been a source of debate throughout the previous decade [7]. To address security problems in the digital environment, the term cybersecurity is used frequently in the literature. Cybersecurity is a wide phrase with many diverse interpretations. It lacks a precise literal or practical meaning on which all academics may agree [8]. Several studies found that human error can be a major threat to cybersecurity [9]. According to [10] data breaches caused by systems or employee errors are exorbitant, with an estimated cost of e2.74 million. In addition, the Verizon’s annual survey on security breaches [11], indicated that more than 80% of cybersecurity intrusions contain a human factor. Therefore, to avoid or minimize the impact of cyberattacks on company performance, and organizational knowledge, regular security awareness training for all employees should be provided [12]. Employees acquire more knowledge about the security requirements that must be in place to secure critical information, as well as corporate norms, regulations, and processes for better cybersecurity management via Cybersecurity training programs. Thus, employees may be able to properly manage data and react to threats related to cybersecurity, through practice and regular application. Practical learning and training is the most eﬀective way to develop and improve skills among staﬀ [13]. Gamiﬁcation or serious games uses the elements that make games attractive and interesting, and applies them in non-game environments such as education or in work to improve the player experience. Serious games can be an enjoyable tool to interact employees while raising awareness [14]. Our study’s major goal is to perform a systematic literature review (SLR), regarding the use of serious games to train and raise awareness among employees about cybersecurity. In order to obtain a comprehensive view of the present state of the literature in the topic, a detailed investigation was conducted. This paper summarizes the outcomes of a thorough review of studies about serious games and gamiﬁed apps, particularly those aimed at raising security and privacy awareness. The sections of the paper are structured as follows: Sect. 2 explains the methodology used to develop the systematic literature review. In Sect. 3, the results of this analysis are represented, along with answers to the research questions. Section 4, includes a brief conclusion.

2 2.1

Methodology Systematic Literature Review

In order to get an overview on the realized investigations and the current state of art in a ﬁeld of interest, a SLR (Systematic Literature Review) is an essential method to conduct. The main purpose of the study is to analyze the existing

192

C. Moumouh et al.

and related literature [15]. This methodical procedure will result to ﬁnding that will give opportunity for researchers to design future work. The research focuses on serious game related to training and raising awareness about cybersecurity among employees. In this literature review the characteristics and impact of serious games on the users’ knowledge about cybersecurity, will be investigated to determine if serious games can help raise awareness and reduce human errors when it comes to managing and communicating delicate information. Determine the research questions (RQ), is the ﬁrst step to take when conducting a systematic literature review, after identifying the search string, the appropriate studies are collected based on the link to the study in accordance with the subject under consideration and based on the inclusion and exclusion criteria. The gathered data is then analyzed, and the ﬁndings are presented. In conclusion, SLR aims to answer RQ regarding the research topics in general by discovering, and assessing existing research. 2.2

Research Question

Deﬁning Research questions can be helpful to keep the research focused and on track, so that the collecting of data and papers for the study may be more structured and eﬃcient. Performing an initial overview related to the study subject can be used to formulate RQ. Table 1 shows the research questions chosen for this study, as well as the justiﬁcation for each one. Table 1. Research questions & rationale. ID

Research Question

RQ1 What is the distribution through time and across countries of research on the use of serious games to train cybersecurity?

Rationale Determine the temporal evolution and the geographical one of publications on serious games as tools to teach cybersecurity for professionals

RQ2 Which serious games were created to To investigate whether any serious raise awareness about cybersecurity games about this issue have been among employees? created RQ3 Are there articles presenting evidence that serious games improved users’ cybersecurity knowledge?

2.3

To investigate the usefulness of serious games in increasing understanding about cybersecurity issues using the available literature

Search Strategy

On March 2nd, the prominent international databases were searched for publications, and to make sure that the whole purpose of the study is covered and to ﬁnd the most relevant papers the following search string was used (“serious game”

Cybersecurity Awareness Through Serious Games: SLR

193

AND “training”AND “cybersecurity”). Scopus, Sciendirect, Springer Database, ACM Digital Library and IEEE Explorer digital libraries were determined to perform the literature search, in a period of ten years (from 2011–2021). To extract the data, the search string was performed to the content (title, keywords, and abstract) at every digital library. Filtering was done at the phase of screening for relevant publications using inclusion and exclusion criteria. These criteria are intended to make it easier to extract information from the publications found, and attempts to maximize time eﬃciency throughout the analysis phase. The presence of constraints based on speciﬁed criteria might help to reﬁne the research process. Also, the number of publications to be examined can be reduced by using the aforementioned norms. The inclusion and exclusion criteria are presented bellow. Inclusion Criteria: – IC1. Papers written in english. – IC2. Papers were published between 2011 and 2021. – IC3. The paper discusses the use of serious games to raise cybersecurity awareness. – IC4. Articles available in Full text. Exclusion Criteria: EC1. Book, book series. EC2. Duplicate articles. EC3. Articles not related to serious games and cybersecurity training. EC4. Papers that examine the use of serious games in any topic other than cybersecurity or for users other than employees, as well as those that do not provide enough information. – EC5. The study is not in the domain of serious games for learning about cybersecurity.

– – – –

39 papers in the form of a full text were gathered at the last step of the search strategy, and the comprehensive discussion for each publication was assessed. The analytical results are then gathered to answer each research question.

3

Results and Discussion

The ﬁndings of the systematic literature study, that was carried out to address the research questions stated in Table 1 are presented in this section. In March 2022, the selection procedure was carried out. The number of results retrieved in each phase of the search and screening of articles that will be used as sources for our study are depicted in Fig. 1. The database searches retrieved 397 publications when the previously stated search keywords were used. After discarding duplicates and publications that did not fulﬁl the ﬁrst two inclusion criteria, which are articles that were not written in English and were not published between January 2011 and December 2021, 212 publications were included for a metadatabased screening (Title, keywords and abstract). After applying IC3 and EC3,

194

C. Moumouh et al.

131 papers out of 212 were eliminated. If complete texts were accessible, the remaining 81 publications were veriﬁed. After using IC4 and EC2, 42 articles were removed, and 39 papers focusing on serious gaming and cybersecurity were chosen. Following full-text screening and the use of EC4 and EC5, a total of 21 publications were ﬁnally included in the SLR. Table 2 displays the following search results in each database.

Fig. 1. Filtration method using particular criteria.

RQ1: What is the distribution through time and across countries of research on the use of serious games to train employees about cybersecurity? The publication trend of the considered publications from 2011 to 2021 is depicted in Fig. 2. All of the selected studies were published from 2017 to 2021, with 2020 as the year with the highest number of publications (11 articles). Figure 3, illustrates the most productive countries, which were retrieved from the nationality of the main author of each of the selected publications. These graphs depict the temporal and regional distribution of research regarding serious games as tools to raise awareness about cybersecurity. There were 11 nationalities identiﬁed, with Germany, and United Kingdom as the most proliﬁc countries

Cybersecurity Awareness Through Serious Games: SLR

195

Table 2. The research results ID Database

Number of articles

1

IEEE Explorer

2

Scopus

3

Springerlink

4

Sciendirect

25

5

ACM Digital Library

20

Total

17 35 300

397

Fig. 2. Temporal distribution of the selected papers.

in research about serious games and cybersecurity. However, there appears to be a gradual increase in global interest in the topic. Despite conducting a research throughout the previous ten years, only those published in the recent ﬁve years (2017–2021) were selected. For more than a decade, the domain of serious games has been quickly expanding, and has taken oﬀ in a variety of domains, most notably industry and education [16]. Even though “Serious Games” appeared to have started around 2002, several games were created for serious reasons prior to this date [17]. What may have led to 2017 being the start of the evolution of literature in serious games for cybersecurity awareness is that the year was

Fig. 3. Geographical distribution of the selected papers.

196

C. Moumouh et al.

marked as a record-breaking year for cybercrime. According to the Online Trust Alliance (OTA), the frequency of cyberattacks rose dramatically from the preceding year [18]. In addition, the famous ransomware attack known as Wannacry that paralyzed the functioning of 16 hospitals in the United Kingdom happened in 2017 [19]. This may have been a reason that led organisations to ﬁnd more suitable ways to raise awareness about cybersecurity among their employees. Giving that research found that human errors help in a considerable percentage in causing these assaults, and that some incidents could have been avoided if simple security practices had been followed [20]. On the other hand geographical distribution of studies regarding employees’ awareness on cybersecurity using serious games appears obvious. The majority of publications in the ﬁeld have been produced by writers from European countries such as Germany [21–23] and the United Kingdom [24,25], Italy [26]. Other countries have contributed with a lower frequency including USA [27], India [28], Switzerland [29], Austria [30] and others more. Even though the exclusion criteria excluded many papers for this literature review, publications from the unmentioned counties were rare. This geographical imbalance might be explained by the fact that cybersecurity is a worldwide issue. Consequently, the most developed countries were undoubtedly the ﬁrst to address this issue and make signiﬁcant contributions. RQ2: Which serious games were created to raise awareness about cybersecurity? Various serious games were designed in this scope, namely the training game introduced by W. Robert in his paper. The game adopts a diﬀerent way of training by allowing the user to take the role of an attacker, which the authors believe has a variety of learning beneﬁts. In the scenario proposed in the paper, the user and a virtual character exchange a conversation on a chat application, where the hacker persuade the user into launching an attack and then describes the different attack possibilities in a guided discourse. The CyberSecurity Challenges (CSCs) is a serious game, designed to raise awareness for cybersecurity topics among industrial software engineers. The game consists of several challenges designed to raise awareness on secure coding guidelines and secure coding on software developers. These challenges are implemented to improve the defensive skills of the participants. In a diﬀerent study, authors design a cybersecurity training game that will help players learn and grasp basic cybersecurity concepts. The players are exposed to realistic security attack scenarios. How to spot phishing emails, is one of the examples introduced in the game. Users will be able to manage their character and ﬁnd a solution to the challenges they are presented with. Users will be guided through a tale step by step, and they will be required to make decisions that represent their character’s actions. Finally, a system evaluates the users’ responses and provides them with the correct answer along with an explanation. “Operation Digital Chamaeleon” is a Serious Game that is intended for IT-Security professionals, to help raise their awareness on IT-Security, by applying Wargaming to the ﬁeld. The game contains the red team that creates assault plans, the blue team creates IT-security solutions, and the white team determines who wins. Employees’ careless actions and errors have

Cybersecurity Awareness Through Serious Games: SLR

197

been proven to be responsible for a variety of breaches, resulting in a rise in data loss, and possibilities of identity theft. In this scope a study investigates the use of a web-based gamiﬁcation application to drive web users to be cyber aware through a fun method. This approach aims to enhance their online behavior, and so improve their cybersecurity behavior by training them to recognize certain typical web spooﬁng assaults. These attacks may cause the confusion between a real website and a forged one. As a result, the app is constructed around some of the most commonly visited websites by employing screenshots of the genuine sites and their damaged or phished counterparts, with participants requested to identify a legitimate page. Numerous serious games were found in the selected papers that aim to enhance cybersecurity awareness among employees including Another Week at the Oﬃce (AWATO), PROTECT, “Guess Who? - A Serious Game for Cybersecurity Professionals”, and Unity role-playing quiz application (RPG). Each one discusses a diﬀerent problem, and gives relevant solutions to help minimize human errors. RQ3: Are there articles presenting evidence that serious games improved users’ cybersecurity knowledge? Although it is diﬃcult to conduct a long-term research on the inﬂuence of the developed serious games on cybersecurity awareness, due to a variety of factors, including expanding of technologies regarding IT security, the evolving of attacks, and others more. Some of the aforementioned serious games were tested and evaluated by diﬀerent users, to prove the impact of this training tool. The eﬀectiveness of The CyberSecurity Challenges (CSCs) game was reported, by collecting feedback from participants that declared having fun during the participation in the game. Also the event was well-received by software engineers and the game has had continual management approval throughout the years. In the game that oﬀers realistic security attack scenarios like”How to spot phishing emails”, University students were asked to play and review the experience, to evaluate the eﬃciency of the game. The participants were then asked to take a survey and rate diﬀerent aspects of the game whether the play or educational part. Results showed that the game was scored favorably in terms of comprehending cybersecurity challenges and solutions. As a result, the game assists in avoiding rote learning of users. However, the fun element was found poor in this game, so the creators are asked to increase the motivation of the learners, and include additional game actions and game features. The testing of another application showed that users gave high ratings after testing the application and found it easy to use. Also from the ratings given, the users of the application are likely to beneﬁt more from the application with continuous practice and higher scores. The results have also shown the eﬀectiveness of a gamiﬁed training approach in improving awareness.

4

Conclusion

This study presents the ﬁndings of a thorough literature review that highlights the available research on serious games used to raise cybersecurity awareness among employees. A starting collection of 397 papers obtained from various

198

C. Moumouh et al.

sources. Only 21 studies were selected and ﬁt the inclusion and exclusion criteria. Three research questions were used to analyze these papers. Serious games were discovered to be attracting the attention of researchers, that aim to increase knowledge about cybersecurity among users. The cyberattacks that increase every year, and cause data and money loss can be a reason to attract researchers’ attention about the topic of serious games. Furthermore, the ﬁndings enabled us to discuss if this tool proved to be aﬀective in raising awareness about cybersecurity. Although numerous developed games were evaluated using surveys taken by users, we believe there is a need for more empirical evaluations of the eﬀectiveness of serious games in the context of training and raising awareness.

References 1. Vu, K., Hanaﬁzadeh, P., Bohlin, E.: ICT as a driver of economic growth: a survey of the literature and directions for future research. Telecommun. Policy 44(2), 101922 (2020) 2. Szczepaniuk, E.K., Szczepaniuk, H.: Analysis of cybersecurity competencies: recommendations for telecommunications policy. Telecommun. Policy 46, 102282 (2021) 3. ENISA: Behavioural Aspects of Cybersecurity C, no. December (2018) 4. Anderson, R., et al.: Savage, Measuring the Chaninging cost of cybercrime. Econ. Inf. Secur. Privacy, 1–32 (2012) 5. Kianpour, M., Kowalski, S.J., Øverby, H.: Advancing the concept of cybersecurity as a public good. Simul. Model. Pract. Theory 116(January), 102493 (2022) 6. UN: Resolution adopted by the general assembly on 6 July 2017, a/res/71/313, work of the statistical commission pertaining to the 2030 agenda for sustainable development (2017) 7. Shackelford, S.J.: Cyber War and Peace: Toward Cyber Peace. Cambridge University Press, Cambridge (2020) 8. Quayyum, F., Cruzes, D.S., Jaccheri, L.: Cybersecurity awareness for children: a systematic literature review. Int. J. Child-Comput. Interact. 30, 100343 (2021) 9. Corallo, A., Lazoi, M., Lezzi, M., Luperto, A.: Cybersecurity awareness in the context of the industrial internet of things: a systematic literature review. Comput. Ind. 137, 103614 (2022) 10. ENISA: Threat Landscape—ENISA. https://www.enisa.europa.eu/topics/threatrisk-management/threats-and-trends 11. Bassett, G., Hylender, C.D., Langlois, P., Pinto, A., Widup, S.: Data breach investigations report, Verizon DBIR Team, Technical report (2021) 12. He, W., et al.: Improving employees’ intellectual capacity for cybersecurity through evidence-based malware training. J. Intell. Capital (2019) 13. Bursk´ a, K.D., Rusˇ n´ ak, V., Oˇslejˇsek, R.: Data-driven insight into the puzzle-based cybersecurity training. Comput. Graph. (2022) 14. Pedreira, O., Garc´ıa, F., Brisaboa, N., Piattini, M.: Gamiﬁcation in software engineering - a systematic mapping. Inf. Softw. Technol. 57, 157–168 (2015) 15. Dresch, A., Lacerda, D.P., Antunes, J.A.V.: Design science research. In: Dresch, A., Lacerda, D.P., Antunes, J.A.V. (eds.) Design Science Research, pp. 67–102. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-07374-3 4 16. Laamarti, F., Eid, M., El Saddik, A.: An overview of serious games. Int. J. Comput. Games Technol. (2014)

Cybersecurity Awareness Through Serious Games: SLR

199

17. Ma, M., Oikonomou, A., Jain, L.C. (eds.): Serious Games and Edutainment Applications. Springer, London (2011). https://doi.org/10.1007/978-1-4471-2161-9 18. Loeb, L.: Cybersecurity incidents doubled in 2017. Study Finds (2018). https:// securityintelligence.com/news/cybersecurity-incidents-doubled-in-2017-studyﬁnds/ 19. Brandom, R.: UK hospitals hit with massive ransomware attack - The Verge (2017). https://www.theverge.com/2017/5/12/15630354/nhs-hospitals-ransomwarehack-wannacry-bitcoin 20. Hockey, A.: Uncovering the cyber security challenges in healthcare. Netw. Secur. 2020(4), 18–19 (2020) 21. Gasiba, T., Lechner, U., Pinto-Albuquerque, M.: CyberSecurity challenges for software developer awareness training in industrial environments. In: Ahlemann, F., Sch¨ utte, R., Stieglitz, S. (eds.) WI 2021. LNISO, vol. 47, pp. 370–387. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86797-3 25 22. Gasiba, T.E., Lechner, U., Pinto-Albuquerque, M.: Cybersecurity challenges in industry: measuring the challenge solve time to inform future challenges. Information (Switzerland) 11(11), 1–31 (2020) 23. Rieb, A., Lechner, U.: Towards a cybersecurity game: operation digital chameleon. In: Havarneanu, G., Setola, R., Nassopoulos, H., Wolthusen, S. (eds.) CRITIS 2016. LNCS, vol. 10242, pp. 283–295. Springer, Cham (2017). https://doi.org/10.1007/ 978-3-319-71368-7 24 24. Williams, M., Nurse, J.R., Creese, S.: (Smart) watch out! Encouraging privacyprotective behavior through interactive games. Int. J. Hum. Comput. Stud. 132, 121–137 (2019) 25. Coull, N., et al.: The gamiﬁcation of cybersecurity training. In: Tian, F., Gatzidis, C., El Rhalibi, A., Tang, W., Charles, F. (eds.) Edutainment 2017. LNCS, vol. 10345, pp. 108–111. Springer, Cham (2017). https://doi.org/10.1007/978-3-31965849-0 13 26. Braghin, C., Cimato, S., Damiani, E., Frati, F., Riccobene, E., Astaneh, S.: Towards the monitoring and evaluation of trainees’ activities in cyber ranges. In: Hatzivasilis, G., Ioannidis, S. (eds.) MSTEC 2020. LNCS, vol. 12512, pp. 79–91. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-62433-0 5 27. Wray, R., Massey, L., Medina, J., Bolton, A.: Increasing engagement in a cyberawareness training game. In: Schmorrow, D.D., Fidopiastis, C.M. (eds.) HCII 2020. LNCS (LNAI), vol. 12197, pp. 147–158. Springer, Cham (2020). https://doi.org/ 10.1007/978-3-030-50439-7 10 28. Gupta, S., Gupta, M.P., Chaturvedi, M., Vilkhu, M.S., Kulshrestha, S., Gaurav, D., Mittal, A.: Guess who? - A serious game for cybersecurity professionals. In: Marﬁsi-Schottman, I., Bellotti, F., Hamon, L., Klemke, R. (eds.) GALA 2020. LNCS, vol. 12517, pp. 421–427. Springer, Cham (2020). https://doi.org/10.1007/ 978-3-030-63464-3 41 29. Smyrlis, M., Fysarakis, K., Spanoudakis, G., Hatzivasilis, G.: Cyber range training programme speciﬁcation through cyber threat and training preparation models. In: Hatzivasilis, G., Ioannidis, S. (eds.) MSTEC 2020. LNCS, vol. 12512, pp. 22–37. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-62433-0 2 30. Luh, R., Temper, M., Tjoa, S., Schrittwieser, S., Janicke, H.: PenQuest: a gamiﬁed attacker/defender meta model for cyber security assessment and education. J. Comput. Virol. Hack. Tech. 16(1), 19–61 (2019). https://doi.org/10.1007/s11416019-00342-x

Design and Implementation of a Serious Game Based on Recommender Systems for the Learning Assessment Process at Primary Education Level Fatima Zohra Lhafra1(B) and Otman Abdoun2 1 Polydisciplinary Faculty of Larache, Abdelmalek Essaadi University, Larache, Morocco

[email protected] 2 Computer Science Department, Faculty of Science, Abdelmalek Essaadi University,

Tetouan, Morocco

Abstract. Assessment is a crucial phase in the learning process. Its purpose is to measure the mastery degree of the acquired knowledge to take the appropriate decisions. The assessment phase accompanies the learner during the whole learning process by different types: diagnostic, formative and summative. In the context of e-learning, the choice of technological tool is still limited and sometimes insufficient to assess the totality of learners’ pre-requisites as well as to cover their different profiles. For this, we need an adaptive assessment process that satisfies the cognitive needs and preferences of each learner. This paper presents a solution that combines recommender systems especially the collaborative filtering technique with serious games as a ludic tool for educational purposes. The effectiveness of the proposed solution is measured through a real experiment with a sample of learners and with a comparative analysis of the obtained results. Keywords: Assessment learning · Serious game · Recommender system · Collaborative filtering · Adaptive E-learning · Primary education

1 Introduction Evaluation is an integral step in the learning process. It allows to ensure and measure the degree of knowledge acquisition of the learners. Generally, it is based on data provided by various sources and measures such as observations, examinations, communications, etc. The field of education has undergone a digital revolution thanks to the integration of several technologies in learning practices. Evaluation is a fundamental step in the learning process. It has also been impacted by this revolution, especially with the use of several methods and technological tools. E-learning systems represent a wide field of technological solutions especially by applying the concept of adaptive learning. Several researches have focused on the proposition of adaptive content, while the efforts dedicated to the adaptive evaluation phase are still limited. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ben Ahmed et al. (Eds.): NISS 2022, LNDECT 147, pp. 200–210, 2023. https://doi.org/10.1007/978-3-031-15191-0_19

Design and Implementation of a Serious Game

201

The adaptive evaluation process is an effective alternative to traditional evaluation. It is based on the individualisation of knowledge through the proposal of tests that are adapted to the intellectual level of each learner. Classical assessment is based on a single and uniform test for all learners and for the same duration despite their differences in achievement and progress from the initial state. The classical assessment process can sometimes cause feelings of stress to learners, which can create psychological barriers to the learning process. To this end, we aim to set up an adaptive assessment process using recommender systems and serious games in order to bring the entertainment aspect to the learning process and to increase the rate of engagement and success of the learners. The idea is to recommend a series of adaptive games; each game evaluate a level of the bloom taxonomy from knowledge to application. Depending on the learner’s response, the difficulty level of the game will be increased or decreased. If the learner fails to reach the objective, he/she will be directed to the remediation module. This paper is organised as follows: The second section is dedicated to the presentation of an overview about the learning assessment, the concept of serious games and the recommendation systems. The third section presents the proposed approach and the final section discuss the process of the experimental results.

2 Background 2.1 Learning Assessment Assessment is of the main phases of the learning act. It accompanies the learner throughout their learning process in order to highlight their level of acquisition and their progress in relation to the initial state. Assessment requires a strict planning process to ensure better decision making. In the literature we find several definitions of learning assessment. According to Sedigheh et al. assessment for learning (AFL) involves both teachers and learners. It aims to measure the learners’ progress and get feedback for the following step [1]. Many efforts have been implemented to improve the evaluation process via e-learning systems. Martin F. et al. described in their work [2] the importance of adopting an effective assessment strategy for online learning. Authors in [3] have implemented a method of self-regulated learning assessment for students during an online course by using process mining techniques. The adopted method allows to examine the specific actions performed by the students. Another solution proposed by Louhab.F et al. [4] which aims to present an approach for adaptive formative assessment in a mobile learning context. The choice of test questions is based on the learner’s profile and his or her learning context. As well as, the structure of the following questions is organised according to the results obtained. S. Hubalovsky et al. [5] focused on the effectiveness of adopting adaptive learning at the primary level. The results obtained show that the learning objectives can be achieved effectively. Ogange et al. [6] investigated students’ perceptions about the effectiveness of the different types of assessment used in online learning. The study was conducted in 2015 within the virtual campus of Maseno university in Kenya. The results indicated that learners find peer or technology-based assessment faster than teacher-graded assessment.

202

F. Z. Lhafra and O. Abdoun

The learner’s profile is dynamic, always in evolution according to their cognitive needs and skills. Therefore, we need a solution that takes into account this progression of the learner as well as we can adopt it to any kind of assessment. In addition, the works dedicated to the primary level especially for the evaluation process are still limited, for this reason we have chosen the combination of recommender systems with serious games in order to increase the engagement and motivation of the learners as well as to overcome their psychological and cognitive obstacles related to the evaluation phase. Among the types of assessment, we mention: • Diagnostic assessment: Its purpose is to identify the learner’s needs in terms of knowledge, skills and attitudes. It is usually placed at the beginning of a sequence or educational program. • Formative assessment: a process that takes place during the learning session. It aims to inform the teacher and the learner about the progress of learning in order to make decisions and improve achievements. • Summative assessment: A planned evaluation process at the end of a program, session or course. The results are recorded in a report card with administrative decisions. Its purpose is to guide the learner’s progress according to the results obtained. • Normative assessment: is a measure of a learner’s level in relation to other learners in the class or within the same group. 2.2 Serious Game The purpose of serious games is to combine learning, training and communication skills with the fun dimension of video games. This combination forms a useful content with a pleasant learning approach to attract the learners’ attention. Serious games can develop motivation and confidence in the learner. It has an impact on the development of Leadership styles, skills [7] and argumentation [8]. The benefits of serious games have been the subject of several research studies [9–11]. The exploitation of serious games is not limited to a single learning area. It is applicable to all disciplines and levels. For example, in the field of biology we find the work of S. Mystakidis et al. [12] which presents a design for a serious game within a virtual reality environment to achieve high quality learning. The researchers conducted a mixed-methods study with 148 students for learning biology concept in an American high school. The results show a 13.8% increase in performance with improved learners’ motivation. In the field of accounting, the study of Rodrigo F. et al. [13] presents an analysis of the perceived use of the DEBORAH serious game for accounting students. The study is conducted using confirmatory factor analysis and structural equation modeling. Accounting students showed significant acceptance of this technology. Authors in [14] presents an example of a serious game application implemented on factory planning for high education. The game is called “Factory Planner”. It is evaluated with a pretest and posttest to measure the knowledge acquired for a baccalaureate class. The results obtained highlighted the positive impact of this game on the acquisition of learning objectives and the improvement of the learners’ rate of engagement. For children the study of Dellamarsto. B et al. [15] demonstrate the design of a serious game to engage parents in teaching ethics and morality to their children. The

Design and Implementation of a Serious Game

203

results show that the proposed game is a fun alternative learning tool. The game consists of 12 stages. After its evaluation, parents affirmed that it was easy to understand and to implement. Other areas of application of serious game include programming language [16] and development of computer assembly skills [17]. E-learning systems have a set of tools to assess learners’ learning, such as: open questions, multiple choice questions, fill-in-the-blank, drop/swipe,…. Generally, these tools are operated in a traditional assessment environment which does not take into consideration the cognitive needs of learners and their preferences. In addition, the adoption of a traditional assessment strategy leads to a sense of stress and imbalance for the learner. To this end, we aim to set up an adaptive assessment process based on serious games. The choice of serious games will allow us to integrate the entertainment aspect into the evaluation process and to motivate the learners to successfully complete it. At the pedagogical level, the assessment tests will be in the form of complex situations in order to lead learners to mobilise their prerequisites to achieve better results. The proposed model will be detailed in the following section. 2.3 Recommendation Systems Recommendation systems are aimed at filtering the information elements that are most relevant to the user. They are designed to understand the user’s behaviour and needs. They have been integrated in several domains, such as: online shops, streaming services, publications,…to ensure better decision making. The research field has taken advantage of these technologies through several studies [18, 19]. The idea is to analyse and explore the interests and preferences of users in order to recommend suitable objects. The most common recommendation techniques are: Content Based Filtering: The content-based recommendation system analyses the attributes of items to generate predictions. It proposes content that is similar to what the user has already viewed, searched for or positively evaluated. It depends only on the learner profile and the description of the items (Fig. 1). Content based filtering does not need the profiles of other users because it has not any influence on the recommendations.

Fig. 1. Content based filtering

204

F. Z. Lhafra and O. Abdoun

Collaborative Filtering: This technique offers recommendations while calculating the similarity between the user’s preferences and those of other users. The idea is to make predictions about user preferences based on the opinions and experiences of other users with a very similar degree of similarity (Fig. 2). The advantage of this technique is the ability to use the scores of other users to assess the usefulness of items. This type of recommendation is widely used in several research studies [20].

Fig. 2. Collaborative filtering

Hybrid Recommendation: The hybrid recommendation system combines collaborative and content-based approaches in order to obtain a better accuracy rate. There are several hybridization techniques such as: Weighted, Switching, Mixed, Cascade, etc.

3 Proposed Approach Assessment is an integral step in the learning process. It requires careful planning to ensure quality learning. Within the context of adaptive learning we have designed an assessment process that takes into account the learner’s profile. The idea is to propose adaptive assessment activities based on the hierarchy of Bloom’s taxonomy. Each assessment activity aims to assess a level of the taxonomy (Fig. 3). For this purpose, we have chosen to adopt serious games to create a fun learning environment and to develop learner engagement. The integration of serious games in the evaluation phase reduces the seriousness and weight of this stage. It leads the learner to live the learning situation and not to undergo it. The aim of this approach is to achieve better results and to develop motivation and self-esteem among learners. The adoption of serious games transforms the assessment process into a learning, playful and enjoyable experience. The attraction of learners and their curiosity towards the concept of serious games favours the acquisition of positive attitudes and makes learning entertaining.

Design and Implementation of a Serious Game

205

Fig. 3. Proposed approach

3.1 Learning Assessment The pedagogical context engenders any learning phase. For this approach, we aimed to present adaptive games to the learners’ profile, either in terms of acquired knowledge or preferences. The proposed assessment example is addressed to primary school learners. It aims to carry out a formative evaluation in order to test the level of acquisition of the vocabulary studied during the didactic unit. The assessments will be programmed in the form of complex situations in order to get the learners to solve them while mobilising their prerequisites. We aim to test four levels of the Bloom taxonomy: knowledge, understanding, application and analysis. Each level represents a game stage. 3.2 Recommendation Process The recommendation process is widely adopted in many sectors to provide relevant information. We used the collaborative filtering technique. It is adopted in many entertainment, advertising and e-commerce systems. In the field of education, the collaborative filtering technique is used in the acquisition process in order to propose adaptive activities as well as it is adopted in the remediation process [21]. The proposed approach focuses on the assessment process in order to recommend the most adaptive activities to the learner’s profile while respecting the acquisition process presented. Such a recommendation will allow us to benefit from the experiences of other learners with similar backgrounds and profiles. The recommendation process will be according to the following table:

206

F. Z. Lhafra and O. Abdoun Table 1. Steps of recommendation process

Step 1

Give each activity a score and develop the matrix (Learner * Assessment activity)

Step 2

Calculate the similarity rate between learners using the Pearson correlation equation

Step 3

Choose a number of learners (K) with a high similarity rate with the active learner

Step 4

Calculate the prediction

In order to assign a weight to each evaluation activity, we defined the set of parameters that characterise the evaluation process such as: the level of the bloom taxonomy concerned, the duration of the activity, the rate of frequentation and the degree of appreciation. We have adopted the formula proposed by [22] to measure the score S for each assessment activity. S =

1 (Sat + I ) 2

(1)

I = BT + R + 2B + 2F ∗ Sat B = e−t

(2)

where: BT: The level of the bloom taxonomy concerned by the assessment that can takes the values 1, 2, 3 for knowledge, understanding or application. R: Determines the result of the learner after completing this activity. Takes the value 0 for failure and 1 for success. T: The duration of the selected activity, it is used for the calculation of function B. F: How often the activity is selected and viewed (frequency). Sat: Learner satisfaction given explicitly by each learner after completing the assessment activity, it takes values between 1 and 10 to reflect the degree of satisfaction. Once we calculate the weights of the evaluation activities using function S, we will organize the results obtained in the form of a matrix M (learner * game) with a number of rows L which corresponds to the number of learners L = (learner1, learner2, learner3,…learner n), and a number of columns C which is associated with the number of serious games planned for the evaluation process, C = (game 1, game 2,….game n). The table below (Table 2) shows this combination: Table 2. Matrix combination learner game Game 1

Game 2

Game 3

Game 4

Game n

Learner 1

–

3

–

–

5

Learner 2

–

4

9

7

8

Learner 3

–

–

8

–

–

Learner 4

7

6

–

1

2

Design and Implementation of a Serious Game

207

The resulting matrix shows the degree of appreciation of serious games according to the weight calculated through the function S. On the other hand, we can notice that there are some learners who did not yet consult or evaluate a certain number of games, for example learner 1 towards activity 1, 3 and 4. In this case, the collaborative filtering technique especially the neighbourhood concept will allow us to calculate the similarity rate between these learners in order to exploit their experiences and preferences regarding the programmed games in order to make the appropriate recommendation. There are several formulas used to calculate similarity between the learners, such as: Spearman rank correlation, Kendall’s correlation, Pearson correlation, entropy, adjusted cosine similarity and mean squared differences. We have selected the person correlation formula because it is the most widely used and offers the most effective results [22]. We will calculate the similarity rate between two learners a and b according to the Eq. 3: i∈I ((rai − r a )(rbi − r b )) (3) Corra,b = 2 rai−r a (rbi − r b )2 where: I: The set of activities evaluated by the two learners. rai: Rating given to activity i by the learner a. rbi: Rating given to activity i by the learner b. ra : Average of the ratings given by the learner a. rb : Average of the ratings given by the learner b. After calculating the similarity rate using the Pearson correlation formula, we will choose a number K of learners with the closest values of this similarity in order to provide a relevant prediction. The prediction formula is as follows (Eq. 4). rb ) ∗ corra,b b∈K (rbi − (4) Preda, b = ra + corra,b b∈k

4 Results and Discussion The proposed approach aims to ensure adaptive assessment to the level and preferences of each learner. The idea is to recommend a series of assessment activities for each learner. The proposed activities are structured according to the hierarchy of the Bloom taxonomy (knowledge, understanding and application). To carry out the experimentation process, we developed a set of games (Fig. 4) through the Scratch program which is a free visual-based programming language. We chose this program because it was appropriate for the target audience with whom we will be conducting the real-life experiment.

208

F. Z. Lhafra and O. Abdoun

Fig. 4. Some proposed assessment activities

First of all, the real experience will be dedicated to the learners of the second year of the primary cycle, within a public school of the regional academy Tangier- TetouanElhouceima in Morocco. The idea is to bring in a comparative study to measure the effectiveness of the proposed approach. We will select a sample of learners and offer them in the first instance a series of classic, random assessment activities. Then, we will keep the same sample to recommend a series of adaptive evaluation activities according to the collaborative filtering technique and respecting Bloom’s hierarchy. The results of this experiment will form a stable basis for the evaluation process via the recommendation systems, taking into consideration the pedagogical context of the teaching/learning operation.

5 Conclusion Evaluation is an important phase in the learning process. For this purpose, within the context of adaptive learning, we have proposed a solution based on the combination of serious games and recommender system while respecting the pedagogical guidelines. This solution will be tested in real practice with a sample of learners to measure its effectiveness. Through the proposed approach, we aimed to make the assessment process more adaptive to the learners’ needs by using the collaborative filtering technique. This technique takes advantage from the experiences of other learners who have followed the same learning process. While the integration of serious games aims to ensure the fun aspect and to make the learner more engaged in the learning process.

Design and Implementation of a Serious Game

209

References 1. Sardareh, A.S., Saad M.M.R.: Defining assessment for learning: a proposed definition from a sociocultural perspective. Life Sci. J. 10 (2013) 2. Martin, F., Kumar, S.: Frameworks for assessing and evaluating e-Learning courses and programs. In: Piña, A.A., Lowell, V.L., Harris, B.R. (eds.) Leading and Managing e-Learning. ECTII, pp. 271–280. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-61780-0_19 3. Cerezo, R., Bogarín, A., Esteban, M., Romero, C.: Process mining for self-regulated learning assessment in e-learning. J. Comput. High. Educ. 32(1), 74–88 (2019). https://doi.org/10. 1007/s12528-019-09225-y 4. Louhab, F.Z., Bahnasse, A., Talea, M.: Towards an adaptive formative assessment in contextaware mobile learning. Procedia Comput. Sci. 135, 441–448 (2018) 5. Hubalovsky, S., Hubalovska, M., Musilek, M.: Assessment of the influence of adaptive Elearning on learning effectiveness of primary school pupils. Comput. Hum. Behav. 92, 691– 705 (2019) 6. Ogange, B., Agak, J., Okelo, K., Kiprotich, P.: Student perceptions of the effectiveness of formative assessment in an online learning environment. Open Praxis 10, 29 (2018). https:// doi.org/10.5944/openpraxis.10.1.705 7. Sousa, M.J., Álvaro, R.: Leadership styles and skills developed through game-based learning. J. Bus. Res. 94, 360–366 (2019) 8. Noroozi, O., Dehghanzadeh, H., Talaee, E.: A systematic review on the impacts of game-based learning on argumentation skills. Entertainment Comput. 35, 100369 (2020) 9. Mouaheb, H., Fahli, A., Moussetad, M., Eljamali, S.: The serious game: what educational benefits? Procedia. Soc. Behav. Sci. 46, 5502–5508 (2012) 10. All, A., Castellar, E.N.P., Looy, J.V.: Digital game-based learning effectiveness assessment: reflections on study design. Comput. Educ. 167, 104160 (2021) 11. Krath, J., Schürmann, L., von Korflesch, H.F.O.: Revealing the theoretical basis of gamification: A systematic review and analysis of theory in research on gamification, serious games and game-based learning. Comput. Hum. Behav. 125, 106963 (2021) 12. Mystakidis, S., Cachafeiro, E., Hatzilygeroudis, I.: Enter the serious E-scape room: a costeffective serious game model for deep and meaningful E-learning. In: 2019 10th International Conference on Information, Intelligence, Systems and Applications (IISA), pp. 1–6 (2019) doi: https://doi.org/10.1109/IISA.2019.890067 13. Rodrigo, F.M., Fernanda, F.O.M., Hwang, Y.: Understanding technology acceptance features in learning through a serious game. Comput. Hum. Behav. 87, 395–402 (2018) 14. Severengiz, M., Seliger, G., Krüger, J.: Serious game on factory planning for higher education. Procedia Manufact. 43, 239–246 (2020). https://doi.org/10.1016/j.promfg.2020.02.148 15. Dellamarsto, B., Kevin, S., Panji, A., Jeklin, H., Andry, C.: Designing serious games to teach ethics to young children. Procedia Comput. Sci. 179, 813–820 (2021) 16. Priyaadharshini, M., Natha, M.N., Dakshina, R., Sandhya, S., Bettina, S.R.: Learning analytics: game-based learning for programming course in higher education. Procedia Comput. Sci. 172, 468–472 (2020) 17. Bourbia, R., Gouasmi, N., Hadjerisb, M., Seridi, H.: Development of serious game to improve computer assembly skills. Procedia. Soc. Behav. Sci. 141, 96–100 (2014) 18. Singhal, A., Sinha, P., Pant, R.: Use of deep learning in modern recommendation system: a summary of recent works. Int. J. Comput. Appl. 180(7), 17–22 (2017). https://doi.org/10. 5120/ijca2017916055 19. Zhang, H., Yang, H., Huang, T., Zhan, G.: DBNCF: personalized Courses Recommendation system based on DBN in MOOC environment. In: International Symposium on Educational Technology, pp 106–108 (2017). DOI: https://doi.org/10.1109/ISET.2017.33

210

F. Z. Lhafra and O. Abdoun

20. Khanal, S.S., Prasad, P.W.C., Alsadoon, A., Maag, A.: A systematic review: machine learning based recommendation systems for e-learning. Educ. Inf. Technol. 25(4), 2635–2664 (2019). https://doi.org/10.1007/s10639-019-10063-9 21. Lhafra, F.Z., Abdoun, O.: Hybrid approach to recommending adaptive remediation activities based on assessment results in an E-learning system using machine learning. In: Kacprzyk, J., Balas, V.E., Ezziyyani, M. (eds.) Advanced Intelligent Systems for Sustainable Development (AI2SD’2020): Volume 1, pp. 679–696. Springer International Publishing, Cham (2022). https://doi.org/10.1007/978-3-030-90633-7_57 22. Bourkoukou, O., El Bachari, E., El Adnani, M.: A recommender model in E-learning environment. Arab. J. Sci. Eng. 42(2), 607–617 (2016). https://doi.org/10.1007/s13369-0162292-2

DLDB-Service: An Extensible Data Lake System Mohamed Cherradi(B) and Anass El Haddadi Data Science and Competetive Intelligence Team (DSCI), ENSAH, Abdelmalek Essaadi University (UAE), Tetouan, Morocco {m.cherradi,a.elhaddadi}@uae.ac.ma

Abstract. Big Data, as a topic of research innovation, still poses numerous research challenges, particularly in terms of data diversity. The wide variety of data sources results in data silos, which correspond to a set of raw data that is only accessible to a part of the company, isolated from the rest of the organization. In this context, the Data Lake concept has been proposed as a powerful solution to handle big data issues by providing schema-less storage for raw data. However, putting heterogeneous data “as-is” into a data lake with no metadata management system will result in a “data swamp”, a collection of undocumented data, poorly designed, or inadequately maintained data lake. To avoid this gap, we propose DLDB-Service (stands for Data Lake Database management Service). The contribution is conceived as a data lake management system with advanced metadata handling over raw data derived from diverse data sources. Indeed, DLDB-Service allows different users to create a data lake, merge heterogeneous data sources into a data lake regardless of format, annotate raw data with information semantics, data querying, and visualize data statistics. During the demo, we’ll go over each component of the DLDB-Service. Furthermore, the proposed solution will be used in real-world scenarios to demonstrate its usefulness in terms of scalability, extensibility, and flexibility. Keywords: Data lake · Big data · Data management systems

1 Introduction The 21st century is characterized by an expanding increase in the amount of data generated in the world. However, while big data offers a great opportunity for a wide range of industries, it also poses a significant challenge. Indeed, their N-V properties (such as volume, variety, veracity, etc.) exceed the capabilities of conventional solutions. Even as data warehouses are still effective and productive for structured and semi-structured data. Then, unstructured data presents significant issues for them, because the majority of data worldwide is unstructured [1, 2]. As a result, the data lake concept was developed to handle big data challenges, particularly issues caused by data variety. Thus, the emergence of the data lake concept in the last decades has seen increasing interest [3], as you can see in Fig. 1. Further, “schema-on read” is one of the fundamental properties of a data lake, which means that raw data is loaded without any transformation, i.e., stored as-is in its native format, and processed until required, described by on-demand queries. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ben Ahmed et al. (Eds.): NISS 2022, LNDECT 147, pp. 211–220, 2023. https://doi.org/10.1007/978-3-031-15191-0_20

212

M. Cherradi and A. El Haddadi

Fig. 1. Evolution of data lake concept (Number of paper per year on Scopus).

The challenging topic of big data is addressed by data lakes by answering the question, “How can we make simple use of the massive volume of heterogeneous data and extract the hidden information and knowledge from raw data?”. Beyond storage, the major challenge of big data is to extract quality value through advanced analysis of large, fast, and varied data. There is a lot of data around nowadays, but it’s often divided into data silos with no interconnections. However, significant information is frequently attainable through the integration and analysis of data silos. To meet this need, data lakes have been conceived as big data repositories that hold raw data and enable on-demand integration through metadata specifications [4–6]. Data lakes ingest raw data from a number of sources in their original format, acting as storage depots for the heterogeneous data and allowing diverse users to access, analyze, and explore it. Indeed, the information schema definition and mapping are not explicitly required as a constraint. However, extracting metadata from heterogeneous data remains a crucial step in the data ingestion layer, ensuring data governance [7]. Therefore, without any efficient metadata management system, data lakes suffer from turning into data swamps, or useless data. Whenever data comes from heterogeneous sources with different models, metadata is mandatory to maintain track of the data life cycle. Thereby, metadata captures information about the original data, such as information schema, semantics, and lineage, among other things. Thus, the main motivation of this research is to propose an extensible solution able to manage heterogeneous data in a well-structured way that gives us the possibility of extracting relevant information hidden in the lake. Then, prevent the data lake from turning into a data swamp. To the best of our knowledge, data lake literature reviews are concise and limited to a single use case. Thus, we examine in this paper an extensible metadata management system capable of managing various use cases. Then, we propose the data lake pipeline architecture correspond to DLDB-Service. Eventually, the main asset of our architecture is that it allows all types of users to alter the data lake according to their own requirements.

DLDB-Service: An Extensible Data Lake System

213

The remainder of this paper is organized as follows. Section 2 represents the literature review. Section 3 depicts the proposed architecture of our data lake metadata management system. Section 4 presents the system overview demo of DLDB-Service. Section 5 discusses the results obtained. Finally, we conclude by some remarks and perspectives in Sect. 6.

2 Related Works Within this section, we present the necessary background for reader to well-understand the rest of our paper. 2.1 Data Lake Concept The concept of a data lake was first introduced in 2010 by James Dixon [8]. Indeed, Dixon said a famous saying about data lakes, “If a data warehouse is a storage space of cleaned, filtered, and structured data easily accessible for consumption, a data lake is a vast body of water in its natural state”. As a relatively new concept, there is no conventional definition or recognized design for data lakes. Several contributions in the literature have been proposed to clarify the data lake concept and avoid any confusion or ambiguity for many practitioners and researchers. Thereby, “data lake” is commonly considered as a synonym or a marketing label for Hadoop technology [9–11]. According to the literature review, data lakes are also known as data reservoirs and data hubs [12, 13]. Data lakes are defined in a variety of ways; some academics believe them to be a simple data repository [14, 15]. Others believe that a data lake is a full system that includes everything from data storage to visualization, as well as the data analyses stage [16–18]. However, some authors, such as [10, 19, 20], regard data lakes as the equivalence of Apache Hadoop. Gartner [21] defines a data lake as a storage mechanism linked to lowcost technologies. A few years later, the idea of low-cost technology has changed, and data lakes remain an inspirational solution for creating innovation in organizations by leveraging their data. Thus, the concept of a data lake is tied to proprietary technologies like Microsoft Azure, IBM Cloudera, HortonWorks, MapR, and others [22, 23]. 2.2 Metadata Management Existing data lake system proposals lack information about the essential metadata management features and methodologies for efficient and effective information retrieval, which makes implementations difficult to execute. However, today’s metadata management systems frequently satisfy the functional and technical requirements of the company. Therefore, the proposal for a generic, scalable, flexible metadata system capable of handling large-scale data efficiently and beneficially remains a high-priority contribution for the majority of companies. Before going into the technical details of our contribution, we wish to describe the related research work with our research topic. Among one of the principal issues investigated in the literature is metadata management, Indeed, metadata management is a critical lever in preventing the data lake from devolving into a data swamp [10, 24]. This is ideal for preliminary research; With the

214

M. Cherradi and A. El Haddadi

increasing amounts of heterogeneous data sources, without any data management solution, data lakes undoubtedly risk creating data swamps i.e., useless data, undocumented, out of control, and unsuccessfully maintained due to a lack of efficient data governance strategy. In this context, according to [25, 26], metadata management system provides business and technical users experience with flexible access to the data quality. Furthermore, the metadata management system plays an essential role in valuing data lakes and then ensuring data governance. Yet, recent literature evaluations look at data governance as one of the most important aspects of data lakes, comprising all of the techniques and policies needed to maintain data quality and control agreement.

3 DLDB-Service Architecture In this section, we describe the proposed data lake architecture followed by DLDBService to manage the different data sources that exist in the lake. As we can see in Fig. 2, the architecture of DLDB-Service classified roughly into four functional layers, such as data ingestion, storage, analysis, and exploration. In the rest of this section, we’ll go through each layer (illustrated in Fig. 2) and the capabilities it provides in more detail. 3.1 Data Ingestion The first layer of the Big Data Lake Architecture is data ingestion, which takes charge of the heterogeneous data coming from different sources. Moreover, it is designed to load data from a variety of heterogeneous data sources into a data lake without the need for data transformation. Yet, this layer is responsible for loading data “as-is” into the data lake, as we can see in Fig. 2. Nonetheless, DLDB-Service loads data in its native format, regardless of whether the type of data source. This avoid the need for time-consuming data preparation and processing. As a result, one of the most significant benefits of the ingestion layer is that the loading process does not require any data transformation step. However, for further use of the data stored in the lake, the DLDB-Service extracts semantic and descriptive metadata to facilitate data navigation and the extraction of useful knowledge for decision-making. Furthermore, different methods can be used for the data ingestion, such as batch, real-time, or hybrid. 3.2 Data Storage This layer is composed of three repositories, such as raw data, metadata, and a replication repository. Indeed, the raw data repository, unlike the other repositories, is not linked with any metadata system; It contains raw data “as-is” in its native format. However, metadata storage corresponds to a storage space that prevents the accumulation of raw data not to turn into a data swamp. Nevertheless, the replication process ensures fault tolerance by copying enterprise data to multiple locations. The primary goal of data replication is to improve data availability and accessibility, as well as system robustness and consistency.

DLDB-Service: An Extensible Data Lake System

215

Fig. 2. The proposed data lake architecture for DLDB-Service.

3.3 Data Analysis After preparing the data to be clean, the data analysis phase will take place. Indeed, it allows for the analysis and modeling of the collected data to extract insights that support decision-making. Furthermore, data analysis ensures that existing processes are

216

M. Cherradi and A. El Haddadi

discoverable, accessible, interoperable, and reusable in order to simplify data exploration in data lakes and make them more interactive. Data analysis’ goal is to extract usable information from data and make decisions based on that knowledge. Perhaps all you need to do is analyze the past and predict the future to expand your business. 3.4 Data Exploration Data exploration is the initial phase in data analysis, and it is used to explore and visualize data in order to gain insights from the data lake. Moreover, users may also better understand the broad picture and gain insights faster by using interactive dashboards. Furthermore, data exploration allows users to make better decisions and gain an efficient understanding of the different data sources stored in the lake. More than the metadata management option and its major components, DLDBService ensures data quality by covering a set of metrics such as completeness, consistency, and validity. Aside from the features we’ve shown, the DLDB-Service provides data lineage, which corresponds to a cartographic process, making the data possible to know their origin and evolution over time. This traceability is captured by describing the information’s source and eventual destination, as well as all of the transformations it has encountered along the way.

4 Technical Demonstration Our contribution, proposed as a basic approach, tends towards a realistic construction of a data lake. Indeed, DLDB-Service offers a flexible, extensible framework for data lake systems data management issues. To emphasize the relevance and importance of metadata in assuring data governance for data lakes, we focus on the metadata of any type of data, whatever its format: structured, semi-structured, or unstructured data. Although, DLDB-Service concentrates on the data ingestion layer, metadata management, and query processing. For a better user experience, DLDB-Service is a web-based application, provides a set of really important data management components. To show the utility of DLDB-Service, Fig. 3, presents the different components of DLDB-Service. The technical demonstration will include the common use cases needed by the community interested in using the data lake. Indeed, DLDB-Service allows different users to create their own data lake database that supports different types of data sources, as illustrated in Fig. 2(a). Then, the data source is structured according to its format and extension. Thereby, a renaming technique is applied for more readability and makes it easier to search. After structuring the data into well-named clusters according to the category of the data source, a tooltip and a badge are displayed alongside to indicate the number of datasets existing in each cluster. Hence, the user can create a database or add a data source regardless of its format, as well as change and remove the database. DLDB-Service covers all CRUD actions on the database. On the other hand, as shown in Fig. 2(b), DLDB-Service has a special interface for statistics on data lakes and their data sources, as well as their formats. Moreover, a very important component concerning the management of metadata, respectively the interaction with the different data sources, is supported by DLDB-Service, as you can

DLDB-Service: An Extensible Data Lake System

217

see in Fig. 2(c) (respectively, Fig. 2(d)). Furthermore, an essential component supported by DLDB-Service ensures a visualization in tree form of the data lakes and their many clusters, confirming the data provenance and lineage of each data source in the lake, as depicted in Fig. 2(e).

Fig. 3. Different components of DLDB-Service.

5 Discussion DLDB-Service is a web-based application that aims to create a better user experience. Users can create a data lake and ingest heterogeneous data source using the 3-click principle. Nevertheless, the step of understanding the data stored in the lake remains a crucial stage to avoid the transformation of the data lake into a data swamp. In this context, our proposed system covers a metadata management that ensures data quality and maintains the challenges imposed by data lakes. Furthermore, to fully understand each type of data exist in the lake, a metadata interception spider is applied, it displays a collection of descriptive metadata that aids in fully comprehending the lake’s contents.

218

M. Cherradi and A. El Haddadi

Our system also includes graphical charts that interact dynamically with the data lake, as well as the ability to search on existing data in the lake. Thus, the DLDB-Service takes a data-centric approach and follows a data-driven process, which refers to an architecture in which data is the primary asset. However, the majority of data lake contributions are abstract enough and focus on a particular layer of data lake architecture designed in Sect. 3. Yet, according to [27, 28] there is no such data lake architecture as a standard. Because it is reliant on business requirements as well as the expertise and the skills of those in charge of its production. In our case, we designed a scalable data lake architecture capable of interrogating massive data based on the four layers explained above. This gives the data lake powerful flexibility and presents a major advantage because the data is stored as-is in their native format. Thereby, to avoid data swamping the proposed architecture include an efficient spider capable of capturing descriptive metadata of any type of data via an application programming interface called Apache TIKA. This paper should not be taken as complete. Rather, it is the beginning or birth of a new strategy that deserves to be supplemented by more effort aimed at properly managing data lakes.

6 Conclusion In this paper, the proposal for a data lake is based on an extensible metadata management system named DLDB-Service, which facilitates the ingestion, storage, metadata management, and interaction with heterogeneous data sources that exist in the lake. Subsequently, data lakes and the methodology of how to structure, analyze, and interact with big data are changing the way companies manage and handle heterogeneous data sources. Furthermore, the future of data lakes may be bright for controlling and maintaining big data. However, an important future perspective is to move to the world of machine learning to add intelligence to our proposed system and subsequently extract the hidden knowledge in the lake to make effective decisions.

References 1. Miloslavskaya, N., Tolstoy, A.: Big data, fast data and data lake concepts. In: 7th Annual International Conference on Biologically Inspired Cognitive Architectures (BICA 2016), NY, USA. Procedia Comput. Sci., vol. 88, pp. 300–305, (2016). https://doi.org/10.1016/j. procs.2016.07.439 2. Sawadogo, P., Darmont, J.: On data lake architectures and metadata management. J. Intell. Inf. Syst. 56(1), 97–120 (2020). https://doi.org/10.1007/s10844-020-00608-7 3. Cherradi, M., EL Haddadi, A., Routaib, H.: Data lake management based on DLDS approach. In: Ben Ahmed, M., Teodorescu, H.-N., Mazri, T., Subashini, P., Boudhir, A.A. (eds.) NISS 2021. SIST, vol. 237, pp. 679–690. Springer, Singapore (2022). https://doi.org/10.1007/978981-16-3637-0_48 4. Terrizzano, I.G., Schwarz, P., Roth, M., Colino, J.E.: Data wrangling: the challenging yourney from the wild to the lake. In: CIDR, Seventh Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA (2015) 5. Walker, C., Alrehamy, H.: Personal data lake with data gravity pull. In: IEEE Fifth International Conference on Big Data and Cloud Computing, pp. 160–167, Aug 2015. https://doi. org/10.1109/BDCloud.2015.62

DLDB-Service: An Extensible Data Lake System

219

6. Quix, C., Hai, R., Vatov, I.: GEMMS - A Generic and Extensible Metadata Management System for Data Lakes (2016) 7. Nicole, L.: Data lake governance: A big data do or die. SearchCIO. https://searchcio.techta rget.com/feature/Data-lake-governance-A-big-data-do-or-die. Accessed 19 Jan 2022 8. Dixon, J.: Pentaho, Hadoop, and Data Lakes. Dixon’s Blog, 14 Oct 2010. https://jamesdixon. wordpress.com/2010/10/14/pentaho-hadoop-and-data-lakes/. Accessed 19 Jan 2022 9. Aburawi, Y., Albaour, A.: Big Data: Review Paper. Int. J. Adv. Res. Innov. Ideas Educ. 7, 2021 (2021) 10. Fang, H.: Managing data lakes in big data era: what’s a data lake and why has it became popular in data management ecosystem. In: IEEE International Conference on Cyber Technology in Automation, Control, and Intelligent Systems (CYBER), pp. 820–824. Jun 2015. https://doi. org/10.1109/CYBER.2015.7288049 11. Zhao, Y.: Metadata Management for Data Lake Governance. Doctoral thesis in Computer Science and Telecommunications (2021) 12. Mandy, C., Ferd, S., Nhan, N., van Ruud, K., van der Ron, S.: Governing and Managing Big Data for Analytics and Decision Makers (2014). Accessed 19 Jan 2022 13. Ganore, P.: Introduction To The Concept Of Data Lake And Its Benefits – ESDS BLOG. ESDS Marketing Team at ESDS Software Solutions, 06 Feb 2015. https://www.esds.co.in/ blog/introduction-to-the-concept-of-data-lake-and-its-benefits/ (Accessed 19 Jan 2022) 14. Kathiravelu, P., Sharma, A.: A dynamic data warehousing platform for creating and accessing biomedical data lakes. In: Wang, F., Yao, L., Luo, G. (eds.) DMAH 2016. LNCS, vol. 10186, pp. 101–120. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-57741-8_7 15. Tardío, R., Maté, A., Trujillo, J.: An iterative methodology for defining big data analytics architectures. IEEE Access 8, 210597–210616 (2020) https://doi.org/10.1109/ACCESS. 2020.3039455 16. Nogueira, I.D., Romdhane, M., Darmont, J.: Modeling data lake metadata with a data vault. In: Proceedings of the 22nd International Database Engineering & Applications Symposium, New York, USA, pp. 253–261, June 2018. https://doi.org/10.1145/3216122.321 6130 17. Bhandarkar, M.: AdBench: a complete benchmark for modern data pipelines. In: Nambiar, R., Poess, M. (eds.) TPCTC 2016. LNCS, vol. 10080, pp. 107–120. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-54334-5_8 18. McPadden, J., et al.: A scalable data science platform for healthcare and precision medicine research (Preprint). J. Med. Internet Res. 21 (2018). https://doi.org/10.2196/13043 19. O’Leary, D.E.: Embedding AI and crowdsourcing in the big data lake. Intell. Syst. IEEE Communication. Syst. 29(5) 70–73 (2014). https://doi.org/10.1109/MIS.2014.82 20. Laurent, A., Laurent, D., Madera, C.: Introduction to Data Lakes: Definitions and Discussions, pp. 1–20 (2020). https://doi.org/10.1002/9781119720430.ch1 21. Gartner: Gartner Says Beware of the Data Lake Fallacy. Gartner. https://www.gartner. com/en/newsroom/press-releases/2014-07-28-gartner-says-beware-of-the-data-lake-fal lacy (Accessed 19 Jan 2022) 22. Madera, C., Laurent, A.: The next information architecture evolution: the data lake wave. In: Proceedings of the 8th International Conference on Management of Digital EcoSystems, New York, USA, pp. 174–180, Nov 2016. https://doi.org/10.1145/3012071.3012077 23. Joseph, S.: The Intelligent Data Lake. Azure Data Lake. https://azure.microsoft.com/en-us/ blog/the-intelligent-data-lake/ (Accessed 19 Jan 2022) 24. Inmon, B.: Data Lake Architecture: Designing the Data Lake and Avoiding the Garbage Dump, 1st edition. Technics Publications (2016) 25. Couto, J., Borges, O., Ruiz, D., Marczak, S., Prikladnicki, R.: A mapping study about data lakes: an improved definition and possible architectures. In: The 31st International Conference

220

M. Cherradi and A. El Haddadi

on Software Engineering and Knowledge Engineering, pp. 453–458, July 2019. https://doi. org/10.18293/SEKE2019-129 26. Zgolli, A., Collet, C., Madera, C.: Metadata in Data Lake Ecosystems. In: book: Data Lakes, pp. 57–96 (2020). https://doi.org/10.1002/9781119720430.ch4 27. Cherradi, M., El Haddadi, A.: Data Lakes: A Survey Paper. In: book: Innovations in Smart Cities Applications, vol. 5, pp.823–835 (2022). https://doi.org/10.1007/978-3-030-941918_66 28. Couto, J., Borges, O.T., Ruiz, D.D., Marczak, S., Prikladnicki, R.: A mapping study about data lakes: an improved definition and possible architectures. In: Conference: The 31st International Conference on Software Engineering and Knowledge Engineering (2019). https:// doi.org/10.18293/SEKE2019-129

Effect of Entropy Reshaping of the IP Identification Covert Channel on Detection Manal Shehab1(B) , Noha Korany1 , Nayera Sadek1 , and Yasmine Abouelseoud2 1 Department of Electrical Engineering, Faculty of Engineering, Alexandria University,

Alexandria 21544, Egypt {eng-manal.shehab,noha.korany,nayera.sadek}@alexu.edu.eg 2 Department of Engineering Mathematics, Faculty of Engineering, Alexandria University, Alexandria 21544, Egypt [email protected]

Abstract. Randomly IP Identification (IP ID) header field could be exploited as a covert channel. This research aims to trigger the developers of the network intrusion prevention systems to consider the role of the IP ID covert channel entropy adaptation on detection. So, the paper proposes a new method for reshaping the entropy of the IP ID covert channel (REIPIC). It iteratively treats the frequencies of occurrence pattern of the IP ID covert values within a sliding window of a specific size, to resemble their counterparts in the normal case. Then, an entropy-based SVM is used to evaluate REIPIC impacts on the IP ID covert channel detection. Finally, SVM examines the REIPIC capability of reaching an iteration at which the IP ID covert channel could not be identified. REIPIC contributes declaring the new frequencies of occurrence matrix and identifying a conceptual framework that could be invested in other pattern adaptation applications. Keywords: IP Identification · Covert channel · Entropy · Detection · Pattern · Adapt · Intrusion · REIPIC · Support vector machine · SVM

1 Introduction IP Identification (IP ID) is a header field that identifies the IP packet to distinguish its fragments during their reassembly at the receiving side [1]. Its size equals 16 bits in IPv4, and 32 bits in IPv6. Randomly generated IP ID field could be used as a hidden data carrier resulting in a storage covert channel (CC) [2–4]. Entropy measures the average uncertainty in a random variable [5]. Consider an IP ID array of n values acts as a discrete random variable ID, where the probability of occurrence of the IDm value within the n values is denoted by P m . Then, the entropy of ID, denoted by H (ID), is given by Eq. (1) [5]: n Pm logPm (1) H(ID) = − m=1

The entropy feature is proved that it could be used alone to detect the IP ID CC clearly by support vector machine (SVM) [6]. The reason of that is embedding data within the © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ben Ahmed et al. (Eds.): NISS 2022, LNDECT 147, pp. 221–231, 2023. https://doi.org/10.1007/978-3-031-15191-0_21

222

M. Shehab et al.

IP ID bits changes the IP ID values and their frequencies of occurrence pattern within the detection sample. While the IP ID entropy feature reflects the average amount of information provided by each occurred IP ID value [7], then the entropy is impacted by the resulted changes in the frequencies of occurrence pattern of the IP ID covert values. Therefore, treating the IP ID CC frequencies of occurrence pattern could reshape its entropy feature; and consequently, affect its detection. However, no sufficient studies for this effect. As a result, this paper proposes a new method for reshaping the entropy feature of the IP ID covert channel (REIPIC), to investigate this effect on the IP ID covert channel detection (CCD). The importance of this study is to draw the attention of the developers of the network intrusion detection and prevention systems (NIDPS) to consider this effect. The REIPIC method contributes presenting the frequencies of occurrence pattern of a dataset at a specified non-overlapping window in a matrix form. Additionally, REIPIC applies its reshaping effect through sequential stages of a new frequencies of occurrence treatment (FOT) technique. Then, an entropy-based SVM is used to examine the effect of the REIPIC reshaping method on the IP ID CCD at each stage. Finally, SVM is employed to check if REIPIC could reach a FOT stage at which the IP ID CC pattern cannot be distinguished from that of the normal one. REIPIC is applied offline to adapt the IP ID CC dataset before starting of its transmission. Thus, REIPIC does not add any delay to the IP ID CC and has no effect on its transmission performance. The paper uses the linear kernel function for SVM due to its simplicity and high prediction speed. The non-overlapping sliding detection window of size (W) presented as the number of consecutive packets in time, is used to extract the IP ID entropy feature. Real-time detection limits the W size, to avoid causing high packet delay that may result in packet’s lifetime expiration. CCD is calculated as a measure for the detection performance according to Eq. (2) as follows [8]: CCD =

Number of covert observations classified as covert Total Number of covert observations

(2)

The rest of this paper is organized as follows: Sect. 2 describes the proposed REIPIC method algorithm and its FOT technique. Section 3 explores the used experiment and its settings. Section 4 analyses the results of using the REIPIC method on the IP ID CC detection. Finally, Sect. 5 concludes this work and suggests possible future researching directions.

2 The Proposed REIPIC Method 2.1 Parameters and Notations The parameters of the proposed REIPIC method and its FOT technique are summarized in Table 1.

Effect of Entropy Reshaping of the IP Identification Covert Channel

223

Table 1. Parameters and notations S

Number of the FOT stage

Gn

Normal IP ID dataset

Gc0

Base IP ID covert dataset before any FOT

GcS−1

Input covert dataset to the Sth FOT

GcS

Output covert dataset from the Sth FOT

C

Covert ciphertext array of L-bits elements

L-covert bits map

L covert bits positions within IP ID bits

TW

Treatment window size

N

Number of the non-overlapping TW windows in the dataset

f max

Maximum IP ID frequency in all the TW windows of Gn

fcS−1

Maximum IP ID frequency in all the TW windows of GcS−1

Rn

Frequencies reshaping matrix of Gn with f max columns

RcS−1

Frequencies matrix of GcS−1 with fcS−1 columns

f max -occurrence treatment FOT adaptation at the specified TW and f max It treats RcS−1 of GcS−1 to resemble Rn at TW W

Detection window sizes vector used by SVM

DT

Pre-assumed SVM classifier detection threshold = CCD rate at which the IP ID CC could not be identified

i

Iteration number in the f max -occurrence treatment

2.2 Frequencies of Occurrence Matrix REIPIC contributes presenting the frequencies pattern of the normal dataset Gn at the non-overlapping sliding window of size (TW) in a matrix form Rn , which has specifications as follows: • The Rn size equals N × f max , where N is the number of the non-overlapping TW windows within Gn , and f max is the maximum frequency of an IP ID value in all the TW windows along Gn . • The matrix element nJf resides at the J th row and the f th column within Rn , and its value represents the number of the IP ID values that occurred f times within the J th TW window in Gn . The relation between the frequencies of occurrence of the IP ID values within any J th TW window of an IP ID dataset is given by Eq. (3):

f max f =1

nJf × f = TW

(3)

The Rn matrix is called a reshaping matrix at the specified reshaping parameters TW and f max , because it is extracted from a normal dataset at these specifications. Rn at

224

M. Shehab et al.

f max = 1, is a one column array that all its values equal TW, while Rn at f max = 2, is a two-column matrix, where the first and the second column represents the number of the IP ID values which occurred by f = 1 and f = 2, respectively, within the normal dataset Gn . Example: Assume the size of Gn is 6000 IP IDs. To apply the two-occurrence treatment at TW = 1500 IP IDs, then f max = 2, and N = 4. Assume that the extracted Rn from Gn in this⎡case is: ⎤ 1498 1 ⎢ 1500 0 ⎥ ⎥ Rn = ⎢ ⎣ 1496 2 ⎦, Then each row in Rn should validate Eq. (3), 1500 0 e.g., the first row in Rn fulfils that: 1498 × 1 + 1 × 2 = 1500 = TW. 2.3 REIPIC Algorithm REIPIC is applied through sequential FOT stages. Each FOT stage uses its specified non-overlapping treatment window size TW and f max . The reason of using more than one FOT sequentially is to treat the IP IDs frequencies within the used TW sizes and the overlapped zones between them along the final resulted treated covert dataset. FOT at a specified f max is called f max -occurrence treatment, and it could include one or more iterations. Any iteration is also considered as a complete FOT stage that uses an appropriate treatment window size TW = TW i,f max with i standing for the iteration number within the f max -occurrence treatment. Generally, the Sth FOT stage treats the RcS−1 matrix of the GcS−1 dataset to resemble the Rn matrix of Gn at TW i,f max and f max . The main reshaping parameters of each FOT stage are the used TW i,f max and its corresponding f max within the normal dataset Gn , which represent the number of rows, and columns of the Rn reshaping matrix, respectively. REIPIC governs the used TW i,f max and f max values at each FOT stage by certain rules based on the Gn normal dataset. I - Rules govern the TWi,fmax size at the FOT stage: • At S = 1, f max = 1, GcS−1 is the base IP ID covert dataset Gc0 which needs treatment, and TW 1,1 is the maximum window size that does not have repeated IP ID values in all its non-overlapping windows within Gn . • First iteration in f max -occurrence treatment (i.e., i = 1) uses TW 1,fmax , as follows: – TW 1,fmax should be the largest treatment window size in all the iterations of the f max -occurrence treatment, and TW 1,fmax should be greater than TW 1,(fmax−1) . – TW 1,fmax is a large possible window size which has f max as the maximum frequency of an IP ID value in all or most its non-overlapping windows along Gn . – Iteration number i > 1 within the f max -occurrence treatment uses TW i,f max that validates, TW 1,(fmax−1) < TW i,f max < TW (i−1), fmax .

Effect of Entropy Reshaping of the IP Identification Covert Channel

225

II - Rules govern the fmax value at the FOT stage: • Apply FOT iterations within the f max -occurrence treatment until reaching a resulted treated covert dataset GcS in which GcS CCD < DT for each W size in W that does not exceed TW i,f max . • Increment f max , if GcS CCD > DT for any W in W greater than TW i,f max . FOT stages are terminated when GcS CCD ≤ DT for all W sizes in W.

Fig. 1. Block diagram of the REIPIC method algorithm

Figure 1 presents the block diagram of the algorithm of the REIPIC method. As could be seen from Fig. 1 that the RIPIC method algorithm works as follows: 1. Apply the Sth FOT stage to treat GcS−1 at f max and TW i,f max to obtain GcS , while reserving the values of the embedded L hidden bits element of the ciphertext message array C within the bits of its corresponding IP ID value in the GcS dataset (guided by the bits positions of the L -covert bits map). 2. For all values in W, SVM employs the entropy feature to evaluate GcS CCD. 3. Check the condition of GcS CCD rate ≤ DT for all W sizes in W that are not exceeding TW i,f max :

226

M. Shehab et al.

• If no, the next treatment iteration would have i←i+1 with same value for f max . • If yes, check the condition of GcS CCD rate ≤ DT for all W sizes in W which are greater than TW i,f max . If it is not fulfilled, proceed to the next FOT stage by incrementing f max . Otherwise, GcS is the final IP ID covert dataset. 2.4 Frequencies of Occurrence Treatment (FOT) The FOT stage number S treats RcS−1 matrix of the covert dataset GcS−1 to reassemble the Rn of Gn , resulting GcS . Before starting the adaptation, both RcS−1 and Rn matrices need to have the same dimensions. While each of GcS−1 and Gn is an array dataset of the same size, then RcS−1 and Rn have equal number of rows which is N i,f max , but they could have different number of columns due to the difference of the frequencies pattern of their datasets GcS−1 and Gn at TW i,f max . Assume that f cS−1 and f max is the number of columns in RcS−1 , and Rn matrices, respectively. Thus, if f cS−1 > f max , then Rn would be extended temporarily by extra columns of zero elements to have f cS−1 columns, and vice versa. FOT at any stage S is governed by Eq. (3), since it ties the relativity interaction between the IP ID frequencies within each TW i,f max window along the dataset regarding the follows: • Assume that nJfc and nJfn is the value of the element in the J th TW i,f max row and the f th column in the RcS−1 and the Rn matrix, respectively. • If nJfn − nJfc = d, then replacing the d extra IP ID values which are occurred by the frequency f within the J th TW i,f max window in the covert GcS−1 dataset for each frequency f , adjusts all the frequencies within this window to resemble those in the J th TW i,f max window in the normal Gn dataset according to Eq. (3). Therefore, FOT adaptation of the RcS−1 matrix to resemble Rn runs row by row, starting from the element in the first row and the last column which has the highest f . Then, the adaptation is shifted to the element in the (f −1)th column by the same way, till reaching the full adaptation for all the frequencies of the IP ID values within the J th TW i,f max window within the covert dataset GcS−1 . Figure 2 illustrates the FOT pseudocode at the specified S, TW i,f max and f max . It is shown from Fig. 2 that line 1 defines the J counter to start adaptation from the first TW i,f max window in GcS−1 till the last one; and thus, from the first row in the RcS−1 matrix. Then, lines 2 and 3 create variables called covert and cipher to carry the content of the elements in the J th TW i,f max window within the GcS−1 and C array, respectively. Then, line 4 creates the f decremented counter to adapt the elements of the RcS−1 matrix starting from the element in the last column at the J th TW i,f max row. Then, line 5 and 6 reads the values of the elements nJfc and nJfn in RcS−1 , and Rn , respectively. Then, line 7 computes their difference d, and line 8 checks if d > 0, then line 9 creates a counter k. After that, lines 10 to 26 are applied to replace the d extra IP ID values occurred by the f th frequency within the J th TW i,f max window in GcS−1 using the Replace_Embed function. The frequency of the resulted covert[k] would also be checked within the J th TW i,f max window to be confirmed that it does not equal f , otherwise it is submitted again to the Replace_Embed function, as indicated from lines 14 to 18. So, FOT treats the

Effect of Entropy Reshaping of the IP Identification Covert Channel

227

IP ID covert dataset while reserving the hidden bits within the treated IP ID values. Line 19 computes the RcS−1 matrix after each updated IP ID value in the J th TW i,f max window of the covert dataset GcS−1 to register this update. Steps 29 and 30 saves GcS−1 and RcS−1 in the new variable GcS and RcS , respectively, after finishing all their required adaptation updates of the S stage. Input: , , , , , , , L-covert bit map Output: , 1 FOR = 1 TO DO [( -1)× ] TO [( × )-1]; 2 covert ← 3 cipher ← [( -1)× ] TO [( × )-1]; TO 1 STEP -1 DO 4 FOR = 5 ← ( , ); 6 ← ( , ); 7 ← ; 8 IF ( > 0) THEN DO 9 FOR = 1 TO 10 ← covert[ ]; 11 ← number of occurrence of in covert; 12 IF (H == f) THEN 13 EXECUTE Replace_Embed (cipher[ ], L-covert bit map to get covert[k] 14 H ← number of occurrence of covert[ ] in covert; 15 WHILE (H == ) DO 16 EXECUTE Replace_Embed (cipher[ ], L-bit covert map) to get covert[ ] 17 H ← number of occurrence of covert[ ] in covert; 18 EndWHILE to get 19 Compute Occurrence Matrix of 20 ← -1; 21 IF ( ==0) THEN 22 Break 23 EndIF 24 EndIF 25 EndFOR 26 EndIF 27 EndFOR 28 EndFOR ← ; 29 ← ; 30 , RETURN

Fig. 2. Pseudocode of the FOT technique at the Sth stage

3 Experiment and Settings Gn and Gnc are two normal randomly IP ID datasets of IP packets generated from the same source operating system. Gn is used as a normal IP ID reference to feed the SVM with normal observations, while the 16-bits IP ID values of the Gnc dataset are used to carry 8 hidden bits of the ciphertext message C (i.e., L = 8) in their last 8 bits resulting the base covert channel Gc0 . The first three FOT stages apply the one-occurrence treatment

228

M. Shehab et al.

using TW 1,1 > TW 2,1 > TW 3,1 , respectively as follows: 1 - Gc0 is treated at TW 1,1 = 1024 IP IDs resulting in Gc1 . 2 - Treatment of Gc1 at TW 2,1 = 1000 IP IDs obtains Gc2 . 3 - Treating Gc2 using TW 3,1 = 840 IP IDs yields to Gc3 . 4 - Finally, Gc3 is treated using the two-occurrence treatment at TW 1,2 = 2002 IP IDs to get Gc4 . The used W sizes in W are shown in Table 2 with W max = 1024 IP IDs to consider the real-time detection limitations. The 5-fold cross-validation scheme is used to protect against overfitting. The pre-assumed DT is 20%.

4 Results and Discussion 4.1 REIPIC Effect on Entropy Reshaping The REIPIC conceptual framework succeeded in reshaping the IP ID CC entropy. Figure 3 shows the effect of the FOT stages of the REIPIC method on reshaping the entropy observations of Gc0 dataset gradually compared with Gn at W=550 IP IDs.

Fig. 3. Effect of the FOT stages of REIPIC on reshaping the entropy observations of Gc0 dataset gradually compared with Gn at W=550 IP IDs

Figure 3 shows that the Gn entropy is steady at 2.74036 for all observations at W = 550 IP IDs. On the other hand, Gc0 which does not receive any treatment, has the highest detectable difference from Gn for its entropy observations, which are fluctuating and not equal the normal entropy value except in 4 observations only from the total 37 ones. Then, the iterative FOT stages enhance the entropy feature of Gc1 to Gc3 progressively. Finally, the last FOT treated Gc4 covert dataset has a full reshaping for all its entropy observations to equal those of Gn . To sum up, Fig. 3 proves that FOT stages could reach an iteration that completely adapts the IP ID CC entropy observations at a specified detection window size.

Effect of Entropy Reshaping of the IP Identification Covert Channel

229

Table 2. Effects of the REIPIC sequential FOT stages for adapting the base covert dataset Gc0 dataset gradually (from Gc1 to Gc4 ) on the IP ID CCD

IP ID CCD (%) size (IP IDs)

Base

50 64 100 128 150 200 250 256 300 350 400 450 500 512 550 600 650 700 750 800 850 900 950 1000 1024

0.98 3.12 6.37 11.87 17.65 30.4 49.38 37.5 57.35 68.97 70.59 84.44 90 87.5 89.19 97.06 96.8 100 100 100 100 100 100 100 100

=1 0 0 0 0 0 1.96 2.47 0 4.41 6.9 5.88 17.78 22.5 0 37.84 35.3 41.94 51.72 48.15 52 62.5 63.64 71.43 80 0

0 0 0 0 0 0 0 0 1.47 1.72 0 6.67 0 0 16.22 11.76 16.13 24.14 14.81 24 37.5 36.36 52.38 0 0

=2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8.11 5.88 6.45 10.34 7.41 16 20 18.18 33.33 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6.45 3.45 0 4 8.33 13.64 19.05 0 0

4.2 REIPIC Effect on the IP ID CC Detection The REIPIC effect on the IP ID CCD at each FOT and the specified W is examined. Table 2 presents the REIPIC effects on the IP ID CCD at each W size. The main conclusions from Table 2 could be outlined as follows: 1. Sequential FOT stages reduce the IP ID CCD gradually. 2. Using only one treatment stage in REIPIC is usually insufficient. 3. Second and third stages are one-occurrence treatment iterations as Gc1 and Gc2 CCD > DT = 20% for all W < TW 1,1 = 1024 and TW 1,2 = 1000 IP IDs, respectively.

230

M. Shehab et al.

4. The 4 th treatment stage is upgraded to the two-occurrence treatment, because Gc3 CCD rate = 33.33% > DT = 20% at W =950 IP IDs > TW 3,1 = 840 IP IDs. 5. Using a specified TW size at the FOT stage results in eliminating the IP ID CCD at the W sizes which equal to this TW or one of its divisors. Moreover, the recursive nature of the REIPIC algorithm leads to inherit the elimination of the IP ID CCD at these sizes of W during the next treatment stages as could be proofed from the results in Table 2. For example, Gc1 , Gc2 , Gc3 and Gc4 have CCD = 0 at W = TW 1,1 = 1024 IP IDs and its divisors. 6. REIPIC proves its ability in reshaping the IP ID CC dataset to act normally, since Gc4 CCD ≤ DT for all W sizes. Thus, Gc4 is the last FOT stage.

5 Conclusion and Future Work REIPIC sequential FOT stages succeeded in reshaping the entropy feature of the IP ID CC gradually. This leads to progressively reducing the IP ID CC detection; and recursively eliminating the detection completely at the detection window sizes which equal the used treatment window sizes and their divisors. Thus, REIPIC could be used to increase the secrecy of the sensitive transmitted data, as in some banking and military applications. REIPIC could adapt IP ID CC that does not consume almost of the IP ID bits, as some bits are needed to control the required randomization in the FOT adaptation. REIPIC has no impact on both the hidden message within the IP ID covert values, and the packet performance within the network. Future work for NIDPS include optimizing the detection window sizes for best realtime detection of the REIPIC reshaped IP ID CC. Besides, detect the IP packets and fragments anomalies resulted from the IP ID CC implementation by the packet injection fashion. Furthermore, it is very important to exploit the REIPIC general conceptual framework in other random variables pattern adaptation applications. Data Availability. Experiment datasets are freely available at https://doi.org/10.13140/RG.2.2. 27831.44962.

Conflict of Interest. Authors declare that they have no conflict of interest.

References 1. Postel, J.: Internet protocol. In: IETF (RFC 791) (1981) 2. Abdullaziz, O., Tor, G.V., Ling, H., Wong, K.: AIPISteg: an active IP identification based steganographic method. J. Netw. Comput. Appl. 63, 150–158 (2016) 3. Shehab, M.: New encryption and steganographic methods for data hiding in the IP packets or their fragments. M.S. thesis, Electrical Engineering Department, Faculty of Engineering, Alexandria University, Egypt (2011) 4. Shehab, M., Korany, N.: New steganographic method for data hiding in the IP ID field. In: 8th International Conference on Electrical Engineering (ICEENG 2012), vol. 8, Article 12, pp. 1–14 (2012). Springer, Cairo. https://doi.org/10.21608/iceeng.2012.30646

Effect of Entropy Reshaping of the IP Identification Covert Channel

231

5. Cover, T., Thomas, J.: Introduction and preview. In: Elements of Information Theory, 2nd edn, p. 5. Wiley, Hoboken (2006) 6. Shehab, M., Korany, N., Sadek, N.: Evaluation of the IP identification covert channel anomalies using support vector machine. In: 2021 IEEE 26th International Workshop on Computer Aided Modeling and Design of Communication Links and Networks, Portugal (2021) 7. Groebner, D., Shannon, P., Fry, P., Smitt, K.: Graph, charts, and tables - describing your data. In: Business Statistics: A Decision-Making Approach, 8th edn, pp 31–35 (2011) 8. Shen, Y., Huang, L., Lu, X., Yang, W.: A novel comprehensive steganalysis of transmission control protocol/internet protocol covert channels based on protocol behaviors and support vector machine. Secur. Commun. Netw. 8(7), 1279–1290 (2015)

Explainable Machine Learning Model for Performance Prediction MAC Layer in WSNs El Arbi Abdellaoui Alaoui1(B) , Khalid Nassiri1,2,3 , and Stephane Cedric Koumetio Tekouabou3 1

3

Department of Sciences, Ecole Normale Sup´erieure, Moulay Isma¨ıl University, Mekn`es, Morocco [email protected] 2 Perception, Robotics, and Intelligent Machines Research Group (PRIME), University of Moncton, Moncton, Canada Center of Urban Systems (CUS), Mohammed VI Polytechnic University (UM6P), Hay Moulay Rachid, 43150 Ben Guerir, Morocco

Abstract. Wireless Sensor Networks (WSNs) are used to gather data in a variety of sectors, including smart factories, smart buildings, and so on, to monitor surroundings. Diﬀerent medium access control (MAC) protocols are accessible to sensor nodes for wireless communications in such contexts, and they are critical to improving network performance. The proposed MAC layer protocols for WSNs are all geared on achieving high packet reception rates. The MAC protocol is adopted and utilized throughout the lifespan of the network, even if its performance degrades over time. Based on the packet reception rate, we use supervised machine learning approaches to forecast the performance of the CSMA/CA MAC protocol in this study. Our method consists of three steps: data gathering trials, oﬄine modeling, and performance assessment. According to our ﬁndings, the XGBoost (eXtreme Gradient Boosting) prediction model is the most eﬀective supervised machine learning approach for improving network performance at the MAC layer. In addition, we explain predictions using the SHAP (SHapley Additive exPlanations) approach.

Keywords: Wireless sensor networks learning · Shap value

1

· MAC protocols · Machine

Introduction

Networks that incorporate wireless sensor nodes to collect data are used in smart cities, Industrial 4.0, smart health monitoring, and other Internet of Things applications [10,11]. Each sensor node may perceive a variety of data, including humidity, pressure, temperature, vibration, pollution, and so on. Sensor nodes

c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ben Ahmed et al. (Eds.): NISS 2022, LNDECT 147, pp. 232–241, 2023. https://doi.org/10.1007/978-3-031-15191-0_23

Explainable Machine Learning Model

233

are frequently employed in a variety of Wireless Sensor Networks (WSNs) applications due to the variety of data kinds gathered. A wireless sensor network (WSN) is a distributed system made up of sensor nodes that share certain properties. They have limited energy (battery-powered device), compute, and storage resources. These nodes are positioned throughout the perceived region and work together to send data to a centralized special node known as a base station or sink for processing and decision support. The sensor nodes function independently once placed in the interest area, and their assignment is determined by the application needs. Because of this applicationdependency, a given solution (such as a routing protocol) cannot be used in all types of applications. WSNs also vary from regular networks and ad hoc networks in a number of ways. The sensor nodes are limited in their resources. They are battery-powered and have limited storage and compute capabilities. Proposing adaptive solutions to WSNs has attracted industry players and researchers during the past two decades. In the data collection process, they handle challenges such as energy eﬃciency and data dependability. Machine learning (ML) is an artiﬁcial intelligence technology that is used to enhance network performance [1,5,8]. There are three types of machine learning algorithms: supervised methods, unsupervised techniques, and reinforcement learning techniques [1]. To develop the system model expressing the learnt link between the input, output, and system parameters, supervised ML methods need a labeled training data set. The aim is for the system to ﬁgure out what the general rule is for mapping inputs and outputs. Unlike the preceding category, unsupervised ML algorithms discover the structure of the input without using a labeled data set. The purpose is to use the similarity between the input samples to categorize the sample sets into distinct groups (i.e. clusters). The system learns by interacting with its surroundings through reinforcement learning methods (i.e., online learning). Resource limits in terms of energy, computation and storage of sensor nodes, dynamic topology, and communication connection failures must all be considered while designing WSNs. WSN capabilities such as energy-eﬃciency in routing and energy-eﬃcient cluster formations are addressed using machine learning methods [6,9]. We employ supervised machine learning approaches in this research to evaluate and forecast performance at the MAC layer of WSNs. First, we gather data via testing, and then we use several machine learning methods to evaluate MAC Layer performance based on packet delivery ratio.

2

Methodology

We will give a technique for forecasting the Packet Reception Rate in this section (PRR). A high-level explanation of the recommended method is shown in Fig. 1. This approach is largely concerned with executing the supervised regression procedure’s many phases. It comprises, ﬁrst and primarily, of a novel enrichment strategy for enhancing data quality while also taking into consideration other

234

E. A. A. Alaoui et al.

relevant criteria. Second, to predict the PRR, machine learning methods may be utilized. Finally, we set the parameters for each model. Finally, we provide acceptable estimating approaches based on regression success measures. The computational architecture of our packet reception rate (PRR) prediction approach is shown in Fig. 1: – Data preparation that removes noise from acquired data, decreases variability, and splits the database into two groups (four-ﬁfths for training and one-ﬁfth for testing); – Separate the data into two sets: training and test; – Models training; – Model performance testing, which generates predictions on the test set and measures the predictive accuracy of each machine learning algorithm using three error metrics: R2, MAE, and MSE; – Model performance testing, which generates predictions on the test set and measures the predictive accuracy of each machine learning algorithm using three error metrics: R2, MAE, and MSE; – Model performance testing, which generates predictions on the test set and measures the predictive accuracy of each machine learning algorithm using – Model explanation, which presents four separate graphs to determine how much each predictor contributes, positively or adversely, to the target variable (local interpretability, standard summary, summary and dependency plots).

Cross-validation

Normalization

Data transformation

Bike-sharing dataset

Feature's selection

Preprocessing steps

Training ensemble Model

Train

preprocessed data

Training set

Ensemble Model

Test

Optimization

Testing set

Optimization

Bad

Model Evaluation

Bad

Predicts Good New Prospect

preprocessing steps

preprocessed data

Packet Reception Rate (PRR)

Final Model

Explainability

Fig. 1. Global ensemble-based system for packet reception rate (PRR) prediction

2.1

Prepossessing

Preparing the data set for successful learning necessitates data pre-processing. When all characteristics are scaled to the same range of values, a neural network

Explainable Machine Learning Model

235

learns optimally. When the inputs are at extremely diﬀerent scales, as they are with the WSN dataset data set, this is advantageous. The data is standardised in our implementation since this approach allows us to get better performance outcomes. 2.2

Models Development

Four machine learning algorithms, namely XGBoost Regressor (XGBR), Decision trees (DT), Support vector machine (SVM), and K-Nearest Neighbors (KNN), were chosen to estimate the prediction of Packet Reception Rate (PRR). These algorithms were selected mainly because their basic principles are considerably divergent, their settings are notably varied, and their learning samples are clearly diﬀerent in the literature. 2.2.1 XGBoost Regressor (XGBR) The Extreme Gradient Boosting (XGBoost) technique [4] is similar to the Gradient Boosting algorithm, but it is more eﬃcient and quicker since it combines linear and tree models. This is on top of the fact that it can do parallel computations on a single system. A gradient boosting approach builds trees in a sequential order such that a gradient descent step may be used to minimize a loss function. In contrast to Random Forest, the XGBoost method creates the tree in parallel. Essentially, statistics may be generated in parallel for the information contained in each column corresponding to a variable. The importance of the variables is calculated in the same way as for the random forests, by calculating and averaging on the values by which a variable decreases the impurity of the tree at each step. Figure 2 shows the algorithm process. At each iteration of the gradient boosting algorithm, the residual will be used to update the previous predictor, as seen in Fig. 2, in order to optimize the speciﬁed loss function According to [4], at the tth iteration, the objective function of XGBoost can be represented as: Θ(τ ) = Φ(τ ) + Ω(τ ) (1) The XGBoost is based on the use of a set of regression trees. Each tree provides a prediction, which are then aggregated according to a chosen method (sum, weighted average) to ﬁnally provide a more relevant prediction. This reﬁnes the prediction by segmenting further, without overﬀetting. Mathematically, the model of a tree boosting can be written: yˆi =

m i=1

fm (xi ),

fm ∈ W,

(2)

236

E. A. A. Alaoui et al. Data set

Tree 1

Tree

Tree 2

Iteration

Residual

Residual

Residual

Fig. 2. Algorithm process of XGBoost

where W represents the set of possible regression trees, and m represents the number of trees. Thus, the objective function is written : Θ(t) =

n

Φ(yi , yˆi ) +

i=1

t

Ω(fk )

(3)

k=1

Equation (4) will give the corresponding output y(t)i of the ensemble model given the objective function of equation (3). (t)

yˆi

=

m i=1

(t−1)

fm (xi ) = yˆi

+ ft (xi )

(4)

The term regularization Ω(fk ) allows to control the complexity of the model and to avoid over-feting. The general idea is that it should allow to have a simple and eﬃcient model at the same time. This term is deﬁned as follows [4]: m

1 2 Ω(fk ) = γT + λ ω 2 i=1 j

(5)

where T is the number of leaves in the regression tree and ω is the vector of values assigned to each of its leaves. This penalization or regularization limits the adjustment of the tree added at each step and helps to avoid over-adjustment. Especially when observations with signiﬁcant errors are present, increasing the number of iterations can cause a degradation of the overall performance rather than an improvement. The term Ω is interpreted as a combination of rigid λ coeﬃcient regularization and γ coeﬃcient Lasso penalty. [4,14] provides more information about the XGBoost algorithm’s process.

Explainable Machine Learning Model

237

2.2.2 Decision Trees (DT) The decision tree is a regression/classiﬁcation algorithm, its popularity rests largely on its simplicity. A decision tree is composed of a root node through which data is entered, leaf nodes that correspond to a classiﬁcation of questions and answers that condition the next question. It is an interactive process of rule induction that leads to a well-justiﬁed assignment. The branching of the nodes involves the calculation of diﬀerent criteria according to the chosen algorithm. There are diﬀerent algorithms for the construction of decision trees such as ID3, C4.5, CHAID and CART and many others. 2.2.3 Support Vector Machine (SVM) SVR is a non-linear kernel-based regression method which consists of locating a regression hyperplane with smallest structural risk in a so-called high dimensional feature space [REF]. Given a set of training data {(x1 , y1 ), . . . , (xp , yp )}, where xi ⊂ Rn , i = 1, . . . , p, denotes the input vector and yi ⊂ R, i = 1, . . . , p, designates the corresponding target value, the SVR estimating function takes the following form: f (x) = =

p i=1 p

(αi − αi∗ ) < ψ(xi ).ψ(x) > +b

(6)

(αi − αi∗ )k(xi , x) + b

i=1

where b ⊂ R is an oﬀset, k(xi , x) is a kernel function which represents the inner product < ψ(xi ).ψ(x) >, αi and αi∗ are nonzero Lagrange multipliers. The most commonly employed kernel function is the radial basis function (RBF) deﬁned as: 2.2.4 K-Nearest Neighbors (KNN) KNN is a regression/classiﬁcation approach that may also be used for estimate problems. The KNN algorithm is a form of case-based reasoning. It all begins with the concept of making judgments by searching for one or more comparable situations that have previously been addressed in memory. Unlike other regression approaches (decision trees, neural networks, and so on), there is no learning stage that involves creating a model from a learning sample. The model is made up of the learning sample, which is linked to a distance function and a class choice function depending on the classes of the nearest neighbors. 2.3

Interpretation of ML Models Using SHAP Values

Basing on the game-theoretic, SHAP VALUE [3,7] is a method used to describe the prediction of each individual, in addition to decide how each character can be changed on our prediction. The main purpose of SHAP VALUE is not only calculating the contribution of each character to the prediction. But also explaining

238

E. A. A. Alaoui et al.

and describing the prediction of an instance [2]. furthermore, there is a basic question in the game-theoretic that SHAP VALUE comes to answer: let us assume having diﬀerent skills among some players, how to reach and get the equitable distribution of this proﬁt between them? Another thing to consider is each player has been rearranged of what the other members added to the group’s overall result, so who can be the high-ranking in the model prediction, if all of them have joined the group at the same time how to determine between these two reward distributions which one is the right and just. This predicament that lead to make up the SHAP VALUES’ formulation which can be recognized at high level for instance “look for the marginal contribution of each player, i.e. we can add a player to the group, when the average over each possible sequence”. 2.4

Interpretation of ML Models Using SHAP Values

As XGboost; one of algorithms; ﬁnds it diﬃcult to clarify and explain the prediction, it is an obligation to add an explainable artiﬁcial intelligence method. So that we must look for the capability of our algorithm to unbox black-box models. We should not use the old and classic methods like feature importance measures that use only the global model interpretation. Because, we used Shapley Additive Explanation or Shape Value [13]. The model’s prediction was set up depending on the summary of values assigned to each variable by belonging the class of additive variable attribution method. This method constitutes and clarify the interpretable estimation of the original model. The representation is deﬁned as a linear relationship between binary variables.

g(z ) = φ0 +

n

φi zi

(7)

i=1

where z is the vector corresponding to the absence/presence of the n explanatory variables of the neighboring instance created z, in the prediction obtained. z M has values in {0, 1} [13]; φi =

|S|!(M − |S| − 1)! [fx (S ∪ {i}) − fx (S)] M!

(8)

S⊆n{i}

with M the number of variables, S a set of variables, fx the prediction function at time x, fx (S) = E[f (x)|xS ], i is the ith variable. Thanks to the Shap value, we can determine the eﬀect of the diﬀerent variables of a prediction for a model that explains the deviation of this prediction from the base value. Kernel SHAP consists in explaining the diﬀerence between a predicted value and the mean predicted value. More precisely, it assigns to each explanatory variable a SHAP value, a value corresponding to its contribution to the diﬀerence obtained between the predicted value and the predicted mean value [12].

Explainable Machine Learning Model

3

239

Experiments and Results Analysis

3.1

WSN Dataset

The dataset used in this work taken from Zenodo. The 802.15.4 MAC layer performance datasets consist of 12 ﬁles, corresponding to measurements from diﬀerent observation interval granularity used in the experiments, located in the folder [*Training data*](Training data). The folder contains a ﬁne-granularity dataset with short-term MAC statistics over a time interval of 5 s, and several derived coarse-granularity dataset with long-term MAC statistics over 10, 15, 20, 25, 30, 35, 40, 45, 50 and 55*s*. The Table 1 indicates the features that have been taken into account. Table 1. Details about WSN dataset N

Attribute

Description

Type

1

NumOfReceived

is the number of received frames during a particular observation interval

Numerical

2

PRR

is the Packet Reception Rate, e.g. the percentage of received frames withing a particular observation interval

Numerical

3

packetLoss

is the number of erroneous frames within a particular observation interval

Numerical

4

PLR

is the Packet Loss Rate, i.e. the percentage of Lost frames withing a particular observation interval

Numerical

5

throughput

is the aggregated throughput of all sending nodes within a particular observation interval

Numerical

6

IPI

is the Inter-Packet-Interval of the transmitter expressed as *X*/128 (*seconds*), where *X* is the value in the ‘*IPI*’ column

Numerical

7

Density

is the number of nodes that were active during the experiment observation interval

Numerical

8

COR

is the Channel Occupancy Ratio which indicates the level of interference generated, e.g. a level of 20 indicates a interference pattern generated 20% of a time period, i.e. transmitting a modulated carrier for 2 ms, followed by a 8 ms idle periodrepeated during the experiment

Numerical

3.2

Results

We compare the validation and test results for several methods to demonstrate the performance of our model. On these two scores, we calculated the MSE, MAE, RMLSE, and R2 of each algorithm. In Table 2, the values of the eﬃciency metrics are combined. The graphic diagrams that refer to the performance associated with each metric are graphically illustrated. 3.3

Results of Model Prediction Interpretation

In variable importance plot, features are vertically sorted by their average impact on model output, allowing for a global interpretability and clarifying the whole structure of the model. Based on Fig. 3(a), the most important feature for predicting PRR with an average impact greater than 0,10 is packetLoss followed by

240

E. A. A. Alaoui et al. Table 2. Comparison of techniques Regressor MSE

MAE

R2

XGB

0.0002 0.0091 0.9965

DT

0.0017 0.0274 0.9694

SVM

0.0131 0.0972 0.7762

KNN

0.0025 0.0304 0.9561

PLR in second position and NumOfReceived in third rank, etc. Density is the least powerful indicator with a mean absolute SHAP value of almost zero. In Fig. 3(a), the importance of the variables is calculated by averaging the absolute value of the Shap values. The Fig. 3(b) ﬁgure represents the overall importance of the variables calculated by the Shap values. In Fig. 3(b), the Shap values are plotted for each variable in their order of importance, with each dot representing a Shap value, red dots representing high values of the variable and blue dots representing low values of the variable. Thanks to the fact that the values are calculated for each example of the dataset, it is possible to represent each example by a point and thus have an additional information on the impact of the variable according to its value. For example, “packetLoss” which is the most important variable, has a negative impact when the value of this variable is high.

(a) Feature importance values were cal- (b) Feature importance values were calculated by averaging the SHAP values of culated by averaging the SHAP values of each attribute. each attribute.

Fig. 3. Feature importance values were calculated by averaging the SHAP values of each attribute.

4

Conclusion

The functioning of networks has a signiﬁcant impact on our everyday life. Predicting the MAC layer performance protocol is a chore that may be avoided in WSNs, which are being more incorporated into IoT applications. In this research, we oﬀer a performance prediction for the MAC layer in WSNs with CSMA/CA based on packet delivery rate.

Explainable Machine Learning Model

241

References 1. Alsheikh, M.A., Lin, S., Niyato, D., Tan, H.-P.: Machine learning in wireless sensor networks: algorithms, strategies, and applications. IEEE Commun. Surv. Tutorials 16(4), 1996–2018 (2014) 2. Ariza-Garzon, M.J., Arroyo, J., Caparrini, A., Segovia-Vargas, M.J.: Explainability of a machine learning granting scoring model in peer-to-peer lending. IEEE Access 8, 64873–64890 (2020) 3. Arjunan, P., Poolla, K., Miller, C.: EnergyStar++: towards more accurate and explanatory building energy benchmarking. Appl. Energy 276(June), 115413 (2020) 4. Chen, T., Guestrin, C.: Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794 (2016) 5. Di, M., Joo, E.M.: A survey of machine learning in wireless sensor netoworks from networking and application perspectives. In: 2007 6th International Conference on Information, Communications & Signal Processing, pp. 1–5. IEEE (2007) 6. F¨ oerster, A., Murphy, A.L.: Machine learning across the wsn layers (2010) 7. Gao, X., Lin, C.: Prediction model of the failure mode of beam-column joints using machine learning methods. Eng. Failure Anal. (1239), 105072 (2020) 8. Kim, T., Vecchietti, L.F., Choi, K., Lee, S., Har, D.: Machine learning for advanced wireless sensor networks: a review. IEEE Sensors J. (2020) 9. Praveen Kumar, D., Amgoth, T., Annavarapu, C.S.R.: Machine learning algorithms for wireless sensor networks: a survey. Inf. Fusion 49, 1–25 (2019) 10. Lin, D., Wang, Q., Min, W., Jianfeng, X., Zhang, Z.: A survey on energy-eﬃcient strategies in static wireless sensor networks. ACM Trans. Sensor Networks (TOSN) 17(1), 1–48 (2020) 11. Messai, M.-L., Seba, H.: A survey of key management schemes in multi-phase wireless sensor networks. Comput. Netw. 105, 60–74 (2016) 12. Rodr´ıguez-P´erez, R., Bajorath, J.: Interpretation of compound activity predictions from complex machine learning models using local approximations and shapley values. J. Med. Chem. 63(16), 8761–8777 (2020) 13. Slack, D., Hilgard, S., Jia, E., Singh, S., Lakkaraju, H.: Fooling LIME and SHAP: adversarial attacks on post hoc explanation methods. In: AIES 2020 - Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, pp. 180–186. Association for Computing Machinery, Inc., New York, February 2020 14. Zhang, D., Qian, L., Mao, B., Huang, C., Huang, B., Si, Y.: A data-driven design for fault detection of wind turbines using random forests and xgboost. IEEE Access 6, 21020–21031 (2018)

Hadoop-Based Big Data Distributions: A Comparative Study Ikram Hamdaoui1(B) , Mohamed El Fissaoui2 , Khalid El Makkaoui1 , and Zakaria El Allali1 1

2

LaMAO laboratory, MSC team, FPD, Mohammed First University, Nador, Morocco [email protected] LMASI laboratory, FPD, Mohammed First University, Nador, Morocco

Abstract. Approximately 2.5 quintillion bytes of various forms (structured, semi- structured, or unstructured) of data are generated every day. Indeed, big data technology has come to solve the limitations of traditional methods, which can no longer handle and process large amounts of data in various forms. Hadoop is an open-source big data solution created to store, process, and manage a huge volume of diﬀerent types of data. Many companies developed their own Hadoop distributions based on the Hadoop ecosystem in the last decade. This paper presents the most popular Hadoop distributions, including MapR, Hortonworks, Cloudera, IBM InfoSphere BigInsights, Amazon Elastic MapReduce, Azure HDInsights, Pivotal HD, and Qubole. Then it provides readers with a deep, detailed comparison of these distributions. Keywords: Batch processing · Big data distributions · Stream processing

1

· Cloud computing · Hadoop

Introduction

Every day, 2.5 quintillion bytes (2.5 × 1018 bytes) of various types of data are created [1,2] from billions of connected devices, social media, etc., and by 2025 it is expected to reach 175 zettabytes (1.75 × 1023 bytes) [3]. Every hour, Facebook receives 293 thousand posts and 510 thousand comments, whereas Instagram creates about 95 million posts daily [4], while six thousand tweets are sent every second on Twitter, with 500 million tweets sent daily [5]. These massive amounts of data come in diﬀerent forms: structured, semistructured, and unstructured data [6]; traditional methods like relational database management system (RDBMS) [7] are not capable of handling and processing them. Therefore, new storage and analysis methods are required. With the advent of big data, many companies and businesses have embraced it and proﬁted greatly from it. By properly utilizing big data, businesses can extract valuable data and insights that can be converted into practical business strategies and decisions, resulting in various advantages. Despite its beneﬁts, big c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ben Ahmed et al. (Eds.): NISS 2022, LNDECT 147, pp. 242–252, 2023. https://doi.org/10.1007/978-3-031-15191-0_24

Hadoop-Based Big Data Distributions: A Comparative Study

243

data comes with many challenges concerning storage, processing, etc., [8,9]. To overcome big data issues, there are several solutions, such as Snowﬂake [10,11], Kubernetes [12], etc. The most used solution is Apache Hadoop [13]. Hadoop is an open-source software framework created by Doug Cutting and Mike Cafarella [14] at the Apache Software Foundation in 2005 to solve big data problems and assist businesses in processing, managing, and storing massive amounts of data in a distributed way. There are three versions of Hadoop (see Fig. 1): • Hadoop 1: consists of two core components; HDFS for storage and MapReduce [15] for resource management and data processing. • Hadoop 2: contains three main components; the HDFS for storage, an additional main component called YARN (Yet Another Resource Negotiator) for resource management and task scheduling while MapReduce for data processing. • Hadoop 3: which includes new features such as multiple name nodes, which eliminates single point failure problem that the previous versions had, and the enhancement of the timeline services, which improved the scalability and reliability of the timeline service, resolving the scalability issue that previous versions had, etc.

Fig. 1. Hadoop versions.

Despite the fact that Hadoop [16] and its major components are completely free and open-source, it is diﬃcult for large companies to rely on the open-source framework. Rather than manually setting up the Hadoop environment, which is diﬃcult, vendors oﬀer their own Hadoop distributions that adopt and develop the open-source Hadoop by adding their own management and troubleshooting tools on top of it in order to provide their own enhanced Hadoop distribution. Their primary goal is to improve performance and functionality while also providing more scalable, secure, and comprehensive big data solutions. Several companies provide cloud-based Hadoop distribution services, such as Amazon EMR [17], Microsoft Azure’s HDInsight [18], and others provide stand-alone (on-premise)

244

I. Hamdaoui et al.

Hadoop distributions, such as MapR [19], while there are hybrid distributions like Cloudera [20], etc. (see Fig. 2).

Fig. 2. Hadoop distributions.

Although there are more similarities than diﬀerences among Hadoop distributions, there are some minor diﬀerences. Customers may ﬁnd it diﬃcult to choose between Hadoop distributions because they all provide a Hadoop platform. Thus, in this paper, we compare cloud-based (Amazon EMR, Microsoft HDInsight, Qubole), stand-alone based distributions (MapR, Pivotal HD, and IBM InfoSphere BigInsights) and Hybrid (Cloudera and Hortonworks) Hadoop distributions and their beneﬁts and drawbacks. The comparison considers several factors: licenses, dependability, data access, and manageability. In this paper, the second section presents eight Hadoop distributions and some of their features. The third section presents a comparison between the eight Hadoop distributions based on various factors as well as their advantages and disadvantages and discusses the comparison. Finally, the fourth section concludes the paper.

2

Hadoop Distributions

Hadoop distributions improved open-source Hadoop and served as a solution to big data challenges. With so many Hadoop distributions available, it might be diﬃcult to select the right one for business requirements. This article will concentrate on some of the most well-known Hadoop distributions. 2.1

Stand-Alone Hadoop Distributions

The term “stand-alone” refers to software or computer programs installed and run on a user’s or organization’s computers without requiring a network connection. Some of the most well-known stand-alone Hadoop distributions are as follows:

Hadoop-Based Big Data Distributions: A Comparative Study

245

• MapR: was founded in 2009 by John Schroeder and M.C. Srivas, is one of the fastest Hadoop distributions. It is dependable and can handle real-time performance with ease. MapR aﬀords its own NoSQL database (MapR-DB) which outperforms HBase in terms of performance and scalability. It provides multi-node direct access, which enables access to Hadoop via the Network File System (NFS). The MapR-File System (MapRFS) is another MapR feature that allows it to include components such as Pig and Hive without any Java dependencies. MapR has other components (as shown in Fig. 3). MapR has high availability, data protection, and disaster recovery capabilities.

Fig. 3. MapR architecture.

• IBM InfoSphere BigInsights: InfoSphere BigInsights [21] is a software platform that is built on top of Apache Hadoop 1 and was developed by IBM experts in 2011. The BigInsights package extends the Apache Hadoop architecture with analytics and visualization algorithms. It contains several extra-value components, such as the BigInsights console, which allows for central cluster management via a web-based interface, the BigInsights workﬂow scheduler, which assigns resources to jobs based on job priority, and BigSheets, which is a browser-based data manipulation and visualization tool that allows non-programmers and users to access, explore, and analyze data. It also has Avro, a system for serializing data, and other components. • Pivotal HD: a Hadoop 2 distribution that was developed by Pivotal Software Inc company in 2013. It provides the beneﬁt of big data analytics without the overhead and complexity of a custom-built project. In addition to Hadoop components, Pivotal HD oﬀers its own Pivotal components, such as HWAQ, which is an advanced database service that is faster than any other Hadoopbased query interface in rendering queries, and Pivotal Command Center, which includes both a command line and a graphical user interface to install, manage, and monitor the Pivotal HD cluster. Another Pivotal component is the Pivotal Data Loader, which is a high-performance data ingest tool for Pivotal HD clusters, etc., (see Fig. 4).

246

I. Hamdaoui et al.

Fig. 4. Pivotal HD architecture.

2.2

Hybrid Hadoop Distributions

A hybrid Hadoop distribution refers to Hadoop distributions that can be installed and run on a user’s computer or in the cloud. Some of the most wellknown hybrid Hadoop distributions are as follows: • Cloudera CDH: Cloudera is one of the ﬁrst to market a Hadoop distribution. It was founded in 2008 by Google, Yahoo, and Facebook experts and then it completely merged with Hortonworks in 2019. Cloudera is at the top of the list of big data providers for making Hadoop a dependable business platform. It supports multi-cluster management as well as the addition of new services to an existing Hadoop cluster. It has added a number of proprietary features to the Hadoop core version, including Cloudera Manager, a tool for creating, managing, and maintaining clusters as well as managing task processing [22], Cloudera Search, which is designed to make product searching easier, and Impala, which is capable of real-time processing (see Fig. 5).

Fig. 5. Cloudera architecture.

• Hortonworks was founded in 2011 by Yahoo and Benchmark Capital. It is a free open enterprise data platform that can be easily downloaded and integrated into a variety of applications, and it also supports Windows and Linux platforms. It was the ﬁrst company to oﬀer a Hadoop 2 distribution that was ready for production. Hortonworks aims to increase the Hadoop

Hadoop-Based Big Data Distributions: A Comparative Study

247

platform’s usability and to create a partner ecosystem that would hasten enterprise Hadoop adoption as well as improve the performance of the Hadoop storage layer. It developed Apache Ambari, a tool that provides, manages, and monitors Apache Hadoop clusters, making Hadoop management easier and containing other components such as Apache Tez, Apache Accumulo, etc., (as shown in Fig. 6).

Fig. 6. Hortonworks architecture.

2.3

Cloud-Based Hadoop Distributions

Cloud-based services [23,24] oﬀer services such as storage, networking, and virtualization options through the internet and without the need to purchase additional hardware. Some of the most well-known cloud-based Hadoop distributions are as follows: • Amazon Elastic MapReduce (EMR): a cloud-based big data platform that makes provisioning and managing Hadoop clusters simple and secure. It enables complex ﬁnancial analyses to be performed and allows the use of machine learning for the improvement of processing methods. It uses Amazon Simple Storage Service (S3) for storing input and output data and Amazon Elastic Compute Cloud (EC2) for computation to process big data across a Hadoop cluster of virtual servers. It is easy to use, secure and reliable. • Microsoft Azure HDInsight: an open-source cloud-based Hadoop distribution platform that makes processing large amounts of data in a customizable environment simple, fast, and aﬀordable. It is used for data warehousing, batch processing, machine learning, etc. In addition to Apache Hadoop 2 components such as YARN, Hive, Pig, Ambari, etc., Microsoft Azure HDInsight oﬀers its own features, such as Azure PowerShell for managing Apache Hadoop clusters, the Microsoft Avro Library for data serialization, etc.

248

I. Hamdaoui et al.

• Qubole: is a complete and autonomous Hadoop 2.0 distribution platform, founded in 2011 by Ashish Thusoo and Joydeep Sen Sarma and launched in 2012. Qubole provides faster access to secure, dependable, and trustworthy datasets for machine learning and analytics. In addition to Apache Hadoop components such as MapReduce, Hive, etc., Qubole oﬀers other features such as Apache Airﬂow, which enhances performance and user experience while facilitating the administration of Airﬂow clusters, Cascading, which is used to develop big data applications on Hadoop, Presto, which is a distributed SQL query engine for big data that provides high performance, etc.

3

A Comparison of Hadoop Distributions

Choosing the right Hadoop distribution depends entirely on the business requirements, the problems faced, and the features needed. In this paper, we compared various Hadoop distributions based on several factors. 3.1

Comparing Factors

In our comparison study, we used the following factors: • Cloud/Hybrid/stand-alone: whether the Hadoop distribution is based on the cloud, a hybrid, or a stand-alone platform. • Documentation: if the Hadoop distribution comes with documentation or a user guide that gives details about it, and how to install or use the software. • Data ingestion:It means transporting data from one or various sources to a medium for further processing and storage. Data can be ingested in batches or in real-time. • Operating system: Which system software does the Hadoop distribution support. • Unique features: The uniqueness of The Hadoop distribution. • Storage services: which components or services oﬀers storage capacity to the Hadoop distribution. • SQL: if the Hadoop distribution support SQL natively or through an interface. • NoSQL: Which NoSQL database does the Hadoop distribution use. • Disaster recovery: whether the Hadoop distribution supplies a disaster recovery to prevent data loss. • Use case: if the Hadoop distribution is used for Machine Learning (ML) [25] [26], Deep Learning (DL), Data analysis (DA), Internet of Things (IoT), Data Warehousing (DW).

Hadoop-Based Big Data Distributions: A Comparative Study

3.2

249

Comparing Hadoop Distributions

Table 1. Hadoop distributions comparison

3.3

Advantages and Disadvantages of Hadoop Distributions

Table 2 shows some of the advantages and disadvantages of the eight chosen Hadoop distributions.

250

I. Hamdaoui et al. Table 2. Hadoop distributions advantages and disadvantages

3.4

Discussion

After comparing diﬀerent Hadoop distributions using various factors, we discovered that the majority of Hadoop vendors had produced their own versions based on Apache Hadoop and related open-source projects. Some of them oﬀer onpremise platforms, while most of them provide a cloud-based or hybrid Hadoop

Hadoop-Based Big Data Distributions: A Comparative Study

251

distribution. Although each one of the eight chosen Hadoop distributions has its own unique features, and some of them oﬀer their own storage services, they all oﬀer documentation for their product and support both batch and stream processing, etc. Thus, there is no clear winner in the market between them since they are all focused on oﬀering critical big data system qualities.

4

Conclusion

With the growing demand for big data technologies, a number of companies have created their own Hadoop distributions, and the competition between them is intense. When deciding on a Hadoop distribution, one crucial element to consider is whether you want a stand-alone or a cloud-based solution. This paper discussed and compared eight Hadoop distributions, including MapR, Hortonworks, Cloudera, IBM InfoSphere BigInsights, Amazon Elastic MapReduce, Azure HDInsights Pivotal HD, and Qubole. We listed their unique features as well as their advantages and disadvantages.

References 1. Gupta, Y.K., Kumari, S.: A study of big data analytics using apache spark with python and scala. In: 3rd International Conference on Intelligent Sustainable Systems (ICISS), pp. 471–478 (2020) 2. Williams, L.: Data DNA and diamonds. Eng. Technol. 14(3), 62–65 (2019) 3. Janev, V.: Semantic intelligence in big data applications. In: Smart Connected World, pp. 71–89. Springer, Cham (2021). https://doi.org/10.1007/978-3-03076387-9 4 4. Alani, M.M.: Big data in cybersecurity: a survey of applications and future trends. J. Reliable Intell. Environ. 7(2), 85–114 (2021). https://doi.org/10.1007/s40860020-00120-3 5. Sarkar, S.: Using qualitative approaches in the era of big data: a confessional tale of a behavioral researcher. J. Inf. Technol. Case Appl. Res. 23(2), 139–144 (2021) 6. Praveen, S., Chandra, U.: Inﬂuence of structured, semi-structured, unstructured data on various data models. Int. J. Sci. Eng. Res. 8(12), 67–69 (2017) 7. Sumathi, S., Esakkirajan, S.: Fundamentals of Relational Database Management Systems, vol. 47. Springer, Cham (2007) 8. Sivarajah, U., Kamal, M.M., Irani, Z., Weerakkody, V.: Critical analysis of Big Data challenges and analytical methods. J. Bus. Res. 70, 263–286 (2017) 9. Tariq, R.S., Nasser, T.: Big data challenges. J. Comput. Eng. Inf. Technol. 04(03) (2015) 10. Dageville, B., et al.: The snowﬂake elastic data warehouse. In: Proceedings of the 2016 International Conference on Management of Data, pp. 215–226 (2016) 11. Bell, F., Chirumamilla, R., Joshi, B.B., Lindstrom, B., Soni, R., Videkar, S.: The snowﬂake data cloud. In: Snowﬂake Essentials, Apress, Berkeley, CA, pp. 1–10 (2022) 12. Nguyen, N., Kim, T.: Toward highly scalable load balancing in kubernetes clusters. IEEE Commun. Mag. 58(7), 78–83 (2020)

252

I. Hamdaoui et al.

13. Mitchell, I., Locke, M., Wilson, M., Fuller, A.: Fujitsu Services Limited.: Big data: the deﬁnitive guide to the revolution in business analytics. Fujitsu Services Ltd, London (2012) 14. Singh, V.K., Taram, M., Agrawal, V., Baghel, B.S.: A literature review on Hadoop ecosystem and various techniques of big data optimization. Advances in Data and Information Sciences, pp.231–240 (2018) 15. Belcastro, L., Cantini, R., Marozzo, F., Orsino, A., Talia, D., Trunﬁo, P.: Programming big data analysis: principles and solutions. J. Big Data 9(1), 1–50 (2022). https://doi.org/10.1186/s40537-021-00555-2 16. Kaur, M., Goel, M.: Big Data and Hadoop: a review. Communication and Computing Systems, pp. 513–517 (2019) 17. Singh, A., Rayapati, V.: Learning big data with Amazon elastic MapReduce. Packet Publishing Ltd. (2014) 18. Webber-Cross, G.: Learning Microsoft Azure. Packet Publishing Ltd. (2014) 19. Oo, M.N., Thein, T.: Forensic investigation on MapR hadoop platform. In: 2018 1st IEEE International Conference on Knowledge Innovation and Invention (ICKII), pp. 94–97. IEEE (2018) 20. Menon, R.: Cloudera Administration Handbook. Packet Publishing Ltd. (2014) 21. Ebbers, M., de Souza, R.G., Lima, M.C., McCullagh, P., Nobles, M., VanStee, D., Waters, B.: Implementing IBM InfoSphere BigInsights on IBM System X. IBM Redbooks (2013) 22. Achari, S.: Hadoop essentials. Packet Publishing Ltd. (2015) 23. El Makkaoui, K., Ezzati, A., Beni-Hssane, A., Motamed, C.: Cloud security and privacy model for providing secure cloud services. In: 2016 2nd International Conference on Cloud Computing Technologies and Applications (CloudTech), pp. 81– 86. IEEE (2016) 24. El Makkaoui, K., Beni-Hssane, A., Ezzati, A.: Cloud-elgamal and fast cloud-RSA homomorphic schemes for protecting data conﬁdentiality in cloud computing. Int. J. Digital Crime Forensics (IJDCF) 11(3), 90–102 (2019) 25. Ouhmad, S., El Makkaoui, K., Beni-Hssane, A., Hajami, A., Ezzati, A.: An electronic nose natural neural learning model in real work environment. IEEE Access 7, 134871–134880 (2019) 26. El Mrabet, M.A., El Makkaoui, K. and Faize, A. : Supervised machine learning: a survey. In: 2021 4th International Conference on Advanced Communication Technologies and Networking (CommNet), pp. 1-10 (2021)

HDFS Improvement Using Shortest Path Algorithms Mohamed Eddoujaji1 , Hassan Samadi2(B) , and Mohammed Bouhorma3(B) 1 Doctoral Studies Research Center Engineering Science and Technology, National School

of Applied Sciences, Abdelamlek ESSADI Univesity, Tangier, Morocco 2 Doctoral Studies Research Center, National School of Applied Sciences, Abdelamlek

ESSADI Univesity, Tangier, Morocco [email protected] 3 Faculty of Sciences and Techniques, Doctoral Studies Research Center, Abdelamlek ESSADI Univesity, Tangier, Morocco [email protected]

Abstract. In the previous version of the article “data processing on distributed systems: storage challenge” we presented a new approach for the storage, management and exploitation of distributed data in the form of small files, such as the case of message exchanges real-time localization in port activity. During this approach, we managed to optimize more than 30% of information management using the classic HADOOP/YARN/HDFS architecture [11]. Considering that in a HADOOP ecosystem with several data processing nodes, access to the right node, containing the desired data, in the optimal time presents a major challenge and very important research avenues for researchers and scientists [15]. In this paper, we will see together that the marriage between mathematical algorithms and computer magic can give us very encouraging and very important results. Indeed, one of the principle that manifests itself is the theory of graphs, especially the calculation of the shortest path to optimally reach the data on a few nodes in an architecture of a few hundred nodes or even thousands [16]. After several research and comparison, Dijkstra’s algorithm is the chosen algorithm for calculating the shortest path in a HADOOP/HDFS system. Keywords: Hadoop · HDFS · Algorithms · Small files · Dijkstra · Shortest path

1 Introduction The port transport industry has stunned the global trade by a massive amount which accounts for about 90% of the total trade. The maritime transport sector generates daily millions of data points from seaports, social media feeds, ship and parcel forwarding, logistics, couriers and much more [16]. Big Data helps port transport companies improve their potential by gathering data and analyzing it to find valuable and useful information. Big Data also helps to monitor the movements of ships on the bridge by capturing the data. It also helps port carriers find the shortest routes to deliver ships. One of the major © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ben Ahmed et al. (Eds.): NISS 2022, LNDECT 147, pp. 253–269, 2023. https://doi.org/10.1007/978-3-031-15191-0_25

254

M. Eddoujaji et al.

problems facing the shipping industry is the straying of ships on their way from source to destination. This problem has led the industry to turn to smarter technologies like data analytics [17]. Ports are no longer limited to the docking of ships. Stakeholders across the port ecosystem are now collaborating in new data-driven ways. When we think of the maritime industry, we often think of port authorities, ship operators and terminal operators. As the global shipping industry has become more complex, various players are now changing information in real time, including freight and logistics companies, storage providers, rail and barge operators, trucking companies and suppliers. sensors for pipelines, cranes, berths and roads [12]. As a result, these players have adopted smarter solutions to increase their productivity and be more efficient. For example, CMA CGM, operator of the third largest shipping company in the world, uses Traxens technology to equip container ships. This tool transforms containers into smart connected objects now with information on their position, which makes it possible to make a prediction of the time of arrival [14]. Today, ports must constantly improve quality, increase productivity and reduce costs for shipping companies in order to be internationally competitive. Other cities like Singapore, Shanghai, Hong Kong, and Taiwan have taken similar smart port initiatives, which has helped to speed up ship turnaround times and make port maintenance more efficient [20].

2 HADOOP Platform Hadoop was designed to efficiently manage effectively large files, especially when traditional systems are facing limitations to analyze this new data dimension of data caused by its exponential growth. However [22], hadoop is not deployed to handle only large files! the heterogeneity and diversity of information from multiple sources (intelligent devices, IOTs, internet users, log files and security events) has become the normal flow of Hadoop architectures [14, 23]. In today’s world, most domains permanently and constantly generate a very large amount of information in the form of small files. Multiple domains stores and analyzes millions and millions of small files, such as analysis for multimedia data mining [2], astronomy [3], meteorology [4], signal recognition [5], climatology [6, 7], energy and e-learning. [8], without forgetting the astronomical information processed by social networks; Facebook stores more than 350 million images every day [9]. In biology, the human genome generates up to 30 million files which do not exceed on average 190 KB [10] (Fig. 1).

HDFS Improvement Using Shortest Path Algorithms

Map Reduce

255

Other

Data Processing

Data Processing

YARN Cluster Ressource Management

HDFS Redudant, Reliable Storage Fig. 1. HADOOP 2.0 main components

2.1 HDFS – Reliable Storage The Hadoop File System (HDFS) provides the ability to store large amounts of data reliably on commodity hardware. Although there are file systems with better performance, HDFS is an integral part of the Hadoop Framework and has already reached the level of a de facto standard. It was designed for large data files and is well suited for rapid data ingestion and bulk processing [13, 19] (Fig. 2).

Fig. 2. Storage & Replication of blocks in HDFS

2.2 DIJKSTRA’s Algorithm Presentation Graph theory is a branch of mathematics and computer science that involves modeling different real-life problems as graphs [16]. One of the most classic uses is modeling a road network between different cities. One of the main issues being the optimization of the distances between two points [24]. To find the shortest path, we often use Dijkstra’s algorithm. Let’s get back to how it works in this article [18].

256

M. Eddoujaji et al.

Algorithm example (Fig. 3): Imagine that we are trying to find the shortest path between node S (Source) and node D (Destination).

Fig. 3. Dijkstra’s algorithm example (Phase 0)

Throughout the algorithm we will keep in memory the shortest path from S for each of the other nodes in the HADOOP network [32]. We always repeat the same process (Macro Algorithm): 1. We choose the accessible vertex of minimum distance as the vertex to be explored. 2. From this vertex, we explore its neighbors and we update the distances for each one. We only update the distance if it is less than the one we had before 3. We repeat until we reach the end point or until all the peaks have been explored.

HDFS Improvement Using Shortest Path Algorithms

2.3 DIJKSTRA’s Algorithm Examples

Fig. 4. Dijkstra’s algorithm example (Phase 1)

Fig. 5. Dijkstra’s algorithm example (Phase 2)

Fig. 6. Dijkstra’s algorithm example (Phase 3)

257

258

M. Eddoujaji et al.

Fig. 7. Dijkstra’s algorithm example (Phase 4)

Fig. 8. Dijkstra’s algorithm example (Phase 5)

We will then repeat the execution of the algorithm of the shortest path of the next nodes until the following final diagram is obtained [27] (Figs. 4, 5, 6, 7, 8 and 9):

HDFS Improvement Using Shortest Path Algorithms

259

Fig. 9. Dijkstra’s algorithm example (Final result)

2.4 SubAlgorithm Dijkstra(Dijkstra_type) Based on the research done previously, we can conclude that several improvements and adaptations have been made to the Dijkstra algorithm, however for our case [26, 27], the most suitable is:

260

M. Eddoujaji et al.

3 Article Sections The following parts in this article, will be organized as follows: Sect. 3 described the related work and some of previous results. Section 4 described our proposed approach. Section 5 presents our experimental works and results. Finally, and to conclude this research, Sect. 6 for conclusion and expectation.

4 Proposed Work 4.1 Little Reminder Dijkstra’s algorithm, was developed by Edsger Dijkstra in 1956 and was published in 1959 [16] is mainly concerned with finding the shortest path using a graph theory algorithm [30]. The principle is very simple, it is between two source and destination points by calculating the optimal cost (minimum cost equal to the shortest path), non-negative, to arrive at the next node, and so on [29]. The best known application of this algorithm is its use in routing or other graph calculation applications, however, and since the HADOOP network is a large network, containing gigantic masses of data, we plan to use this algorithm to optimize the storage of information on a network of nodes as well as their repatriation, by optimizing research and analysis times [16]. The most well-known subjects on which the algorithm appears as a leader are navigation systems, online route systems (Google Maps) [16, 28, 30]. Generally, the network of HADOOP nodes is presented as a weighted graph [25]. Then, to store and retrieve information, Dijkstra’s algorithm will be applied to calculate the optimal cost, then find the shortest path between two consecutive nodes and therefore find the most optimized path between the NameNode (S) and DataNodes (N) [31]. In other words, if we have a weighted network with hundreds of nodes, Dijkstra’s algorithm will help us determine the best route to use [28], taking into consideration the number of vertices (nodes) in our network, we can, so efficiently calculate the optimal route to access information, regardless of the size of the graph. 4.2 The Proposed Approach for Small File Management See Fig. 10.

HDFS Improvement Using Shortest Path Algorithms

261

Fig. 10. Hadoop based algorithm for shortest path

5 Experimental 5.1 Experimental Environment (RatioTable) One of the most important tasks is to assign the Ratio for each path to the device whether it is read or write. The calculation of this Ratio mainly involves several factors, namely, CPU capacity, size of memory, type of disk access (HDD, SDD, NL-SAS), data volume and even the number of nodes in the network in question [25]. Indeed, the HADOOP ecosystem allows the use of several types of hardware, being independent of this layer is one of the major advantages of the technology, we can use old servers, new machines, virtual machines, mix a little of everything, add subtract what you want [33]. On the following table we will summarize the weightings (Ratio) of each node: The Ratio is calculated (Table 1).

262

M. Eddoujaji et al. Table 1. RatioTable example

Node

CPU

Memory

Disk

Ratio rate WordCount

Grep

VM0 (master)

8

64 GB

500 GB

VM1 (slave)

4

32 GB

500 GB

4

3

VM2 (slave)

4

16 GB

500 GB

2

1.5

VM3 (slave)

2

8 GB

500 GB

1

1

VM4 (slave)

2

8 GB

500 GB

1

1

Previously we defined how we calculate thos rate [2], to avoid the repetition wa will just reminde these: For WordAccount A(ComputeCapacity) = 3xC(ComputeCapacity) B(ComputeCapacity) = 1.5xC(ComputeCapacity). For grep job: A(ComputeCapacity) = 2.5xC(ComputeCapacity) B(ComputeCapacity) = 1.5xB(ComputeCapacity).

HDFS Improvement Using Shortest Path Algorithms

263

5.2 Phases Algorithms

5.3 Results After having defined the ratios of each node, we will then start our experiment by using the same set of data used in the NISS21 conference paper [2] (Fig. 11):

264

M. Eddoujaji et al.

7000 7000 6000

5000

5000 4000

4000 3000

3000 2000

1000

1000 0 0-128

128-512

512-1028

1024-4096

4096-8192

SMALL FILES RANGE (KB)

Fig. 11. Distribution of File Sizes in our experiment

Time taken for read and write operations for the above Datasets, based on Memory A (Fig. 12).

3500 3087,2

3000 2500

2240,5

2000 1550,9

1500 1000 500 0

800 353,6 1,88 2500

7,3 5000

16,6 10000

25,1 15000

31,9 20000

Number of files HDFS

HFSA approach

Memory Consumption We note that the memory consumed by the use of the Djisktra algorithm is slightly more than that consumed by the HDFSA approach [1], which is quite logical since the execution of the algorithm requires a little more resource than it is memory or CPU.single

HDFS Improvement Using Shortest Path Algorithms

265

Table 2. Comparison of the NameNode memory usage Dataset

Number of small

#

Files

Memeory usage (MB) Normal HDFS

Dijkstra algorithm

HFSA algorithm

1

5000

980

268

500

2

10000

1410

390

900

3

15000

2010

530

1200

4

20000

2305

620

1230

Memory consumption (MB) 2500 2000 1500 1000 500 0 0

5000

10000

Memeory usage (MB)

15000

Memeory usage (MB)

20000

25000

Memeory usage (MB)

Fig. 12. Memory usage by NodeName

combined file and not for every single small file, which explains the reduction of the memory used by the proposed approach (Table 2). Performances Comparison Writing test Dataset

Performance evaluation: reading time Number of small files

Normal HDFS

HFSA algorithm

Dijkstra algorithm

2500

1100

1000

700

1

5000

2200

1700

1500

2

10000

3800

2600

2640 (continued)

266

M. Eddoujaji et al.

(continued) Dataset

Performance evaluation: reading time Number of small files

Normal HDFS

HFSA algorithm

Dijkstra algorithm

3

15000

5500

3500

3550

4

20000

7200

4400

4480

Results of writing time are shown in this Fig. 13:

Writing Performance 8000 7000 6000 5000 4000 3000 2000 1000 0 0

5000

10000

15000

20000

25000

Performance evaluation: reading time Performance evaluation: reading time Performance evaluation: reading time

Fig. 13. Performance evaluation: writing time

For the data sets used, we notice that the use of the Dijkstra algorithm becomes important once the volume of data is large, with low data masses, we gain nothing, which aligns perfectly with our objective, since the mass of data targeted will be thousands of GB.

HDFS Improvement Using Shortest Path Algorithms

267

Reading Time Dataset

Performance evaluation: reading time 0

Normal HDFS

HFSA algorithm

Dijkstra algorithm

2500

390

285,7

320

1

5000

877,5

567,2

580

2

10000

1568,7

729,1

735

3

15000

1984,4

1029,8

1000

4

20000

3112,3

1332,9

1200

Reading test using the DIjkstra Algorithm.

Reading Performance Time consumption (s)

3500 3000 2500 2000 1500 1000 500 0 0

2500

5000

10000

15000

20000

Data set Normal HDFS

HFSA Algorithm

Dijkstra Algorithm

Fig. 14. Performance evaluation: reading time

For reading too, the results of the algorithm become more important and more appealing once the number of files increases and the data size is high (Fig. 14). This proves once again that the use of the algorithm is of great importance once the volume of data is large, where the approach presented in the paper [1] will have limitations. 5.4 Conclusion and Future Works We tried to prove if the use of shortest path algorithms can have advantages and improvements for our approach. The answers are affirmative and the results are encouraging, indeed, for a large volume of data, something expected once the communication with the devices (IOT)

268

M. Eddoujaji et al.

will be started, and also the number of nodes of our platform will be a few hundred nodes. Calculating the shortest path manifests itself as a magic solution. However, the Dijkstra algorithm assumes the existence of two points, one of departure and the other of arrival, this could present a brake and an inconvenience to apply the algorithm on several nodes distributed on several sites containing thousands and thousands of small files. The next step will consist, mainly in two reflections: • The use of an improved algorithm from that of Dijkstra, A star appears as a great alternative [33], also the neural networks [34] can be tested and compare the results. All of these can provide a huge domain to launch the first AI approaches for HADOOP Platform using IOT on the port industry [28]. • The distribution of files and their location is a second very motivating research topic. The subject of processing large amounts of data (Big Data) on distributed systems (Blockchain, Cloud, several nodes, multisites), with several heterogeneous sources of information (Mobile device, tablet, IOT…) presents a world a rich tire for researchers and for scientists too, the ball is in everyone’s court, challenge the world, amaze it!

References 1. Eddoujaji, M.: Data processing on distributed systems: storage challenge. In: NISS 2021, July 2021 2. Eddoujaji, M.: Improving hadoop through data placement strategy. AJATIT J. (2022) 3. Hadoop official site. http://hadoop.apache.org/ 4. Bende, S., Shedge, R.: Dealing with small files problem in hadoop distributed file system. Procedia Comput. Sci. 79, 1001–1012 (2016) 5. Cai, X., Chen, C., Liang, Y., An optimization strategy of massive small files storage based on HDFS. In: 2018 Joint International Advanced Engineering and Technology Research Conference (JIAET 2018) (2018) 6. Niazi, S., Ronström, M., Haridi, S., Dowling, J.: Size matters: improving the performance of small files in Hadoop. Paper Submission, Middleware 2018. ACM, Rennes (2018) 7. Mir, M.A., Ahmed, J.: An optimal solution for small file problem in Hadoop. Int. J. Adv. Res. Comput. Sci. (2017) 8. Alange, N., Mathur, A., Small sized file storage problems in hadoop distributed file system. In: Second International Conference on Smart Systems and Inventive Technology (ICSSIT 2019). IEEE Xplore (2019) 9. Andreiana, A.-D., B˘adic˘a, C., Ganea, E.: An experimental comparison of implementations of Dijkstra’s single source shortest path algorithm using different priority queues data structures. In: 24th International Conference on System Theory, Control and Computing (ICSTCC) (2020) 10. Ahada, M.A., Biswasa, R.: Dynamic merging based small file storage (DMSFS) architecture for efficiently storing small size files in hadoop. Procedia Comput. Sci. 132, 1626–1635 (2018) 11. Verma, D., Messon, D., Rastogi, M., Singh, A.: Comparative study of various approaches of Dijkstra algorithm. In: International Conference on Computing, Communication, and Intelligent Systems (ICCCIS) (2021)

HDFS Improvement Using Shortest Path Algorithms

269

12. Rattanaopas, K., Kaewkeeree, S.: Improving hadoop mapreduce performance with data compression: a study using wordcount job. In: Conference Paper. IEEE (2017) 13. El-Sayed, T., Badawy, M., El-Sayed, A.: SFSAN approach for solving the problem of small files in Hadoop. In: 13th International Conference on Computer Engineering and Systems (ICCES) (2018) 14. Zheng, T., Guo, W., Fan, G.: A method to improve the performance for storing massive small files in Hadoop. In: The 7th International Conference on Computer Engineering and Networks (CENet2017), Shanghai (2017) 15. Carns, P.H., Ligon III, W.B., Ross, R.B., Thakur, R.: PVFS: a parallel file system for Linux clusters. In: Proceedings of the 4th Annual Linux Showcase and Conference, pp. 317–327. USENIX Association (2000) 16. Alange, N., Mathur, A.: Small sized file storage problems in hadoop distributed file system. In: 2019 International Conference on Smart Systems and Inventive Technology (ICSSIT). IEEE (2019) 17. Zhu, Z., Li, L., Wu, W., Jiao, Y.: Application of improved Dijkstra algorithm in intelligent ship path planning. In: 2021 33rd Chinese Control and Decision Conference (CCDC). IEEE (2021) 18. Shah, A., Padole, M.: Optimization of hadoop MapReduce model in cloud computing environment. In: Conference Paper. IEEE (2019) 19. Candra, A., Budiman, M.A., Hartanto, K.: Dijkstra’s and A-star in finding the shortest path: a tutorial. In: International Conference on Data Science, Artificial Intelligence, and Business Analytics (DATABIA) (2020) 20. Gbadamosi, O.A., Aremu, D.R.: Design of a modified Dijkstra’s algorithm for finding alternate routes for shortest-path problems with huge costs. In: Conference Paper. IEEE (2020) 21. Nurhayati, B., Amrizal, V.: Big data analysis using hadoop framework and machine learning as decision support system (DSS). In: International Conference on Cyber and IT Service Management (CITSM) (2018) 22. Tchaye-Kondi, Y.Z.: Study of distributed framework hadoop and overview of machine learning using apache mahout. In: IEEE 9th Annual Computing and Communication Workshop and Conference (CCWC) (2019) 23. Bogdan, P.: Dijkstra algorithm in parallel-case study. IEEE (2015) 24. Aridhi, S., Benjamin, V., Lacomme, P., Ren, L.: Shortest path resolution using Hadoop. In: 10ème Conférence Francophone de Modélisation, Optimisation et Simulation (2018) 25. Aridhi, S., Lacomme, P., Ren, L., Vincent, B.: A MapReduce-based approach for shortest path problem in large-scale networks. Eng. Appl. Artif. Intell. J. (2015) 26. Hamilton Adoni, W.Y., Nahhal, T., Aghezzaf, B., Elbyed, A.: The MapReduce-based approach to improve the shortest path computation in large-scale road networks: the case of A* algorithm. Springer Open, J. BIG DATA (2018) 27. Luo, M., Hou, X., Yang, S.X.: A multi-scale map method based on bioinspired neural network algorithm for robot path planning. In: Conference Paper. IEEE (2018)

Improved Hourly Prediction of BIPV Photovoltaic Power Building Using Artificial Learning Machine: A Case Study Mouad Dourhmi1 , Kaoutar Benlamine2 , Ilyass Abouelaziz4 , Mourad Zghal1 , Tawﬁk Masrour3 , and Youssef Jouane1(B) 1

3

LINEACT CESI, Strasbourg, France [email protected] 2 ICube Laboratory, INSA de Strasbourg, Strasbourg, France L2M3S Laboratory, ENSAM Meknes, Moulay Ismail University, Meknes, Morocco 4 LINEACT CESI, Reims, France

Abstract. In the energy transition, controlling energy consumption is a challenge for everyone, especially for BIPV (Building Integrated Photovoltaics) buildings. Artiﬁcial Intelligence is an eﬃcient tool to analyze ﬁne prediction with a better accuracy. Intelligent sensors are implemented on the diﬀerent equipments of a BIPV building to collect information and to take decision about the energy in order to reduce its consumption. This paper presents the implementation of a machine learning model of short and medium term hourly energy production of photovoltaic panels in BIPV buildings on several sites. We selected the data inﬂuencing the energy eﬃciency of the PV panels, with the measurement of variable importance score for each model. Indeed, we have developed and compared several machine learning models of hourly prediction independently of the building location taking into account the weather forecast data on site such as DHI, DNI and GHI and the same in clear sky condition. Five methods are tested and evaluated to determine the best prediction: Random Forest (RF), Artiﬁcial Neural Networks (ANN), Support Vector Regression (SVR), Decision Trees regression (DTR), and linear regression. The methods are evaluated based on their ability to predict photovoltaic energy production at hourly and daily resolution. Keywords: Photovoltaic photovoltaic (BIPV)

1

· Neural network · Building-integrated

Introduction

Solar energy production is known to be directly related to on-site weather conditions. It varies throughout the day with changes in solar irradiation. Photovoltaic power forecasts require a prior knowledge of meteorological parameters. Diﬀerent models and parameters known to have an inﬂuence, such as opacity and c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ben Ahmed et al. (Eds.): NISS 2022, LNDECT 147, pp. 270–280, 2023. https://doi.org/10.1007/978-3-031-15191-0_26

Prediction of BIPV

271

solar irradiance, have been used for photovoltaic power prediction [1,2]. Previous research has suggested that the prediction of renewable energy production balancing supply and demand in electric power management [3]. This approach will be taken into account in the input data of my project, but this approach remains incomplete and out of alignment with the geometric and structural data of the building. In recent years, many prediction methods for photovoltaic electrical systems have been developed using statistical models and machine learning algorithms. Several prediction methods have had diﬀerent levels of success in improving accuracy and reducing complexity in terms of computational cost. These methods can be classiﬁed as direct and indirect. In other words, an electrical model of a photovoltaic system is combined with a statistical model that converts the numerical weather prediction data into solar energy with a shortterm predictive horizon for the physical model. Indeed, in our project we will have to integrate this approach as well. On the other hand, Dong et al. [4] have developed a hybrid prediction method for hourly solar irradiance by integrating self-organizing map, support vector regression and particle swarm optimization approaches. Another alternative lies on power prediction using the indirect method using simulation tools, such as TRNSYS, PVFORM and HOMER, but these approaches remain limited and do not directly integrate the BIM data in the design phase. There is also limited applicability for typical buildings that do not have solar meters to measure diﬀerent forms of solar irradiance, such as direct normal irradiance (DNI) or global horizontal irradiance (GHI). Several studies have investigated methods that improve the prediction of building-integrated photovoltaic (BIPV) power using artiﬁcial learning methods but very diﬃcult to link with all direct normal et horizontal irradiance [5]. Although many studies have suggested the prediction of PV power generation in a building using various prediction algorithms and hybrid models based on the direct method [10], these last have limited ability to maintain high hourly prediction performance of PV energy production in the short term because these models mainly depend on predicted weather data, which does not include solar irradiance, building geometry and orientation of building, and even less on deployed PV technologies. Therefore, all three input characteristics, such as BIM data, weather information, and the conﬁguration of the forecasting model must be taken into account to improve the hourly accuracy of the forecasts. Another approach based on feature engineering and machine learning manages to eﬃciently improve the short-term predictions of photovoltaic energy production in BIPV buildings with only four typical types of weather conditions [6]. Chuyao Wang et al. proposes a case study with the development of ANN Model for an air-conditioned room in comparison with the conventional method according to the weather conditions [7]. This last proposes an investigation on the integration strategy of a hybrid BIPV/T facade with an adaptive control method based on ANN model to improve the indoor thermal environment of a building space having two types of PV technology, transparent and opaque, on their facade. In this work we propose an improvement of the hourly prediction of the building photovoltaic energy BIPV using an artiﬁcial learning device with a ﬁner accuracy of prediction for one in

272

M. Dourhmi et al.

the season: a comparative study of real cases of residential building in Strasbourg in FRANCE is proposed with broader climate data. Initially, we started by processing diﬀerent climate data variables with direct, indirect and diﬀuse solar radiation data taking into account the importance of each variable. Indeed, we propose the comparison of the performance of ﬁve algorithms and we choose the most suitable for the complexity and complementary between our variables.

2

Materials and Methods

2.1

Proposed BIPV Study of Case

In this work we exploit the hourly data Photovoltaic power production of a residential Building located in the north east of the city of Strasbourg in FRANCE, at a latitude of 48.61N and a longitude of 7.787E, as shown in Fig. 1. The capacity of this photovoltaic production system is about 54 kW and the speciﬁcations of the BIPV system are shown in Fig. 1. The power of the BIPV system was measured in hours and collected in a database. There are 40 photovoltaic panels of monocrystalline silicon technology with an area of 124.6 m. 2.2

Description of Machine Learning Models

In this section, we describe the diﬀerent models that we used in our application, namely: support vector regression, decision tree regression, artiﬁcial neural networks, linear regression and random forest. Table 1 illustrates the diﬀerent models that we used and their conﬁguration and selected parameters. In what follows, we give a brief description of each model. Table 1. The parametric conditions for each of the ﬁve machine learning models Model

Configuration parameters

Selected parameter

Support Vector Regression (SVR)

Kernel type

RBF

Regularization

20

Gamma

1

Regression precision (epsilon)

0.1

Decision Tree Regression

random state

42

Artificial Neural Networks

Activation function

Relu

Number of hidden layers

4

Data division (training testing)

80% 20%

n estimators

200

random state

0

Linear Regression Random Forest

Prediction of BIPV

273

Support Vector Regression (SVR) is a regression function that is generalized by Support Vector Machines which is a machine learning model used for data classiﬁcation.

Panel system parameters Total power 24.3 KW Module surface 124.6 m Model of panel LG320N1C-G4 Panel power 320 W number of cells per panel (6X10) 60 Losses due to shading 0% Losses due to heating 8.5 % Open-circuit voltage Voc 40.9 V % short circuit current ISC 10.1 A %

Fig. 1. The BIPV building study with the photovoltaic generation system installed on the roof of residential building with its associated BIM model and the characteristics of photovoltaic panels installed in the roof of BIPV building

Decision tree regression Decision tree can be used for regression or classiﬁcation problem, it builds models in the form of a tree structure. It breaks down a dataset into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The ﬁnal result is a tree with decision nodes and leaf nodes. A decision node has two or more branches, each representing values for the attribute tested. The leaf node represents a decision on the numerical goal. The highest decision node in a tree that matches the best predictor called the root node. Decision trees can handle both categorical and numeric data. Artificial neural networks are a computational model that mimics the way nerve cells work in the human brain. It can be viewed as weighted directed graphs, that are commonly organized in layers. It basically consists of three

274

M. Dourhmi et al.

layers: input layer, hidden layer and output layer. These layers have many nodes that mimic biological neurons in the human brain that are interconnected and contain activating function. The ﬁrst layer receives the raw input signal from the external world. Each successive layer gets the output from the layer preceding it. The output at each node is called its activation or node value. The last level produces the output of the system. Linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). When there is only one input variable, the method is called simple linear regression. When there are multiple input variables, the statistical literature often refers to the method as multiple linear regression. Diﬀerent techniques can be used to prepare or train the linear regression equation from data, the most common of which is called ordinary least squares. It is therefore common to refer to a model prepared in this way as ordinary least squares linear regression or simply least squares regression. Random forest is a robust machine learning algorithm that can be used for a variety of tasks, including regression and classiﬁcation. It is an ensemble method, which means that a random forest model is made up of a large number of small decision trees, called estimators, which each produce their own predictions. The random forest model combines the predictions of the estimators to produce a more accurate prediction.

3 3.1

Empirical Evaluations Datasets

The forecast weather dataset containing historical weather data was obtained from the National Solar Radiation Database (NSRDB) and the Photovoltaic Geographic Information System (PVGIS). Twenty weather variables were used as predictors: outdoor air temperature (T), Relative Humidity (RH), Pressure (P), Precipitable Water (PW), Wind Direction (WD), Wind Speed (WS), DNI: Diﬀused Normal Irradiance, DHI: Diﬀused Horizontal Irradiance, GHI: Global Horizontal Irradiance, Cloud Type (CL), Dew Point (DP), Surface Albedo (SA), Solar Zenith Angle (SZA), as shown in Table 2. The set of meteorological variables consists of data collected at 1 h intervals. In this work, we have preprocessed the set of these variables and selected only the most relevant ones such as DNI, DHI and GHI, with a resolution of 1 day over a 24 h season, in order to perform the short-term hourly prediction at random for 1 day in the season as shown in Fig. 3.

Prediction of BIPV

275

Table 2. Characteristics of online weather forecasting information. Parameter

Description

UoM

T

External air temperature

C

DHI

Diﬀused horizontal irradiance w/m

DNI

Diﬀused normal irradiance

w/m

GHI

Global horizontal irradiance

w/m

Clearsky DNI –

w/m

Clearsky GHI –

w/m

Clearsky DHI –

w/m

CL

Cloud type

–

DP

Dew point

C

RH

Relative humidity

P

Pressure

mbar

PW

Precipitable water

cm

SA

Surface Albedo

–

SZA

Solar Zenith Angle

Degree

WD

Wind direction

Degree

WS

Wind speed

m/s

Using all precedent data from PVGIS and NSRDB, the correlation between photovoltaic output power and meteorological parameters and time values is analyzed, and the results are shown in Fig. 2. In fact, we used the path analysis method to analyze the degree of correlation between weather conditions and BIPV power output (PP-BIPV). Indeed, this identiﬁcation of the correlation between weather factors and PP-BIPV, can facilitate the problem solving of independent variables indirectly inﬂuencing on the dependent variables, which will allow us to select the main variables as the model input. In the path analysis theory, the direct path coeﬃcient represents the direct inﬂuence of the independent variable itself on the dependent variable, and the indirect path coeﬃcient represents the indirect inﬂuence of the independent variable on the dependent variable through other variables. Considering exhaustively the interaction between the variables, the correlation coeﬃcient in the Fig. 2 shows this dependence and independence between the variables. In path analysis theory, the absolute value of the correlation coeﬃcient determines the degree of correlation and inﬂuence. The absolute value of the correlation coeﬃcient greater than 0.2 between the weather variables and PP-BIPV production are selected as inputs in the prediction model. Apart from pressure (P), wind speed (WD) and precipitable water (PW), all other variables such as relative humidity (RH), solar zenith angle (SZA), temperature, Clearsky DHI, Clearsky DNI, dew point (DP), DHI, DNI and GHI have a correlation coeﬃcient greater than 0.2. After

276

M. Dourhmi et al.

the path analysis, the characteristic variables for each sample include only the input data at ten inputs with 10 dimensions.

Fig. 2. Correlation matrix between weather variables and PP-BIPV

3.2

Error Metrics

The performance of the prediction model is aﬀected by the conﬁguration and type of model. In order to select an appropriate prediction model we evaluated ﬁve machine learning models, namely Random forest (RF), an artiﬁcial neural network (ANN), a Linear regression (LR), a support vector regression (SVR), and a Decision tree regression (DTR). To compare diﬀerent models, several indicators can be computed to express their performance. In our case, we used Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and coeﬃcient of determination (R2 ). These metrics are expressed as follows: M AE =

N 1 |yi − xi | N i=1

(1)

N

1 (yi − xi )2 N i=1 N 1 RM SE = (yi − xi )2 N i=1 N (yi − xi )2 R2 = 1 − i=1 N 2 i=1 (yi − y) M SE =

(2)

(3)

(4)

where yi is the prediction, xi the true value, y is the mean of the observed data and N is the number of samples.

Prediction of BIPV

277

Table 3. Errors of diﬀerent machine learning models

3.3

Errors MAE

MSE

RMSE

R2

SVR

0.59

1.35

1.16

0.94

DTR

9,56.10−2 6,6.10−2

ANN

7,68.10−2 5,30.10−2 23.10−2 −2

6,93.10

−2

25,7.10−2 99,7.10−2 27,9.10

−2

99,7.10−2 99,6.10−2

LR

9,66.10

RF

5,85.10−2 3,34.10−2 18,2.10−2 99,8.10−2

Results and Interpretation

Photovoltaic power forecasts require prior knowledge of meteorological parameters. Diﬀerent parameters could inﬂuence the model, such as solar irradiance and opacity, and have been used for PV power prediction [1,8]. Other work based on RNN models estimated that total daily hourly PV energy production following periods of clear and slightly cloudy weather [11–13]. Dongkyu Lee et al. estimated BIPV predictions with less than 10% MAPE and 0.20 CV(RMSE), knowing that the predictor parameters they used are reduced compared to our current work [5]. Indeed, they used seven meteorological variables as predictors with better prediction performances for clear and slightly cloudy weather. Our approach is rather seasonal with much higher prediction performances up to 99.7% of R2 for ANN model. In machine learning energy prediction models, the performance and choice of the model will depend on i) type and characteristic of the model, ii) input variables such as climate data, building location, PV panel technology deployed. Indeed, the PV power produced by the BIPV depends mainly on several key variables such as the temperature T, DHI, DNI GHI and other variables have an indirect inﬂuence on the PP-BIPV production like temperature T and the SZA. The Fig. 3 shows the relationship of some variables such as T, SZA, and DHI on the hourly PV production over a period of

Fig. 3. Characteristics of several parameters such as solar irradiation, Climate data and BIPV for one week per hour in the case of winter season

278

M. Dourhmi et al.

one week. We observe a certain periodicity in the inﬂuence of the variables on the production because we have periods of nights in the days sampled. On the other hand, the DHI parameter follows the variation of the production by comparing it the temperature T and the SZA, of course the diﬀuse irradiance has always been a parameter of the so-called indirect methods of sizing the installation of PV panels in BIPVs. Figure 4 shows the results of the hourly BIPV forecasts obtained with the six diﬀerent machine learning models with the four types of typical daily sky conditions. The results for each model show that RF outperformed the other models with an average hourly R2 of 99.8% during one season, followed by ANN and DTR (R2 = 99.7%), LR (R2 = 99.6%) and SVR (R2 = 94%). Table 3 lists the mean hourly errors (RMSE), MAE, MSE and R2 of each machine learning model for prediction during the four seasons from January 2019 to December 2019 without missing data. In addition, other parameters can complement the DHI parameter in the power production inﬂuence combination for PV power prediction, such as DNI and GHI as shown Fig. 4. Opacity is one of the parameters that is diﬃcult to identify in our case study, it is for this reason that we push the proposed models further in the hourly prediction, since we will integrate the complementary parameters such as clearsky DHI, clearsky GHI, and clearsky DNI. It is known that the DNI parameter derives from GHI, the majority of works take into consideration only the values of GHI. Lara-Fanego et al. have developed predictive models of these parameters by statistical approaches oﬀering a detailed description of the current and future state of the atmosphere and allowing a more precise disaggregation of global solar irradiance into its components [9]. Previous research has suggested models using a deep convolutional neural network (CNN) structure and an input signal decomposition algorithm for the prediction of photovoltaic power generation [14]. These obtained results are comparable to our results with a limited validity of the model from 1 h to 5 h with a correlation coeﬃcient (R = 97.28%), for four weather conditions partial cloudy, cloudy-rainy, heavy-rainy, and sunny days. Indeed, we will evaluate the relevance of these two parameters GHI and DNI with and without clear sky in our prediction model. Figure 4 shows the results of hourly PP-BIPV predictions for one day of the season obtained using the ﬁve machine learning models with DNI, DHI, and GHI parametric variables. All models had better prediction performance during all four seasons except the SVR model. We note that the inﬂuence of DHI on prediction is better distributed in winter and fall than in spring. On the other hand, the inﬂuence of the three variables DNI, DHI and GHI is less pronounced in spring and very evident in summer, which is explained by the fact that the learning model makes combinations of inﬂuence of variables almost by season hence the reason to expose our prediction results by season as shown Fig. 4a. Diﬀused Normal Irradiance has almost kept its weight on the prediction models exclusively in summer. To complete our analysis we have evaluated the same parameters but this time in clearsky conditions as shown in Fig. 4b. Indeed, clearsky DHI consolidates its inﬂuence on the prediction especially in winter and Autumn. On the other hand, during spring the combination of the three clearsky variables DNI, clearsky DHI

Prediction of BIPV

279

and clearsky GHI is taken into account in the prediction to correct the lack of inﬂuence of these variables without clearsky. The encouraging results presented earlier can be explained by the concept of favorable combination of hour-by-hour predictor variables in a season for ANN model as clearly demonstrated in the spring season.

Fig. 4. Characteristics of: a) DHI, GHI and DNI b) clearsky DHI, clearsky DNI and clearsky GHI, parameters data and power BIPV hourly for one day in the season

4

Conclusion

In this research, we performed the PV prediction with the French weather forecast database to analyze the prediction data. The inﬂuential variables were introduced, and several models are proposed that can optimize both PV and weather variables. To compare the performances, various machine learning models are tested, and comparative experiments were performed. The proposed RF and ANN models provides the best results not only in terms of performance but also in ﬁnding an accurate prediction per season.

References 1. Wang, W., Rivard, H., Zmeureanu, R.: Floor shape optimization for green building design. Adv. Eng. Inform. 20, 363–378-4 (2006). https://doi.org/10.1016/00222836(81)90087-5 2. May, P., Ehrlich, H.-C., Steinke, T.: ZIB structure prediction pipeline: composing a complex biological workﬂow through web services. In: Nagel, W.E., Walter, W.V., Lehner, W. (eds.) Euro-Par 2006. LNCS, vol. 4128, pp. 1148–1158. Springer, Heidelberg (2006). https://doi.org/10.1007/11823285 121 3. Zhao, P., Wang, J., Xia, J., Dai, Y., Sheng, Y., Yue, J.: Performance evaluation and accuracy enhancement of a day-ahead wind power forecasting system in China. Renew. Energy 43, 234–241 (2012) 4. Yang, D., Dong, Z.: Operational photovoltaics power forecasting using seasonal time series ensemble. Sol. Energy 166, 15 (2018). https://doi.org/10.1109/HPDC. 2001.945188

280

M. Dourhmi et al.

5. Lee, D., Jeong, J., Yoon, S.H., Chae, Y.T.: Improvement of short-term BIPV power predictions using feature engineering and a recurrent neural network. Energies 12, 3247 (2019). https://doi.org/10.3390/en12173247 6. Lee, D., Jeong, J.-W., Choi, G.: Short term prediction of PV power output generation using hierarchical probabilistic model. Energies 14, 2822 (2021). https:// doi.org/10.3390/en14102822 7. Wang, C., Ji, J., Yu, B., Xu, L., Wang, Q., Tian, X.: Investigation on the operation strategy of a hybrid BIPV/T fa¸cade in plateau areas: an adaptive regulation method based on artiﬁcial neural network. Energy 239, Part A, 122055 (2022) 8. Lara-Fanego, V., Ruiz-Arias, A.J., Pozo-V´ azquez, D., Santos-Alamillos, F.J., Tovar-Pescador, J.: Evaluation of the WRF model solar irradiance forecasts in Andalusia (Southern Spain). Sol. Energy 86, 2200–2217 (2018) 9. Lara-Fanego, V., Ruiz-Arias, J.A., Pozo-V´ azquez, D., Santos-Alamillos, F.J., Tovar-Pescador, J.: Evaluation of the WRF model solar irradiance forecasts in Andalusia (Southern Spain). Sol. Energy 86(8), 2200–2217 (2012). ISSN 0038092X, https://doi.org/10.1016/j.solener.2011.02.014 10. Pretto, S., Ogliari, E., Niccolai, A., Nespoli, A.: A new probabilistic ensemble method for an enhanced day-ahead PV power forecast. IEEE J. Photovolt. 12(2), 581–588 (2022). https://doi.org/10.1109/JPHOTOV.2021.3138223 11. Schratz, P., Muenchow, J., Iturritxa, E., Richter, J., Brenning, A.: Performance evaluation and hyperparameter tuning of statistical and machine-learning models using spatial data. arXiv 1830-11266 (2018) 12. Gandoman, F.H., Raeisi, F., Ahmadi, A.: A literature review on estimating of PVarray hourly power under cloudy weather conditions. Renew. Sustain. Energy Rev. 36, 579–592 (2016) 13. Rodr´ıguez, F., Fleetwood, A., Galarza, A., Font´ an, L.: Predicting solar energy generation through artiﬁcial neural networks using weather forecasts for microgrid control. Renew. Energy 126, 855–864 (2018) 14. Korkmaz, D., Acikgoz, H., Yildiz, C.: A novel short-term photovoltaic power forecasting approach based on deep convolutional neural network. Int. J. Green Energy 18(5), 525–539 (2021). https://doi.org/10.1080/15435075.2021.1875474

Improving Speaker-Dependency/Independency of Wavelet-Based Speech Emotion Recognition Adil Chakhtouna1(B) , Sara Sekkate2 , and Abdellah Adib1 1

2

Team Computer Science, Artiﬁcial Intelligence & Big Data, MCSA Laboratory, Faculty of Sciences and Technologies of Mohammedia, Hassan II University of Casablanca, Mohammedia, Morocco [email protected], [email protected] Higher National School of Arts and Crafts of Casablanca, Casablanca, Morocco

Abstract. The investigation of human emotions through speech signals is a complex process due to the diﬃculties inherent to their characterization. The performance of the emotion recognition rates depends on the extracted features, the nature of the chosen database and also the underlying language their contain. In this contribution, a new feature combination based on Mel Frequency Cepstral Coeﬃcients (MFCC), Discrete Mel frequency Cepstral Coeﬃcients (DMFCC) and Stationary Mel frequency Cepstral Coeﬃcients (SMFCC) is performed. The extracted features from the speech utterances of the Berlin Emotional database (EMODB) are given to linear-based Support Vector Machine (SVM) for the classiﬁcation. The proposed approach is evaluated in two distinct patterns, Speaker-Dependent (SD) and Speaker-Independent (SI) where the recognition rates of 91.4% and 80.9% are reached respectively in SD and SI experiments. Diﬀerent setups in Speech Emotion Recognition (SER) show that our suggested method performs better than other SER state-of-the-art. Keywords: Speech Emotion Recognition · Support Vector Machine · Multilevel Wavelet Transform · DMFCC · SMFCC · MFCC · Feature selection

1

Introduction

Speech Emotion Recognition (SER) is an emerging research area that focuses on detecting emotional responses of humans with the help of advanced technologies. The human speech contains a variety of valuable information about the speakers, such as their feelings, gender, ages and other aspects of the human being. There is a huge diﬀerence in the way people can convey their emotions, that is, the spoken utterance by speakers may have diﬀerent ways depending on the given situation. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ben Ahmed et al. (Eds.): NISS 2022, LNDECT 147, pp. 281–291, 2023. https://doi.org/10.1007/978-3-031-15191-0_27

282

A. Chakhtouna et al.

Emotion recognition can be done in multiple forms, such as facial expressions, electroencephalography (EEG) signals, linguistic features and acoustic features. SER based techniques is a very challenging task that requires a multidimensional approaches. The classiﬁcation of emotions is mainly approached in the psychological community with the assumption that emotions can be represented in a three-dimensional space; activation, valence and power dimensions [1]. In order to provide robust SER applications adapted to real-world scenarios, determining which features to extract from speech signal to analyze its underlying emotions is a key factor in improving the performance of any system. Relevant studies [2] have found that the combination of prosodic and spectral features such as Mel Frequency Cepstral Coeﬃcients (MFCC), Linear Predictive Coeﬃcients (LPC), pitch and energy are interesting for SER. The largest number of researches on SER has focused extensively on the Speaker-Dependent (SD) situation, while Speaker-Independent (SI) SER has been covered in a limited number of works. Accordingly, to make a meaningful contribution to the SER literature, in this study, MFCC coeﬃcients are extracted from the Discrete Wavelet Transform (DWT) and Stationary Wavelet Transform (SWT) sub-bands of the speech utterances. Detailed experiments and analysis of SD and SI patterns for SER using the EMODB database are performed. For classiﬁcation, a Support Vector Machine (SVM) fed by a feature-level fusion of DMFCC, SMFCC and MFCC is used. The content of this manuscript is presented as follows: the latest literature for SER is reviewed in Sect. 2 and a concise explanation of the discrete and stationary wavelet transform is oﬀered in Sect. 3. Afterwards, we present in Sect. 4 in details the suggested emotion recognition scheme. Experimental setup and results are discussed in Sect. 5. Finally, conclusion in Sect. 6 summarizes the current work’s main contributions and possible guidelines for future research.

2

Speech Emotion Recognition Literature

The literature on SER has tended to focus on selecting the most appropriate attributes and classiﬁers. Prevailing approaches have used global features, also known as statistical features and SVM, Gaussian Mixture Model (GMM) or a Deep Neural Network (DNN) as classiﬁers. Among these approaches, Gomes and El-Sharkawy [3] implemented a compound system of GMM and Universal Background Model (UBM) for SER, features comprising pitch, formants, energy and MFCC were extracted using the OpenSMILE toolkit1 . The proposed method enhances the performance of emotion recognition to 75.54% in all the four emotional states of the Interactive Emotional Dyadic Motion Capture (IEMOCAP) database. Bandela and Kumar [4] presented a Semi-Nonnegative Matrix Factorization (semi-NMF) with Singular Value Decomposition (SVD) to optimize the number of speech features. The considered features are MFCC, Linear Predictive Cepstral Coeﬃcients (LPCC) and Teager Energy OperatorAutocorrelation (TEO-AutoCorr). For model testing, the obtained accuracies 1

https://www.audeering.com/research/opensmile/.

Improving Speaker-Dependency/Independency of Wavelet-Based SER

283

are 90.12% and 83.2% with SVM, 89.3% and 78% with K Nearest Neighbor (KNN) on the EMODB and IEMOCAP databases respectively. M. Bojanić et al. [5], oﬀered a novel system for redistribution of calls in call centers based on SER. The system employed a set of features such as short-term pitch, zero crossing rate, energy, the voicing probability, 12 MFCC and their related statistics to characterize speech cues. Furthermore, the recognition rate of 91.5% is achieved using linear Bayes classiﬁer on the Govorne Ekspresije Emocija i Stavova (GEES) Serbian corpus. In [6], the authors implemented a variety of experimental setups for cross-lingual and within-lingual emotion recognition, the results of training and testing SVM classiﬁer on four diﬀerent language corpora (Italian, English, German and Urdu) were observed. The proposed approach attained a high accuracy of 70.98% when training the model with the EMODB, EMOVO, and SAVEE databases and testing it on the URDU database. In addition to all the above-mentioned studies, a powerful handcrafted feature extraction is proposed in this study. First, baseline MFCC are extracted from brute speech signals, then, SMFCC and DMFCC are developed from DWT and SWT sub-bands coeﬃcients of the speech utterances. Boruta [7] feature selection is employed to choose the most discriminating features. Finally, the linear-based SVM is made available for the classiﬁcation task.

3

Multilevel Wavelet Transform

Multilevel Wavelet Transform (MWT) analysis has been successfully applied in many diﬀerent disciplines to analyze non-stationary signals. MWT provides more valuable properties compared to other transforms, they have a perfect reproduction to analyze and synthesize signals and they are focused in the temporal and frequency domains [8]. The core idea of the MWT is the deployment of ﬁlters to provide more information about voice signals. Two forms of MWT are investigated in this study, DWT and SWT. 3.1

1-D Discrete Wavelet Transform (DWT)

The Discrete Wavelet Transform (DWT) applies the transform step to successively decompose a sequence of signal into multiple separate components in distinct frequency bands. The input signal is basically generated from a unique function called principal or mother wavelet, once this function is decomposed, wavelet coeﬃcients are produced. The DWT is based on two types of functions, the wavelet function called the mother wavelet, which is speciﬁed on the highpass ﬁlter, and the scaling function, which is speciﬁed on the low-pass ﬁlter. Given an initial sequence of samples x[k] in input, detail cDi and approximation cAi coeﬃcients are formed by cascading ﬁlters [9], Eq. (1) explains this process: x[k] ≡ [cAn , cDn , cDn−1 , ..., cD2 , cD1 ] where n denotes the level of decomposition.

(1)

284

3.2

A. Chakhtouna et al.

1-D Stationary Wavelet Transform (SWT)

The baseline DWT algorithm has the ability to be modiﬁed to produce the SWT, it is constructed in such a way that the resulting output remains unchanged by avoiding the sub-sampling process of the input signals. Unlike DWT, the scaling process is ignored in SWT. We only need to apply suitable high-pass and low-pass ﬁlters to the data at the output of each level to obtain two diﬀerent sequences at the upcoming level [10]. In SWT, the size of the resulting sub-bands coeﬃcients is identical to the size of the input signal. For each level, detail cDi and approximation cAi coeﬃcients are produced. Equation (2) deﬁnes n level decomposition of the x[k] signal: x[k] ≡ [(cAn , cDn ), (cAn−1 , cDn−1 ), ..., (cA2 , cD2 ), (cA1 , cD1 )]

4

(2)

Methods

In this section, we propose a SER system that leverages the use of MWT to model the time and frequency information of speech signals. The proposed SER framework consists of two main steps, as shown in Fig. 1: decomposition of speech signals using MWT then, MFCC extraction from raw speech signals and subbands coeﬃcients of DWT & SWT. The features sets are concatenated and then selected to form the ﬁnal feature set. More details are described in the next subsections. 4.1

Feature Extraction and Selection

In order to test the eﬃciency of the proposed SER system in SD and SI patterns, features are extracted based on MFCC from both the raw speech signals and decomposed sub-bands coeﬃcients of DWT and SWT. A- Mel Frequency Cepstral Coeﬃcients (MFCC) MFCC [11] are the most universally used features in a large number of SER related studies. MFCC play a vital role in ensuring the correct transmission of information in speech. They are also compatible with the capabilities of human perception. MFCC extraction can be done by a sequence of successive steps. In this study, the audio signal is ﬁrst ﬁltered using a pre-emphasis function with a coeﬃcient of 0.97. Then, the speech signal is divided into a number of frames of ﬁxed duration. The frames are formed with a 50% overlap with each other to gain further information about the speech utterances, then a Hamming window function is also applied to each frame to avoid spectral leaking problem. In order to obtain the amplitude spectrum for each windowed frame, the Fast Fourier Transform (FFT) is employed to convert the signal to the frequency spectrum. Afterwards, the frequency spectrum is aggregated into the so-called Mel bands using overlapping triangular ﬁlter banks to obtain the Mel spectrum. Finally, a set of cepstral coeﬃcients is produced using the Discrete Cosine Transform (DCT).

Improving Speaker-Dependency/Independency of Wavelet-Based SER

285

In our work, 13 coeﬃcients have been extracted for MFCC and 11 related statistics are calculated including mean, variance, maximum, minimum, mean absolute value, median, kurtosis, skewness, mean/maximum, mean/minimum and standard deviation. B- Discrete Mel Frequency Cepstral Coeﬃcients (DMFCC) As previously mentioned in Sect. 3, the purpose of using MWT is to split the speech signal into a series of multiple separated components in diﬀerent frequency bands. In this study, the DWT is done at four-level decomposition, four details D1 , D2 , D3 , D4 and one approximation A4 sub-bands coeﬃcients are obtained, each coeﬃcient is divided into equally sized frames. To produce DMFCC features, the MFCC features are derived from each frame of the DWT sub-bands coeﬃcients of the raw speech signal.

Fig. 1. Diagram of the proposed SER system.

286

A. Chakhtouna et al.

Among several existing types of wavelets, the fourth-order Daubechies (DB4) was chosen. Figure 2 synthesizes the extraction of DMFCC features and their corresponding statistics used in our study. C- Stationary Mel Frequency Cepstral Coeﬃcients (SMFCC) In order to obtain Stationary Mel Frequency Cepstral Coeﬃcients (SMFCC) features illustrated in Fig. 3, four-level wavelet decomposition were explored for SWT. Eight sub-bands of details coeﬃcients and approximations (D1 , D2 , D3 , D4 , A1 , A2 , A3 and A4 ) were generated using DB4. Then MFCC features are extracted from each frame of the SWT sub-bands coeﬃcients to derive the required SMFCC.

Fig. 2. The DMFCC extraction process. (D1 , D2 , D3 , D4 ) produced details sub-bands coeﬃcients, A4 approximation sub-band coeﬃcient, Hi and Li denote the high-pass and low-pass ﬁlters respectively.

The statistical values were calculated for all frames of each sound ﬁle as well as for its four-level decomposition by SWT and DWT. After concatenating all the characteristics obtained (MFCC, DMFCC and SMFCC), we get a total feature vector of size (13 × 5 × 9) + (13 × 8 × 9) + (13 × 11) = 1664. D- Boruta feature selection Due to the high number of extracted features from data in many applications related to Machine Learning (ML), feature selection is often a decisive phase to reduce and eliminate unnecessary features. Boruta [7], is a wrapper algorithm, that allows a precise and stable selection of important and irrelevant attributes of a given feature set, by performing the random forest classiﬁcation algorithm. In order to improve the performance of our proposed system, Boruta feature selection is performed to elect the most appropriate features.

Improving Speaker-Dependency/Independency of Wavelet-Based SER

4.2

287

Classification

The SVM is a supervised ML algorithm designed for classiﬁcation and regression tasks. The power of SVM lies in its ability to separate the input data into several classes by ﬁtting a high dimensional space using diﬀerent kernels such as, Radial Basis Function (RBF), linear, polynomial, etc. In this study, for the classiﬁcation task, our model is trained using SVM with a linear kernel.

Fig. 3. The SMFCC extraction process. (D1 , D2 , D3 , D4 ) are details sub-bands coefﬁcients, (A1 , A2 , A3 , A4 ) are approximations sub-bands coeﬃcients, Hi and Li stands for the high-pass and low-pass ﬁlters respectively.

5

Experimental Setups and Results

In this section, we present the diﬀerent experimental setups used in our work, regarding the used database, Speaker-Dependency/Independency and various evaluation metrics like the confusion matrix and accuracy. 5.1

Speaker-Dependent (SD)

The proposed method was evaluated on the acted EMODB benchmark database [12], it’s generated by ten professional speakers (5 men and 5 women), from 21 to 34 years old, producing ten native German statements (5 short and 5 large). Each recorded utterance was obtained at a sampling rate of 48-kHz and then downsampled to 16-kHz. The EMODB includes seven basic emotions: anger, boredom, fear, happiness, sadness, disgust and neutral. There are two main patterns to

288

A. Chakhtouna et al.

explore for SER: SD and SI. For SD experiments, in order to secure the validity and signiﬁcance of our results, a 10-fold cross-validation scheme is applied to ﬁt our model. For each iteration the accuracy is calculated and then the emotion recognition rate is obtained by averaging all accuracies. The number of selected features using the Boruta algorithm is reduced from 1664 to 557 attributes for both SD and SI. After applying the linear SVM to the selected features, the confusion matrix of the suggested model on SD SER is displayed in Table 1. The correctly recognized samples for each emotion are located on the diagonal of the confusion matrix. In addition, the distribution of the majority of the diagonal values was expected with high percentages. Anger, fear and sadness emotions are easily classiﬁed, achieving the highest accuracies of 93.7%, 94.2% and 96.7% respectively. To a lesser extent, neutral, boredom and disgust are rated of 92.4%, 90.1% and 89.1% respectively. Finally, happiness which has the lowest recognition rate of 81.6%, since 14% of its samples are misclassiﬁed as anger state. Table 1. The EMODB confusion matrix of the proposed model after feature selection for SD mode (%).

5.2

Emotion

Anger Boredom Disgust Fear Happiness Neutral Sadness

Anger

93.7

0

0

0.7

5.5

0

0

Boredom

0

90.1

0

0

0

8.6

1.2

Disgust

4.3

2.1

89.1

2.1

0

0

2.1

Fear

1.4

0

1.4

94.2

2.8

0

0

Happiness 14.0

0

0

2.8

81.6

1.4

0

Neutral

0

7.5

0

0

0

92.4

0

Sadness

0

3.2

0

0

0

0

96.7

Speaker-Independent (SI)

In order to evaluate the robustness of the proposed SER system in real-life applications, the SI scenario is more reliable in this situation due to the fact that we have no prior idea about the speaker. In this section, SI experiments were performed using Leave-One-Subject-Out (LOSO) test. In the case of EMODB database, at each iteration 9 speakers are required to train linear SVM model and the 10th speaker is dedicated for the test. The confusion matrix in SI pattern was obtained by computing the average accuracies for the ten iterations. The average recognition rate is 80.9%. The accuracies of 86.6%, 82.7%, 86.9%, 73.9%, 66.1%, 79.7% and 88.7% were achieved for anger, boredom, disgust, fear, happiness, neutral and sadness respectively. Emotions like sadness and disgust and anger are successfully recognized with the highest accuracies. The accuracy of boredom and fear is improved by 8.7% and 7.3% respectively when compared to the accuracy without feature selection. The largest misclassiﬁed part for all emotions was taken by happiness with 28.1%, as it is confused with anger state.

Improving Speaker-Dependency/Independency of Wavelet-Based SER

5.3

289

Comparison with SER Studies

In this section, we are going to compare our obtained results on the EMODB database in both SD and SI modes with other previous studies. Table 2 summarizes several approaches used for SD/SI SER in terms of feature extraction, selected features, classiﬁers and the recognition rate. Table 2. Comparison with the state-of-the-art for SD/SI SER using EMODB. Reference Feature extraction Number of features SD

Classiﬁer

Accuracy

[13]

Mel Frequency Magnitude Coeﬃcient (MFMC) - 360

No

SVM (10-fold CV)

81.50%

[14]

Formants, energy, pitch, spectral, MFCC, LPC and PLP - 286

Sequential Forward Floating Selection (SFFS) - 15

Naive Bayes (10-fold CV)

82.26%

[15]

Wavelets features - 7680

Neighborhood Component Analysis (NCA) - 1024

SVM (10-fold CV)

89.16%

Boruta - 557

SVM (10-fold CV)

91.4%

Proposal SMFCC, DMFCC and MFCC - 1664 SI

Feature selection Selected features

[16]

Wavelet Packet Coeﬃcients (WPC) 480

SFFS - 150

SVM (LOSO)

79.5%

[17]

INTERSPEECH 2010 feature set (OpenSMILE) - 1582

Principal Component Analysis (PCA) - 100

SVM (LOSO)

77.49%

Boruta - 557

SVM (LOSO)

80.9%

Proposal SMFCC, DMFCC and MFCC - 1664

It is noticeable from Table 2 that the proposed feature extraction based on SMFCC, DMFCC and MFCC helps to make a major impact in the ﬁeld of SER compared to the existing systems. In addition, it was clear that the Boruta feature selection method has provided an additional beneﬁt to improve the recognition rate of the proposed system.

6

Conclusion and Future Work

The SER remains an interesting area for research, it makes a great challenge for studies to improve the performance of their systems. Our contribution in this paper is manifested on the extraction of new features named SMFCC and DMFCC. MFCC features were extracted from SWT and DWT sub-bands coefﬁcients. Hence, a Boruta feature selection algorithm was adopted to retrieve the most suitable features. The experiments results on the benchmark EMODB database allow us to compare the performance of our proposed model with other previous SER works. The recognition rates of 91.4% and 80.9% were achieved on SD and SI patterns respectively. The accuracy of the suggested method could be improved in the future work by extracting other features using MWT. In order to evaluate the robustness of

290

A. Chakhtouna et al.

our SER system, we will seek to study the impact and the dependency of language on the extracted features (SMFCC, DMFCC and MFCC) using additional classiﬁcation algorithms. In the mid-term, deep learning approaches are strong on huge data, hence the implementation of powerful classiﬁers such as Bi-LSTM and Multiplicative LSTM are expected to enhance the SER performance. Acknowledgements. This work was supported by the Ministry of Higher Education, Scientiﬁc Research and Innovation, the Digital Development Agency (DDA) and the CNRST of Morocco (Alkhawarizmi/2020/01).

References 1. Schlosberg, H.: Three dimensions of emotion. Psychol. Rev. 61(2), 81 (1954) 2. Chakhtouna, A., Sekkate, S., Adib, A.: Improving speech emotion recognition system using spectral and prosodic features. In: Abraham, A., Gandhi, N., Hanne, T., Hong, T.P., Nogueira Rios, T., Ding, W. (eds.) ISDA 2021. LNNS, vol. 418, pp. 1–10. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-96308-8_37 3. Gomes, J., El-Sharkawy, M.: i-Vector algorithm with gaussian mixture model for eﬃcient speech emotion recognition. In: 2015 International Conference on Computational Science and Computational Intelligence (CSCI), pp. 476–480. IEEE (2015) 4. Bandela, S.R., Kishore, K.T.: Speech emotion recognition using semi-NMF feature optimization. Turkish J. Electr. Eng. Comput. Sci. 27(5), 3741–3757 (2019) 5. Bojanić, M., Delić, V., Karpov, A.: Call redistribution for a call center based on speech emotion recognition. Appl. Sci. 10(13), 4653 (2020) 6. Latif, S., Qayyum, A., Usman, M., Qadir, J.: Cross lingual speech emotion recognition: Urdu vs. western languages. In: 2018 International Conference on Frontiers of Information Technology (FIT), pp. 88–93. IEEE (2018) 7. Kursa, M.B., Rudnicki, W.R., et al.: Feature selection with the Boruta package. J. Stat. Softw. 36(11), 1–13 (2010) 8. Burrus, C.S., Gopinath, R.A., Guo, H., Odegard, J.E., Selesnick, I.W.: Introduction to wavelets and wavelet transforms: a primer. Englewood Cliﬀs (1997) 9. Sekkate, S., Khalil, M., Adib, A., Ben Jebara, S.: An investigation of a feature-level fusion for noisy speech emotion recognition. Computers 8(4), 91 (2019) 10. Nason, G.P., Silverman, B.W.: The stationary wavelet transform and some statistical applications. In: Antoniadis, A., Oppenheim, G. (eds.) Wavelets and Statistics. LNS, vol. 103, pp. 281–299. Springer, New York (1995). https://doi.org/10.1007/ 978-1-4612-2544-7_17 11. Davis, S., Mermelstein, P.: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Sig. Process. 28(4), 357–366 (1980) 12. Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W.F., Weiss, B., et al.: A database of German emotional speech. Interspeech 5, 1517–1520 (2005) 13. Ancilin, J., Milton, A.: Improved speech emotion recognition with MEL frequency magnitude coeﬃcient. Appl. Acoust. 179, 108046 (2021) 14. Karimi, S., Sedaaghi, M.H.: Robust emotional speech classiﬁcation in the presence of babble noise. Int. J. Speech Technol. 16(2), 215–227 (2013) 15. Sönmez, Y.Ü., Varol, A.: A speech emotion recognition model based on multi-level local binary and local ternary patterns. IEEE Access 8, 190784–190796 (2020)

Improving Speaker-Dependency/Independency of Wavelet-Based SER

291

16. Wang, K., Su, G., Liu, L., Wang, S.: Wavelet packet analysis for speaker-independent emotion recognition. Neurocomputing 398, 257–264 (2020) 17. Kanwal, S., Asghar, S.: Speech emotion recognition using clustering based GAoptimized feature set. IEEE Access 9, 125830–125842 (2021)

Improving the Quality of Service Within Multi-objective Customer-Oriented Dial-A-Ride Problems Sonia Nasri1 , Hend Bouziri2,3 , and Wassila Aggoune-Mtalaa3(B) 1

2

Higher Business School of Tunis, Manouba University, Manouba, Tunisia Higher School of Economic and Commercial Sciences Tunis, Tunis University, Tunis, Tunisia 3 Luxembourg Institute of Science and Technology, L-4362 Esch/Alzette, Luxembourg [email protected] Abstract. Seeking a trade-oﬀ between the service oﬀered to customers and the interests of the system provider is a challenging task within Dial a Ride Problems. A continuous reﬂection is required especially in the resolution of transport problems focusing on the requirements of the customers. In this paper, we address a customer oriented Dial A Ride Problem which minimizes the total transport costs while improving the quality of service provided to the customers. A multi-objective formulation is proposed in order to minimize the total travel costs and the total waiting times. Real life instances of the problem are solved using the branch and bound method embedded in the CPLEX solver for producing exact solutions of the problems and in the worst cases the lower bounds of the search space. Keywords: On-demand transport · Mobility · Pick up and delivery · Dial-A-Ride · Customer-dependent DARP · Quality of service · Exact optimization methods

1

Introduction

Mobility-as-a-Service (MaaS) solutions continue to evolve for enabling equitable access to employment, social life, markets and services. In this way, transport systems should be designed for enhancing the transport quality and optimizing the economic proﬁtability. The problem which we address in this paper is a particular variant of an on-demand transport service, commonly known as the Dial-A-Ride Problem (DARP). In this paper, we focus mainly on a bi-objective customer oriented formulation to optimize both the riding costs and quality of service. 1.1

Presentation of DARP

This problem was ﬁrstly deﬁned by Cordeau in [9] and considered by Parragh et al. [31] as a variant of the pickup and delivery problem. Further, authors such c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ben Ahmed et al. (Eds.): NISS 2022, LNDECT 147, pp. 292–305, 2023. https://doi.org/10.1007/978-3-031-15191-0_28

Improving the Quality of Service

293

as Cordeau et al., [11] and Rekiek et al. [33] called it as a transport on-demand problem. Also, the DARP is treated as a door-to-door transportation system by Melachrioudis et al. [21] and Rekik et al. [33]. The problem is proved to be NP-Hard by Healy et al. [15] where the aim is to minimise the travel costs under vehicles and customers requests constraints. A request here is a set of passengers who demand to be transported from a location to another. In all DARP variants, standard constraints related to the requests and the vehicles have to be satisﬁed. In particular, time windows are ﬁxed for each request and customers have to specify the pickup or the delivery time windows. In addition, a maximal riding time is proposed for all the requests ensuring a time limit for each passenger riding aboard a vehicle. Thus, users have inconvenience times (time spent aboard a vehicle) which must respect the deﬁned maximal ride times. Given the fact that for each request there is a pickup node and a delivery one, the nodes are visited by the vehicles respecting both the precedence and pairs related to the origin and the destination. Moreover, each couple of origin and destination node must be visited by a same vehicle. For each request, a service duration corresponds to the time for loading and unloading passengers. Furthermore, the number of passengers who are picked up from various locations is lower than the vehicle capacity. All the vehicles begin their tour from the predeﬁned depot and return to it. In addition, the vehicles have a common predeﬁned maximal tour duration to respect. Customers access to mobility has evolved rapidly due to technological advancements and changing customers preferences. Several factors such as price, waiting time, travel time, convenience, and traveller experience are introduced when selecting their modes of transport by the mobility consumers [37]. These factors should be well addressed to ensure the improvement of the quality of service. This quality in [16] is measured through waiting time metrics and riding time according to customers’ preferences for improving both the service quality and the distances travelled by the vehicles. Thus, one way to address the new demands of the transport of nowadays, is to be able to align to speciﬁc needs of the customers for transport. This is what is highlighted by Paquette et al. [27] fostering the modelling of mobility on-demand problems according to a customer-oriented perspective. 1.2

The Customer Oriented Design of the Quality of Service in DARPs

In the survey of Nasri et al. in [25], two main ﬁelds of research related to the quality of service levels in DARPs are emphasized. In the ﬁrst group of works, the DARP is qualiﬁed as classical when it aims at minimizing the travel costs ensuring a common design of the service quality for all the users, (e.g. Parragh et al. [30], Paquette et al., [29], Lehu´ed´e et al., [18], Jorgensen et al. [17] and Melachrinoudis et al. [21]). In the second trend of works, the quality of service is studied from a customer-oriented point of view. In this paper, we consider a customer-oriented service quality in which each customer speciﬁes his/her own riding time and the time windows are designed

294

S. Nasri et al.

incorporating more customers preferences to increase their satisfaction. The latter is well addressed in a customer oriented DARP model introduced by Nasri et al. in [24]. In their model, the authors deﬁned ﬁtting time windows which are designed to minimize the waiting time of the passengers aboard a vehicle. The outputs of such a customized model of DARP were a positive impact on the level of service provided to the customer. Therefore, we intend to include such a customized design of time windows in a new bi-objective DARP model. In other words, we seek for possible interesting trade-oﬀs between costs and the service quality in terms of waiting time. 1.3

Contributions

Our contribution is twofold: we ﬁrstly propose a new bi-objective customer oriented DARP model in which the travel cost and the total waiting time have to be minimized separately. In this model, customized time windows are considered for reducing unnecessary waiting times. Secondly, this bi-objective problem is solved using the branch and bound method of the CPLEX optimizer tool. The solutions are computed based on a lexicographic order of the two objectives considered, namely the total travel cost and the total waiting time. The implementation of the model with the total waiting time as the prioritized objective to minimize is the one which produces the best results.

2

Literature Review

In this section, we give a brief review of how the design of the service quality has been treated in the mono-objective and multi-objective DARPs of the literature. We give a particular attention to the works which focused on the service quality from a customer oriented viewpoint. 2.1

The Quality of Service in DARPs

The quality of service is generally deﬁned as the degree of conformity to customers’ requirements. As an example, in the work of Braekers et al. [5], the time windows and a maximum riding time are speciﬁed for each customer request. A customized design of the quality of service consists in using a maximum riding time for designing the time windows as in the works of Molenbruch et al. and Chassaing et al. [7,22]. In the problem proposed by [32], time windows are deﬁned for the locations of the destinations. They are expressed by many customer-dependent terms such as the service duration, the transit time and the maximal ride time. Various deﬁnitions of the service quality are found in DARPs. A set of service quality speciﬁcations are cited in [29] and diﬀerently expressed. For instance, the time windows are designed as constraints in the works of Melachrinoudis et al. [20], Parragh et al. [32], Cordeau [8] and Rekiek et al. [33]). Many expressions of maximum ride time are also found in [12]. The total waiting time can be

Improving the Quality of Service

295

considered either during the customer ride or before his/her pickup [12] and [17]. Whatever the design of the quality of service, authors in [28] emphasized the beneﬁt of addressing both the costs and the service quality terms in a common model. This aims at obtaining a good trade-oﬀ between the total transportation costs for the system provider and the service quality for the customers. Diﬀerent operational scenarios were investigated to help transport providers in their decision making, by varying several parameters aﬀecting both the service quality and a high economic proﬁtability [19,22]. The time windows and the route duration are factors that considerably inﬂuence both the operating cost and the quality of the transport system for diverse types of demand distributions, [26]. Therefore, their relaxation leads to a decreased operating cost and less vehicles needed for the transportation. More precisely, larger width of time windows and larger route durations provide more ﬂexibility to the system provider, but a lower level of service for customers. 2.2

Customized Service Quality in Multi-objective DARPs

One eﬃcient way of addressing at the same time the total costs and the quality of service is to propose a multi-objective modelization for the DARP. In this ﬁeld of the search, few related works have been proposed. One can cite a model which has been developed by Guerriero in [14] where in addition to the traditional capacity constraints of the vehicles, also the pairing, the precedence, and the time windows conditions were treated. The authors aimed at maximizing the weighted and normalized maximum total ride time and the total waiting time. To ensure the satisfaction of both the drivers and the customers, the authors proposed a bi-objective set partitioning model taking into account simultaneously the total waiting time of the drivers at the nodes and the maximum total ride time. Moreover, a bi-objective dial-a-ride problem of patient transportation was considered by Molenbruch et al. in [23]. It aimed to minimize the total distance travelled in a DARP while taking into account the trade-oﬀ between the operational eﬃciency and the quality of service. This trade-oﬀ is expressed through an additional objective for minimizing the total user ride time. The authors incorporated more real-life characteristics to the general DARP benchmarks [8,10] related to a real patient transportation system in Belgium. The real-life simulations showed the impact on the costs of a service provider while respecting the constraints related to the maximum ride time and the time windows. A Multi-criteria Dial-a-Ride Problem proposed by Atahran in [3] involved three objective functions which expressed the total costs of the transportation, the quality of service measured through the user’s dissatisfaction and the impact of the CO2 emitted by the transportation system on the environment. To minimize the value of the user’s dissatisfaction, the authors computed the optimal starting time of service at each pickup and delivery location. Furthermore, they proposed two parameters to penalize the earliness and tardiness at the pickup and delivery location and a penalty term for the users’ waiting time.

296

S. Nasri et al.

In this present work we propose to improve the quality of service by designing customer oriented time windows and assessing the total waiting times in a biobjective framework.

3 3.1

The Multi-objective Customer Oriented DARP Description of the Problem

The Multi-Objective Customer oriented DARP consists in satisfying a set of a priori known transportation requests using a set of homogeneous vehicles while enhancing customer oriented variables. Some common characteristics are aligned with those of the classical DARP [9]. – A request corresponds to a limited number of people who require to be transported from a location to another. – The number of passengers related to a request is the same at the pickup and delivery sites. – The set of requests corresponds to that of the pickup nodes in a transportation network. – The depot is concerned by the departure and the arrival of all the vehicles and no demand is assigned to it. – A maximal total duration is ﬁxed to limit the total vehicle travel time. – Each customer has a riding time which is the time interval from the time of his/her pickup to that of the delivery including waiting times which occurred during the vehicle route. – A maximal capacity is supposed for all the vehicles. – A service duration is devoted to the loading of people at a node. This time is the same at the pickup and the delivery sites. In this bi-objective model of DARP, some customer-oriented constraints are imposed when designing the maximal riding time and the time windows. These constraints are properly modelled considering each customer demand of transport. Indeed, a maximal riding time which is the maximal time spent aboard a vehicle is ﬁxed for each customer. The time windows are imposed at each of the origin and the destination nodes. Besides, these time windows are updated based on the customers’ expectations including the eﬀective riding time and the service time. These customer oriented constraints allows reducing unnecessary waiting times when satisfying the requests of transportation. The waiting time at a node is the one spent by the customer between the arrival of the vehicle at this node and the eﬀective pick up operation. An example of a produced waiting time at a node is illustrated in Fig. 1. A time window is indicated by its lower and upper bound. The vehicle’s arrival time and departure time are also presented.

Improving the Quality of Service

297

Fig. 1. A time schedule at a visited node

A beginning of service is deﬁned for respecting the time window. The waiting time is that between the vehicle’s arrival time at a node and the beginning of service at this node. The vehicle’s departure time starts after the loading or unloading of people within the service time. In the following section, we focus only on the constraints which are related to the design of the service quality in the proposed bi-objective customized DARP. More precisely, we deﬁne the maximal riding time, the time windows, and the time schedule for the transportation of the customers. 3.2

Mathematical Formulation of the Problem

The customer oriented bi-objective DARP is deﬁned on a complete directed graph G = (N, A), where N is the set of vertices and A is the set of arcs. The set of nodes i ∈ N is deﬁned as {0...2n+1} which includes the depot in two copies (i = 0) at the departure and (i = 2n + 1) at the arrival, the set of pickup nodes {1...n} and the set of delivery nodes {n + 1...2n}. The set of arcs is deﬁned as {(i,j) with i,j ∈ N , and i = j}. We assume that a ﬂeet of homogeneous vehicles is available at the depot to satisfy a given number of requests. A request is deﬁned by an origin node i ∈ {1...n} and a destination one (i + n) ∈ {n + 1...2n}. 3.2.1

Notations Used in the Formulation

The problem under consideration which is a variant of the DARP is NP-Hard [15]. It can be modelled as a mixed integer linear program (MILP) using the following decisions variables: the ﬁrst one is a routing variable whereas the others are scheduling variables. – xv(i,j) is equal to 1 if and only if the vehicle v goes from node i to node j, 0 otherwise. – Biv ,Ariv , Wiv , and Riv are real variables which indicate respectively, the starting time of the service, the arrival time, the waiting time, and the riding time at node i if the latter is visited by a vehicle v. The notations used in the problem formulation are listed in Table 1.

298

S. Nasri et al. Table 1. The problem notations Parameters Description n

Total number of requests

m

Total number of vehicles

ti

Service time at node i ∈ N

tr(i,j)

Transit time on an arc (i,j) ∈ A

mrti

Maximal riding time for a request i ∈ {1...n}

[infi , supi ]

Time window at node i ∈ N

The vehicle’s arrival time Ariv at node i is the time spent from the depot to this node i. Note that this arrival is allowed before the beginning of service Biv but not after. Therefore, only a positive value for the waiting time Wiv at a node i is considered. It is equal to the time between the arrival time Ariv and the starting of the service Biv ∈ [infi , supi ] where infi and supi are the earliest and the lasted service time respectively. The departure time at node i is Div = Biv +ti . In this problem the service time ti is equal to the passengers’ loading time. Given that no demand is assigned at the depot, the departure time at the depot is equal to the beginning of service; v v = D2n+1 . In addition, the arrival time at the depot thus B0v = D0v and B2n+1 v v v v = Ar2n+1 . Note that B0v is initially is initialized as B0 = Ar0 as well as B2n+1 v v equal to Bi − tr(0,i) and B2n+1 is initially equal to Biv − tr(i,2n+1) . Moreover, some parameters of the requests are supposed to be the same at the origin and the destination as for instance ti = ti+n , and mrti = mrti+n . 3.2.2

A Bi-Objective Model of the Customer Oriented DARP

In this problem, we consider two objectives to minimize. The ﬁrst objective function (1) consists of the total travel costs generated by the visited arcs in the vehicle tours. A travel cost c(i,j) on a visited arc (i, j) is the sum of the transit time tr(i,j) and the service time ti at node i. The second objective function (2) to minimize is the sum of the total waiting time spent during the vehicles tours. M in

f1 (S) =

v=m

c(i,j) ∗ xv(i,j)

(1)

v=1 (i,j)∈A

M in

f2 (S) =

v=m

Wiv

(2)

v=1 (i,j)∈A

3.2.2.1 The Requests Scheduling Time Constraints The constraints (3) ensure the precedence of the starting service time between the nodes and the inequalities (4) stipulate that the pickup stations precede the delivery stations in a route. Bjv ≥ Biv

∀v ∈ {1 . . . m},

∀(i, j) ∈ A

(3)

Improving the Quality of Service

v Bi+n ≥ Biv + Riv + ti

∀v ∈ {1 . . . m},

∀i ∈ {1 . . . n}

299

(4)

Constraint (5) expresses that an arrival time at a node j is the sum of the arrival time at the predecessor node i, the service duration, the transit time on the arc (i, j), and the waiting time at node i. v v v Ar(j) ≥ Ar(i) + ti + tr(i,j) + W(i)

∀i, j ∈ {1 . . . n}

∀v ∈ {1 . . . m}

(5)

3.2.2.2 Computation of the Customer Oriented Variables The customer oriented riding time is expressed by (6). Time windows are computed by (7)–(10) using the riding time, the service duration (which is the time for loading people), and the initial selected time windows. Moreover, the waiting times are computed in (11). ∀i ∈ {∈ 1 . . . n}, ∀v ∈ {∈ 1 . . . m} tr(i,i+n) ≤ Riv ≤ mrti M in inf(i+n) − mrti − ti , infi ≤ Biv + M ∗ (1 − var) ∀i ∈ {1 . . . n}, ∀v ∈ {1 . . . m} Biv − M ∗ var ≤ M in sup(i+n) − ti − Riv , supi ∀i ∈ {1 . . . n}, ∀v ∈ {1 . . . m} v M in infi − mrti − ti , inf(i+n) ≤ B(i+n) + M ∗ (1 − var) ∀i ∈ {1 . . . n}, ∀v ∈ {1 . . . m} v B(i+n) − M ∗ var ≤ M in sup(i) − ti − Riv , sup(i+n) ∀i ∈ {1 . . . n}, ∀v ∈ {1 . . . m} v v v W(i) ≥ B(i) − Ar(i)

∀i ∈ {1 . . . n}

∀v ∈ {1 . . . m}

(6) (7) (8)

(9)

(10) (11)

Last Eq. (12) precise the binary nature of the decision variables xv(i,j) . Finally, inequalities (13) ensure the positivity of the scheduling decision variables. xv(i,j) ∈ {0, 1}

∀(i, j) ∈ A,

v v v v B(i) ≥ 0, Ar(i) ≥ 0, W(i) ≥ 0, R(i) ≥0

∀v ∈ {1 . . . m} ∀i ∈ N, ∀v ∈ {1 . . . m}

(12) (13)

As compared with a classical DARP, here the time windows are designed for the origin nodes and for the destination nodes separately. Equations (7) and (8) successively present the lower and upper bounds for the beginning of service at the origins. Equations (9) and (10) deﬁne the lower and upper bounds for the beginning of service at the destinations. New bounds are of time windows calculated based on the initial intervals [infi , supi ] and inf(i+n) , sup(i+n) . The new bounds are the minimal values between the initial time windows and those adjusted to customers’ speciﬁcations. These latter are the eﬀective customer riding time Riv , the service time ti , and the maximal ride time mrti which are used in this redeﬁnition. In this way, a new beginning of service is calculated,

300

S. Nasri et al.

leading to a minimization of the waiting time expressed by (11). To achieve the linearisation of the time window constraints, we use a large constant M and a boolean variable var in a further implementation of the model. Figure 2 shows a time schedule at a node i in a tour of a vehicle v before and after the computation of the time window. In the second case, we show an enhanced lower bound infi and a better upper bound supi at node i. Thus, for the same arrival time Ariv computed by (5), an enhanced value of Biv ∈ [infi , supi ] can be obtained by the time windows Eqs. (7)–(10). Note that we have assumed that Biv = infiv at the node i visited by the vehicle v.

Fig. 2. A case of a reduced waiting time at a visited node i in the tour of v.

In addition, the new beginning of service impacts the departure time at node i and the arrival time at the next visited node j. As a consequence, both the customer and the service provider may take beneﬁt from the reduction of the waiting times as well as the unnecessary delays during a trip. The equations of the model are further implemented and tested on real instances of DARPs in the next experimental section. Although metaheuristics [1,13,36] and especially hybrid ones [34,35] are eﬀective and fast for solving complex problems such as scheduling and transportation ones, we prefer here to have a ﬁrst experimentation with an exact resolution [2,4] to ﬁnd the optimal solution of each of the customer oriented DARP instances foreseen.

4

Preliminary Tests

In this section, we propose an exact resolution of the problem under consideration by CPLEX. We compare the results obtained with the lower bounds found by CPLEX when solving a mono-objective version of the customer-oriented DARP [24]. The models were implemented in the Optimization Programming Language

Improving the Quality of Service

301

(OPL) within CPLEX Studio IDE 12.6 solver. To solve the models, the Cplex solver uses the lexicographic method. The models were tested on a computer R Core i7 processor and 8 GB of RAM. with the 10th generation Intel The two types of problems are applied on the benchmark test instances taken from the work of Chassaing in [6]. These instances (I) are mobility on-demand problems built upon realistic geographical data where pickup and delivery points are geolocated sites deﬁned by a geographical information system. Large distances separate the locations and the networks are of diﬀerent complexities. These instances include customer oriented values with parameters related to the quality of service. Indeed, maximal riding times are speciﬁed for each customer as well as time windows which are supposed for pickup and delivery nodes. For these requests, the time windows are of diﬀerent sizes depending on the customers preferences. The maximal riding times are also expressed according to the distances between the requests of the customers. A homogeneous ﬂeet of vehicles is available at the depot with a maximal capacity of 8 persons. A maximal total tour duration equal to 480 is considered. There are 96 instances, but only 12 examples are reported for this experimentation. The selected instances are those solved by CPLEX within a limit of four hours. In Table 2, we present the total costs and the total waiting times obtained in case of the mono-objective and the bi-objective formulations of the customer oriented DARP. These two quantities are denoted f1 (S) for the total travel costs and f2 (S) for the total waiting time. In the Mono-objective case, only f1 (S) is the objective function. In the biobjective case, two types of results are reported depending on the lexicographic order adopted for the two objectives. In the Bi-objective1 formulation, the prioritized objective is the total costs f1 (S). In the Bi-objective2 columns, the results are obtained when the total waiting time f2 (S) is minimized in priority. In the last columns, we highlight in bold font the total waiting times improved when compared to the ones obtained in the mono-objective tests. What is remarkable in Table 2, is that as compared with the mono-objective case, one can observe an improvement in the total waiting time obtained in the biobjective cases regardless to the priority given to the objective functions. When the total costs are the prioritized objective, the improvement are obtained on seven out of the twelve instances tested. In the case when the total waiting time is the objective to minimize in priority, these total waiting times are improved on all the instances except one. As an example for the instances d75, d52, and d70 the best total waiting times obtained are respectively (f2 (S) = 80.66), (f2 (S) = 385.66), and (f2 (S) = 203.44) for the Bi-objective2 conﬁguration. Moreover, in the case that the prioritized objective is the total costs, these costs are reduced as compared with the mono-objective case ﬁve times out of the twelve instances tested. When this cost is not improved, the total waiting times are reduced on six instances. In total only two instances are not improved in one objective or the other. This shows that implementing a bi-objective model of the problem is beneﬁcial in almost all the cases. One can conclude that separating the objective to optimize is important when addressing a customer-oriented DARP. Furthermore, when the total waiting time is the prioritized objective to minimise,

302

S. Nasri et al.

Table 2. Results obtained by CPLEX in the mono-objective and bi-objective cases. I

n

m Mono-objective f 1 (S) f 2 (S)

Bi-objective1 f 1 (S) f 2 (S) 105.39

Bi-objective2 f 1 (S) f 2 (S)

d75 10 2

105.39

160.29

d92 17 2

314.9

136.021 306 .89 219 .22 391.805 91.10

140.98 135.851 80.66

d93 20 2

402.14

737.00

473.44

191.13 496.55

123.12

d94 23 2

273.89

808.34

394.42

758.76 454.45

485.22

d55 28 4

1500.06 95.68

1500.62 138.13

1601.23 144.25

d52 29 4

1481.30 412.79

1527.96 498.56

1534.55 385.66

d10 34 4

1331.15 271.34

1301.84 245.33 1402.12 233.46

d39 38 6

1955.61 123.48

2073.39 112.54 2199.47 101.77

d70 39 6

2000.30 283.58

1987.05 301.44

d82 39 6

1830.12 176.32

1920.25 170.35 2001.23 101.55

d08 42 7

1682.52 185.58

1732.96 235.47

d36 42 6

2130.52 459.16

2197.05 387.46 2596.01 405.33

2223.56 203.44 1994.27 180.41

this leads to the best results from the customer perspective. Therefore, a multiobjective formulation should be privileged when addressing customer oriented DARPs.

5

Discussion and Conclusion

In this paper we introduced a new problem called bi-objective customer oriented DARP. Indeed the model consists in two separate objectives. One for the total costs of the transport and the other is for the total waiting time which should be reduced for a better quality of the service provided to the customers. The results of the tests performed on real world instances of customer oriented DARPs showed that when the total waiting time is the objective which has to be minimised before the total costs, these waiting times were reduced on almost all the cases. Two major contributions of our bi-objective modelling were obtained. Firstly, by the use of the bi-objective model one has the possibility to look for a better quality of service for the customer with a customized design of the time windows. Secondly, this type of modelling helps us to ﬁnd a balance between minimizing the costs and improving the quality of service. Therefore the search for trade-oﬀ between the interests of the service provider and those of the customer should be addressed with two or more objectives separately. Last, a metaheuristic based approach would help ﬁnding near optimal solutions of the problem with a good level of service with a lower computational eﬀort than with the current exact resolution.

Improving the Quality of Service

303

References 1. Aggoune-Mtalaa, W., Aggoune, R.: An optimization algorithm to schedule care for the elderly at home. Int. J. Inf. Sci. Intell. Syst. 3(3), 41–50 (2014) 2. Amroun, K., Habbas, Z., Aggoune-Mtalaa, W.: A compressed generalized hypertree decomposition-based solving technique for non-binary constraint satisfaction problems. AI Commun. 29(2), 371–392 (2016). www.scopus.com 3. Atahran, A., Lent´e, C., T’kindt, V.: A multicriteria Dial-A-Ride problem with an ecological measure and heterogeneous vehicles. J. Multi-criteria Decis. Anal. 21(5–6), 279–298 (2014) 4. Bennekrouf, M., Aggoune-Mtalaa, W., Sari, Z.: A generic model for network design including remanufacturing activities. Supply Chain Forum 14(2), 4–17 (2013) 5. Braekers, K., Kovacs, A.A.: A multi-period Dial-A-Ride Problem with driver consistency. Transp. Res. Part B Methodol. 94, 355–377 (2016) 6. Chassaing: Instances of Chassaing. http://fc.isima.fr/∼lacomme/Maxime/ (2020). Accessed 19 July 2020 7. Chassaing, M., Duhamel, C., Lacomme, P.: An ELS-based approach with dynamic probabilities management in local search for the Dial-A-Ride Problem. Eng. Appl. Artif. Intell. 48, 119–133 (2016) 8. Cordeau, J.F.: A branch-and-cut algorithm for the Dial-A-Ride Problem. Oper. Res. 54(3), 573–586 (2006) 9. Cordeau, J.F., Laporte, G.: The Dial-A-Ride Problem (DARP): variants, modeling issues and algorithms. Q. J. Belg. Fr. Ital. Oper. Res. Soc. 1(2), 89–101 (2003) 10. Cordeau, J.F., Laporte, G.: A Tabu search heuristic for the static multi-vehicle Dial-A-Ride Problem. Transp. Res. Part B Methodol. 37(6), 579–594 (2003) 11. Cordeau, J.F., Laporte, G., Potvin, J.Y., Savelsbergh, M.W.: Transportation on demand. Handb. Oper. Res. Manag. Sci. 14, 429–466 (2007) 12. Diana, M.: Innovative systems for the transportation disadvantaged: toward more eﬃcient and operationally usable planning tools. Transp. Plan. Technol. 27(4), 315–331 (2004) 13. Djenouri, Y., Habbas, Z., Aggoune-Mtalaa, W.: Bees swarm optimization metaheuristic guided by decomposition for solving Max-SAT. In: ICAART 2016 - Proceedings of the 8th International Conference on Agents and Artiﬁcial Intelligence, vol. 2, pp. 472–479 (2016) 14. Guerriero, F., Pezzella, F., Pisacane, O., Trollini, L.: Multi-objective optimization in Dial-A-Ride public transportation. Transp. Res. Procedia 3, 299–308 (2014) 15. Healy, P., Moll, R.: A new extension of local search applied to the Dial-A-Ride Problem. Eur. J. Oper. Res. 83(1), 83–104 (1995) 16. Hyland, M., Mahmassani, H.S.: Operational beneﬁts and challenges of shared-ride automated mobility-on-demand services. Transp. Res. Part A Policy Pract. 134, 251–270 (2020) 17. Jorgensen, R.M., Larsen, J., Bergvinsdottir, K.B.: Solving the dial-a-ride problem using genetic algorithms. J. Oper. Res. Soc. 58(10), 1321–1331 (2007) 18. Lehu´ed´e, F., Masson, R., Parragh, S.N., P´eton, O., Tricoire, F.: A multi-criteria large neighbourhood search for the transportation of disabled people. J. Oper. Res. Soc. 65(7), 983–1000 (2014) 19. Markovi´c, N., Nair, R., Schonfeld, P., Miller-Hooks, E., Mohebbi, M.: Optimizing dial-a-ride services in Maryland: beneﬁts of computerized routing and scheduling. Transp. Res. Part C Emerg. Technol. 55, 156–165 (2015)

304

S. Nasri et al.

20. Melachrinoudis, E., Ilhan, A.B., Min, H.: A Dial-A-Ride problem for client transportation in a health-care organization. Comput. Oper. Res. 34(3), 742–759 (2007) 21. Melachrinoudis, E., Min, H.: A Tabu search heuristic for solving the multi-depot, multi-vehicle, double request dial-a-ride problem faced by a healthcare organisation. Int. J. Oper. Res. 10(2), 214–239 (2011) 22. Molenbruch, Y., Braekers, K., Caris, A.: Beneﬁts of horizontal cooperation in DialA-Ride services. Transp. Res. Part E Logist. Transp. Rev. 107, 97–119 (2017) 23. Molenbruch, Y., Braekers, K., Caris, A., Berghe, G.V.: Multi-directional local search for a bi-objective Dial-A-Ride problem in patient transportation. Comput. Oper. Res. 77, 58–71 (2017) 24. Nasri, S., Bouziri, H.: Improving total transit time in dial-a-ride problem with customers-dependent criteria. In: 2017 IEEE/ACS 14th International Conference on Computer Systems and Applications (AICCSA), pp. 1141–1148. IEEE (2017) 25. Nasri, S., Bouziri, H., Aggoune-Mtalaa, W.: Customer-oriented Dial-A-Ride problems: a survey on relevant variants, solution approaches and applications. In: Ben Ahmed, M., Mellouli, S., Braganca, L., Anouar Abdelhakim, B., Bernadetta, K.A. (eds.) Emerging Trends in ICT for Sustainable Development. ASTI, pp. 111–119. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-53440-0 13 26. Nasri, S., Bouziri, H., Aggoune-Mtalaa, W.: Dynamic on demand responsive transport with time-dependent customer load. In: Ben Ahmed, M., Rakip Karas, I., Santos, D., Sergeyeva, O., Boudhir, A.A. (eds.) SCA 2020. LNNS, vol. 183, pp. 395–409. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-66840-2 30 27. Paquette, J., Bellavance, F., Cordeau, J.F., Laporte, G.: Measuring quality of service in Dial-A-Ride operations: the case of a Canadian city. Transportation 39(3), 539–564 (2012) 28. Paquette, J., Cordeau, J.F., Laporte, G.: Quality of service in Dial-A-Ride operations. Comput. Ind. Eng. 56(4), 1721–1734 (2009) 29. Paquette, J., Cordeau, J.F., Laporte, G., Pascoal, M.M.: Combining multicriteria analysis and Tabu search for Dial-A-Ride Problems. Transp. Res. Part B Methodol. 52, 1–16 (2013) 30. Parragh, S.N.: Introducing heterogeneous users and vehicles into models and algorithms for the Dial-A-Ride Problem. Transp. Re. Part C Emerg. Technol. 19(5), 912–930 (2011) 31. Parragh, S.N., Doerner, K.F., Hartl, R.F.: A survey on pickup and delivery problems. J. f¨ ur Betriebswirtschaft 58(1), 21–51 (2008) 32. Parragh, S.N., Doerner, K.F., Hartl, R.F., Gandibleux, X.: A heuristic two-phase solution approach for the multi-objective Dial-A-Ride Problem. Netw. Int. J. 54(4), 227–242 (2009) 33. Rekiek, B., Delchambre, A., Saleh, H.A.: Handicapped person transportation: an application of the grouping genetic algorithm. Eng. Appl. Artif. Intell. 19(5), 511– 520 (2006) 34. Rezgui, D., Bouziri, H., Aggoune-Mtalaa, W., Siala, J.C.: An evolutionary variable neighborhood descent for addressing an electric VRP variant. In: Sifaleras, A., Salhi, S., Brimberg, J. (eds.) ICVNS 2018. LNCS, vol. 11328, pp. 216–231. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-15843-9 17 35. Rezgui, D., Bouziri, H., Aggoune-Mtalaa, W., Siala, J.C.: A hybrid evolutionary algorithm for smart freight delivery with electric modular vehicles. In: 2018 IEEE/ACS 15th International Conference on Computer Systems and Applications (AICCSA), pp. 1–8. IEEE (2018)

Improving the Quality of Service

305

36. Rezgui, D., Siala, J.C., Aggoune-Mtalaa, W., Bouziri, H.: Towards smart urban freight distribution using ﬂeets of modular electric vehicles. In: Ben Ahmed, M., Boudhir, A.A. (eds.) SCAMS 2017. LNNS, vol. 37, pp. 602–612. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-74500-8 55 37. Shaheen, S., Cohen, A., Yelchuru, B., Sarkhili, S., Hamilton, B.A., et al.: Mobility on demand operational concept report (2017)

Interpretability Based Approach to Detect Fake Profiles in Instagram Amine Sallah1 , El Arbi Abdellaoui Alaoui2(B) , and Said Agoujil1 1

2

Department of Computer Science, Faculty of Sciences and Techniques, Moulay Ismail University, Errachidia, Morocco Department of Sciences, Ecole Normale Sup´erieure, Moulay Ismail University, Meknes, Morocco [email protected]

Abstract. The explosive rise of OSNs, as well as the vast quantity of personal data they hold, has attracted attackers and imposters looking to steal personal information, propagate fake news, and carry out destructive actions. Researchers, on the other hand, have begun to look at eﬀective approaches for detecting anomalous behaviors and phony accounts using account characteristics and classiﬁcation algorithms. Creating fake accounts, which is used to increase the popularity of an account in an immoral way, is one of the security challenges in these networks that has become a major concern for users. An attacker can aﬀect the security and privacy of legitimate users by spreading spam, malware, and disinformation. Advertisers utilize these channels to reach out to a certain demographic. The number of false accounts is growing at an exponential rate, and in this study, we propose an architecture for detecting phony accounts in social media, particularly Instagram. We are utilizing Machine Learning techniques such as Bagging and Boosting in this study to make better predictions about detecting bogus accounts based on their proﬁle information. We utilized the SMOTE technique to balance the two groups of data, which enables us to get the equal number of persons for each class. The approaches for understanding complicated Machine Learning Models to understand the reasoning behind a model choice, such as SHAP values and LIME, were also included in this article. The XGBoost and Random Forest models have a combined accuracy of 96%. An online fake detection method has been built to identify rogue accounts on Instagram, as shown below. Keywords: Fake account · Machine Learning Classiﬁcation · Interpretability · Shap Values

1

· Instagram · Smote ·

Introduction

Online Social Networks (OSNs) such as Facebook and Instagram have grown in popularity and importance in today’s world [8]. OSNs are used to acquire popularity and promote companies in addition to being utilized as a means of c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ben Ahmed et al. (Eds.): NISS 2022, LNDECT 147, pp. 306–314, 2023. https://doi.org/10.1007/978-3-031-15191-0_29

Interpretability Based Approach to Detect Fake Proﬁles in Instagram

307

communication. At ﬁrst look, an account’s popularity is determined by measures such as follower count or attributes of shared material such as the number of likes, comments, or views. As a result, users of any social network may have a temptation to artiﬁcially manipulate metrics in order to get greater advantages from OSNs. Regrettably, this completely destroys trust [4]. Many individuals attempt to obtain popularity quickly by taking advantage of this process by purchasing false following proﬁles on the illicit market [1]. Another example of a ﬁrm that sells false followers is likes.io. For just 4$, the consumer may purchase 100 bot-generated followers. A genuine individual has the ability to construct several fake identities. In contrast to the actual world, where various laws and regulations are applied to uniquely identify oneself (for example, when granting a passport or a driver’s license), entrance to social media does not need such limitations. In this paper, we look at distinct Instagram accounts in detail and use machine learning approaches to determine if they are phony or genuine. The following is an example of a faked proﬁle (see Fig. 1). It has a large number of followers but a small number of followers, no proﬁle picture, and no posts, as can be observed. There are also several pieces that look at Instagram from the viewpoint of a fake account. The authors in this article [5] have investigated diﬀerent Instagram accounts, in particular to evaluate an account as fake or real based on certain features (N of posts, N of followers) using machine learning techniques, such as logistic regression and the Random Forest algorithm. The authors in this article [2] presented a study related to the recognition of fake and automated accounts that lead to a false engagement on Instagram. Following intensive hand labeling, 1002 authentic accounts and 201 fraudulent accounts were acquired for the dataset. During this gathering process, the following and follower counts, media counts, media posting dates or frequency, media comments, the presence of a proﬁle image, and the proﬁle’s username are all taken into account. The authors proposed a cost-sensitive attribute reduction technique based on genetic algo-

Fig. 1. Example of a fake Instagram account

308

A. Sallah et al.

rithms to select the optimal attributes for binary classiﬁcation using machine learning algorithms. Accuracy, recall, and F1 score are used as performance metrics, and parameter tuning is performed by GridSearch. Artiﬁcial intelligence (AI) has a lot of potential for improving both private and public life. The automatic discovery of patterns and structures in vast troves of data is a basic component of data science, and it is now driving applications in a variety of ﬁelds. It is better to have a precise model, but it is also better to get an explainable model, especially for making eﬃcient and transparent decisions [13]. Explainability can facilitate the understanding of various aspects of a model,leading to insights that can be utilized by various stakeholders, such as: – Data scientists can beneﬁt from explainability while debugging a model or searching for methods to enhance performance. – Consumers want clarity about how choices are made and how they may eﬀect them.

2

Methodology

In this section, we describe our proposed approach. Speciﬁcally, we explain how each technique is used in the process of identifying fake proﬁles on Instagram, in addition the interpretability of the Machine Learning models will help us to understand the decision making process to improve the performance of the models. The process consists of ﬁve main mining steps. In the ﬁrst step, a dataset containing the relevant proﬁle functionalities is available. In the second step, we apply a set of pre-processing operations on the data before analyzing it. In the third step, we train several classiﬁcation models to select the most appropriate algorithm to model the target variable. In the fourth step, a predictive model is not perfect, it can make prediction errors, and it is essential to measure its performance before launching it into production. Last step we deployed the optimal model on a web server to predict the classes of the new Instagram proﬁles. The methodology to detect a fake proﬁle on Instagram is described in Fig. 2. 2.1

Dataset and Features

We applied our experience to an Instagram account dataset, this Dataset was taken from Kaggle. The target variable, which allows to know if it is a fake account or not, it takes two values 0 (not fake) and 1 (fake). The Table 1 indicates the features that have been taken into account. There were no missing values in the provided dataset. 2.2

Data Preprocessing

Data preprocessing in machine learning is an essential step that contributes to improve data quality and supports the extraction of meaningful knowledge from data. Data preprocessing includes various operations, each operation aims to enhance the performance of machine learning models [11]. The details of these preprocessing operations are brieﬂy described in this sub-section.

Interpretability Based Approach to Detect Fake Proﬁles in Instagram

309

Fig. 2. Design approach to detect fake account

Normalization. Most of the time, in Machine Learning, the Dataset comes with diﬀerent orders of magnitude. This diﬀerence in scale can lead to lower performance. To compensate for this, preparatory processing of the data exists. Notably the Feature Scaling which includes Normalization and Standardization. Min-Max Scaling can be applied when the data vary in diﬀerent scales. At the end of this transformation, the features will be included in a ﬁxed range [0.1]. 2.3

Over-simpling

It often happens in cases of Machine Learning classiﬁcation that one of the classes is in the minority in relation to the overall population. This can be a problem because most classiﬁcation algorithms rely on accuracy to build their models. Seeing that the vast majority of observations belong to the same category, you may end up with a model that is not very intelligent and will always predict the dominant class [14]. Resampling is one of the problems in data classiﬁcation is the unbalanced distribution of data, in which items in some classes are more than those of other classes. The negative impact of the imbalance class of the constructed model was handled by Synthetic Minority Oversampling Technique (SMOTE). SMOTE [6,7] is the best known over-sampling algorithm. It works as follows: the algorithm runs through all the observations of the minority class, searches for its k nearest neighbors and then randomly synthesizes new data between these two points.

310

A. Sallah et al. Table 1. Details of the proﬁle features Feature

Maximum value Description

profil pic

1.00

User has a profile picture or not

nums/length username 0.92

Ratio between the number of numeric characters and the length of the username

fullname words

Number of words in the full name

12.00

nums/length fullname 1.00

Percentage of numeric characters in the full name.

name==username

1.00

Are the username and full name the same?

description length

150.00

Biography length in characters

external URL

1.00

Has an external URL or not?

private

1.00

Private or not

#posts

7389.00

Number of publications

#followers

15338538.00

Number of subscribers

#follows

7500.00

Number of following

After applying oversampling, all classiﬁers are trained on equal number of training samples per class. 2.4

Models Construction

Based on recent work in Machine Learning, In this paper we introduce an ensemble classiﬁer (Random Forest, AdaBoost,...) that combines a multi-decision tree for fake account recognition in Instagram. By combining ensemble of classiﬁers with diﬀerent parameters, we reduce the classiﬁcation error, and improve the robustness of the classiﬁer [9,10]. Machine learning models for binary classiﬁcation problems predict a binary result (a class between two possible classes) real or fake account. To build the model, we will use cross-validation, this technique consists on dividing the original data sample into k samples of which one is used as a test sample and the remaining (k − 1) will form the learning set. Subsequently, each set will be used in turn as a test sample. 2.5

Interpretation of Machine Learning Models

How do we understand the decisions suggested by these models in order that we can trust them?. One of the ethical issues that is very often raised in our modern world is that of transparency. Indeed, it is necessary to be able to explain the prediction or decision of an artiﬁcial intelligence to a customer or simple user of the AI [3]. In Moreover the explanation of an algorithm of complex learning allows also to optimize the model by predetermining the important variables to identify a fake account. The method Shap (SHapley Additive exPlanations) proposed by Lundberg and Lee [12]. It allows for all types of models to give

Interpretability Based Approach to Detect Fake Proﬁles in Instagram

311

a posteriori explanations on the contribution of each variable. To compute the Shapley values of features for an instance, we simulate diﬀerent combinations of feature values, knowing that we can have combinations where a feature is totally absent. For each combination, we calculate the diﬀerence between the predicted value and the mean of the predictions on the real data. The general expression of the Shapley value ϕi is : ϕi =

S⊆N \{i}

|S|!(M − |S| − 1)! (fx (S ∪ i) − fx (S)), M!

(1)

such as M Number of features, S features set, fx the prediction function at time x, fx (S) = E[f (x)|xS ], i is the ith feature. SHAP approach is additive, so a prediction can be written as the sum of the diﬀerent eﬀects of the variables (shap value ϕi ) added to the base value ϕ0 . The base value being the average of all the predictions of the dataset: f (x) = ypred = ϕ0 +

M

ϕi zi .

(2)

i=1

with, ypred the predicted value of the model for this example, z ∈ {0, 1}M when the variable is observed zi = 1 or unknown zi = 0.

3

Results and Analysis

In this section, we determine the results of the classiﬁcation models. After creating the models using the training data sets, we apply the models to new data. We want to know how well they perform, so we will exploit the performance parameters mentioned in Table 2. Table 2. Resultant analysis for various classiﬁers. Classifier

without SMOTE

with SMOTE

Accuracy Precision Recall Accuracy Precision Recall Adaboost

0.94

0.91

0.93

0.96

0.97

0.96

k-Nearest Neighbors

0.88

0.90

0.90

0.92

0.91

0.92

Random Forest

0.94

0.95

0.95

0.95

0.95

0.95

XGBoost

0.91

0.91

0.91

0.96

0.95

0.96

Decision tree

0.90

0.90

0.90

0.91

0.91

0.91

Support Vector Machine 0.88

0.90

0.91

0.93

0.93

0.92

The Table 2 shows the performance of the diﬀerent algorithms used for the Accuracy, Precision and Recall metrics. Experimental results show that SMOTE

312

A. Sallah et al.

improves signiﬁcantly the measurements by generating a new data of the imbalanced data. Analysing Table 2, we can see that for these metrics, the XGBoost provides 96% Accuracy with SMOTE. A detection model is all the more eﬃcient as its ROC curve is close to the upper left corner: a high detection rate for a low false classiﬁcation rate. We analyze the diﬀerent ROC curves of our classiﬁers given in Fig. 3, we ﬁnd that the overall Random Forest, XGBoost and AdaBoost methods are more eﬃcient, as they give the best AUCs (0.993, 0.995 and 0.989 respectively) compared to the individual classiﬁers KNN, SVM and Decision Tree (0.971, 0.984 and 0.929 respectively).

Fig. 3. ROC curves comparison of diﬀerent classiﬁers

Figure 4 depicts the global importance of the variables calculated by the Shap values. It is a method aimed at identifying which parameter or a set of parameters inﬂuenced the prediction (Classiﬁcation), each point represents a Shap value (for an example), the red points represent high values of the variable and the blue points represent low values of the variable. We can see from Fig. 4 that the feature #followers has a strongly impact in the decision. For example, high values of the proﬁlepic variable have a high positive contribution on the prediction, while low values have a high negative contribution. In Fig. 5, This plot shows us what are the main features aﬀecting the prediction of a single observation, and the magnitude of the SHAP value for each feature. Let us ﬁrst note that E(f (X)) = 0.5147 represents the average of the predictions, that f(x) = 0.88 represents the prediction for the instance studied. The force plot is another way to see the eﬀect each feature has on the prediction, for a given observation. In this plot the positive SHAP values are displayed on the left side and the negative on the right side. For example the feature #description length, has a positive impact, which pushes the predicted value higher.

Interpretability Based Approach to Detect Fake Proﬁles in Instagram

313

Fig. 4. SHAP summary plot illustrates the feature importance inﬂuential in the classiﬁcation

Fig. 5. Force plot

4

Conclusion

In this paper, we study the diﬀerent accounts of Instagram, in particular and try to assess an account as fake or real using Machine. In this paper, we proposed an approach to identifying fake proﬁles on Instagram based on proﬁle features using Machine Learning algorithms. First, a comparative analysis of all classiﬁers has performed in the testSet. When we classify the dataset without oversampling the data, our model attains 91% accuracy. Next, the experiment has been performed on the balanced data. Therefore, the proposed model attains 96% accuracy, which is comparatively better than the imbalanced data. In order to ﬁnd the best combination of the hyperparameters of a classiﬁer, an optimization by the Grid search is proposed. We also interpreted the decision made by our model using the SHAP values. This helps decision-makers to believe the model and to know how to incorporate its recommendations with other decision factors. Our goal is to understand why the machine learning model made a certain prediction. As a perspective of this work, our future work will focus on enriching the database, while addressing content-based and emotion-based features. Data Availability Statement. The data used to support the ﬁndings of this study are available from the corresponding author upon request.

314

A. Sallah et al.

References 1. Aggarwal, A., Kumaraguru, P.: What they do in shadows: twitter underground follower market. In: 2015 13th Annual Conference on Privacy, Security and Trust, PST 2015, vol. i, pp. 93–100 (2015) 2. Akyon, F.C., Kalfaoglu, M.E.: Instagram fake and automated account detection. In: Proceedings - 2019 Innovations in Intelligent Systems and Applications Conference, ASYU 2019 (2019) 3. Carvalho, D.V., Pereira, E.M., Cardoso, J.S.: Machine learning interpretability: a survey on methods and metrics. Electronics 8(8), 83 (2019) 4. Castellini, J., Poggioni, V., Sorbi, G.: Fake twitter followers detection by denoising autoencoder. In: Proceedings - 2017 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2017, pp. 195–202 (2017) 5. Dey, A., Reddy, H., Dey, M., Sinha, N.: Detection of fake accounts in instagram using machine learning. Int. J. Comput. Sci. Inf. Technol. 11(5), 83–90 (2019) 6. Elreedy, D., Atiya, A.F.: A comprehensive analysis of synthetic minority oversampling technique (SMOTE) for handling class imbalance. Inf. Sci. 505, 32–64 (2019) 7. Fern´ andez, A., Garc´ıa, S., Herrera, F., Chawla, N.V.: SMOTE for learning from imbalanced data: progress and challenges. In: Marking the 15-year Anniversary (2018) 8. Tankovska, H.: Number of daily active Instagram Stories users from October 2016 to January 2019. Statista 9. Kramer, O., et al.: On ensemble classiﬁers for nonintrusive appliance load monitoring. In: Corchado, E., Sn´ aˇsel, V., Abraham, A., Wo´zniak, M., Gra˜ na, M., Cho, S.-B. (eds.) HAIS 2012. LNCS (LNAI), vol. 7208, pp. 322–331. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-28942-2 29 10. Latha, C.B.C., Jeeva, S.C.: Improving the accuracy of prediction of heart disease risk based on ensemble classiﬁcation techniques. Inf. Med. Unlocked 16, 100203 (2019) 11. Liu, H.: Feature engineering for machine learning and data analytics (2018) 12. Lundberg, S.M., Lee, S.I.: A uniﬁed approach to interpreting model predictions. Adv. Neural Inf. Process. Syst. 30, 4766–4775 (2017) 13. Shirataki, S., Yamaguchi, S.: A study on interpretability of decision of machine learning. In: Proceedings - 2017 IEEE International Conference on Big Data, Big Data 2017, 2018-Janua, pp. 4830–4831 (2017) 14. Xiaolong, X., Chen, W., Sun, Y.: Over-sampling algorithm for imbalanced data classiﬁcation. J. Syst. Eng. Electron. 30(6), 1182–1191 (2019)

Learning Styles Prediction Using Social Network Analysis and Data Mining Algorithms Soukaina Benabdelouahab1,2(B) , Jaber El Bouhdidi1 , Yacine El Younoussi1 , and Juan M. Carrillo de Gea2 1 National School of Applied Science of Tetuan, Abdelmalek Essaadi University Tetuan,

Tetouan, Morocco [email protected], {jaber.elbouhdidi, yacine.elyounoussi}@uae.ac.ma 2 Department of Informatics and Systems, Faculty of Computer Science, University of Murcia, Murcia, Spain [email protected]

Abstract. One of the most interesting directions of digital educational technology is adaptive e-learning systems. Learning styles are used in adaptive e-learning systems to provide helpful suggestions and regulations for learners to improve their learning performance and optimize the educational process. Recently, the research trend is to detect learning styles without disturbing the users. In contrast to the old method, which included students filling out a questionnaire, many ways to automatically detecting learning styles have been presented. These methods are based on analyzing behavior data collected from students’ interactions with the system using various data mining tools. Simultaneously, recent research has embraced the use of social network analysis in improving online teaching and learning, with the goal of analyzing user profiles, as well as their interactions and behaviors, to better understand the learner and his needs in order to provide him with appropriate learning content. The aim of our research was to determine if the automatic detection of learning styles can be done using the learner social network analysis and data mining algorithms, our research was implemented with Sakai learning management systems, which allow us to examine the performance of our approach. Keywords: Learning styles · Social network analysis · Data mining

1 Introduction The rapid developments in information and communication technology have made profound effects on people’s lives [1]. Teachers, educators, and educational institutions all across the world have noticed its growing importance in education [2]. Online education offers students a flexible educational option that allows them to complete their education at their own speed and on their own time [3]. E-learning was increasing at a rate of about 15.4% each year in educational institutions all across the world [4]. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ben Ahmed et al. (Eds.): NISS 2022, LNDECT 147, pp. 315–322, 2023. https://doi.org/10.1007/978-3-031-15191-0_30

316

S. Benabdelouahab et al.

In the context of forming, an adequate learning content a learning style describes the attitudes and behaviors, which determine an individual has preferred way of learning [5]. According to research, each person has his unique learning style. Knowing learning styles can assist educators in identifying and resolving student learning issues. This will motivate learners to study more effectively since the educator will be able to design educational materials to the students’ learning styles. A test or questionnaire is the usual method of determining learning styles. Despite their reliability, these tools have certain issues that make identifying learning styles difficult [6]. Students’ lack of desire to complete a questionnaire and their lack of self-awareness of their learning preferences are two of these issues. As a result, various approaches for automatically detecting learning styles have been presented [7], which aim to solve these problems. Online social networking has become an unavoidable aspect of modern life [8]. As more students now spend so much of their time on online social networking sites, it may become a useful tool for improving student learning. In order to properly use online social networks in learning, student Individual variations in learning styles should be taken into account [9]. Social networks are important sources of online interactions and contents sharing, subjectivity, influences, feelings, opinions and sentiments expressions borne out in text, reviews, blogs, discussions, news, remarks, reactions, or some other documents [10]. The social network allows for the efficient collecting of vast amounts of data, which poses significant computational issues [11]. However, users can now obtain important, accurate, and helpful knowledge from social network data thanks to the usage of effective data mining tools [12]. The main objective of the proposed method is to transform data extracted from social network to identify learner preference and predict learner styles, with the purpose of providing adequate learning resources to the user needs, so that we can use it on a learner management system. The rest of the paper is organized as follows: Sect. 2 explains the concept of learning style; Sect. 3 depicts the interplay between learning styles and social networks; Sect. 4 shows our proposal; and Sect. 5 are our conclusions.

2 Learning Styles In an adaptive e-learning system, the learner model (LM) is the most significant component [13], because of its ability to depict the learner’s characteristics, based on which the educational system makes suggestions [14]. One of the most important characteristics of an LM is that everyone has their different learning style (LS) [15]. A LS refers to the preferred way of grasping and treating information, and it is one of the primary components of the LM since it can define the learner’s behavior when interacting with the e-learning environment [16]. We can divide learners into a variety of learning style categories based on their behavior during the learning process; these categories can then be used to create a learning style model (LSM). However, there are a variety of learning style models available, notably Kolb and Felder-Silverman, which is the most often used due to its capacity to quantify students’ learning styles [17]. Generally, a LSM classifies students based on where they fall on a scale that describes how they receive and process information. In this paper, we consider Felder–Silverman

Learning Styles Prediction Using Social Network Analysis

317

learning style model (FSLSM), because of many reasons, among which we can note that FSLSM is the most frequently used learning style model. Shockley and Russell [18] examined the use of learning style models in adaptive learning systems over the last decade and discovered that the FSLSM model is the most popular (50%), far outnumbering Kolb’s model (8.6%). Furthermore, FSLSM provides more thorough descriptions than other learning style models, as well as dependability data. LSs are divided into four aspects by the FSLSM: processing (active/reflective), perception (sensing/intuitive), input (visual/verbal) and understanding (sequential/global). • Processing: this component outlines how information is interpreted and transformed into knowledge. This dimension’s learning types are: – Active: In situations where they must be passive, active learners do not learn much. They are good at working in groups and are willing to try new things. – Reflective: In settings where there is little chance to think about the material being delivered, reflective learners do not learn much. They like to work alone or with only one other person, and they are theoreticians. • Perception: this dimension is concerned with the type of information that a learner wants to receive. This dimension’s learning types are: – Sensitive: facts, data, and experimentation are examples of sensitive sensors. They like to solve difficulties using tried-and-true ways and abhor “surprises.“ They are patient when it comes to details, but they despise difficulties. Sensors are effective at remembering things and being cautious, but they can be slow. – Intuitive: Intuitive people favor theories and principles. They enjoy new ideas and despise repetition. Detail bores them, so they embrace problems. Intuitors are quick and good at learning new concepts, but they can be sloppy. • Input: this dimension evaluates how learners like to get information from outside sources. This dimension’s learning types are: – Visual: visual learners remember images, diagrams, flow charts, time lines, videos, and demonstrations the best. – Verbal: verbal learners remember much of what they hear, and they remember even more of what they hear and subsequently say. • Understanding: this component defines the process through which pupils gain understanding. This dimension’s learning types are: – Sequential: sequential learners solve problems using linear reasoning processes and can work with information that they only have a rudimentary understanding of. – Global learners make intuitive leaps and may not be able to describe how they arrived at solutions. They may also have a hard time comprehending only a portion of the material.

318

S. Benabdelouahab et al.

FSLSM includes 44 questions in its high operational index of learning style (ILS) instrument: To detect both the preference and the degree of preference, 11 questions were asked for each dimension, each with two possible replies.

3 The Relationship Between Learning Styles and Social Networks Supatra Wanpen [19] surveys on the use of social networking sites and the Index of Learning Styles (ILS) which were administered to 379 college students from five different faculties. The study examines the links between student learning styles and their use of online social networks. The findings should help instructors in successfully planning the usage of social network to boost their students’ learning. Wanpen’s findings provide information regarding the online social networking sites that students use, their online social networking interactions, and their learning styles preferences, as well as the links that exist between students’ learning styles and various online social networking activities. The simple correlation shown in Table 1 indicate that there are numerous significant links between the ILS and the scales from the online social network survey when it comes to students’ learning styles. The results show that the active/reflective aspect has relationships with many scales of online social network use, namely: Follow pages created by others, Chat & send messages through social network sites, Start a thread and/or express opinions in a discussion forum, and Read a discussion thread written and commented by others. There is also a relationship between the other sensing/intuitive aspect and Chat & send messages through social network sites Scale. Table 1. Simple correlation coefficients for students’ learning styles in association with the use of social network interaction. Learning style preferences Active/Reflective Sensing/Intuitive Visual/Verbal Sequential/Global Profile post and comment

Own profile

Update status

0.07

0.00

0.05

0.04

Comment 0.06 on a post

0.48

0.00

0.01

0.07

0.01

0.02

0.03

Comment 0.07 on a post

0.18

0.08

0.03

Others’ Post on profiles others’ profiles

Notes or blogs

Write own notes/blogs

0.05

0.03

0.01

0.01

Read notes/blogs written by others

0.04

0.46

0.00

0.01

(continued)

Learning Styles Prediction Using Social Network Analysis

319

Table 1. (continued) Learning style preferences Active/Reflective Sensing/Intuitive Visual/Verbal Sequential/Global Groups

Pages

Create own groups 0.04

0.08

0.03

0.02

Join groups created 0.11 by others

0.10

0.07

0.01

Create a page

0.06

0.07

0.03

0.01

Follow pages created by others

0.18

0.05

0.02

0.01

Chats & Chat & send 0.16 Messages messages through social network sites

0.11

0.02

0.01

Use other chat programs or messengers

0.00

0.01

0.06

0.09

Start a thread and/or express opinions in a discussion forum

0.17

0.01

0.07

0.04

Read a discussion thread written and commented by others

0.14

0.00

0.03

0.04

Forums

4 Proposed Method Currently the use of social networks is a trend among students, and with this emergence, the data produced by social networks is used in several fields as well as the field of education. Social networks are spaces where users can share their interests, opinions and preferences, as well as their interactions with different subjects. Researchers agree that data extracted from a user’s social media profile can yield important information about him, which can then be used to better understand the user. In our model, we tried to use the information extracted from social networks to detect the learning styles of a user based on the model proposed by Wanpen [19] who studied the relationship between the interaction of users with social networks and their learning styles based on FLSM. Data mining is the most automatic process possible, which starts from basic data available in a decision-making Data Warehouse. The main objective of Data Mining is to create an automatic process that has data as its starting point and decision-making support as its end. Figure 1 can explain our model that can be divided on the following tasks: Task 1: Extracting Social network data using an algorithm based on the social network API. Task 2: Determinate how students interact with social networks.

320

S. Benabdelouahab et al.

Task 3: Data mining Prediction models application on data warehouse. Task 4: Learning style prediction.

Fig. 1. Learning styles prediction model based on data mining techniques and social networks.

5 Conclusion Thanks to social network analysis, it is possible to extract different parameters from the student activity in social networks [20]. This type of analysis could help teachers to better understand their students’ behavior, and consequently, help them to get better results [21]. The massive size of social network datasets requires automated data processing in order to analyze them in a reasonable amount of time [22]. Interestingly, data mining techniques also require huge data sets to mine remarkable patterns from data [23]. Social network sites appear to be perfect sites to mine with data mining tools. The main objective of the proposed method is to use data generated by social network analysis integrating data mining techniques to predict student learning style model. The proposed model is, however, not free from some significant limitations. Undoubtedly, the most obvious limitation of this study derives from using the Felder Silverman model of learning styles; however, there are other models to establish the paradigms of learning. In this sense, it would be interesting to incorporate new models of learning Styles, such as Kolb Model. Finally, it is necessary to generate pedagogical metrics to evaluate the impact of this technology in a learning environment. In addition, it is essential to generate experiments in real environments, which allow measuring the contribution of this technology in the learning processes.

Learning Styles Prediction Using Social Network Analysis

321

References 1. De La Hoz-Rosales, B., Ballesta, J.A.C., Tamayo-Torres, I., Buelvas-Ferreira, K.: Effects of Information and communication technology usage by individuals, businesses, and government on human development: an international analysis. IEEE Access 7, 129225–129243 (2019). https://doi.org/10.1109/ACCESS.2019.2939404 2. Ratheeswari, K.: Information communication technology in education. J. Appl. Adv. Res. 3, S45–S47 (2018). https://doi.org/10.21839/jaar.2018.v3is1.169 3. Ali, W.: Online and remote learning in higher education institutes: a necessity in light of COVID-19 pandemic. High. Educ. Stud. 10(3), 16 (2020). https://doi.org/10.5539/hes.v10 n3p16 4. Dao Thi Thu, H., Duong Hong, N.: A survey on students’ satisfaction with synchronous Elearning at public universities in vietnam during the COVID-19. In: 2021 5th International Conference on Education Multimedia Technology, pp. 196–202 (2021) 5. Ha, N.T.T.: Effects of learning style on students achievement. Linguist. Cult. Rev. 5(S3), 329–339 (2021). https://doi.org/10.21744/lingcure.v5ns3.1515 6. Feldman, J., Monteserin, A., Amandi, A.: Automatic detection of learning styles: state of the art. Artif. Intell. Rev. 44(2), 157–186 (2014). https://doi.org/10.1007/s10462-014-9422-6 7. Azzi, I., Jeghal, A., Radouane, A., Yahyaouy, A., Tairi, H.: A robust classification to predict learning styles in adaptive E-learning systems. Educ. Inf. Technol. 25(1), 437–448 (2019). https://doi.org/10.1007/s10639-019-09956-6 8. Kuss, D.J., Griffiths, M.D.: Social networking sites and addiction: ten lessons learned. Int. J. Environ. Res. Public Health 14(3), 311 (2017). https://doi.org/10.3390/ijerph14030311 9. Zachos, G., Paraskevopoulou-Kollia, E.A., Anagnostopoulos, I.: Social media use in higher education: a review. Educ. Sci. 8(4), 194 (2018). https://doi.org/10.3390/educsci8040194 10. Adedoyin-olowe, M., Gaber, M.M., Stahl, F.: [SNA] A survey of data mining techniques for social network analysis. Int. J. Res. Comput. Eng. Electron. 3(6), 1–8 (2014). https://jdmdh. episciences.org/18/pdf%5cnhttp:/jdmdh.episciences.org/18/ 11. Chang, V.: A proposed social network analysis platform for big data analytics. Technol. Forecast. Soc. Change 130(January), 57–68 (2018). https://doi.org/10.1016/j.techfore.2017. 11.002 12. Wu, X., Zhu, X., Wu, G.-Q., Ding, W.: Data mining with big data Xindong. In: Ieeexplore.Ieee.Org, pp. 1–26 (2014). https://ieeexplore.ieee.org/abstract/document/6547630/ 13. Uddin, M., Ahmed, N., Mahmood, A.: A learner model for adaptable e-learning. Int. J. Adv. Comput. Sci. Appl. 8(6), 139–147 (2017). https://doi.org/10.14569/ijacsa.2017.080618 14. Tarus, J.K., Niu, Z., Mustafa, G.: Knowledge-based recommendation: a review of ontologybased recommender systems for e-learning. Artif. Intell. Rev. 50(1), 21–48 (2017). https:// doi.org/10.1007/s10462-017-9539-5 15. Lwande, C., Muchemi, L., Oboko, R.: Identifying learning styles and cognitive traits in a learning management system. Heliyon 7(8), e07701 (2021). https://doi.org/10.1016/j.hel iyon.2021.e07701 16. Kolb, A.Y., Kolb, D. A.: Experiential learning theory as a guide for experiential educators in higher education. ELTHE A J. Engag. Educ. 1(1), 7–45 (2017). https://nsuworks.nova.edu/ elthe/vol1/iss1/7 17. Ramírez-Correa, P.E., Rondan-Cataluña, F.J., Arenas-Gaitán, J., Alfaro-Perez, J.L.: Moderating effect of learning styles on a learning management system’s success. Telemat. Inf. 34(1), 272–286 (2017). https://doi.org/10.1016/j.tele.2016.04.006 18. Shockley, D.R.: Learning styles and students’ perceptions of satisfaction in community college web-based learning environments (2005)

322

S. Benabdelouahab et al.

19. Wanpen, S.: The relationship between learning styles and the social network use of tertiary level students. Procedia - Soc. Behav. Sci. 88, 334–339 (2013). https://doi.org/10.1016/j.sbs pro.2013.08.514 20. Giunchiglia, F., Zeni, M., Gobbi, E., Bignotti, E., Bison, I.: Mobile social media usage and academic performance. Comput. Human Behav. 82, 177–185 (2018). https://doi.org/10.1016/ j.chb.2017.12.041 21. Greenhow, C., Askari, E.: Learning and teaching with social network sites: a decade of research in K-12 related education. Educ. Inf. Technol. 22(2), 623–645 (2015). https://doi. org/10.1007/s10639-015-9446-9 22. Serrat, O.: Knowledge Solutions: tools, methods, and approaches to drive organizational performance. In: Knowledge Solution Tools, Methods, Approaches to Drive Organzation Performing, pp. 1–1140 (2017). https://doi.org/10.1007/978-981-10-0983-9 23. Wang, R., et al.: Review on mining data from multiple data sources. Pattern Recogn. Lett. 109, 120–128 (2018). https://doi.org/10.1016/j.patrec.2018.01.013

Managing Spatial Big Data on the Data LakeHouse Soukaina Ait Errami1(B) , Hicham Hajji1 , Kenza Ait El Kadi1 , and Hassan Badir2 1 School of Geomatics and Surveying Engineering, IAV Hassan II Institute, Rabat, Morocco

{s.aiterrami,h.hajji,k.aitelkadi}@iav.ac.ma

2 Computer Science Department, National School of Applied Sciences, Tangier, Morocco

[email protected]

Abstract. The objective of this paper is to propose some of the best storage practices for using Spatial Big data on the Data Lakehouse. In fact, handling Big Spatial Data showed the limits of current approaches to store massive spatial data, either traditional such as geographic information systems or new ones such as extensions of augmented Big Data approaches. Our article is divided into four parts. In the first part, we will give a brief background of the data management system scene. In the second part, we will present the Data LakeHouse and how it responds to the problems of storage, processing and exploitation of big data while ensuring consistency and efficiency as in data warehouses. Then, we will recall the constraints posed by the management of Big Spatial Data. We end our paper with an experimental study showing the best storage practice for Spatial Big data on the Data LakeHouse. Our experiment shows that the partitioning of Spatial Big data over Geohash index is an optimal solution for the storage. Keywords: Data architecture · Storage · Spatial big data

1 Introduction With the increased use of geolocation services and location devices, comes an enormous growth of spatial data sources in addition to its volume and variety. The traditional spatial data warehouses were unable to handle these data, as the storage exceed their capacity and vertical scaling was expensive, on the other hand the variety of this data faced the data warehouse integration bottleneck, another constraint was the veracity of data as the data comes not only in batches but also in streams, while data warehouses were unable to support data streaming [1]. Thus, Data lakes have emerged as a flexible store-all and analyze-all platform to keep up with the big data era requirements [2]. Handling spatial big data presents extra-requirements due to highly expensive computation cost, diversity of data types and sources (batch and stream data sources). But these flexibility and unlimited storage capability compromised the consistency and the governance that were present in data warehouses [3]. Thus, a new system emerged trying to solve all of these issues: The Data Lakehouse.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ben Ahmed et al. (Eds.): NISS 2022, LNDECT 147, pp. 323–331, 2023. https://doi.org/10.1007/978-3-031-15191-0_31

324

S. A. Errami et al.

2 Background and Related Works The data decision support ecosystem went from different stages as the data volume grows and the data analytic needs grow. The data warehouses helped in integrating the different internal and external data sources of the enterprise into one system which eases high level decisions [4]. However, with the grown amounts of data and their acquisition speed, the traditional data warehouses were unable to deal neither with the huge data volumes nor the ETL process that starts representing a bottleneck in the data pipeline, nor with the speed of data as they were non compatible with data streaming [5]. To solve these issues, Data Lake was adopted as a central repository where data flows from different sources as it is, and in the raw data state, which enable a reduced acquisition-to-storage time, and a store-then-process methodology in addition to a theoretically unlimited storage and computing capabilities due to the use of the distributed architecture [6]. However, the Data Lakes itself showed some consistency issues especially those related to lack of governance, schema issues when merging tables and metadata handling [7]. Given this, a new architecture emerged trying to solve all these issues without compromising the benefits of each architecture.. There are still few papers tackling the data lakehouse systems [7, 8]. [7] dived deep in the cloud management side of the data lakehouse, while [8] introduced the data lakehouse system and its added values compared to data warehouses and data lakes. On the other side, our paper tackles the management of spatial data specifically.

3 Data LakeHouse The Data Lakehouse emerged as a new paradigm trying to combine the benefits of the Data Lake and the Data Warehouse into a unified centralized system. In fact, the Data Lakehouse brings the consistency and governance of Data Warehouses to the flexible Data Lakes by adding a governance layer on top of the Data Lake [8]. The governance layer adds ACID properties by implementing different protocols, it also adds metadata handling as the metadata is the core component of data lakes to keep visible and accessible all of its data and enforce schemas to protect data consistency [8]. The governance layer enables also data indexing for faster data accessibility. In addition to that, it supports BI and data science [7, 8]..

4 Spatial Big Data on the LakeHouse Spatial data is known to be compute-intensive and especially with the explosion of its amount due to the proliferation of sensor and location-aware devices, thus to get the full potential of this data, big data techniques were adopted to support the increasing storage and computational needs. The existence of different methods and devices for capturing spatial data imposes the deal with heterogeneous spatial data types. This heterogeneous spatial data nature in addition to the geometric computational complexity for spatial data analytics make the Data LakeHouse a suitable approach to deal with this data, as it will as previously mentioned allow the storage without heavy and costly ETL process, neither any other sorts

Managing Spatial Big Data on the Data LakeHouse

325

of preprocessing and without compromising the governance and consistency benefits of a Data Warehouse. The Data Lakehouse gives the opportunity to store heavy spatial data and partition it spatially. The metadata handling also makes the Data LakeHouse suitable for the complicated nature of spatial big data.

5 Experimental Study In this section, we will experiment the optimization avenues for Spatial big data in terms of storage and query processing in a spatial Lakehouse system. Our architecture is built using the delta lake system upon the AWS cloud. 5.1 The Spatial Lakehouse Architecture The architecture that we propose for handling Spatial big data on the Lakehouse is built using the Databricks LakeHouse based on the delta governance layer with the Geomesa distributed library to handle the spatial dimension upon the Apache Spark framework with S3 cloud object storage (Fig. 1).

Fig. 1. Data Lakehouse

5.2 Storage and Query Optimization For the storage we used delta table that is composed from data objects stored in Parquet format and their metadata in a delta transactional log [9]. Parquet storage format is an open-source columnar storage which ease the skipping of irrelevant data during queries by reading only the needed columns [10]. For a better performance, parquet files are separated to row groups, and each group contains separated columns, parquet format also stores metadata related to every row group which helps minimizing IO by reading into these metadata before going through the data for query purposes, and makes schema evolution easier. In addition to that, parquet storage uses compression techniques such as dictionary encoding that helps fastening the queries especially for columns with small values redundancy as it stores in the metadata of the concerned columns the repetitive

326

S. A. Errami et al.

values [10]. In addition to that there is some other techniques, such for columns with numerical values, the variation can be stored rather than the value per-se [10]. Parquet format helps in query optimization as the Spark Catalyst optimizer pushes the filters to the data file level then fetching into the parquet metadata before data scanning helps reducing considerably the query time (Fig. 2).

Fig. 2. Parquet files [10]

The delta log is the layer that assure transactional support and ACID compliance, and its composed of JSON files and checkpoints summarizing the transactions stored into those JSON files to fasten the reading through the transactional metadata [10]. The Z-ordering technique and the partitioning are also used to optimize storage on delta lake. Z-order as a multidimensional clustering technique helps optimizing the storage for Spatial data efficiently. Z-ordering techniques work well with high cardinality columns which makes it a good technique for Spatial point data with Latitude and longitude. On the other hand, partitioning technique is mostly used over low cardinality columns to partition the delta table into partitions based on that column values [11] (Fig. 3).

Fig. 3. Z-order curve

For more query optimization, data skipping techniques such as columns’ stats collecting helps reducing the data scanning time. In addition to that, caching helps also accelerating the data access.

Managing Spatial Big Data on the Data LakeHouse

327

5.3 Input Datasets Our Input dataset is the Vehicle Energy Dataset (VED), a large-scale dataset collected using 383 cars composed from GPS trajectories and energy consumption related data for every record. The dataset was collected from Nov, 2017 to Nov, 2018 over an accumulated distance of approximately 374,000 miles in Ann Arbor, Michigan, USA [12]. However, the data volume is 3 Gb only within this research and doesn’t reach a sufficiently large volume to analyze query performance at scale, and that’s why we created a synthetic dataset. The synthetic dataset has a size of 1 287 GB on csv format that converts to a maximum of 209.27 Gb only on delta format which shows clearly the compression capabilities and storage optimization of the delta format. 5.4 Setup and Hardware Our experiment was conducted on Databricks platform over the AWS cloud using the latest runtime 8.3 (Scala 2.12, Spark 3.1.1). The cluster configuration was using 4 workers of type i3.xlarge with 30.5 Gb memory and 4 cores each, and a similar one as driver. The cluster mode is standard and the Auto scaling was disabled to ensure the similar hardware capabilities over all the tests. 5.5 Analysis and Results To find the most suitable storage optimization technique with Spatial point big data, we used different optimization layouts and we tested them with a range query. We started our comparison with the base case of the data neither partitioned nor z-ordred (Fig. 4), the figure below shows a sample of different parquet files of the unpartitioned dataset. Thus, the scanning process during the query runtime will go through a large number of files to get the results as there is no spatial distinction between those files.

Fig. 4. Samples from parquet files of the non-partitioned dataset

The next tested optimization layouts are based on partitioning over the geohash or the H3. The geohash algorithm divides the earth into a rectangular grid of multi-levels on different levels and thus it converts the latitude and longitude to a string identifier

328

S. A. Errami et al.

representing the code of the rectangle where those coordinates are located into [13]. Whereas the H3 index is based on a hexagonal grid [14]. The figures show a sample of data from each partition for the two tested layouts (Fig. 5).

Fig. 5. Samples from parquet files of the partitioned dataset. ((a) Partitioning with geohash, (b) Partitioning with H3)

The last tested optimization layouts are based on the Z-order over the geohash column, the H3 column and then on the pair (Longitude, Latitude) (Fig. 6).

Fig. 6. Samples from parquet files of Z-ordered dataset ((a) Z-order with geohash, (b) Z-order with h3, (c) Z-order with (latitude, longitude))

Managing Spatial Big Data on the Data LakeHouse

329

The range query pseudo code is as follows:

Initiation: long_value1, long_ value2, lat_value1, lat_value2 Get parquet file longitude and latitude columns metadata While not at the last parquet file Begin If latitude, longitude columns range values within searched range Begin Get first row’s latitude, longitude values If long_value145 nodes) the AODV protocol will have the highest throughput (Fig. 11). Table 12. Throughput vs Node Speed Throughput Node speed (m/s)

AODV

DSDV

OLSR

DSR

15

0.84

0.95

0.67

0.45

20

0.85

0.96

0.65

0.46

25

0.85

0.95

0.65

0.44

30

0.83

0.94

0.66

0.47

Performance Evaluation of NS2 and NS3 Simulators

383

Table 13. EED vs Node Speed Average end-end delay Node speed (m/s)

AODV

DSDV

OLSR

DSR

15

0.04

0.009

0.1

0.63

20

0.14

0.002

0.22

0.67

25

0.06

0.0007

0.05

0.74

30

0.12

0.003

0.15

0.76

Table 14. End-End Delay vs Number of Nodes Energy (Joul) Nbr Nodes

AODV

DSDV

OLSR

DSR

15

0.9796

0.9968

0.9534

0.9941

25

0.9797

0.9726

0.9328

1

35

0.9806

0.9697

0.9412

0.9992

45

0.9800

0.9858

0.9123

0.9918

50

0.9922

0.9839

0.9637

0.9999

We focus on the variations of throughput by varying the speed of the node (Table 13, Fig. 12). The results show that the throughput almost does not change, so we can say that the throughput does not have a strong relationship with the node speed. And among the four DSDV protocols is still the best.

Fig. 13. End-End Delay vs Number of nodes

Fig. 14. End-End Delay vs Speed

According to the results obtained, the end-to-end delay is higher for the two OLSR and DSR protocols, on the other hand, the other two protocols AODV and DSDV have the smallest end-to-end delay, and it is almost negligible for DSDV.

384

B. A. Abdelhakim et al.

From Table 15 Fig. 14, we can say that the end-to-end delay of DSDV protocol is almost equal to zero and it does not vary by varying the speed and it is higher for the DSR protocols, and it increases by varying the speed of the node. Table 15. End-End Delay vs Speed Average end-end delay Speed

AODV

DSDV

OLSR

DSR

15

0.10

0.009

0.13

0.62

20

0.17

0.009

0.21

0.68

25

0.10

0.007

0.09

0.72

30

0.15

0.054

0.18

0.78

Fig. 15. Energy vs Number of Nodes

Fig. 16. Energy vs Speed

Table 16. Energy vs Speed Energy (Joul) Speed Node (m/s)

AODV

DSDV

OLSR

DSR

15

0.9885

0.9965

0.9812

0.9796

20

0.9954

0.9941

0.9920

0.9971

25

0.9991

0.9950

0.9412

0.9918

30

0.9897

0.9970

0.9900

0.9823

Regarding the variety of energy consumption with respect to the number of nodes (Table 15 Fig. 15), the energy consumed is generally very high but comparing it among these four protocols, OLSR consumes less energy and DSR consumes almost all the energy. From Table 16 Fig. 16, The energy consumption is generally very high for the four protocols and by varying the speed it sometimes increases and it sometimes decreases but we have noticed that for low speeds the OLSR protocol consumes less energy and for high speeds DSR becomes better.

Performance Evaluation of NS2 and NS3 Simulators

385

5 Conclusion and Future Work In our research we tried to compare the protocols AODV, DSDV, OLSR and DSR using two simulators NS2 and NS3 using specific parameters. We focused on four metrics Packet Delivery Ratio, Throughput, End-to-End Delay and Energy consumed. We first compared the results obtained in the same simulator but by varying one of the parameters in our case, we sometimes varied the number of nodes and sometimes the speed of the node. For the NS2 simulator, all the results found show that the DSDV protocol is the best among other protocols in all metrics. And even by changing the two variables, number of nodes and speed of the node and by analyzing the results obtained from the NS3 simulator, we noticed this time that the AODV protocol is the performed with regard to the packet delivery ration and throughput but this is not the case in end to end delay, in the latter the two protocols DSDV and DSR are the best. And for the energy consumed the DSDV protocol consumes less than the others. NS3 is more powerful, performed and flexible compared to NS2. Therefore, it is recommended to use NS3 simulator for obtain the performed results. Future researches can compare MANET routing protocols with two others open source simulators for example OPNET and OMNET++.

References 1. Jha, R.K., Kharga, P.: A comparative performance analysis of routing protocols in MANET using NS3 Simulator. Comput. Network Inf. Secur. 2015(4), 62–68 (2015) 2. Gulati1, M.K., Kumar, K.: Performance comparison of mobile ad hoc network routing protocols. Int. J. Comput. Networks Commun. 6(2), 127–142 (2014) 3. Kaur, D., Kumar, N.: Comparative analysis of AODV, OLSR, TORA, DSR and DSDV routing protocols in mobile ad-hoc networks. Comput. Network Inf. Secur. 3, 39-46 (2013) 4. Ahmed, S., Alam, M.S.: Performance evaluation of important ad-hoc network protocols. EURASIP J. Wirel. Commun. Networking 2, 42 (2006) 5. Singla, R., Sharma, S., Singh, G., Ravinderkaur: Performance Evaluation Of Routing Protocols In Vanets By Using Tcp Variants On Omnet++ Simulator. International Journal of Engineering Research and Applications (IJERA) 6. Bhatia, D., Sharma, D.P.: A comparative analysis of proactive, reactive and hybrid routing protocols over open source network simulator in mobile ad hoc network. Int. J. Appl. Eng. Res. 11(6), 3885–3896 (2016). ISSN 0973-4562 7. L akshman Naik.L, R.U.Khan, R.B.Mishra “Analysis of Node Density and Pause Time Effects in MANET Routing Protocols using NS-3” Computer Network and Infor- mation Security, 2016, 12, 9–17 8. Ali, S., Ali, A.: Performance Analysis of AODV, DSR and OLSR in MANET. Masters Thesis, M.10:04, COM/School of Computing, BTH (2010) 9. Pandey, A., Srivastava, A.: Performance evaluation of MANET through NS2 simulation. Int. J. Electron. Electr. Eng. 7(1), 25–30 (2014). ISSN 0974-2174

Gamification in Software Development: Systematic Literature Review Oki Priyadi1(B) , Insan Ramadhan1 , Dana Indra Sensuse1 , Ryan Randy Suryono1,2 , and Kautsarina1,3 1 Faculty of Computer Science, Universitas Indonesia, Depok, Indonesia

{oki.priyadi,insan.ramadhan,dana}@ui.ac.id, [email protected], [email protected] 2 Faculty of Engineering and Computer Science, Universitas Teknokrat Indonesia, Bandar Lampung, Indonesia 3 Ministry of Communication and Information Technology Republic of Indonesia, Jakarta, Indonesia

Abstract. Since 2010, gamification has become a trend to motivate someone in various ways. Gamification is a new concept in software engineering that influences the software development process. Along with the increase of gamification frameworks, gamification is a means to increase motivation and opportunities for successful project completion. The issue with gamification adoption is a dearth of a methodology or framework for developing a gamification system. To overcome this challenge, we employed the Kitchenham approach in this systematic literature study on gamification in software development over the last five years. This study aims to comprehend gamification framework design, gamification aspects, and the software development mechanism. We bolded that the most important thing to make gamification succeed is using the correct game elements. The results show that point is the most widely-used gamification element in the software development area, followed by level, badge, social engagement challenges-quest, leaderboards, voting, and betting. These findings will be used as inputs for further research into integrating the correct elements and framework in developing a gamification system for the software development industry. Another result showed that none of the studied papers used a common framework to build gamification in software development. It implies that the designers used their frameworks and built gamification from scratch. Consequently, it becomes a challenge and opportunity to be resolved in future research. We suggest that further studies develop mature frameworks that can be used in the software development area. Keywords: Gamification · SLR · Software Development · Gamifying

1 Introduction The enhancement of software development by design characteristics drawn from (video) games, commonly known as gamification, has been popular in recent years [1]. Gamification has become a hot topic of discussion in various fields. As a basic concept, © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ben Ahmed et al. (Eds.): NISS 2022, LNDECT 147, pp. 386–398, 2023. https://doi.org/10.1007/978-3-031-15191-0_37

Gamification in Software Development

387

gamification is defined as the application of game design elements to non-game contexts [2]. Considered a new idea in software engineering, this will change how to develop software. A software development project is a multi-party, multi-tool, multi-technique process [3]. The opportunity to work in a team and developer motivation are essential elements in the success of an IT project [4]. Routine labor, such as updating information in the progress documentation, is not challenging or appealing to the developer. That is why some programmers skip this process. In contrast, the project manager needs to ensure the project’s progress. Project managers can use up-to-date and timely information to identify risks, difficulties, and modifications required to complete a project successfully [5]. Recent studies in the field of software engineering combine current gamification design methodologies with precise software development methods [6–9]. The increase in gamification frameworks includes gamification elements used in software development. However, the methodologies or generic frameworks for building gamification systems are still lacking. Reference [10] found 83% of gamification implementation in e-government success. On the other hand, 17% of failed cases were found because the designers used the wrong element in the wrong place. Based on those studies, we attempt to identify what elements are suitable for implementation in information systems development and which frameworks are used in applying gamification in information systems development. There are some relevant SLR studies about the implementation of gamification. Ref [11] maps out the field of gamification in software development that highlights research approaches to evaluate the quality of software gamification. Different from that research, we emphasize the gamification frameworks that have been used in software development. Ref [10] studies how gamification has been used in e-government. Instead, our research focuses on what gamification elements have been used and how the frameworks are implemented in software development. To answer the research problems outlined earlier, we used the Kitchenham technique of systematic literature review (SLR). Kitchenham is a well-known method that provides guidelines for a systematic literature review. Kitchenham is different from PRISMA or other SLR methods. Kitchenham is more specifically made for computer science. This SLR approach has been used successfully in some studies such as Suryono et al. (2019) in Peer to Peer (P2P) lending problems and potential solutions [12], Purwandari et al. (2019) in e-Government [10], and Jamra et al. (2020) in e-commerce security [13]. By implementing the SLR from reputable digital academic databases, such as Scopus, ProQuest, ACM Digital Library, IEEE Xplore, and SpringerLink, this study attempts to determine the relevant game element and proper framework in the software development process. This is how the rest of the paper will be organized. The second section of the paper examines gamification and software development literature. The methodologies and processes used to perform the research are described in Sect. 3. The main findings from the extraction and synthesis studies are presented in Sect. 4. Finally, Sect. 5 presents the research results and future directions.

388

O. Priyadi et al.

2 Relevant Theory 2.1 Gamification Human motivation and behavior are influenced by gamification [14]. Gamification’s primary goal is to boost user motivation to complete tasks or use technology, predicted to improve the number and quality of outcomes from these activities [11] and expand the number of activities or the speed with which they are processed [15]. This approach has been employed in various fields, including education [16] and business [17]. Popular gamification elements are leaderboards, points, badges, levels, and progress bars. Gamification applies game mechanics and elements to non-game situations [18]. It motivates users involved in a project by increasing their engagement and loyalty through stages of selection processes and game mechanisms [19]. Some elements are required to play the game. As indicated in Fig. 1, the elements employed in the play are separated into three major components [20].

Fig. 1. Software development gamification concept [20]

On the contrary, gamification does not always improve people’s desire to do something. According to the data in a previous study, only 38% of e-government gamification research from 2013 to 2018 successfully improved motivation or user engagement. Furthermore, due to system misuse and lack of publication, 8% of research failed [10]. In one situation, the gamification mechanism is quite adequate, but in another one, it is ineffective. Thus, we need the correct elements at the right time and in the right conditions.

Gamification in Software Development

389

2.2 Software Development The task of a software developer is to develop units and software modules according to customer needs within time and budget constraints [7]. There are some steps needed to build the final product. This process is tailored to the life cycle of the chosen software development project, resulting in differences in the team’s composition, function, work unit, and routine tasks. The duties of each role in software development are described in Table 1. Table 1. Software development project task [21] Role

Activities

Programmer

Developing Running tests for a new feature Correcting vulnerabilities Examine the codes

Analyst

Defining and analyzing the requirements Analyzing and commenting on change requests Running the test Writing documentation Advising programmers and testers

Tester

Developing both manual and automated tests Registering defects Re-examining the defects

Table 1 summarizes the responsibilities of each software developer role. Not all the roles are included in this table only typical roles are shown for simplicity.

3 Methodology The research was conducted in some steps described in Fig. 2. First, we began to identify the need for SLR and review the protocol at stage ‘Planning.’ Using the objectives and key points as a guide, we started our research. After completing the goals, research questions, and scope, we went on to the stage ‘Conduct the review,’ where we designed the information retrieval procedure. We searched for scientific publications (papers and conference proceedings) in Scopus, ProQuest, ACM Digital Library, IEEE Xplorer, and Springer-link. They are chosen because they are pertinent databases. For all those databases, the following combined search string was used: (gamification OR gamifying OR gamify) AND (“software development” OR “project planning” OR “project assessment” OR “software requirements” OR “software design” OR “software risk” OR “software configuration” OR “software process” OR “software testing” OR “software integration” OR “software maintenance”) between 2016 until 2021.

390

O. Priyadi et al.

Fig. 2. Process model for systematic literature review

We compiled the findings and used a selection process that followed the inclusion criteria. First, the publications are English publications. Second, those were published from 2017 to 2021. Third, the title contains some string related to these research goals, and the contents are relevant to the work described in this research. The excluded items are those which may skew the analysis results, such as duplicates and publications that are not appropriate to this study. Table 2 shows the detailed inclusion and exclusion were conducted in phases. Table 2. Inclusion and exclusion criteria. Stages

Criteria for Inclusion

Criteria for exclusion

Initiation

Same as keyword

Not using English

Papers in 2017–2021 First

Title and abstract similar to boolean search

Duplicates

Second

Papers that answered the research question

Papers cannot be accessed

Once the proper corpus of documents has been selected, we start the iterative steps. The iterative part needs to handle some Boolean searches from a new key founded for the software development area. The steps of research implementation can be seen in Table 3. We make the necessary adjustments and transformations to the data pool. We investigate and assess the findings to determine whether the objectives have been met or additional iteration procedures are still required. Finally, we write a review report based on the study results.

Gamification in Software Development

391

4 Results and Discussion 4.1 Publication We found 23 papers related to gamification in software development. The criteria outlined in the methodology section were used in selecting the paper. We had to eliminate certain papers from the results since they could not be accessed. As a result, only papers satisfying the requirements were selected. Table 3. Selected papers by the stage Source

Initial Stage

Stage 1

Stage 2

Scopus

195

136

9

ACM Digital Library

15

5

1

IEEE

12

1

1

SpringerLink

148

24

11

ProQuest

261

17

1

Total

631

183

23

Table 4. Final papers selected. Source IEEE SpringerLink

Final papers 1 11

Papers [22] [23-33]

ProQuest

1

[34]

ACM Digital Library

1

[35]

9

[36-44]

Scopus Total

23

SpringerLink has the most matched papers (eleven), followed by Scopus (nine), ProQuest, IEEE, and the ACM digital library (one paper each). All 23 papers were published from 2017 to 2021. The summary of the final paper’s distribution according to the publication year is shown in Fig. 3. In the 2020 publication, the number of studies is the largest. It means that the growth of gamification in the software development industry that year was notable. In the 2021 publication, none of the papers was found. It is because we conducted and wrote this paper at the beginning of that year (Table 4).

392

O. Priyadi et al.

4.2 RQ1 What Gamification Elements Have Been Used Our primary area of research is the application of gamification elements or mechanics to software engineering. The following is a list of gamification elements and mechanics we discovered: point, level, leaderboard, badge, challenge and quest, social engagement, voting, and betting. It is essential to note that different studies may refer to gamification aspects differently. We classified the game element utilized in the papers into one or more categories given above. Regarding the proposal of [22], the authors made a gamification engine for all software development processes. The engine can allow the following elements to use: points, badges, levels, feedback, game dialogue, quest, rankings, social networks, voting, and betting. Points can be used to track each user’s successful completion of tasks. For example, a programmer can get the point from coding for 400 lines, or an analyst system can get the point from setting up a list of software requirements. Other elements include experience, redeemability, karma, skill, and reputation points. Most Experience Point (XP) amounts never decreased because XP determines the player’s level. The point can encourage users to accomplish some tasks from the organization. Paper [34] found that some tasks were more challenging than others, but users got the same point for that challenge, so the author suggests that point should scale according to the degree of difficulty.

Fig. 3. Distribution by year

Proposal [23] uses gamification for improving scrum adoption. The game elements used are experience point, level, feedback, pop-up, badges, Gems, and trade with real rewards. Users get the level position based on the experience point they get. Most of the levels made the game get harder exponentially. A time limit for development tasks is a crucial success factor for the working time of coding results. However, the badge is also used to visualize some achievements. Paper [24] developed a gamified attacker-defender to assess cyber security using the following elements: point, level, and challenges. A game rule from this paper divides users as attackers and defenders. The attacker must create a mutant to make a bug from the software, and the defender needs to make a test case to find that bug. The rule also maps behaviors to achievements. Every achievement will be rewarded with a point.

Gamification in Software Development

393

Paper [25] does not mention the game element used by the paper. Paper [26] uses gamification for fostering good coding practices. The elements used include points, scores, ranking, and feedback. This paper claims that gamification can improve product quality and project results. Paper [28] suggests that developers maximize source code from stack overflow. The point, level, social engagement, and voting are used in that paper. Furthermore, the proposal [30] generated and recommended personalized challenges for gamification automatically. The elements used comprise points, badges, leaderboards, challenges, and levels. Table 5. Elements used by primary papers. Element

Papers

Freq

Point

[22–24, 26–34]

12

Level

[22-24, 26-32, 34]

11

Badge

[22, 23, 30, 32, 34]

5

Social Engagement

[22, 23, 26, 28]

4

Leaderboard

[22, 30, 34]

3

Challenge and Quest

[22, 24, 30]

3

Voting

[22, 28]

2

Betting

[22]

1

Figure 4 shows the results of elements used by the papers found. As we can see in this graph, the point is the most widely used gamification element in the software development area, followed by level, badge, social engagement, challenge-quest, leaderboard, voting, and betting. Table 5 displays the elements or mechanisms of gamification used in each paper. Unfortunately, the success level of using gamification elements is not mentioned in any selected papers. The available sources only mentioned the success of the application of their gamification framework and did not specify them based on each gamification element. The success of the gamification framework will be described in the following subsection.

394

O. Priyadi et al.

Fig. 4. Distribution by element

4.3 RQ2 What Gamification Framework Has Been Used There are some frameworks known in gamification, such as Mechanic, Dynamic, Aesthetic (MDA), Mechanic, Dynamic, Emotional (MDE), Octalysis, and Sustainable Gamification Design (SGD) [45]. Nevertheless, none of the papers in our study used them. All papers use their approaches. Even though the growth of gamification in the software development industry has been notable, the adoption of gamification in software development is slower than in other domains [45]. Regarding the proposals of [22], they made a CASE tools framework. This paper made a framework for all software development project activities. This framework has an advantage because it is customizable. This framework can adjust the game element used for the project. The rule of the game also can be built easily. Proposal [34] made a comprehensive gamification framework from different multidisciplines. The SGM Framework can store project data and serve as a knowledge base for projects in the future. This framework provides project tracking and matric data over time, such as the full line of code, number of defects, and overall project performance. This framework is flexible enough to be modified using EGS, the expert system for gamification implementation. The paper [23] uses the Scrum Hero framework to assist in the administration. Meanwhile, the paper [24] uses the PenQuest framework, which is a framework designed for cyber security assessment. Paper [25] does not mention the framework used. That paper made a framework for intelligent city gamification called the automatic procedural challenge generator framework. The level of success from using gamification framework is available in eleven (11) sources [22-24, 26, 30, 32, 34, 36, 37, 41, 43]. The rest of the publications make no mention of a success indicator. Their success criteria were based on some aspects. In [41], the success of their framework claim was based on the performance of employees. The performance was better when compared to before implementing gamified software. Ref [43] claimed engagement improvement while using their application. Ref [26] mentioned that gamified model made it interesting for the developers, and they will be

Gamification in Software Development

395

more motivated to pay attention to software quality. In [22], The framework’s design of behaviors, achievements, and gamification rules allows the institution’s gamified work environment to be exceedingly customizable. In abstracted (IT) infrastructures, the gamified approach was successful in training people, assessing risk mitigation techniques, and computing new attacker/defender scenarios [24].

5 Conclusion One of the conclusions of this paper is that most research works on gamification in software development applied point as an element of gamification and as a reward system for the users. The level is the second element, followed by a badge, social engagement, challenge-quest, leaderboard, voting, and betting. We also bolded that the most important thing to succeed in gamification software development is to utilize the appropriate game features to increase user engagement. Although some tasks were more challenging than others, the users got the same point for the different challenges. Therefore, we suggest that points be adjusted or scaled according to the difficulty to encourage users to accomplish their tasks. Another result of this paper highlights that even though the growth of gamification in the software development industry has been notable, software development has been slower to adopt gamification than other domains. Most papers implemented gamification for their software development focused on software requirements. Nevertheless, another software process does not implement gamification yet. It is suggested that applying gamification details to other software process areas is preferable. In addition, several papers referring to gamification do not explain gamification implementation in their works. The implementation of the framework makes the implementation easier. However, none of the papers used the common framework, so it will be hard to implement it from scratch. Thus, it becomes a challenge and opportunity to be resolved in future research. We suggest that future studies develop mature frameworks applicable to software engineering. Acknowledgment. This research was supported by the E-Government and E-Business Laboratory, Faculty of Computer Science, Universitas Indonesia.

References 1. Huotari, K., Hamari, J.: A definition for gamification: anchoring gamification in the service marketing literature, Electron. Mark. 27, 21–31 (2017). https://doi.org/10.1007/s12525-0150212-z 2. Deterding, S., Sicart, M., Nacke, L., Ohara, K., Dixon, D.: Gamification. using game- design elements in non-gaming contexts. In: CHI’11 Extended Abstracts on Human Factors in Computing Systems, pp. 2425–2428 (2011) 3. Bourque, P., Fairley, R.E.: The Guide to the Software Engineering Body of Knowledge (SWEBOK Guide), 3.0. IEEE Computer Society (2014) 4. Caccamese, A., Bragantini, D.: The hidden pyramid. In: PMI® Global Congress 2015, Orlando, Florida, USA, PA: Project Management Institute (2015)

396

O. Priyadi et al.

5. Project Management Institute: A Guide to the Project Management Body of Knowledge (PMBOK® Guide) 6th edn., PMI (2017) 6. Naik, N., Jenkins, P.: Relax, it’s a game: utilising gamification in learning agile scrum software development. In: 2019 IEEE Conference on Games (CoG), London, UK, pp. 1–4, (2019). https://doi.org/10.1109/CIG.2019.8848104 7. Machuca, L.V., Hurtado, G.P.G.: Toward a model based on gamification to influence the productivity of software development teams. In: 2019 14th Iberian Conference on Information Systems and Technologies (CISTI), Coimbra, Portugal, pp. 1–6. (2019). https://doi.org/10. 23919/CISTI.2019.8760813 8. Platonova, V., Berzisa, S.: Gamification framework for software development project processes Vide. Tehnologija. Resource - Environment, Technology, Resources 2, 114–118 (2017) 9. Jaramillo, S.G., Cadavid, J.P., Jaramillo, C.M.Z.: Adaptation of the 6D gamification model in a software development course. In: 2018 XIII Latin American Conference on Learning Technologies (LACLO), Sao Paulo, Brazil, pp. 85–88 (2018). https://doi.org/10.1109/LACLO.2018.00030 10. Purwandari, B., Sutoyo, M.A.H., Mishbah, M., Dzulfikar, M.F.: Gamification in egovemment: a systematic literature review. In: 2019 Fourth International Conference on Informatics and Computing (ICIC), pp. 1–5 (2019). https://doi.org/10.1109/ICIC47613.2019.898 5769 11. Pedreira, O., Garc´ıa, F., Brisaboa, N., Piattini, M.: Gamification in software engineering – a systematic mapping. Inf. Softw. Technol. 57, 157–168 (2015). https://doi.org/10.1016/j.inf sof.2014.08.007 12. Suryono, R.R., Purwandary, B., Budi, I.: Peer to Peer (P2P) lending problems and potential solutions: a systematic literature review. Procedia Comput. Sci. (2019). https://doi.org/10. 1016/j.procs.2019.11.116 13. Jamra, R.K., Jati, B.A, Kautsarina, Suryono, R.R.: Systematic review of issues and solutions for security in E-commerce. In: International Conference on Electrical Engineering and Informatics (Celtics) (2020). https://doi.org/10.1109/ICELTICs50595.2020.9315437 14. Cunningham, C., Zickerman, G.: Gamification by Design: Implementing Game Mechanics in Web and Mobile Apps, 1st edn. O’Reilly Media, Inc. (2011) 15. Hamari, J., Koivisto, J., Sarsa, H.: Does gamification work?–a literature review of empirical studies on gamification. In: Proceedings 47th Hawaii International Conference System Science Hawaii, pp. 3025–3034. IEEE (2014). https://doi.org/10.1109/HICSS.2014.377 16. Hamari, J., Shernoff, D., Rowe, E., Coller, B., Edwards, T.: Challenging games help students learn: an empirical study on engagement, flow, and immersion in game-based learning, Comput. Human Behav 54, 170–179 (2016). https://doi.org/10.1016/j.chb.2015.07.045 17. Hamari, J.: Transforming homo economicus into homo ludens: a field experiment on gamification in a utilitarian peer-to-peer trading service, Electronic Commerce Research, and Applications, Volume 12, Issue 4 (2013) 18. Arai, S., Sakamoto, K., Washizaki, H., Fukazawa, Y.: A gamified tool for motivating developers to remove warnings of bug pattern tools. In: 6th International Workshop on Empirical Software Engineering in Practice, November 2014. https://doi.org/10.1109/iwesep.2014.17 19. Management Association: Gamification Concepts, Methodologies, Tools, and Applications. IGI Global (2015). https://doi.org/10.4018/978-1-4666-8200-9 20. Landsell, J., Hagglundd, E.: Towards a Gamification Frame-work: Limitations and opportunities when gamifying business processes: Technical report (2016). https://umu.diva-portal. org/smash/get/diva2:929548/FULLTEXT01.pdf. Accessed 16 Mar 2021 21. International Institute of Business Analysis, A Guide to the Business Analysis Body of Knowledge (BABOK), 3rd edn., IIBA (2015)

Gamification in Software Development

397

22. Pedreira, O., Garc´ıa, F., Piattini, M., Cortinas, A., Pena, A.C.: An architecture for software engineering gamification. In: Tsinghua Science and Technology, vol. 25, no. 6, pp. 776–797 (2020). https://doi.org/10.26599/TST.2020.9010004 23. Marques, R., Costa, G., Mira da Silva, M., Gonçalves, D., Gonçalves, P.: A gamification solution for improving Scrum adoption. Empir. Softw. Eng. 25(4), 2583–2629 (2020). https:// doi.org/10.1007/s10664-020-09816-9 24. Luh, R., Temper, M., Tjoa, S., Schrittwieser, S., Janicke, H.: PenQuest: a gamified attacker/defender meta model for cyber security assessment and education. J. Comput. Virology Hacking Tech. 16(1), 19–61 (2019). https://doi.org/10.1007/s11416-019-00342-x 25. Erdogmus, T., Czermak, M., Baumsteiger, D., et al.: How to support clients and vendors in IT outsourcing engagements: the different roles of third-party advisory services. J. Inf. Technol. Teach. Cases 8(2), 184–191 (2018). https://doi.org/10.1057/s41266-018-0038-6 26. Foucault, M., Blanc, X., Falleri, J.-R., Storey, M.-A.: Fostering good coding practices through individual feedback and gamification: an industrial case study. Empir. Softw. Eng. 24(6), 3731–3754 (2019). https://doi.org/10.1007/s10664-019-09719-4 27. Marinho, M., Sampaio, S., Moura, H.: Managing uncertainty in software projects. Innovations Syst. Softw. Eng. 14(3), 157–181 (2017). https://doi.org/10.1007/s11334-017-0297-y 28. Wu, Y., Wang, S., Bezemer, C.-P., Inoue, K.: How do developers utilize source code from stack overflow? Empir. Softw. Eng. 24(2), 637–673 (2018). https://doi.org/10.1007/s10664018-9634-5 29. Xu, H., König, L., Cáliz, D., Schmeck, H.: A generic user interface for energy management in smart homes. Energy Inf. 1(1), 1–63 (2018). https://doi.org/10.1186/s42162-018-0060-0 30. Khoshkangini, R., Valetto, G., Marconi, A., Pistore, M.: Automatic generation and recommendation of personalized challenges for gamification. User Model. User-Adap. Inter. 31(1), 1–34 (2020). https://doi.org/10.1007/s11257-019-09255-2 31. Visser, W.F.: A blueprint for performance-driven operations management. Mining, Metallurgy Exploration 37(3), 823–831 (2020). https://doi.org/10.1007/s42461-020-00199-5 32. Ebbers, F., Zibuschka, J., Zimmermann, C., Hinz, O.: User preferences for privacy features in digital assistants. Electron. Mark. 31(2), 411–426 (2020). https://doi.org/10.1007/s12525020-00447-y 33. Sillaber, C., Waltl, B., Treiblmaier, H., Gallersdörfer, U., Felderer, M.: Laying the foundation for smart contract development: an integrated engineering process model. IseB 19(3), 863–882 (2020). https://doi.org/10.1007/s10257-020-00465-5 34. Chow, I., Huang, L.: A Software Gamification Model for Cross-Cultural Software Development Teams. Proceedings of the International Conference on Management Engineering, Software Engineering and Service Sciences (ICMSS 2017), pp. 1–8. Association for Computing Machinery, New York (2017). https://doi.org/10.1145/3034950.3034955 35. Ren, W., Barrett, S., Das, S.: Toward gamification to software engineering and contribution of software engineer. In: Proceedings of the 2020 4th International Conference on Management Engineering, Software Engineering and Service Sciences, pp. 1–5. Association for Computing Machinery, New York (2020). https://doi.org/10.1145/3380625.3380628 36. García, F., Pedreira, O., Piattini, M., Pena, A.C., Penabad, M.: A framework for gamification in software engineering. J. Syst. Softw. 132, 21–40, ISSN 0164-1212 (2017). https://doi.org/ 10.1016/j.jss.2017.06.021 37. Morschheuser, B., Hassan, L., Werder, K., Hamari, J.: How to design gamification? A method for engineering gamified software, Inf. Softw. Technol. 95, 219–237 (2018). ISSN 0950-5849. https://doi.org/10.1016/j.infsof.2017.10.015 38. Alhammad, M.M., Moreno, A.M.: Gamification in software engineering education. J. Syst. Softw. 141, 131–150 (2018). ISSN 0164-1212. https://doi.org/10.1016/j.jss.2018.03.065

398

O. Priyadi et al.

39. Porto, D.P., Jesus, G.M., Ferrari, F.C., Pinto, S.C., Fabbri, F.: Initiatives and challenges of using gamification in software engineering. J. Syst. Softw. 173 (2021). ISSN 0164-1212. https://doi.org/10.1016/j.jss.2020.110870 40. Herranz, Guzmán1, J.G., Seco1, A.A., Larrucea, X.: Gamification for software process improvement: a practical approach, The Institute of Engineering and Technology (2018). https://doi.org/10.1049/iet-sen.2018.5120 41. Memar, N., Krishna, A., McMeekin, D., Tan, T.: Investigating information system testing gamification with time restrictions on testers performance. Australasian J. Inf. Syst. 24 (2020). https://doi.org/10.3127/ajis.v24i0.2179 42. Lowry, P. B., Petter, S., Leimeister, J. M.: Desperately seeking the artefacts and the foundations of native theory in gamification. Eur. J. Inf. Syst. 29(4) (2020) 43. Nowostawski, M., McCallum, S., Mishra, D.: Gamifying research in software engineering, Computer Application in Engineering Education, vol. 26, no. Twenty-Fifth Anniversary Special Issue of Computer Applications in Engineering Education Innovation in Engineering Education with Digital Technologies, pp. 1641–1652 (2018) 44. Muñoz, M., Negrón, A.P.P., Mejia, J., Hurtado, G.P.G., Alvarez, M.C.G., Hernández, L.: Applying gamification elements to build, The Institution of Engineering and Technology, no. Gamification and Persuasive Games for Software (2018) 45. Ivanova, S., Georgiev, G.: Towards a justified choice of gamification framework when building an educational application. In: 42nd International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), pp. 594–599 (2019). https:// doi.org/10.23919/MIPRO.2019.8757085

Robust Method for Estimating the Fundamental Matrix by a Hybrid Optimization Algorithm Soulaiman El Hazzat1(B) and Mostafa Merras2 1 LSI, Department of Computer Science, Polydisciplinary Faculty of Taza, Sidi Mohamed Ben

Abdellah University, Fez, Morocco [email protected] 2 IMAGE Laboratory, High School of Technology, Moulay Ismaïl University, B.P. 3103, Toulal, Route d’Agouray, Meknes, Morocco

Abstract. In this paper, we present a new method for estimating the fundamental matrix by a hybrid algorithm that combines the Genetic Algorithm with Levenberg-Marquardt. Compared to classical optimization methods, the estimation of the fundamental matrix by this approach can avoid being trapped in a local minimum and converges quickly to an optimal solution without initial estimates of the elements of the fundamental matrix. Several experiments are implemented to demonstrate the validity and performance of the presented approach. The results show that the proposed technique is both accurate and robust compared to classical optimization methods. Keywords: Fundamental matrix · Computer vision · Genetic algorithm · Levenberg-Marquardt · Nonlinear optimization

1 Introduction The fundamental matrix is the key to many problems of computer vision [1, 10–13]. it encapsulates all the information about the movements of the camera. In general, the estimation of this matrix is done by two techniques. The first is linear which allows the direct calculation of the fundamental matrix from point matching by means of a linear criterion [6]. This method suffers from two flaws, related to the lack of constraints on the rank of the matrix searched and also the lack of normalization of the linear criterion, which leads to significant errors in the estimation of this matrix. In order to overcome these difficulties, a nonlinear technique was proposed. However, the minimization algorithms used namely RANSAC, Levemberg-Marquardt [2–4], also suffer from a convergence problem because the minimization by these algorithms requires a very important initialization step (if the initialization is very far from the optimum then it is difficult to converge to an optimal solution). Similarly, the nonlinear criterion is not convex and contains a lot of complex local minima, these algorithms easy to be trapped in a local minimum. This has prompted researchers to test approaches based on genetic algorithms (GA) [7]. These approaches are classified as nonlinear stochastic optimization methods © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ben Ahmed et al. (Eds.): NISS 2022, LNDECT 147, pp. 399–406, 2023. https://doi.org/10.1007/978-3-031-15191-0_38

400

S. El Hazzat and M. Merras

and are much more insensitive to the initial solution and are expected to be able to determine the global minimum in all cases. In this paper, we present a new technique for estimating the fundamental matrix by a hybrid optimization algorithm that combines the Genetic Algorithm and LevermbergMarquardt (GA-LM). In this genetic approach, the fundamental matrix estimation procedure is done in two steps. First, we propose a GA to find a good approximation of the solution. Then, a nonlinear minimization algorithm (Levenberg-Marquardt) is applied to refine the results. The rest of this paper is organized as follows. In Sect. 2, we define the fundamental matrix. Section 3 describes the proposed AG-LM algorithm. Section 4 presents the experimental results and analysis. The conclusion is presented in Sect. 5.

2 Fundamental Matrix

Let mi (xi , yi ) and m i (x i , y i ) be the projections of the scene point Mi into the two left and right images. Then m i is on the epipolar line associated withmi , which is called the epipolar constraint, this constraint plays a very important role in stereoscopic vision. When the intrinsic parameters of the cameras are not known then the epipolar constraint can be represented algebraically by a matrix of size 3 × 3, called fundamental matrix F [8]. So we can express the epipolar constraint between two corresponding points (mi , m i ) by the following relation: T

mi Fmi = 0

(1)

The above equation can be rewritten as a linear system consisting of 9 elements of the matrix F and the coordinates of the points mi and m i : Ui f = 0

(2)

with: Ui = [xi xi xi yi xi yi xi yi yi yi xi yi 1] f = [f11 f12 f13 f21 f22 f23 f31 f32 f33 ] Eight matches are sufficient to solve the system of Eq. (2) [9]. This method does not require an iterative computation. However, this way suffers from two problems: the absence of constraint on the rank of F and also there is no normalization. To remedy these problems a nonlinear computation is used.

3 Estimation of the Fundamental Matrix by an AG-LM Algorithm In this section, we propose a hybrid AG-LM algorithm for the estimation of the fundamental matrix, by minimizing a nonlinear cost function that measures the geometric distance between each point of the matching pair and its epipolar line (Fig. 1) given by formulas (3) and (4): T T F m m i i di = d (mi , F T mi ) = (3) 2 2 (F T mi )1 + (F T mi )2

Robust Method for Estimating the Fundamental Matrix

di = d (mi , Fmi ) =

T mi Fmi

401

(4)

(Fmi )21 + (Fmi )22

where (Fmi )k and (F T mi )k mean, respectively, the k element of the vectors (Fmi ) and (F T mi ).

Fig. 1. Geometric distance

The main idea of the method is to find the fundamental matrix that minimizes the sum of the distances to the squares given above (for n matches) using the following cost function: n 2 di 2 + di (5) min F

i=1

In order to minimize the relationship (5) by a GA-LM, we opted for a favored generation of the initial population, for example, for a desired initial population, a number P of individuals will be randomly generated. Each individual represents a potential solution to the F estimation problem, and each individual is composed of genes representing the elements of the fundamental matrix. For this we concatenate the elements of F(f11 f12 f13 f21 f22 f23 f31 f32 f33 ) into a single vector (Table 1) to form the genes of the individuals and we choose a binary coding of these individuals.

402

S. El Hazzat and M. Merras

Fig. 2. AG-LM for the estimation of F.

Robust Method for Estimating the Fundamental Matrix

403

Table 1. The jth individual of the population. j

f11

j

f12

j

f13

j

f21

j

f22

j

f23

j

f31

j

f32

j

f33

The genetic algorithm performed the optimization by repeating the genetic operations until the stopping criterion is satisfied. At the end of the GA execution, the best individual will be an input to the LM algorithm, a refined solution is thus obtained. If the error obtained after the execution of the latter is acceptable, the results will be final. If not, the minimization process will start again from the GA which will include the last best individual for the initialization of the new population. The steps of our algorithm for estimating F are given in the flowchart in Fig. 2.

4 Experiments To demonstrate the robustness of our approach, we use a sequence of images taken by a video camera. Figure 3 shows two of the images.

Fig. 3. Two pairs of images were used.

We find matched points between each pair of images (Fig. 4). These points will be used as input data for our optimization algorithm.

404

S. El Hazzat and M. Merras

Fig. 4. The matches obtained between two pairs of images.

In order to identify any improvement due to the estimation of F by a GA-LM, we run first, the GA and LM to estimate F. Then we compute the geometric mean distance of the points to their epipolar lines to test the quality of the fundamental matrix for each of these three optimization algorithms. Table 2 shows the fundamental matrices estimated between each image pair by the three optimization algorithms and their characteristics. Table 2. Experimental results for the three algorithms. LM

AG

AG-LM

Optimal geometric distance

9.2359

4.0237

1.1234

Mean distance

8.6578

3.6730

0 .9865

Number of generations

320

300

270

Figure 5 represents the values of the cost function over the generations. The analysis of Table 2 and the reading of Fig. 5 shows that the estimation of the fundamental matrix by an AG-LM is more robust and accurate compared to AG and LM, this is logical, because the LM algorithm is a local search method that requires an initialization step to find the optimum and also LM can be easily fallen into a local optimum because the cost function is nonlinear and has many local minima which encourage us to test the genetic approach which is a global search method, does not require initialization and converges quickly to a global solution from a population of initial solutions. This population of initial solutions chosen randomly without a priori knowledge on the elements of the fundamental matrix, which makes the GA possible to not converge to an optimal solution.

Robust Method for Estimating the Fundamental Matrix

405

Fig. 5. Convergence of algorithms over generations.

5 Conclusion In this paper, we have presented a fundamental matrix estimation technique based on a hybrid optimization algorithm that combines GA and LM, the proposed approach converges to an optimal solution without initial fundamental matrix estimates. The obtained results show the performance of the proposed technique in terms of convergence and accuracy compared to other single optimization methods.

References 1. El Hazzat, S., Merras, M., El Akkad, N., Saaidi, A., Satori, K.: Enhancement of sparse 3D reconstruction using a modified match propagation based on particle swarm optimization. Multimedia Tools Appl. 78(11), 14251–14276 (2018). https://doi.org/10.1007/s11042-0186828-1 2. Levenberg, K.: A method for the solution of certain non-linear problems in least squares. Quart. Appl. Math. 164–168 (1944) 3. Fischler, M., Bolles, R.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 24(6), 381–395 (1981) 4. Luong, Q.T., Faugeras, O.: The fundamental matrix: theory, algorithms and stability analysis. Int. J. Comput. Vision 17(1), 43–76 (1996) 5. Merras, M., El Hazzat, S., Saaidi, A., Satori, K., Nazih, A.G.: 3D face reconstruction using images from cameras with varying parameters. Int. J. Autom. Comput. 14(6), 661–671 (2016). https://doi.org/10.1007/s11633-016-0999-x 6. Hartley, R.: In defence of the 8-point algorithms. In: Proceeding of the International Conference on Computer Vision, pp. 1064–1070. IEEE Computer Society Press (1995) 7. Holland, J.H.: Adaptation in Natural and Artificial Systems. MIT Press (1992)

406

S. El Hazzat and M. Merras

8. Faugeras, O. D., Luong, Q. -T., Maybank, S. J.: Camera self-calibration: theory and experiments. In: Sandini, G. (ed.) ECCV 1992. LNCS, vol. 588, pp. 321–334. Springer, Heidelberg (1992). https://doi.org/10.1007/3-540-55426-2_37 9. Chai, J., Ma, S.D.: Robust epipolar geometry estimation using genetic algorithm. Pattern Recogn. Lett. 19, 829–838 (1998) 10. El Hazzat, S., Merras, M., El Akkad, N., Saaidi, A., Satori, K.: 3D reconstruction system based on incremental structure from motion using a camera with varying parameters. Vis. Comput. 34(10), 1443–1460 (2017). https://doi.org/10.1007/s00371-017-1451-0 11. El Hazzat, S., Merras, M., El Akkad, N., Saaidi, A., Satori, K.: Silhouettes based-3D object reconstruction using hybrid sparse 3D reconstruction and volumetric methods. In: Bhateja, V., Satapathy, S.C., Satori, H. (eds.) Embedded Systems and Artificial Intelligence. AISC, vol. 1076, pp. 499–507. Springer, Singapore (2020). https://doi.org/10.1007/978-981-15-09476_47 12. Merras, M., El akkad, N., El hazzat, S., Saaidi, A., Nazih, A.G., Satori, K.: Method of 3D mesh reconstruction from sequences of calibrated images. In: International Conference on Multimedia Computing and Systems (ICMCS) (2014) 13. El Akkad, N., El hazzat, S., Saaidi, A., Satori, K.: Reconstruction of 3D scenes by camera self-calibration and using genetic algorithms. 3D Res. 7(1),1–17 (2016). https://doi.org/10. 1007/s13319-016-0082-y

SDN Southbound Protocols: A Comparative Study Lamiae Boukraa1(B) , Safaa Mahrach2 , Khalid El Makkaoui3 , and Redouane Esbai1 1

LMASI laboratory, FPD, Mohammed First University, Nador, Morocco [email protected] 2 IR2M laboratory, FST, Hassan First University, Settat, Morocco 3 LaMAO laboratory, MSC team, FPD, Mohammed First University, Nador, Morocco

Abstract. Software-Defined Network (SDN) is an emerging technology in computer networks. SDN simplifies the design, control, and management of next-generation networks (e.g., 5G, Cloud computing, and Big data) by separating the existing network into a centralized control plane (CP) and a remotely programmable data plane (DP). The SDN southbound interface (SBI) connects the CP and the DP. Since the emergence of the SDN, many SBI protocols have been suggested; nowadays, OpenFlow is the most widely used. This paper provides readers with a deep comparative study on SDN southbound protocols, namely OpenFlow, ForCES, and P4. Keywords: Software-Defined Network (SDN) (SBI) · NetConf · OpenFlow · ForCES · P4

1

· Southbound Interface

Introduction

As demand for online services (e.g., cloud computing [1,2], big data application [3], and automated networking platforms for IoT [4]) continues to grow, the number of the equipment has increased, so the complexity of their administration is growing; this is why traditional networks encounter big problems. In 2011, Open Networking Foundation (ONF) suggested SDN, which supports network programming and automation for network operations [5]. SDN is considered a key technology for improving the management and control of large-scale networks. SDN separates the control plane (CP) from the data plane (DP) to make traditional networks more ﬂexible, dynamic, and programmable. Moreover, it allows applications and network services to control the abstract infrastructure directly [6,7]. Figure 1 depicts the SDN architecture, which has three principal planes: namely, the DP, CP, and management plane (MP) [5,9]. The bottom layer is called the “data plane”. It consists of physical network elements, usually switches that forward packets according to ﬂow rules deﬁned c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ben Ahmed et al. (Eds.): NISS 2022, LNDECT 147, pp. 407–418, 2023. https://doi.org/10.1007/978-3-031-15191-0_39

408

L. Boukraa et al.

Fig. 1. SDN architecture.

by the controller. The interface between the DP and the CP is named the southbound interface (SBI). The most commonly utilized SBI is OpenFlow [8]. The middle layer, called the “control plane,” consists of one or more controllers that control and manage the devices in the DP and deﬁne traﬃc ﬂows established on the network policy. The controllers employ East and West interfaces for exchanging information on the inter-domain network [9]. The top layer is called the “management plane”. It contains several SDN applications, traﬃc engineering, load balancing, ﬁrewalls, etc., which are created to execute typical control and management techniques. The MP interacts with the CP by employing a northbound interface [9]. The article focuses on the SBI, linking the DP and the CP. The main purpose of this interface is to transmit messages by the controller to the DP devices and deliver information about these devices. It is used to determine the network topology, deﬁne the network ﬂows, and execute the requests sent by the MP. The paper aims to introduce a comprehensive detail of diﬀerent SDN SBI. We oﬀer a comparative study between three protocols, namely, ForCES [10], OpenFlow [11], and P4 [12]. In this paper, Sect. 2 gives a comprehensive background of SDN, including the multilayers and interfaces. Section 3 presents in detail an overview of SBI interfaces. Section 4 discusses a comparative study of the three southbound protocols. The paper closes with a conclusion.

2

SDN Background

Due to tightly coupled CP and DP, hardware and software inventions of network devices are ﬁrmly coupled. The control functionalities and data forwarding are integrated into the same devices in traditional network architecture. Network management becomes heavy and diﬃcult to perform, error-prone, and timeconsuming. SDN is an emergent network paradigm that isolates the CP from the DP. It can easily simplify network management, deﬁne network ﬂows, improve network

SDN Southbound Protocols: A Comparative Study

409

capacity and enable virtualization within the network [7]. In what follows, we present an overview of the SDN architecture, the diﬀerent components, and their functionalities. We then discuss the diﬀerent SDN interfaces and some of their properties. 2.1

SDN Architecture

The SDN architecture comprises three planes: the DP, CP, and MP [5,9], as shown in Fig. 1. DP is also known as forwarding plane, including software and hardware switches, routers, etc. It is responsible for handling packets in the data path utilizing the received rules from the controller. Forwarding elements communicate through an open interface with the controller, usually using OpenFlow. Each switch keeps one or more ﬂow tables whose entries are delivered by the controller; each ﬂow table entry is composed of an entry identiﬁer, statistics, and an action. Action contains but is not limited to forwarding, modifying, and dropping packets. CP includes one or more controllers implementing a network operating system (NOS) to ease network management. Controllers provide a global view of the network. This helps populate switch ﬂow tables and provide network status information to the MP for application development. Controllers communicate with the other planes using southbound, northbound, and east/westbound interfaces. POX [13], Ryu [14], OpenDaylight [15], and FloodLight [16] are some of the open and popular controllers that support the OpenFlow standard interface. Several various controllers have been proposed in [13–17]. MP is also known as the application plane. Using SDN, we do not need dedicated devices such as load balancers, routers, ﬁrewalls, policy enforcement, etc. SDN MP allows us to implement these functionalities as applications in software on a server. As a result, network managers can control easily and dynamically the DP through the MP and CP. 2.2

SDN Interfaces

Application programmable interfaces (APIs) play an essential role in SDN and oﬀer interaction between the planes [18]. Fig. 1 shows the various interface APIs and some of their properties. Southbound API: allows communication between the CP and DP. It is used to get information about network elements, push conﬁguration to network elements and install ﬂow entries. It also oﬀers an abstraction model of the DP to the CP. A wide range of traditional network elements creates major issues, namely

410

L. Boukraa et al.

heterogeneity, speciﬁc architecture for switching fabric, and language speciﬁcations. SDN addresses these challenges by supplying an open and standardized SBI. Openﬂow is the standard SBI considered between CP and DP. Northbound API: plays a vital role in application development and provides a common communication interface between the SDN controller and the MP. It also helps provide information about network elements for application developers, making network management easy and dynamic. The northbound APIs support a wide variety of applications. There are various SDN Northbound APIs for managing diverse types of applications through an SDN controller. East/Westbound API: due to an exponential expansion in network devices and large-scale networks, distributed controllers are becoming necessary. In such a distribution, each controller manages its domain with forwarding elements and needs to report information about its domain for a consistent overview of the entire network. Eastbound APIs are employed to share information between different distributed controllers. Westbound APIs make the legacy network devices communicate with controllers. We discuss and review some example solutions for SBI [10–12] in the following sections.

3

Overview of SDN Southbound Protocols

NetConf [19] is a protocol that provides mechanisms for installing, handling, and deleting network device conﬁgurations. It uses remote procedure call (RPC) mechanism to facilitate communication; one of the NetConf key aspects is that it uses XML, which directly lowers cost and allows fast access to new features. NetConf is suggested before SDN, but it also comes with a speciﬁc API just like OpenFlow. The API allows to receive and send conﬁguration datasets [20,21]. Since the SDN’s emergence, many southbound protocols have been proposed. The most widely employed SBI is OpenFlow. Some protocols are dependent on OpenFlow, like OVSBD [22] and OpenFlow conﬁguration protocol [23], and some protocols are independent of OpenFlow protocol or are parallel proposals, e.g., ForCES and P4. In what follows, we present three southbound protocols: OpenFlow, ForCES, and P4. 3.1

OpenFlow

In 2012, the ONF proposed the OpenFlow protocol, which is considered the ﬁrst standard interface for separating the network CP and DP.

SDN Southbound Protocols: A Comparative Study

411

A. OpenFlow Evolution The OpenFlow protocol [11] has been created and considered the standard SBI between the CP and DP. The controller utilizes this interface for communication between the controller and the forwarding elements. OpenFlow has developed from version1.0 with solely 12 matching ﬁelds. And one ﬂow table directly impacted the OpenFlow scalability to version1.5 with 41 matching ﬁelds and many new functionalities; version1.1 introduced several tables and used a ﬁxed-length format for match ﬁelds. In version1.2, the format was changed to a TLV(Type-LengthValue) format to provide more ﬂexibility. Therefore, it allows multiple controllers to operate in master/slave mode. Version1.3 oﬀered a counter table containing several counter en-tries called counter identiﬁers. A packet was either forwarded to a particular port or dropped. Version1.4, The synchronized table was introduced in SDN technology, the DP devices transmit packets according to the policy determined by the MP, so if a policy update ensues by the controller, all controller devices will receive the update. Version1.5 improves synchronization between multiple switches. B. OpenFlow Architecture Figure 2 represents a basic OpenFlow architecture consisting of end hosts, a controller, and OpenFlow switches [24]. An OpenFlow Switch includes one or more ﬂow table, which fulﬁlls packet lookup and forwarding. A secure channel to secure communicates with the controller via the OpenFlow protocol. OpenFlow pipeline has numerous ﬂow tables, group tables, and meter tables. Flow tables consist of ﬂow entries that determine how packets will be processed. Flow entries are mostly as follows: – Matching rules are utilized to match inbound packets; matching ﬁelds can contain the packet header information (Ethernet Src/Dst, IP Src/Dst, etc.). – Actions deﬁne how the packets should be processed. – Counters are employed to assemble statistics for a particular ﬂow, like the received packets’ number, bytes, and ﬂow duration.

Fig. 2. OpenFlow architecture.

412

L. Boukraa et al.

After matching an incoming packet, special action is performed to transmit it to one or more several ports; if no match is found, it is transmitted to the controller using the packet IN message. If no match is found, it is sent to the controller utilizing the packet IN message. This message includes the information of the input port, the packet header, and the ID of the buﬀer where the packet is stored. The controller transmits a Packet OUT message to respond to the packet IN message. The controller sends a Packet OUT message to answer the packet IN message. This message contains the buﬀer ID of the corresponding Packet IN message and the actions to be performed transfer to a particular port, drop, update, and add entries to the ﬂow tables. This message includes the buﬀer ID of the related Packet IN message and the actions to be performed transfer to a particular port, dropping, updating, and adding entries to the ﬂow Tables [7]. Figure 3 describes transmitting messages between the CP and the DP.

Fig. 3. OpenFlow discovery messages.

To select the optimum path for a packet through the forwarding elements, The controller starts the procedure by transmitting an LLDP frame encapsulated in a packet-out command to all switches. By default, LLDP is designed as a single-hop protocol [26]. When a switch receives a packet-out message, it sends LLDP frames to all of its ports. An OpenFlow neighbor receiving these packets will look for the incoming ﬂow entry in its local tables. If a ﬂow for that entry does not exist, it will send this packet to the controller with a packet-in to update the topological view. Packet-out messages to all switches, the controller data is updated periodically. However, mishandling of the LLDP payload inside these packets can cause serious problems in the discovery process. SSL protocol has been developed to provide communication between the controller and the switches. OpenFlow’s discovery is based on packet-out and packet-in messages sent from the controller and the switch, respectively. LLDP is being used locally in the DP’ s switches to permit protocol to function as it is planned to, without changing or modifying any ﬁelds of it. 3.2

ForCES

In 2015, Internet Engineering Task Force (IETF) terminated the protocol forwarding control element separation, ForCES, which splits the DP from the CP. ForCES establishes a framework and associated protocol that standardizes for exchanging information between the DP and the CP.

SDN Southbound Protocols: A Comparative Study

413

Figure 4 deﬁnes the CP as Control Elements (CEs) and those of the DP Forward Elements (FEs). ForCES supplies a method by which CEs can link to and manage multiple FEs utilizing the ForCES protocol. This protocol includes both the communication channel management and the control messages themselves [26].

Fig. 4. ForCES framework.

Figure 5 illustrates a ForCES-modeled FE and, in particular, the directed graph of LFBs. ForCES deﬁnes an object-oriented model, also called a modeling language; the ForCES modeling language is used to develop XML-expressed models that describe the datapath’s resources in very ﬁne detail. The language follows a building block approach where each block is called a Logical Functional Block (LFBs). Each LFB implements a speciﬁc function and can receive, transmit, and modify packets. LFBs are then interconnected in a directed graph to form an FE. An LFB model depicts all mandatory and optional information that should be exchanged between the forwarding element and the control. That means the development and underlying implementation of an LFB are transparent to the FE so that a customer can programmatically and dynamically edit and control the packet rules and processing policies [10].

Fig. 5. ForCES modeled FE.

414

L. Boukraa et al.

A useful mechanism supplied by ForCES is events. LFB Event is a function capable of informing the controller of various network events, i.e., timer expiration, link failure, or topology changes. ForCES provides a ﬂexible way to describe a network function based on the LFB abstraction with an XML ﬁle. To acquire available neighbor information directly from the hardware of the network elements, LLDP is utilized locally in the DP switches or between the controller and the DP to allow the protocol to function as it is supposed to, without altering or changing the ﬁelds [26]. ForCES is supposed to be faster and consume less bandwidth for the same number of conﬁguration messages due to the messages’ nature. ForCES provides rich features, but it lacks open source support. 3.3

P4

P4 is a top-level language for programming protocol-independent packet processors [27]. It is also considered a protocol between the controller and the network devices, used to represent the way packets are processed by the DP. P4 was ﬁrst published in 2014. P4 presents an adaptable mechanism for packet analysis and header ﬁeld matching; the goals of P4 are: – Reconfigurability: permitting programmers to modify the way switches process packets. – Protocol independence: since switches do not need to be tied to speciﬁc network protocols, P4 enables a programmable switch to determine new header formats with new ﬁeld names and types. – Target independence: the packet processing capability must be disconnected from the underlying hardware. The P4 program is a forwarding model [28] that switches forward packets via a programmable parser and a set of mapping and action table resources, divided between input and output, as shown in Fig. 6.

Fig. 6. Concept of P4.

SDN Southbound Protocols: A Comparative Study

415

– P4 header: types represent the format of the packet headers. An Ethernet header, e.g., describes the source MAC address, the destination MAC address, and the EtherType. – Parser: Speciﬁes the headers for each incoming packet. – Match-action tables: Tables are the basic units of the matchmaking action pipeline. The tables deﬁne the rules to run, the input ﬁelds to use, and the actions that can be applied. – P4 deparser: reassembles the headers into a properly formed network packet that can be sent through an output port of the packet transfer device. Figure 7 depicts the forwarding model consisting of two main operations: the conﬁguration procedures create a parser, set the order of match+action steps, and select the header ﬁelds processed by each step. The conﬁguration deﬁnes which protocols are handled and how the switch can process packets. The population operations add and remove entries in the match+action tables speciﬁed during conﬁguration. The population deﬁnes the policy used for packets [28,29].

Fig. 7. The abstract forwarding model.

The parser ﬁrst processes incoming packets; the parser determines and extracts ﬁelds from the header. The extracted header ﬁelds are passed to the match+action tables, modifying the packet header. The packet can be forwarded, replicated for multicast or CP, and dropped; the packet is then passed to egress match+action to modify the ﬂow per instance [28]. Google and Barefoot Networks have standardized P4runtime API. P4runtime was designed to ensure communication between the switches and the controller. The principal objective of a P4runtime is to manage P4 objects at run time and to dynamically provision switches with the appropriate P4 program. P4runtime is a protocol-independent API, allowing vendors to adopt it without complexity. P4 discovery is based on packet-out and packet-in messages sent by the controller and switch, respectively, such as OpenFlow. The Media Access Control Security (MACsec) protocol is used locally in the DP switches to ensure Ethernet’s conﬁdentiality, integrity, and authenticity (IEEE 802) using symmetric encryption and hash functions [30].

416

4

L. Boukraa et al.

Discussion

OpenFlow, ForCES, and P4 follow the basic idea of separations of DP and CP in network elements; on the other hand, NetConf does not have as objective the separation of CP and DP. The objective of NetConf is to reduce complexity and enhance performance. ForCES includes logical function blocks (LFBs) connected in a ﬂow logic described in the logic of directed graphs. ForCES also describes a policy on constructing the routing mechanism from a set of logical function blocks (LFBs). Switches that support OpenFlow and P4 contain one or more ﬂow tables for the controller to communicate with the switch. Both OpenFlow and P4 as a solution as Match ﬁeld and Action. NetConf operates a simple RPC-based mechanism to boost communication between a network administrator (CP) and a network device (DP). NetConf and ForCES are represented as an XML schema. The schema describes the structure of RPC for NetConf and LFB library documents for ForCES. XML was chosen because it has the advantage of being both machinereadable and human with broadly available tool support. Netconf, OpenFlow, and P4 use TCP protocol. On the other side, ForCES utilizes SCTP Stream Controlled Transmission Protocol gives a range of reliability levels. One of the main diﬀerences between NetConf, ForCES, and OpenFlow is that OpenFlow has to be modeled and standardized each time a new feature is added. While NetConf and ForCES provide extensibility without standardizing again and again. ForCES is expected to be swifter and utilize less bandwidth for the same amount of conﬁguration messages owing to the messages’ nature but lacks open source support. OpenFlow and P4 consume much bandwidth for packet IN and packet OUT messages. OpenFlow is the most general-purpose standard architecture and southbound protocol for SDN. It consists of SDN switches with a ﬁxed function DP that a central SDN controller supervises; the ﬁxed function of the Openﬂow switch oﬀers a complex design to enable variations in the DP. The P4-programmable switch is a new concept to make the network more ﬂexible, dynamic, programmable, and scalable to solve the inﬂexibilities of ﬁxed-function DPs. Programming Protocol independent Packet Processors (P4) is a programming language for describing how packets are processed by the DP. At the controller level, it uses Python. Since OpenFlow and P4 have the same objective and principle, we have found minimal diﬀerence in performance between them (Table 1).

SDN Southbound Protocols: A Comparative Study

417

Table 1. Summary of comparison between NetConf, OpenFlow, ForCES, and P4. Southbound Standardizing interface body

5

Determinant used

Protocol

Extensibility

Security

NetConf

IETF

RPC-Remote procedure

TCP

Yes

SSH

OpenFlow

ONF

Match fields and actions

TCP

No

SSL, LLDP

ForCES

IETF

Logical Functional Block

SCTP

Yes

LLDP

P4

ONF

Match fields and actions

TCP

Yes

P4runtime, MACsec

Conclusion

This paper presented a detail of the SDN technology, SDN architecture, and SDN interfaces. It focused on the SBI, which links the DP and the CP that OpenFlow dominated. Most of the research work on the SBI is extensions or enhancements to the OpenFlow protocol. It has become the standard, although several other southbound interface solutions are also available that do not depend on OpenFlow. This paper has clearly shown that OpenFlow consists of an SDN switch with a ﬁxed function DP; the P4-programmable switch solves the inﬂexibilities of ﬁxed-function DPs. It is a concept to make the network more ﬂexible, dynamic, programmable, and scalable.

References 1. Zhang, Q., Cheng, L., Boutaba, R.: Cloud computing: state-of-the-art and research challenges. J. Internet Serv. Appl. 1(1), 7–18 (2010). https://doi.org/10.1007/ s13174-010-0007-6 2. El Makkaoui, K., Ezzati, A., Beni-Hssane, A., Ouhmad, S.: Fast Cloud-Paillier homomorphic schemes for protecting confidentiality of sensitive data in cloud computing. J. Ambient. Intell. Humaniz. Comput. 11(6), 2205–2214 (2020) 3. Rossi, F.D., Rodrigues, G.D.C., Calheiros, R.N., Conterato, M.D.S.: Dynamic network bandwidth resizing for big data applications. In: 13th International Conference on e-Science (e-Science), pp. 423–431. IEEE (2017) 4. Yu, W., Liang, F., He, X., et al.: A survey on the edge computing for the Internet of Things. IEEE Access 6, 6900–6919 (2017) 5. Ahmad, S., Mir, A.H.: Scalability, consistency, reliability and security in SDN controllers: a survey of diverse SDN controllers. J. Netw. Syst. Manage. 29(1), 1–59 (2021) 6. Maleh, Y., Qasmaoui, Y., El Gholami, K., Sadqi, Y., Mounir, S.: A comprehensive survey on SDN security: threats, mitigations, and future directions. J. Reliable Intell. Environ. 1–39 (2022) 7. Latif, Z., Sharif, K., Li, F., et al.: A comprehensive survey of interface protocols for software defined networks. J. Netw. Comput. Appl. 156, 102563 (2020) 8. Limoncelli, T.A.: Openflow: a radical new idea in networking. Commun. ACM 55(8), 42–47 (2012) 9. Haleplidis, E.: Overview of RFC7426: SDN layers and architecture terminology. IEEE Softwareization (2017)

418

L. Boukraa et al.

10. Haleplidis, E., Salim, J.H., Halpern, J.M., et al.: Network programmability with ForCES. IEEE Commun. Surv. Tutor. 17(3), 1423–1440 (2015) 11. OpenFlow Switch Consortium. OpenFlow switch specification version 1.0.0. (2009). http://www.openflowswitch.org/documents/openflow-spec-v1.0.0.pdf. Accessed 13 Mar 2022 12. V¨ or¨ os, P., Kiss, A.: Security middleware programming using P4. In: Tryfonas, T. (ed.) HAS 2016. LNCS, vol. 9750, pp. 277–287. Springer, Cham (2016). https:// doi.org/10.1007/978-3-319-39381-0 25 13. POX controller. https://noxrepo.github.io/pox-doc/html. Accessed 13 Mar 2022 14. RYU controller. https://ryu-sdn.org. Accessed 13 Mar 2022 15. OpenDaylight controller. https://www.opendaylight.org. Accessed 13 Mar 2022 16. Floodlight controller. https://floodlight.atlassian.net/wiki/spaces/floodlight controller/overview. Accessed 13 Mar 2022 17. ONOS controller. https://opennetworking.org/onos. Accessed 13 Mar 2022 18. Shin, M.K., Nam, K.H., Kim, H.J.: Software-defined networking (SDN): a reference architecture and open APIs. In: 2012 International Conference on ICT Convergence (ICTC), pp. 360–361. IEEE (2012) 19. Sch¨ onw¨ alder, J., Bj¨ orklund, M., Shafer, P.: Network configuration management using NETCONF and YANG. IEEE Commun. Mag. 48(9), 166–173 (2010) 20. Dallaglio, M., Sambo, N., Cugini, F., Castoldi, P.: Management of sliceable transponder with NETCONF and YANG. In: 2016 International Conference on Optical Network Design and Modeling (ONDM), pp. 1–6. IEEE (2016) 21. Valenˇci´c, D., Mateljan, V.: Implementation of netconf protocol. In: 2019 42nd International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), pp. 421–430. IEEE (2019) ˇ 22. Cejka, T., Krejˇc´ı, R.: Configuration of open vSwitch using OF-CONFIG. In: NOMS 2016–2016 IEEE/IFIP Network Operations and Management Symposium, pp. 883– 888. IEEE (2016) 23. Narisetty, R., Dane, L., Malishevskiy, A., et al.: OpenFlow configuration protocol: implementation for the of management plane. In: 2013 Second GENI Research and Educational Experiment Workshop, pp. 66–67. IEEE (2013) 24. Mahrach, S., El Mir, I., Haqiq, A., Huang, D.: SDN-based SYN flooding defense in cloud. J. Inf. Assur. Secur. 13(1), 30–39 (2018) 25. Haleplidis, E., Joachimpillai, D., Salim, J. H., et al.: ForCES applicability to SDNenhanced NFV. In: 2014 Third European workshop on software defined networks, pp. 43–48. IEEE (2014) 26. Tarnaras, G., Haleplidis, E., Denazis, S.: SDN and ForCES based optimal network topology discovery. In: Proceedings of the 2015 1st IEEE Conference on Network Softwarization (NetSoft), pp. 1–6. IEEE (2015) 27. Huang, D., Chowdhary, A., Pisharody, S.: Software-Defined Networking and Security: From Theory to Practice. CRC Press, Boca Raton (2018) 28. Bosshart, P., Daly, D., Gibb, G., et al.: P4: Programming protocol-independent packet processors. ACM SIGCOMM Comput. Commun. Rev. 44(3), 87–95 (2014) 29. Mahrach, S., Haqiq, A.: DDoS flooding attack mitigation in software defined networks. Int. J. Adv. Comput. Sci. Appl. 11(1), 693–700 (2020) 30. Hauser, F., Schmidt, M., H¨ aberle, M., Menth, M.: P4-MACsec: dynamic topology monitoring and data layer protection with MACsec in P4-based SDN. IEEE Access 8, 58845–58858 (2020)

Simulating and Modeling the Vaccination of Covid-19 Pandemic Using SIR Model - SVIRD Nada El Kryech(B) , Mohammed Bouhorma, Lotfi El Aachak, and Fatiha Elouaai Computer Science, Systems and Telecommunication Laboratory (LIST), Faculty of Sciences and Technologies, University Abdelmalek Essaadi Tangier, Tangier, Morocco [email protected], {mbouhorma,lelaachak,felouaai}@uae.ac.ma

Abstract. Over the past decade, the emergence of new infectious diseases in the world has become a serious problem requiring special attention. These days the COVID-19 epidemic is affecting not only the health sector but also the economy. Therefore, it is of great importance to build models that appropriately derive and preside over the spread of the epidemic to improve the control of epidemics. As well as to adopt appropriate strategies to avoid or at least mitigate its spread faster, different modeling methods have been proposed to build epidemiological models, we find the use of an agent-based model which makes it possible to reproduce the real behavior of the daily course of individuals already seen in the previous article [1], However this article presents in the same context stimulates the spread of covid using stochastic SIR model (Susceptible - Infected – Recovered) and its extension SVIRD (Susceptible - Vaccinated - Infected - Recovered – Death) which takes into consideration the vaccination parameter. Results: For a sample of 50 citizens network, we used a combination of simulations for the 4 parameters in SVIRD model, The result of the simulation shows that: The more connected a population is, the higher vaccination rates need to be to effectively protect the population. Also, the relationship between vaccination and infection rates looks more like an exponential decay and infection rates scale linearly with death rates for very low and very high numbers of connections. Keywords: Covid-19 · Model · Disease · SIR model · Reproduction rate · Transmission rate · Simulation · Infected · Vaccinated

1 Introduction The COVID-19 epidemic continues to worry the whole world and to impact the economic sector in the first place, in the previous article [1] we saw the importance of modeling and understanding the spread of the disease and we studied two essential parameters a susceptible or infected person as well as the distance and the wearing of the mask in a closed place using agent based model, certainly at the level of this article we will formulate and theoretically analyze a mathematical model of the mechanism of transmission of COVID-19 incorporating the parameter of the vaccination of individuals using the SIR model (Susceptible - Infected - Recovered) and its extension SVIRD (Susceptible - Vaccinated - Infected - Recovered - Death) using two hypothesis low vaccinated population and high vaccinated population to reduce the spread of covid-19. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ben Ahmed et al. (Eds.): NISS 2022, LNDECT 147, pp. 419–431, 2023. https://doi.org/10.1007/978-3-031-15191-0_40

420

N. El Kryech et al.

2 Overview In the last article [1], the authors present the implementation of agent-based model according to the contaminated ratio, number of population and other behavior such as wearing mask to understand the infectious disease circulation; however, in this paper we will be using a stochastic model such as SIR model and its extension SVIRD which takes into consideration the vaccination parameter to understand the effect of vaccination rate in different scenarios and analyze the relation between the vaccinated rate and infected rate to decrease the spread of Covid-19. In this context many studies [2] have investigated to implement SIR model and variations based on SIR such as, SIRD, SIRV, MSIR, SEIR, etc. [9], in this paper we propose SVIRD [11] model to study the variance of vaccination parameter when applied to the transmission of Covid-19 disease, although this work is started with a theoretical argumentation using two scenarios low vaccinated population and high vaccinated population to understand the relation between vaccination and infection rates and then simulating distinct case using a simulator tool developed in Angular based on SVIRD Model.

3 Method and Model Description SIR model is Compartmental mathematical modelling of infectious diseases and the population is divided into several categories as following: • Susceptible (S): represents healthy individuals. • Infected (I): represents infected individuals. • Recovered (R): represents recovered individuals after being infected. 3.1 SIR Model The SIR [12] model is based on two concepts: compartments and rules. Compartments divide the population into various possible disease states as we mentioned above. The rules specify the proportion of individuals moving from one compartment to another. The SIR model can therefore be represented by the following diagram in Fig. 1:

Fig. 1. SIR model representation

β represents the transmission rate, i.e., the rate of healthy individuals who become infected and γ the recovery rate, i.e., the rate of infected individuals who become recovered. Mathematically, the SIR model is given by the following equation system: dS(t) = −βS(t).I(t) dt

(1)

Simulating and Modeling the Vaccination

421

dI(t) = −βS(t).I(t) − γI(t) dt

(2)

dR(t) = −γS(t).I(t) dt

(3)

d The derivatives dt make it possible to know the variation is increasing or decreasing of the functions S(t), I(t) and R(t) in order to describe their evolution over time. The term S(t)I(t) represents the number of contacts between healthy individuals and infected individuals, β being the transmission rate, there are therefore βS(t)I(t) newly infected individuals. These are subtracted from healthy individuals (1), and added to infected individuals (2). Similarly, among the infected people, some will recover: γ being the recovery rate, there are γI(t) newly cured people who remove or recovered themselves from the infected people (2) and add to the removed or recovered people (3). Also, at the level of the sir model we define an important parameter which is the reproduction rate R0 [12]. The reproduction rate is the average number of secondary cases produced by an infectious individual during its period of infection. Here, we take R0 = 3, this means that each contaminated person will contaminate 3 others (Fig. 2).

Fig. 2. Diagram to illustrate the reproduction rate R0

The vaccination policy aiming to vaccinate 100% of the population is almost impossible: for that reason it is necessary to find the right balance to vaccinate a part of the population large enough to slow down and then stop the epidemic. This is what the SVIRD model extension of the SIR model made possible. 3.2 SVIRD Extention of SIR Model The SVIRD model is an extension of SIR model that has five compartments (SusceptibleVaccinated-Infected-Recovered-Death), we add two new parameters as following: • Vaccinated(V): represents vaccinated individuals.

422

N. El Kryech et al.

• Death(D): represents people dead after being individuals. The SVIRD model can therefore be represented by the following diagram in Fig. 3:

Fig. 3. SVIRD model representation

In the SVIRD model we add two new rate α that represents death rate and v the vaccination rate, i.e., the rate of vaccination against the Covid-19 virus. Mathematically, the SVIRD model is given by the following equation system: dS(t) = −βS(t).I(t) + vR(t) dt

(4)

dI(t) = −βS(t).I(t) − γ I(t) dt

(5)

dR(t) = −γS(t).I(t) − vR(t) dt

(6)

dD(t) = αI(t) dt

(7)

The term v R(t) represents the number of populations has taken a vaccination that is directly transform from susceptible to recovered population in the Eq. (4) we add the term v R(t) to susceptible population and then we substract it from Eq. (6) to have only population recovered after being infected and has immunity against covid-19, and α I(t) is the dead population (7).

4 Sample and Analysis After understanding the SIR model, we will analyse and assess the effects of the highly contagious and deadly covid-19 epidemic in a small population with different levels of vaccination.

Simulating and Modeling the Vaccination

423

4.1 Low Vaccinated Population We consider the following table: Table 1. The starting matrix at t = 0 Number of individuals 50

Compartments Susceptible

Vaccinated

Infected

Recovered

Death

36

6

8

0

0

We consider network of individuals with low vaccinated population 12% as mentioned above in Table 1. The network can be represented by the following diagram in Fig. 4 knowing that the bleu nodes represent the susceptible population, the red nodes describe the vaccinated population, the yellow nodes represent the infected population of covid-19, the nodes that turn to green are the recovered individuals and gray are the dead ones, at time t = 0 we have the network as following:

Fig. 4. Representation of 50 individuals at t = 0 with law vaccinated population

After launching the simulation, we got some of nodes turns to gray and green at t = 5 we got the following diagram and network:

Fig. 5. Representation of 50 individuals at t = 5 with law vaccinated population

We explain the above network representation by the following diagram (Fig. 6):

424

N. El Kryech et al.

Fig. 6. Diagram of 50 individuals at t = 5 with law vaccinated population

As the diagram shows, at t = 5 we got a high number of contamination and death. Let’s increasing time t to t = 17 we got the following result:

Fig. 7. Representation of 50 individuals at t = 17 with law vaccinated population

In parallel we find the explanation of the transformation of the nodes by the following diagram (Fig. 8):

Fig. 8. Diagram of 50 individuals at t = 17 with law vaccinated population

Simulating and Modeling the Vaccination

425

At t = 17, we got high number of recovered people and decrease in Infected, mortality rate. In the figure of the network constructed of nodes in Figs. 4, 5 and 7, we can observe how the covid-19 epidemic is spreading rapidly in almost the entire unvaccinated population. Additionally, we can track the number of individuals in a particular state in a histogram at the bottom of each network. As the covid-19 epidemic spreads unhindered, most individuals die or recover and therefore acquire immunity. Individuals who die are obviously no longer part of the network, so their connections to other individuals are removed. According to the mathematical approach of the SIR model, we obtain the following information at the end of the simulation, we see that the infection and mortality rates are very high in this population. The disease reached a high percentage of all individuals causing the death of a large part of them. Among the unvaccinated population, they are even higher with nearly >90% infected and >40% dead. The covid-19 disease has spread through our network unhindered. Subsequently, we will see what happens if a large part of the population is vaccinated knowing that R0 = 0.8333. 4.2 High Vaccinated Population We consider the following table: Table 2. The starting matrix at t = 0 Number of individuals 50

compartments Susceptible

Vaccinated

Infected

Recovered

Death

16

30

4

0

0

At Table 2, we consider network of individuals with high vaccinated population 65% and the network will be as following (Fig. 9):

Fig. 9. Representation of 50 individuals at t = 0 with High vaccinated population

By increasing the notion of time to t = 5 we got the following results (Fig. 10):

426

N. El Kryech et al.

Fig. 10. Representation of 50 individuals at t = 5 with High vaccinated population

Fig. 11. Diagram of 50 individuals at t = 5 with High vaccinated population

With high number of vaccinated populations, we observe a decrease in the rate of contamination as shown in the diagram (Fig. 11): By increasing time t to t = 17 we got the following result (Figs. 12 and 13):

Fig. 12. Representation of 50 individuals at t = 17 with High vaccinated population

Simulating and Modeling the Vaccination

427

Fig. 13. Diagram of 50 individuals at t = 17 with High vaccinated population

We can still see the disease spreading among unvaccinated people, but we can also observe how vaccinated people stop the spread. If an infected individual is related to a majority of vaccinated individuals, the likelihood of the disease spreading is greatly reduced. Unlike the poorly vaccinated population, the disease stops spreading not because too many individuals have died, but rather because it runs out of steam quickly, so that a majority of the initial population, sensitive but in good health, remains completely spared. This is called herd immunity. However, the transmission rate R0 is 0.6666667, hence the decrease in the infection rate and the mortality rate for the entire population is obviously easily explained by the fact that a smaller fraction of the population was susceptible to the disease in the first place, but as herd immunity as well as the large vaccinated population we manage to limit the spread of the virus, in conclusion the vaccine plays a crucial role.

5 Simulator Description 5.1 Simulation Parameter This solution runs simulations to determine the possible scenarios taken as inputs, namely: Number of citizens: number of persons in the environment to be simulated. – – – – –

Parameter Beta: represents the transmission rate. Parameter gamma: represents the recovery rate. Initial Infection: number of initial infected people at t = 0. Vaccination rate: number of initial vaccinated people. R0 is the reproduction rate which is the desease spread parameter.

5.2 Simulation Process When running a given simulation, the population is chosen by the name of country in our example we will choose Morocco with 35 million as number of citizens according to [9], and then set up the simulation parameters as following (Table 3):

428

N. El Kryech et al. Table 3. Initial simulation parameters

Number of individuals 35276786

Compartments Beta

Gamma

Infected

Recovered

Death

0,25

0,083

2000

0

0

We set up the parameters initial above as (Figs. 14 and 15):

Fig. 14. Screenshot of setting parameters in the simulator

The simulation is designed to imitate the spread of Covid-19 using the SIR model and SVIRD model to analys the effect of high and low rate of vaccinated population to control the virus in entire country cause the tool contain a real data of country citizens and you can do several simulation. The simulation takes into account the ratio of mortality of each country and then run the simulation to visualize a diagram that describe the covid-19 spread according to the input parameter given. The tool also allows to simulation with other countries such as Germany, France and describes the SIR model used behind to analyze.

Simulating and Modeling the Vaccination

429

Fig. 15. Screenshot of the view projected by the simulator after execution.

6 Conclusion By varying the number of connections, we can draw the following conclusions: • More connections in the network lead to significantly higher infection and death rates. • Infection rates scale linearly with death rates for very low and very high numbers of connections. • The relationship between vaccination and infection rates looks more like an exponential decay. • The more a population is connected, the higher the vaccination rates must be to effectively protect the population. We can illustrate the conclusion result into two following figures of relation between the rate reproduction R0, infection or transmission rate and vaccinated rate (Fig. 16 and 17).

Fig. 16. Diagram 3D of R0, vaccinated rate and connection in networks connection

430

N. El Kryech et al.

Fig. 17. Diagram 3D of R0, Infection rate and connection in networks connection

More importantly, they show how important it is to continue vaccination programs around the world. Acknowledgments. This project is subsidized by the MENFPESRS and the CNRST as part of the program to support scientific and technological research related to “COVID-19” (2020). Also, we acknowledge financial support for this research from CNRST.

References 1. Almechkor, M., Aachak, Lotfi El, Elouaai, F., Bouhorma, M.: Development of a simulator to model the spread of coronavirus infection in a closed space. In: BenAhmed, M., RakıpKaras, , ˙I, Santos, D., Sergeyeva, O., Boudhir, A.A. (eds.) SCA 2020. LNNS, vol. 183, pp. 1220–1230. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-66840-2_93 2. Kopfová, J., Nábˇelková, P., Rachinskii, D., Rouf, S.: Dynamics of SIR model with vaccination and heterogeneous behavioral response of individuals modeled by the Preisach operator. J. Math. Biol. 83(2), 1–34 (2021). https://doi.org/10.1007/s00285-021-01629-8 3. Weiss, H.: The SIR model and the Foundations of Public Health, MATerials MATemàtics, Publicació electrònica de divulgació del Departament de Matemàtiques de la Universitat Autònoma de Barcelona (2013) 4. Kuniya, T., Wang, J., Inaba, H.: A multi-group SIR epidemic model with age structure. Discrete Continuous Dyn. Syst. 21, 3515 (2016) 5. Yorozu, Y., Hirano, M., Oka, K., Tagawa, Y.: Electron spectroscopy studies on magnetooptical media and plastic substrate interface. IEEE Transl. J. Magn. Jpn. 2, 740–741 (1987). [Digests 9th Annual Conf. Magnetics Japan, p. 301, 1982] 6. Zamana, G., Khana, A.: Dynamical aspects of an age-structured SIR endemic model. Comput. Math. Appl. 72, 1690–1702 (2016) 7. Purkayastha, S., et al.: A comparison of five epidemiological models for transmission of SARS-CoV-2 in India. BMC Infectious Disease. 21, 1–23 (2021) 8. Angeli, M., Neofotistos, G., Mattheakis, M., Kaxiras, E.: Modeling the effect of the vaccination campaign on the COVID-19 pandemic. Chaos Solitons Fractals 154, 111621 (2022) 9. https://www.worldometers.info/world-population/morocco-population/ 10. https://en.wikipedia.org/wiki/Compartmental_models_in_epidemiology

Simulating and Modeling the Vaccination

431

11. https://www.medrxiv.org/content/10.1101/2021.06.17.21258837v1.full 12. http://images.math.cnrs.fr/Modelisation-d-une-epidemie-partie-1.html 13. Schuster, M.D., Memarsadeghi, N.: Infection modeling case study: discrete spatial susceptible-infected-recovered model. Comput. Sci. Eng. 23, 83–88 (2021) 14. Amaral, F., Casaca, W., Oishi, C.M., Cuminato, J.A.: Simulating immunization campaigns and vaccine protection against COVID-19 pandemic in Brazil. IEEE Access 9, 126011–126022 (2021) 15. https://www.medrxiv.org/content/10.1101/2021.02.08.21251383v1.full-text

The New Generation of Contact Tracing Solution: The Case of Morocco Badr-Eddine Soussi Niaimi(B) , Lotfi Elaachak, Hassan Zili, and Mohammed Bouhorma Faculty of Sciences and Techniques of Tangier, LIST Lab, Abdelmalek Essaâdi University, Tangier, Morocco {bsoussiniaimi,lelaachak,mbouhorma}@uae.ac.ma

Abstract. Nowadays, the entire world is struggling to adapt and survive the global pandemic. Moreover, most countries had a hard time keeping up with the new mutations of COVID-19. Therefore, taking preventive measures to control the spreading of the virus, including lockdowns, curfews, social distancing, masks, vaccines, is not enough to stop the virus. However, using the new technologies to adapt the prevention measures and enhance the existing ones will be more efficient. Most countries have already developed their non-pharmaceutical interventions measures (NPIs), mainly contacts tracing solutions at the pandemic beginning. Using those mobile applications, the authorities were able to reduce the spreading of the virus. Nevertheless, the virus is evolving, mutating, and becoming more and more dangerous to survive. Therefore, these mobile applications have become less effective in facing the constant changes of the pandemic situation. To that end, the need for enhancing and evolving contact tracing became more urgent. The goal here is to control the spread of the new variants and keep up with the rapid changes happening around the world. In this paper, we will present a detailed view of the new solution built to take contact tracing to a new level, empowered by the Bluetooth Low Energy technology for communication, advanced encryption method for security and data privacy, as well as secured storage and data management to have a system capable of slowing the COVID-19 variants from spreading and save lives. Keywords: Nonpharmaceutical interventions · Contact tracing · COVID-19 · BLE · Data-privacy · Secure storage

1 Introduction We are now living a new phase of the crisis; as COVID-19 sadly develops its presence all over the world, the virus continues to take lives and spreads faster than ever [17]. Given that we are fighting an invisible enemy, a microscopic creature that can be transmitted when people breathe or when aerosols or droplets containing the virus are come directly into contact or inhaled with the nose, eyes, or mouth [3]. As a result, the virus can be transmitted from one person to another without being noticed; in that perspective, the use of the new method for contact tracing as a non-pharmaceutical intervention (NPI) © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ben Ahmed et al. (Eds.): NISS 2022, LNDECT 147, pp. 432–443, 2023. https://doi.org/10.1007/978-3-031-15191-0_41

The New Generation of Contact Tracing Solution

433

to suppress the epidemic is the most efficient way to keep up with the rapid changes of the pandemic situation [18, 21]. The new contact tracing solution can notify the user if there is a risk of exposure. Moreover, the system will determine the thread-level based on the degree of exposure. Furthermore, the mobile application will provide instructions to follow to stay safe and protect others from getting infected. All of that, without exposing any personal information about the user or the contacts gathered by the mobile application [19]. Moreover, giving insight into the current pandemic situation and key figures about the vaccination coverage and COVID-19 cases. The new contact tracing solutions will automate the process of tracking people’s interactions and reveal any potential infection at the earliest possible stage. The purpose of this study is to evolve the current solution. As well as build a better and more attractive contact tracing application for the users and enhance the contact tracing results. Furthermore, the ecosystem offers its users features and advanced functionalities to help them in their daily life during the pandemic.

2 State of the Art After researching on the internet for contact tracing solutions, we came across many solutions in many countries; this includes Health Code in China [11], the mobile contact tracing applications released in Germany [12], Australia [13], the United Kingdom [14], Singapore [15], and Israel [16]. In Morocco, we have Wiqaytna which was built with the cooperation of the health ministry, the interior ministry, the digital development agency (ADD), and the national telecommunications regulatory agency (ANRT) [4]. Wiqaytna is a contact tracing application based on Bluetooth to log the users’ encounters and store them in the users’ local storage, it is designed to send those data manually to the authorities when the user is infected. The key limitation is that most users are not aware of the degree of exposure at the early stage. Moreover, the OMICRON variant breakout proves that keeping track using this manual implementation is inefficient [20]. Also, the users’ commitment to keeping the application up and running is a key factor for better contact tracing. As COVID-19 keeps mutating and evolving, the authorities are forced to take new measures such as Sanitary passports and authorizations to travel; on one hand. On the other hand, informing the citizens about their current state or exposure should be integrated as soon as possible. So, the users can protect their friends, family, and close ones. Therefore, the Wiqaytna solution needs a major update and enhancement to meet those requirements. Overall, the Wiqaytna solution does not contain critical vulnerabilities. However, some issues should be fixed, enhanced, and improved. First of all, the application does not work properly in the background in iOS (the application did not collect ids when it is not in the foreground [24]). Moreover, the application uses the test UUID of the Singapore TraceTogether for the Bluetooth service. Furthermore, the application does also contain Vulnerable Dependency on iOS. There is also an interesting contact tracking application developed by the Japanese Ministry of Health, Labor and Welfare on 19 June 2020 [5] called COCOA, this contact tracing mobile application is used for 2.8% of positive cases [5], it uses the Bluetooth to collect information on close contacts. As a result, the contacts will be alerted if they were

434

B.-E. Soussi Niaimi et al.

in close contact with an infected user. Unfortunately, the downloads were insufficient. In consequence, the help provided by the mobile application to combat the coronavirus is considered limited. In Morocco’s context, the evolution of the existing solution is mandatory to keep up with the rapid spreading of the virus. The new variants of COVID-19 are defying all odds; the vaccines are helping people to have less severe symptoms, but the spreading is the real issue and must be dealt with as soon as possible. From that perspective, the Wiqaytna solution must be replaced with the newly proposed solution for several reasons. First of all, the new solution is fully automated and will not require any human intervention. Furthermore, the security and privacy mechanisms are more advanced. Moreover, the new solution offers a lot of interesting and useful features, which we will shed the light on them in the results section.

3 Method When two users come across each other, the solution will exchange an encrypted key (UUID) based on the physical proximity. The key is generated based on the device’s unique identifier (UDID); the key does not identify the user personally, but it allows us to retrieve the firebase cloud messaging (FCM) token in order to contact the user in question in real-time by sending a push-notification. After launching the mobile application, a discovery process will begin using Bluetooth Low Energy alongside the advertising process. When two users discover each other’s devices, the exchange process will start, the devices will exchange the encrypted key and save it with the Date-Time of the event in a local database (NoSQL), the list of the saved contacts keys will be sent to the authorities when the user is declared COVID-19 positive or scan a PCR test with a positive result. The list will be sent to the server-side of the system in order to alert the users through push notifications. The received notifications will allow the mobile application to update the exposure indicator. Moreover, give the user the instructions to follow to stay safe and stop any possible spread of the virus. 3.1 System Architecture After collecting the nearby connection using the BLE, and storing them in a local database (NoSQL/Object-Box). The user will be asked to send the list to the authorities in case of a positive COVID-19 test; The mobile application is communicating with the server using a REST API call and sending JSON data with HTTPS protocol; the API is secured with an authentication system based on tokens. Once the API retrieves the firebase cloud messaging (FCM) tokens using the encrypted keys from the API database, then we can send notifications to those users through the FCM service, which is secured with the server key generated by Firebase. The server-side is based on the Docker platform for containerization to have seamless scalability, better performance, and agility. The mobile application database is used as well to store the user’s wallet. This part of the database is isolated, so the users’ QR-Codes (sanitary pass, PCR, authorization …) are secured. This feature has been added to the system in order to motivate the users

The New Generation of Contact Tracing Solution

435

to install the mobile application. Therefore, the user’s data are secured and will not be shared with the server-side (Fig. 1).

Fig. 1. Overview of the system architecture.

3.2 The Proposed Algorithms The system contains two main parts, the mobile part which is in charge of collecting the nearby connection using the BLE and storing the collected keys in a NoSQL local database, and the server part where we generate the encrypted keys. Moreover, keeping track of the FCM tokens, and processing the list of the contacts sent by the users. Algorithm 1: The Discovery Process The first algorithm is representing the discovery process; the mobile application launches the browsing, and advertising every 5 s; if a device is founded, the keys exchange process will be triggered (generated by the API based on the device’s unique identifier).

436

B.-E. Soussi Niaimi et al.

Function BLE Browsing and advertising: For each interval of 5 seconds: Nearby service starts advertising for peers. Wait for 200 microseconds. Nearby service starts browsing for peers. If a device is found then: Send the encrypted key. Wait for the other device’s encrypted key. If received the encrypted key: Save the data to the local database. Else: Retry connection. End If End If End of for each End of function

Algorithm 2: Notifying the User of Exposure The second algorithm is executed on the server-side in real-time when the user sends the contacts list. The contacts list contains the encrypted keys and the Date-Time of the contact (JSON list), the data will be parsed and then matched with the database records to retrieve the FCM tokens. The tokens will be used to send push notifications to the users in order to alert them of their exposure and update their exposure indicator. Function Contact alerting: For each contact in the List: If the date is between today and (today – 14) days then: Get the FCM token from the database by the encrypted key Send push notification with the encrypted key. Else: Continue. End If End of for each End of function

3.3 Assessment of Concerns for General Contact Tracing Applications The contact tracing applications have been noticed as the new way to combat COVID19 and stop it from spreading, as the global pandemic is affecting over 200 countries

The New Generation of Contact Tracing Solution

437

shortly after the first outbreak in December 2019. The manual contact tracing using an army of detectives has been proven ineffective in many countries [1, 2, 7–9]. Therefore, automated contact tracing is the alternative. Nevertheless, the potential violations of privacy remain the first and the main concern [6], because of the mass-scale installation of the first versions despite the fact that they were developed rapidly to replace the manual tracing method, also, the insufficient knowledge of how to use the mobile application stopped many users from using it. Therefore, we have developed a new solution to solve the previous problems, the new mobile application does not require any user’s intervention or personal information to work. In addition to that, the mobile application does contain many features that can be used by the users in daily life (QR-Codes wallet) and stay updated with the latest COVID-19 statistics. In the following section, we will take a closer look at how the security measures were implemented to stop any potential violations of privacy or data leaks. 3.4 Privacy Policy and Data Confidentiality In this section we will shed the light on the most important part of any system, which is the security and data confidentiality, therefore, we have to mention that all the communications between the mobile application and the server-side are encrypted with the HTTPS (HTTP over security socket layer (SSL)/transport layer security (TLS)). Moreover, the users can disable the contact tracing feature whenever they want. The data are stored in a local database and will not be shared with the server-side without the users’ permission. The collected data contain only the encrypted keys, without any personal information that can be used to identify the users or traced back to its source. When the user opens the mobile application, the server-side will send a unique universal identifier (UUID) that will be used for every interaction with the other users through the BLE. Therefore, the users will not have to register or mention any personal information. The users will exchange the encrypted key provided by the server-side through the BLE, and those keys will be stored in a local secured NoSQL database (Object-Box) alongside the QR-Codes. The QR-codes scanned by the users are stored in the same local database but in a different document, and will not be shared with any third party. Only the PCR’s date will be used to identify the date of the contamination in order to filter the contacts list with high accuracy (Fig. 2). 3.5 Results The new system does contain two parts, the mobile application and the server-side. The mobile application is used to trace the nearby contacts (BLE background process), as well as store the user’s QR-Codes (Fig. 3). The home screen of the mobile application has much useful information such as the exposure indication and the current state of the user (Fig. 4). Moreover, we are using push notifications to keep the exposure indicator updated and alert the users in real-time in case of exposure.

438

B.-E. Soussi Niaimi et al.

Fig. 2. The process of new registration.

We display in the mobile application some detailed statistics on COVID-19 to keep the users informed of the current pandemic situation (Fig. 5). To respect the privacy of the users, all personal information can be deleted from the local database (Fig. 6). For the backend part, the principal role is the storage of the combinations of UUID, UDID, and FCM tokens (Fig. 7), as well as the statistics of the application installs (Fig. 8), and eventually, sending notifications in real-time to users in case of exposure with the goal to alert the user and update the exposure indicator (Fig. 9).

Fig. 3. The user’s Qr-codes wallet.

The Dashboard contains insights based on the data collected from the mobile applications. The data cannot be traced back to any user and does not contain any personal information.

The New Generation of Contact Tracing Solution

Fig. 4. The home page of the mobile application.

Fig. 5. The COVID-19 statistics page.

439

440

B.-E. Soussi Niaimi et al.

Fig. 6. Mobile application settings.

Fig. 7. The registered users list from the backend side.

Fig. 8. The Back-End’s dashboard.

The New Generation of Contact Tracing Solution

441

Fig. 9. The system alerts after receiving a notification.

4 Discussion of Results The system hasn’t been deployed for public usage yet. However, the API and the Back office are deployed in a virtual private cloud (VPC) for testing purposes. To the end of testing and evaluating the performance and the capabilities of our solution, we have built an ecosystem configured with the following parameters: Server configurations: CPU Cores: 1 CPU Memory: 1.5 GB Bandwidth: IN: 8.51 GB, OUT: 2.65 GB ` Operating System: Ubuntu-20.04 Emulators: Count: 13 502 mobile devices. Operating System: IOS 14.5, Android (10 and 12).

We have considered during the development of the new contact tracing ecosystem data privacy and data confidentiality concerns raised by the previous solution [22, 23]. Moreover, minimizing battery consumption on the mobile side thanks to the usage of the BLE, allows the mobile application to run for hours without draining mobiles’ batteries. Furthermore, the tests have confirmed that neither security risks nor technical issues are present. The communication between the devices and the server is secured and stable even with the overload of the server. In contrast, the system can be enhanced by the integration of a machine learning layer. The role of the new layer will be to analyze the user behaviors with the goal of improving the user’s experience.

442

B.-E. Soussi Niaimi et al.

5 Conclusion Among all the NPIs to combat COVID-19, the contact tracing solution remains to be the least intrusive measure in the given circumstances. In this paper, we have given an overview of the way used to build a safe, easy-to-use, and efficient contact tracing system. We have used the latest technologies in order to have the most accurate result possible, in addition, to securing any interaction or data transmission. With the goal of motivating the citizens to install the new mobile application, we have added many pertinent features. We have integrated features that make the system flexible and capable to adapt to any future measures taken by the authorities. The build of the new system was based on the previous solutions’ successes and failures. Moreover, we have enhanced the security mechanism to earn the users’ trust. Acknowledgments. This project is subsidized by the MENFPESRS and the CNRST as part of the program to support scientific and technological research related to “COVID-19” (2020). Also, we acknowledge the financial support for this research from CNRST.

References 1. Ortega-García, J.A., Ruiz-Marín, M., Cárceles-Álvarez, A., Campillo, I., López, F., Clau-dio, L.: Social distancing at health care centers early in the pandemic helps to protect population from COVID-19. Environ. Res. 189, 109957 (2020). https://doi.org/10.1016/j.envres.2020. 109957 2. Bergwerk, M., Gonen, T., Lustig, Y., et al.: Covid-19 breakthrough infections in vaccinated health care workers. N. Engl. J. Med. 385, 1474–1484 (2021) 3. Jones, R.M.: Relative contributions of transmission routes for COVID-19 among healthcare personnel providing patient care. J. Occup. Environ. Hyg. 17, 408–415 (2020). https://doi. org/10.1080/15459624.2020.1784427 4. Launch of “Wiqaytna”, Mobile Application for Notification of Exposure to Coronavirus Responsible for the “COVID-19” Disease. https://www.add.gov.ma/lancement-de-wiqaytnaapplication-mobile-de-notification-dexposition-au-coronavirus-responsable-de-la-maladiecovid-19, Accessed 01 Jan 2022 5. Ministry of Health Labour and Welfare: COVID-19 Contact-Confirming Application (2021). https://www.mhlw.go.jp/stf/seisakunitsuite/bunya/cocoa_00138.html. Accessed 15 Jan 2022 6. Leins, K., Culnane, C., Rubinstein, B.I.: Tracking, tracing, trust: contemplating mitigating the impact of COVID-19 through technological interventions. Med. J. Australia 1 (2020) 7. Chappell, B.: Coronavirus: Sacramento county gives up on automatic 14-day quarantines (2020). https://www.npr.org/sections/health-shots/2020/03/10/813990993/coronavirussacramento-county-gives-up-on-automatic-14-day-quarantines. Accessed 03 Jan 2022 8. Gould, T., Mele, G., Rajendran, P., Gould, R.: Anomali threat research identifies fake COVID-19 contact tracing apps used to download malware that monitors devices, steals personal data. https://www.anomali.com/blog/anomali-threat-research-identifies-fake-covid19-contact-tracing-apps-used-to-monitor-devices-steal-personal-data. Accessed 05 Jan 2022 9. Latif, S., et al.: Leveraging data science to combat COVID-19: a comprehensive review. IEEE Trans. Artif. Intell. 1, 85–103 (2020) 10. Dierks, T., Rescorla, E.: Transport Layer Security Protocol. Network Working Group, RFC 5246 (2008). http://tools.ietf.org/html/rfc5246. Accessed 02 Jan 2022

The New Generation of Contact Tracing Solution

443

11. Paul Mozur, R.Z., Krolik, A.: In Coronavirus fight, China gives citizens a color code, with red flags (2020). https://www.nytimes.com/2020/03/01/business/china-coronavirus-surveilla nce.html. Accessed 07 Jan 2022 12. Koch-Institut, R.: Corona Warn App (2020). https://www.coronawarn.app/en/. Accessed 02 Jan 2022 13. A. Department of Health. “COVIDSafe”. https://www.health.gov.au/resources/apps-andtools/covidsafe-app. Accessed 03Jan 2022 14. Department of Health and Social Care: Next phase of NHS coronavirus (COVID19) app announced (2020). https://www.gov.uk/government/news/next-phase-of-nhs-corona virus-covid-19-app-announced. Accessed 17 Jan 2022 15. Bay, J., et al.: BlueTrace: a privacy-preserving protocol for community-driven contact tracing across borders. Government Technology Agency-Singapore, Technical report (2020) 16. Ministry of Health. Hamagen (2020). https://govextra.gov.il/ministry-of-health/hamagenapp/. Accessed 13 Jan 2022 17. Kwok, K.O., Tang, A., Wei, V.W.I., Park, W.H., Yeoh, E.K., Riley, S.: Epidemic models of contact tracing: systematic review of transmission studies of severe acute respiratory syndrome and middle east respiratory syndrome. Comput. Struct. Biotechnol. J. (2019). https:// doi.org/10.1016/j.csbj.2019. 01.003 18. Sun, K., Viboud, C.: Impact of contact tracing on SARS-CoV-2 transmission. Lancet Infect Dis. (2020). https://doi.org/10.1016/S1473-3099(20) 30357-1 19. Rowe, F.: Contact tracing apps and values dilemmas: a privacy paradox in a neo-liberal world. Int. J. Inf. Manag. (2020). 10.1016/j. ijinfomgt.2020.102178 20. Centers for Disease Control and Prevention. About variants of the virus that causes COVID19 (2021). What You Need to Know About Variants. https://www.cdc.gov/coronavirus/2019ncov/transmission/variant.html. Accessed 07 Mar 2022 21. Vaughan, A.: The problems with contact-tracing apps. New Sci. (2020). https://doi.org/10. 1016/s0262-4079(20)30787-9 22. Kapa, S., Halamka, J., Raskar, R.: Contact tracing to manage COVID19 spread—balancing personal privacy and public health. Mayo Clin Proc. (2020). https://doi.org/10.1016/j.may ocp.2020.04.031 23. Parker, M.J., Fraser, C., Abeler-Dörner, L., Bonsall, D.: Ethics of instantaneous contact tracing using mobile phone apps in the control of the COVID-19 pandemic. J. Med. Ethics. 46, 427–431 (2020). https://doi.org/10.1136/medethics-2020-106314 24. Mesbahi, A., Mesbahi, A., Lopes, G.: COVID-19 contact tracing app wiqaytna mobile application security review 15, 15 June 2020. https://blog.ostorlab.co/covid19-wiqaytna-mobileapplication-review.html. Accessed 07 Mar 2022

The Prediction Stock Market Price Using LSTM Rhada Barik(B) , Amine Baina, and Mostafa Bellafkih National Institute of Posts and Telecommunications, Rabat, Morocco {barik.rhada,baina,mbella}@inpt.ac.ma

Abstract. In this paper, we will focus on the applicability of recurrent neural networks, particularly the Long Short Term Memory networks, in predicting the NASDAQ and the S&P 500 stock market prices were investigated. Daily stock exchange rates of NASDAQ and S&P 500 from January 4, 2010, to January 30, 2020, are used to construct a robust model. By building a model with various configurations of LSTM can be tested and compared. We used two evaluation measures, the coefficient of determination R2 as well as the “Root Mean Squared Error” RMSE, in order to judge the relevance of the results. Keywords: Time series · Stock market prediction · Long Short Term Memory LSTM

1 Introduction Forecasting the stock market to aid investment decisions has long been a historical challenges in academia and the asset management industry. Because of the environment’s complexity and uncertainty, modeling must account for a variety of elements, including incomplete, noisy, and heterogeneous data, with nearly 80% of it in unstructured form [1, 2]. There are three types of forecasting challenges, short-term forecasting for less than one year (seconds, minutes, days, weeks, or months), medium-term forecasting for one to two years, and Long-term forecasting for more than two years. Some stock price forecasting techniques can be divided into three categories [3, 4]: • Fundamental analysis is a sort of investment analysis that studies sales, incomes, management control, and a variety of other economic aspects that affect profitability and business to make long-term forecasts that investors can use to make investment decisions. • Technical analysis is a variety of rules and indicators used to find and explain the regularity of historical price fluctuations. Prices are gradually reacting to new information, according to technical analysts. This approach works well for making short-term predictions.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ben Ahmed et al. (Eds.): NISS 2022, LNDECT 147, pp. 444–453, 2023. https://doi.org/10.1007/978-3-031-15191-0_42

The Prediction Stock Market Price Using LSTM

445

• Time series forecasting is a chronological sequence of observations for a selected variable to predict the near future data based on its past data. It mainly contains two classes of algorithms with a linear Models (AR, ARMA, ARIMA), and Non Linear Models such as ARCH, GARCH, TAR, Deep learning algorithms (LSTM, RNN,..). In this study, the forecasting method used is time series forecasting, with the LSTM (Long-Short-Term Memory) network as the deep learning network of choice. One of the most successful RNN architectures for fixing the vanishing gradient problem in a neural network is the Long Short-Term Memory (LSTM) (see Fig. 1). In the hidden layer of the network, LSTM introduces [5] the memory cell, a computational unit that substitutes typical artificial neurons. Networks can correlate memories and input remotely in time thanks to their memory cells, making them well suited to comprehend the structure of stock data dynamically through time with strong prediction ability [6, 14].

Fig. 1. Long short-term memory neural network [7]

2 Related Works However, a large number of studies from numerous disciplines have attempted to resolve this problem, offering a diverse range of solutions. Here we cite some of the most important research in the topic of using LSTM to predict the stock market, including: An LSTM-based method for stock returns a prediction of China stock market (CHEN, Kai, ZHOU, Yi, et DAI, Fangyan 2015) [15]. Stock market’s price movement prediction with LSTM neural networks Nelson, David MQ, Adriano CM Pereira, and Renato A. de Oliveira [6]. In Table 1, we cite some details of the previous research.

446

R. Barik et al. Table 1. A summary of major studies on stock market prediction

Articles

Input features

Used method

Performance measure

CHEN, et al. 2015

Chaina stock market

LSTM

Accuracy

Nelson, et al. 2017

Brazilian stock exchange(bova11, bbdc4, CIEL3, ITUB4, PETR)

LSTM

Accuracy, recall, precision, f-measure

MOGHADDAM, et al. 2016

NASDAQ and S&P500

ANN

R2

Our study

NASDAQ and S&P500

LSTM

R2 , RMSE

The goal of this paper is to investigate the applicability of LSTM networks to the problem of stock market price prediction, to evaluate their performance in terms of the Root Mean Square RMSE [19] and the coefficient of determination R2 [11] using realworld data, and to see if there is any gain in changing various LSTM configurations such as the number of layers, neurons, and the input parameters.

3 Methodology The approach employed in this paper is based on the use of Long short-term memory (LSTM) to perform predictions of close price value for some stocks such as NASDAQ and S&P 500. In order to improve the model, several LSTM configurations can be evaluated and compared. In this section, we will discuss the methodology of our system. It is divided into various steps, which are as follows in Fig. 2:

Raw data

Data Processing

Model

Train/Test

Results

Evaluation Fig. 2. The steps of the proposed approach

The Prediction Stock Market Price Using LSTM

447

3.1 Raw Data The historical stock data for the NASDAQ and S&P500 is obtained from Yahoo’s financial section, this historical data is used for the prediction of future stock prices. The whole data set covers the period from January 4, 2010, to January 30, 2020, the data set is divided into two parts: The first part (January 28 to December 31, 2019) is used to determine the specification of the models and parameters. The second part, 21 days (from January 1 to January 30, 2020) is reserved for out-of-sample evaluation of performances among various forecasting models. 3.2 Data Processing Historic price data for particular stocks are gathered in the format of a time series of candles (open, close, high, low, and volume). To prepare the dataset a pre-processing of the collected data is carried out. We began by eliminating the invalid samples, then transforming raw data before entering it into training using the normalization which is one of the data representation methods. It is based on MaxMinScaler [17] of sklearn library [8], presented in Eq. (1), which obtains new sample values (Dataj) in a range from 0 to 1. DatajNew =

Dataj − min(Data) Max(Data) − min(Data)

(1)

This scaling is performed because the LSTM model works best on values in the 0–1 range. 3.3 Model Our network model is composed of a sequential input layer followed by two LSTM layers, a dense layer, and then finally a dense output layer with a linear activation function (see Fig. 3).

.

Input Layer

. . .

LSTM (1)

. .

. . Output Layer

LSTM (2)

Dense Layer

Fig. 3. The architecture of our neural network model

448

R. Barik et al.

The following code of the model implemented using Keras [9] is:

model = Sequential() model.add(LSTM(units=,return_sequences=True, input_shape=())) model.add(LSTM(units=, return_sequences=False)) model.add(Dense(units=)) model.add(Dense(units=1))

3.4 Training To train our modal, we used the ADAM optimization algorithm [10] and Mean Square Error as our loss function. We separated the dataset into two parts 80% for training and 20% for validation [16]. In this study, epoch number in training section has been decided as 100 epochs, and the batch sizes has been identified as 10. For our experiment, we have employed a various set of parameters with a different number of neurons in the two LSTM layers, then select the best LSTM architecture to train it with a different number of neurons in the dense layer. This method of changing the number of neurons in the hidden layer inspired from the study (MOGHADDAM, Amin Hedayati 2016) [18]. 3.5 Compared Methods As inputs of the LSTM models, we worked with the historical stock data parameters such as volume, high, low, open, close to predicting the close price of the NASDAQ, and the S&P 500. M1: LSTM with Open and Close price as learning feature; M2: LSTM with learning feature (High, Low, Close); M3: M2 with the feature Open; M4: M2 with the feature Volume; M5: M1 with extra three features added (High, Low, Volume). 3.6 Performance Evaluation We evaluate the efficiency and the performance of the prediction models by using two measures: the root-mean-square error RMSE [19], and The coefficient of determination R2 [11].

The Prediction Stock Market Price Using LSTM

449

They are commonly used in forecasting to verify experimental results. They were determined as follows: 2 1 N ytar − ypred (2) RMSE = n=1 N 2 ytar − ypred R2 = 1 − (3) 2 ytar − ypred where ytar is the actual exchange rate price on the kth day, ypred is the predicted exchange rate price on the kth day and N is a total number of data samples.

4 Results 4.1 Simulation Tools For the experimental validation, we used Python in JupyterLab. JupyterLab [12] is a web-based interactive development environment for Jupyter notebooks [13], code, and data. JupyterLab is flexible: configure and arrange the user interface to support a wide range of workflows in data science, scientific computing, and machine learning [12]. 4.2 Experimental Results In this section, several models for NASDAQ and S&P500 index prediction were developed and validated. Then the optimized model will be selected based on its capability in prediction. The evaluation of forecasting performances of prediction models is based on the two performance measures RMSE and R2 . In Table 2 And 3, the metrics of the algorithm prediction performance using testing data are shown for each of the stocks NASDAQ, and S&P500, which tells how well it does as a prediction models in Sect. (3.5. Compared Method). Table 2. Comparative results using different parameters with Nasdaq Methods

Features

Nasdaq RMSE

R2

0.1128

0.8614

M1

Open, Close

M2

High, Low, Close

0.0771

0.9352

M3

High, Low, Open, Close

0.0799

0.9304

M4

High, Low, Volume, Close

0.0731

0.9418

M5

High, Low, Volume, Open, Close

0.0702

0.9463

450

R. Barik et al. Table 3. Comparative results using different parameters with S&P500 Methods

Features

S&P500 RMSE

R2

M1

Open, Close

0.0867

0.8950

M2

High, Low, Close

0.0854

0.9253

M3

High, Low, Open, Close

0.0799

0.9304

M4

High, Low, Volume, Close

0.0837

0.9282

M5

High, Low, Volume, Open, Close

0.0788

0.9363

Fig. 4. Real and predicted NASDAQ index values for 21 days

Figure 4 shows the real and predicted Nasdaq close price of the five models for 21 days. We chose to work with NASDAQ in the rest of the following experiments because we had the same result with both indexes NASDAQ, and S&P500 as seen in Tables 2, And 3. The forecasting performances of the models using testing data are shown in Tables 4, and 5.

The Prediction Stock Market Price Using LSTM

451

Table 4. Comparative results using different architectures of neurons in the two LSTM layers Architecture

Nasdaq RMSE

R2

LSTM(80,80), Dense

0.0720

0.9436

LSTM(70,70), Dense

0.0747

0.9393

LSTM(60,60), Dense

0.0729

0.9421

LSTM(50,50), Dense

0.0702

0.9463

LSTM(40,40), Dense

0.0761

0.9370

LSTM(30,30), Dense

0.0741

0.9401

LSTM(20,20), Dense

0.0809

0.9286

LSTM(10,10), Dense

0.0876

0.9164

Table 5. Comparative results using different architectures of the number of neurons in the dense layer Architecture

Nasdaq RMSE

R2

LSTM(50,50), Dense 80

0.0728

0.9422

LSTM(50,50), Dense 70

0.0817

0.9272

LSTM(50,50), Dense 60

0.0739

0.9404

LSTM(50,50), Dense 50

0.0737

0.9407

LSTM(50,50), Dense 40

0.0735

0.9411

LSTM(50,50), Dense 30

0.0717

0.9439

LSTM(50,50), Dense 25

0.0702

0.9463

LSTM(50,50), Dense 20

0.0734

0.9413

LSTM(50,50), Dense 10

0.0726

0.9426

5 Discussion From the comparison between the five models presented in Table 2 and Table 3, it shows that the model M5 with five features (High, Low, Volume, Open, Close) has better forecasting performances of the two different indexes such as the NASDAQ, and the S&P500. Because it provides the lowest rate of RMSE values 0.0788 and the higher rate of R2 values 0.9363. It is similar to one of the findings of the study (CHEN, Kai, ZHOU, Yi, 2015) [16], which is the using of the five input features of the model, while in our study the model (M5) was optimized by adjusting different values for the number of neurons in the two

452

R. Barik et al.

LSTM layers and different values for the number of neurons in the dense layer as shown in Tables 4, 5. According to the Table 4, eight networks with different structures were generated, trained, and tested to select the optimized model by changing the number of the two hidden layers of LSTM. It turns out that RMSE and R2 had desirable values when the numbers of neurons in the two LSTM layers were 50. It provides the lowest rate of the Root Mean Squared Error RMSE values 0.0702 and the higher rate of the coefficient of determination R2 values 0.9463. As a result of Table 4, the model of LSTM (50,50) was selected for use with a different number of neurons in the dense layer. In Table 5, different numbers of neurons in the dense layer are used with the selected LSTM (50.50) from Table 4. It appeared that a network with 25 neurons in the dense layers had desirable values with RMSE values 0.0702 and R2 values 0.9463.

6 Conclusion and Future Work The task-focused in this paper is to predict the close price of the two different indexes such as the NASDAQ, and the S&P500. This paper has shown that LSTM can predict an interest rate of the closest price of the two different indexes such as the NASDAQ, and the S&P500. It can be seen from the results that all the models perform the same in the two different datasets Nasdaq and S&P 500. According to the results obtained in the testing, the Mean Squared Error-values 0.0702, and the coefficient of determination R2 values 0.9463 indicating that that the LSTM with five features (Open, High, Low, Volume, Close) has the best performance with R2 and the best score with RMSE. This improves that the more features we have as input, the more the model performs well. A network with two LSTM and each layer with 50 neurons and a dense layer with 25 neurons is the optimized network with validation R2 of 0. 0.9463 and RMSE of 0.0702. Moreover, when we compare the value of our LSTM model’s coefficient of determination R2 with the result obtained by the study (MOGHADDAM, et al. 2016) [18] their R2 value (0.9408) of the neural network back propagation (four prior working days) in a network with 20-40-20 neurons in hidden layers, we have earned the highest score with 0.9463. We intended to continue investigating ways to improve the model and its predictions by studying changes in the LSTM network architecture and different approaches for preprocessing the input data as well as adding new different features.

References 1. Vuppala, K.: Text Analytics for Quant Investing. BlackRock (2015) 2. Squirro: Use of unstructured data in financial services, White Paper (2014) 3. Selvin, S., Vinayakumar, R., Gopalakrishnan, E.A., et al.: Stock price prediction using LSTM, RNN and CNN-sliding window model. In: 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI), pp. 1643–1647. IEEE (2017)

The Prediction Stock Market Price Using LSTM

453

4. Devadoss, A.V., Ligori, T.A.A.: Forecasting of stock prices using multi layer perceptron. Int. J. Comput. Algorithm 2, 440–449 (2013) 5. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997) 6. Nelson, D.M.Q., Pereira, A.C.M., De Oliveira, R.A.: Stock market’s price movement prediction with LSTM neural networks. In: 2017 International Joint Conference on Neural Networks (IJCNN), pp. 1419–1426. IEEE (2017) 7. Greff, K., Srivastava, R.K., Koutník, J., Steunebrink, B.R., Schmidhuber, J.: LSTM: a search space odyssey. arXiv preprint arXiv:1503.04069 (2015) 8. Pedregosa, F., Varoquaux, G., Gramfort, A., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011) 9. Chollet, F. (2016). Keras. https://github.com/fchollet/keras 10. Kingma, D.P., Jimmy, B.A.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014) 11. Nagelkerke, N.J.D., et al.: A note on a general definition of the coefficient of determination. Biometrika 78(3), 691–692 (1991) 12. Granger, B., Grout, J.: JupyterLab: Building Blocks for Interactive Computing. In: Slides of Presentation Made at SciPy 2016 (2016) 13. Kluyver, T., et al.: Jupyter Notebooks—a publishing format for reproducible computational workflows. Positioning and Power in Academic Publishing: Players, Agents and Agendas, p. 87 (2016). https://doi.org/10.3233/978-1-61499-649-1-87 14. OLAH. Christopher. Understanding lstm networks (2015) 15. Chen, K., Zhou, Y., Dai, F.: A LSTM-based method for stock returns prediction: a case study of China stock market. In: 2015 IEEE International Conference on Big Data (Big Data), pp. 2823–2824. IEEE (2015) 16. Chong, E., Han, C., Park, F.C.: Deep learning networks for stock market analysis and prediction: methodology, data representations, and case studies. Expert Syst. Appl. 83, 187–205 (2017) 17. Scikit-learn. Min Max Scaler. https://scikit-learn.org/stable/modules/generated/sklearn.pre processing.MinMaxScaler.html 18. Moghaddam, A.H., Moghaddam, M.H., Esfandyari, M.: Stock market index prediction using artificial neural network. J. Econ. Fin. Adm. Sci. 21(41), 89–93 (2016) 19. Willmott, C.J., Matsuura, K.: Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Clim. Res. 30(1), 79–82 (2005)

Hybrid Movie Recommender System Based on Word Embeddings Amina Samih(B) , Abderrahim Ghadi, and Abdelhadi Fennan Faculty of Sciences and Techniques, List Laboratory University Abdelmalek Essaadi, Tangier, Morocco [email protected], {aghadi,afennan}@uae.ac.ma

Abstract. A recommender system is an application intended to offer a user item that may be of interest to him according to his profile, the recommendations have been applied successfully in various fields. Recommended items include movies, books, travel and tourism services, friends, research articles, research queries, and much more. Hence the presence of recommender systems in many areas, in particular, movies recommendation. The problem of film recommendation has become more interesting because of the rich data and context available online, what advance quickly the research in this field. Therefore, it’s time to overcome traditional recommendation methods (traditional collaborative filtering, traditional content-based filtering) wich suffer from many drawbacks like cold start problem and data sparsity. In this article we present a solution for these limitations, by proposing a hybrid recommendation framework to improve the quality of online films recommendations services, we used users ratings and movies features, in order to use two models into the framework based on word2vec and Knn algorithms respectively. Keywords: Recommender systems · Hybridization, · KNN · Word2vec

1 Introduction With the ease of access to the Internet, we are increasingly exposed to many information. However, this overload of data can become problematic. In order to remedy this problem, recommendation systems are used. There are mainly three main categories of recommendation systems [2]. One is content-based filtering, and the second is collaborative filtering; the third is hybrid; the recommendations have been applied successfully in a variety of areas. Recommendable items include movies, books, travel and tourism services, friends, research articles, research queries, and many more [1]. Hence the presence of recommendation systems in many areas, particularly movies recommendation [3]. Among the first film’s recommender systems proposed, we cited [4, 5] and their respective approaches “Movielens”, “Each movie”. Although they represent a primary version of movies recommender systems, these approaches have inspired a plethora of approaches these days. We cited the Netflix Prize challenge [6] as an example. This popular competition aimed to improve © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ben Ahmed et al. (Eds.): NISS 2022, LNDECT 147, pp. 454–463, 2023. https://doi.org/10.1007/978-3-031-15191-0_43

Hybrid Movie Recommender System Based on Word Embeddings

455

the accuracy of Netflix’s recommender system by ten percent where a prize of one million dollars was at stake. So When we access the Netflix service, its recommendation system makes it as easy as possible to find the TV shows or movies we might enjoy. That means that The goal of a movies recommendation system is to provide users with relevant movies according to their preferences. It helps to minimize the time consumed by users searching for films of interest to them and find films that they are likely to like but might not have paid attention to. The importance of this category of recommendation pushes us to think that it is time to overcome the limitations of traditional recommendations algorithms (collaborative filtering, content-based filtering) by proposing a hybrid solution. In this regard, this paper proposes a hybrid recommendation approach based on two models, the first uses KNN, and the second uses word2vec. The rest of this paper is classified as follows: Sect. 2 briefly explains the related work carried out on content-based and collaborative recommendation systems and their drawbacks. The proposed approach based on KNN and Word2vec for a movie recommender system is explained in Sect. 3. In Sect. 4, experiment results performed on Movielens dataset are described. Finally, we conclude our work, and we highlight future work in Sect. 5.

2 Related Work This section presents an analysis of recent related work on recommender systems, including recommendations based on content filtering, and recommendations based on collaborative filtering 2.1 Recommendation Based on Content Filtering Content_based film recommendation relies on three major parts: The content analyser, learning user profile, and filtering movies [1]. • Content analyser: the preprocessing manager transforms film information from its original format to another more abstract and helpful format. • User Profile: It receives pre-processed movie information from the content analyzer and generalizes it to build user preferences. • Filtering: filters relevant movies by matching the representation of the user profile to movies that are candidates for the recommendation. Indeed, this recommendation is based on user profiles using characteristics extracted from the content of elements that the user has consulted in the past. So, the filter- ing is done independently of other users and their profiles and adapts to the change of each user’s profile [7]; a movie can be recommended without being evaluated. Many movies recommender systems are proposed using content_based filtering; Eddy and al. Built a recommendation framework using movies genre similarity extracted from user history. [8] proposed a new idea, they tried to find the correlation between movies descriptions in order to classify similar ones by using a simple comparator to

456

A. Samih et al.

achieve comparison and matching of tags and genres of films. In [9], authors implement a recommendation framework that use genomic tags to find similarity between films; they use both principal component analysis (PCA) and Pearson correlation to adapt redundancy and complexity issues to have more accurate recommendations. By analyzing the aforementioned approaches, tree glaring limitation emerge. • Limited content analysis: these require a detailed description of the items and a very well-organized user profile before the recommendation can be made to users. That is not always the case. • Content subspecialization: users are limited to obtaining recommendations similar to items already defined in their profiles. • Non-immediate onboarding of a new user: A user must evaluate many items before the system can interpret their preferences and provide them with relevant recommendations. 2.2 Recommendation Based on Collaborative Filtering Let us consider a recommendation system (RS) for films. The RS can face a situation in which we do not know the characteristics of a particular film, but we know how some users have evaluated it. Now, if two users named “Amina” and “Mouna” saw a movie called B, and later Amina watches another movie called D and likes it, we can recommend it to Mouna. This approach comes from the collaborative filtering method [10]. The collaborative movies recommendation system searches for the similarity between users and movies to make predictions. In many cases, the user rating structure helps determine similarity. The data available in this type of system can be collected in an explicit way (stars, like, dislike ..) or in an implicit way (clicks, movies watched, comments, ...), exploited in several ways. These methods are classified into two prominent families: memory-based algorithms and model-based algorithms [1, 11] . The term collaboration has achieved great success in recommendation area, particularly the systems of recommendation of films; prevalent online sites have used this approach like (youtube Netflix …). This approach is extensively used in the movies area, as it is easily implemented. In [12], the authors implement a recommendation system of film that merges the collaborative filtering algorithm with K_means, based on user segmentation with the prediction of ratings to make more accurate recommendations. Based on the same idea, [13] proposed clustering both users and items and then using the clusters into the collaborative filtering approach to determine similarity. Users’ feedbacks play a vital role in performing the quality of collaborative recommendation services; in this vein, [14] provides a dynamic Collaborative filtering method based on the positive feedback of users in real-time, then The Rs produce new recommendations in each update of data. However, with the use of collaborative filtering in the approaches mentioned earlier, glaring limitations emerge: • Cold start is when a recommendation system does not have sufficient information about a user or item to make relevant predictions.

Hybrid Movie Recommender System Based on Word Embeddings

457

• Sparsity is a situation when users evaluate a little of the total number of items available in a database. This leads to a hollow matrix with a rate of missing values. • Gray Sheep Problem, users with different, out-of-the-ordinary tastes will not have many similar users.

3 Proposed Approach Inspired by cited works in the previous section and aiming to overcome existing limitations in terms of data sparsity and cold start problem. We propose an approach to implement a hybrid recommendation system.

Fig. 1. An overview of the proposed approach.

3.1 Hybrid Recommendation A hybrid recommendation is a combination of two or more different recommendation techniques. The most popular hybrid approaches are that of content-based and collaborative filtering. It uses both the content of the element and the ratings of all users [15]. Hybrid recommendation systems combine two or more recommendation strategies in different ways to take advantage of their complementary advantages [16]. There are several ways to hybridize, and the research community has not reached a consensus [17]. However, Burke [18] identified seven different ways of hybridizing: 1. Weighted hybridization: the score or prediction obtained by each technique is combined into a single result. 2. Switching hybridization: the system switches between the two recommendation techniques depending on the situation.

458

A. Samih et al.

3. Mixed hybridization: The recommendations from the two techniques are merged into a single list. 4. Feature combination hybridization: the data from the two techniques are combined and transmitted to a single recommendation algorithm. 5. Hybridization increase feature: the result of one technique is used as input to the other technique. 6. Cascade hybridization: In this type of hybridization, a recommendation technique is used to produce the first ranking of candidate items and a second technique then refines the list of recommendations. 7. Hybridization by defining a meta-level: In a hybrid based on a meta-level, a first technique is used, but differently than the previous method (an increase of characteristics), not to produce new characteristics but to produce a model. Furthermore, in the second stage, the entire model will be used as the input for the second technique [10]. 3.2 Word2vec Word2evc [20] is a famous word embeddings algorithm. It is based on two-layer neural networks and attempts to learn vector representations of the words composing the input text. Close digital vectors represent words that share similar contexts. Word2Vec has two neural architectures, called CBOW and Skip-Gram. The first receives the context of a word as input, that is, the terms surrounding it in a sentence, and tries to predict the word in question. The second takes a word as input and predicts its context (see Fig. 2).

Fig. 2. Word2vec neural architectures

3.3 K Nearest Neighbor One of the most popular collaborative filtering methods is the K nearest neighbor (KNN) algorithm. KNN is a classification algorithm where k is a parameter that specifies the number of neighbors used; its operation can be likened to the following analogy “tell me who your neighbors are, I will tell you who you are…”. To use KNN, the system must

Hybrid Movie Recommender System Based on Word Embeddings

459

have a similarity measure to distinguish between users who are near and those who are far. Depending on the problem studied, this measure of similarity can be derived [10]: from the Euclidean distance, the measure of the cosine, the Pearson correlation, etc. Then, when the system is asked about new information, such as a search for movies that might appeal to an individual, it will find the k users closest to the target user. Finally, a majority vote is taken to decide which film will be recommended. For example, in the case of an explicit information system, the system could recommend the film with the best side. While for the implied information, the system might recommend the most popular movie within the neighborhood. 3.4 The Overall Approach We used mixed hybridization (see Sect. 3.1) , based on the fusion of results of two models based on KNN and word2vec. A. Model 1 (based on KNN) The first model . (see Fig. 1 Model 1) uses Item-based collaborative filtering based on KNN; This model can be summarized as follows: (i) For a film f candidate for the recommendation, we determine the closest neighbours (similar films) by calculating its similarity with the other available films. (ii) predicting the rating of the current user Ux for the film f is calculated from the ratings that Ux has assigned to the neighbors of the film f. We make inferences about movies by applying the KNN (K nearest neighbor) algorithm. We use similarity information(cosine similarity) between movies; the function takes a film as input entered by the user and launches the KNN model, which calculates the distance between the input film and the films stored in our database;By defining a number of neighbours, it performs a classification of these calculated distances to return a list of recommended movies close to the input movie. B. Model 2 (based on Word2vec) We extend the idea of [21], which apply word2vec method to non-textual ecommerce data. The goal of model 2 is to recommend a set of movies to the active user based on his past watching behavior. The model uses word2vec [20] algorithm, which takes all genres of movies in order to recommend a list of movies that have genres similar to genres of movies already seen by each user, so our recommendation in this step is content_based. For example, a user involved to watch a film of genre ‘Drama’ might have a watching pattern of films, similar to this (Fig. 1 Model2). We will represent each of these genres of movies by a vector; then, we can find similar movies based on word2vec model. So, if a user is checking out a film, then we will recommend him/her similar films by using the vector similarity score between movies. We will represent each of these genres of movies by a vector; then, we can find similar movies based on word2vec model. So, if a user is checking out a film, then

460

A. Samih et al.

we will recommend him/her similar films by using the vector similarity score between movies. We defined two recommendation functions in this model based on the vectors extracted from the word2vec model after training; the first will take the vector (X) of a genre of a movie as input and will return the six main similar movies, the second method consists of recommending films to each user according to the films which a user has already seen. We will use the average vectors of the genres of films that the user has already seen and use the result vector as input in our second recommendation method to find similar movies. In this regard, we merge all results obtained into one list of recommended movies. C. Mixed Hybridization We launch each model independently; then, we merge the two resulting lists into a single recommendation list to display it to the active user. In this regard, we use both content-based filtering and collaborative filtering based on word2vec and KNN; respectively, these methods have different advantages and disadvantages so that no single solution can answer all the problems [22]. For this reason, we decided to use the two techniques and combine them to produce a better recommendation, which overcame many limitations, cold start and data sparsity, and the limited content analysis.

4 Experiments We tested the approach on latest movielens dataset, which contain 100,000 ratings applied to 9,000 distinct movies by 600 users. We use two files; the first one, rating.csv, presents ratings according to movies. The second file is movie.csv, which presents movies information. Each line presents one id for one movie title and its genres, respectively. To present the results of our framework, we start by data preprocessing. Data Preprocessing Step1 for model 1: To overcome the noisy pattern and tackle “memory errors», we will filter our rating data frame only on popular movies and active users. (more details in next paragraph). To create the user_film matrix, we need to check how many notes (ratings) we have and how many of the notes are missing. Some users cannot assign a rating to a film; the number of non-existent ratings is:5830804. We visualize the evaluations on log scale to know which note (from 0 to 5) is more frequent (see Fig. 3).

Hybrid Movie Recommender System Based on Word Embeddings

461

Fig. 3. Visualization of ratings from movielens dataset on log scale

In general, only a few users are interested in giving a rating to films (vote data sparsity), we also filter the users based on their rating frequency, the goal is to find users (active users) who give a rating more than 30 times, we find 501 active users. So, in the first model, we build our final ratings data, which have a reduced sparsity more than the original ratings, based on ratings with most rated movies and ratings with active users. Step 2 for Model 2 We started our data preparation for this model 2 by converting the movieId to string datatype; then, we split our data into 80% of users for training and 20% for testing. (we choose actif users as we will extract genres of movies already watched by each user to build the vocabulary of woed2vec model). We create sequences of watched films by the users in the dataset for both the train and test data. Building The KNN Model After preparing our dataset, we present the final ratings in an item-user matrix and fill the empty cell with zeros as the model calculates the distance between every two points. We embed the index of the movie and its title by using a constructed mapper. We create a sparse matrix based on this final rating to make more efficient calculations. In the beginning, we aim to detect if the movie entered by the user exists or is not in the database. Then to make an effective recommendation, we will use fuzzy match. We will consider an active user that enters as an example ’Father of the Bride Part II (1995)’ as input film. The model will calculate the distance between the movie “Father of the Bride Part II (1995)” and the movies in the dataset to rank these distances and return the ten similar movies as part of the recommendations that our system produces. Building the word2vec Model We created our labelled data to train the word2vec model. Below we will use genres of movies; our model has 659 unique words as a vocabulary extracted from the past watching of users, and each one of their vectors has a size of 100. The embeddings vectors created by applying word2vec model on genres of movies in movielens. The first function in this model take genre vector (‘genreX’) as input and return the top 6

462

A. Samih et al.

similar movies of ‘filmX. In the second function, we take the average of all the vectors of the genres of the movies that the user has seen so far, and we use this resulting vector to find interesting movies. Final Phase: Merging Results of both Model 1 and Model 2 At this end, we present all obtained results from both model 1 and model 2 for the input film “‘filmX”, then as we mentioned before, our system is based on mixed hybridization, so we merged all results into a single list without any redundancy of results, as we apply a filtering process before presenting the final recommended movies to the active user.

5 Conclusion In this paper, we proposed an approach that allows improving the quality of movies recommendation by combining two techniques, content-based filtering and collaborative filtering using word2vec and KNN, respectively, to take power from their advantages and overcome their limitations. The proposed approach can be applied to any area of recommendation (e-commerce recommendation, restaurant recommendation,.); it is based on two basic models that we launched independently. The first model uses the KNN algorithm to predict ratings. The second model applies the word2vec algorithm on non-textual data, particularly genres of movies; then, we use vectors generated by the algorithm to produce a list of movies. Finally, we merge the results obtained from both model 1 and model 2 into one single list based on mixed hybridization that we present to the active user. We plan in the future to enhance this work by making a robust evaluation with other models to prove our results. by using more features of movies and using demographic data of users in addition to working on the explanation part, we will add more suitable style of explanation to satisfy users; then we will propose an advanced parameter that takes into consideration the independence between models to evaluate the future work.

References 1. Karimi, M., Jannach, D., Jugovac, M.: News recommender systems survey and roads ahead. Sci. Direct J. 1–4 (2018) 2. Samih, A., Adadi, A., Berrada, M.: Towards a knowledge based explainable recommender systems. In: Proceedings of the 4th International Conference on Big Data and Internet of Things, ser. BDIoT’19. New York, NY, USA: Association for Computing Machinery (2019). https://doi.org/10.1145/3372938.3372959 3. Samih, A., Ghadi, A., Fennan, A.: Deep graph embeddings in recommender systems: a survey. J. Theor. Appl. Inf. Technol. 99(15) (2021). https://doi.org/10.5281/zenodo.5353504 4. Herlocker, J., Konstan, J., Borchers, A., Riedl, J.: An algorithmic framework for performing collaborative filtering. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 1999, pp. 230– 237 (1999) 5. Samih, A., Ghadi, A., Fennan, A.: ExMrec2vec: explainable movie recommender system based on Word2vec. Int. J. Adv. Comput. Sci. Appl. 12(8) (2021)

Hybrid Movie Recommender System Based on Word Embeddings

463

6. De Pessemier, T., Vanhecke, K., Dooms, S., Deryckere, T., Martens, L.: Extending user profiles in collaborative filtering algorithms to alleviate the sparsity problem. In: Filipe, J., Cordeiro, J. (eds.) WEBIST 2010. LNBIP, vol. 75, pp. 230–244. Springer, Heidelberg (2011). https:// doi.org/10.1007/978-3-642-22810-0_17 7. Seyednezhad, M., Cozart, K., Bowllan, J., Smith, A.: A review on recommendation systems: context-aware to social-based. IEEE J. 9–20 (2018) 8. Pal, A., Parhi, P., Aggarwal, M.: An improved content based collaborative filtering algorithm for movie recommendations. In: 2017 Tenth International Conference on Contemporary Computing (IC3), pp. 1–3 (2017). https://doi.org/10.1109/IC3.2017.8284357 9. Ali, S.M., Nayak, G.K., Lenka, R.K., Barik, R.K.: Movie recommendation system using genome tags and content-based filtering. In: Kolhe, M.L., Trivedi, M.C., Tiwari, S., Singh, V.K. (eds.) Advances in Data and Information Sciences. LNNS, vol. 38, pp. 85–94. Springer, Singapore (2018). https://doi.org/10.1007/978-981-10-8360-0_8 10. Arnautu, O.: Un système de recommandation de musique. Master’s thesis, La Faculté des arts et des sciences Université de Montréal (2012) 11. Benouaret, I.: Un système de recommandation contextuel et composite pour la visite personnalisée de sites culturels, pp. 19–20 (2018) 12. Ahuja, R., Solanki, A., Nayyar, A.: Movie recommender system using K-Means clustering AND K-Nearest neighbor. In: 2019 9th International Conference on Cloud Computing, Data Science & Engineering (Confluence), pp. 263–268 (2019). https://doi.org/10.1109/CONFLU ENCE.2019.8776969 13. Wasid, W., Ali, R.: An improved recommender system based on multicriteria clustering approach. Procedia Comput. Sci. 131, 93–101 (2018). (https://www.sciencedirect.com/sci ence/article/pii/S1877050918305659). ISSN 1877-0509 14. Kharita, M.K., Kumar, A., Singh, P.: Item-based collaborative filtering in movie recommendation in real time. In: 2018 First International Conference on Secure Cyber Computing and Communication (ICSCCC), pp. 340–342 (2018). https://doi.org/10.1109/ICSCCC.2018.870 3362 15. Prasad, R., Kumari, V.A.: Categorial review of recommender systems. Int. J. Distrib. Parallel Syst. (IJDPS), 3(5), 70–79 (2012) 16. Cano, E., Morisio, M.: Hybrid recommender systems: a systematic literature review. IEEE J. 2–3 (2017) 17. BenTicha, S.: Recommandation Personnalisée Hybride, pp. 51–54 (2018) 18. Burke, R.: Hybrid recommender systems: survey and experiments. User Model User-Adapt. Interact. 12(4), 331–370 (2002) 19. Renaud-Deputter, S.: Système de recommandations utilisant une combinaison de filtrage collaboratif et de segmentation pour des données implicates. Collection Sciences – Mémoires (2013). http://hdl.handle.net/11143/6599 20. Mikolov, T.: Efficient Estimation of Word Representations in Vector Space, Arxiv (2013) 21. Joshi, P.: Building a Recommendation System using Word2vec: A Unique Tutorial with Case Study in Python, available Online (2019) Samih, A., Ghadi, A., Fennan, A.: Translationalrandomwalk embeddings- based recommender systems: a pragmatic survey. In: Kacprzyk J., Balas V.E., Ezziy- yani M. (eds) Advanced Intelligent Systems for Sustainable Development (AI2SD’2020). AI2SD 2020. Advances in Intelligent Systems and Computing, vol. 1418. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-90639-9_77 22. Samih, A., Ghadi, A., Fennan, A.: Translational-randomwalk embeddings- based recommender systems: a pragmatic survey. In: Kacprzyk J., Balas V.E., Ezziy- yani M. (eds) Advanced Intelligent Systems for Sustainable Development (AI2SD’2020). AI2SD 2020. Advances in Intelligent Systems and Computing, vol. 1418. Springer, Cham (2022). https:// doi.org/10.1007/978-3-030-90639-9_77

Towards Big Data-based Sustainable Business Models and Sustainable Supply Chain Lahcen Tamym1(B) , Lyes Benyoucef2 , Ahmed Nait Sidi Moh3 , and Moulay Driss El Ouadghiri1 1

2

Moulay Ismail University of Meknès, I&A Laboratory, Meknes, Morocco [email protected], [email protected] Aix-Marseille University, University of Toulon, CNRS, LIS, Marseille, France [email protected] 3 Jean Monnet University, LASPI Laboratory, Roanne, France [email protected]

Abstract. Nowadays, in a hotly competitive environment, enterprises that do not integrate sustainability metrics into their operations, decision-making process, or business model (BM) cannot be considered successful sustainable businesses. Thus, in order to achieve their economic, environmental, and social sustainability, enterprises must integrate their supply chain network (SCN) and their indirect stakeholders (i.e., social and environmental) into their sustainable business models (SBMs). As an emergent technology, Big Data (BD) oﬀers enterprises the ability to achieve sustainable development goals. To this end, in this paper, we present an overview of how enterprises are planning to integrate sustainability into their BMs and expand it to cover the involved stakeholders in their business processes. In addition, we give some beneﬁts of leveraging BD in enterprises’ SBMs refereed to us BD-based SBMs. Keywords: Sustainable business models · Supply chian sustainability · Big data · Industry 4.0 · Sustainability

1

Context and Motivations

In the last decades, the introduction of new, innovative, and sustainable business models (SBMs) have received increased academic attention [1]. Whereas successful enterprises are those that introduce new or innovative business models (BMs) [2]. Furthermore, the conventional enterprise’s interests in making proﬁts and value focus only on economic performance. While, in the current business, enterprises that do not integrate sustainability metrics to their operations, decision-making process, or BM cannot be considered as a successful sustainable business [3]. Typically, enterprises’ SBMs must meet sustainable development goals launched by United Nations. The main objective of SBM is not only economic success but equally intended to create ecological and social values c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ben Ahmed et al. (Eds.): NISS 2022, LNDECT 147, pp. 464–474, 2023. https://doi.org/10.1007/978-3-031-15191-0_44

Big Data-based Sustainable Business Models

465

[4]. Besides, the value oﬀered by enterprises must consider ecological and social value generation [5,6], while value capture mechanisms transform this ecological and social value generated into revenue. Within Industry 4.0 (I4.0), the concepts SBMs and supply chain sustainability (SCS) can be seen as game-changers of business in integrating ecological and social aspects in addition to economic rationales [7,8]. Indeed, SCS focuses on managing environmental, social, and economic impacts from upstream to downstream business operations. This leads to eﬃcient governance and tracking of goods and services throughout their life-cycles [9]. The main intention behind SCS is the creation, protection, and growing long-term environmental, social and economic value for all stakeholders involved in the enterprise focal business [10]. Thus, changing the organization’s culture to meet the needs of the current business based on those concepts, they face a strong challenge for them in being more sustainable and rethinking their BMs in parallel. Several studies in the recent literature are trying to address this challenge, even though placing this idea into practice is hard task [2]. To this end, in this review study, we present an overview of how enterprises are planning to make their BMs more sustainable besides taking into account the sustainability of their supply chains (SCs) in a general context. Thus, a deep study is performed from the focal enterprise SBM to SCS level. In addition, the enterprises begin to collect and analyze data on a wide range of sustainability-related factors, such as energy and resource use, greenhouse gas emissions, and logistics performance. Hence, leveraging and integrating BD to their SBMs will enable them to generate the necessary data to derive insights that will help them to guide their sustainability-related initiatives and improve overall resources eﬃciency and sustainable development. Thus, we give some related BD-based SBMs that we have identiﬁed in the literature. The following motivations are identiﬁed: 1. Enterprises’ sustainability are an emergent topic. 2. Integrating SBMs with SCS and societal and ecological stakeholders will make enterprise’ businesses innovative and more sustainable. 3. Coupling BD with enterprise’ SBM will revolutionize the current business and make it matches the economic, social, and environmental sustainable development needs. The rest of the paper is organized as follows: Sect. 2 presents an overview of BMs and SBMs and the beneﬁts of integrating SBMs with SCN sustainability. Section 3 shows the potential of leveraging BD technology within enterprises’ SBMs. Finally, Sect. 4 concludes the paper with some future work directions.

2 2.1

Overview Sustainability and Supply Chain

The aim of sustainability has derived from the U.S. National Environmental Policy Act of 1969 (NEPA), is to “create and maintain conditions, under which

466

L. Tamym et al.

humans and nature can live in productive conformity, that allow fulﬁlling the social, economic, and other needs of present and future generations.” Sustainability can be deﬁned as the “rearrangement of technological, scientiﬁc environmental, economic and social resources in such a way that the causing assorted system can be controlled in a state of momentary and spatial equilibrium” [11]. Moreover, sustainability is recognised as “the evolution that joins the present requirements without compromising the ability of future generations to join their own needs, there are multi-dimensions of sustainability evolution: economy, environment, and society.” Sustainability covers many industries and its related activities, such as, SC. For instance, several authors tried to integrate energy consumption into the total cost of production using time of use (TOU) tariﬀs in scheduling problems. In addition a methodology developed to prepare a sustainable value stream mapping (Sus-VSM) [12]. This methodology includes various metrics to evaluate a manufacturing line’s economic, environmental, and societal sustainability performances. Metrics were selected to assess the process, water consumption, raw material usage, energy consumption, potential hazards concerning the work environment and the physical work done by the employees. Dornfeld [13] stated that appropriate metrics to measure sustainability in SC are enablers of technology in design processes and suggested, respectively, carbon footprint, energy consumption, and pollution, among other metrics. Moreover, to design sustainable and reliable power systems under uncertainty a novel multi-objective fuzzy robust optimization approach was proposed [14]. The objective is to determine the optimal number, location, capacity, and technology of the generation production sites and the electricity generated and transmitted through the network while minimizing the sustainability and reliability costs of the system. Thus, the approach solves the problem by simultaneously improving both sustainability and reliability and capturing uncertain factors. A case study in Vietnam is used to demonstrate the eﬃciency of the developed approach. Recently, [15] tried to answer “how sustainable manufacturing and the use of the new technologies can enable I4.0 to positively impact all the sustainability dimensions in an integrated way and support the implementation of the I4.0 agenda”. They proposed a conceptual framework formed by the principles and technological pillars of I4.0, sustainable manufacturing scope, and sustainability dimensions, guided analysis of 35 papers from 2008 to 2018, selected by a systematic approach. They concluded that “ﬁeld is legitimated but not consolidated, however, it is evolving based on the development of new BMs and value-creation-chains integration. Moreover, research gaps and opportunities for ﬁeld development, becoming more mature and having a signiﬁcant contribution to fully developing the agenda of I4.0.” [16] presented a systematic literature review approach with the help of three major digital scientiﬁc databases, i.e., IEEE, Web of Sciences, and Scopus, to ﬁnd out the current research progress and future research potential of I4.0 technologies to achieve manufacturing sustainability. They stated that “the majority of studies focused on a general discussion about the I4.0 concepts and theories. However, very few studies discuss the role of

Big Data-based Sustainable Business Models

467

shop ﬂoor activities and diﬀerent technologies for achieving manufacturing sustainability.” Moreover, “in management, case studies, and simulation approaches were used to validate the proposed approaches. Nevertheless, there is a lack of studies on social and environmental issues which can contribute to achieving sustainable SCs”. 2.2

Business Models Vs Sustainable Business Models

Generally, a BM can be seen as the design of mechanisms and strategies that an organization takes to create, deliver, and capture value for its involved stakeholders including suppliers, business partners, customers, to name a few [17,18]. Simply, a BM describes the speciﬁc actions that an enterprise takes to achieve its business goals. While a SBM is the analysis and management of an enterprise’s sustainable value proposition (SVP) to all its stakeholders and how this value is being created and delivered [19]. Moreover, how it captures economic value whereas maintaining or regenerating natural and social on the far side its organizational boundaries [3]. Therefore, typical BMs are designed to create value for the owner of the business and shareholders, which make them insuﬃcient and incomplete, due to the complexity of the current business [20]. This complexity resides in the eﬀects that a business has on its ecosystem, society, and the business itself. Accordingly, enterprises have to consider their ecosystem, SC, and other external stakeholders. Indeed, they have to generate value from the environment and society by calculating the eﬀects that their businesses have on them [5]. Thus, increasing sustainable practices within their business by making their value creation (VC) models include the full impacts on the direct and indirect stakeholders [19]. For instance, by including, reducing, reusing, and recycling materials or products within an enterprise’s SC, or by sending waste to be used otherwise [21]. Based on a deep study conducted by [22], a synthesis of diﬀerent available kinds of BMs and SBMs in the literature is presented. Table 1 summarizes the key features and comparisons between BMs and SBMs models. Within their study, in order to identify and validate the existing SBM patterns, they performed a classiﬁcation of BMs and SBMs by developing, testing, and applying a new multi-method and multi-step approach centered on an expert review process that combines literature review, Delphi survey, and physical card sorting, following the notion of patterns as problem-solution combinations. As result, they oﬀered a rich taxonomy that classiﬁed 45 SBM patterns, and these patterns were assigned to 11 groups along ecological, social, and economic sustainability dimensions. As well as the potential contributions that have on the VC, and sustainabilityoriented BMs. 2.3

Sustainable Business Models and Sustainable Supply Chain

The focuses of SBM concept are ecological, social, and economic VC of a focal enterprise [19]. While, SSC management (SCSM) focuses on managing goods, information, and capital ﬂows and the relationships between business partners

468

L. Tamym et al. Table 1. Existing sustainable business models

Sustainable business models Business model pattern

Features

Examples

Social Business Model: No dividends

Oﬀering products and services to “base of the pyramid” and low-income groups Encouraging new social businesses and extending social target groups beneﬁt

Grameen-Veolia

Hybrid model/Gap-exploiter model

Reducing the use of virgin, ﬁnite resources and diminishing waste and pollution Designing a durable product that contains short-lived consumable parts Selling the long-lasting products and remanufacturing its short-lived parts

Epson’s EcoTank printer

Product Design

Designing products that meet the societal and political expectations and that support circular economy Replacing ineﬃcient and harmful product designs by oﬀering responsible and sustainable long-lasting products that increase users’ ecoeﬃciency and are reusable, repairable, and/or recyclable

Xella Denmark

Co-Product Generation

Reducing unused surplus material, waste and production costs Exploiting the by-products resulted from product generation in producing new additional products or selling them directly on the market, this will allow companies to reduce wastes, optimizing material ﬂows, and increase revenues

Industrial symbiosis of Kalundborg in Denmark

Product Recycling, Remanufacturing/Next Life Sales, Repair, Reuse

Solving the problem of increasing resource scarcities, wastage, and pollution, by exploiting the products that have been used and still have use-value for others Leveraging the back ﬂow of products or product components in a remanufacturing, repairing or replacing processes, by disassembling products to reuse their components in “as new” products. This allows retaining the value contained in products and creating new revenue sources

Bionic Yarn, Apple Certiﬁed Refurbished, Agito Medical, Godsinlösen

Green Supply Chain Management

Improving the eﬃciency and transparency of enterprise’s SC network in terms of using natural resources and avoiding risks and harms along with the network Sourcing raw materials that are eco-friendly and reducing or even eliminating toxic materials Making commitments between partners including suppliers to greening SC management

IKEA, “IKEA Way on Purchasing Products, Materials and Services”

Market-Oriented Social Mission or One-Sided Social Mission

Expanding BMs to meet the “Base of the pyramid” and low-income groups that are often excluded from particular forms of consumption due to price barriers or the unexistence of markets for these groups Oﬀering opportunities to excluded social target groups to engage as a productive and paid workforce.

Arbeiterkind (Working-class Child), Grameen Bank

Two-Sided Social Mission

Integrate social target groups including neglected groups such as, customers or productive partners. Improving social business by Two-sided platforms to match suppliers and users of products and services Oﬀering platforms to match two social target groups (the production side and the consumption side), within the production side group oﬀers free production support for the consuming social target group

Was hab’ ich? (What do I have?) which is an online interactive platform

Product-oriented Services

Finding ways to oﬀer complex products or new ecofriendly technologies in a broader sense, because of several barriers related to their diﬀusion Finding ways to convince users to switch from old and ineﬃcient products to new, more eco-friendly versions

Tesla, e-mobiles as products

Big Data-based Sustainable Business Models

469

and this focal enterprise, including suppliers, and customers to improve the sustainability performance of SC network (SCN). Both SBMs and SSCM concepts share the main idea of an enterprise’s sustainability development by integrating and considering direct or indirect business stakeholders [23]. Speciﬁcally, SBMs maintain linkages between social, environmental, and economic issues along SCNs, and in the creation, capturing, and delivery of the value proposition [19,24]. Moreover, this SBMs management exceeds the organization-centric VC perspective to include the inter-organizational perspective of the whole SCN, and other involved stakeholders [25]. For instance, the extension of this value to cover customers, suppliers, distributors, employees, ﬁnancial stakeholders, societal stakeholders, and the natural environment [26], to name a few. Figure 1 illustrate the sustainable value that an enterprise could create beyond its shareholders by looking at other external stakeholders such as ﬁnancial and societal stakeholders. Moreover, contributing to SCSM, enterprises have to generate value for the environment and the society, besides the long-term ﬁnancial competition. Thus, strengthening the relationship between an enterprise’s SBM and its SCN will encourage enterprise sustainability and SCS. As well as, this will ensure its long-term survival and thrivability, and more attractive eco-minded consumers, as well as potential employees [4]. To this end, enter-

Fig. 1. Sustainable enterprise’s value proposition to the involved stakeholders and sustainable value oﬀers by stakeholders to the enterprise.

470

L. Tamym et al.

prises that integrate their SBM and SCSM are more likely to succeed in more sustainable business. This integration will enable an eﬃcient trade-oﬀ between economic growth needs and social inclusion and mitigation of environmental impacts. The modern enterprise and their global value chain face many risks related to society and the environment, thus, the management of both SBM and SCS will ensure the identiﬁcation of these risks [10].

3

Big Data-driven Sustainable Business Models

Sustainability and BD are two concepts that have received great attention by the scientiﬁc community as a promising subject, and many signiﬁcant methodological innovations to solve sustainability challenges using Big Data Analytics (BDA) technology have been proposed [27]. Moreover, sustainability analytics (SA) refers to the application of BDA to achieve sustainable development [28]. SA enables enterprises to streamline their operations, provide important insights, and move towards green goods and services, as well as sustainable commercial value [28]. Indeed, enterprises begin to collect and analyze data on a wide range of sustainability-related factors, such as energy and resource use, greenhouse gas emissions, and logistics performance. The leveraging of BDA on these generated data gives the necessary insights to guide their sustainability-related initiatives and improve overall resources eﬃciency and sustainable development. In addition, beneﬁting from the most recent BDA tools and technologies, enterprises can perform real-time (or near real-time) sustainability analysis on massive volumes of data and maintain linkages between economic, social, and environmental sustainability elements. To this end, in order to recognize the opportunities that BD provides to society, environment, and economy, the United Nations created the “BD for sustainable development” program in 2017 to fulﬁll the sustainable development goals (SDGs) [29,30], also known as grand challenges (GC) [31]. Digitizing and sustaining their BMs, enterprises are increasingly aware of BD technology beneﬁts [32]. Thus, leveraging BDA on data generated in various sources such as (Internet of Things, enterprise information systems, web content, social media, environmental reports, etc.) will signiﬁcantly transform the enterprise’s SBM. In addition, the integration of BDA in enterprise’s SBM will improve its internal business processes, enrich its products and services by making them meet environmental and social needs, enrich customer experiences, as well as monetize their internal data [33]. Based on the literature study, we identiﬁed four types of BD-based SBMs; namely, data users, data suppliers, delivery Networks, data facilitators [34,35]. Indeed, in the context of technological evolvement, BD for SBMs represent a crucial phenomenon in terms of capturing value for enterprises [36]. Therefore, these four BD-based SBMs are mapped on four BM dimensions, such as value proposition, value architecture, Value network, and value ﬁnance [34]. Figure 2 illustrates the BD-based BMs types and the value network that each type oﬀers to customers. BD-based sustainable business will enable enterprises and their direct or indirect stakeholders to create and capture sustainable

Big Data-based Sustainable Business Models

471

value in the context of sustainable business. To this end, the integration of BD within enterprise’ BM improves and optimizes their decision-making and internal processes [37], and will enable them the prediction and proactive handling of critical events. Thus, these increasing value creation and proposition eﬀorts will oﬀer data and related valuable insights for the enterprise and its supplychain partners to enhance economic, social, and environmental sustainability. Accordingly, analyzing and processing data to make information available for all involved stakeholders SC partners in sustainability is a crucial value oﬀer. For instance, optimization issues related to CO2 reductions or energy consumption and greening logistics processes can be addressed using BD-based SBMs. This can be done in the circular economy context, recycling information or product/service information derived from the shared and available data across multiple business stakeholders [38]. Thus, BD-based SBMs are the leading factors that shape and transform product design and service delivery to meet sustainability needs. The transformation of such data through a BD-based SBM phases will enable enterprises to gain knowledge and wisdom that constitute a new source of income and greening their activities.

Fig. 2. BD-based sustainable business models

4

Conclusion and Future Work Directions

The conventional enterprises’ business focuses on making proﬁts and economic value only. While, the current world is moving towards sustainable development thus, the enterprises that do not integrate sustainability metrics into their operations, decision-making process, or BM cannot be considered as successful sustainable businesses. Consequently, this paper presented an overview of SBMs

472

L. Tamym et al.

and their contribution to SC sustainability and societal and environmental stakeholders. In addition, we show how BD as an emergent technology will improve enterprises’ SBMs to meet sustainable development needs. Indeed, in the I4.0 context, SBMs and SCS can be seen as game-changers of business in integrating ecological and social aspects in addition to economic rationales. Finally, as future work, we expect to explore more about the role of BD and its related technologies in enhancing enterprises’ BMs and SCS. Also, studying a real-life scenario will be one of our primary concerns in the near future.

References 1. Bouncken, R.B., Fredrich, V.: Business model innovation in alliances: successful conﬁgurations. J. Bus. Res. 69(9), 3584–3590 (2016) 2. Minatogawa, V.L.F., et al.: Operationalizing business model innovation through big data analytics for sustainable organizations. Sustainability 12(1), 227 (2020) 3. Schaltegger, S., Hansen, E.G., Lüdeke-Freund, F.: Business models for sustainability: origins, present research, and future avenues. Organ. Environ. 29(1), 3–10 (2016) 4. Bocken, N., Short, S., Rana, P., Evans, S.: A literature and practice review to develop sustainable business model archetypes. J. Clean. Prod. 65, 42–56 (2014) 5. Evans, S., Fernando, L., Yang, M.: Sustainable value creation—from concept towards implementation. In: Stark, R., Seliger, G., Bonvoisin, J. (eds.) Sustainable Manufacturing. SPLCEM, pp. 203–220. Springer, Cham (2017). https://doi. org/10.1007/978-3-319-48514-0_13 6. Sinkovics, N., Sinkovics, R.R., Yamin, M.: The role of social value creation in business model formulation at the bottom of the pyramid-implications for MNES? Int. Bus. Rev. 23(4), 692–707 (2014) 7. Chalmeta, R., Santos-deLeón, N.J.: Sustainable supply chain in the era of industry 4.0 and big data: a systematic analysis of literature and research. Sustainability 12(10), 4108 (2020) 8. Müller, J.: Data-based sustainable business models in the context of industry 4.0 (2021) 9. Tamym, L., Moh, A.N.S., Benyoucef, L., Ouadghiri, M.D.E.: Goods and activities tracking through supply chain network using machine learning models. In: Dolgui, A., Bernard, A., Lemoine, D., von Cieminski, G., Romero, D. (eds.) APMS 2021. IAICT, vol. 630, pp. 3–12. Springer, Cham (2021). https://doi.org/10.1007/978-3030-85874-2_1 10. Tamym, L., Benyoucef, L., Nait Sidi Moh, A., El Ouadghiri, M.D.: A big data based architecture for collaborative networks: supply chains mixed-network. Comput. Commun. 175, 102–111 (2021) 11. de Ron, A.J.: Sustainable production: the ultimate result of a continuous improvement. Int. J. Prod. Econ. 56–57, 99–110 (1998). Production Economics: The Link Between Technology And Management 12. Faulkner, W., Templeton, W.D., Gullett, D.E., Badurdeen, F.: Visualizing sustainability performance of manufacturing systems using sustainable value stream mapping (sus-vsm) (2012) 13. Dornfeld, D.A.: Moving towards green and sustainable manufacturing. Int. J. Precision Eng. Manuf.-Green Technol. 1(1), 63–66 (2014). https://doi.org/10.1007/ s40684-014-0010-7

Big Data-based Sustainable Business Models

473

14. Tsao, Y.C., Thanh, V.V.: A multi-objective fuzzy robust optimization approach for designing sustainable and reliable power systems under uncertainty. Appl. Soft Comput. 92, 106,317 (2020) 15. Machado, C.G., Winroth, M.P., da Silva, E.H.D.R.: Sustainable manufacturing in industry 4.0: an emerging research agenda. Int. J. Prod. Res. 58(5), 1462–1484 (2020) 16. Jamwal, A., Agrawal, R., Sharma, M., Giallanza, A.: Industry 4.0 technologies for manufacturing sustainability: a systematic review and future research directions. Appl. Sci. 11(12), 5725 (2021) 17. Sjödin, D., Parida, V., Jovanovic, M., Visnjic, I.: Value creation and value capture alignment in business model innovation: a process view on outcome-based business models. J. Prod. Innov. Manag. 37(2), 158–183 (2020) 18. Johnson, E.A.J.: Business model generation: a handbook for visionaries, game changers, and challengers by alexander osterwalder and yves pigneur. hoboken, nj: John wiley & sons, 2010. 281+iv pages. us$34.95. J. Product Innovation Manage. 29(6), 1099–1100 (2012) 19. Evans, S., Vladimirova, D., Holgado, M., Van Fossen, K., Yang, M., Silva, E.A., Barlow, C.Y.: Business model innovation for sustainability: towards a uniﬁed perspective for creation of sustainable business models. Bus. Strateg. Environ. 26(5), 597–608 (2017) 20. Latiﬁ, M.A., Nikou, S., Bouwman, H.: Business model innovation and ﬁrm performance: Exploring causal mechanisms in smes. Technovation 107, 102,274 (2021) 21. Fratila, D.: 8.09 - environmentally friendly manufacturing processes in the context of transition to sustainable production. In: Comprehensive Materials Processing, pp. 163–175 (2014) 22. Lüdeke-Freund, F., Carroux, S., Joyce, A., Massa, L., Breuer, H.: The sustainable business model pattern taxonomy-45 patterns to support sustainability-oriented business model innovation. Sustain. Prod. Cons. 15, 145–162 (2018) 23. Lüdeke-Freund, F., Gold, S., Bocken, N.: Sustainable business model and supply chain conceptions - towards an integrated perspective, pp. 337–363 (2016) 24. Boons, F., Lüdeke-Freund, F.: Business models for sustainable innovation: stateof-the-art and steps towards a research agenda. J. Clean. Prod. 45, 9–19 (2013). Sustainable Innovation and Business Models 25. Gold, S., Seuring, S., Beske, P.: Sustainable supply chain management and interorganizational resources: a literature review. Corp. Soc. Responsib. Environ. Manag. 17(4), 230–245 (2010) 26. Freudenreich, B., Lüdeke-Freund, F., Schaltegger, S.: A stakeholder theory perspective on business models: value creation for sustainability. J. Bus. Ethics 166, 3–18 (2020) 27. Lv, Z., Iqbal, R., Chang, V.: Big data analytics for sustainability. Futur. Gener. Comput. Syst. 86, 1238–1241 (2018) 28. Deloitte: Sustainability analytics, the three-minute guide. Technical report, Deloitte Development LLC (2012) 29. Zhang, D., Pan, S.L., Yu, J., Liu, W.: Orchestrating big data analytics capability for sustainability: a study of air pollution management in China. Inf. Manage. 59, 103231 (2019) 30. Nations, U.: Transforming our world: the 2030 agenda for sustainable development. https://sdgs.un.org/2030agenda. Accessed 25 Jul 2021 31. Ye, L., Pan, S.L., Wang, J., Wu, J., Dong, X.: Big data analytics for sustainable cities: an information triangulation study of hazardous materials transportation. J. Bus. Res. 128, 381–390 (2021)

474

L. Tamym et al.

32. Akter, S., Wamba, S.F., Gunasekaran, A., Dubey, R., Childe, S.J.: How to improve ﬁrm performance using big data analytics capability and business strategy alignment? Int. J. Prod. Econ. 182, 113–131 (2016) 33. Baecker, J., Engert, M., Pfaﬀ, M., Krcmar, H.: Business strategies for data monetization: deriving insights from practice, pp. 972–987 (2020) 34. Wiener, M., Saunders, C., Marabelli, M.: Big-data business models: a critical literature review and multiperspective research framework. J. Inf. Technol. 35(1), 66–91 (2020) 35. Schroeder, R.: Big data business models: challenges and opportunities. Cogent Social Sci. 2(1), 1166,924 (2016) 36. Teece, D.J., Linden, G.: Business models, value capture, and the digital enterprise. J. Organ. Des. 6(1), 1–14 (2017). https://doi.org/10.1186/s41469-017-0018-x 37. Ehret, M., Wirtz, J.: Unlocking value from machines: business models and the industrial internet of things. J. Mark. Manag. 33, 111–130 (2017) 38. Müller, J.M., Veile, J.W., Voigt, K.I.: Prerequisites and incentives for digital information sharing in industry 4.0 - an international comparison across data types. Comput. Ind. Eng. 148, 106,733 (2020)

Treatment of Categorical Variables with Missing Values Using PLS Regression Yasmina Al Marouni(B)

and Youssef Bentaleb

Engineering Sciences Laboratory, ENSA, Ibn Tofail University, Kenitra, Morocco {yasmina.almarouni,youssef.bentaleb}@uit.ac.ma

Abstract. Partial Least Squares Regression (PLSR) is a data analysis method that allows prediction and estimation of complex models via latent and manifest variables. Moreover, PLSR can be used either with small or big sample size. However, its limitations can create challenges and restrict it to handle quantitative variables only. This manuscript per the authors presents PLS1 for Categorical Predictors with missing Value (PLS1-CAP-MV), an adaptation of PLS for Categorical Predictors, to handle categorical data with missing values. The relevance for the use of PLS1-CAP-MV will be demonstrated and supported through its application to an actual data set. As well, a comparison between the variable importance in projection (VIP) of PLS-NIPLAS and PLS1-CAP-MV will be performed. The data pertain to the security of children on the Internet. Keywords: Partial Least Square (PLS) regression · Categorical data · Missing values · VIP · NIPLAS

1 Introduction and Literature Review The Internet is likely to continue to play a larger role in the lives of millions of children than before. While emerging technologies and digital solutions provide important opportunities for children to continue to learn, play and connect with others, these same tools may also expose them more to a multitude of risks. Long before the pandemic began, online sexual exploitation, harmful content, misinformation, and cyber-harassment were all threats to children’s rights, safety, and psychological well-being. The purpose of this paper is to analyze the online behavior of children, then to determine the precautions to be taken to protect them against cybercrime. To that effect, 490 children from different cities in Morocco were interviewed. Besides, the data collected is incomplete, the simple size small, in addition the questionnaire contains 34 variables, most of which are qualitative.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ben Ahmed et al. (Eds.): NISS 2022, LNDECT 147, pp. 475–485, 2023. https://doi.org/10.1007/978-3-031-15191-0_45

476

Y. Al Marouni and Y. Bentaleb

Consequently, the challenge is to find the most appropriate statistical regression method that allows to handle: 1. 2. 3. 4.

Small sample size. Large number of variables. Presence of qualitative variables. Presence of data with missing values.

Thus, this regression method should permit not only to predict, analyze, and study data with missing values, but also to reduce dimensions and process data of different types. This brings us back to partial least squares regression (PLSR). PLSR is a data analysis method proposed by Wold, Albano, Dunn III, Esbensen, Hellbeg, Johansson and Sjosostom in 1983, and was mainly developed by Svante Wold [10]. PLSR is both a tool of linear regression and for dimension reduction [4, 5, 11], in addition, it can be an effective method for prediction in big high-dimensional regressions [8]. Moreover, PLSR is distinguished by its ability to be used if the number of variables is greater than the number of observations and if the sample size is not significant. It can also be used in the absence of distributional hypothesis and in the presence of problems arising from multi-collinearity [10]. Furthermore, PLS algorithm can use the Nonlinear estimation by Iterative Partial Least Square (NIPLAS) algorithm to process data with missing values without estimating missing values or deleting rows with incomplete data [10]. Nevertheless, PLSR was designed to treat only quantitative variables [10]. To benefit from PLSR advantages and apply it to categorical variables, some methods were proposed: • A common method consists of replacing each qualitative variable with its indicator matrix [10]. The model considers categories as separate variables [6] • PLS for Categorical Predictors (PLS-CAP), an adaptation of PLSR [9]. Using this method, a model can be obtained using the original variable’s names, and variables may be not quantitative. A comparative study was made to highlight PLSR adaptations and extensions used to handle quantitative data [2]. Based on the results of this comparison we choose to focus on the adaptation of PLS-CAP to treat quantitative data with missing values. This method has been chosen because it allows to handle all types of data (nominal, ordinal or categorical). With its characteristics, the study data can be processed. In this paper, we will propose an adaptation of PLS1 to process categorical data with missing values and one response variable. The proposed method will be named PLS1 for Categorical Predictors with Missing Values (PLS1-CAP-MV). Then, we will compare the results of the application of PLS1-NIPLAS using the common method and PLS1-CAP-MV. Finally, we will conclude with a conclusion.

Treatment of Categorical Variables with Missing Values

477

2 PLS1 for Categorical Predictors with Missing Values PLS1 for Categorical Predictors with Missing Values (PLS1-CAP-MV) algorithm is based on PLS-CAP [9] algorithm and PLS1-NIPLAS [10] to handle categorical variables with missing values. The diagram below illustrates how the PLS1-CAP-MV method works.

Diagram 1. PLS1-CAP-MV implementation model.

In fact, we started by replacing the original matrix with a matrix containing only quantitative variables, the quantification method was based on Hayashi’s first quantification method [3]. Next, we used the resulting matrix in the input of PLS1-NIPLAS algorithm. Indeed, we followed the steps bellow: 1. Create a new matrix initialized by the quantitative variables of the initial matrix. 2. For each qualitative variable of the initial matrix: a. Replace each qualitative value with a numerical value while retaining the missing values. 3. Concatenate the resulting quantified variables with the new matrix. 4. Use the transformed matrix as input of PLS1-NIPLAS algorithm. In the following, we will detail PLS1-CAP-MV algorithm, and use the succeeding notations: • • • • • • • • • • • • • • •

X = {x1, .., xk}: the matrix of k variables observed on n individuals. y: the vector of the response variable. u1 : a vector initialized by the response variable. x ql li : the value of the observation i of the vector x ql l . x qqs l : the vector of the quantified values without rows containing missing values. x qq li : the value of the observation i of the vector x qq l . u(e) 1i : equal to true if the value of the individual i of the vector u1 exists. x (e) ji : equal to true if the value of the individual i of the x ji exists. X qt : the matrix with quantitative variables. X qq : the matrix with quantified variables. X 0 : the juxtaposition of Xqt and Xqq . X 1 : the residue of the regression of X0 on t1 . y1 : the residue of the regression of X0 on t1 . p1 : the regression coefficient of t1 in the regression of X0 on t1 . c1 : the regression coefficient of t1 in the regression of y on t1 .

478

Y. Al Marouni and Y. Bentaleb

The algorithm starts by initializing u1 with the response variable and follows the next steps. Step 1: Calculation of quantified values The objective of this step is to replace all qualitative variables with missing values by quantitative variables with the same missing values. For each l ∈ (1: L): xqq l (e) = Gl (e) (Gl (e) Gl (e) )−1 Gl (e) u1l Xql li (e)

(1)

with: • x qq l (e) : vector of x ql if x ql exists. • Gl (e) : dummy matrix of x qq l (e) . • u1l Xql li (e) : vector containing the u1li values of u1 for observations where x ql li exists. The resulting quantified variable is: x l qq = {for each observation i: if i exists x li qq else NA}. Step 2: Deduce the new quantified matrix X 0 = X qt |X qq

(2)

This new matrix X 0 is the juxtaposition of X qt and X qq . X 0 will be used as input to the classic PLS1-NIPLAS method. X 0 contains the same missing values as the original matrix. Therefore, the choice of the method for handling missing values will be more flexible. Step 3: Calculation of the first component t 1 and the weight w1 t 1 = X 0 w1 /w 1 w1

(3)

For each observation i: t1i = j:x (e)ji w 1j xji /j:x (e)ji w 1j 2

(4)

w 1j = w1j ( pj=1 w1j 2 )1/2

(5)

with

The coefficient w 1 j is the normalization of w 1 j which is: w1j = i:x(e) jiξ u(e) xji u1i i:x(e) jiξ u(e) u 1j 2 1i

1i

(6)

Treatment of Categorical Variables with Missing Values

479

In addition: • t 1i represent the slope of the least squares line, passing through the origin of the scatter plot (w1j , x ji ). • w 1j represent the slope of the least squares line, passing through the origin of the scatter plot (u1j , x ji ). • Step 4: Calculation of p1 p1 = X 0 t 1 /t 1 t 1

(7)

X 1 = X 0 − t 1 p 1

(8)

c1 = y t 1 /t 1 t 1

(9)

u1 = y/c1

(10)

Step 5: Calculation of X1

Step 6: Calculation of c1

Step 7: Calculation of u1

Step 8: Regression equation After these steps, a simple regression of y over t1 is performed. y = y1 + c1 t 1

(11)

The coordinates of the vectors w1 , t 1 , p1 and c1 represent the slopes of the least square lines passing through the origin. Thus, they can be calculated on data with missing values. If the explanatory power of the y regression over t 1 is too low, we seek to construct a second t 2 component in the same way as t 1 . Next, the regression of y occurs on t1 and t2 [10]. y = y2 + c1 t 1 + c2 t 2

(12)

Steps 3 to 8 are repeated in the same manner to calculate additional components until their maximum number is reached. The number of components is usually determined by cross-validation [10].

3 Comparative Case Study 3.1 Data Description The study sample was the result of the intervention of 490 students (aged between 10 and 12) from different cities in Morocco in 2017. Although the sample size was not significant, it did not influence the results, as the PLSR could be used with a small sample [9]. The research focused on the following variables.

480

Y. Al Marouni and Y. Bentaleb

• Explanatory variables are as follows: city, sector (private, public), sex, level of study, French language, English language, Spanish language, computer available, tablet available, smartphone available, internet use, use with adult, internet search language, internet usage frequency, tools of internet access, average hours of internet use, use video games, frequency of use of video games, number hour games, Instagram access, WhatsApp access, twitter access, Facebook access, Snapchat access, YouTube access, use of the same email address, instruction from adult, place of instruction, source of instruction, chat with strangers, sharing personal data, request personal information, request camera. • Explained variable is victim internet crime. The number of variables was high, which was more conducive to the choice of a multidimensional approach. The principal aims of the study were: • Analyze the safe behavior of children on the internet. • Identify the key factors that can be manipulated to protect children from the threats of cybercrime. • Make children aware of the dangers of cybercrime. 3.2 Matrix Quantification The first stage of PLS1-CAP-MV was the quantification of the input matrix, the algorithm used was developed under python. The two tables successively presented extracts of data before and after quantification (Tables 1 and 2). Table 1. Extract from the original matrix

Table 2. Extract from the quantified matrix

Treatment of Categorical Variables with Missing Values

481

All qualitative values have been transformed into numerical values and missing values remain missing, considering both figures. Accordingly, the choice of the method for handling missing values became more flexible. 3.3 PLS1-NIPLAS Application The second phase of PLS1-CAP-MV was the application of PLS1-NIPLAS using the resulting matrix of Stage 1 as input. In the next section, the results of the comparison of Variable Importance in Projection (VIP) for independent variables between PLS1-CAP-MV and PLS1-NIPLAS with the common method for processing categorical data will be presented. Subsequently, we will highlight some results from the PLS1-CAP-MV application. In fact, and as a reminder, the common method consists of replacing each qualitative variable with its indicator matrix, and then apply the PLS method [10]. 3.3.1 Variable Importance in Projection (VIP) for Independent Variables Variable Importance in Projection (VIP) for independent variables is used to avoid the two main problems encountered in a multivariate analysis [1]. Those problems arise when the number of predictors becomes larger than the sample size and when the multicollinearity between the predictor variables appears. The resolution of those problems by VIP was done by selecting important variables in projection [1]. Most works selected variables with the value of VIP score more than a constant value such as 1 [12], or 2 [7]. In this paper, the constant value is 0.8. The two figures below showed the VIP after the application of the common method and PLS1-CAP-MV respectively.

Fig. 1. VIP after application of PLS1-NIPLAS with the common method for processing categorical data

482

Y. Al Marouni and Y. Bentaleb

Fig. 2. VIP after PLS1-CAP-MV application

The two figures reflected the relative importance of the predictors in the prediction model through the first factors. The first figure showed that the number of relevant predictors is very high, and all retained variables represented categories of the initial variables like ‘City Casablanca’ instead of ‘City’. The study interest was to analyze variables and not categories. Hence, the rest of the paper contained PLS1-CAP-MV results only. We concluded from Fig. 2 that the most relevant predictor with the highest VIP was “Request for Personal Information”. The remaining predictors with significant VIP (> = 1) were: Request_cam, Share_pers data, Chat_with_stranger, Instructon_source, Snapchat_access, Facebook_access, WhatsAapp_access, Instagram_access, Nbr_hour_game, Internet_use_avg_hrs, Smartphone_available, Study_level and City. The VIP score of the variable Instruction_place is close to 0. Consequently, this variable could be removed from the model because they are not relevant to predict the response variable. PLS1-CAP-MV Reliability: Figure 1 is the result of applying PLS1-NIPLAS using a software native method, whereas Fig. 2 is the result of PLS1-NIPLAS following the application of the quantification method. Comparing the two illustrations, we found that the VIP of the variables is related to the VIP of their categories. The resemblance between the VIPs by applying two different methods shows that the quantification method is reliable, as is PLS1-CAP- MV. 3.3.2 Loadings and Weights of Predictors The weights and loadings are quite similar and serve similar interpretive uses. The weights of predictors indicate how much they are correlated with the PLS response, and the factor loadings show how the PLS factors are constructed from the centered

Treatment of Categorical Variables with Missing Values

483

and scaled predictors. The resulting plots of weights and loadings were exposed in the figures below (Figs. 3 and 4).

Fig. 3. Predictor weight after PLS1-CAP-MV application

Fig. 4. Predictor loading after PLS1-CAP-MV application

The six factors represented the six latent variables, the number six was determined by cross validation. The figures showed that the weight and the loading of the six latent variables depended on a set of predictors. This does not conflict with the VIP chart, as all variables with a significant VIP contribute significantly to the construction of one or more latent variables. In addition, most of the variables that contribute significantly to the construction of latent variables had a significant VIP. 3.3.3 X-Scores Versus Y-Scores Figure 5 represents the graphical response scores of the predictor scores, also referred to as X-Y scores graphs, and further explores the characteristics of the PLS model. He

484

Y. Al Marouni and Y. Bentaleb

examined, X scores versus Y scores and explored how successive factors were selected by partial least squares, he also showed outliers.

Fig. 5. X-Y scores plots after PLS1-CAP-MV application

Note that the figure above shows the X-Y scores for the 6 factors. For the first factor, we noted that the data were divided into two separate groups. For the remaining factors, we found a resemblance in the distribution of the data among their charts. Summary From the PLS output above, the researcher would conclude that, contrary to the belief of some, being a victim of cybercrime is not gender dependent. However, it is closely associated with the request for personal information, it also depends on other predictors such as camera request and the use of certain social network, in addition to the time spent on the internet and video games. The results obtained are not at odds with logic. When a user requests a child’s personal information, his intentions can be malicious, which exposes the child to the risk of being a victim of cybercrime. Given the criticality of the subject, the author suggests considering all important variables as a precaution to protect children from cybercrime.

4 Conclusion The purpose of the current study was to suggest an adaptation of PLS1 algorithm to handle categorical variables with missing values. The proposed method was compared with the common PLS adaptation through a case study. These experiments confirmed that the new method (PLS-CAP-MV) is more adapted to the case study because it enables to analyze the variables, unlike the common method, which requires analysis of categories instead of variables. We can now give the two essential steps for PLS1-CAP-MV (1) Construct a new matrix with only quantitative variables while retaining data with missing values. (2) Use this matrix as an input of the PLS1-NIPLAS algorithm.

Treatment of Categorical Variables with Missing Values

485

Acknowledgements. The authors would like to thank the CMRPI (Moroccan Centre polytechnic research and innovation www.cmrpi.ma) for its support.

References 1. Akarachantachote, N., et al.: Cut off threshold of variable importance in projection for variable selection. Int. J. Pure Appl. Math. 94(3):307–322 (2014). https://doi.org/10.12732/ijpam.v94 i3.2 2. Al Marouni, Y., Bentaleb Y.: State of art of pls regression for non quantitative data and in big data context. In: Proceedings of the 4th International Conference on Networking, Information Systems amp; Security, NISS2021, New York, NY, USA, 2021. Association for Computing Machinery (2021). ISBN 9781450388719. https://doi.org/10.1145/3454127.3456615 3. Hayashi, C.: On the prediction of phenomena from qualitative data and the quantification of qualitative data from the mathematicostatistical point of view. Ann. Inst. Statist. Math. 3(6998), 1952 (1952). https://doi.org/10.1007/BF02949778 4. Wold, H.: Soft modelling by latent variables: the non-linear iterative partial least square (niplas) approach. J. Appl. Probab. 12, 117–142 (1975). https://doi.org/10.1017/s00219002 00047604 5. Helland, I.: On the structure of partial least square regression. Commum. Stat. Simul. Comput. 17(2), 581–607 (1988). https://doi.org/10.1080/03610918808812681 6. Jakobowiz, E., Derquenne, C.: A modified pls path modeling algorithm handling reflective categorical variables and a new model building strategy. Comput. Stat. Data Anal. 51(36663678), (2007). https://doi.org/10.1016/j.csda.2006.12.004 7. Pärez-Enciso, M., Tenenhaus, M.: Prediction of clinical outcome with microarray data: a partial least squares discriminant analysis (pls-da) approach. Hum. Genet. 112(5), 581–592 (2003). https://doi.org/10.1007/s00439-003-0921-9 8. Cook, R., Liliana, F.: Big data and partial least-squares prediction. Can. J. Stat. 46, 2017 (2017). https://doi.org/10.1002/cjs.11316 9. Russolillo, G., Lauro, C.N.: A proposal for handling categorical predictors in PLS regression framework. In: Fichet, B., Piccolo, D., Verde, R., Vichi, M. (eds.) Classification and Multivariate Analysis for Complex Data Structures. Studies in Classification, Data Analysis, and Knowledge Organization, pp. 343–350. Springer, Heidelberg (2011). https://doi.org/10.1007/ 978-3-642-13312-1_36 10. Tenenhaus, M.: 1998. Editions Technip, La régression PLS - Théorie et pratique (1998) 11. Naes, T., Martens, H.: Comparaison of prediction methods for multicollinear data. Commum. Stat. Simul. Comput. 14, 545–576 (1985). https://doi.org/10.1080/03610918508812458 12. Chen, Y.: Statistical approaches for detection of relevant genes and pathway in analysis of gene expression data (2008)

Type 2 Fuzzy PID for Robot Manipulator Nabil Benaya1 , Faiza Dib1(B) , Khaddouj Ben Meziane2 , and Ismail Boumhidi3 1 LRDSI Laboratory, FSTH, Abdelmalek Essaadi University, Tetouan, Morocco

[email protected], [email protected]

2 Higher Institute of Engineering and Business (ISGA), Fez, Morocco

[email protected]

3 FSDM, Sidi Mohammed Ben Abdellah University, Fez, Morocco

[email protected]

Abstract. This paper consists in developing a robust control device for a nonlinear dynamic robot manipulator subjected to external disturbances. We propose an approach combining the fuzzy logic type 2 with the proportional-integralderivative (PID) controller to control the joint-angle position of the two-link robot manipulator in order to apply an optimization mechanism for the gains of the PID. The type-2 fuzzy system is applied to improve the efficiency and robustness of the traditional PID controller, allowing for improving the classical regulator, by introducing a certain degree of intelligence in the control strategy. For the purpose of verifying the effectiveness of the approach developed in this study, all the simulation results attest that the (T2FL-PID) compared with (T1FL-PID) and (PID) controller produces a better response. The feasibility, as well as the performances of this controller, have been validated in simulation in the control of the two-link robot manipulator. Keywords: Fuzzy logic type 2 · PID · Two-link robot · Arm manipulator

1 Introduction The degree of evolution of the robot is linked to several factors such as the information introduced into its artificial server and the pressure of economic forces. Robot control consists of transcribing the task, and its operational variables into generalized variables, then managing the movements of the different axes, the motorizations, and the servocontrols, in complete safety for the robot and the environment in which it operates [1]. Two regulation tasks are used depending on the application of the robot, the first task is the proprioceptive regulation of a robot consists in enabling the tool to effectively follow, in position and orientation, the trajectories requested. This regulation uses proprioceptive data from the robot: the torques requested from the actuators, and the positions and speeds of the different articular variables. The second task is the exteroceptive regulation, which integrates data from the world outside the robot, such as the vision of the work scene taken by the camera, and the quantities of the interaction of the robot with its environment, such as force sensors [2]. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ben Ahmed et al. (Eds.): NISS 2022, LNDECT 147, pp. 486–495, 2023. https://doi.org/10.1007/978-3-031-15191-0_46

Type 2 Fuzzy PID for Robot Manipulator

487

Several works relating to the use of fuzzy logic confirm that fuzzy regulators have advantages over classical regulators such as a very short processing time and mathematical precision, despite these advantages, classical regulators have several limitations, in the case where fairly large variations of the disturbance factors act on the system to be regulated and the nonlinear systems. The classical controller does not react optimally to unknown disturbances and the adaptability and robustness of this type of regulator are therefore limited. The fuzzy logic control is a nonlinear control with robustness properties. It is very interesting to explore its potential for the control of nonlinear systems. Several works have been carried out to improve the control of robot manipulators. Wang et al. employed the adaptive incremental sliding mode as a control method [3]. The Adaptive neural combined with the backstepping control is applied for a flexible-joint robot manipulator in [4]. Yin et al. proposed the fuzzy logic and the sliding mode control to guarantee good trajectory tracking for serial robotic manipulators [5]. A robust feedback linearization technique is applied in [6] to control the robot manipulators. The fuzzy PID controller is used in papers [7, 8]. For the PID regulator to remain a powerful and robust controller, it is necessary to develop it by supervising these gains using a fuzzy adaptation mechanism, allowing a certain degree of intelligence to be incorporated into the regulation strategy. The majority of research work using fuzzy logic is generally founded on the first classical approach type-1 fuzzy logic. The fuzzy logic type 2 can generate data that is difficult to model by type-1 fuzzy logic techniques. Note that the parametric uncertainties, the neglected dynamics, and the external disturbances cause abrupt variations of the operating point. To take into account all the data, in particular, those generated by these disturbing factors, we use type-2 fuzzy modeling. This is characterized by fuzzy membership functions. For type-2 fuzzy systems, each entry has a degree of membership defined by two functions of type-1, one upper and one lower membership function so that all data is between the two functions. In the case of type-1 fuzzy sets, the degree of membership of an element is an ordinary number that belongs to the interval [0,1] while for a type-2 fuzzy set the degree of membership is a blurry set [9]. Unfortunately, these type-2 fuzzy sets are more difficult to define and use than type-1 fuzzy sets. But, their good handling of uncertainties, not supported by type-1 fuzzy sets, justified their use. Currently, Type -2 Fuzzy Systems have been used in several applications, such as decision making, resolution of fuzzy relationships, process monitoring, mobile robot control, and data processing [9]. Different types of PID controllers using fuzzy logic type 2 for various systems have been proposed. In [10] the authors proposed the control power system by using the fractional order PID controller with the fuzzy type-2 f to enhance the dynamic stability of this system. In [11] the PID based on interval type-2 fuzzy controller is optimized by hybrid algorithm-differential evolution and neural network for proton exchange membrane fuel cell air feeding system. In [12] the authors have applied the interval type-2 fuzzy PID to guarantee a high accuracy for the electro-optical tracking system. Nayak et al. in [13] have applied the Crow Search Algorithm to tune optimally the scaling factor of fuzzy PID controller for the automatic generation control power system. Yun et al. [14], have realized realize a controller based on the position control using a quantitative factor-scale of fuzzy PID for a two-link robot manipulator. Zhong et al. [15] have realized a method based on accurate tracking of the trajectory at the manipulator’s end-effector using fast

488

N. Benaya et al.

terminal sliding mode controller with the fuzzy adaptive PID. Dhyani et al. in [16] have applied an evolving fuzzy PID controller to guarantee an optimal tracking trajectory of a 7-DOF redundant manipulator. Angel and Viola [17] have applied a fractional order PID for a parallel robotic manipulator. Aftab and Luan have proposed fuzzy PID controllers to resolve the reactor power control problem [18]. Arteaga-Pérez has demonstrated the ability of PID controllers to regulate and stabilize the robot manipulators [19]. Indeed, this paper aims to develop an effective control strategy, this approach associates the PID regulator and the supervisor composed of fuzzy logic type 2 rules, offers the possibility of using the mathematical precision of the PID algorithm with the adaptability, flexibility, and simplicity of the fuzzy linguistic formalism type 2, to eliminate the error between the robot output and the desired trajectory manipulator despite the presence of the disturbances and nonlinearity of the system. The organization of this paper is as follows: the dynamics of the model of the robot manipulator studied are given in Sect. 2. A review of fuzzy logic type 2 and the proposed control scheme is presented in Sect. 3. Section 4 shows the simulation results and discussion. The conclusion is presented in Sect. 5.

2 System Modeling of Two-Link Robot Manipulator The conception and control of robots necessitate the calculation of certain mathematical models. In order to successfully design a correct model, one must go through the following different models: • The direct and inverse geometric models • The direct and inverse kinematic models • The dynamic models. In this work, we are interested in a two-link manipulator robot, given in the following figure (Fig. 1).

Fig. 1. Two link robot manipulator.

Type 2 Fuzzy PID for Robot Manipulator

489

The mathematical model which represents the dynamics of the robot above is presented as follows: M (q)¨q + C(q, q˙ )˙q + G(q) = τ

(1)

where: q, q˙ and q¨ respectively the angular vectors of the position, the speed and the acceleration with q = [q1 q2] T represents the vector of the two joints of the robot. M(q) is the inertia matrix given by the following equation: I1 + I2 + 4 m2 l12 + 4 m2 l1 l2 cos(q2 ) I2 + 2 m2 l1 l2 cos(q2 ) (2) M (q) = I2 I2 + 2 m2 l1 l2 cos(q2 ) C(q) is the matrix of Coriolis and centrifugal effects described by: −2m2 l1 l2 q˙ 2 sin(q2 ) −2m2 l1 l2 (˙q1 + q˙ 2 ) sin(q2 ) C(q, q˙ ) = 0 2m2 l1 l2 q˙ 1 sin(q2 ) G(q) represents the vector gravity given by: m2 gl2 sin(q1 + q2 ) + m1 gl1 sin(q2 ) G(q) = m2 gl2 sin(q1 + q2 )

(3)

(4)

We define the terms of each matrix used in the control law. Knowing that: m1, m2: the masses of the first and second segment of the robot manipulator, l1, l2: the respective lengths of the first and second segment of the robot manipulator, I1, I2: the respective inertias of the first and second segments of the robot manipulator. g: the term of gravity. The parameters of the system used in simulations are given in Table 1 [20]. Table 1. Physical properties of the robot manipulator. Mass (mi/kg)

Length (li/m)

Inertia of moment (Ii/kg m2 )

Link 1

4

0.5

0.2

Link 2

4

0.5

0.2

3 Proposed Controller Design The main task associated with this study is to show the design performance of a selfadjusting PID control by using a fuzzy logic type 2 with a two-input, the error and the variation of the error, and the three outputs KP, KI, and KD are tuned in real-time for each joint. The mathematical model of the PID corrector is given as follows: de(t) (5) τpid = KP e(t) + KI e(t)dt + KD dt

490

N. Benaya et al.

The error is given by: e(t) = q(t) − qd (t)

(6)

where: τpid is the applied torque for each joint; e(t) is the position error in radians of each joint; qd (t) is the requested position in radians. The type-2 fuzzy sets was proposed by Lotfi Zadeh as an extension to type-1 fuzzy sets which each element of membership function type-2 is a fuzzy number in [0,1] [21]. Type-2 fuzzy sets are used in situations where uncertainty arises. A type-2 interval fuzzy set is a type 2 fuzzy set with the supplementary membership functions are type 1 sets [22, 23]. The design of type 2 fuzzy system is given in Fig. 2 [24]:

Fig. 2. Structure of a type-2 fuzzy system.

For a fuzzy logic type 1 system, the output processing block is reduced to defuzzification only. When an input is applied to a type 1 fuzzy system, the inference mechanism gives each corresponding rule a type 1 output fuzzy set. Defuzzification then calculates a real output from these fuzzy sets delivered by each rule [21]. In the case of a fuzzy system of type 2, each output of the rule is of type 2, there are several methods for defuzzification that can convert a set of type 2 to a set of type 1. We call this operation “Type reduction” instead of defuzzification, and we call the resulting type-1 set “Reduced set” [25]. Ycos (X ) = yl , yr =

yl ∈ yll ,yrl

..

.. 1/ l M yM ∈ ylM ,yrM f l ∈ f l ,f f M ∈ f M ,f

M

i=1 f

M

i yi

i=1 f

i

(7) where: yl =

M i=1 M

fli yli

i=1 fl

i

;yr =

M i i fr y r i=1 M i i=1 fr

A real output value is obtained by calculating the centroid of cos Ycos (X ). The membership functions of the outputs for each gain (KP; KI; KD) are defined by: N; Z; P. Figure 3 shows the membership function for the outputs and inputs of fuzzy type-2 PID.

Type 2 Fuzzy PID for Robot Manipulator

491

Fig. 3. Membership functions of the inputs and the outputs of type-2 fuzzy logic controller.

The membership function for the inputs and the outputs of the type-1 fuzzy logic controller is illustrated in Fig. 4 and Fig. 5 respectively.

Fig. 4. Membership functions of the inputs

Fig. 5. Membership functions of the outputs

4 Simulations Results We present the simulation results by displaying the angles of two joints q1 and q2, and also their tracking errors. The simulation is carried out according to the diagram developed with Matlab/Simulink, presented in Fig. 6. The design of the PID adjusted by the type-2 fuzzy logic is illustrated in Fig. 7, which gives us the three outputs corresponding to the three parameters of the PID regulator (KP, KI, KD) (Fig. 9).

492

N. Benaya et al.

Fig. 6. The complete model of robot manipulator with T2FL-PID.

Fig. 7. Structure of a type-2 fuzzy PID.

Fig. 8. The error of joint q1.

Fig. 9. The position of joint q1.

With the proposed command, the joint q1 is stabilized in a very short time. From Fig. 8 we can notice that the classical regulator application (PID) requires more time and more oscillations and the tracking error always oscillates between −0.1 and 0.7 and does not converge toward zero. However, the application of type 1 fuzzy logic on the PID controller slightly improves the response and allows convergence towards the

Type 2 Fuzzy PID for Robot Manipulator

493

desired values in time, and allows the tracking error to vary between −0.1 and 0.2. On the other hand, the proposed command eliminates the error perfectly between the desired position and the system output, which demonstrates the good performance of the proposed command in terms of error elimination and oscillation damping.

Fig. 10. The error of joint q2.

Fig. 11. The position of joint q2.

According to the results of the simulation presented in Figs. 10 and 11, the classical PID regulator presents a large oscillation for the joint q2 and the tracking error always oscillates between −0.1 and 0.2 and does not converge toward zero, when one applies fuzzy logic type 1 on the PID regulator, the convergence time is reduced with low damping of the oscillations and the error oscillates between –0.1 and 0.1. But, when the gains of the PID regulators are adjusted by fuzzy logic type 2, the response is improved compared to the other approaches, consequently, the proposed control device gives a better response in convergence time with a damping of the oscillations and the cancellation of tracking error.

5 Conclusion The main goal of the approach presented in this article is to set up a type 2 fuzzy regulator which makes it possible to counter the effect of the disturbance and the nonlinearity of the system in order to tune the three gains of the PID regulator (Kp, Ki, Kd) during the variation of setpoints over time. The comparative results have demonstrated the better performance and accuracy of the developed approach compared to the other controllers the PID and the type 1 fuzzy PID, in eliminating the persistent oscillations and tracking to the desired values optimally without error.

References 1. Trana, M.D., Kang, H.J.: Adaptive terminal sliding mode control of uncertain robotic manipulators based on local approximation of a dynamic system. Neurocomputing 228, 231–240 (2017) 2. Lund, S.H.J., Billeschou, P., Larsen, L. Bo.: High-bandwidth active impedance control of the proprioceptive actuator design in dynamic compliant robotics. Actuators 8, 2–33 (2019)

494

N. Benaya et al.

3. Wang, Y., Zhang, Z., Li, C., Buss, M.: Adaptive incremental sliding mode control for a robot manipulator. Mechatronics 82, 1–14 (2022) 4. Cheng, X., Zhang, Y., Liu, H., Wollherr, D., Buss, M.: Adaptive neural backstepping control for flexible-joint robot manipulator with bounded torque inputs. Neurocomputing 458, 70–86 (2021) 5. Yin, X., Pan, L., Cai, S.: Robust adaptive fuzzy sliding mode trajectory tracking control for serial robotic manipulators. Robot. Comput.-Integr. Manufact. 72, 1–15 (2021) 6. Perrusquía, A.: Robust state/output feedback linearization of direct drive robot manipulators: a controllability and observability analysis. Eur. J. Control. 64, 1–10 (2022) 7. Amit, K., Shrey, K., Bahadur, K.: PD/PID-fuzzy logic controller based tracking control of 2-link robot manipulator. i-Manager’s J. Instrum. Control Eng. 7, 18–25, (2019) 8. Das, S., Pan, I., Das, S., Gupta, A.: A novel fractional-order fuzzy PID controller and its optimal time-domain tuning based on integral performance indices. Eng. Appl. Artif. Intell. 25, 430–442 (2012) 9. Lathamaheswari, M., Nagarajan, D., Kavikumar, J., Broumi, S.: Triangular interval type-2 fuzzy soft set and its application. Complex Intell. Syst. 6(3), 531–544 (2020). https://doi.org/ 10.1007/s40747-020-00151-6 10. Abdulkhader, H.K., Jacob, J., Mathew, A.T.: Robust type-2 fuzzy fractional order PID controller for dynamic stability enhancement of power system having RES-based microgrid penetration. Electr. Power Energy Syst. 110, 357–371 (2019) 11. AbouOmar, M.S., Su, Y., Zhang, H., Shi, B., Lily, W.: Observer-based interval type-2 fuzzy PID controller for PEMFC air feeding system using novel hybrid neural network algorithmdifferential evolution optimizer. Alex. Eng. J. 61, 7353–7375 (2019) 12. Tong, W., Zhao, T., Duan, Q., Zhang, H., Mao, Y.: Non-singleton interval type-2 fuzzy PID control for high precision electro-optical tracking system. ISA Trans. 120, 258–270 (2022) 13. Nayak, J.R., Shaw, B., Sahu, B.K., Naidu, K.A.: Application of optimized adaptive crow search algorithm based two degrees of freedom optimal fuzzy PID controller for AGC sys-tem. Eng. Sci. Technol. Int J. 32, 1–14 (2022) 14. Yun, J., et al.: Self-adjusting force/bit blending control based on quantitative factor-scale factor fuzzy-PID bit control. Alex. Eng. J. 61, 4389–4397 (2022) 15. Wang, C., Dou, W.: Fuzzy adaptive PID fast terminal sliding mode controller for a redundant manipulator Guoliang Zhong. Mech. Syst. Signal Process. 159, 1–16 (2021) 16. Dhyani, A., Panda, M.K., Jha, B.: Design of an evolving fuzzy-PID controller for optimal trajectory control of a 7-DOF redundant manipulator with prioritized sub-tasks. Expert Syst. Appl. 162, 1–12 (2020) 17. Angel, L., Viola, J.: Fractional-order PID for tracking control of a parallel robotic manipulator type delta. ISA Trans. 79, 172–188 (2018) 18. Aftab, A., Luan, X.: A fuzzy-PID series feedback self-tuned adaptive control of reactor power using nonlinear multipoint kinetic model under reference tracking and disturbance rejection. Ann. Nucl. Energy 166, 1–13 (2022) 19. Arteaga-Pérez, M.A.: An alternative proof to the asymptotic stability of PID controllers for regulation of robot manipulators. IFAC J. Syst. Control 9, 1–8 (2019) 20. Zuo, Y., et al.: Neural network robust H1 tracking control strategy for robot manipulators. Appl. Math. Model. 34, 1823–1838 (2010) 21. Castillo, O., Melin, P.: A review on interval type-2 fuzzy logic applications in intelligent control. Inf. Sci. 279, 615–631 (2014) 22. Lahlou, Z., Ben Meziane, K., Boumhidi, I.: Sliding mode controller based on type-2 fuzzy logic PID for a variable speed wind turbine. Int. J. Syst. Assur. Eng. Manag. 10(4), 543–551 (2019). https://doi.org/10.1007/s13198-019-00767-z

Type 2 Fuzzy PID for Robot Manipulator

495

23. Ben Meziane, K., Naoual, R., Boumhidi, I.: Type-2 fuzzy logic based on PID controller for AGC of Two-area with three source power system including advanced TCSC. Procedia Comput. Sci. 148, 455–464 (2019) 24. Ben Meziane K., Dib F., Boumhidi, I.: Design of type-2 fuzzy PSS combined with sliding mode control for power system SMIB. In: 1st International Conference on Innovative Research in Applied Science, Engineering and Technology (IRASET), pp. 1–6 (2020) 25. Castillo, O., Martinez, A.I., Martinez, A.C.: Evolutionary computing for topology optimization of type-2 fuzzy systems. In: Analysis and Design of Intelligent Systems using Soft Computing Techniques. Advances in Soft Computing. vol. 41, pp. 63–75 (2007)

Using Latent Class Analysis (LCA) to Identify Behavior of Moroccan Citizens Towards Electric Vehicles Taouﬁq El Harrouti1(B) , Mourad Azhari2 , Abdellah Abouabdellah1 , Abdelaziz Hamamou3 , and Abderahim Bajit3 1

Department of Computer Science, Logistics and Mathematics (ILM) Engineering Science Laboratory, National School of Applied Sciences, Ibn Tofail University Kenitra, Kenitra, Morocco [email protected] 2 Center of Guidance and Planning, Rabat, Morocco 3 Laboratory of Advanced Systems Engineering ISA, National School of Applied Sciences, Ibn Tofail University, Kenitra, Morocco [email protected]

Abstract. Electric Vehicles (EV) represent an eﬃcient way for reducing the eﬀects of gas emissions polluting our environment. In this context, Latent Class AnalyLCAsis(), as an unsupervised Machine Learning method, is used to identify people’s behavior towards ecological phenomena, particularly EV, as an alternative to the usual mobility way. This method can detect the group proﬁles (clusters) from the manifests variables. In this paper, we use LCA method to Identify the behavior of Moroccan Citizens towards the EV. The results show that the LCA method can identify the sample selected into two classes: The ﬁrst concerns a group more interested in EV. However, the second group concerns people less interested in ecological transport. Keywords: Electric Vehicles (EV) · Latent Class Analysis (LCA) Probability · Proﬁles · Groups · Clustering

1

·

Introduction

The transport sector in Morocco is among the energy consuming sectors with a rate of 41%, thus causing environmental nuisances especially in GHG emissions with a rate estimated at 23% [1]. Knowing that the energy source in this ﬁeld is largely imported from abroad, this means that the energy bill is heavy. All these problems have led us to conduct this study to review the expectations of Moroccan consumers and their opinions about the use of EV instead of conventional ones. If the EV is seen by the majority of researchers and operators as an optimal alternative, this can not deny that it was always there in the c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ben Ahmed et al. (Eds.): NISS 2022, LNDECT 147, pp. 496–503, 2023. https://doi.org/10.1007/978-3-031-15191-0_47

Using LCA to Identify Behavior of Moroccan Citizens

497

world, it comes from a century of timid existence in the car markets worldwide, but also with prices in exponential [2]. In this sense, a survey was conducted as the subject of this study to reveal the revealed preferences of Moroccans on EVs, and it was quite legitimate to ask questions about the possibility of adopting this green mode of transport in place of traditional transport, and to what extent the EV sees itself as an optimal solution for the Moroccan consumer [3,4]. In order to answer these questions, a questionnaire was used in this sense which was aimed at a population restricted to the Rabat-Sal´e agglomeration, this choice is supported by the fact that the inhabitants of this region are already confronted with the issue of electric mobility since the tramway was among the modes of transport used there since 2011 constituting a network of electric transport extended over about 30 km. LCA is an approach that can be applied to complex datasets. It organizes manifest variables that represent an unobservable in several areas, as Education, Demography, Economy, Health, politic, Ecology, etc. [5–9]. This paper is divided into four sections, after this introduction we will present proposed method: LCA, as for the third it is designed for the preprocessing and description of data. While the fourth section is destinated to analysis their subsequent results, and we will end with a conclusion.

2

Proposed Method: Latent Class Analysis

LCA is an unsupervised machine learning method used for detecting diﬀerent qualitative subgroups of populations who often have some common characteristics. [10,11]. It consists in constructing latent (non-observable) classes based on observed (manifest) responses of individual. The model includes a latent variable X with K categories, each category representing a class. Each class comprises a homogeneous group of individuals sharing the same interests, values, characteristics, and behaviors. the belonging to a class Ck is calculated by the probability of realization of a response Yj knowing the classes Ck and the determination of the number of classes is calculated using the selection criteria (AIC, BIC, . . . ) [12]. The mathematical model for LCA can be expressed as follows. Let yj represent element j of a response pattern y. Let us create an indicator function I(yj = rj ) that equals 1 when the response to variable j = rj , and equals 0 otherwise. Then, the probability of observing a particular vector of responses is calculated as [8,13]: P (Y = y) =

C c=1

3

γc

Rj J j=1 rj =1

I(y =rj )

ρj,rjj|c

(1)

Description of Data

Our target population is all the inhabitants of the Rabat-Sal´e agglomeration, we proceeded to a sampling and a signiﬁcant representation. A questionnaire

498

T. El Harrouti et al.

was elaborated constituting a guided interview with the respondents via a platform on the web. We use R programming language to process and analyze the responses. The respondents answered the questionnaire with 18 questions (see Table 1): Table 1. Questions of the questionnaire N

Question

Q.1 Sex

N

Question

Q.10 Fiscal power

Q.2 Age

Q.11 EV information

Q.3 Residence

Q.12 Attitude on the environment

Q.4 Education level

Q.13 Estimated route

Q.5 Profession

Q.14 Purchase decision

Q.6 Income category Q.15 EV recharge location of EV Q.7 Personal vehicle Q.16 Recharge time of EV

4 4.1

Q.8 Fuel type

Q.17 Charging frequency

Q.9 Vehicle type

Q.18 Purchase criteria

Experimental Results Multiple Correspondence Analysis Application

We used the Multiple Correspondence Analysis (MCA) method to describe the associations between the variables/modalities and the individuals in our data [14]. The cumulative projected inertia indicates that the ﬁrst ﬁve axes explain 29.84% of the information while the ﬁrst two axes alone explain 14.20% of the variations observed in our sample (see Fig. 1). We will explore. The acm method to describe in our data.

Fig. 1. Variance explained with ﬁrst ﬁve dimensions

The results obtained using the MCA method illustrate the associations between the modalities that can build diﬀerent proﬁles (See the biplot in Fig. 2). [14].

Using LCA to Identify Behavior of Moroccan Citizens

499

Fig. 2. The biplot illustrates the associations between individuals and variables/modalities.

4.2

Interpretation of MCA Results

The explained information using two dimensions is less than 15%. moreover, we can identify two proﬁles: Dimension 1: The contributions concern young (18–30 years:15%) and old (60 years and over:7.1%) respondents. They have a moderate salary (601–1000 Euros: 12.5%) and a high salary (1001–1200 Euros: 5.5%). The people who have a car represent 7.5%) and without (2.5%). This group consists of women (8%) and men (4%) who work in liberal professions, government oﬃcials, and employees. Observing the ﬁgure above, we can see that axis 1 discriminates young people (18–30 years) on the right and older people (60 years and over) on the left. This axis contrasts also young and old and the vehicle disposers versus the nonvehicle disposers, women versus men, average salary versus higher salary, and liberal professions versus employees and government oﬃcials. Dimension 2: The most important contributions are those of the liberal professions (18%), government oﬃcials (5%) and employees (3.75%). This group contains vehicles of diﬀerent ﬁscal powers (8–11 hp) with 8%, (11 hp and over) with 4%, and 5–7 hp (3.9%). These people have revenues varying from high

500

T. El Harrouti et al.

(7.4%) to very high (5.1%). They are young (30–45 years) and consider environmental aspects to deciding to purchase an EV. Axis 2 separates the ﬁscal powers (8 hp and more) at the top, and those of 5–7 hp, at the bottom. In brief, we see that the respondents/individuals who decide to purchase an EV are close to the modalities contributed to axis 1. While those who are against the purchase of this EV are close to the modalities contributed to axis 2. MCA method describes the proﬁles of the groups formed with the qualitative variables, but the information explained by the ﬁrst two axis remains insuﬃcient to conclude relevant results. This problem can be solved by using an unsupervised machine learning method such as the LCA method. 4.3

LCA Method Application

The Table 2 presents the results of the LCA method applied to the collected data [12] Table 2. Conditional item response probabilities Questions Modalities

Class 1 Class 2 Questions Modalities 0,65

0,35

0,84

0,41

0,65

0,35 0,60

M F

0,16

0,59

601–1000 Euros: 2

0,28

0,28

Q.2

18–30: 1

0,00

0,40

1001–1200 Euros: 3

0,29

0,06

31–45: 2

0,47

0,50

46–60: 3

0,49

0,08

60 and more: 4

0,04

0,00

RABAT: 1

0,54

0,43

SALE: 2

0,35

0,35

Others: 3

0,11

0,22

YES: 1

0,96

NO: 2

0,04

Diesel: 1

Q.4 Q.5

Q.6

Q.7

Q.8 Q.9

0–600 Euros: 1

0,05

Q.1

Q.3

Q.10

Class 1 Class 2

1201 and more: 4

0,38

0,06

Q.11

YES INFO: 1

0,73

0,60

NO INFO: 2

0,27

0,40

Q.12

YES ENV: 1

0,98

0,96

NO ENV: 2

0,02

0,04

Q.13

0–15 km: 1

0,59

0,72

0,52

16–30 km: 2

0,13

0,04

0,48

31–45 km: 3

0,16

0,14

0,87

0,65

46–60 km: 4

0,09

0,06

Essence: 2

0,10

0,30

Hybrid: 3

0,03

0,00

Q.14

60 and more: 4

0,03

0,03

YES purchase 1

0,10

0,42

Others: 4

0,00

0,05

NO purchase: 2

0,36

0,20

City car: 1

0,41

0,77

I don’t know

0,27

0,29

BERLINE: 2

0,33

0,01

Useful: 3

0,09

0,09

Others: 4

0,17

0,13

5–7: 1

0,71

0,89

8–10: 2

0,27

0,08

11 and more: 3

0,02

0,03

Secondary level: 1

0,99

0,93

High level: 2

0,01

0,07

Government oﬃcial: 1 0,85 Employee: 2

0,06

Others: 3

0,09

Others: 3

0,27

0,09

Q.15

Home charging: 1

0,88

0,82

Public charging station: 2 0,12

0,18

Q.16

During the night with a reduced rate: 1

0,62

During the day: 2

0,49

0,38

Q.17

One liver: 1

0,82

0,90

0,51

Twice: 2

0,18

0,10

Purchase price: 1

0,85

0,85

0,63

Maintenance price: 2

0,13

0,15

0,17

Price of consumption: 3

0,00

0,00

0,20

Respect for the environment: 4

0,02

0,00

Q.18

Using LCA to Identify Behavior of Moroccan Citizens

4.4

501

Analysis and Discussion

The survey aims to know the characteristics of respondents, their opinion on the EV and the possibility of its purchase by them. It subsequently seek signiﬁcant correlations, topics related to modes of travel and the choices and constraints of this transportation way. Class 1: People Less Interested in EV. Class 1 represents 65% of the surveyed population, with a high proportion of men (84%). This class includes Moroccan citizens in the 31–45 (prob = 0.47) and 46–60 (prob = 0.49) age brackets, from two cities: Rabat (prob = 0.54) and Sal´e (prob = 0.35). These individuals are mostly civil servants (prob = 0.99) who have a monthly revenue that places them in the middle class (prob = 0.28) and the upper class (prob = 0.38). This group of individuals disposer vehicles (prob = 0.96) of the diesel fuel type (prob = 0.87) and uses vehicles of the city car (prob = 0.41) and sedan (0.33) type. These individuals have vehicles with a ﬁscal power between 5 and 7 (prob = 0.71) and between 8 and 10 horsepower (prob = 0.27). These vehicles travel an estimated average distance of 15–30 km daily (prob = 0.72). Individuals in this class are informed about EVs and ﬁnd that this type of technology will improve the quality of the environment with a high probability (prob = 0.98). Vehicle users in this class are against the purchase of an EV (prob = 0.36), however some respondents decide to purchase an EV (prob = 0.10) and prefer home charging with a frequency of once a day (prob = 0.88). The latter ﬁnd that the purchase price is a determining factor for the acquisition of an EV (prob = 0.85). The Fig. 3 illustrates the proﬁle of these individuals according to the manifest variables.

Fig. 3. Class 1: people less interested in EV

Class 2: People More Interested in EV. Class 2 represents 35% of all respondents. Unlike class 1, this class is characterized by a female representation of 59%. These young people (prob = 0.9), residing in two cities: Rabat (prob = 0.43) and Sal´e (prob = 35), are civil servants (prob = 63%) with a high school level of education (prob = 0.93) and an average income (prob = 0.60). In contrast to class 1, this group of individuals owns diesel vehicles (prob = 0.52) of the city car type (prob = 0.77) with a ﬁscal power varies between

502

T. El Harrouti et al.

5 and 7 (prob = 0.89) and travels an estimated distance of 15 km per day (prob = 0.72). Respondents in this group have information about EVs (prob = 0.6) and ﬁnd that this type of transportation will contribute to air quality improvement (0.43). In opposition to class 1, individuals in this class decide to accept the oﬀer to purchase an EV (prob = 0.42) by considering an adequate purchase price (prob = 0.85). These EV users prefer home charging (prob = 0.82). The Fig. 4 illustrates the proﬁle of these individuals based on the manifest variables.

Fig. 4. People more interested in EV

5

Conclusion

In this study devoted to raise the tendency of the Moroccan citizen towards the electriﬁcation of the conventional vehicle ﬂeet by adopting a strategy of partial replacement but increased with time. We found that the ground is fruitful to establish this policy see the class 2 which is constituted of a population representing 35%, and therefore the decision makers are called to meet the needs of this category to ensure an encouraging start of the purchase of EVs, subsequently put the country in a promising path of electric mobility, which will have a positive eﬀect on the reduction of CO2 emissions. This pushes us towards another avenue of research which is the search for optimal locations of charging stations, and the optimization of electrical energy from green sources such as solar and wind.

References 1. Chachdi, A., Rahmouni, B., Aniba, G.: Socio-economic analysis of electric vehicles in Morocco. Energy Procedia 141, 644–653 (2017) 2. Liberto, C., Valenti, G., Orchi, S., Lelli, M., Nigro, M., Ferrara, M.: The impact of electric mobility scenarios in large urban areas: the Rome case study. IEEE Trans. Intell. Transp. Syst. 19(11), 3540–3549 (2018) 3. Dombrowski, U., Engel, C.: Impact of electric mobility on the after sales service in the automotive industry. Procedia CIRP 16, 152–157 (2014)

Using LCA to Identify Behavior of Moroccan Citizens

503

4. El Harrouti, T., Abouabdellah, A., Serrou, D.: Impact of electric mobility on the sustainable development of the country, case study in Morocco. In: 2020 IEEE 13th International Colloquium of Logistics and Supply Chain Management (LOGISTIQUA), pp. 1–6. IEEE, December 2020 5. Boulay, A.M., Bulle, C., Bayart, J.B., Deschˆenes, L., Margni, M.: Regional characterization of freshwater use in LCA: modeling direct impacts on human health. Environ. Sci. Technol. 45(20), 8948–8957(2011) 6. Skr´ ucan´ y, T., Kendra, M., Stopka, O., Milojevi´c, S., Figlus, T., Csisz´ ar, C.: Impact of the electric mobility implementation on the greenhouse gases production in central European countries. Sustainability 11(18), 4948 (2019) 7. Garnett, B.R., Masyn, K.E., Austin, S.B., Miller, M., Williams, D.R., Viswanath, K.: The intersectionality of discrimination attributes and bullying among youth: an applied latent class analysis. J. Youth Adolesc. 43(8), 1225–1239 (2014) 8. Acharoui, Z., Alaoui, A., Azhari, M., Abarda, A., Ettaki, B., Zerouaoui, J.: Using latent class analysis to identify political behavior of Moroccan citizens on social media. Int. J. Innov. Technol. Explor. Eng. (IJITEE) 9(6) (2000) 9. Nylund-Gibson, K., Choi, A.Y.: Ten frequently asked questions about latent class analysis. Transl. Issues Psychol. Sci. 4(4), 440 (2018) 10. Hagenaars, J.A., McCutcheon, A.L. (eds.): Applied Latent Class Analysis. Cambridge University Press, Cambridge (2002) 11. Vermunt, J.K., Magidson, J.: Latent class models for classiﬁcation. Comput. Stat. Data Anal. 41(3–4), 531–537 (2003) 12. Abarda, A., et al.: Solving the problem of latent class selection. In: Proceedings of the International Conference on Learning and Optimization Algorithms: Theory and Applications, pp. 1–6, May 2018 13. Abarda, A., Dakkon, M., Azhari, M., Zaaloul, A., Khabouze, M.: Latent transition analysis (LTA): a method for identifying diﬀerences in longitudinal change among unobserved groups. Procedia Comput. Sci. 170, 1116–1121 (2020) 14. Choulakian, V.: Analyse factorielle des correspondances de tableaux multiples. Revue de statistique appliqu´ee 36(4), 33–41 (1988)

Using Learning Analytics Techniques to Calculate Learner’s Interaction Indicators from Their Activity Traces Data Lamyaa Chihab1(B) , Abderrahim El Mhouti2 , and Mohammed Massar1 1 LPMIC, FSTH, Abdelmalek Essaadi University, Tetouan, Morocco

[email protected] 2 LPMIC, FS, Abdelmalek Essaadi University, Tetouan, Morocco

Abstract. Learning Analytics is an emerging discipline in the field of learning. In this sense, the objective of this work is to focus on learning analytics and their eLearning application. Thus, to present the main interaction indicators. Once this topic is addressed, many questions arise: How can we define learning analytics? What is an indicator and what are its types? Our work aims to determine a new classification of indicators. The results of this study are complemented by the conclusions of a precise literature review. Keywords: Learning analytics · E-learning · Interaction indicators

1 Introduction Learning Analytics is a rapidly evolving research discipline that uses the knowledge generated by data analysis to support learners and optimize both the learning process and the learning environment (Few 2013). Learning Analytics is about collecting the traces that learners leave behind and using them to improve learning. Learning Analytics is driven by the availability of massive records of its data. This student-student and student-environment interaction generate many digital traces. Exploiting the traces of user interaction, provides different levels of feedback and for different types of users (feedback for teachers, students, student groups, designers). The objective of this work is to better understand not only whether Learning Analytics helps to understand student behavior, but also meets the interaction indicators and their types. The remainder of the paper is organized as follows—The first part presents a Research background; the second part presents LA techniques and their applications in e-learning systems; the third part presents LA techniques to get interaction indicators in e-learning environments; the 4th part presents a synthesis and discussion; and finally, the 5th part presents a conclusion and future work.

2 Research Background Learning Analytics is an emerging field used to improve learning and education; These tools have the stakes of collecting traces of interaction of learners, analyzing them © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ben Ahmed et al. (Eds.): NISS 2022, LNDECT 147, pp. 504–511, 2023. https://doi.org/10.1007/978-3-031-15191-0_48

Using Learning Analytics Techniques to Calculate Learner’s Interaction Indicators

505

and proposing a display of the analysis results to different users (Dabbebi – 2019). An indicator is the implementation and result of an analysis process. It is defined as a pedagogically significant observable, calculated or based on observable data, and testifying to the quality of interaction, activity and learning (Siemens 2011). Despite the strong interest in indicators and their relationship to learning analytics, there is a lack of research on this topic. So, in this work, we collect a set of the main indicators. Thus, a classification is proposed according to the indicators that have broken the collaboration.

3 LA Techniques and Their Application in E-Learning Systems Georges Siemens proposes in 2011 at the first international conference dedicated to the subject the following definition: Learning Analytics are the evaluation, analysis, collection and communication of data relating to learners and their learning context, with a view to understanding and optimizing learning and its environment (Detroz 2018). In other words, Learning Analytics collects the digital traces left by the learners whenever the student interacts, connects or exploits the resources made available to him. These traces give clues about the learning strategies of students who, according to the logic of Learning Analytics, would be automatically stored on a computer server before they could be processed by algorithms capable of analyzing and establishing their profiles (Nouri. Jalal et al. 2019). Each student’s online activities can be recorded and used for learning analytics, including access to online management systems, access to course resources, searching for information on the online library, taking online exams, writing homework or conducting hypothetical exchanges with colleagues, provided they are ethical and confidential. There are no restrictions on content that can be collected or used for learning analysis and research (Clow 2012). The learning analytics process can be conceptualized as a sequence of interdependent steps or stages (Saqr 2018) (Fig. 1).

learners

interven on

data

metrics

Fig. 1. The learning analytics cycle (CLOW 2012)

506

L. Chihab et al.

In Clow (2012), Learning Analytics is portrayed as a cycle that starts with Learners taking part informal or casual internet learning exercises. Through their activities, students create a lot of information that is recorded on the internet learning stages. Crude information is changed into information (measurements) about learning processes that can illuminate proper mediations. For example, Mohammed Saqr (2018) used Learning Analytics to guide educators and their learners to identify gaps in collaborative learning and to try to identify opportunities for early support and intervention to understand learner behavior (Saqr 2018).

4 LA Techniques to Get Interaction Indicators in E-Learning Environments 4.1 Learner Interaction Indicators It is difficult to find a common definition for the word “indicator” although the term is used very often in all kinds of reports, journal articles, etc. in different fields. It was A. Dimitracopoulou et al. (2004) who, in her various researches on interaction in collaborative platforms, tried to clarify the definition and design of the indicators she developed and built. An indicator according to Dimitracopoulou and Bruillard (2006) is a variable in the mathematical sense with several characteristics. It is a variable that takes values represented by a numerical, alphanumeric or even graphical form, etc. The value has a status it can be raw (without a defined unit), calibrated or interpreted. The status identifies a specific feature of the type of support provided to users. Each indicator is independent or dependent on other variables such as time, or even other indicators. A learning indicator according to Dimitracopoulou (2005) is defined by the following attributes: – The Concept: This is the element we want to characterize (for example, division, work sharing, the intensity of collaboration, participation rate, etc.). – Dependencies: The indicator can be structured or impartial of outside variables (together with time or content). – The value: the indicator takes values that can be numeric, literal or graphic. It may be a basic gross value, a calculated value or an interpreted value. – The objective: The indicator may be used to: diagnose, help, monitor, understand, evaluate, and warn. Participants in a technology-enabled learning environment: One indicator refers to participants in a technology-enabled (LE) learning environment. These participants can be a student, a group of students, a collection of the virtual community,or a teacher. – Interaction Analysis Indicator (IIA) Target Users: An indicator is addressed to customers of the Interaction Analysis Indicator (IIA). It may be addressed to man or woman students, a set of students, a teacher, a researcher. It must be referred to that although the indicator idea is the same, the form of the values or the popularity can be exceptionally relying on the supposed user.

Using Learning Analytics Techniques to Calculate Learner’s Interaction Indicators

507

– An indicator usually has a specific time of appropriate use: Some signs take their values on the fly (at some point of the interplay), even as others may also take their values on the give up of the interplay process, and consequently be exploited with the aid of using the supposed users. And consequently, be exploited with the aid of using the supposed users. – Validity: indicator values can be within a validity field and be valid for a given time. – Environment: The data used for the calculation may come from activities in collaborative or individual environments. The indicators developed in learning research are varied and numerous. We will cite only a few examples classified by the axes identified by Mazza to illustrate their potential: – Behavioral indicators: The Echo360 Lectopia application allows you to post video, audio,… The system automatically records some student access data to course units. From these recordings, Robb Philipps et al. identify key indicators that define a learner’s profile: good, repentant, etc. These profiles are determined by an algorithm used between course posting and student access to courses. – Cognitive indicators: In the project “knowledge mapping”, the company Educlever with research laboratories in Grenoble mapped the skills of cycle 3 students in French and mathematics (Sonia Mandin, Nathalie Guin, 2015). The idea is to display statistics of skill acquisition calculated from the puzzles. The objective is to allow students to see their progress and to give teachers a tool to easily identify successes and failures. – Social indicators on student collaboration: A. Dimitracopoulou et al. (2004) present a rich variety of indicators measuring student interactions. They were developed as part of the ICALTS consortium in different learning environments. These indicators were chosen because of their representativeness in the context of learning activities that we remember as requiring the development of learning indicators. They are also chosen for the quality of their documentation, which allows for re-description using the trace-transformation approach. Only a few examples will be given (Table 1):

Table 1. Main indicators (A. Dimitracopoulou 2004) Indicator

Concept

Conversation and action balance

This indicator reflects the balance between the production of problem-solving and dialogue-related actions

Division of labor

This indicator, intended specifically for researchers, helps to identify the role played by each participant in the collaborative learning process (continued)

508

L. Chihab et al. Table 1. (continued)

Indicator

Concept

Actor’s degree centrality [Social network analysis]

The degree of spatial relation of Associate in Nursing actor in a very social network represents the quantity of links that that actor maintains with different actors

Network degree centralization (CD) [SNA] The centralization of a social network measures the degree to that the activity of a network depends on the activity of a specific member or a awfully little cluster of members Network density () [SNA]

The density of a social network measures the degree of activity of this network in relevancy the link that’s measured

Activity level

This indicator reflects the level of activity of groups using an online educational project manager. It shows the contributions of different role groups (students, teachers) to the production of files and messages. Collaboration level in the group; The indicator refers to contributions made by users of the learning environment when participating in a reasoned discussion

Collaborative activity function (CAF)

This indicator is used to represent both the number of shares and the number of active agents during the resolution of a given problem

Work amount

it measures the total work done by the group to generate the solution. This measure is inferred by the number of participants contributions, the size of this contribution, and the elaboration

Argumentation

measures the degree of discussion within the group. It can be calculated from the depth of the analysis tree (DephTree), interactivity, initiative, and work

Coordination

it shows the degree of intercommunication within the group member. Calculated from argumentation, coordination messages, and initiative

Cooperation

it examines how the process of argumentation is developed through the calculation of the traditional attitude of the individual

Collaboration

it gives an estimate of the collaboration the in the group, During the trial. It is inferred from argumentation, coordination and cooperation (continued)

Using Learning Analytics Techniques to Calculate Learner’s Interaction Indicators

509

Table 1. (continued) Indicator

Concept

Initiative

it quantifies the degree of participation and involvement in a work, and the responsibility given by each type of contribution

Creativity

it quantifies the degree of complexity, originality, and uproariousness of the ideas involved in developing the textbook for each donation

Elaboration

this attribute is linked to the previous one and quantifies the amount of work required to elaborate the text of a contribution

Conformity

it quantifies the degree of agreement involved by the contribution with the relationship it is related to

4.2 Classification of Interaction Indicators Researchers participating in ICALTS JEIRP propose a classification according to 8 categories: – High level indicators, related to collaboration quality, modes, state, structure – Elaborated indicators, related to collaboration quality [in text production-oriented systems, based on argumentation] – Elaborated indicators, related to argumentation quality [in text production-oriented systems based on argumentation] – Low level indicators, related to argumentation quality [in text production-oriented systems based on argumentation] – Indicators related to awareness, in action-based systems – Indicators that concern Participation assessment – Content related/dependent indicators – Cognitive indicators related to strategies (processed manually)

5 Synthesis and Discussion When a student connects and exploits the resources at his disposal, he leaves behind traces giving clues about his learning strategies which, according to the logic of Learning Analytics, would be automatically stored on a computer server before it could be processed by algorithms capable of analyzing and establishing its individual profile. The indicators are elaborated from these traces, it is observed and testify to the quality of interaction, activity and learning. These indicators can describe specific and timely information, such as the number of clicks, the number of accesses to a resource, and the time spent. Many studies have been published on indicators that propose classifications according to one basis. In this sense, we represent in this part a classification according

510

L. Chihab et al.

to two categories; the first one based on actions and the second one based on a production-oriented system. As mentioned above, the indicators developed in the framework of research on learning are numerous and varied. In the following table, we will only include a few examples classified according to the two categories proposed (Fig. 2):

Fig. 2. Classification the indicators by action and production

The first category is based on actions; the indicators chosen in this calculated class compare actions taken by learners (number of accesses to an online resource, number of attempts for each response, time spent on each activity, success rate during an activity) These are tracks that help to better understand learner behaviour in an online learning environment. The second category represents all indicators oriented towards production.

6 Conclusion and Future Work Our intention in this work was to propose a new classification of interaction indicators according to different categories, the first one related to learners’ actions and the second one is production oriented. Thus, we briefly presented a set of definitions of Learning Analytics and its application in E-learning and we determined the main interaction indicators. During this work we faced a major difficulty, we were faced with a terrible lack of references and documentation despite the frequent use of the term “Indicator” in different types of reports, articles… In our future work, we would first like to continue the literature review to complete the framework of interaction indicators by analyzing the most recent empirical studies on Learning Analytics. Thus, we would like to talk about the importance of collaborative learning while highlighting the exploitation of interaction traces.

Using Learning Analytics Techniques to Calculate Learner’s Interaction Indicators

511

References Cherigny, F., et al.: L’analytique des apprentissages avec le numérique Groupes thématiques de la Direction du numérique pour l’Éducation (DNE-TN2). Diss. Direction du numérique pour l’éducation (2020) Clow, D.: The learning analytics cycle: closing the loop effectively. In: Proceedings of the 2nd International Conference on Learning Analytics and Knowledge, pp. 134–138 (2012) Dabbebi, I.: Conception et génération dynamique de tableaux de bord d’apprentissage contextuels. Université du Maine, Diss (2019) Detroz, P., et al.: Implémentation des Learning Analytics dans l’enseignement supérieur. Ministère de l’Enseignement Supérieur de la Fédération Wallonie-Bruxelles (2018) Djouad, T.: Ingénierie des indicateurs d’activités à partir de traces modélisées pour un Environnement Informatique d’Apprentissage Humain. Diss. Université Claude Bernard-Lyon I, Université Mentouri-Constantine (2011) Dimitrakopoulou, A., et al.: State of the art of interaction analysis for metacognitive support & diagnosis (2006) Dimitracopoulou, A., Designing collaborative learning systems: current trends & future research agenda." Computer supported collaborative learning,: The next 10 years! Routledge 2017, 115–124 (2005) Luengo, V.: Projet anr hubble human observatory based on analysis of e-LEarning traces (2014) Lund, K., Mille, A.: Traces, traces d’interactions, traces d’apprentissages : définitions, modèles informatiques, structurations, traitements et usages.“ Analyse de traces et personnalisation des environnements informatiques pour l’apprentissage humain. Hermès, 21–66 (2009) Mandin, S., Guin, N., Lefevre, M.: Modèle de personnalisation de l’apprentissage pour un EIAH fondé sur un référentiel de compétences. 7ème Conférence sur les Environnements Informatiques pour l’Apprentissage Humain-EIAH’2015 (2015) Mattern, S.: Mission control: a history of the urban dashboard. Places J. (2015) Mazza, R.: Monitoring students in course management systems: from tracking data to insightful visualizations. Stud. Commun. Sci. 6(2) (2006) Nouri, J., et al.: Efforts in Europe for data-driven improvement of education–a review of learning analytics research in six countries (2019) Phillips, R., et al.: Exploring learning analytics as indicators of study behaviour. EdMedia+ Innovate Learning. Association for the Advancement of Computing in Education (AACE) (2012) Saqr, M.: Using Learning Analytics to Understand and Support Collaborative Learning. Stockholm University, Diss. Department of Computer and Systems Sciences (2018) Fesakis, G., Petrou, A., Dimitracopoulou, A.: Collaboration activity function: an interaction analysis tool for supported collaborative learning activities. In: IEEE International Conference on Advanced Learning Technologies, 2004. Proceedings. IEEE (2004) Siemens, G., et al.: Open Learning Analytics: an integrated & modularized platform. Diss. Open University Press (2011)

Web-Based Dyscalculia Screening with Unsupervised Clustering: Moroccan Fourth Grade Students Mohamed Ikermane1(B)

and A. El Mouatasim2

1 Laboratory LABSI, University Ibn Zohr Agadir, Agadir, Morocco

[email protected] 2 Department of Mathematics, Informatics and Management, Poly-disciplinary Faculty, Ibn

Zohr University, Code Postal 638, Ouarzazate, Morocco

Abstract. The interest in learning disability (LD) is relatively new in Morocco, not only in research but also in student detection techniques. The purpose of this study is to provide a novel dyscalculia screening method. It comprised 56 fourthgrade students (aged 8 to 11,5 years, SD = 0.74, 48% females) from two Moroccan primary public schools in Guelmim, Rabouate Assahrij, and Al Massira. The identification is based on a web-based tool that uses the nonverbal intelligence Raven’s Colored Progressive Matrices test (CPM), and The Trends in International Mathematics and Science Study (TIMSS) assessment. As a consequence, it was shown that 46 pupils evaluated had a significant probability of having dyscalculia and needs more diagnosis. They are characterized by weak nonverbal intelligence as well as a low TIMSS score respectively an average of 25,87 and 6,63 while having a long Answering duration when compared to the group of normal students (6 min and 12 s longer than the average time for regular students). Furthermore, we discovered that, in contrast to normal students, the higher the nonverbal intelligence score of suspected dyscalculia students, the longer it takes them to answer the mathematical achievement test. Keywords: Learning disability · Morocco · Dyscalculia · Non-verbal intelligence · Hierarchical clustering · Primary school · Raven’s Colored Progressive Matrices · CPM · International Mathematics and Science Study · TIMSS

1 Introduction Dyscalculia is a specific learning disorder, characterized by difficulties learning fundamental mathematical abilities that cannot be explained by low intellect or poor training [1]. Often known as “dyslexia or blindness of numbers”, dyscalculia is difficult to identify, even though it affects 6 to 7% of the population [2]. However, its appearances and practical consequences in daily life differ, it can last throughout one’s life [3]. Nonetheless, it is necessary to differentiate between two forms of dyscalculia:

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ben Ahmed et al. (Eds.): NISS 2022, LNDECT 147, pp. 512–519, 2023. https://doi.org/10.1007/978-3-031-15191-0_49

Web-Based Dyscalculia Screening with Unsupervised Clustering

513

• Acalculia: when the individual has a problem with numbers as a result of a traumatic occurrence, such as an injury or a stroke; • Developmental Dyscalculia: relates to a childhood cognitive disease that impairs the normal learning of arithmetical abilities. Dyscalculia is sometimes used as a catch-all word for all facets of arithmetic difficulties [4]. According to A. El Madani [5], 3,6% to 7,7% of Moroccan pupils do not comprehend the notion of numbers and possess any knowledge of arithmetic logic. Morocco was ranked 75th out of 79 nations in the worldwide study “Programme for International Student Assessment” (PISA) [6], which assesses pupils’ ability in reading, maths, and science, and was positioned fifth of the low performers in mathematics among PISAparticipating nations. Because developmental dyscalculia can significantly impact everyday living, this disorder needs to be diagnosed early in childhood. The necessity for dyscalculia screening tools among pipuls seems to be necessary, but it is still a laborious and time-consuming procedure; however, when Machine Learning Techniques and appropriate test dyscalculia assessments are combined, the diagnosis can be a quick and highly accurate process [7]. The current work aims to give a web-based assessment of developmental dyscalculia detection with unsupervised clustering that is tailored to the specificities of the Moroccan educational system.

2 Related Works Many studies have been conducted to identify dyscalculic Arabic students. R. Abdou et al. [8] found that using a translated and a modified version (to suit the Egyptian educational system specificity) of the Test of Mathematical Abilities—Third Edition (TOMA-3 test) on a group of 170 Egyptian school-age children of normal and dyscalculic students, is reliable and highly valid for diagnosing dyscalculia. Another research [9] involving 315 students in Kuwaiti primary schools concluded that employing an assessment of mathematical abilities expected of primary pupils is an effective screening tool for dyscalculia, as well as assessing the underlying ability deficiencies associated with dyscalculia. From 2012 to 2018, a survey [10] was performed among 14,605 Moroccan primary school students over 63 public and 16 private schools in Casablanca, Nouaceur, Larrach, and Temara, Morocco. Speech therapists, neuropsychologists, psychologists, psychomotor therapists, and child psychiatrists collaborated in a multidisciplinary team. As a consequence, they discovered that 3% of the samples had a specific arithmetic skill disorder (dyscalculia).

3 Non-verbal Intelligence and Mathematical Performances Non-verbal intelligence refers to the capacity to comprehend, analyze visual information and solve issues using visual reasoning. As well as identifying correlations, similarities, and differences between forms, patterns and abstract concepts [11]. It’s a strong predictor of mathematical achievement. Correlation values ranging from 0.40 to 0.63 between

514

M. Ikermane and A. El Mouatasim

non-verbal intelligence and academic performance, it has a significant impact on their mathematical success throughout the early years of school [12, 13]. It is highly linked to quantitative concept understanding and arithmetic abilities [14].

4 Materials and Methods 4.1 Participants The current study included 56 fourth-grade pupils (aged 8 to 11,5 years, SD = 0.74, 48, 21% Females) from two Morrocan primary public schools in guelmim, Rabouate Assahrij, and Al Massira. The experiment was carried out in the primary school’s multimedia room, which included ten PCs with the dyscalculia web application hosted on a Xamp server. 4.2 Measures Non-verbal Intelligence It was assessed using Raven’s Colored Progressive Matrices (CPM), which is one of the most often utilized nonverbal ability assessments in numerous research studies [15]. This cross-cultural assessment is intended for children aged 5 to 11,6 and older people. Made of 3 sets, A, AB, and B, each containing 12 multiple-choice complete picture questions. Every correct answer was given one point. The total score is then converted to an IQ percentile. Mathematical Achievement We utilized The Trends in International Mathematics and Science Study (TIMSS) to assess mathematical achievement. It is a worldwide evaluation of student achievement in mathematics and science in the fourth and eighth grades. The following are the fourth-grade mathematics assessment domains: • Numbers: include dealing with numbers, expressions, simple equations, as well as fractions and decimals. • Measurement and Geometry: pupils should be able to measure length, solve problems involving length, mass, capacity, and time, and calculate the areas and perimeters of basic polygons… • Data: Students should be able to read and recognize various types of data presentations, as well as apply data to solve issues [16]. We tailored the TIMSS test to the official Moroccan primary school curriculum [17] with the assistance of primary school teachers to suit the learning achieved in the Moroccan fourth grade. 18 questions in total, six in each assessment domain (Numbers, Measurement and geometry, and Data).

Web-Based Dyscalculia Screening with Unsupervised Clustering

515

4.3 Screening Model We deployed a web-based Application with three phases to screen for developmental dyscalculia. Personal data, the Raven’s CPM Test, and the TIMSS Mathematics Assessment Fig. 1.

Fig. 1. The structure of dyscalculia screening web-based application

This web application was developed using html5/css3 and PHP programming language, while the database was created with MySQL Fig. 2.

Fig. 2. Home page of dyscalculia screening website, where the subject must fill in his personal information before starting the test.

5 Results and Discussion The initial phase in our Unsupervised model Hierarchical clustering approach is to calculate the right number of clusters to employ, and for this, we used two methods:

516

M. Ikermane and A. El Mouatasim

5.1 Dendrogram In Hierarchical clustering, a dendrogram is a technique used to determine the right number of clusters, it’s a sort tree diagram that illustrates hierarchical clustering relationships between data. The dendrogram would begin by displaying a set of 56 lines indicating the original data (Students), and the remaining nodes showing the clusters to which the data belongs, with vertical lines reflecting the Euclidean distance between merged clusters. To find the number of clusters, we look for the longest vertical distance without crossing any horizontal lines. And from our dendrogram, we can see that 2 is a good number of clusters due to the distance between the two clusters Fig. 3.

Fig. 3. Dendrogram representing the hierarchical clustering of the dataset points

5.2 Silhouette Score Silhouette analysis can also help decide the best number of clusters. Used to analyze the separation distance between the resulting clusters This measure has a range of [–1, + 1]. A value near +1 indicates that the sample is located distant from the neighboring clusters. Using the ‘Ward’ method for our hierarchical clustering algorithm, 2 seems the best number of clusters to use in our Dataset Table 1. Table 1. The Silhouette coefficient for each number of clusters using hierarchical clustering Number of clusters

Silhouette coefficient

2

0.6717764328877011

3

0.4568352130316559

4

0.4762747112913596

Web-Based Dyscalculia Screening with Unsupervised Clustering

517

5.3 Dataset Clustering The results we obtained from this experiment, in which we employed the hierarchical algorithm using the Ward technique, are presented in three dimensions. Score CPM indicates the total number of correct answers in the non-verbal intelligence test, the TIMSS score represents the TIMSS assessment passing result, and the TIMSS answering time (Time needed to finish Test) Fig. 4.

Fig. 4. Dataset points clustering using a hierarchical clustering algorithm.

After analyzing our clustered dataset, we found that the first cluster, which we named “High Risk of Dyscalculia”, has a significant probability of having dyscalculia and needs more diagnosis. This group includes 46 students and is characterized by low non-verbal intelligence (min = 2, max = 33, SD = 4,965) as well as low performances in the mathematics achievement test with a score between 2 and 11 but a longer Answering time of 16 min 67 s on average, whereas the second cluster, named “Normal Students”, contains 10 subjects and has features that are far superior to the first cluster, with high scores in both CPM with a minimum of 27 and TIMSS test (between 10 and 16) and a

Fig. 5. Dataset features statistics for normal and dyscalculia high-risk students

518

M. Ikermane and A. El Mouatasim

shorter average response time of 10 min 55 s ( 6 min 12 s less than dyscalculic group) Fig. 5. We observed that the higher their non-verbal intelligence score, the longer they take to answer the mathematical achievement exam and that there is a clear correlation between non-verbal intelligence and mathematics performance among the dyscalculic high-risk subjects. For the normal students, also a strong positive correlation was between nonverbal intelligence and mathematical achievement score, and the better they score on the CPM test the less time they need to finish the TIMSS test Fig. 6.

Fig. 6. The correlation heatmap of the dyscalculia highly risked and of the normal students

6 Conclusion The current study examined Dyscalculia identification utilizing a web-based tool that assessed nonverbal intelligence and math skills. Our model using web-based screening will help students pre-diagnosis correctly whether they have Risks of developmental dyscalculia or not. With the help of Machine learning techniques and CPM and TIMSS assessments, this pre-diagnosis will be a speedy and reliable process.

References 1. Kaufmann, L., von Aster, M.: The diagnosis and management of dyscalculia. Dtsch. Arztebl. Int. 109(45), 767–778 (2012) 2. Berch, D., Mazzocco, M.: Why Is Math So Hard for Some Children? The Nature and Origins of Mathematical Learning Difficulties and Disabilities. Paul H. Brookes Publishing Co, Baltimore, MD (2007) 3. Osmon, D.C., Smerz, J.M., Braun, M.M., Plambeck, E.: Processing abilities associated with math skills in adult learning disability. J. Clin. Exp. Neuropsychol. 28, 84–95 (2006) 4. Ardila, A., Rosselli, M.: Acalculia and dyscalculia. Neuropsychol. Rev. 12, 179–231 (2002). https://doi.org/10.1023/A:1020329929493 , AZ-editions Rabat. ISBN: 5. Madani, A.: 9789920391054 2020

Web-Based Dyscalculia Screening with Unsupervised Clustering

519

6. OECD.: PISA 2018 Results (Volume II): Where All Students Can Succeed, PISA, OECD Publishing, Paris (2019). https://doi.org/10.1787/b5fd1b8f-en 7. Giri, N., et al.: Detection of dyscalculia using machine learning. In: 2020 5th International Conference on Communication and Electronics Systems (ICCES), pp. 1–6 (2020) 8. Abdou, R.M., Hamouda, N.H., Fawzy, A.M.: Validity and reliability of the Arabic dyscalculia test in diagnosing Egyptian dyscalculic school-age children. Egypt. J. Otolaryngol. 36(1), 1–5 (2020). https://doi.org/10.1186/s43163-020-00020-6 9. Everatt, J., Mahfoudhi, A., Al-Manabri, M., Elbeheri, G.: Dyscalculia in Arabic speaking children: assessment and intervention practices. In: The Routledge International Handbook of Dyscalculia and Mathematical Learning Difficulties (1st ed.), pp. 183–192. Routledge. https://doi.org/10.4324/9781315740713 (2015) 10. Leqouider, Z., Zakaria Abidli, L.K., El Turk, J., Touri, B., Khyati, A.: Prevalence of learning disability in the moroccan context: epidemiological study. J. Southwest Jiaotong Univ. 56(6), 1020–1035 (2021). https://doi.org/10.35741/issn.0258-2724.56.6.89 11. Kuschner E.S.: Nonverbal intelligence. In: Volkmar F.R. (eds.) Encyclopedia of Autism Spectrum Disorders. Springer, New York, NY (2013). https://doi.org/10.1007/978-1-4419-16983_354 12. Tikhomirova, T., Voronina, I., Marakshina, J., Nikulchev, E., Ayrapetyan, I., Malykh, T.: The relationship between non-verbal intelligence and mathematical achievement in high school students. In: SHS Web of Conferences, vol. 29, p. 02039 (2016). https://doi.org/10.1051/shs conf/20162902039 13. Tikhomirova, T., Misozhnikova, E., et al.: A cross-lag analysis of longitudinal associations between non-verbal intelligence and math achievement. In: ITM Web of Conferences, vol. 10, pp. 02007 (2017). https://doi.org/10.1051/itmconf/20171002007 14. Pina, V., Fuentes, L.J., Castillo, A., Diamantopoulou, S.: Disentangling the effects of working memory, language, parental education, and non-verbal intelligence on children’s mathematical abilities. Front. Psychol. (2014). https://doi.org/10.3389/fpsyg.2014.00415 15. Smirni, D.: The raven’s coloured progressive matrices in healthy children: a qualitative approach. Brain Sci. 10, 877 (2020). https://doi.org/10.3390/brainsci10110877 16. Mullis, I.V.S., Martin, M.O. (eds.).: TIMSS 2019 Assessment Frameworks. Retrieved from Boston College, TIMSS & PIRLS International Study Center website: http://TIMSSandpirls. bc.edu/TIMSS2019/frameworks/ (2017) https://men-gov.ma/wpcontent/ 17. 2021 uploads/2021/07/Curriculum-_Primaire_2021-Final-28-juillet_men-gov.ma_.pdf. Accessed 10 Mar 2022

Correction to: A New Telecom Churn Prediction Model Based on Multi-layer Stacking Architecture Jalal Rabbah

, Mohammed Ridouani

, and Larbi Hassouni

Correction to: Chapter “A New Telecom Churn Prediction Model Based on Multi-layer Stacking Architecture” in: M. Ben Ahmed et al. (Eds.): Emerging Trends in Intelligent Systems & Network Security, LNDECT 147, https://doi.org/10.1007/978-3-031-15191-0_4

In the original version of the book, the following correction has been incorporated: In Chapter 4, the second author’s name has been changed from “Mohammed Ridouan” to “Mohammed Ridouani”. The chapter has been updated with the change.

The updated original version of this chapter can be found at https://doi.org/10.1007/978-3-031-15191-0_4 © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ben Ahmed et al. (Eds.): NISS 2022, LNDECT 147, p. C1, 2023. https://doi.org/10.1007/978-3-031-15191-0_50

Author Index

A Abarkan, Mustapha, 73 Abdelhakim, Boudhir Anouar, 176, 372 Abdelhamid, Bouzidi, 45 Abdellaoui Alaoui, El Arbi, 21 Abdoun, Otman, 200 Abouabdellah, Abdellah, 496 Abouelaziz, Ilyass, 270 Abouelseoud, Yasmine, 221 Adi, Kusworo, 146 Adib, Abdellah, 136, 168, 281 Aggoune-Mtalaa, Wassila, 1, 292 Agoujil, Said, 21, 306 Akil, Siham, 136 Al Marouni, Yasmina, 475 Alami, Taha El, 45 alamir, Taha el, 123 Alaoui, El Arbi Abdellaoui, 232, 306 Alhayani, Mohammed, 97 Ali, Hanae Aoulad, 45 Al-Khiza’ay, Muhmmad, 97 Asri, Hiba, 53 Assia, Soufiani, 372 Aya, Abbadi, 372 Ayad, Habib, 168 Azhari, Mourad, 496 B Badir, Hassan, 323 Baina, Amine, 444 Bajit, Abderahim, 496 Banodha, Umesh, 332

Barik, Rhada, 444 Bellafkih, Mostafa, 444 Ben Meziane, Khaddouj, 486 Benabdelouahab, Soukaina, 315 Benaya, Nabil, 486 Benlamine, Kaoutar, 270 Bensalah, Nouhaila, 168 Bentaleb, Youssef, 475 Benyoucef, Lyes, 464 Bouhdidi, Jaber El, 315 Bouhorma, Mohammed, 253, 419, 432 Boukraa, Lamiae, 407 Boumhidi, Ismail, 486 Bouroumane, Farida, 73 Bouziri, Hend, 1, 292 C Cendani, Linggar Maretva, 106 Chakhtouna, Adil, 281 Cherradi, Mohamed, 211 Chihab, Lamyaa, 504 Chkouri, Mohamed Yassin, 190 Chrayah, Mohamed, 123 D de Gea, Juan M. Carrillo, 315 Dib, Faiza, 486 Diop, Abdou Khadre, 63 Diouf, Mamadou Diallo, 352 Dourhmi, Mouad, 270

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ben Ahmed et al. (Eds.): NISS 2022, LNDECT 147, pp. 521–523, 2023. https://doi.org/10.1007/978-3-031-15191-0

522 E Eddoujaji, Mohamed, 253 El Aachak, Lotfi, 419 El Allali, Zakaria, 242 El Faddouli, Nour- eddine, 363 El Fissaoui, Mohamed, 242 El Haddadi, Anass, 53, 211 El Harrouti, Taoufiq, 496 El Hazzat, Soulaiman, 399 El Kryech, Nada, 419 El Makkaoui, Khalid, 242, 407 El Mhouti, Abderrahim, 504 El Mouatasim, A., 512 El Ouadghiri, Moulay Driss, 464 Elaachak, Lotfi, 432 Elouaai, Fatiha, 419 Endah, Sukmawati Nur, 106 Er-raha, Brahim, 53 Errami, Soukaina Ait, 323 Esbai, Redouane, 407 F Faieq, Soufiane, 343 Farouk, Abdelhamid Ibn El, 168 Farssi, Sidi Mohamed, 63 Fennan, Abdelhadi, 454 Fernández-Alemán, Jose L., 190 G Ghadi, Abderrahim, 454 Gueye, Amadou Dahirou, 63 H Hajji, Hicham, 323 Hamamou, Abdelaziz, 496 Hamdaoui, Ikram, 242 Hassan, Faham, 176 Hassouni, Larbi, 35 Hiri, Mustafa, 123 I Ikermane, Mohamed, 512 ˙Ilhan, Kamil, 84 J Jouane, Youssef, 270 K Kadi, Kenza Ait El, 323 Kautsarina,, 386 Korany, Noha, 221 Kusumaningrum, Retno, 106, 146

Author Index L Laaz, Naziha, 53 Lhafra, Fatima Zohra, 200 M Mahrach, Safaa, 407 Masrour, Tawfik, 270 Massar, Mohammed, 504 Menemencio˘glu, O˘guzhan, 9 Merras, Mostafa, 399 Moh, Ahmed Nait Sidi, 464 Mohamed, Ben Ahmed, 176, 372 Mohamed, Chrayah, 45 Moumouh, Chaimae, 190 Muzakir, Ari, 146 N Nasri, Sonia, 1, 292 Nassiri, Khalid, 232 Ndong, Massa, 352 O Ouakasse, Fathia, 157 Oukhouya, Lamya, 53 Oumoussa, Idris, 343 Ourdani, Nabil, 45, 123 P Priyadi, Oki, 386 R Rabbah, Jalal, 35 Rakrak, Said, 157 Ramadhan, Insan, 386 Ridouani, Mohammed, 35 S Sadek, Nayera, 221 Sahlaoui, Hayat, 21 Saidi, Rajaa, 343 Saleh, Abbadullah .H, 9 Sallah, Amine, 306 Salma, Achahbar, 372 Samadi, Hassan, 253 Samih, Amina, 454 Saxena, Kanak, 332 Sebbaq, Hanane, 363 Sekkate, Sara, 136, 281 Sensuse, Dana Indra, 386 Shehab, Manal, 221 Sohaib, Samadi, 176 Soumaya, El belhadji, 176

Author Index Soussi Niaimi, Badr-Eddine, 432 Suryono, Ryan Randy, 386

T Tall, Khaly, 63 Tamym, Lahcen, 464 Tekouabou, Stephane Cedric Koumetio, 232

523 U Ünver, Muharrem, 84 Y Younoussi, Yacine El, 315 Z Zghal, Mourad, 270 Zili, Hassan, 432