Intelligent Information and Database Systems: 14th Asian Conference, ACIIDS 2022, Ho Chi Minh City, Vietnam, November 28–30, 2022, Proceedings, Part II (Lecture Notes in Artificial Intelligence) 303121966X, 9783031219665

This book constitutes the refereed proceedings of the 14th Asian Conference on Intelligent Information and Database Syst

122 88 61MB

English Pages 772 [766] Year 2022

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Preface
Organization
Contents – Part II
Contents – Part I
Machine Learning and Data Mining
Machine Learning or Lexicon Based Sentiment Analysis Techniques on Social Media Posts
1 Introduction
2 Literature Review
3 Methodology
3.1 Comparison Between Sentiment Analysis Methods
4 Results and Discussion
5 Conclusion
References
A Comparative Study of Classification and Clustering Methods from Text of Books
1 Introduction
2 Related Works
3 Project Gutenberg
4 Natural Language Processing
4.1 Word Weighting Measures
5 Machine Learning Methods
5.1 Algorithms for Classification
5.2 Algorithm for Clustering
5.3 Measures of the Quality
6 Proposed Approach
7 Experiments
7.1 Experimental Design and Data Set
7.2 Results of Experiments
8 Conclusions
References
A Lightweight and Efficient GA-Based Model-Agnostic Feature Selection Scheme for Time Series Forecasting
1 Introduction
2 Related Works
2.1 Feature Selection Methods
2.2 GA-Based Feature Selection
3 GA-Based Model-Agnostic Feature Selection
3.1 Problem Formulation
3.2 Overview
3.3 GA-Based Feature Selector
3.4 Training Data Generator
4 Performance Evaluation
4.1 Evaluation Settings
4.2 Impact of GA-Based Feature Selector
4.3 Impact of Training Data Generator
5 Conclusion
References
Machine Learning Approach to Predict Metastasis in Lung Cancer Based on Radiomic Features
1 Background
2 Materials and Methods
2.1 Data
2.2 Radiomics Features
2.3 Classification Workflow
3 Feature Selection Challenges
3.1 Multiple ROIs from the Same Patient
3.2 Response Variable Type
3.3 Small Differences Between Classes
4 Results
5 Discussion and Future Work
References
Covariance Controlled Bayesian Rose Trees
1 Introduction
2 Algorithm
2.1 Hierarchical Clustering
2.2 Bayesian Rose Trees
2.3 Constraining BRT Hierarchies
2.4 Parameterisation
2.5 Depth Level as a Function of the Likelihood
2.6 Hierarchy Outside of Defined Clusters
3 Method Comparison
4 Conclusions
References
Potential of Radiomics Features for Predicting Time to Metastasis in NSCLC
1 Background
2 Materials and Methods
2.1 Data
2.2 Radiomics Features
2.3 Data Pre-processing and Unsupervised Analysis
2.4 Modeling of Metastasis Free Survival
3 Results
4 Discussion and Future Work
References
A Survey of Network Features for Machine Learning Algorithms to Detect Network Attacks
1 Introduction
2 Background Study
3 Literature Survey
4 Shortcoming of Existing Literature
5 Recommendations
References
The Quality of Clustering Data Containing Outliers
1 Introduction
1.1 The Structure of the Paper
2 State of Art
3 Clustering Data Containing Outliers
3.1 Clustering Algorithms: Hierarchical AHC vs Partitional K-Means
3.2 Clustering Quality Indices
3.3 Outlier Definition
3.4 Outlier Detection Algorithms
4 Experiments
4.1 Data Description
4.2 Methodology
4.3 Experimental Environment
4.4 Results
4.5 Discussion
5 Summary
References
Aggregated Performance Measures for Multi-class Classification
1 Introduction
2 Method
2.1 Classification of a Single Data Point
2.2 Aggregation Over Classes and Thresholds
2.3 Normalisation
2.4 The Case of Specificity
2.5 The Compound Measure of Accuracy
3 Discussion
References
Prediction of Lung Cancer Survival Based on Multiomic Data
1 Introduction
2 Materials and Methods
2.1 Data Used in the Study
2.2 Feature Definition and Pre-selection
2.3 Variable Importance Study
2.4 Classification of Data
3 Results
3.1 Aggregation and Dimensionality Reduction
3.2 Predictive Potential of Various -Omics Datasets
3.3 Variable Importance Study in a Multiomic Dataset
4 Discussion
References
Graph Neural Networks-Based Multilabel Classification of Citation Network
1 Introduction
2 Related Works
3 Dataset Description
4 Experiments
5 Multilabel Classification Approach
6 Conclusion and Future Works
References
Towards Efficient Discovery of Partial Periodic Patterns in Columnar Temporal Databases
1 Introduction
2 Related Work
3 The Model of Partial Periodic Pattern
4 Proposed Algorithm
4.1 3P-ECLAT Algorithm
5 Experimental Results
5.1 Evaluation of Algorithms by Varying minPS
5.2 Evaluation of Algorithms by Varying Per
5.3 Scalability Test
5.4 A Case Study: Finding Areas Where People Have Been Regularly Exposed to Hazardous Levels of PM2.5 Pollutant
6 Conclusions and Future Work
References
Avoiding Time Series Prediction Disbelief with Ensemble Classifiers in Multi-class Problem Spaces
1 Introduction
2 Time Series Analysis Life-Cycle
3 Prediction Disbelief in Acceptance Tests of Forecasting Models
4 Discussion
5 Conclusions
References
Speeding Up Recommender Systems Using Association Rules
1 Introduction
2 Preliminaries
2.1 Factorization Machines
2.2 Association Rules
2.3 Related Works
3 FMAR Recommender System
3.1 Problem Definition
3.2 Factorization Machine Apriori Based Model
3.3 Factorization Machine FP-Growth Based Model
4 Evaluation for FMAR
4.1 Performance Comparison and Analysis
5 Conclusions and Future Work
References
An Empirical Experiment on Feature Extractions Based for Speech Emotion Recognition
1 Introduction
2 Literature Review
3 Dataset
4 Feature Extraction
5 Methodology
5.1 Input Preparation
5.2 Classification Models
6 Experimental Results
7 Conclusion and Discussion
References
Parameter Distribution Ensemble Learning for Sudden Concept Drift Detection
1 Introduction
2 Methods
2.1 BO-ERICS Phase
2.2 Ensemble Phase
3 Experiments and Discussion
3.1 Datasets
3.2 Evaluation
3.3 Results
3.4 Discussion
4 Conclusions
References
MLP-Mixer Approach for Corn Leaf Diseases Classification
1 Introduction
2 Related Work
2.1 Literature Review
2.2 MLP-Mixer
2.3 Deep Learning
3 Methods
3.1 Data Requirements, Collection and Preparation
3.2 Configure the Hyperparameters
3.3 Build a Classification Model
3.4 Define an Experiment and Data Augmentation
3.5 The MLP-Mixer Model Structure
3.6 Build, Train, and Evaluate the MLP-Mixer Model
4 Experiment and Result
4.1 Image Segmentation
4.2 Experiment Results (Train and Evaluate Model)
4.3 Discussion
5 Conclusion
References
A Novel Neural Network Training Method for Autonomous Driving Using Semi-Pseudo-Labels and 3D Data Augmentations
1 Introduction
2 Related Work
3 A Novel Training Method with Semi-Pseudo-Labeling and 3D Augmentations
3.1 Semi-Pseudo-Labeling
3.2 3D Augmentations
3.3 An Example of Training with Semi-Pseudo-Labeling and 3D Augmentations
4 Experiments
4.1 Argoverse
4.2 In-House Highway Dataset
5 Conclusion
References
Machine Learning Methods for BIM Data
1 Introduction
2 BIM Data - IFC Files
3 Machine Learning Techniques for BIM
3.1 Learning Semantic Information - Space Classification
3.2 Semantic Enrichment of BIM Models from Point Clouds
3.3 Building Condition Diagnosis
3.4 BIM Enhancement in the Facility Management Context
3.5 Knowledge Extraction from BIM
4 Conclusions
References
Self-Optimizing Neural Network in Classification of Real Valued Experimental Data
1 Introduction
2 Self Optimizing Neural Network
2.1 SONN Formalism
2.2 Fundamental Coefficient of Discrimination
2.3 Structure of the Network and the Weight Factor
2.4 Network Response
3 Experiment and Results
3.1 Dataset
3.2 Data Preparation
3.3 Classification
4 Conclusion
References
Analyzing the Effectiveness of the Gaussian Mixture Model Clustering Algorithm in Software Enhancement Effort Estimation
1 Introduction
2 Backgrounds
2.1 The FPA Overview
2.2 The Gaussian Mixture Model Clustering Algorithm
2.3 The k-means Clustering Algorithm
3 Research Methodology
3.1 Dataset Pre-processing
3.2 Determine the Number of Clusters
3.3 Evaluation Criteria
4 Results and Discussions
5 Conclusion
References
Graph Classification via Graph Structure Learning
1 Introduction
2 Related Works
3 Proposed Method: GC-GSL
3.1 Extracting Topological Attribute Vector
3.2 Rooted Subgraph Mining
3.3 Neural Network Graph Embedding
3.4 Computational Complexity
4 Experiments
4.1 Results
4.2 Discussions
5 Conclusion
References
Relearning Ensemble Selection Based on New Generated Features
1 Introduction
2 Related Works
3 The Proposed Framework
3.1 Generation of Diverse Base Classifiers
3.2 Relearning Base Classifiers
3.3 Feature Generation Based on Learned and Relearned Base Classifiers
3.4 Learning Second-Level Base Classifier Based on New Vector of the Features
3.5 Selection Base Classifiers Based on Second-Level Classification Result
4 Experiments
4.1 Experimental Setup
4.2 Results
5 Discussion
6 Conclusions
References
Random Forest in Whitelist-Based ATM Security
1 Introduction
2 Related Work
3 Test Procedure
4 Data Pre-processing
5 Data Classification
6 Results
7 Conclusions
References
Layer-Wise Optimization of Contextual Neural Networks with Dynamic Field of Aggregation
1 Introduction
2 Contextual Neural Networks
3 Layer-Wise Patterns of Connections Grouping in CxNNs
4 Experiments and Results
5 Conclusions
References
Computer Vision Techniques
Automatic Counting of People Entering and Leaving Based on Dominant Colors and People Silhouettes
1 Introduction
2 Related Work
3 New Approach for People Detection, Counting, and People Annotation
4 Test Results
5 Conclusions
References
Toward Understanding the Impact of Input Data for Multi-Image Super-Resolution
1 Introduction
2 Literature Review
2.1 Single-Image Super-Resolution
2.2 Multi-Image Super-Resolution
3 Proposed Experimental Setup
3.1 Models Exploited for Multi-Image Super-Resolution
3.2 Test Data
3.3 Testing Procedure
4 Experimental Results
5 Conclusions and Future Work
References
Single-Stage Real-Time Face Mask Detection*-12pt
1 Introduction and Related Works
2 Proposed Model
2.1 Choosing Anchor Boxes for Model
2.2 Architecture
3 Implementation
3.1 Datasets
3.2 Training and Validation Processes
3.3 Web Application Implementation
4 Result
4.1 Evaluation Metrics
4.2 Comparisons and Discussions
5 Conclusion
References
User-Generated Content (UGC)/In-The-Wild Video Content Recognition
1 Introduction
2 Other Work
3 Databases
3.1 The Full Sets
3.2 The Sub Sets
3.3 Video Indicators
4 Modelling
5 Results
6 Conclusions and Further Work
References
A Research for Segmentation of Brain Tumors Based on GAN Model
1 Introduction
2 Related Works
3 Our Proposed Method
3.1 Overview
3.2 Data Structure and Preprocessing
3.3 Building Architecture
3.4 Modifying Loss Function
4 Implementation and Results
5 Discussion and Comparison
5.1 Evaluation Metrics
5.2 Evaluation and Discussion
6 Conclusion
References
Tracking Student Attendance in Virtual Classes Based on MTCNN and FaceNet
1 Introduction
2 Related Work
3 Background
3.1 Face Detection
3.2 Face Recognition
4 A Proposed Student Monitoring System
5 Experiments and Result Analysis
5.1 Datasets
5.2 Parameter Settings
5.3 Results
6 Conclusion and Future Research Directions
References
A Correct Face Mask Usage Detection Framework by AIoT
1 Introduction
2 Background and Related Works
3 System Design and Implementation
3.1 System Design
3.2 Implementation
3.3 Training Procedure
4 Experiments
4.1 Dataset
4.2 Results
4.3 Discussion
5 Conclusions
References
A New 3D Face Model for Vietnamese Based on Basel Face Model
1 Introduction
2 Related Works
2.1 Basel Face Model
2.2 Weakly-Supervised Learning Method
3 Proposal
3.1 Face Image Collecting
3.2 Generating New Mean Face Shape
3.3 Training Session
4 Experiments
4.1 Measurements
4.2 Discussion
5 Conclusion
References
Domain Generalisation for Glaucoma Detection in Retinal Images from Unseen Fundus Cameras
1 Introduction
2 Related Work
3 Methodology
3.1 Input Standardisation
3.2 Histogram Matching
4 Experimental Design
4.1 Datasets
4.2 Image Preprocessing
4.3 Network Training and Testing
5 Results and Discussion
6 Conclusion
References
BRDF Anisotropy Criterion
1 Introduction
2 Bidirectional Reflectance Distribution Function
3 Anisotropy Criterion
4 Experimental Textures
4.1 Wood UTIA BTF Database
5 Results
6 Conclusion
References
Clustering Analysis Applied to NDVI Maps to Delimit Management Zones for Grain Crops
1 Introduction
2 Materials and Methods
2.1 Acquisition of Satellite Images
2.2 Calculation of NDVI Matrices
2.3 Consolidation of NDVI Data
2.4 Clustering the Consolidated Data
2.5 Visualization of Clusters (Zones of Homogeneous Management) on the Map
3 Results
4 Conclusion
References
Features of Hand-Drawn Spirals for Recognition of Parkinson's Disease
1 Introduction
2 Method Description
3 Data Acquisition
4 Feature Extraction
5 Experiments and Results
6 Multimodal Diagnostic System
7 Conclusions
References
FASENet: A Two-Stream Fall Detection and Activity Monitoring Model Using Pose Keypoints and Squeeze-and-Excitation Networks
1 Introduction
2 Related Work
3 Methodology
3.1 Preprocessing
3.2 FASENet: The Proposed Architecture
4 Datasets
5 Results and Discussion
6 Conclusion and Future Work
References
Pre-processing of CT Images of the Lungs
1 Introduction
2 Related Work
3 Methods
4 Conclusion
References
Innovations in Intelligent Systems
Application of Hyperledger Blockchain to Reduce Information Asymmetries in the Used Car Market
1 Introduction
2 Hyperledger Framework
3 Implementation and Simulation
3.1 Hyperledger Fabric Network Implementation
3.2 Simulation Process Experiments with Accessing Historical Data
4 Conclusions and Limitations
References
Excess-Mass and Mass-Volume Quality Measures Susceptibility to Intrusion Detection System’s Data Dimensionality
1 Introduction
2 Related Works
3 Algorithms and Datasets
3.1 Algorithms
3.2 Datasets
4 Empirical Research
4.1 Research Agenda
4.2 Research Results
4.3 Results Discussion
5 Summary
References
A Homomorphic Encryption Approach for Privacy-Preserving Deep Learning in Digital Health Care Service
1 Introduction and Cryptographic Backgrounds
1.1 Homomorphic Encryption
1.2 CKKS
2 The Proposed Solution
2.1 System Architecture
2.2 Data Selection and Features
3 Implementation
3.1 Data Preparation
3.2 Training Model
3.3 Predicting Model
4 Test Results and Evaluations
5 Conclusions and Discussion
References
Semantic Pivoting Model for Effective Event Detection
1 Introduction
2 Related Work
3 Event Detection
4 Proposed Model
4.1 Label Semantic Learner
4.2 Trigger Classifier
5 Experiments
5.1 Dataset and Evaluation Metrics
5.2 Baselines
5.3 Implementation Details
5.4 Experimental Results
5.5 Ablation Studies
5.6 Analysis and Discussion
6 Conclusion
References
Meet Your Email Sender - Hybrid Approach to Email Signature Extraction
1 Introduction
2 Related Work
3 Methodology
3.1 Early Stage - Phase 1
3.2 Development of Signature Extractor - Phase 2
4 Data Sets and Results
4.1 Dataset-1
4.2 Dataset-2
4.3 Dataset-3
5 Conclusion and Future Work
References
Performance of Packet Delivery Ratio for Varying Vehicles Speeds on Highway Scenario in C-V2X Mode 4
1 Introduction
2 C-V2X
2.1 C-V2X Physical Layer
2.2 C-V2X MAC Layer
3 System Models
4 Simulation Parameters and Results
5 Conclusions
References
.26em plus .1em minus .1emExtensions of the Diffie-Hellman Key Agreement Protocol Based on Exponential and Logarithmic Functions
1 Introduction
2 Related Work
3 Preliminaries
4 Extensions of Diffie-Hellman Protocol
5 The Key Agreement Protocol on Permutation Group
6 Conclusions
References
Music Industry Trend Forecasting Based on MusicBrainz Metadata
1 Introduction
2 Related Works
3 Forecasting Methods
4 The Experiment
4.1 Dataset: MusicBrainz
4.2 Data Processing
5 Results
5.1 Releases as Single and Album
5.2 Medium Used for Releases
5.3 Album Total Time Length and Tracks
6 Conclusions
References
AntiPhiMBS-TRN: A New Anti-phishing Model to Mitigate Phishing Attacks in Mobile Banking System at Transaction Level
1 Introduction
2 Related Works
3 Proposed Anti-phishing Model AntiPhiMBS-TRN
3.1 Architecture of Anti-phishing Model AntiPhiMBS-TRN
3.2 Verification of Proposed Anti-phishing Model AntiPhiMBS-TRN
4 Results and Discussion
5 Conclusion and Future Work
References
Blockchain-Based Decentralized Digital Content Management and Sharing System
1 Introduction
2 Related Work
2.1 Blockchain
2.2 Decentralized Storage System
2.3 Decentralized Data Sharing
3 Digital Content Sharing Model
3.1 Digital Content Sharing Between Data Provider and Data Owner
3.2 Digital Content Sharing Between Data Owner and Data User
4 Security Analysis
4.1 Confidentially
4.2 Integrity
4.3 Anonymity
4.4 Non-repudiation
4.5 Scalability
4.6 Availability
5 Conclusion and Future Work
References
Analysis of Ciphertext Behaviour Using the Example of the AES Block Cipher in ECB, CBC, OFB and CFB Modes of Operation, Using Multiple Encryption
1 Introduction
2 Analysis of Characteristics of Ciphertext Behaviour in the Modes of Operation
2.1 Analysis of Characteristics of Ciphertext Behaviour in ECB Mode
2.2 Analysis of Characteristics of Ciphertext Behaviour in CBC Mode
2.3 Analysis of Characteristics of Ciphertext Behaviour in OFB Mode
2.4 Analysis of Characteristics of Ciphertext Behaviour in CFB Mode
3 Conclusion
References
Vulnerability Analysis of IoT Devices to Cyberattacks Based on Naïve Bayes Classifier
1 Introduction
2 CVSS System
3 Classification in Data Mining
4 ROC Curves
4.1 Confusion Matrix
4.2 Area Under the Curve (AUC)
5 Naïve Bayes Classifier
5.1 Bayes Theorem
5.2 An Assumption of Naïve Bayes
6 Classification Framework
7 Network Scanning
8 Naïve Bayes Classification Algorithm
9 Conclusion
References
Symmetric and Asymmetric Cryptography on the Special Linear Cracovian Quasigroup
1 Introduction
2 The Symmetric Cipher on SLn(Z)
3 The Symmetric Cipher on the Quasigroup KSLn(Z)
4 The Asymmetric Cipher on SLn(Zm)
5 The Asymmetric Cipher on the Quasigroup KSLn(Zm)
6 Conclusions
References
Multi-pass, Non-recursive Acoustic Echo Cancellation Sequential and Parallel Algorithms
1 Introduction
1.1 Elimination of the Echo
2 Acoustic Echo Cancellation Algorithms
2.1 Non-recursive Algorithms for Acoustic Echo Cancellation Problem
2.2 Algorithms from Least Mean Square Family
3 Multi-pass Algorithms from Least Mean Square Family in Acoustic Echo Cancellation Problem
3.1 SeqPipelineLMS Sequential Algorithm
3.2 Model of the Parallel ParPipelineLMS Algorithm with an Iteration Rate Equal to the Signal Sampling Rate
3.3 Parallel ParPipelineLMS Algorithm Model, for which the Determination Time of a Single Iteration of the LMS TLMS Algorithm is much Shorter than the Quotient Value 1f
4 Experimental Studies
4.1 Computational Experiments of ParSeqPipelineLMS Class Algorithms for Their Application in Real-Time Systems
4.2 A Model of the Parallel-Parallel ParParPipelineLMS Algorithm
5 Summary and Conclusions
References
Effective Resource Utilization in Heterogeneous Hadoop Environment Through a Dynamic Inter-cluster and Intra-cluster Load Balancing
1 Introduction
2 Related Works
3 Problem Statement
3.1 Jobs
3.2 Nodes
3.3 Cluster
4 Effective Load Balancing Policy in Heterogeneous Hadoop Clusters
4.1 Clustering of Nodes
4.2 Jobs Ranking
4.3 Dynamic Inter-cluster and Intra-cluster Load Balancing
5 Experimental Results
6 Conclusion
References
Intelligent Audio Signal Processing – Do We Still Need Annotated Datasets?
1 Introduction
2 Audio-Related Research
2.1 Machine Learning Applied to Audio-Related Topics
2.2 Audio Datasets
3 Examples of Audio-Related Work Performed
4 Conclusion
References
Multimodal Approach to Measuring Cognitive Load Using Sternberg Memory and Input Diagrammatic Reasoning Tests
1 Introduction
2 Background and Related Works
2.1 Cognitive Load and Its Theory
2.2 Methods Employed to Measure Cognitive Load
2.3 Cognitive Load and Sternberg Memory Tasks
2.4 Cognitive Load and Input Diagrammatic Reasoning Tasks
2.5 Psychophysiological Techniques
3 Experiment Setup
3.1 Participants
3.2 Sternberg Memory Tasks
3.3 Input Diagrammatic Reasoning Tasks
3.4 Biometric Sensors Used
3.5 Metrics Used
4 Analysis of Experiment Results
5 Discussion and Conclusions
References
Chaining Electronic Seals
1 Introduction
2 Previous Work
3 Signature Chaining
4 Proof-of-Concept Schemes
4.1 Scheme I
4.2 Scheme II with Hidden Fingerprints
4.3 Scheme III - Bilateral Use
4.4 Scheme IV - Chaining with a Hidden Inner Signature
5 Conclusion
References
Author Index
Recommend Papers

Intelligent Information and Database Systems: 14th Asian Conference, ACIIDS 2022, Ho Chi Minh City, Vietnam, November 28–30, 2022, Proceedings, Part II (Lecture Notes in Artificial Intelligence)
 303121966X, 9783031219665

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

LNAI 13758

Ngoc Thanh Nguyen · Tien Khoa Tran · Ualsher Tukayev · Tzung-Pei Hong · Bogdan Trawin´ski · Edward Szczerbicki (Eds.)

Intelligent Information and Database Systems 14th Asian Conference, ACIIDS 2022 Ho Chi Minh City, Vietnam, November 28–30, 2022 Proceedings, Part II

Lecture Notes in Artificial Intelligence Subseries of Lecture Notes in Computer Science Series Editors Randy Goebel University of Alberta, Edmonton, Canada Wolfgang Wahlster DFKI, Berlin, Germany Zhi-Hua Zhou Nanjing University, Nanjing, China

Founding Editor Jörg Siekmann DFKI and Saarland University, Saarbrücken, Germany

13758

More information about this subseries at https://link.springer.com/bookseries/1244

Ngoc Thanh Nguyen · Tien Khoa Tran · Ualsher Tukayev · Tzung-Pei Hong · Bogdan Trawi´nski · Edward Szczerbicki (Eds.)

Intelligent Information and Database Systems 14th Asian Conference, ACIIDS 2022 Ho Chi Minh City, Vietnam, November 28–30, 2022 Proceedings, Part II

Editors Ngoc Thanh Nguyen Wrocław University of Science and Technology Wrocław, Poland Nguyen Tat Thanh University Ho Chi Minh city, Vietnam Ualsher Tukayev Al-Farabi Kazakh National University Almaty, Kazakhstan Bogdan Trawi´nski Wrocław University of Science and Technology Wrocław, Poland

Tien Khoa Tran Vietnam National University, Ho Chi Minh City Ho Chi Minh City, Vietnam Tzung-Pei Hong National University of Kaohsiung Kaohsiung, Taiwan Edward Szczerbicki University of Newcastle Newcastle, NSW, Australia

ISSN 0302-9743 ISSN 1611-3349 (electronic) Lecture Notes in Artificial Intelligence ISBN 978-3-031-21966-5 ISBN 978-3-031-21967-2 (eBook) https://doi.org/10.1007/978-3-031-21967-2 LNCS Sublibrary: SL7 – Artificial Intelligence © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Preface

ACIIDS 2022 was the 14th event in a series of international scientific conferences on research and applications in the field of intelligent information and database systems. The aim of ACIIDS 2022 was to provide an international forum for research with scientific backgrounds in the technology of intelligent information and database systems and its various applications. The ACIIDS 2022 conference was co-organized by the International University - Vietnam National University HCMC (Vietnam) and the Wrocław University of Science and Technology (Poland) in cooperation with the IEEE SMC Technical Committee on Computational Collective Intelligence, the European Research Center for Information Systems (ERCIS), Al-Farabi Kazakh National University (Kazakhstan), the University of Newcastle (Australia), Yeungnam University (South Korea), Quang Binh University (Vietnam), Leiden University (The Netherlands), Universiti Teknologi Malaysia (Malaysia), Nguyen Tat Thanh University (Vietnam), BINUS University (Indonesia), the Committee on Informatics of the Polish Academy of Sciences (Poland) and Vietnam National University, Hanoi (Vietnam). ACIIDS 2022 was scheduled to be held in Almaty, Kazakhstan, during June 6–9, 2022. However, due to the unstable political situation, the conference was moved to Ho Chi Minh City, Vietnam, and was conducted as a hybrid event during 28–30 November 2022. The ACIIDS conference series is already well established, having taken place at various locations throughout Asia (Vietnam, South Korea, Taiwan, Malaysia, Thailand, Indonesia, and Japan) since 2009. The 12th and 13th events were planned to take place in Phuket (Thailand). However, the global COVID-19 pandemic resulted in both editions of the conference being held online in virtual space. We were, therefore, pleased to be able to hold ACIIDS 2022 in person, whilst still providing an option for people participate online. These two volumes contain 113 peer-reviewed papers selected for presentation from 406 submissions, with each submission receiving at least 3 reviews in a single-blind process. Papers included in this volume cover the following topics: data mining and machine learning methods, advanced data mining techniques and applications, intelligent and contextual systems, natural language processing, network systems and applications, computational imaging and vision, decision support and control systems, and data modeling and processing for Industry 4.0. The accepted and presented papers focus on new trends and challenges facing the intelligent information and database systems community. The presenters at ACIIDS 2022 showed how research work could stimulate novel and innovative applications. We hope that you find these results useful and inspiring for your future research work. We would like to express our sincere thanks to the honorary chairs for their support: Arkadiusz Wójs (Rector of Wroclaw University of Science and Technology, Poland) and Zhanseit Tuymebayev (Rector of Al-Farabi Kazakh National University, Kazakhstan). We would like to express our thanks to the keynote speakers for their world-class plenary speeches: Tzung-Pei Hong from the National University of Kaohsiung (Taiwan), Michał Wo´zniak from the Wrocław University of Science and Technology (Poland), Minh-Triet

vi

Preface

Tran from the University of Science and the John von Neumann Institute, VNU-HCM (Vietnam), and Minh Le Nguyen from the Japan Advanced Institute of Science and Technology (Japan). We cordially thank our main sponsors: International University - Vietnam National University HCMC, Hitachi Vantara Vietnam Co., Ltd, Polish Ministry of Education and Science, and Wrocław University of Science and Technology, as well as all of the aforementioned cooperating universities and organizations. Our special thanks are also due to Springer for publishing the proceedings and to all the other sponsors for their kind support. We are grateful to the Special Session Chairs, Organizing Chairs, Publicity Chairs, Liaison Chairs, and Local Organizing Committee for their work towards the conference. We sincerely thank all the members of the international Program Committee for their valuable efforts in the review process, which helped us to select the highest quality papers for the conference. We cordially thank all the authors for their valuable contributions and the other conference participants. The conference would not have been possible without their support. Thanks are also due to the many experts who contributed to the event being a success. November 2022

Ngoc Thanh Nguyen Tien Khoa Tran Ualsher Tukeyev Tzung-Pei Hong Bogdan Trawi´nski Edward Szczerbicki

Organization

Honorary Chairs Arkadiusz Wójs Zhanseit Tuymebayev

Wrocław University of Science and Technology, Poland Al-Farabi Kazakh National University, Kazakhstan

Conference Chairs Tien Khoa Tran Ngoc Thanh Nguyen Ualsher Tukeyev

International University - Vietnam National University HCMC, Vietnam Wrocław University of Science and Technology, Poland Al-Farabi Kazakh National University, Kazakhstan

Program Chairs Tzung-Pei Hong Edward Szczerbicki Bogdan Trawi´nski

National University of Kaohsiung, Taiwan University of Newcastle, Australia Wrocław University of Science and Technology, Poland

Steering Committee Ngoc Thanh Nguyen (Chair) Longbing Cao Suphamit Chittayasothorn Ford Lumban Gaol Tu Bao Ho Tzung-Pei Hong Dosam Hwang Bela Stantic Geun-Sik Jo

Wrocław University of Science and Technology, Poland University of Science and Technology Sydney, Australia King Mongkut’s Institute of Technology Ladkrabang, Thailand Bina Nusantara University, Indonesia Japan Advanced Institute of Science and Technology, Japan National University of Kaohsiung, Taiwan Yeungnam University, South Korea Griffith University, Australia Inha University, South Korea

viii

Organization

Hoai An Le-Thi Toyoaki Nishida Leszek Rutkowski Ali Selamat

University of Lorraine, France Kyoto University, Japan Cz˛estochowa University of Technology, Poland Universiti Teknologi Malaysia, Malaysia

Special Session Chairs Van Sinh Nguyen Krystian Wojtkiewicz Bogumiła Hnatkowska Madina Mansurova

International University - Vietnam National University HCMC, Vietnam Wroclaw University of Science and Technology, Poland Wroclaw University of Science and Technology, Poland Al-Farabi Kazakh National University, Kazakhstan

Doctoral Track Chairs Marek Krótkiewicz Marcin Pietranik Thi Thuy Loan Nguyen Paweł Sitek

Wrocław University of Science and Technology, Poland Wrocław University of Science and Technology, Poland International University - Vietnam National University HCMC, Vietnam Kielce University of Technology, Poland

Liaison Chairs Ford Lumban Gaol Quang-Thuy Ha Mong-Fong Horng Dosam Hwang Le Minh Nguyen Ali Selamat

Bina Nusantara University, Indonesia VNU-University of Engineering and Technology, Vietnam National Kaohsiung University of Applied Sciences, Taiwan Yeungnam University, South Korea Japan Advanced Institute of Science and Technology, Japan Universiti Teknologi Malaysia, Malaysia

Organizing Chairs Van Sinh Nguyen Krystian Wojtkiewicz

International University - Vietnam National University HCMC, Vietnam Wrocław University of Science and Technology, Poland

Organization

ix

Publicity Chairs Thanh Tung Tran Marek Kopel Marek Krótkiewicz

International University - Vietnam National University HCMC, Vietnam Wrocław University of Science and Technology, Poland Wrocław University of Science and Technology, Poland

Webmaster Marek Kopel

Wroclaw University of Science and Technology, Poland

Local Organizing Committee Le Van Canh Le Hai Duong Le Duy Tan Marcin Jodłowiec Patient Zihisire Muke Thanh-Ngo Nguyen Rafał Palak

International University - Vietnam National University HCMC, Vietnam International University - Vietnam National University HCMC, Vietnam International University - Vietnam National University HCMC, Vietnam Wrocław University of Science and Technology, Poland Wrocław University of Science and Technology, Poland Wrocław University of Science and Technology, Poland Wrocław University of Science and Technology, Poland

Keynote Speakers Tzung-Pei Hong Michał Wo´zniak Minh-Triet Tran Minh Le Nguyen

National University of Kaohsiung, Taiwan Wrocław University of Science and Technology, Poland University of Science and John von Neumann Institute, VNU-HCM, Vietnam Japan Advanced Institute of Science and Technology, Japan

x

Organization

Special Sessions Organizers ACMLT 2022: Special Session on Awareness Computing Based on Machine Learning Yung-Fa Huang Rung Ching Chen

Chaoyang University of Technology, Taiwan Chaoyang University of Technology, Taiwan

ADMTA 2022: Special Session on Advanced Data Mining Techniques and Applications Chun-Hao Chen Bay Vo Tzung-Pei Hong

Tamkang University, Taiwan Ho Chi Minh City University of Technology, Vietnam National University of Kaohsiung, Taiwan

AIIS 2022: Special Session on Artificial Intelligence in Information Security Shynar Mussiraliyeva Batyrkhan Omarov

Al-Farabi Kazakh National University, Kazakhstan Al-Farabi Kazakh National University, Kazakhstan

BMLLC 2022: Special Session on Bio-modeling and Machine Learning in Prediction of Metastasis in Lung Cancer Andrzej Swierniak Rafal Suwinski

Silesian University of Technology, Poland Institute of Oncology, Poland

BTAS 2022: Special Session on Blockchain Technology and Applications for Sustainability Chien-wen Shen Ping-yu Hsu

National Central University, Taiwan National Central University, Taiwan

CIV 2022: Special Session on Computational Imaging and Vision Manish Khare Prashant Srivastava Om Prakash Jeonghwan Gwak

Dhirubhai Ambani Institute of Information and Communication Technology, India NIIT University, India HNB Garwal University, India Korea National University of Transportation, South Korea

Organization

xi

DMPI-APP 2022: Special Session on Data Modelling and Processing: Air Pollution Prevention Marek Kr´otkiewicz Krystian Wojtkiewicz Hoai Phuong Ha Jean-Marie Lepioufle

Wrocław University of Science and Technology, Poland Wrocław University of Science and Technology, Poland UiT The Arctic University of Norway, Norway Norwegian Institute for Air Research, Norway

DMSN 2022: Special Session on Data Management in Sensor Networks Khouloud Salameh Yannis Manolopoulos Richard Chbeir

American University of Ras Al Khaimah, UAE Open University of Cyprus, Cyprus Université de Pau et des Pays de l’Adour (UPPA), France

ICxS 2022: Special Session on Intelligent and Contextual Systems Maciej Huk Keun Ho Ryu Rashmi Dutta Baruah Tetsuji Kuboyama Goutam Chakraborty Seo-Young Noh Chao-Chun Chen

Wroclaw University of Science and Technology, Poland Ton Duc Thang University, Vietnam Indian Institute of Technology Guwahati, India Gekushuin University, Japan Iwate Prefectural University, Japan Chungbuk National University, South Korea National Cheng Kung University, Taiwan

IPROSE 2022: Special Session on Intelligent Problem Solving for Smart Real World Doina Logof˘atu Costin B˘adic˘a Florin Leon Mirjana Ivanovic

Frankfurt University of Applied Sciences, Germany University of Craiova, Romania Gheorghe Asachi Technical University of Ia¸si, Romania University of Novi Sad, Serbia

xii

Organization

ISCEC 2022: Special Session on Intelligent Supply Chains and e-Commerce Arkadiusz Kawa Bartłomiej Piera´nski

Łukasiewicz Research Network – The Institute of Logistics and Warehousing, Poland Poznan University of Economics and Business, Poland

ISMSFuN 2022: Special Session on Intelligent Solutions for Management and Securing Future Networks Grzegorz Kołaczek Łukasz Falas Patryk Schauer Krzysztof Gierłowski

Wrocław University of Science and Technology, Poland Wrocław University of Science and Technology, Poland Wrocław University of Science and Technology, Poland Gda´nsk University of Technology, Poland

LPAIA 2022: Special Session on Learning Patterns/Methods in Current AI Applications Urszula Boryczka Piotr Porwik

University of Silesia, Poland University of Silesia, Poland

LRLSTP 2022: Special Session on Low Resource Languages Speech and Text Processing Ualsher Tukeyev Orken Mamyrbayev

Al-Farabi Kazakh National University, Kazakhstan Al-Farabi Kazakh National University, Kazakhstan

MISSI 2022: Satellite Workshop on Multimedia and Network Information Systems Kazimierz Choro´s Marek Kopel Mikołaj Leszczuk Maria Trocan

Wrocław University of Science and Technology, Poland Wrocław University of Science and Technology, Poland AGH University of Science and Technology, Poland Institut Supérieur d’Electronique de Paris, France

Organization

xiii

MLLSCP 2022: Special Session on Machine Learning in Large-Scale and Complex Problems Jan Kozak Przemysław Juszczuk Barbara Probierz

University of Economics in Katowice, Poland University of Economics in Katowice, Poland University of Economics in Katowice, Poland

MLND 2022: Special Session on Machine Learning Prediction of Neurodegenerative Diseases Andrzej W. Przybyszewski Jerzy P. Nowacki

Polish-Japanese Academy of Information Technology, Poland Polish-Japanese Academy of Information Technology, Poland

MMAML 2022: Special Session on Multiple Model Approach to Machine Learning Tomasz Kajdanowicz Edwin Lughofer Bogdan Trawi´nski

Wrocław University of Science and Technology, Poland Johannes Kepler University Linz, Austria Wrocław University of Science and Technology, Poland

PPiBDA 2022: Special Session on Privacy Protection in Big Data Approaches Abdul Razaque Saleem Hariri Munif Alotaibi Fathi Amsaad Bandar Alotaibi

International Information Technology University, Kazakhstan University of Arizona, USA Shaqra University, Saudi Arabia Eastern Michigan University, USA University of Tabuk, Saudi Arabia

SIOTBDTA 2022: Special Session on Smart IoT and Big Data Technologies and Applications Octavian Postolache Madina Mansurova

ISCTE-University Institute of Lisbon, Portugal Al-Farabi Kazakh National University, Kazakhstan

Senior Program Committee Ajith Abraham Jesus Alcala-Fdez Lionel Amodeo Ahmad Taher Azar

Machine Intelligence Research Labs, USA University of Granada, Spain University of Technology of Troyes, France Prince Sultan University, Saudi Arabia

xiv

Organization

Thomas Bäck Costin Badica Ramazan Bayindir Abdelhamid Bouchachia David Camacho Leopoldo Eduardo Cardenas-Barron Oscar Castillo Nitesh Chawla Rung-Ching Chen Shyi-Ming Chen Simon Fong Hamido Fujita Mohamed Gaber Marina L. Gavrilova Daniela Godoy Fernando Gomide Manuel Grana Claudio Gutierrez Francisco Herrera Tzung-Pei Hong Dosam Hwang Mirjana Ivanovic Janusz Je˙zewski Piotr Jedrzejowicz Kang-Hyun Jo Jason J. Jung Janusz Kacprzyk Nikola Kasabov Muhammad Khurram Khan Frank Klawonn Joanna Kolodziej Józef Korbicz Ryszard Kowalczyk Bartosz Krawczyk Ondrej Krejcar Adam Krzyzak Mark Last Hoai An Le Thi

Leiden University, Netherlands University of Craiova, Romania Gazi University, Turkey Bournemouth University, UK Universidad Autonoma de Madrid, Spain Tecnologico de Monterrey, Mexico Tijuana Institute of Technology, Mexico University of Notre Dame, USA Chaoyang University of Technology, Taiwan National Taiwan University of Science and Technology, Taiwan University of Macau, Macau SAR Iwate Prefectural University, Japan Birmingham City University, UK University of Calgary, Canada ISISTAN Research Institute, Argentina University of Campinas, Brazil University of the Basque Country, Spain Universidad de Chile, Chile University of Granada, Spain National University of Kaohsiung, Taiwan Yeungnam University, South Korea University of Novi Sad, Serbia Institute of Medical Technology and Equipment ITAM, Poland Gdynia Maritime University, Poland University of Ulsan, South Korea Chung-Ang University, South Korea Systems Research Institute, Polish Academy of Sciences, Poland Auckland University of Technology, New Zealand King Saud University, Saudi Arabia Ostfalia University of Applied Sciences, Germany Cracow University of Technology, Poland University of Zielona Gora, Poland Swinburne University of Technology, Australia Virginia Commonwealth University, USA University of Hradec Králové, Czech Republic Concordia University, Canada Ben-Gurion University of the Negev, Israel University of Lorraine, France

Organization

Kun Chang Lee Edwin Lughofer Nezam Mahdavi-Amiri Yannis Manolopoulos Klaus-Robert Müller Saeid Nahavandi Grzegorz J. Nalepa Ngoc Thanh Nguyen Dusit Niyato Yusuke Nojima Manuel Núñez Jeng-Shyang Pan Marcin Paprzycki Bernhard Pfahringer Hoang Pham Tao Pham Dinh Radu-Emil Precup Leszek Rutkowski Juergen Schmidhuber Björn Schuller Ali Selamat Andrzej Skowron Jerzy Stefanowski Edward Szczerbicki Ryszard Tadeusiewicz Muhammad Atif Tahir Bay Vo Gottfried Vossen Dinh Duc Anh Vu Lipo Wang Junzo Watada Michał Wo´zniak Farouk Yalaoui Sławomir Zadro˙zny Zhi-Hua Zhou

xv

Sungkyunkwan University, South Korea Johannes Kepler University Linz, Austria Sharif University of Technology, Iran Open University of Cyprus, Cyprus Technical University of Berlin, Germany Deakin University, Australia AGH University of Science and Technology, Poland Wrocław University of Science and Technology, Poland Nanyang Technological University, Singapore Osaka Prefecture University, Japan Universidad Complutense de Madrid, Spain Fujian University of Technology, China Systems Research Institute, Polish Academy of Sciences, Poland University of Waikato, New Zealand Rutgers University, USA INSA Rouen, France Politehnica University of Timisoara, Romania Cz˛estochowa University of Technology, Poland Swiss AI Lab IDSIA, Switzerland University of Passau, Germany Universiti Teknologi Malaysia, Malaysia Warsaw University, Poland Pozna´n University of Technology, Poland University of Newcastle, Australia AGH University of Science and Technology, Poland National University of Computing and Emerging Sciences, Pakistan Ho Chi Minh City University of Technology, Vietnam University of Münster, Germany Vietnam National University HCMC, Vietnam Nanyang Technological University, Singapore Waseda University, Japan Wrocław University of Science and Technology, Poland University of Technology of Troyes, France Systems Research Institute, Polish Academy of Sciences, Poland Nanjing University, China

xvi

Organization

Program Committee Muhammad Abulaish Bashar Al-Shboul Toni Anwar Taha Arbaoui Mehmet Emin Aydin Amelia Badica Kambiz Badie Hassan Badir Zbigniew Banaszak Dariusz Barbucha Maumita Bhattacharya Leon Bobrowski Bülent Bolat Mariusz Boryczka Urszula Boryczka Zouhaier Brahmia Stephane Bressan Peter Brida Piotr Bródka Gra˙zyna Brzykcy Robert Burduk Aleksander Byrski Dariusz Ceglarek Somchai Chatvichienchai Chun-Hao Chen Leszek J. Chmielewski Kazimierz Choro´s Kun-Ta Chuang Dorian Cojocaru Jose Alfredo Ferreira Costa Ireneusz Czarnowski Piotr Czekalski Theophile Dagba Tien V. Do

South Asian University, India University of Jordan, Jordan Universiti Teknologi PETRONAS, Malaysia University of Technology of Troyes, France University of the West of England, UK University of Craiova, Romania ICT Research Institute, Iran École Nationale des Sciences Appliquées de Tanger, Morocco Warsaw University of Technology, Poland Gdynia Maritime University, Poland Charles Sturt University, Australia Białystok University of Technology, Poland Yildiz Technical University, Turkey University of Silesia, Poland University of Silesia, Poland University of Sfax, Tunisia National University of Singapore, Singapore University of Žilina, Slovakia Wroclaw University of Science and Technology, Poland Poznan University of Technology, Poland Wrocław University of Science and Technology, Poland AGH University od Science and Technology, Poland WSB University in Pozna´n, Poland University of Nagasaki, Japan Tamkang University, Taiwan Warsaw University of Life Sciences, Poland Wrocław University of Science and Technology, Poland National Cheng Kung University, Taiwan University of Craiova, Romania Federal University of Rio Grande do Norte (UFRN), Brazil Gdynia Maritime University, Poland Silesian University of Technology, Poland University of Abomey-Calavi, Benin Budapest University of Technology and Economics, Hungary

Organization

Rafał Doroz El-Sayed M. El-Alfy Keiichi Endo Sebastian Ernst Nadia Essoussi Usef Faghihi Dariusz Frejlichowski Blanka Frydrychova Klimova Janusz Getta Daniela Gifu Gergo Gombos Manuel Grana Janis Grundspenkis Dawit Haile Marcin Hernes Koichi Hirata Bogumiła Hnatkowska Bao An Mai Hoang Huu Hanh Hoang Van-Dung Hoang Jeongkyu Hong Yung-Fa Huang Maciej Huk Kha Tu Huynh Sanjay Jain Khalid Jebari Joanna J˛edrzejowicz Przemysław Juszczuk Krzysztof Juszczyszyn Mehmet Karaata Rafał Kern Zaheer Khan Marek Kisiel-Dorohinicki

xvii

University of Silesia, Poland King Fahd University of Petroleum and Minerals, Saudi Arabia Ehime University, Japan AGH University of Science and Technology, Poland University of Carthage, Tunisia Université du Québec à Trois-Rivières, Canada West Pomeranian University of Technology, Szczecin, Poland University of Hradec Králové, Czech Republic University of Wollongong, Australia Alexandru Ioan Cuza University of Ia¸si, Romania Eötvös Loránd University, Hungary University of the Basque Country, Spain Riga Technical University, Latvia Addis Ababa University, Ethiopia Wrocław University of Business and Economics, Poland Kyushu Institute of Technology, Japan Wrocław University of Science and Technology, Poland Vietnam National University HCMC, Vietnam Posts and Telecommunications Institute of Technology, Vietnam Quang Binh University, Vietnam Yeungnam University, South Korea Chaoyang University of Technology, Taiwan Wrocław University of Science and Technology, Poland Vietnam National University HCMC, Vietnam National University of Singapore, Singapore LCS Rabat, Morocco University of Gda´nsk, Poland University of Economics in Katowice, Poland Wroclaw University of Science and Technology, Poland Kuwait University, Kuwait Wroclaw University of Science and Technology, Poland University of the West of England, UK AGH University of Science and Technology, Poland

xviii

Organization

Attila Kiss Shinya Kobayashi Grzegorz Kołaczek Marek Kopel Jan Kozak Adrianna Kozierkiewicz Dalia Kriksciuniene Dariusz Król Marek Krótkiewicz Marzena Kryszkiewicz Jan Kubicek Tetsuji Kuboyama El˙zbieta Kukla Marek Kulbacki Kazuhiro Kuwabara Annabel Latham Tu Nga Le Yue-Shi Lee Florin Leon Chunshien Li Horst Lichter Tony Lindgren Igor Litvinchev Doina Logofatu Lech Madeyski Bernadetta Maleszka Marcin Maleszka Tamás Matuszka Michael Mayo Héctor Menéndez

Eötvös Loránd University, Hungary Ehime University, Japan Wrocław University of Science and Technology, Poland Wrocław University of Science and Technology, Poland University of Economics in Katowice, Poland Wrocław University of Science and Technology, Poland Vilnius University, Lithuania Wrocław University of Science and Technology, Poland Wrocław University of Science and Technology, Poland Warsaw University of Technology, Poland VSB -Technical University of Ostrava, Czech Republic Gakushuin University, Japan Wrocław University of Science and Technology, Poland Polish-Japanese Academy of Information Technology, Poland Ritsumeikan University, Japan Manchester Metropolitan University, UK Vietnam National University HCMC, Vietnam Ming Chuan University, Taiwan Gheorghe Asachi Technical University of Iasi, Romania National Central University, Taiwan RWTH Aachen University, Germany Stockholm University, Sweden Nuevo Leon State University, Mexico Frankfurt University of Applied Sciences, Germany Wrocław University of Science and Technology, Poland Wrocław University of Science and Technology, Poland Wrocław University of Science and Technology, Poland Eötvös Loránd University, Hungary University of Waikato, New Zealand University College London, UK

Organization

Mercedes Merayo Jacek Mercik Radosław Michalski Peter Mikulecky Miroslava Mikusova Marek Milosz Jolanta Mizera-Pietraszko Dariusz Mrozek Leo Mrsic Agnieszka Mykowiecka Pawel Myszkowski Huu-Tuan Nguyen Le Minh Nguyen Loan T. T. Nguyen Quang-Vu Nguyen Thai-Nghe Nguyen Thi Thanh Sang Nguyen Van Sinh Nguyen Agnieszka Nowak-Brzezi´nska Alberto Núñez Tarkko Oksala Mieczysław Owoc Panos Patros Maciej Piasecki Bartłomiej Piera´nski Dariusz Pierzchała Marcin Pietranik Elias Pimenidis Jaroslav Pokorný Nikolaos Polatidis Elvira Popescu Piotr Porwik Petra Poulova Małgorzata Przybyła-Kasperek Paulo Quaresma

xix

Universidad Complutense de Madrid, Spain WSB University in Wrocław, Poland Wrocław University of Science and Technology, Poland University of Hradec Králové, Czech Republic University of Žilina, Slovakia Lublin University of Technology, Poland Opole University, Poland Silesian University of Technology, Poland IN2data Data Science Company, Croatia Institute of Computer Science, Polish Academy of Sciences, Poland Wrocław University of Science and Technology, Poland Vietnam Maritime University, Vietnam Japan Advanced Institute of Science and Technology, Japan Vietnam National University HCMC, Vietnam Korea-Vietnam Friendship Information Technology College, Vietnam Cantho University, Vietnam Vietnam National University HCMC, Vietnam Vietnam National University HCMC, Vietnam University of Silesia, Poland Universidad Complutense de Madrid, Spain Aalto University, Finland Wrocław University of Business and Economics, University of Waikato, New Zealand Wroclaw University of Science and Technology, Poland Poznan University of Economics and Business, Poland Military University of Technology, Poland Wrocław University of Science and Technology, Poland University of the West of England, UK Charles University in Prague, Czech Republic University of Brighton, UK University of Craiova, Romania University of Silesia in Katowice, Poland University of Hradec Králové, Czech Republic University of Silesia, Poland Universidade de Evora, Portugal

xx

Organization

David Ramsey Mohammad Rashedur Rahman Ewa Ratajczak-Ropel Sebastian A. Rios Keun Ho Ryu Daniel Sanchez Rafał Scherer Donghwa Shin Andrzej Siemi´nski Dragan Simic Bharat Singh Paweł Sitek Krzysztof Slot Adam Słowik Vladimir Sobeslav Kamran Soomro Zenon A. Sosnowski Bela Stantic Stanimir Stoyanov Ja-Hwung Su Libuse Svobodova Jerzy Swi˛atek Andrzej Swierniak Julian Szyma´nski Yasufumi Takama Zbigniew Telec Dilhan Thilakarathne Satoshi Tojo Diana Trandabat Bogdan Trawi´nski Maria Trocan Krzysztof Trojanowski Ualsher Tukeyev

Wrocław University of Science and Technology, Poland North South University, Bangladesh Gdynia Maritime University, Poland University of Chile, Chile Chungbuk National University, South Korea University of Granada, Spain Cz˛estochowa University of Technology, Poland Yeungnam University, South Korea Wrocław University of Science and Technology, Poland University of Novi Sad, Serbia Universiti Teknologi PETRONAS, Malaysia Kielce University of Technology, Poland Łód´z University of Technology, Poland Koszalin University of Technology, Poland University of Hradec Králové, Czech Republic University of the West of England, UK Białystok University of Technology, Poland Griffith University, Australia University of Plovdiv "Paisii Hilendarski", Bulgaria Cheng Shiu University, Taiwan University of Hradec Králové, Czech Republic Wrocław University of Science and Technology, Poland Silesian University of Technology, Poland Gda´nsk University of Technology, Poland Tokyo Metropolitan University, Japan Wrocław University of Science and Technology, Poland Vrije Universiteit Amsterdam, Netherlands Japan Advanced Institute of Science and Technology, Japan Alexandru Ioan Cuza University of Ia¸si, Romania Wrocław University of Science and Technology, Poland Institut Superieur d’Electronique de Paris, France Cardinal Stefan Wyszy´nski University in Warsaw, Poland Al-Farabi Kazakh National University, Kazakhstan

Organization

Olgierd Unold Jørgen Villadsen Thi Luu Phuong Vo Wahyono Wahyono Paweł Weichbroth Izabela Wierzbowska Krystian Wojtkiewicz Xin-She Yang Tulay Yildirim Drago Zagar Danuta Zakrzewska Constantin-Bala Zamfirescu Katerina Zdravkova Vesna Zeljkovic Aleksander Zgrzywa Jianlei Zhang Zhongwei Zhang Adam Zi˛ebi´nski

xxi

Wrocław University of Science and Technology, Poland Technical University of Denmark, Denmark Vietnam National University HCMC, Vietnam Universitas Gadjah Mada, Indonesia Gda´nsk University of Technology, Poland Gdynia Maritime University, Poland Wrocław University of Science and Technology, Poland Middlesex University London, UK Yildiz Technical University, Turkey University of Osijek, Croatia Łód´z University of Technology, Poland Lucian Blaga University of Sibiu, Romania Ss. Cyril and Methodius University in Skopje, Macedonia Lincoln University, USA Wroclaw University of Science and Technology, Poland Nankai University, China University of Southern Queensland, Australia Silesian University of Technology, Poland

Contents – Part II

Machine Learning and Data Mining Machine Learning or Lexicon Based Sentiment Analysis Techniques on Social Media Posts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . David L. John and Bela Stantic

3

A Comparative Study of Classification and Clustering Methods from Text of Books . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Barbara Probierz, Jan Kozak, and Anita Hrabia

13

A Lightweight and Efficient GA-Based Model-Agnostic Feature Selection Scheme for Time Series Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Minh Hieu Nguyen, Viet Huy Nguyen, Thanh Trung Huynh, Thanh Hung Nguyen, Quoc Viet Hung Nguyen, and Phi Le Nguyen Machine Learning Approach to Predict Metastasis in Lung Cancer Based on Radiomic Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Krzysztof Fujarewicz, Agata Wilk, Damian Borys, Andrea d’Amico, ´ Rafał Suwi´nski, and Andrzej Swierniak Covariance Controlled Bayesian Rose Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Damian P˛eszor and Eryka Probierz Potential of Radiomics Features for Predicting Time to Metastasis in NSCLC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Agata Wilk, Damian Borys, Krzysztof Fujarewicz, Andrea d’Amico, ´ Rafał Suwi´nski, and Andrzej Swierniak A Survey of Network Features for Machine Learning Algorithms to Detect Network Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Joveria Rubab, Hammad Afzal, and Waleed Bin Shahid The Quality of Clustering Data Containing Outliers . . . . . . . . . . . . . . . . . . . . . . . . Agnieszka Nowak-Brzezi´nska and Igor Gaibei

26

40

51

64

77

89

Aggregated Performance Measures for Multi-class Classification . . . . . . . . . . . . . 103 Damian P¸eszor and Konrad Wojciechowski Prediction of Lung Cancer Survival Based on Multiomic Data . . . . . . . . . . . . . . . 116 ´ Roman Jaksik and Jarosław Smieja

xxiv

Contents – Part II

Graph Neural Networks-Based Multilabel Classification of Citation Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 Guillaume Lachaud, Patricia Conde-Cespedes, and Maria Trocan Towards Efficient Discovery of Partial Periodic Patterns in Columnar Temporal Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 Penugonda Ravikumar, Venus Vikranth Raj, Palla Likhitha, Rage Uday Kiran, Yutaka Watanobe, Sadanori Ito, Koji Zettsu, and Masashi Toyoda Avoiding Time Series Prediction Disbelief with Ensemble Classifiers in Multi-class Problem Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 Maciej Huk Speeding Up Recommender Systems Using Association Rules . . . . . . . . . . . . . . . 167 Eyad Kannout, Hung Son Nguyen, and Marek Grzegorowski An Empirical Experiment on Feature Extractions Based for Speech Emotion Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 Binh Van Duong, Chien Nhu Ha, Trung T. Nguyen, Phuc Nguyen, and Trong-Hop Do Parameter Distribution Ensemble Learning for Sudden Concept Drift Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 Khanh-Tung Nguyen, Trung Tran, Anh-Duc Nguyen, Xuan-Hieu Phan, and Quang-Thuy Ha MLP-Mixer Approach for Corn Leaf Diseases Classification . . . . . . . . . . . . . . . . . 204 Li-Hua Li and Radius Tanone A Novel Neural Network Training Method for Autonomous Driving Using Semi-Pseudo-Labels and 3D Data Augmentations . . . . . . . . . . . . . . . . . . . . . . . . . . 216 Tamás Matuszka and Dániel Kozma Machine Learning Methods for BIM Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230 ´ Gra˙zyna Slusarczyk and Barbara Strug Self-Optimizing Neural Network in Classification of Real Valued Experimental Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241 Alicja Miniak-Górecka, Krzysztof Podlaski, and Tomasz Gwizdałła Analyzing the Effectiveness of the Gaussian Mixture Model Clustering Algorithm in Software Enhancement Effort Estimation . . . . . . . . . . . . . . . . . . . . . . 255 Vo Van Hai, Ho Le Thi Kim Nhung, Zdenka Prokopová, Radek Silhavy, and Petr Silhavy

Contents – Part II

xxv

Graph Classification via Graph Structure Learning . . . . . . . . . . . . . . . . . . . . . . . . . 269 Tu Huynh, Tuyen Thanh Thi Ho, and Bac Le Relearning Ensemble Selection Based on New Generated Features . . . . . . . . . . . 282 Robert Burduk Random Forest in Whitelist-Based ATM Security . . . . . . . . . . . . . . . . . . . . . . . . . . 292 Michal Maliszewski and Urszula Boryczka Layer-Wise Optimization of Contextual Neural Networks with Dynamic Field of Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302 Marcin Jodłowiec, Adriana Albu, Krzysztof Wołk, Nguyen Thai-Nghe, and Adrian Karasi´nski Computer Vision Techniques Automatic Counting of People Entering and Leaving Based on Dominant Colors and People Silhouettes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315 Kazimierz Choro´s and Maciej Uran Toward Understanding the Impact of Input Data for Multi-Image Super-Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329 Jakub Adler, Jolanta Kawulok, and Michal Kawulok Single-Stage Real-Time Face Mask Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343 Linh Phung-Khanh, Bogdan Trawi´nski, Vi Le-Thi-Tuong, Anh Pham-Hoang-Nam, and Nga Ly-Tu User-Generated Content (UGC)/In-The-Wild Video Content Recognition . . . . . . 356 Mikołaj Leszczuk, Lucjan Janowski, Jakub Nawała, and Michał Grega A Research for Segmentation of Brain Tumors Based on GAN Model . . . . . . . . . 369 Linh Khanh Phung, Sinh Van Nguyen, Tan Duy Le, and Marcin Maleszka Tracking Student Attendance in Virtual Classes Based on MTCNN and FaceNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382 Trong-Nghia Pham, Nam-Phong Nguyen, Nguyen-Minh-Quan Dinh, and Thanh Le A Correct Face Mask Usage Detection Framework by AIoT . . . . . . . . . . . . . . . . . 395 Minh Hoang Pham, Sinh Van Nguyen, Tung Le, Huy Tien Nguyen, Tan Duy Le, and Bogdan Trawinski

xxvi

Contents – Part II

A New 3D Face Model for Vietnamese Based on Basel Face Model . . . . . . . . . . . 408 Dang-Ha Nguyen, Khanh-An Han Tien, Thi-Chau Ma, and Hoang-Anh Nguyen The Domain Generalisation for Glaucoma Detection in Retinal Images from Unseen Fundus Cameras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421 Hansi Gunasinghe, James McKelvie, Abigail Koay, and Michael Mayo BRDF Anisotropy Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434 Michal Haindl and Votˇech Havlíˇcek Clustering Analysis Applied to NDVI Maps to Delimit Management Zones for Grain Crops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445 Aliya Nugumanova, Almasbek Maulit, and Maxim Sutula Features of Hand-Drawn Spirals for Recognition of Parkinson’s Disease . . . . . . . 458 Krzysztof Wrobel, Rafal Doroz, Piotr Porwik, Tomasz Orczyk, Agnieszka Betkowska Cavalcante, and Monika Grajzer FASENet: A Two-Stream Fall Detection and Activity Monitoring Model Using Pose Keypoints and Squeeze-and-Excitation Networks . . . . . . . . . . . . . . . . 470 Jessie James P. Suarez, Nathaniel S. Orillaza Jr., and Prospero C. Naval Jr. Pre-processing of CT Images of the Lungs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484 Talshyn Sarsembayeva, Madina Mansurova, Adai Shomanov, Magzhan Sarsembayev, Symbat Sagyzbayeva, and Gassyrbek Rakhimzhanov Innovations in Intelligent Systems Application of Hyperledger Blockchain to Reduce Information Asymmetries in the Used Car Market . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495 Chien-Wen Shen, Agnieszka Maria Koziel, and Chieh Wen Excess-Mass and Mass-Volume Quality Measures Susceptibility to Intrusion Detection System’s Data Dimensionality . . . . . . . . . . . . . . . . . . . . . . . 509 Arkadiusz Warzy´nski, Łukasz Falas, and Patryk Schauer A Homomorphic Encryption Approach for Privacy-Preserving Deep Learning in Digital Health Care Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 520 Tuong Nguyen-Van, Thanh Nguyen-Van, Tien-Thinh Nguyen, Dong Bui-Huu, Quang Le-Nhat, Tran Vu Pham, and Khuong Nguyen-An

Contents – Part II

xxvii

Semantic Pivoting Model for Effective Event Detection . . . . . . . . . . . . . . . . . . . . . 534 Hao Anran, Hui Siu Cheung, and Su Jian Meet Your Email Sender - Hybrid Approach to Email Signature Extraction . . . . 547 Jelena Graovac, Ivana Tomaševi´c, and Gordana Pavlovi´c-Lažeti´c Performance of Packet Delivery Ratio for Varying Vehicles Speeds on Highway Scenario in C-V2X Mode 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 559 Teguh Indra Bayu, Yung-Fa Huang, and Jeang-Kuo Chen Extensions of the Diffie-Hellman Key Agreement Protocol Based on Exponential and Logarithmic Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 569 Zbigniew Lipi´nski and Jolanta Mizera-Pietraszko Music Industry Trend Forecasting Based on MusicBrainz Metadata . . . . . . . . . . . 582 Marek Kopel and Damian Kreisich AntiPhiMBS-TRN: A New Anti-phishing Model to Mitigate Phishing Attacks in Mobile Banking System at Transaction Level . . . . . . . . . . . . . . . . . . . . 595 Tej Narayan Thakur and Noriaki Yoshiura Blockchain-Based Decentralized Digital Content Management and Sharing System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 608 Thong Bui, Tan Duy Le, Tri-Hai Nguyen, Bogdan Trawinski, Huy Tien Nguyen, and Tung Le Analysis of Ciphertext Behaviour Using the Example of the AES Block Cipher in ECB, CBC, OFB and CFB Modes of Operation, Using Multiple Encryption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 621 Zhanna Alimzhanova, Dauren Nazarbayev, Aizada Ayashova, and Aktoty Kaliyeva Vulnerability Analysis of IoT Devices to Cyberattacks Based on Naïve Bayes Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 630 Jolanta Mizera-Pietraszko and Jolanta Ta´ncula Symmetric and Asymmetric Cryptography on the Special Linear Cracovian Quasigroup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 643 Zbigniew Lipi´nski and Jolanta Mizera-Pietraszko Multi-pass, Non-recursive Acoustic Echo Cancellation Sequential and Parallel Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 656 Maciej Walczy´nski

xxviii

Contents – Part II

Effective Resource Utilization in Heterogeneous Hadoop Environment Through a Dynamic Inter-cluster and Intra-cluster Load Balancing . . . . . . . . . . . 669 Emna Hosni, Wided chaari, Nader Kolsi, and Khaled Ghedira Intelligent Audio Signal Processing – Do We Still Need Annotated Datasets? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 682 Bozena Kostek Multimodal Approach to Measuring Cognitive Load Using Sternberg Memory and Input Diagrammatic Reasoning Tests . . . . . . . . . . . . . . . . . . . . . . . . . 693 Patient Zihisire Muke, Zbigniew Telec, and Bogdan Trawi´nski Chaining Electronic Seals: An eIDAS Compliant Framework for Controlling SSCD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 714 Przemysław Bła´skiewicz and Mirosław Kutyłowski Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 733

Contents – Part I

Advanced Data Mining Techniques and Applications Textual One-Pass Stream Clustering with Automated Distance Threshold Adaption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dennis Assenmacher and Heike Trautmann

3

Using GPUs to Speed Up Genetic-Fuzzy Data Mining with Evaluation on All Large Itemsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chun-Hao Chen, Yu-Qi Huang, and Tzung-Pei Hong

17

Efficient Classification with Counterfactual Reasoning and Active Learning . . . . Azhar Mohammed, Dang Nguyen, Bao Duong, and Thin Nguyen

27

Visual Localization Based on Deep Learning - Take Southern Branch of the National Palace Museum for Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chia-Hao Tu and Eric Hsueh-Chan Lu

39

SimCPSR: Simple Contrastive Learning for Paper Submission Recommendation System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Duc H. Le, Tram T. Doan, Son T. Huynh, and Binh T. Nguyen

51

Frequent Closed Subgraph Mining: A Multi-thread Approach . . . . . . . . . . . . . . . . Lam B. Q. Nguyen, Ngoc-Thao Le, Hung Son Nguyen, Tri Pham, and Bay Vo

64

Decision Support and Control Systems Complement Naive Bayes Classifier for Sentiment Analysis of Internet Movie Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Christine Dewi and Rung-Ching Chen Portfolio Investments in the Forex Market . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Przemysław Juszczuk and Jan Kozak

81

94

Detecting True and Declarative Facial Emotions by Changes in Nonlinear Dynamics of Eye Movements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 ´ Albert Sledzianowski, Jerzy P. Nowacki, Andrzej W. Przybyszewski, and Krzysztof Urbanowicz

xxx

Contents – Part I

Impact of Radiomap Interpolation on Accuracy of Fingerprinting Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 Juraj Machaj and Peter Brida Rough Set Rules (RSR) Predominantly Based on Cognitive Tests Can Predict Alzheimer’s Related Dementia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 Andrzej W. Przybyszewski, Kamila Bojakowska, Jerzy P. Nowacki, Aldona Drabik, and BIOCARD Study Team Experiments with Solving Mountain Car Problem Using State Discretization and Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 Amelia B˘adic˘a, Costin B˘adic˘a, Mirjana Ivanovi´c, and Doina Logof˘atu A Stable Method for Detecting Driver Maneuvers Using a Rule Classifier . . . . . 156 Piotr Porwik, Tomasz Orczyk, and Rafal Doroz Deep Learning Models Using Deep Transformer Based Models to Predict Ozone Levels . . . . . . . . . . . . . 169 Manuel Méndez, Carlos Montero, and Manuel Núñez An Ensemble Based Deep Learning Framework to Detect and Deceive XSS and SQL Injection Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 Waleed Bin Shahid, Baber Aslam, Haider Abbas, Hammad Afzal, and Imran Rashid An Image Pixel Interval Power (IPIP) Method Using Deep Learning Classification Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 Abdulaziz Anorboev, Javokhir Musaev, Jeongkyu Hong, Ngoc Thanh Nguyen, and Dosam Hwang Meta-learning and Personalization Layer in Federated Learning . . . . . . . . . . . . . . 209 Bao-Long Nguyen, Tat Cuong Cao, and Bac Le ETop3PPE: EPOCh’s Top-Three Prediction Probability Ensemble Method for Deep Learning Classification Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222 Javokhir Musaev, Abdulaziz Anorboev, Huyen Trang Phan, and Dosam Hwang Embedding Model with Attention over Convolution Kernels and Dynamic Mapping Matrix for Link Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234 Thanh Le, Nam Le, and Bac Le Employing Generative Adversarial Network in COVID-19 Diagnosis . . . . . . . . . 247 Jakub Dere´n and Michał Wo´zniak

Contents – Part I

xxxi

SDG-Meter: A Deep Learning Based Tool for Automatic Text Classification of the Sustainable Development Goals . . . . . . . . . . . . . . . . . . . . . . . 259 Jade Eva Guisiano, Raja Chiky, and Jonathas De Mello The Combination of Background Subtraction and Convolutional Neural Network for Product Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272 Tin-Trung Thai, Synh Viet-Uyen Ha, Thong Duy-Minh Nguyen, Huy Dinh-Anh Le, Nhat Minh Chung, Quang Qui-Vinh Nguyen, and Vuong Ai-Nguyen Strategy and Feasibility Study for the Construction of High Resolution Images Adversarial Against Convolutional Neural Networks . . . . . . . . . . . . . . . . . 285 Franck Leprévost, Ali Osman Topal, Elmir Avdusinovic, and Raluca Chitic Using Deep Learning to Detect Anomalies in Traffic Flow . . . . . . . . . . . . . . . . . . . 299 Manuel Méndez, Alfredo Ibias, and Manuel Núñez A Deep Convolution Generative Adversarial Network for the Production of Images of Human Faces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313 Noha Nekamiche, Chahnez Zakaria, Sarra Bouchareb, and Kamel Smaïli ECG Signal Classification Using Recurrence Plot-Based Approach and Deep Learning for Arrhythmia Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327 Niken Prasasti Martono, Toru Nishiguchi, and Hayato Ohwada Internet of Things and Sensor Networks Collaborative Intrusion Detection System for Internet of Things Using Distributed Ledger Technology: A Survey on Challenges and Opportunities . . . . 339 Aulia Arif Wardana, Grzegorz Kołaczek, and Parman Sukarno An Implementation of Depth-First and Breadth-First Search Algorithms for Tip Selection in IOTA Distributed Ledger . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351 Andras Ferenczi and Costin B˘adic˘a Locally Differentially Private Quantile Summary Aggregation in Wireless Sensor Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364 Aishah Aseeri and Rui Zhang XLMRQA: Open-Domain Question Answering on Vietnamese Wikipedia-Based Textual Knowledge Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377 Kiet Van Nguyen, Phong Nguyen-Thuan Do, Nhat Duy Nguyen, Tin Van Huynh, Anh Gia-Tuan Nguyen, and Ngan Luu-Thuy Nguyen

xxxii

Contents – Part I

On Verified Automated Reasoning in Propositional Logic . . . . . . . . . . . . . . . . . . . 390 Simon Tobias Lund and Jørgen Villadsen Embedding and Integrating Literals to the HypER Model for Link Prediction on Knowledge Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403 Thanh Le, Tuan Tran, and Bac Le A Semantic-Based Approach for Keyphrase Extraction from Vietnamese Documents Using Thematic Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416 Linh Viet Le and Tho Thi Ngoc Le Mixed Multi-relational Representation Learning for Low-Dimensional Knowledge Graph Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 428 Thanh Le, Chi Tran, and Bac Le Learning to Map the GDPR to Logic Representation on DAPRECO-KB . . . . . . . 442 Minh-Phuong Nguyen, Thi-Thu-Trang Nguyen, Vu Tran, Ha-Thanh Nguyen, Le-Minh Nguyen, and Ken Satoh Semantic Relationship-Based Image Retrieval Using KD-Tree Structure . . . . . . . 455 Nguyen Thi Dinh, Thanh The Van, and Thanh Manh Le Preliminary Study on Video Codec Optimization Using VMAF . . . . . . . . . . . . . . 469 Syed Uddin, Mikołaj Leszczuk, and Michal Grega Semantic-Based Image Retrieval Using RS -Tree and Knowledge Graph . . . . . . . 481 Le Thi Vinh Thanh, Thanh The Van, and Thanh Manh Le An Extension of Reciprocal Logic for Trust Reasoning: A Case Study in PKI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496 Sameera Basit and Yuichi Goto Common Graph Representation of Different XBRL Taxonomies . . . . . . . . . . . . . 507 Artur Basiura, Leszek Kotulski, and Dominik Ziembi´nski Natural Language Processing Development of CRF and CTC Based End-To-End Kazakh Speech Recognition System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 519 Dina Oralbekova, Orken Mamyrbayev, Mohamed Othman, Keylan Alimhan, Bagashar Zhumazhanov, and Bulbul Nuranbayeva

Contents – Part I

xxxiii

A Survey of Abstractive Text Summarization Utilising Pretrained Language Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 532 Ayesha Ayub Syed, Ford Lumban Gaol, Alfred Boediman, Tokuro Matsuo, and Widodo Budiharto A Combination of BERT and Transformer for Vietnamese Spelling Correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545 Trung Hieu Ngo, Ham Duong Tran, Tin Huynh, and Kiem Hoang Enhancing Vietnamese Question Generation with Reinforcement Learning . . . . 559 Nguyen Vu and Kiet Van Nguyen A Practical Method for Occupational Skills Detection in Vietnamese Job Listings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 571 Viet-Trung Tran, Hai-Nam Cao, and Tuan-Dung Cao Neural Inverse Text Normalization with Numerical Recognition for Low Resource Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 582 Tuan Anh Phan, Ngoc Dung Nguyen, Huong Le Thanh, and Khac-Hoai Nam Bui Detecting Spam Reviews on Vietnamese E-Commerce Websites . . . . . . . . . . . . . 595 Co Van Dinh, Son T. Luu, and Anh Gia-Tuan Nguyen v3MFND: A Deep Multi-domain Multimodal Fake News Detection Model for Vietnamese . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 608 Cam-Van Nguyen Thi, Thanh-Toan Vuong, Duc-Trong Le, and Quang-Thuy Ha Social Networks and Recommender Systems Fast and Accurate Evaluation of Collaborative Filtering Recommendation Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623 Nikolaos Polatidis, Stelios Kapetanakis, Elias Pimenidis, and Yannis Manolopoulos Improvement Graph Convolution Collaborative Filtering with Weighted Addition Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 635 Tin T. Tran and Václav Snasel Combining User Specific and Global News Features for Neural News Recommendation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 648 Cuong Manh Nguyen, Ngo Xuan Bach, and Tu Minh Phuong

xxxiv

Contents – Part I

Polarization in Personalized Recommendations: Balancing Safety and Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 661 Zakaria El-Moutaouakkil, Mohamed Lechiakh, and Alexandre Maurer Social Multi-role Discovering with Hypergraph Embedding for Location-Based Social Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675 Minh Tam Pham, Thanh Dat Hoang, Minh Hieu Nguyen, Viet Hung Vu, Thanh Trung Huynh, and Quyet Thang Huynh Multimedia Application for Analyzing Interdisciplinary Scientific Collaboration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 688 Veslava Osi´nska, Konrad Por˛eba, Grzegorz Osi´nski, and Brett Buttliere CORDIS Partner Matching Algorithm for Recommender Systems . . . . . . . . . . . . 701 Dariusz Król, Zuzanna Zborowska, Paweł Ropa, and Łukasz Kincel Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 717

Machine Learning and Data Mining

Machine Learning or Lexicon Based Sentiment Analysis Techniques on Social Media Posts David L. John and Bela Stantic(B) School of Information and Communication Technology, Griffith University, Brisbane, Australia [email protected], [email protected] Abstract. Social media provides an accessible and effective platform for individuals to offer thoughts and opinions across a wide range of interest areas. It also provides a great opportunity for researchers and businesses to understand and analyse a large volume of online data for decision-making purposes. Opinions on social media platforms, such as Twitter, can be very important for many industries due to the wide variety of topics and large volume of data available. However, extracting and analysing this data can prove to be very challenging due to its diversity and complexity. Recent methods of sentiment analysis of social media content rely on Natural Language Processing techniques on a fundamental sentiment lexicon, as well as machine learning oriented techniques. In this work, we evaluate representatives of different sentiment analysis methods, make recommendations and discuss advantages and disadvantages. Specifically we look into: 1) variation of VADER, a lexicon based method; 2) a machine learning neural network based method; and 3) a Sentiment Classifier using Word Sense Disambiguation, Maximum Entropy and Naive Bayes Classifiers. The results indicate that there is a significant correlation among all three sentiment analysis methods, which demonstrates their ability to accurately determine the sentiment of social media posts. Additionally, the modified version of VADER, a lexicon based method, is considered to be the most accurate and most appropriate method to use for the semantic analysis of social media posts, based on its strong correlation and low computational time. Obtained findings and recommendations can be valuable for researchers working on sentiment analysis techniques for large data sets. Keywords: Sentiment analysis processing · Machine learning

1

· Social media · Natural language

Introduction

Sentiment analysis, also referred to as opinion mining, is the field of study which focuses on the analysis and quantification of people’s sentiments, opinions, attitude, emotions and appraisals. It employs a natural language processing technique to determine whether data, in the form of written text, is positive, negative, or neutral. Sentiment analysis is often used to analyse people’s feelings c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  N. T. Nguyen et al. (Eds.): ACIIDS 2022, LNAI 13758, pp. 3–12, 2022. https://doi.org/10.1007/978-3-031-21967-2_1

4

D. L. John and B. Stantic

towards certain entities and their attributes, such as businesses, services, products, organisations, individuals, issues, topics or events [8,9]. Most sentiment classification tasks rely largely on a fundamental sentiment lexicon - a group of lexical features, i.e., words or terms, “which are generally labelled according to their semantic orientation as either positive or negative” [6]. These are known as opinion words where positive opinion words are used to express covet states while negative opinion words are used to express undesired states. Examples of positive opinion words include, “beautiful”, “good”, “wonderful”, “joy”, “great”, “excellent” and “amazing”; examples of negative opinion words include, “bad”, “awful”, “poor”, “terrible”, “stop”, “evil” and “sad”. In addition to individual words, there are also opinion phrases or idioms, for example: “back against the wall” or “on cloud nine”. Individually, none of the words in these (and other similar) idioms are particularly positive or negative opinion words, but as a whole, these phrases can be perceived to be positive or negative. These, collectively, are referred to as the opinion lexicon and are a vital and instrumental part of sentiment analysis [5,6,10]. Opinion words can be categorised as comparative-type opinion words and all of the examples listed above fall into the category of base-type opinion words. Comparative-type opinion words differ in that they are used to express a comparative opinion in relation to a particular topic, object or statement. Examples include “better”, “worse”, “best” and “worst”, which are comparative forms of their base terms, “good” and “bad”. In other words, comparative-type opinion words do not directly express an absolute opinion or sentiment on an object, but only a comparative opinion or sentiment on or among two or more objects. For example, “A is better than B”. Here, this sentence does not express an opinion of whether A or B is good or bad, it only states that in comparison to B, A is better, or in comparison to A, B is worse. Therefore, even though a positive or negative classification can be given to comparative opinion words (based on whether it represents a desirable or undesirable state), it cannot be used in the same way as base-type opinion words. To date, the vast majority of sentiment analysis has been carried out on written text, with the intent of predicting the sentiment of a given written statement. In this context, sentiment analysis can be considered primary research field of Natural Language Processing (NPL) which, with the aid of machine learning techniques, aims to identify and extract certain insights from written text. It should be noted, however, that in recent years there has been an increased focus on multimodal sentiment analysis, specifically in the form of images, videos, audio and textual data to classify and quantify emotion and sentiment [9]. Although opinion words and phrases (the opinion lexicon) are essential for sentiment analysis, they are not without their shortcomings. Completely accurate sentiment analysis is a much more complicated problem and given this, several shortcomings and challenges will be highlighted. 1. Orientation and Context: A positive or negative opinion word may have opposite orientations when applying them to different domains or contexts within a sentence. Here orientation is whether an opinion word should be

Comparison of Different Sentiment Analysis Techniques

5

considered as positive, negative or neutral. For example, the term “lazy” usually would indicate a negative sentiment i.e. “He is so lazy”. However, in the domain of tourism, “lazy” can be considered a positive term when referring to a relaxing environment, desirable for a holiday. Another example is “The Great Barrier Reef”. In general, the word “great” would usually indicate a positive sentiment but given it is the name of the reef, it should be considered neutral in this context. Therefore, orientations of opinion words can be domain or context specific which gives rise to many complications and issues when aiming to provide accurate sentiment analysis. 2. Sentences containing opinion words which don’t express any sentiment: This phenomenon can happen in several types of sentences, mainly involving questions and conditional sentences. For example, “Are the new Apple headphones any good?” and “If I can find a good pair of headphones I will buy them.” These both contain the positive opinion word “good” but neither of these express positive sentiment towards any specific pair of headphones. However, this is not to say that all conditional sentences or questions express no sentiment. For example, “Does anybody know where I can get these terrible headphones repaired?” does express a negative sentiment towards a pair of headphones. 3. Sarcasm and irony: The common use of sarcasm and irony in social media posts present substantial issues for accurate sentiment analysis. These sentences can be hard to analyse regardless of whether they contain opinion words or not - for example “What a great pair of headphones, they stopped working after one day!”. Sarcasm and irony are not as common in consumer reviews or in economic contexts but are common amongst political discussions which makes analysis of political opinions difficult to accurately quantify [9]. 4. Statements with no opinion words: Arguably, the most significant challenge when using the opinion lexicon, involves sentences which contain no opinion words but can imply a positive or negative sentiment. For example, “This car uses a lot of fuel” implies a negative sentiment towards the car as it uses a lot of resources (and therefore would cost the driver more money per week). Sentences such as these are usually objective sentences as they express some factual information about something. Another example is, “After wearing these shoes for only three days, they already have a hole in the bottom” which expresses a negative (and factual) opinion about the quality of the shoes. Both of these examples contain no opinion words but nevertheless still express some sort of sentiment. All these shortcomings present significant challenges when attempting to obtain an accurate analysis of sentiment from textual data [9]. In this work, in order to give guidelines to researchers, we identify and evaluate typical representatives of different sentiment analysis methods, test their performance both with regard to accuracy and time complexity, make recommendations, and discuss advantages and disadvantages.

6

2

D. L. John and B. Stantic

Literature Review

Sentiment analysis techniques have been widely used to analyse social networks from determining the public’s opinions towards certain topics, issues or events, to aiding businesses and organisations in improving their services, and they have proven to be an important and useful tool. The following will review the effectiveness of prevailing methods of how social media content may be semantically analysed and also provide a comparative review of some of these methods. The paper by [4] proposed a Sentiment Classifier semantic lexicon based method for analysing sentiment which is implemented it in [7]. This approach utilises a Word Sense Disambiguation (WSD) sentiment classifier that uses the SentiWordNet [3] database, which calculates positive and negative scores of a post based on the positivity and negativity of the individual words in that post. The sentiment analysis results present a pair of values, indicating positive and negative sentiment scores, of the document-based scores for individual words. The larger of these two values is used as the sentiment value for the post i.e., if a post has a positive score of 0.7 and a negative score of 0.3, the sentiment score for this post will be 0.7. To address the problem of sentiment analysis during critical events such as natural disasters or social movements, [12] used sentiment analysis through machine learning using Twitter. Here Bayesian network classifiers were used to perform sentiment analysis on two data sets: the 2010 Chilean earthquake and the 2017 Catalan independence referendum. The effectiveness of this approach was evident when compared favourably with support vector machines (SVM) and random decision forests. The resulting networks also allowed for the identification of relationships amongst words, thus offering some useful, qualitative information to assist in understanding the main social and historical features of the events being analysed. Another paper which employed advanced machine learning, sentiment analysis techniques on Twitter data was carried out by [13] to investigate user experiences and potentially changing environmental conditions at the Great Barrier Reef. The Twitter API was used to extract only those tweets which had been geo-tagged around the Great Barrier Reef geographic area. Before implementing large scale analytics on the extracted Twitter data, development of algorithms and training of model were carried out on human-annotated posts. The tweets were then classified based on their polarity into positive and negative sentiment categories with the results revealing a relatively high accuracy where comparing to the annotated data, conforming the suitability of the developed algorithms for future implementation. A comparison of various sentiment analysis techniques was employed by [6]. They presented the sentiment analysis tool VADER (for Valence Aware Dictionary for sEntiment Reasoning) and compared its effectiveness for Twitter data anaylsis to eleven typical state-of-the-art benchmarks. These included sentiment analysis tools such as ANEW, LIWC, SentiWordNet, machine learning oriented techniques relying on Na¨ıve Bayes Classifiers, the General Inquirer, Support Vector Machine (SVM) algorithms and Maximum Entropy. Using a parsimonious

Comparison of Different Sentiment Analysis Techniques

7

rule-based model to assess the sentiment of tweets, VADER was shown to deliver the best results when compared to these state-of-the-art sentiment analysis tools. VADER was also successfully applied to Twitter data relating to the tourism industry [1], where the authors noted that, for tourism specific results, most other sentiment analysis methods perform better in classifying positive sentences than negative or neutral sentences. Additionally, VADER was also found to be more cost effective and efficient, which is another advantage when analysing the sentiment of millions of Twitter posts. Lexicon based method for Chinese language which takes into consideration length of the post was also proposed [5]. In this paper, the accuracy of three different sentiment analysis techniques are investigated. A comparison of: 1) a sentiment analysis tool calculated using propriety methodology from the Big Data and Smart Analytics lab at Griffith University, which is built on top of the sentiment analysis tool VADER as described by [6]; 2) a machine learning technique using a neural network as implemented in [13]; and 3) a Sentiment Classifier [7] using Word Sense Disambiguation, Maximum Entropy and Naive Bayes Classifiers, is carried out.

3

Methodology

The textual data source used for sentiment analysis was a collection of original public tweets (not including retweets) which are related to the cryptocurrency Dogecoin posted in May 2021. Given recent events and subsequent excitement on social media surrounding this particular cryptocurrency, this was deemed an appropriate topic for the analysis of sentiment [2]. The tweets were extracted from worldwide through the use of the public Twitter API, using code developed in the Big Data and Smart Analytics lab at Griffith University and contained only references to the cryptocurrency Dogecoin. The filters employed on Dogecoin tweets included: those which mentioned the terms “dogecoin”, “dogearmy”, “dogecoin-Rise”, “dogeEurope” or “dogecoins”; and that have a length of text larger than 150 characters to avoid short messages that usually do not have significant semantic content. A collection of 10,000 tweets were randomly extracted to time complexity while 1,000 posts between 23:04:03AEST and 23:52:30AEST 7 May, 2021 were used for visualization. Despite collection consisted of million of posts, this relatively small dataset is chosen to enable visualization. As mentioned above, the three methods used to comparatively analyse the sentiment of these tweets included: a modified version of VADER [6]; a machine learning technique using a neural network [13]; and a Sentiment Classifier implemented in [7]. VADER is a comprehensive, lexicon and simple rule-based model for sentiment analysis and is specifically designed for analysis of sentiments on social media1 . It combines a lexicon and a series of intensifiers, punctuation transformation, and emoticons, along with some heuristics to calculate a sentiment score for textual data. The sentiment lexicon used by VADER is comprised of over 7,000 terms, including their associated sentiment intensity measures which 1

https://github.com/cjhutto/vaderSentiment#citation-information.

8

D. L. John and B. Stantic

have been validated by humans and is specifically adapted to sentiment in small textual contexts, such as Twitter posts [6]. Specifically in this work in-house modified VADER which improves the time complexity was considered. Modification was related to reduction of the compute complexity in relation to the access of the lexicon. The second method for sentiment analysis is a machine learning technique using a neural network that was implemented in the study by [13], where user experiences and potentially changing environmental conditions at the Great Barrier Reef are investigated. Neural networks employing a set of algorithms designed to recognize patterns were used to label or cluster data; the patterns they recognised were subsequently translated by untangling and breaking down the complex relationships. As the name suggests, neural networks differ from other machine learning techniques as they are modeled and inspired by the neurons in the brain [11]. The neural network used here was trained using human annotation of posts relevant to tourism [13]. The third method used for comparative sentiment analysis, was the semantic, lexical method as implemented in [7] and called the Sentiment Classifier. This method relies on a Word Sense Disambiguation (WSD) classifier using the SentiWordNet [3] database and Maximum Entropy and Naive Bayes Classifiers which was trained using Twitter posts relevant to movie reviews2 . This Sentiment Classifier calculates the positive and negative scores of a post based on the positivity and negativity of individual words. The result of the sentiment analysis is a pair of values, indicating the positive and negative sentiments of the document-based scores for individual words. The larger of these two values is used as the sentiment value for each tweet. For each of these methods, the total value of sentiment for each post is calculated. To enable comparison of the sentiment values calculated, a common scale is required. Therefore, sentiment values are normalised to 1 to ensure that the sentiment is always between −1 (the lower limit for negative sentiment) and +1 (the upper limit for positive sentiment). For this normalisation, the methods as proposed in [5] were used. 3.1

Comparison Between Sentiment Analysis Methods

To quantitatively determine the relations between the three sentiment analysis methods, multiple regression analysis is used to test the correlation of these three plots. The regression model used here is shown in Eq. 1. YS = β0 +

N 

βi Xi + t

(1)

i=1

where YS represents the sentiment plot from one analysis method that the other two sentiment plots are being compared to. Therefore N = 2 and so, X1 and 2

see https://github.com/kevincobain2000/sentiment classifier for further explanation and the code used.

Comparison of Different Sentiment Analysis Techniques

9

X2 represent the plots from the remaining two sentiment analysis methods.β0 is a constant (y-intercept), βi is the Xi coefficient and t is a random error component. If the p-value (probability value) is less than any α level (typically p < 0.05) for any two sentiment analysis plots, this implies a significant correlation exists between the two plots. Comparison of the two sentiment analysis methods with the lowest p-value was considered to have the best correlation in analysing the semantic content of tweets relating to Dogecoin.

4

Results and Discussion

Figure 1 shows the normalized sentiment scores obtained for a selected sample (i.e. 50) of the 1,000 Twitter posts analysed by the three different sentiment analysis methods. Visual inspection of Fig. 1 shows that the three plots frequently overlap and trend in the same direction for many of the Twitter posts.

Fig. 1. Sentiment values of Twitter posts using three sentiment analysis methods

An interesting feature of the data in Fig. 1 is the limited range and variation of normalized sentiment values for the Neural Network method compared to the other two methods (namely the Modified Vader and Sentiment Classifier methods). While the directional trends of the Neural Network data are similar to the other methods across much of the graph, the changes in normalized sentiment values are tend be smaller and less extreme when compared to the other two methods. This suggests that the Neural Network method may be less sensitive to increases and decreases in sentiment, relative to the other methods. This is further reflected in the values for the standard deviations and the minimum/maximum sentiment values for the entire 1,000 data points for each sentiment analysis method (shown in Table 1).

10

D. L. John and B. Stantic

Table 1. Standard deviations, minimum and maximum sentiment values for the three sentiment analysis methods Sentiment analysis method Modified VADER

Std. Dev. Min. 0.450

Max.

–0.972 0.972

Neural network

0.181

–0.629 0.762

Sentiment classifier

0.412

–0.922 0.922

The reason for the lower deviation sentiment for the neural network method is likely due to the way the neural network was trained. Given that network was trained using human annotation of posts relevant to tourism (as is implemented in [13]), this may explain why using this network on posts relating to Dogecoin does not give a significant spread of sentiment values. To quantitatively determine the relationships between the sentiment values of all 1,000 Twitter posts for three sentiment analysis methods, the correlation between these variables was tested using multiple regression analysis. The regression model is shown in Eq. 1. Table 2. Multiple regression analysis results for comparison between three sentiment analysis methods Comparison

Coeff. Std. Error

t

P-value

MV vs. NN 1.124

0.069

16.252 7.301 E-53

MV vs. SC

0.200

0.030

6.572 7.996 E-11

NN vs. SC

0.058

0.013

4.645 3.863 E-06

The multiple regression analysis results are shown in Table 2. These results indicate that there is a statistically significantly correlation among the sentiment values of all three sentiment analysis methods. The most significantly correlated of these is between the modified version of VADER and the machine learning technique using a neural network with a p-value of 7.301 E-53. These statistically significant results of multiple regression analysis (p  0.001) accompanied by a clear visual correlation among the three plots, show that these three different sentiment analysis methods are able to calculate similar results; thus providing strong evidence that these three methods can calculate an accurate sentiment value for textual data based on the analysis of Twitter posts relating to Dogecoin. To determine the most beneficial method to use for sentiment analysis from a time complexity point of view, the time taken to calculate sentiment values of 10,000 posts was also considered. The modified version of VADER, despite running only on a CPU based server (Intel Xeon CPU E5-2609 v3 @ 1.90 GHz, 8-Core Processor, 8234 MB Memory), took only 32 s, while the Neural Network (Machine Learning) approach and the Sentiment Classifier approach, which

Comparison of Different Sentiment Analysis Techniques

11

needed to be run on a GPU (GEFORCE RTX 2080), required 1 h and 2 min and over than 6 h to complete, respectively. The Neural Network and the Sentiment Classifier time complexity was also influenced by the learning process itself. Given that the modified version of VADER has shown to have the strongest correlation with both the other two methods used (MV vs. NN p = 7.301 E-53 and MV vs. SC p = 7.996 E-11 as shown in Table 2) and has the shortest computational time, it is reasonable to suggest that it would be the most appropriate of the three methods to use for this type of analysis. Additionally, while machine learning methods and fine-tuning a domain specific lexicon (in this case relating to Dogecoin) can provide a more accurate analysis, however, it appears that the difference is not significant enough to justify timely and also costly annotations of domain related posts [13]. Therefore, the modified version of VADER, a lexicon-based method, is considered to be the most appropriate method for the semantic analysis of Twitter posts because of accuracy and low computation complexity.

5

Conclusion

Social media sentiment analysis has been shown to provide a significant opportunity for researchers and businesses to understand and analyse a large volume of online data. Accurately quantifying the semantic content of social media platforms, such as Twitter, to make predictions, highlights the usefulness and effectiveness of this analysis technique. In this paper, a comparison of three different sentiment analysis techniques for short text messages was carried out using: 1) an adaption and modification of the general sentiment analysis tool called VADER; 2) machine learning using a neural network; and a Sentiment Classifier using Word Sense Disambiguation, Maximum Entropy and Naive Bayes Classifiers, was carried out. The results of the relations between the three sentiment analysis methods, obtained by multiple regression analysis, provide evidence that these methods are able to have a statistically significant correlation and accuracy in their quantification of sentiment in Twitter posts. Based on a strong correlation and low time complexity, it was concluded that the modified version of VADER is considered to be the most accurate and most appropriate method to use. These findings and recommendations can be valuable for researchers that need sentiment polarity analysis for short text messages and large data sets.

References 1. Alaei, A.R., Becken, S., Stantic, B.: Sentiment analysis in tourism: capitalizing on big data. J. Travel Res. 58(2), 175–191 (2019) 2. AliciaAdamczyk: What’s behind dogecoin’s price surge-and why seemingly unrelated brands are capitalizing on its popularity (May 2021). https://www.cnbc.com/ 2021/05/12/dogecoin-price-surge-elon-musk-slim-jim.html

12

D. L. John and B. Stantic

3. Baccianella, S., Esuli, A., Sebastiani, F.: Sentiwordnet 3.0: an enhanced lexical resource for sentiment analysis and opinion mining. In: Lrec, vol. 10, pp. 2200– 2204 (2010) 4. Cagan, T., Frank, S.L., Tsarfaty, R.: Generating subjective responses to opinionated articles in social media: An agenda-driven architecture and a Turing-like test. In: Proceedings of the Joint Workshop on Social Dynamics and Personal Attributes in Social Media, pp. 58–67. Association for Computational Linguistics, Baltimore, Maryland (Jun 2014). https://doi.org/10.3115/v1/W14-2708, https:// aclanthology.org/W14-2708 5. Chen, J., Becken, S., Stantic, B.: Lexicon based chinese language sentiment analysis method. Comput. Sci. Inf. Syst. 16(2), 639–655 (2019) 6. Hutto, C., Gilbert, E.: Vader: A parsimonious rule-based model for sentiment analysis of social media text. In: Proceedings of the International AAAI Conference on Web and Social Media, vol. 8 (2014) 7. Kathuria, P.: Sentiment classification using wsd, maximum entropy and naive bayes classifiers (2014). https://github.com/kevincobain2000/sentiment classifier 8. Learn, M.: Sentiment analysis: The go-to guide (2021). https://monkeylearn.com/ sentiment-analysis/ 9. Liu, B.: Sentiment analysis: Mining opinions, sentiments, and emotions. Cambridge University Press (2020) 10. Liu, B., et al.: Sentiment analysis and subjectivity. Handbook Nat. Lang. Process. 2(2010), 627–666 (2010) 11. Pathmind: A beginner’s guide to neural networks and deep learning (2021). https://wiki.pathmind.com/neural-network 12. Ruz, G.A., Henr´ıquez, P.A., Mascare˜ no, A.: Sentiment analysis of twitter data during critical events through bayesian networks classifiers. Futur. Gener. Comput. Syst. 106, 92–104 (2020) 13. Stantic, B., Mandal, R., Chen, J.: Target sentiment and target analysis. Report to the National Environmental Science Program (2020) https://nesptropical.edu.au/ wpcontent/uploads/2020/02/NESPTWQ-Project5.5-TechnicalReport1.pdf

A Comparative Study of Classification and Clustering Methods from Text of Books Barbara Probierz(B) , Jan Kozak , and Anita Hrabia Department of Machine Learning, University of Economics in Katowice, 1 Maja, 40-287 Katowice, Poland {barbara.probierz,jan.kozak,anita.hrabia}@ue.katowice.pl Abstract. Book collections in libraries are an important means of information, but without proper assignment of books into appropriate categories, searching for books on similar topics is very troublesome for both librarians and readers. This is a difficult problem due to the analysis of large sets of real text data, such as the content of books. For this purpose, we propose to create an appropriate model system, the use of which will allow for automatic assignment of books to appropriate categories by analyzing the text from the content of the books. Our research was tested on a database consisting of 552 documents. Each document contains the full content of the book. All books are from Project Gutenberg in the Art, Biology, Mathematics, Philosophy, or Technology category. Well-known techniques of natural language processing (NLP) were used for the proper preprocessing of the book content and for data analysis. Then, two different machine learning approaches were used: classification (as supervised learning) and clustering (as unsupervised learning) in order to properly assign books to selected categories. Measures of accuracy, precision and recall were used to evaluate the quality of classification. In our research, good classification results were obtained, even above 90% accuracy. Also, the use of clustering algorithms allowed for effective assignment of books to categories. Keywords: Natural language processing · Book classification Clustering · Text analysis · Machine learning

1

·

Introduction

Effective book organization is now very important not only for libraries and bookstores, but also for many on-line collections. Most large library collections are based on the division of books into specific categories related to the content of the books. Currently, many new books appear on the market, and assigning them to the appropriate category without knowing the content can be very problematic. Also, when searching for books with mislabeled categories, the categories may not be found, reducing the availability of the books to readers [28]. Machine learning techniques combined with natural language processing methods (NLP) help to easily assign books into the appropriate category. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  N. T. Nguyen et al. (Eds.): ACIIDS 2022, LNAI 13758, pp. 13–25, 2022. https://doi.org/10.1007/978-3-031-21967-2_2

14

B. Probierz et al.

Our hypothesis is that, based on the text of the book, it can be efficiently and automatically categorized into appropriate categories. In this article, we define categorization as assigning a book to one of five categories, i.e. art, biology, mathematics, philosophy and technology, regardless of the machine learning method used. The aim of this article is to develop and test an approach to automatic book categorization by analyzing the text from the content of the books. First, natural language processing (NLP) techniques were used to transform text from the content of a book into a decision table containing information about the occurrence of words in the text. Then, two different machine learning approaches were used: classification (as supervised learning) and clustering (as unsupervised learning) in order to properly assign books to selected categories and to find groups of thematically similar books. The algorithms used for the classification were CART, Bagging, Random Forest, AdaBoost and SVM, while the K-means algorithm was used for clustering. Accuracy, precision and recall measures were used to evaluate the quality of the classification. An additional aim is to see how the quality of the approach is affected by the size of the training set and the size of the test set. The database of real and full-content books on which this approach was tested was obtained through the Project Gutenberg (PG) [13] and consisted of categories such as Art, Biology, Mathematics, Philosophy and Technology. We used the categories from PG as decision classes, the previously mentioned categories were chosen as examples – all documents stored in these categories were used. It should also be added that the distribution of documents in categories is not equal, which made classification even more difficult. Analyses related to word weighting measures and word frequency were also carried out. This article is organized as follows. The Sect. 1 introduces the problem of this article. The Sect. 2 provides an overview of related book classification or clustering work, in particular related to the Project Gutenberg, which is described in the Sect. 3. In the Sect. 4, we present NLP methods as data preprocessing and word weighting measures. The Sect. 5 describes classic machine learning algorithms to classification and clustering. In addition, the quality measures are described, i.e. accuracy, recall and precision. In the Sect. 6, we explain our methodology based on the use of NLP to analyze a text document and the use of machine learning algorithms for classification or clustering. In the Sect. 7, we present and discuss the experimental results. Finally, in the Sect. 8, we conclude with general remarks on this work, and a few directions for future research are pointed out.

2

Related Works

The research related to the analysis of text documents is very extensive and covers many aspects, i.e. understanding, translation or natural language generation by the computer [12]. A very important problem in NLP is to understand the text you read and based on it to classify the documents [30]. For this purpose, researchers are trying to develop a model for automatic document categorization using NLP and machine learning methods such as classification or clustering [26].

A Comparative Study of Classification and Clustering Methods from Text

15

In order to improve the efficiency of clustering text documents, a new preselection of centroids for the K-means [18] algorithm has been proposed. The combination of classification and clustering methods was used by the authors [14] to develop a method for classifying scientific articles based on the subject of clusters. This solution is aimed at grouping scientific articles into categories in order to overcome the difficulties faced by users looking for scientific papers. In addition, several studies have emerged that have been proposed to perform clustering along with a classification using the similarity of distance measures [20, 23]. Some authors [1] have focused on promoting the efficiency of grouping and classifying texts with the new measure proposed. Other authors [11] describe an advanced classification model based on grouping documents into large sets of documents in a distributed environment. There are also researches [16], where a graph-based methodology has been proposed that uses sentence sensitivity ranking to extract key phrases from documents.

3

Project Gutenberg

Project Gutenberg1 , initiated by Michael Hart in 1971, is an attempt to make electronic texts available in the public domain in an easily accessible and useful form [3]. It is an online library of free e-books. Project Gutenberg, in many ways pioneering, was the first information provider on the Internet and the oldest digital library. As the internet became popular in the mid-nineties, the project gained momentum and an international dimension. This project promotes digitization in text format, which means that the book can be copied, indexed, searched, analyzed and compared with other books. The Project Gutenberg, as a volunteer effort, aims to digitize, archive and distribute literary works. The mission of the project is to encourage all interested people to create e-books and help in their dissemination. In addition, the project partners want to make as many e-books available as possible in as many formats as possible, so that the whole world can read them in as many languages as possible. Currently, Project Gutenberg is an online collection of over 60,000 free e-books, consisting mainly of novels, plays, and collections of poetry and short stories, but also biographies, stories, cookbooks, textbooks and magazines that are no longer copyrighted in United States [8]. Regardless of when the texts were digitized, all books are recorded with the same formatting rules so they can be easily read on any computer, operating system or software including mobile phone or e-book reader. Any person or organization can convert them to different formats, without any restriction, only respecting the copyright laws of the country concerned. Project Gutenberg also publishes books in well-known formats such as HTML, XML, and RTF. There are also Unicode files [19].

1

https://www.gutenberg.org/.

16

4

B. Probierz et al.

Natural Language Processing

Natural language processing (NLP) refers to computer systems that attempt to understand human language. For this purpose, tools and techniques are developed, the use of which allows computer systems to perform appropriate tasks related to the processing of real data defined in natural language [9]. The input can be text or spoken language [17]. Applications of NLP cover many fields of study, such as machine translation, speech recognition, and decision making based on information in natural language. The main function of NLP is to extract a sentence from a string. The second stage is the tokenization process, i.e. token division [29]. It works by breaking up a text into simple parts, usually single words or short sets of consecutive words. In this case, the n − gram model is used, where gram is a single word and n is the number of words that make up the sequence [27]. Additionally, when preprocessing real text data, a normalization process must be performed on all the words present in the text, which depends on the grammatical structure of the language of the text [21]. Popular methods of normalization in NLP are: stemming and lemmatization, which consists in transforming various word forms into its basic form [21]. Stemming is a method of extracting the subject and ending of words and then replacing similar words with the same in the base form [2]. On the other hand, lemmatization aims to obtain the lemma of a word, most often by removing inflectional endings and returning the lemma [15]. 4.1

Word Weighting Measures

One of the NLP methods used in the process of text analysis is its statistical model, which is based on the representation of documents as vectors of terms. In this case, term is the vector component, while the vector values are the respective term weights. The vector representation can be represented as one of three measures of word weighting. The first representation is the Term Frequency (TF) measure which determines the similarity between documents by counting the words appearing in the documents [22]. In this way, the degree of affiliation of a term to the document is determined, taking into account the frequency of the term in the entire set. TF measure is calculated using the formula: T Fi,j =

ni,j , k nk,j

(1)

where ni,j is the raw count of a term ti in the document dj and the denominator is the sum of the raw count of all nk,j terms in the document dj . The second representation is the Term Frequency - Inverse Document Frequency (TF-IDF) measure, which is used to reduce the weights of less significant words in a set of words. In addition to the term weight in relation to the complete set of documents, the TF-IDF measure also determines the corresponding term weight locally [22]. In this way, the weights of words appearing several

A Comparative Study of Classification and Clustering Methods from Text

17

times in one document are lower than the weights of words appearing in multiple documents. TF-IDF is the frequency of T Fi,j terms multiplied by the inverse frequency in the text IDFi , which is expressed by the formula: (T F − IDF )i,j = T Fi,j × IDFi

(2)

where Inverse Document Frequency (IDF) is the ratio of the number of processed nd documents to the number of documents containing at least one occurrence of the given term {d : ti ∈ d}. IDF is expressed by the formula: IDFi = log

nd {d:ti ∈d}

(3)

The last representation in the described model is the Binary representation. It specifies the occurrence of the term in the document [4]. If term appears in a given document, the attribute takes the value 1, if not, it takes the value 0.

5

Machine Learning Methods

Currently, there are many methods of machine learning, but we want to focus on two different techniques, i.e. supervised learning and unsupervised learning. The supervised learning method (we use classification) is based on learning an algorithm based on a training set in which decision classes are defined (in our case, categories to which we assign books), and then testing the algorithm on test data. The unsupervised learning method (we use clustering) is to learn an algorithm from the input data (book content) without decision information (assigned category). The algorithm is supposed to find the pattern and then extract the clusters as groups of objects similar to each other. 5.1

Algorithms for Classification

Classification is one of the main tasks of machine learning. The classification model called a classifier consists in assigning objects from a given set to one of the predefined classes. There are many classic machine learning classifiers that do well with the task of classifying actual text data. One of the first to be developed is the Classification and Regression Trees (CART) algorithm used to construct decision trees based on a split criterion and proposed by Breiman et al. in 1984 [6]. The main task of the division criterion is to divide the data into two as homogeneous or possibly equal parts as possible. Another group of classifiers are algorithms consisting of many classifiers, creating ensemble methods. Breiman proposed the bagging [5] classifier because he thought a set of multiple classifiers would be better than a single CART classifier. Bagging consisted in building the classifier multiple times on the basis of random samples created from the entire training set. The results obtained from the classifiers were then combined by means of majority voting. An improved form of bagging is the method proposed by Braiman as late as 2001 called Random Forest [7]. It consists in the fact that single classifiers are

18

B. Probierz et al.

decision trees, for which, when selecting the test for each node, the attributes selected during the division are randomized. In this way, each split is made on the basis of a different set of attributes. As an alternative to booking, Schapire proposed the AdaBoost [25] classifier in 1990. It consisted in combining the weak classifiers and then creating groups to obtain a better set of classifiers. The classification should also include the Support Vector Machine (SVM) method proposed by Vladimir Vapnik [10]. It involves analyzing the data by identifying patterns and then assigning them to one of two classes. 5.2

Algorithm for Clustering

Clustering is the division of a set of objects into groups (clusters) of objects. The number of clusters must be specified when running the algorithm. Objects are grouped on the basis of the similarity of the features for which the distances between the features in the vector space are calculated. One known clustering algorithm is the K-means algorithm, which groups objects into k groups of equal variance, minimizing a criterion known as inertia or sum of squares within a cluster. For this purpose, centroids (as centers in the clusters) are selected to minimize the inertia or the criterion of the sum of squares inside the cluster [24]. 5.3

Measures of the Quality

For each of the proposed methods, the quality of the classification can be assessed by appropriate calculation of measures, i.e. accuracy, recall and precision. All three measures are based on the confusion matrix for multiple classes shown in Table 1. Rows correspond to the actual class and columns correspond to the predicted classes. The elements on the diagonal (T P i for i ∈ {A, B, C, D, E}) in the Table 1 show the number of correctly classified objects executed for each class i ∈ {A, B, C, D, E}. The remaining items in the table show the mistakes made. Table 1. Confusion matrix for multiple classes Predicted ACTUAL A B C D E

A T PA EBA ECA EDA EEA

B EAB T PB ECB EDB EEB

C EAC EBC T PC EDC EEC

D EAD EBD ECD T PD EED

E EAE EBE ECE EDE T PE

The accuracy of the classification is the determination of how many texts, of all classified ones, have been classified correctly. Accuracy calculated as the

A Comparative Study of Classification and Clustering Methods from Text

19

sum of correct classifications (T P i for i ∈ {A, B, C, D, E}) divided by the total number of classifications (sum of all entries in the Table 1). Unfortunately, in the case of large data sets, accuracy does not always describe the classification process well, so it is worth extending the quality of the classification with additional measures of recall and precision. Recall determines the proportion of correctly predicted positive objects (T P i E for i ∈ {A, B, C, D, E}) among all positive objects (T P i + j=A Eij for i ∈ {A, B, C, D, E} and i = j - the row of that class in confusion matrix), including those that were incorrectly classified as negative objects. Note that recall will be 1 if the classifier fails to classify any positive object, even if negative objects are classified as positive. For example: Recall for class A is: recallA =

T PA T PA +EAB +EAC +EAD +EAE

(4)

Precision determines how many of the positively predicted objects (T P i E for i ∈ {A, B, C, D, E}) are actually positive (T P i + j=A Eji for i ∈ {A, B, C, D, E} and i = j - the column of that class in confusion matrix). Precision as well as accuracy and recall should take values as close to 1 as possible. For example: Precision for class A is: precisionA =

6

T PA T PA +EBA +ECA +EDA +EEA

(5)

Proposed Approach

The proposed approach is to automatically assign books to the appropriate categories based on the performed text analysis from the content of the books by applying NLP techniques and machine learning algorithms. In the first step of our approach, the content of all the books should be read along with the actual categories, and then the data preprocessing process, which is described in Sect. 4. By using one of the measures (TF-IDF, TF, Binary) to analyze the content of books, we obtain word weight vectors, which we transform into a decision table. The columns in the table are words selected by the word weighting measures. There are subsequent books in the lines, and the intersection of the line and column is the word score for the book (depending on the word weights selected). The last column is the decision in which the appropriate category for the book is written. This produces a table with 552 rows (number of books in the collection) and approximately 30,000 features (number of words). Such prepared decision table is used for supervised learning (classification) and for unsupervised learning (clustering). For supervised learning, we have selected some classic classifiers described in the Sect. 5.1. For this purpose, it is necessary to use a training set in which there are books (all their content) and categories (serving as decision classes). For unsupervised learning, we chose the K-means algorithm described in the Sect. 5.2, in which it is necessary to specify the number of centroids for which

20

B. Probierz et al.

the algorithm creates clusters. This is important because we want to match the clustered documents with the actual decision class in order to check the algorithm’s operation. After learning the algorithms, it is possible to predict new documents. Simply subject each book to NLP using the same word weighting measure. The algorithm will then assign the book to one of the known categories (decision classes or clusters).

7

Experiments

The aim of our experiments was to test whether, on the real data set (prepared by us on the basis of the books from Project Gutenberg), the proposed approach achieves good performance as measured by various measures of classification quality, such as accuracy, precision and recall. In addition, we have shown that the use of clustering algorithms allows for effective assignment of books to categories. Word weighting measures from NLP techniques were used to analyze the content of the books. The word frequency for each category was also analyzed according to the selected measure. 7.1

Experimental Design and Data Set

To carry out the experiments we had to build a dataset consisting of the content of the books and the categories assigned to them. For this purpose, we selected 5 categories from Project Gutenberg. The categories were chosen to correspond to the main themes, currents in science. In the next step, we downloaded all the books that were in these categories. Many of the books, especially in the case of M athematics, were written in LATEX, requiring documents to be preprocessed by removing content related to licensing and TEX instructions. Books containing only numbers, such as π or e, were also removed. In this way, a dataset containing a total of 552 books with a size of 171.2 MB was created. The exact categories and numbers of books assigned to this category are given in Table 2. Table 2. Number of books in the dataset by category (decision class). Category Art

Number of instances Memory occupancy [MB] 71

21.6

Biology

66

29.4

Mathematics

95

28.7

Philosophy

102

42.8

Technology

218

48.7

The datasets prepared in this way were evaluated against three word weighting measures (TF, TF-IDF and Binary) and six machine learning algorithms (CART, Bagging, Random Forest, AdaBoost and SVM, K-means). The results obtained are described in Sect. 7.2.

A Comparative Study of Classification and Clustering Methods from Text

7.2

21

Results of Experiments

Due to the large number of combinations of test conditions, the results were aggregated. In this section, they are broken down into two parts: results for word weighting measures, and data set analysis. In fact, experiments were performed with 75 different settings of the proposed approach for classification and 3 settings for clustering. For three word weighting measures, we checked 5 classifiers and one clustering algorithm. In addition, we checked the impact of the division of training and test data on the classification results, where the divisions of the data were as follows: 10–90, 30–70, 50–50, 70–30 and 90–10. The best results were obtained with the division of 90–10, where 90% is the training set and 10% is the test set and they are presented in this section. Table 3. Evaluation of quality of classification depending on the word weighting measures and algorithms. Accuracy CART Bagging Random forest AdaBoost SVM TF 0.68 TF-IDF 0.79 Binary 0.70

0.88 0.88 0.91

0.86 0.93 0.84

0.75 0.73 0.68

0.80 0.91 0.86

Recall CART Bagging Random Forest AdaBoost SVM TF 0.63 TF-IDF 0.75 Binary 0.63

0.83 0.83 0.87

0.79 0.88 0.76

0.74 0.66 0.58

0.67 0.87 0.78

Precision CART Bagging Random Forest AdaBoost SVM TF 0.61 TF-IDF 0.76 Binary 0.69

0.87 0.83 0.97

0.87 0.90 0.92

0.76 0.70 0.70

0.91 0.88 0.92

Results Related to the Word Weighting Measures: Table 3 shows the results of the word weighting measures for the selected classification algorithms. For the classifiers, we used a training set containing 90% of all books and a test set containing the remaining 10% of books. This is a model case, which allows us to present differences between specific word weighting measures. The results were divided due to the measure of quality of classification. Thus, it can be seen that in terms of accuracy, the best measure is TF-IDF. Only for the Bagging and AdaBoost algorithms slightly better results are obtained for the other measures, but the average value is for: TF: 0.794; TF-IDF: 0.848; Binary: 0.798. An analogous situation is in the case of recall, where the mean values are for TF, TF-IDF and Binary respectively: 0.732, 0.798 and 0.724. On the other hand,

22

B. Probierz et al.

in the case of precision, the best results are obtained for Binary. The average values are: TF: 0.804; TF-IDF: 0.814; Binary: 0.840. Thus, it can be seen that TF-IDF is a measure whose application allows for potentially good results. In the case of Bagging, Random Forest and SVM algorithms, it allows prediction with an accuracy of around 90%. However, in the case of Random Forest and SVM also recall and precision is close to 90%. The results for the K-means algorithm are slightly different. We know that there are actually 5 classes in our set, so we set the number of centroids to 5. Thus, each of the 552 documents after the algorithm’s result is in one of the 5 groups (clusters). However, better information can be obtained after comparing the documents assigned to clusters with the use of the K-means algorithm with the actual decision class (see Fig. 1). We have the actual classes for the books in the lines and the clusters in the columns.

Fig. 1. Results of assigning books to categories for the word weighting measure with K-means algorithm.

As shown in the Fig. 1 for the TF measure, the books were assigned to two clusters. For the Binary measure, only math books were assigned to the one cluster, and the most books were assigned to the other one cluster (last on the Fig. 1). For the TF-IDF measure, documents can be clearly grouped by using unsupervised learning methods, and the obtained results are close to reality. Additionally, as in the case of the Binary measure, many books were assigned to the last cluster. Data Analysis: Data analysis was also carried out to investigate the problem. On the Fig. 2 we can see the relationship between the category of book and the size of the book (calculated in memory occupancy). Although books from the T echnology category are the most numerous, their memory usage is very similar to books from the P hilosophy category. When calculating the average memory occupancy of one book, it can be seen that the Biology and P hilosophy categories occupy more memory than Art, M athematics and, above all, T echnology. Word frequency was also analysed according to the measure chosen. Sample graphs for the first 500 words in the Biology category are presented, with similar results for the other categories. As can be seen in Fig. 3, for the TF and TF-IDF measure there is a narrow group of words that exceeds the rest. For the TF-IDF

A Comparative Study of Classification and Clustering Methods from Text

23

Fig. 2. Number of instances and memory occupancy by category.

measure, the differences between the values are larger, but have similar characteristics as for the TF measure. In the case of the binary measure, where only the information whether a given word exists is recorded, the obtained frequencies are very similar even in the range of the first 500 words.

Fig. 3. Crucial tokens based on TF, TF-IDF, Binary weight for Biology (top 500).

8

Conclusions

The purpose of this article was to develop a method for automatically classifying books based on their content. Three word weighting measures based on natural language processing were selected for content analysis, and machine learning algorithms were used, both based on supervised and unsupervised learning. Accuracy, precision and recall were selected to assess the quality of the classification. By combining methods from various fields and preprocessing complex data, we wanted to show a scheme that can be used to classify texts (not only books) into relevant data. The observations resulting from the conducted experiments allowed to confirm the hypothesis put forward by the authors. On the basis of the experiments carried out, it was confirmed that the application of NLP for the prediction of a book category, based on the analysis of its content, allows to achieve an accuracy, precision and recall of 90%. This is achieved when using the TF-IDF measure and the Random Forest algorithm. However, it should be noted that higher precision can be achieved with the Binary measure and the Bagging algorithm. Additionally, it should be noted that in the case of the TF-IDF measure, it is possible to obtain similar results using both the classification and clustering algorithm. In the case of book clustering, good results were obtained by

24

B. Probierz et al.

combining the K-means algorithm (for the number of clusters corresponding to the actual number of decision classes) with the TF-IDF measure. The resulting clusters largely correspond to the actual categories of books. In the future, it is worthwhile to investigate carefully the use of different measures depending on the size of the training and test sets. The range of words sufficient to classify books should also be investigated. The analyses made indicate that this should be possible even for a small group of selected words. Additionally, it would be necessary to investigate how memory occupancy affects the performance of the algorithm. From the observations made, it appears that the size of the book may affect the classification results due to the changing number of words as the number of attributes.

References 1. Amer, A.A., Abdalla, H.I.: A set theory based similarity measure for text clustering and classification. J. Big Data 7(1), 1–43 (2020). https://doi.org/10.1186/s40537020-00344-3 2. Amirhosseini, M.H., Kazemian, H.: Automating the process of identifying the preferred representational system in neuro linguistic programming using natural language processing. Cogn. Process. 20(2), 175–193 (2019) 3. Bean, R.: The use of Project Gutenberg and hexagram statistics to help solve famous unsolved ciphers. In: Proceedings of the 3rd International Conference on Historical Cryptology HistoCrypt 2020, pp. 31–35. No. 171. Link¨ oping University Electronic Press (2020) 4. Bedekar, P.P., Bhide, S.R.: Optimum coordination of directional overcurrent relays using the hybrid GA-NLP approach. IEEE Trans. Power Delivery 26(1), 109–119 (2010) 5. Breiman, L.: Bagging predictors. Mach. Learn. 24(2), 123–140 (1996) 6. Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. Chapman & Hall, New York (1984) 7. Breiman, L.: Random forests. Mach. Learn. 45, 5–32 (2001) 8. Brooke, J., Hammond, A., Hirst, G.: GutenTag: an NLP-driven tool for digital humanities research in the Project Gutenberg corpus. In: Proceedings of the Fourth Workshop on Computational Linguistics for Literature, pp. 42–47 (2015) 9. Chowdhury, G.G.: Natural language processing. Ann. Rev. Inf. Sci. Technol. 37(1), 51–89 (2003) 10. Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20, 273–297 (1995) 11. Devi, S.A., Kumar, S.S.: A hybrid document features extraction with clustering based classification framework on large document sets. Int. J. Adv. Comput. Sci. Appli. (IJACSA) 11(7) (2020) 12. Eichstaedt, J.C., et al.: Closed-and open-vocabulary approaches to text analysis: A review, quantitative comparison, and recommendations. Psychol. Methods 26(4), 398 (2021) 13. Hart, M.: Project Gutenberg literary archive foundation (1971) 14. Jalal, A.A., Ali, B.H.: Text documents clustering using data mining techniques. Int. J. Electr. Comput. Eng. (2088–8708) 11(1) 664–670 (2021) 15. Jivani, A.G., et al.: A comparative study of stemming algorithms. Int. J. Comp. Tech. Appl 2(6), 1930–1938 (2011)

A Comparative Study of Classification and Clustering Methods from Text

25

16. Kannan, G., Nagarajan, R.: Text document clustering using statistical integrated graph based sentence sensitivity ranking algorithm. In: IOP Conference Series: Materials Science and Engineering, vol. 1070, p. 012069. IOP Publishing (2021) 17. Kent, A., Williams, J.G.: Encyclopedia of Computer Science and Technology: Volume 27-Supplement 12: Artificial Intelligence and ADA to Systems Integration: Concepts: Methods, and Tools. CRC Press (2021) 18. Lakshmi, R., Baskar, S.: DIC-DOC-K-means: dissimilarity-based initial centroid selection for document clustering using K-means for improving the effectiveness of text document clustering. J. Inf. Sci. 45(6), 818–832 (2019) 19. Lebert, M.: Le Projet Gutenberg (1971–2008). Project Gutenberg (2008) 20. Lin, Y.S., Jiang, J.Y., Lee, S.J.: A similarity measure for text classification and clustering. IEEE Trans. Knowl. Data Eng. 26(7), 1575–1590 (2013) 21. Lovins, J.B.: Development of a stemming algorithm. Mech. Transl. Comput. Linguistics 11(1–2), 22–31 (1968) 22. Luhn, H.P.: A statistical approach to mechanized encoding and searching of literary information. IBM J. Res. Dev. 1(4), 309–317 (1957) 23. Oghbaie, M., Mohammadi Zanjireh, M.: Pairwise document similarity measure based on present term set. J. Big Data 5(1), 1–23 (2018). https://doi.org/10.1186/ s40537-018-0163-2 24. Pedregosa, F., et al.: Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011) 25. Schapire, R.E.: The strength of weak learnability. Mach. Learn. 5, 197–227 (1990) 26. Selivanova, I., Kosyakov, D., Dubovitskii, D., Guskov, A.: Expert, journal, and automatic classification of full texts and annotations of scientific articles. Autom. Docum. Math. Lingu. 55(4), 178–189 (2021) 27. Wang, K., Thrasher, C., Viegas, E., Li, X., Hsu, B.j.P.: An overview of microsoft web n-gram corpus and applications. In: Proceedings of the NAACL HLT 2010 Demonstration Session, pp. 45–48 (2010) 28. Wanigasooriya, A., Silva, W.P.D.: Automated text classification of library books into the dewey decimal classification (ddc) (2021) 29. Webster, J.J., Kit, C.: Tokenization as the initial phase in NLP. In: COLING 1992 Volume 4: The 15th International Conference on Computational Linguistics (1992) 30. Yao, L., Mao, C., Luo, Y.: Graph convolutional networks for text classification. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 7370–7377 (2019)

A Lightweight and Efficient GA-Based Model-Agnostic Feature Selection Scheme for Time Series Forecasting Minh Hieu Nguyen1 , Viet Huy Nguyen1 , Thanh Trung Huynh2 , Thanh Hung Nguyen1 , Quoc Viet Hung Nguyen2 , and Phi Le Nguyen1(B) 1

School of Information and Communication Technology, Hanoi University of Science and Technology, Hanoi, Vietnam {hieu.nm2052511m,huy.nv184120}@sis.hust.edu.vn, {hungnt,lenp}@soict.hust.edu.vn 2 Griffith University, Brisbane, Australia {h.thanhtrung,henry.nguyen}@griffith.edu.au Abstract. Time series prediction, which obtains historical data of multiple features to predict values of features of interest in the future, is widely used in many fields. One of the critical issues in dealing with the time series prediction task is how to choose appropriate input features. This paper proposes a novel approach to select a sub-optimal feature combination automatically. Our proposed method is model-agnostic that can be integrated with any prediction model. The basic idea is to use a Genetic Algorithm to discover a near-optimal feature combination; the fitness of a solution is calculated based on the accuracy obtained from the prediction model. In addition, to reduce the time complexity, we introduce a strategy to generate training data used in the fitness calculation. The proposed strategy aims to satisfy at the same time two objectives: minimizing the amount of training data, thereby saving the model’s training time, and ensuring the diversity of the data to guarantee the prediction accuracy. The experimental results show that our proposed GA-based feature selection method can improve the prediction accuracy by an average of 28.32% compared to other existing approaches. Moreover, by using the proposed training data generation strategy we can shorten the time complexity by 25.67% to 85.34%, while the prediction accuracy is degraded by only 2.97% on average.

Keywords: Time series prediction algorithm

1

· Feature selection · Genetic

Introduction

Time series prediction is a critical problem applied in many fields, including weather forecasting, environmental indicators prediction, stock forecasting. The common formulation of such problems is to use historical data of many features to predict the value of one or more features of interest in the future. For examples, c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  N. T. Nguyen et al. (Eds.): ACIIDS 2022, LNAI 13758, pp. 26–39, 2022. https://doi.org/10.1007/978-3-031-21967-2_3

Efficient GA-Based Model-Agnostic Feature Selection for Time Series

27

one can use PM2.5 data from the past along with additional supporting features (such as CO2, SO2, etc.) to forecast the PM2.5 level in the following few days. In the past, time series prediction was often solved by complicated simulation models. In [5], the authors used a distributed hydrological model to simulate the rainfall-runoff and forecast the real-time river flow in the River Uruguay basin, Brazil. The authors in [18] implemented a real-time air quality forecasting solution leveraging the MM5-CMAQ-EMIMO air quality modeling system. The online coupled meteorology and chemistry model was used to predict air quality in [22]. The drawback of these methods is the necessity of specialist expertise in building and using the models. Moreover, the created models may not be adaptable to environmental dynamics. Recently, the data-driven approach has emerged as a potential solution. The main idea is leveraging collected data to train a machine learning model. This way, the model can capture implicit characteristic of the data and exploit the learned knowledge to forecast future trends. One of the most common machine learning models in tackling time series prediction problems is recurrent neural networks (RNNs), including LSTM [9]. In [20], the authors used LSTM with the input features of PM2.5 and other gas concentrations to predict air quality indicators in Taiwan. Liu et al. [12] combined LSTM and KNN for real-time flood forecasting in China. Some researchers also proposed deep models that utilize spatial information and temporal data for flood prediction [17]. In [15], the authors used GRU and CNN networks to predict the future water level at eight o’clock of the five days ahead. Other architectures such as Graph Neural networks [16], Extreme Learning Machines [19] were also exploited to enhance the prediction accuracy. The performance (i.e., prediction accuracy and time complexity) of a datadriven solution is decided by two main factors, namely the input features and the prediction model. Although many works have been devoted to construct prediction models, that is still rare for feature selection. In practice, the number of features in a prediction problem is usually very large. For example, in a PM2.5 prediction task, the number of available features is large and the usage of all of them might lead to a significant increase in training time [13]. Besides, unrelated input features may confuse the model, thus degrade the prediction accuracy. As a result, choosing appropriate input features is essential in enhancing the accuracy and reducing the training time. Traditional approaches often use filtering methods, which rely on the correlation between the input features and the feature of interest (i.e., the output feature) [8]. However, such approaches can only capture the linear relationship between features and prone to non-linear correlation. The later embedded-based approaches attempt to determine optimal features by integrating the feature selection module [6,7]. However, in this approach, all features still must go through the model, causing the increase of model size due to the growing number of weights. By contrast, without putting all features into the prediction model, wrapper methods can learn non-linear relationship by evaluating feature combinations using the accuracy of the prediction model [10,14]. Although this approach can provide near-optimal feature combination, it suffers from a significant time complexity.

28

M. H. Nguyen et al.

Motivated by the aforementioned observations, in this study, we aim to propose a novel feature selection framework that allows us to quickly select an optimal feature combination and train the prediction model using the selected features. Our approach is model agnostic that can be applied with any prediction model. The key idea behind our approach is as follows. First, we seek a sub-optimal input feature combination using genetic algorithm (GA). In our GA-based feature selection algorithm, each individual represents a feature combination whose fitness is determined based on the accuracy of the prediction model. Second, we propose a training data generation strategy in which the fitness of each individual is calculated by only a subset of the original training dataset. The sub-dataset is varied over the generations and chosen to guarantee the diversity. This way, we can shorten the time for calculating the fitness (i.e., training the prediction model) while ensuring the prediction model’s accuracy. The main contributions of our paper are as follows. – We propose a model-agnostic feature selection framework that can dynamically determine input features for time series prediction problems. – We present a GA-based method to search for a near-optimal feature combination. – We design a training data generation strategy that helps to reduce the training time while ensuring the accuracy of the prediction model. – We perform extensive experiments on a real dataset to evaluate the effectiveness of the proposed method. The remainder of the paper is constructed as follows. We briefly introduce the related works in Sect. 2. Section 3 formulate the targeted problem and our proposed approach. We present our experimental results in Sect. 4 and conclude the paper in Sect. 5.

2 2.1

Related Works Feature Selection Methods

The feature selection methods are generally divided into filter methods, wrapper methods, and embedded methods [11]. Filter methods resembles a pre-processing data method when it considers the importance of input features by using different criteria such as information-based methods and statistics-based methods. However, different methods usually result in different assumption about correlation between features. Therefore, in [8], A. U. Haq et al. combine multiple feature-ranking methods to remove redundant features. While filter methods uses the statistical performance of all training data directly to evaluate features, embedded methods perform feature selection as part of the model construction process. In [6], the authors propose a novel attention-based feature selection to find optimal correlation between hidden features and the supervision target. Han et al. proposed an embedded feature selection method which combines group lasso tasks and auto-encoder [7]. By extracting both the linear and nonlinear information among features, the proposal can select the optimal features.

Efficient GA-Based Model-Agnostic Feature Selection for Time Series

29

The wrapper methods rely on the predictive accuracy of a predefined learning algorithm to evaluate the quality of selected features. There are many elaborate algorithms such as evolutionary and swarm intelligence algorithms for feature selection applying wrapper methods in [3,21]. 2.2

GA-Based Feature Selection

There are some existing works that leverage GA to select feature combinations. In these approaches the fitness function is used to evaluate the goodness of the selected features [10,14]. In [10], the authors proposed a novel approach called IGDFS, which utilizes a GA-based wrapper method for feature selection. The fitness is defined by the classification accuracy produced by machine learning models. Moreover, to reduce the computing complexity, only the top k most important features, ranked by information gain, are retained before going through the GA-based wrapper method. However, we hardly define which k should be, especially when the number of features is very large. Besides, removing redundant features before processing the wrapper method could possibly miss the combinations of the valuable features. In [14], to identify an optimum feature subset for credit risk assessment problem, the authors used neural networks to evaluate the fitness of individuals in the GA wrapper. They also used a filter method that reduces the number of input features before putting them in the GA-based wrapper phase. Although some existing works exploit GA to select input features, most of them focus on classification problems. Moreover, they all require training the model with the full data to calculate the fitness, leading to huge time complexity.

3

GA-Based Model-Agnostic Feature Selection

This section proposes a novel framework to automatically determine optimal input features and train the prediction model with the selected features. We first formulate the targeted problem in Sect. 3.1. We then describe the overview of our approach in Sect. 3.2. Section 3.3 and 3.4 present the details of our GAbased feature selection and the training data generator. 3.1

Problem Formulation

We are given a dataset consisting of l supporting features, and k features of interest, denoted as X1 , ..., Xl and Xl+1 , ..., Xl+k , respectively. We focus on a time series prediction task which obtains the information of the features in m previous timesteps and produces the values concerning the features of interest in the following n timesteps. Such a prediction task can be represented as follows. Input: xi , xi+1 , . . . , xi+m−1 Output: y˜i+m , y˜i+m+1 , ..., y˜i+m+n−1 =

argmax

yi+m ,yi+m+1 ,...,yi+m+n−1

p(yi+m , yi+m+1 , ..., yi+m+n−1 |xi , xi+1 , . . . , xi+m−1 )

30

M. H. Nguyen et al.

Fig. 1. Overview of the proposed training mechanism

where xi+j (j = 0, ..., m−1) is the input vector with respect to the j-th timestep; yi+m+k (k = 0, ..., n−1) is a vector represents the value of the features of interest at the (i + m + k)-th timestep. Suppose that f is the prediction model. Our objective is to determine l features among l supporting features to feed into f to achieve the sub-optimal accuracy. 3.2

Overview

Figure 1 presents the overview of our proposed method which comprises of a GA-based feature selector, a training data generator and a prediction model. We use each individual in the GA-based feature selector to represent a feature combination. Therefore, the accuracy of the prediction model when utilizing the corresponding feature combination determines an individual’s goodness. The GA-based method includes numerous iterations (i.e., generations), which perform genetic operations (i.e., crossover and mutation) to produce new individuals. In each generation, the individuals with better fitness remain while the others are removed from the population. By repeating these processes, the fitness values of the population are improved over the generations. Finally, the individual with the best fitness value is selected. Although the GA-based feature selector may help identify the sub-optimal solution, it suffers from a considerable time complexity as we need to train the prediction model to calculate the fitness for every individual. To this end, the training data generator aims to lessen the fitness calculation time by reducing the size of the training data for each individual. A challenge is how to reduce the training data while assuring the prediction accuracy. Our idea is to choose a sub-dataset that has the least overlap with the previously trained data. The details of the GA-based feature selector and the training data generator will be described in Sects. 3.3 and 3.4, respectively. 3.3

GA-Based Feature Selector

A GA-based method typically consists of two main components: individual encoding, and genetic operation definition. In our GA-based feature selector,

Efficient GA-Based Model-Agnostic Feature Selection for Time Series

31

Algorithm 1: GA-based feature selection Input: population size (N ), crossover probability (pc ), mutation probability (pm ), max generations (genmax ), size of one sub-dataset (s), original training data (originalT rainingData), testing data (testingData). Output: sub-optimal feature combination, prediction model trained with the selected eature combination. P op = Initialization(); m = 0; while m < genmax do p = random(0, 1); if p < pc then P opC = crossover(); P op.append(P opC); if p < pm then P opM = mutation(); P op.append(P opM ); /* Fitness calculation */ for each indi ∈ P op do dm+1 = TrainingDataGenerator(originalT rainingData, s, indi, m + 1); predictionModel.train(indi, dm+1 ); fitness = predictionModel.test(testingData); P op = selection(P op); ∗

A = argmax f itness(A); f ∗ = predictionModel.train(A∗ ) ; A∈P op

Return: A∗ , f ∗

each individual is a binary string of length l that encodes a combination of the supporting features. Specifically, the i-th gene obtains the value of either 1 or 0, which indicates whether the feature Xi is selected. A genetic algorithm adopts the fitness value to represent how “good” an individual is. Let A = {a1 , ..., al } be an individual, then we define the fitness of A as the accuracy achieved by the prediction model when using the features selected by A. In each generation, we determine all the newly generated individuals. Then, for each of such individuals, we use the training data generator to create a training data set and train the prediction model. The accuracy obtained from the model is used as the individual’s fitness. The initial population is generated randomly. The crossover and mutation algorithms work as follows. Let A = {a1 , a2 , . . . , al } and B = {b1 , b2 , . . . , bl } be two parents who will be crossed. We randomly select m genes of A and B and swap them to produce two offsprings, where m is a random number that varies in the range from 1 to 2l . Moreover, to retain good features, we propose a heuristic algorithm for selecting parents when performing the crossover as follows. Let pc be the crossover probability and N be the population size; then, we choose among the N individuals N × pc individuals to crossover. We choose c c individuals who have the best fitness values. The other N ×p parents the n×p 2 2 N ×pc are randomly chosen among the remaining N − 2 individuals.

32

M. H. Nguyen et al.

For mutation, we randomly select a gene segment of a parent and then invert the values of the genes (i.e., change bit 0 to 1, and vice versa). To increase the diversity of the individuals, we use two selection operators: the best individual selection operator and the random selection operator. Specifically, 50% of the best individuals are selected for the next evolution, and the remaining 50% are chosen randomly. 3.4

Training Data Generator

In this section, we will describe our strategy to generate data for calculating the fitness of an individual in a generation. Recall that the fitness is the accuracy of the prediction model when using features selected by the individual. The intuition behind our approach is that we try to reduce the size while guaranteeing the diversity of training data concerning every individual. Specifically, at every generation, the training data for each individual is a fixed-length subset of the original data. The subset is selected so that data that has been trained more in the previous generations will be assigned a lower priority to be trained in the current generation. Before going in details, we will define some notations used in this work. We denote by S the length of the original training data set, and s the length of a sub-dataset which is used to calculate the fitness of one individual. Note that s is a hyperparameter which tradeoffs between the training time and the prediction model’s accuracy. In general, the greater s, the higher the accuracy but the larger the training time. The impacts of s will be investigted in Sect. 4. The main purpose of the training data generator is to reduce the running time of the framework while ensuring high prediction accuracy. First, to decrease the running time, a naive way is to divide the original training dataset into fixed sub-datasets and use each of them to calculate the fitness of one individual. However, this method leads to many data segments which will never be trained (see Fig. 2(a)). As a result, the temporal relationship of data is not fully learned, thereby degrading the final model’s accuracy. We therefore propose a novel method to generate the sub dataset as follows. For each individual A, we store all sub-datasets that have been used to calculate the fitness of A. Specifm the set of all datasets that have been used to train ically, let us denote by DA the model for calculating the fitness of A until the m-th generation. Suppose m = {d1 , ..., dm }, where di is the sub-dataset used in the i-th generation. Now, DA we select the sub-dataset for A in the (m + 1)-th generation as follows. For each timestep t, i.e., t ranges from 1 to S − s + 1, we determine a subdataset of length s starting from timesteps t, and denote as dtm+1 . m as the total length of For each dtm+1 , we define its overlap degree with DA t segments overlapped by dm+1 and di (i = 1, ..., m). Finally, we choose the subm dataset dtm+1 whose overlap degree with DA is the minimum. Figure 3 illustrates our algorithm and pseudo-code Algorithm 2 details the algorithm.

4

Performance Evaluation

In this section, we evaluate the effectiveness of our proposed method concerning two metrics: training time and prediction accuracy. The prediction accuracy

Efficient GA-Based Model-Agnostic Feature Selection for Time Series

33

Algorithm 2: Training data generator Input: original training data (originalT rainingData), size of one sub-dataset (s), individual that needs to calculate the fitness (indi). Output: sub-dataset for calculating the fitness of individual indi. D = getUsedDatasets(indi) // get all the sub-datasets used to train indi; S = getLength(originalT rainingData); m + 1 = getCurrentGeneration(); for t in range(1, S-s+1) do dtm+1 = originalT rainingData[t, t + s − 1]; for di in D do overlapt += overlapSegment(dtm+1 , di ); t∗ =

argmin

t∈(1,S−s+1) ∗ return dtm+1 ;

overlapt ;

Fig. 2. Comparison between the fixed division strategy and our proposal.

Fig. 3. Illustration of sub-dataset selection strategy. The selected sub-dataset is the one whose total overlapped segments to the learned sub-datasets is the minimum.

is evaluated by MAE (Mean Absolute Error); the lower the MAE, the more accurate the prediction result. We aim to answer the following two research questions. 1. How much does our GA-based feature selection method improve the prediction accuracy compared to existing approaches?

34

M. H. Nguyen et al. Table 1. Evaluation datasets extracted from the Hanoi dataset Data set Date range #1

2016/01/01–2016/06/01

#2

2016/01/01–2017/01/01

#3

2016/01/01–2017/06/01

#4

2016/01/01–2018/01/01

2. How much does our training data generation strategy can shorten the training time while assuring the prediction accuracy? We first describe the experimental setting in Sect. 4.1, then answer the questions by the experimental findings in the Sect. 4.2 and 4.3. 4.1

Evaluation Settings

We take the PM2.5 prediction task as a case study. We use the air quality dataset collected from a air quality monitoring station in Hanoi, Vietnam from January 2016 to January 2018 [1]. Besides PM2.5, the dataset comprises other 15 supporting features, including temperature, CO, NO, NO2, NOx, O3, PM10, RH, SO2, etc. From the original dataset, we generate four datasets used in the experiments as follows. We divide the original dataset into three sub-datasets, #1, #2, and #3, which contain data in the first half-year, one year, and one and a half years, respectively (see Table 1). The dataset #4 is exactly the same as the original Hanoi dataset. For prediction models, we select three baselines, namely Linear Regression (LR), XGBoost [4], and LSTM [9]. The reason for choosing these models is that they are representative of the three most popular approaches: regression, ensemble, and deep-learning. 4.2

Impact of GA-Based Feature Selector

This section evaluates the prediction accuracy gained when using features selected by our proposed GA-based feature selector. Specifically, we compare the performance of our proposal with four other feature selection methods. The first one is to use all possible features. The second one is to use only the targeted feature, namely PM2.5, The third one is to use the embedded method for feature selection in Random Forest Algorithm [2] and the last one is choosing features based on the Spearsman correlation concerning the targeted feature. The results are shown in Fig. 4. Overall, the line charts prove that our proposed GA-based feature selector produces that best results compared to other feature selection methods despite which model is used. Moreover, our proposal still achieves the lowest MAE on all four datasets. Particularly, the proposed feature selection approach improves the accuracy from 6.9% up to 66.88%, with an average of 28.32%, compared to the existing ones.

Efficient GA-Based Model-Agnostic Feature Selection for Time Series

35

Fig. 4. Comparison of the proposed GA-based feature selection and other approaches

4.3

Impact of Training Data Generator

In this section, we investigate the effectiveness of our proposed training data generation strategy. To ease the presentation, we define size as the ratio (%) of the length of one sub-dataset (which is used for calculating the fitness of an individual) to the whole training dataset’s length. We vary the size and see how much the training time can be reduced. The experimental results are shown in Table 2. Specifically, columns “MAE change” and “training time change” represent the performance gaps in terms of MAE and the training time when using sub-datasets compared to using full data. As can be observed, our proposal reduces significantly the training time when the size reduces from 100 down to 10. The smaller the size is, the more training time shrink. Indeed, in Table 2(d), the LSTM model, running on all training data (size = 100) of the two-year dataset, takes over 66 d to have the final result. But when we set size = 50, the training time decreases by almost 40% compared to the case of size = 100. Particularly, when size = 10, using any model on any dataset, the training time could be lowered from 25.67% up to 85.34% (see Table 2(c)). Besides, we also compare the prediction accuracy when using our proposed training data generation strategy with the one using full data for training. It is clear that in Table 2, the prediction accuracy of the case size = 100 is the best when the models are trained with full data. However, with our proposal, when the size gets lower, the accuracy still remains high reliability as it stays close to the results given by training with full data. Specifically, our proposal increases the MAE by only 2.97% on average compared to the case of using full data for calculating the fitness. Especially, even in the worst case, the MAE produced by our proposal increases by only 7.05%. Notably, there are some cases our proposed method outperforms the one using full training data. Indeed, in

36

M. H. Nguyen et al.

Table 2. Performance of the proposed approach with various settings of the subdataset’s size (b) Evaluation on dataset #2

(a) Evaluation on dataset #1 model

size MAE

MAE training training time change (%) time(s) change(%)

size MAE 5.06 5.11 5.00 4.95 4.98 5.01

+1.00 +2.00 -0.2 -1.2 -0.6

7162 11738 16273 17927 16726 27739

74.18 57.68 41.34 35.37 39.70

70.38 49.91 45.96 41.11 29.68

10 20 30 XGBoost 40 50 100

5.92 5,16 5.76 5.77 5.62 5.53

+7.05 -6.99 +4.16 +4.34 +1.66

823 1021 1325 1748 1942 2714

69.68 62.38 51.18 35.59 28.45

82.59 68.96 61.85 51.62 41.35

LSTM

10 20 30 40 50 100

3.79 3.67 3.71 3.69 3.65 3.57

+6.16 +2.80 +3.92 +3.36 +2.24

26841 42947 79372 84829 119273 173261

84.51 75.21 54.19 51.04 31.16

model

size MAE

5.37 5.11 5.33 5.3 4.97 5.03

+6.76 +1.59 +5.96 +5.37 -1.19

5920 8136 10847 11410 13863 19289

69.31 57.82 43.77 40.85 28.13

10 20 30 XGBoost 40 50 100

6.4 6.31 6.3 6.25 6.22 6.01

+6.49 +4.99 +4.83 +3.99 +3.49

495 837 903 984 1175 1671

LSTM

10 20 30 40 50 100

3.96 3.94 3.85 3.92 3.83 3.71

+6.74 +6.20 +3.77 +5.66 +3.23

17927 31961 39284 49827 60395 102982

model

size MAE

LR

(c) Evaluation on dataset #3

(d) Evaluation on dataset #4

MAE training training time change (%) time(s) change(%)

10 20 30 40 50 100

3.78 3.72 3.76 3.71 3.65 3.61

+4.71 +3.05 +4.16 +2.77 +1.11

10228 15439 17925 19057 22714 37650

72.83 58.99 52.39 49.38 39.67

10 20 30 XGBoost 40 50 100

3.56 3.63 3.55 3.54 3.53 3.49

+2.01 +4.01 +1.72 +1.43 +1.15

946 1264 1310 1723 2415 3249

10 20 30 40 50 100

3.31 3.39 3.28 3.35 3.21 3.23

+2.48 +4.95 +1.55 +3.72 -0.62

31081 76923 91836 118291 131739 212071

LR

LSTM

MAE training training time change (%) time(s) change(%)

10 20 30 40 50 100

10 20 30 40 50 100

LR

model

MAE training training time change (%) time(s) change(%)

10 20 30 40 50 100

3.93 3.97 3.91 3.9 3.85 3.82

+2.88 +3.93 +2.36 +2.09 +0.79

11851 15504 14413 18083 23071 41173

71.22 62.34 64.99 56.08 43.97

70.88 61.10 59.68 46.97 25.67

10 20 30 XGBoost 40 50 100

4.43 4.4 4.39 4.18 4.25 4.27

+3.75 +3.04 +2.81 -2.11 -0.47

1295 1872 2534 3129 3471 5172

74.96 63.81 51.01 39.50 32.89

85.34 63.73 56.70 44.22 37.88

10 20 30 40 50 100

3.71 3.63 3.67 3.51 3.55 3.49

+6.3 +4.1 +5.16 +0.57 +1.72

49737 91996 113155 134699 141018 240283

79.30 61.71 52.91 43,94 41.31

LR

LSTM

Table 2(b), at the case size = 20, using XGBoost model, MAEs achieved by our proposal are better than MAE of the case size = 100 by 6.99%. In summary, while our proposed training data generation strategy shortens the running time of the framework, it succeeds in keeping prediction accuracy almost the same as using full data for training. Moreover, we also study the performance of the proposed solution over the one using fixed sub-datasets. The line charts in Fig. 5 and Fig. 6 describe the accuracy achieved by the best individual over generations. Overall, the results demonstrate that in any case, the proposed solution always achieves the best accuracy. Notably, Fig. 5(c) and Fig. 6(b) prove the outstanding performance of the proposal over the one using fixed sub-datasets in terms of the prediction accuracy. Besides, Figs. 5(b), 6(a), 6(c) show that the proposal has shorter convergence time. As shown, our proposal achieves the best result at the 10-th

Efficient GA-Based Model-Agnostic Feature Selection for Time Series

37

Fig. 5. Comparison of the proposed training strategy and the one using the fixed subdatasets with size = 10.

Fig. 6. Comparison of the proposed training strategy and the one using the fixed subdatasets with size = 20.

generation while the one using fixed sub-datasets begins to converge at the 13-th and 14-th generations.

5

Conclusion

This paper proposed a novel feature selection framework for time series prediction problems. Specifically, we exploited GA to search for a sub-optimal feature combination. The fitness of each individual in the GA is defined by the prediction model’s accuracy. To reduce the time complexity, we introduced a training data generation strategy that aims to reduce the training data size while ensuring the diversity of data used for each genetic individual. We took PM2.5 prediction as the case study and evaluated the proposed approach regarding prediction accuracy and time complexity. The experimental results showed that our approach improved the accuracy by at least 6.9% up to 66.88% with an average of 28.32% compared to the existing ones. Moreover, by reducing the data size used in calculating the fitness, we can shorten the time complexity by 25.67% to 85.34%, while the prediction accuracy is declined by only 2.97% on average.

38

M. H. Nguyen et al.

Acknoledgement. This research is funded by Hanoi University of Science and Technology under grant number T2021-PC-019.

References 1. Hanoi dataset. https://bit.ly/hanoi-pm25. (Accessed Nov 2020) 2. Breiman, L.: Random forests. Mach. Learn. 45, 5–32 (2001) 3. Brezoˇcnik, L., Fister, I., Podgorelec, V.: Swarm intelligence algorithms for feature selection: A review. Applied Sciences 8(9) (2018) 4. Chen, T., Guestrin, C.: Xgboost: A scalable tree boosting system (2016) 5. Collischonn, W., Haas, R., Andreolli, I., Tucci, C.E.M.: Forecasting river uruguay flow using rainfall forecasts from a regional weather-prediction model. J. Hydrol. 305(1), 87–98 (2005) 6. Gui, N., Ge, D., Hu, Z.: Afs: An attention-based mechanism for supervised feature selection. In: AAAI, vol. 33(01) (2019) 7. Han, K., Wang, Y., Zhang, C., Li, C., Xu, C.: Autoencoder inspired unsupervised feature selection. In: ICASSP, pp. 2941–2945. IEEE (2018) 8. Haq, A.U., Zhang, D., Peng, H., Rahman, S.U.: Combining multiple featureranking techniques and clustering of variables for feature selection. IEEE Access 7, 151482–151492 (2019) 9. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 1735–80 (1997) 10. Jadhav, S., He, H., Jenkins, K.: Information gain directed genetic algorithm wrapper feature selection for credit rating. Appl. Soft Comput. 69, 541–553 (2018) 11. Li, J., et al.: Feature selection: A data perspective. ACM Comput. Surv. 50, 1–45 (2016) 12. Liu, M., et al.: The applicability of lstm-knn model for real-time flood forecasting in different climate zones in china. Water 12(2), 440 (2020) 13. Nguyen, M.H., Le Nguyen, P., Nguyen, K., Le, V.A., Nguyen, T.H., Ji, Y.: Pm2.5 prediction using genetic algorithm-based feature selection and encoder-decoder model. IEEE Access 9, 57338–57350 (2021) 14. Oreski, S., Oreski, G.: Genetic algorithm-based heuristic for feature selection in credit risk assessment. Expert Syst. Appli. 41, 2052–2064 (2014) 15. Pan, M., et al.: Water level prediction model based on gru and cnn. IEEE Access 8, 60090–60100 (2020) 16. Qi, Y., Li, Q., Karimian, H., Liu, D.: A hybrid model for spatiotemporal forecasting of pm2.5 based on graph convolutional neural network and long short-term memory. Sci. Total Environ. 664, 1–10 (2019) 17. Hua, R., Fanga, F., Pain, C.C., Navon, I.M.: Rapid spatio-temporal flood prediction and uncertainty quantification using a deep learning method. J. Hydrol. 575, 911– 920 (2019) 18. San Jos´e, R., P´erez, J.L., Morant, J.L., Gonz´ alez, R.M.: European operational air quality forecasting system by using mm5-cmaq-emimo tool. Simul. Model. Pract. Theory 16(10), 1534–1540 (2008) 19. Shiri, J., Shamshirband, S., Kisi, O.: Prediction of water-level in the urmia lake using the extreme learning machine approach. Water Resour Manag. 30, 5217–5229 (2016) 20. Tsai, Y., Zeng, Y., Chang, Y.: Air pollution forecasting using rnn with lstm. In: Proceedings of IEEE DASC/PiCom/DataCom/CyberSciTech, pp. 1074–1079 (2018)

Efficient GA-Based Model-Agnostic Feature Selection for Time Series

39

21. Xue, B., Zhang, M., Browne, W.N., Yao, X.: A survey on evolutionary computation approaches to feature selection. IEEE Trans. Evolut. Comput. 20, 606–626 (2016) 22. Yahya, K., Zhang, Y., Vukovich, J.M.: Real-time air quality forecasting over the southeastern united states using wrf/chem-madrid: Multiple-year assessment and sensitivity studies. Atmos. Environ. 92, 318–338 (2014)

Machine Learning Approach to Predict Metastasis in Lung Cancer Based on Radiomic Features Krzysztof Fujarewicz1(B) , Agata Wilk1,2 , Damian Borys1,2 , 1 ´ nski2 , and Andrzej Swierniak Andrea d’Amico2 , Rafal Suwi´ 1

2

Department of Systems Biology and Engineering, Silesian University of Technology, Akademicka 16, 44-100 Gliwice, Poland [email protected] Maria Sklodowska-Curie National Research Institute of Oncology, Gliwice Branch, Wybrzeze Armii Krajowej 15, 44-102 Gliwice, Poland Abstract. Lung cancer is the most common cause of cancer-related death worldwide. One of the most significant negative prognostic factors is the occurrence of metastasis. Recently, one of the promising way to diagnose cancer samples is to use the image data (PET, CT etc.) and calculated on the basis of these images so called radiomic features. In this paper we present the attempt to use the radiomic features to predict the metastasis for lung cancer patients. We applied and compared three feature selection methods and two classification methods: logistic regression and support vector machines. The obtained accuracy of the best classifier confirms the potential of the radiomic data in prediction of metastasis in lung cancer. Keywords: Lung cancer features

1

· Machine learning · Metastasis · Radiomic

Background

Lung cancer is the most common cause of cancer-related death worldwide [3]. One of the most significant negative prognostic factors is the occurrence of metastasis, which in lung cancer is located primarily in bones, brain and liver. The most common histological subtype is non-small-cell lung carcinoma, accounting for 85% of all lung cancer cases [8]. Advanced NSCLC is more likely to metastasize, leading to severe symptoms and a decrease in overall survival. The presence of distant metastases is one of the most predictive factors of poor prognosis [12]. Distant metastases (distant cancer) refer to cancers that have spread via blood or lymphatic vessels from the original location (the primary tumor) to distant organs or lymph nodes. The main cause of cancer death is associated with metastases, which are mainly incurable. Thus, distant cancer is resistant to treatment intervention. Even though cancer researchers have made a lot of effort to understand the appearance of metastases, few preclinical studies about metastases were translated to clinical practice. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  N. T. Nguyen et al. (Eds.): ACIIDS 2022, LNAI 13758, pp. 40–50, 2022. https://doi.org/10.1007/978-3-031-21967-2_4

Radiomics-Based Prediction of Metastasis

41

In this paper we present the attempt to use the radiomics features to predict the metastasis for lung cancer patients. We applied and compared three feature selection methods: based on Fold-change, T-test and Wilcoxon test. Then we check the classification potential of two classification methods: logistic regression and support vector machines. The obtained accuracy of the best classifier confirms the potential of the radiomics data in prediction of metastasis in lung cancer.

2 2.1

Materials and Methods Data

Data were collected retrospectively at Maria Sklodowska-Curie National Research Institute of Oncology (NIO), Gliwice Branch from patients treated for non-small-cell lung cancer. From a larger cohort, we selected patients for whom PET/CT images were collected as part of routine diagnostic procedure. Patients with metastasis present at the moment of diagnosis were also excluded, resulting in a total of 131 patients included in this analysis, for 38 of whom metastasis was recorded later. A more detailed cohort characteristic is shown in Table 1, including: age, sex, Zubrod score—a performance score describing the patient’s general quality of life [11], TNM classification—a globally recognized standard describing the site and size of primary tumour (T), regional lymph node involvement (N) and distant metastases (M) [2]. All patients gave informed consent, and the data collection was approved by the ethical committee at NIO. 2.2

Radiomics Features

Acquired pre-treatment PET/CT images were preprocessed in order to save the data in appropriate format for subsequent radiomics analysis [1,4,6,7,9,10]. Manually generated regions of interest (ROIs) (with the automatic support for PET image lesion segmentation tool using the MIM Software v7.0) were preprocessed in the similar way to produce an output .nrrd files format. Radiomics features were extracted from the target lesions (described by ROIs) using the program based on pyRadiomics package for Python (v3.0.1), including: First order features (energy, entropy, minimum, percentiles, maximum, mean, median, interquartile range, range, standard deviation, skewness, kurtosis etc.); Shape Features (volume, surface area, sphericity, etc.); higher order statistics texture features, including: Gray-Level Co-occurrence Matrix (GLCM), Grey-Level Dependence Matrix (GLDM), Grey-Level Run Length Matrix (GLRLM), GreyLevel Size Zone Matrix (GLSZM) and Neighboring -Gray Tone Difference Matrix (NGTDM). Only the original images, without additional filters, were used in this analysis. This procedure gave us a total number of 105 radiomic features.

42

K. Fujarewicz et al.

Table 1. Characteristic of the patient cohort. For age, mean and range are shown. T, N, M—TNM classification at the moment of imaging: T—primary tumor size, N— spread to lymph nodes, M—distant metastases. Zubrod score—ECOG performance score. Eventual metastasis—MFS status. n = 131 Age

61.83 (43–81)

Sex

Male Female

34 (26.0%) 95 (74.0%)

Zubrod score

0 1 2

39 (29.8%) 91 (69.5%) 1 (0.8%)

T

T0 T1 T2 T3 T4

4 (3.1%) 13 (9.9%) 31 (23.6%) 43 (32.8%) 40 (30.5%)

N

N0 N1 N2 N3

57 11 49 14

M

M0 M1

131 (100%) 0 (0%)

Subtype

Squamous Large cell Adeno Other

90 (68.7%) 29 (22.1%) 10 (7.6%) 2 (1.6%)

Eventual metastasis Yes no

2.3

(43.5%) (8.4%) (37.4%) (10.7%)

38 (29.0%) 93 (71%)

Classification Workflow

Validation Scenario. We used Monte Carlo cross-validation modified to account for multiple ROIs measured for a single patient. In each of 500 iterations, we created a random stratified partition, with 70% of patients assigned to the training set and 30% to the test set. Next, we reconstructed the training and test sets for individual ROIs based on patient assignments. Feature Selection. As the feature values varied considerably in orders of magnitude, we standardized all variables using z-score transformation, the parameters for which were estimated based on the training set. We employed three

Radiomics-Based Prediction of Metastasis

43

selection methods belonging to the filter type: fold change, two-sample t test and Mann-Whitney (Wilcoxon rank-sum) test. Classification Methods. For selected feature numbers between 1 and 20 we performed binary classification against the occurrence of metastasis. We tested two approaches—statistical classification using a logistic regression model (LogReg), and machine learning-based classification using a support vector machine (SVM) with radial kernel function. We measured the model performance in terms of accuracy achieved on the test set, with median values and confidence intervals calculated from 500 cross-validation iterations. Information Leakage Avoidance. Information leak phenomena in overall validation scenario occurs when observations from the test set are used in any way during building the classification model [13]. The risk of information leak applies for all stages of the data analysis including the feature selection stage. The information leak can be source of strong optimistic bias in assessing the classification quality. It is especially visible for ”small sample data”, i.e. when the number of observation is small or comparable to the number of features. To avoid the information leakage we used the overall workflow [5] in each of the Monte Carlo validation iteration that is presented in Fig. 1.

3

Feature Selection Challenges

Although in theory the classification workflow seems quite straightforward, in practice adapting the problem and data to its principles presents some major difficulties. Let us examine some of the most prominent ones emerging in this study. 3.1

Multiple ROIs from the Same Patient

As already mentioned, in cases where the lungs contained multiple tumor sites, more than one ROI was generated, resulting in between 1 and 11 (with a median value 2) data points corresponding to a single patient. Information leakage resulting from similarity of ROIs can relatively easily be countered. Still, internal heterogeneity (Fig. 2) is another concern. Moreover, many features are highly correlated, as seen in this example. Significant variance in the ROIs as well as inevitable presence of outliers makes it difficult to observe any trends related to metastatic status, for example in PCA (Fig. 3). Several approaches to this issue may be considered, the most obvious of which is retaining only one ROI corresponding to largest or primary tumor from each patient (Fig. 4). Although it appears to solve the problem of outliers, this approach is burdened with significant information loss as all but one ROIs per patient are ignored.

44

K. Fujarewicz et al.

An optimal approach would involve utilising all ROIs, but it requires development of data integration strategy which usually proves to be a major challenge. 3.2

Response Variable Type

While classification pertains specifically to categorical response variables, the situation here is much more complicated. At the moment of image acquisition, none of the patients had detected metastasis (Table 1), which only occurred at some point later, as seen in Fig. 5a. Furthermore, typically for studies relying on clinical observation, the follow-up is not always conclusive, resulting in right-censored data. The negative status may therefore mean several different outcomes, only one of them being actual lack of metastasis. Looking at censoring times of patients labeled as metastasis-negative (Fig. 5b), we can see a considerable group of individuals for whom censoring occurs right after their inclusion in the study—most likely they were transferred to a different institution for therapy. Given the categorical nature of classification, the truly negative cases are indistinguishable from censored ones. While based solely on status the latter data points fall into the negative category, in reality they contribute virtually no information (particularly for very short censoring times) and may even introduce bias or outright error. A simple solution to this issue might be exclusion of patients with negative status and short censoring times, the threshold must however be a compromise between increasing information reliability and maintaining reasonable dataset size. A somewhat related way of adapting the task to a typical classification framework is modification of the problem formulation - for example Which patients will develop metastasis within a year? This way, all observations with censoring times shorter than a year are filtered out, observations when metastasis appears within a year belong to the positive class and the rest to the negative class (Fig. 6). 3.3

Small Differences Between Classes

In addition to all aforementioned issues, for most features the differences between classes are subtle, especially taking into account only the standard radiomics

Fig. 1. Data processing workflow that prevent from the information leak. Note that the data partitioning is the very first stage and feature selection is done only on the basis of the train set.

Radiomics-Based Prediction of Metastasis

45

Fig. 2. Feature values (z-score transformed) for each ROI for a single patient. Column annotation shows feature classes.

Fig. 3. Principal component analysis on z-score transformed data for all ROIs.

46

K. Fujarewicz et al.

Fig. 4. Principal component analysis on z-score transformed data for the largest ROI of each patient.

Fig. 5. Time aspect of the response variable.

features. Let us look at the example of the first order feature 90Percentile, which in all analyzed cases ranked relatively high. Figure 7 depicts value distributions for a relatively strong feature - classes are mostly overlapping.

Radiomics-Based Prediction of Metastasis

47

Fig. 6. Principal component analysis on z-score transformed data for the largest ROI of selected patients. Classes represent onset of metastasis within a year.

Fig. 7. Example feature value distributions

4

Results

Table 2 presents rank order of radiomic features obtained using three feature selection methods. The Fold-change method selected different features than selected by two other (similar) filter methods: T-test (parametric) and Wilcoxon test (non-parametric). The first top feature given by the fold-change method was Contrast while two top-ranked features selected by T and Wilcoxon test were: Mean and Median features.

48

K. Fujarewicz et al. Table 2. Top 10 features for each selection method based on all observations.

Rank FoldChange

FC value Ttest

t statistic Wilcox

p-value (·10−5 )

1

NGTDM Contrast

0.240

FirstOrder Mean

3.594

FirstOrder Median

1.779

2

GLCM ClusterShade

–0.461

FirstOrder Median

3.566

FirstOrder Mean

1.942

3

GLSZM LargeAreaLow GrayLevelEmphasis

0.702

FirstOrder RootMeanSquared

3.561

FirstOrder RootMeanSquared

2.012

4

FirstOrder Variance

1.380

FirstOrder 10Percentile

3.489

FirstOrder 90Percentile

3.056

5

FirstOrder RobustMean AbsoluteDeviation

1.323

FirstOrder 90Percentile

3.452

FirstOrder Maximum

5.749

6

GLSZM SmallAreaLow GrayLevelEmphasis

1.322

FirstOrder Maximum

3.287

FirstOrder RobustMean AbsoluteDeviation

8.910

7

FirstOrder Range

1.308

FirstOrder Minimum

3.042

FirstOrder 10Percentile

9.543

8

FirstOrder InterquartileRange

1.306

GLSZM SmallAreaLow GrayLevelEmphasis

3.022

FirstOrder Variance

16.778

9

FirstOrder MeanAbsoluteDeviation

1.306

GLSZM SmallAreaEmphasis

2.985

FirstOrder MeanAbsoluteDeviation

17.805

10

FirstOrder Maximum

1.287

FirstOrder RobustMean AbsoluteDeviation

2.979

FirstOrder Range

29.947

Fig. 8. Medians of accuracy achieved using various selection and classification methods.

Figure 8 presents accuracy obtained by logistic regression (left panel) and SVM (right panel) and three feature selection methods mentioned above. The maximal obtained accuracy, about 0.715, was achieved for logistic regression

Radiomics-Based Prediction of Metastasis

49

Fig. 9. Medians and 95% confidence intervals for accuracy achieved using various selection and classification methods.

and one (top ranked) feature selected by the fold-change method. The accuracy obtained by T and Wilcoxon tests were more or less similar and slightly better when used together with SVM method of classification. Figure 9 presents confidence intervals estimated based on results of 500 MCcross-validation scenario that was used in our analysis. The confidence intervals are similar and overlap to a large extent with slightly better (higher) obtained for fold-change selection method.

5

Discussion and Future Work

In this paper we presented the attempt to use the radiomic features to predict the metastasis for lung cancer patients. We applied and compared three feature selection methods and two classification methods: logistic regression and support vector machines. The obtained accuracy of the best classifier confirms the potential of the radiomic data in prediction of metastasis in lung cancer. In our future work we plan to check other, more sophisticated, methods of feature selection and classification. We also plan to check stability [14] of obtained ranked feature lists. In addition, from a clinical point of view, investigation of the impact of therapy on metastasis would be of particular interest. Acknowledgement. This work was supported by Polish National Science Centre, grant number: UMO-2020/37/B/ST6/01959 and Silesian University of Technology statutory research funds. Calculations were performed on the Ziemowit computer cluster in the Laboratory of Bioinformatics and Computational Biology created in the

50

K. Fujarewicz et al.

EU Innovative Economy Programme POIG.02.01.00-00-166/08 and expanded in the POIG.02.03.01-00-040/13 project.

References 1. Aerts, H.J.W.L., et al.: Decoding tumour phenotype by noninvasive imaging using a quantitative radiomics approach. Nature communications 5, 4006 (2014). https:// doi.org/10.1038/ncomms5006 2. Brierley, J.D., Gospodarowicz, M.K., Wittekind, C.: TNM classification of malignant tumours. John Wiley & Sons (2017) 3. Cruz, C.S.D., Tanoue, L.T., Matthay, R.A.: Lung cancer: Epidemiology, etiology, and prevention. Clin. Chest Med. 32(4), 605–644 (2011). https://doi.org/10.1016/ j.ccm.2011.09.001 4. d’Amico, A., Borys, D., Gorczewska, I.: Radiomics and artificial intelligence for PET imaging analysis. Nuclear medicine review. Central Eastern Europe 23(1), 36–39 2020). https://doi.org/10.5603/NMR.2020.0005 5. Fujarewicz, K., et al.: Large-scale data classification system based on galaxy server and protected from information leak. In: Nguyen, N.T., Tojo, S., Nguyen, L.M., Trawi´ nski, B. (eds.) ACIIDS 2017. LNCS (LNAI), vol. 10192, pp. 765–773. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-54430-4 73 6. Gillies, R.J., Kinahan, P.E., Hricak, H.: Radiomics: Images are more than pictures, they are data. Radiology 278(2), 563–77 (2016). https://doi.org/10.1148/radiol. 2015151169 7. van Griethuysen, J.J.M., et al.: Computational radiomics system to decode the radiographic phenotype. Cancer Res. 77(21), e104–e107 (2017). https://doi.org/ 10.1158/0008-5472.CAN-17-0339 8. Inamura, K.: Lung cancer: Understanding its molecular pathology and the 2015 WHO classification. Front. Oncol. 7, 193 (2017). https://doi.org/10.3389/fonc. 2017.00193 9. Kumar, V., et al.: Radiomics: the process and the challenges. Magn. Reson. Imaging 30(9), 1234–1248 (2012). https://doi.org/10.1016/j.mri.2012.06.010 10. Lambin, P., et al.: Radiomics: extracting more information from medical images using advanced feature analysis. Euro. J. Cancer (Oxford, England : 1990) 48(4), 441–446 (2012). https://doi.org/10.1016/j.ejca.2011.11.036 11. Oken, M.M., et al.: Toxicity and response criteria of the eastern cooperative oncology group. Am. J. Clinical Oncol. 5(6), 649–656 (1982). https://doi.org/10.1097/ 00000421-198212000-00014 12. Popper, H.H.: Progression and metastasis of lung cancer. Cancer Metastasis Rev. 35(1), 75–91 (2016). https://doi.org/10.1007/s10555-016-9618-0 13. Simon, R., Radmacher, M.D., Dobbin, K., McShane, L.M.: Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. JNCI J. National Cancer Inst. 95(1), 14–18 (2003). https://doi.org/10.1093/jnci/95.1.14 14. Student, S., Fujarewicz, K.: Stable feature selection and classification algorithms for multiclass microarray data. Biol. Direct 7(1), 33 (2012). https://doi.org/10. 1186/1745-6150-7-33

Covariance Controlled Bayesian Rose Trees 1,2(B) Damian Peszor  1

and Eryka Probierz1,2

Silesian University of Technology, Akademicka 2A, 44-100 Gliwice, Poland {damian.peszor,eryka.probierz}@polsl.pl 2 Infotower sp. z o.o., Wincentego Pola 16, 44-100 Gliwice, Poland https://www.polsl.pl, https://infotower.pl

Abstract. This paper aims to present a modified version of Bayesian Rose Trees (BRT). The classical BRT approach performs data clustering without restricting the resulting hierarchy to the binary tree. The proposed method allows for constraining the resulting hierarchies on the basis of additional knowledge. Thanks to this modification, it is possible to analyse not only the raw structure of the data but also the nature of a cluster. This allows an automatic interpretation of the resulting hierarchies while differentiating between clusters of different magnitudes, or those that extend significantly through one pair of dimensions while being coherent in a different one. On the basis of the resulting modifications, it is possible to analyse the depth level as a function of likelihood. The developed method allows maximising customisation possibilities and comparative analysis between the nature of clusters. It can be applied to the clustering of different types of content, e.g. visual, textual, or in a modern approach to the construction of container databases. Keywords: Hierarchical clustering constraining

1

· Bayesian Rose Tree · Hierarchy

Introduction

Contextual analysis of data allows better results at lower costs [1], which led to many such applications in industry, science, and medicine, especially in artificial intelligence methods - from contextual neural networks, through expert systems and ensemble of classifiers to decision trees [2–7]. Contextual processing is characterised by the multi-step analysis of chunks of data and evolution of the decision space after model training. Such a decision space is analysed by following the trained scan-path [5,6] with a typical length from 3 to 9 [8,9]. This can be hard to apply in the case of binary hierarchies as obtained by classical hierarchical clustering methods, which are limited by the binary structure of branches, making such chunks indistinguishable. Bayesian Rose Trees (BRT) allow one to Supported by organization Infotower sp. z o.o., Wincentego Pola 16, 44-100 Gliwice, Poland. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  N. T. Nguyen et al. (Eds.): ACIIDS 2022, LNAI 13758, pp. 51–63, 2022. https://doi.org/10.1007/978-3-031-21967-2_5

52

D. Peszor and E. Probierz 

create hierarchical structures from the data [10] solving this problem using rose trees, in which each node has a variable number of children. Each partial tree is defined by a high marginal likelihood over the data that belong to it. This means that although BRT has higher computational complexity, they allow for a better representation of the data structure and differentiation between clusters. Among the various modifications of Bayesian Rose Trees, the evolutionary approach [11] is particularly noteworthy. The authors of the study point out that once created, the organisation very often evolves, inducing the need to reorganise the model. For this purpose, evolutionary hierarchical clustering, derived from binary trees, was used. Tree evolutions were transferred to multibranch trees, and evolutionary patterns over time were analysed. The research was carried out using data derived from journal articles and reports to extract keywords or topics. Smoothness and efficiency were emphasised in the research conducted; each tree in a sequence should be similar in structure and content to the one at the previous time point. In addition to this, a good match between documents at a given point in time is also taken into account. A similar topic was analysed in another study [12], which dealt with the construction of an automatic taxonomy based on keywords that were extracted using BRT. The problem of deriving a taxonomy from a set of key phrases is an important issue because it allows for increased flexibility and ease of characterising the selected content, as well as selecting other additions to the site, such as advertisements. To create a taxonomy, however, it is not enough to have just a set of keywords, but also contextual information or additional knowledge. Another interesting use of BRT is presented in a paper on a causality model constructed from review text data [13]. The purpose of this investigation was to extract features that would allow for an understanding of the evaluation factors and the interrelationships between them. This approach makes it possible to create a hierarchical structure with different topic depths, which the authors further enrich with the semantics of concept labels. The results indicate that the model achieves high accuracy while its structure is close to the real one. Most of the research conducted using this method is focused on text analysis in the form of keywords or search engine queries [14]. Similarly, the authors focused on queries and their representation. This was to provide the user with better query suggestions so that the given user could get better recommendations and increased personalisation, and thus more satisfaction with the given solution. The authors point out that tasks are usually not a flat structure, but a set of interconnected subtasks, for which reason they chose BRT. This approach was also used to extend the bag-of-words model [15]. It allowed the use of hierarchical concept labelling. The authors first applied a de-noising algorithm and then the generated labels were based on BRT to achieve high accuracy. The ongoing research mainly focusses on text analysis and the acquisition of well-fitting models of interrelationships between different semantic constructs. However, among the articles analysed and the review carried out, no publications addressed the use of BRT for automatic data analysis using additional domain constraints.

Covariance Controlled Bayesian Rose Trees

2 2.1

53

Algorithm Hierarchical Clustering

To discover the hierarchical nature of a given set of data, one has to employ an algorithm that will not only cluster such data but also do it hierarchically. Such hierarchies are a subject of assumptions about the way the data is organised. One such method, Bayesian Hierarchical Clustering (BHC) [16] is a common tool used to find a hierarchy in a given data set. BHC is a representative of a common approach to the problem, wherein clusters at one level of the hierarchy are joined together as a higher-level cluster based on some measure of similarity. Since the information about the number of clusters is not present, BHC is a bottom-up approach, so that it finds out larger clusters on its way to unify all data as a single, all-encompassing cluster. Such an approach results in a hierarchy represented as a binary tree of clusters. The important issue arrives with such a representation. Since there is a limited number of children at any given node, a very coherent cluster of very similar data points has to be represented as either a balanced tree, a cascade, or something in between. Similarly, a data set containing a few clusters that are easily distinguishable from each other, while each contains a closely related set of data points, has to be represented in the same manner. It is therefore impossible to tell based on the resulting hierarchy, which of the tree branches correspond to coherent clusters and which correspond to loosely related data which has to be organised in the same manner due to the restrictions of using the binary tree. Figure 1 represents a typical result of a cascade of nodes, that is, the result of BHC. In this case, the actual relations between clusters are impossible to discern.

Fig. 1. Fragment of the cascade of nodes typical for BHC clustering.

54

D. Peszor and E. Probierz 

2.2

Bayesian Rose Trees

Bayesian Rose Trees (BRT) [10,17] is a method that solves the issue arising from the usage of binary trees by allowing multiple children for every cluster in a rose tree structure [18]. This results in a multilevel hierarchy that models similar data points as children of a node representing the cluster, while relations between clusters are represented by the entirety of the constructed tree. The resulting hierarchical representation is one of the many possible options, therefore the BRT algorithm uses the likelihood ratio (Eq. 1) in order to determine which pair of clusters to merge to build the tree using the bottom-up approach. L(Tm ) =

p(leaves(Tm )|Tm ) p(leaves(Ti )|Ti )p(leaves(Tj )|Tj )

(1)

The likelihood ratio is composed of tree likelihoods, wherein Tm denotes the tree resulting from the merge of Ti and Tj trees. The tree likelihood is given by Eq. 2, wherein f (leaves(T )) denotes the marginal probability of a cluster containing set of data points, children(T ) denotes the set of trees being children of given tree T and leaves(T ) denotes the set of data points at the leaves of T . The πT parameter corresponds to the prior probability of T being coherent rather than subdivided. It is a hyperparameter of the BRT method. p(leaves(T )|T ) = πT f (leaves(T )) + (1 − πT )



p(leaves(Ti )|Ti ) (2)

Ti ∈children(T )

2.3

Constraining BRT Hierarchies

The added benefit of clustering multiple data points as children of a single tree node is the ability to interpret the hierarchy obtained from BRT. However, clusters at the same tree depth might vary considerably between clusters that correspond to almost the entirety of the data and clusters that present only a small, outlying fraction. Therefore, the structure of the data itself is not enough to understand the nature of the cluster. This is evident whenever BRT is used to obtain the structure of the data for further automatic processing. A way to define the properties of clusters under tree partitioning is needed to interpret the results automatically. This relates to another, quite common requirement. In many cases, hierarchical clustering is used to obtain information about the presence of clusters that might be subject to an overarching hierarchy, but are themselves defined by the requirements of the user. Typical such requirement is the range of possible variance of the cluster, which can be easily constructed based on domain knowledge, wherein a cluster represents a given property of the data, e.g., a period. Although BRT extends its usage to probability distributions that are not parameterised by covariance or are asymmetric, in the scope of this paper, it is assumed that the underlying probability distribution is symmetric and parameterised by covariance. This approach can be extended to different probability distributions. Trees resulting from BRT are not subject to

Covariance Controlled Bayesian Rose Trees

55

covariance-based requirements and therefore produce results that do not necessarily correspond to the demand of the user. Let user requirements be represented in the form of L different cluster characteristics over N-dimensional data as a hypermatrix K of order 3, such that K = [ki,j,l ]. K is therefore a concatenation along the third dimension [19] of covariance matrices such that k·,·,l = Σl , wherein i ∈ [1 . . . N ], j ∈ [1 . . . N ], l ∈ [1 . . . L], as represented by Eq. 3.     K = ki,j,l = Σ1 |Σ2 | · · · |ΣL  2 ⎡ 2 σ1,1 σ1,2,1 · · · σ1,N,1  σ1,2 σ1,2,2 2 2 ⎢ σ2,1,1 σ2,1 · · · σ 2,N,1  σ2,1,2 σ2,2  ⎢ =⎢ .  . . . ..  .. ⎣ .. · · · .. ···  2 σN,1,1 σN,2,1 · · · σN,1  σN,1,2 σN,2,2

 · · · σ1,N,2  · · · σ2,N,2  ..  · · · .. . .  2  · · · σN,2

 2  σ1,L σ1,2,L  2  σ2,1,L σ2,L   .  .. ···  σN,1,L σN,2,L

⎤ · · · σ1,N,L · · · σ2,N,L ⎥ ⎥ .. ⎥ .. . . ⎦ 2 · · · σN,L

(3) Note that for a common case of 1-dimensional data (N = 1), K is in fact simply a vector of variances. Given Eq. 3, let the third dimension of K be defined as a sequence of minimal requirements for a cluster characteristic to correspond to a given level of overarching clustering hierarchy, wherein Σ1 corresponds to the highest assumed level and ΣL corresponds to the lowest assumed level of hierarchy. Therefore, all clusters of level lx incorporate clusters of level ly : ly > lx and are incorporated by clusters of level lz : lz < lx . Such a hierarchical approach results in constraints on the covariance matrices, so that the modules of covariances never increase as the depth level l increases, as in Eq. 4. |ki,j,l−1 | ≥ |ki,j,l | ≥ |ki,j,l+1 |

(4)

Let the cluster of level l be defined as a rose tree T of which leaves D = leaves(T ) are such that any element of the covariance matrix of D is greater than the corresponding element of K, with l being the highest of those which fulfil that criteria. l(D) = max(l ∈ [1 . . . L]) : ∃i : i ∈ 1 . . . N ∧ ∃j : j ∈ 1 . . . N ∧ |cov(D)i,j | ≥ |ki,j,l | (5) Then the four possible merge operations defined for BRT; join, two symmetrical absorbs, and collapse, which are compared by BRT using the likelihood ratio between the possible merged cluster and existing ones (Eq. 1) can be controlled by an additional factor depending on cluster depth levels. – – – –

Join, wherein Tm = {Ti , Tj } is always available. Absorb, wherein Tm = ch(Ti ) ∪ {Tj } requires that l(Tm ) = l(Ti ). Absorb, wherein Tm = {Ti } ∪ ch(Tj ) requires that l(Tm ) = l(Tj ). Collapse, wherein Tm = ch(Ti ) ∪ ch(Tj ) requires that l(Tm ) = l(Ti ) = l(Tj ).

56

D. Peszor and E. Probierz 

Note that for the classical BRT ki,j,k = 0 or alternatively L = 0. The proposed algorithm is, therefore, a generalisation of BRT. 2.4

Parameterisation

Consider the implications that arise with the use of the above modifications to the BRT algorithm. Since BRT builds clusters that correspond to an underlying probability distribution, constraining the results using K actually corresponds to defining the range of parameterisations for clusters created by BRT. For singledimensional (N = 1) data, such as timestamps of different events, the hypermatrix K consists of L variances, one for each level of depth of the resulting hierarchy. Each cluster is therefore restricted to the range of variances. Due to the restrictions of the merge operations, a cluster that approaches the maximum variance of its level (the minimum variance of the next level) will not grow further to encompass additional data, if this would result in an increase of variance above the threshold. Instead, the cluster will be preserved as part of a lowerlevel cluster which corresponds to higher variance. This decreases the impact of the parameters used for the priors of the underlying probability distributions, as well as πT , which can be difficult to correctly estimate, especially in the case of varying requirements for different depth levels of the resulting hierarchy. Consider multidimensional (N > 1) data, such as a set of points in spacetime. The covariance matrices Σ contain on their diagonal the variance along each dimension. Since each depth level corresponds to minimal values for each variance, one can constrain the resulting clusters in either one or many dimensions simultaneously. One can therefore define a cluster of x-level as a cluster containing data points coherent in time, while a cluster of level e.g. y : y < x as a supercluster that collects different time-based clusters based on the location of data points in space. Each depth level might correspond to a different dimension of the data, including repetitions according to the needs of the user. Covariance matrices Σ of which K is composed of are however not restricted to diagonals, as each ki,j,l:i=j corresponds to cov(Di , Dj ). This allows identifying clusters that extend significantly through a pair of dimensions, while not extending the threshold in any single one, or those that are coherent in a pair of dimensions while being significantly elongated along any one dimension. The flexibility of the proposed approach along with the simplicity of defining K using domain knowledge can be considered crucial in terms of extending the usability of the BRT algorithm. 2.5

Depth Level as a Function of the Likelihood

BRT selects two clusters to merge on the basis of Eq. 1, the ratio of likelihoods which are defined by Eq. 2, wherein f (D) is defined by Eq. 6.   f (x|θ)dθ (6) f (D) = f (θ|η) x∈D

Covariance Controlled Bayesian Rose Trees

57

Note that the vector of parameters of the underlying probability distribution, θ, is marginalized out using a corresponding prior f (θ|η) and hyperparameter η. For a common case of modelling clusters using the multivariate normal distribution, a Gaussian-inverse-Wishart is used, being a conjugate prior with unknown covariance and mean. In such a case, Eq. 7 holds, wherein the prior Z is directly dependant on posterior Λ, as in Eq. 8 [20]. f (D) =

Z  n Z0 (2Π)nd

(7)

 νn d ν d −ν ) κ−d (8) n (2π) 2 n |Λn | n 2 wherein the posterior is dependant on the scatter matrix S, as in Eq. 9, which in turn is the sample covariance matrix not normalized by the factor of (n − 1) as in Eq. 10. Zn = Γd (

Λn = Λ0 + S +

κ0 n (x − μ0 )(x − μ0 )T κ0 + n

S = (n − 1)Σ

(9) (10)

One can therefore notice that the threshold of marginal likelihood can be calculated from the covariance matrices Σl of which K is composed of. While this approach seems more elegant, it comes with the drawback that the thresholds ki,j,l are applied in a unified form, which highly reduces the customization capabilities of the proposed solution. 2.6

Hierarchy Outside of Defined Clusters

As presented, the proposed solution assumes that the structure of the tree is based on the BRT approach, but the clusters are constrained by additional criteria. This results in a hierarchy that allows for additional, data-emergent clusters that aggregate two or more clusters defined by K while remaining at the same depth level, as illustrated in Fig. 2. Since each cluster is described by its depth level, this general solution does not prevent one from using only those clusters that are defined by user requirements, however, it allows to notice or process additional information, which might be the main result of selecting BRT as a clustering method. That is, one can distinguish between different branches that correspond to the same depth level, but are separate due to the data being spread around two different data points that are distinguishable. Note that since every such cluster is described by its depth level, one can process the resulting hierarchy in a way that is dependant on the nature of the cluster, e.g., including the internal differences between clusters of which each corresponds to a day (and thus allows to find a few days long trips) but ignore clusters of a higher level l which correspond to different events during the day that are described by variances corresponding to an hour.

58

D. Peszor and E. Probierz 

Fig. 2. Part of hierarchy that consists of many clusters of same depth level. D denotes the depth level of a given cluster.

If, however, one does not plan on including additional structural data, it might be beneficial to construct a tree in a way, that assumes that each cluster’s parent is of lower depth level and each child is of higher depth level. In such a case, the below requirements are used to obtain a hierarchy defined by K, without additional hierarchical data. It is important to note, that in such a case, wherein a join operation is unavailable, it is replaced by absorb operation. However, the fact that the pair of clusters should be merged remains, therefore the higher likelihood of the join and the absorb operations is used as the likelihood of the latter. – Join, wherein Tm = {Ti , Tj } requires that l(Tm ) = l(Ti ) ∧ l(Tm ) = l(Tj ). – Absorb, wherein Tm = ch(Ti ) ∪ {Tj } requires that l(Tm ) = l(Ti ). Note, that in this case L(Tm ) = max(L(ch(Ti ) ∪ {Tj }), L({Ti , Tj })). – Absorb, wherein Tm = {Ti } ∪ ch(Tj ) requires that l(Tm ) = l(Tj ). Note, that in this case L(Tm ) = max({Ti } ∪ ch(Tj ), L({Ti , Tj })). – Collapse, wherein Tm = ch(Ti ) ∪ ch(Tj ) requires that l(Tm ) = l(Ti ) = l(Tj ).

3

Method Comparison

Consider a minimal example to compare the results of different methods. A simple, one-dimensional dataset as presented in Table 1 illustrates the differences between the proposed method and state of the art. While a more realistic dataset could be used, the resulting trees would be too large to present. The data points in this exemplary dataset represent events during a single weekend, which can be clusterized based on the timestamp of the event included, e.g., in the metadata of the photo taken. All presented differences are not dependant on the dimensionality of the data, the size of the dataset, or the meaning of it. Figure 3 represents the hierarchy obtained using BHC with Normal-Inverse Wishart model parameterised with g = 10000 and s = 1e − 9 and with BHC parameterised by α = 0.5. Note that one could use a different set of parameters to obtain different trees. The values used were selected to present the effect of the proposed method rather than to best represent the data.

Covariance Controlled Bayesian Rose Trees

59

Table 1. Minimal, 1-dimensional dataset of event dates used in presented cases. Node label Event date

Timestamp

N0

Friday, 10 Dec. 2021 20:00

1639166400

N1

Saturday, 11 Dec. 2021 12:00 1639224000

N2

Saturday, 11 Dec. 2021 19:00 1639249200

N3

Saturday, 11 Dec. 2021 20:00 1639252800

N4

Sunday, 12 Dec. 2021 18:00

1639332000

N5

Sunday, 12 Dec. 2021 19:00

1639335600

N6

Sunday, 12 Dec. 2021 20:00

1639339200

As one can see, the resulting binary tree is not a full cascade, however, one cannot tell if the relationship between the N0 node and N10 cluster is the same as in the case of N4 and N7 or is different.

Fig. 3. Hierarchy tree generated using BHC and the example dataset.

Figure 4 represents the hierarchy obtained with the same dataset and the same parameters using the BRT algorithm. One can notice that the two very similar N2 and N3 data points were grouped, while all other data points were not clustered, even though the difference between N4 and N5 or N5 and N6 is the same, and that is despite the scale factor being relatively small. One could avoid such an effect by using more appropriate parameters, however, their values would have to be tuned rather than selected based on additional knowledge - in this case, the day/night cycle in which humans operate. Consider applying contextual knowledge to the task. It is easy to say that one might expect a cluster that represents a single event during a day, such as a social gathering, which would be characterised by events that are described by

60

D. Peszor and E. Probierz 

Fig. 4. Hierarchy tree generated using BRT and the example dataset.

a normal distribution with a standard deviation of, say, σ = 1000 s. Notice that multiple events in a day can be grouped by the day itself, so σ = 30000 can represent the day level cluster. For the sake of completeness, assume that a cluster of σ = 500000 will represent a month. A basic understanding of the standard distribution plot is enough that is required to define appropriate parameters. Using the covariance matrix (in the 1D example, a variance vector) based on standard deviations allows to illustrate the covariance controlled BRT as in Fig. 5.

Fig. 5. Hierarchy tree generated using CCBRT and the example dataset.

As one can see, the additional constraints on the generated trees usefully represent clusters of the predicted nature. In the aforementioned case, the BRT found a useful distinction between N1 and N9, despite N9 and N20 both being of depth level 2 (as denoted by the letter ’D’). One, therefore, does not lose such information by providing constraints for CCBRT as opposed to classical BRT. In some cases, however, that might not be useful information for the processing of the tree, in which case, the pruned variant gives the result as in Fig. 6.

Covariance Controlled Bayesian Rose Trees

61

Fig. 6. Hierarchy tree generated using pruned CCBRT and the example dataset.

4

Conclusions

In this article, a modified constrained hierarchical clustering algorithm based on BRT was presented. The classical BRT approach was enhanced with the constraining method based on the range of covariances. Such an application is essential when analysing real-world datasets, which allows the data to be created and explained by simple models. In the work cited above [10–12], the authors draw attention to both the increasing ease of computation and the increasing complexity of analysis. It is an unquestionable advantage of BRT that it allows for better data representation; however, simply modelling the data may not be sufficient in many cases, where additional processing is done automatically. This is especially true when there is a need for comparative analysis of the obtained clusters. This problem can be solved using the proposed modification, which, based on the range of covariance, allows for a deeper and more meaningful contextual analysis of the data. It is also extremely important that the possibility of constraining the resulting hierarchies is formulated in a way that allows the use of contextual domain knowledge by experts. The proposed solution allows to define the range of the resulting hierarchies by means of understandable constraints described by covariance matrices, which makes the solution useful in a wide range of applications. Another problem that appears in BRT articles is the need to reorganise the model with changing data [11]. In the case of the proposed solution, wherein the nature of clusters is partially defined by user requirements, the influx of data does not disrupt the entirety of the hierarchy, but instead modifies only a subtree, thus resulting in a stable system, not dependant on a single data point. BRT solutions offer great opportunities on the topics of autonomic taxonomy construction, text-based causality modelling, and hierarchy extraction. The hierarchical nature of the clustering allows for use in computer graphics, wherein a multiresolution representation of shape is needed to provide the best quality without the overhead. Similarly, the hierarchical structure is useful for recognising planes and obstacles under noisy circumstances in the case of computer vision algorithms such as [21,22] or when controlling the hierarchical shape by animation using segmented surfaces as in [23]. However, these need some constraints to be useful which can be applied using the proposed solution. To increase the

62

D. Peszor and E. Probierz 

scope of their application, the issues of complexity and computational difficulty should be taken into account. This is a path of further investigation. Acknowledgements. The research presented in this paper is co-financed by the EU Smart Growth Operational Programme 2014-2020 under the project POIR.01.01.0100-1111/19 “Development of the advanced algorithms for multimedia data selection and the innovative method of those data visualization for the platform supporting customer service in the tourism industry”. The work of Eryka Probierz was supported in part by the European Union through the European Social Fund as a scholarship Grant POWR.03.02.00-00-I029, and in part by the Silesian University of Technology through a grant: the subsidy for maintaining and developing the research potential in 2022 for young researchers in data collection and analysis. was supported in part by Silesian University of TechThe work of Damian Peszor  nology through a grant number BKM-647/RAU6/2021 “Detection of a plane in stereovision images without explicit estimation of disparity with the use of correlation space”.

References 1. Huk, M.: Measuring the effectiveness of hidden context usage by machine learning methods under conditions of increased entropy of noise. In: 2017 3rd IEEE International Conference on Cybernetics (CYBCONF), Exeter, pp. 1–6 (2017) 2. Kwiatkowski J., Huk, M., et al.: Context-sensitive text mining with fitness leveling genetic algorithm. In: 2015 IEEE 2nd International Conference on Cybernetics (CYBCONF), Gdynia, Poland, 2015, pp. 1–6 (2015). ISBN: 978-1-4799-8321-6 3. Mizera-Pietraszko J., Huk M.: Context-related data processing in artificial neural networks for higher reliability of telerehabilitation systems. In: 17th International Conference on E-health Networking, Application & Services (HealthCom 2015). IEEE Computer Society (2015) 4. Huk, M., Szczepanik M.: Multiple classifier error probability for multi-class problems. Eksploatacja i Niezawodnosc-Maint. Reliabil. 51(3), 12–16 (2011) 5. Privitera, C.M., Azzariti, M., Stark, L.W.: Locating regions-of-interest for the Mars Rover expedition. Int. J. Remote Sens. 21, 3327–3347 (2000) 6. Huk, M.: Training contextual neural networks with rectifier activation functions: role and adoption of sorting methods. J. Intell. Fuzzy Syst. 37(6), 7493–7502 (2019) 7. Szczepanik, M., J´ o´zwiak, I.: Fingerprint recognition based on minutes groups using directing attention algorithms. In: Rutkowski, L., Korytkowski, M., Scherer, R., Tadeusiewicz, R., Zadeh, L.A., Zurada, J.M. (eds.) ICAISC 2012. LNCS (LNAI), vol. 7268, pp. 347–354. Springer, Heidelberg (2012). https://doi.org/10.1007/9783-642-29350-4 42 8. Miller, G.A.: The magical number seven, plus or minus two: some limits on our capacity for processing information. Psychol. Rev. 63(2), 81–97 (1996) 9. Huk, M.: Weights ordering during training of contextual neural networks with generalized error backpropagation: importance and selection of sorting algorithms. In: Nguyen, N.T., Hoang, D.H., Hong, T.-P., Pham, H., Trawi´ nski, B. (eds.) ACIIDS 2018. LNCS (LNAI), vol. 10752, pp. 200–211. Springer, Cham (2018). https://doi. org/10.1007/978-3-319-75420-8 19 10. Blundell, C., Teh, Y.W., Heller, K.A.: Bayesian rose trees. In: Proceedings of the 26th Conference on Uncertainty in Artificial Intelligence (UAI 2010), Catalina Island, California, USA, pp. 65–72 (2010)

Covariance Controlled Bayesian Rose Trees

63

11. Liu, S., Wang, X., Song, Y., Guo, B.: Evolutionary bayesian rose trees. IEEE Trans. Knowl. Data Eng. 27(6), 1533–1546 (2014) 12. Song, Y., Liu, S., Liu, X., Wang, H.: Automatic taxonomy construction from keywords via scalable bayesian rose trees. IEEE Trans. Knowl. Data Eng. 27(7), 1861– 1874 (2015) 13. Ogawa, T., Saga, R.: Text-based causality modeling with a conceptual label in a hierarchical topic structure using bayesian rose trees. In: Proceedings of the 54th Hawaii International Conference on System Sciences, pp. 1101–1110. Hawaii International Conference on System Sciences (HICSS), Honolulu (2021) 14. Mehrotra, R., Yilmaz, E.: Extracting hierarchies of search tasks & subtasks via a bayesian nonparametric approach. In: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 285–294. Association for Computing Machinery, New York (2017) 15. Jiang, H., Xiao, Y., Wang, W.: Explaining a bag of words with hierarchical conceptual labels. In: World Wide Web, pp. 1–21 (2020) 16. Heller, K. A., Ghahramani, Z.: Bayesian Hierarchical Clustering. In: Proceedings of the 22nd International Conference on Machine Learning (ICML 2005), pp. 297– 304. Association for Computing Machinery, Bonn (2005) 17. Blundell, C., Teh, Y.W., Heller, K.A.: Discovering non-binary hierarchical structures with bayesian rose trees. In: Mixture Estimation and Applications. John Wiley and Sons, Hoboken (2011) 18. Meertens, L.: First steps towards the theory of rose trees. Working paper 592 ROM-25, IFIP Working Group 2.1 (1988) 19. Yamada, K.: Hypermatrix and its application. Hitotsubashi J. Arts Sci. 6, 34–44 (1965) 20. Gelman, A., Carlin, J.B., Stern, H.S., Dunson, D.B., Vehtari, A., Rubin, D.B.: Bayesian Data Analysis, 3rd edn. Chapman & Hall/CRC, Boca Raton (1999) D., Wojciechowska, M., Wojciechowski, K., Szender, M.: Fast moving UAV 21. Peszor,  collision avoidance using optical flow and stereovision. In: Nguyen, N.T., Tojo, S., Nguyen, L.M., Trawi´ nski, B. (eds.) ACIIDS 2017. LNCS (LNAI), vol. 10192, pp. 572–581. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-54430-4 55 D., Wojciechowski, K., Szender, M., Wojciechowska, M., Paszkuta, M., 22. Peszor,  Nowacki, J.P.: Ground plane estimation for obstacle avoidance during fixed-wing UAV landing. In: Nguyen, N.T., Chittayasothorn, S., Niyato, D., Trawi´ nski, B. (eds.) ACIIDS 2021. LNCS (LNAI), vol. 12672, pp. 454–466. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-73280-6 36 D., Wojciechowski, K., Wojciechowska, M.: Automatic markers’ influence 23. Peszor,  calculation for facial animation based on performance capture. In: Nguyen, N.T., Trawi´ nski, B., Kosala, R. (eds.) ACIIDS 2015. LNCS (LNAI), vol. 9012, pp. 287– 296. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-15705-4 28

Potential of Radiomics Features for Predicting Time to Metastasis in NSCLC Agata Wilk1,2(B) , Damian Borys1,2 , Krzysztof Fujarewicz1 , 1 ´ nski2 , and Andrzej Swierniak Andrea d’Amico2 , Rafal Suwi´ 1

2

Department of Systems Biology and Engineering, Silesian University of Technology, Akademicka 16, 44-100 Gliwice, Poland [email protected] Maria Sklodowska-Curie National Research Institute of Oncology, Gliwice Branch, Wybrzeze Armii Krajowej 15, 44-102 Gliwice, Poland Abstract. Lung cancer is the most deadly malignancy, with distant metastasis being a major negative prognostic factor. Recently, interest is growing in imaging data as a source of predictors due to the low invasiveness of their acquisition. Using a cohort of 131 patients and a total of 356 ROIs we built a Cox regression model which predicts metastasis and time to its occurrence based on radiomic features extracted from PET/CT images. We employed several variable selection methods, including filtering based on correlation, univariate analysis, recursive elimination and LASSO, and obtained a C-index of 0.7 for the best model. This result shows that radiomic features have great potential as predictors of metastatic relapse, knowledge of which could aid clinicians in planning treatment.

Keywords: Lung cancer features

1

· Survival analysis · Metastasis · Radiomics

Background

Lung cancer is the most common cause of cancer-related death worldwide. [2,5]. In 2020 alone, it was responsible for over 2.2 million new cases (11.4% of all, second by a small number only to breast cancer) and nearly 1.8 million deaths (constituting 18% of total deaths caused by cancer) [14]. The 5-year survival rate for lung cancer is between 10 and 20%, which can in many cases be attributed to late detection. One of the most significant negative prognostic factors is the occurrence of metastasis, which in lung cancer is located primarily in bones, brain and liver [12]. Metastatic tumors are practically incurable, due to their resistance to available treatment options. Ability to predict metastasis, and, even more importantly, time to its onset, would grant clinicians a crucial advantage in planning therapy timing and intensity. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  N. T. Nguyen et al. (Eds.): ACIIDS 2022, LNAI 13758, pp. 64–76, 2022. https://doi.org/10.1007/978-3-031-21967-2_6

Potential of Radiomics Features for Predicting Time to Metastasis

65

Although several approaches based on clinical features have been proposed [3], interest is growing in other sources of predictors, in particular imaging data due to the low invasiveness of their acquisition [6,17]. Computer tomography and positron emission tomography (PET/CT) images are routinely acquired during diagnostics. They provide valuable knowledge about tumor location, size and shape which can then be used for clinical decision-making. However, being highresolution images, they contain a much greater amount of information, most of it undetectable by human eye, which remains unused in practice. There are, however, certain limitations on taking full advantage of that information. The size of a dataset necessary to fit a model far exceeds any realistically obtainable patient cohorts. A compromise solution can be found in radiomics. By first applying feature extraction to the original images, the data is reduced to a smaller number of variables which can serve as a starting point for further processing, for example by means of statistical inference or machine learning. While the number of radiomic features that can be extracted is virtually unlimited, image pre-processing and custom-constructed variables may make standarization, comparison between studies and practical application difficult. Here we present a model predicting time to metastasis in non-small-cell lung cancer based on 105 standard radiomic features extracted from unmodified images. We explore the relationships in data using unsupervised analysis. We employ several approaches to variable selection, in both single- and multistep procedures, determining both the optimal number and set of features, for which we construct a Cox regression model.

2 2.1

Materials and Methods Data

Data was collected retrospectively at Maria Sklodowska-Curie National Research Institute of Oncology (NIO), Gliwice Branch from patients treated for nonsmall-cell lung cancer between 2009 and 2017. From a larger cohort, we selected patients for whom PET/CT images were collected as part of routine diagnostic procedure. A sample image is shown in Fig. 1. Patients with metastasis present at the moment of diagnosis were also excluded, resulting in a total of 131 patients included in this analysis. The patients were treated with radio-chemotherapy, primarily in sequential form with 1 to 6 cycles of platinum-based chemotherapy followed by radiotherapy with a total dose of 60–70 Gy. For 38 individuals metastasis was recorded at a later time. A more detailed cohort characteristic is shown in Table 1. All patients gave informed consent, and the data collection was approved by the ethical committee at NIO.

66

2.2

A. Wilk et al.

Radiomics Features

Acquired pre-treatment PET/CT images were preprocessed in order to save the data in appropriate format for subsequent radiomics analysis [1,4,8–11]. Manually generated regions of interest (ROIs) (with the automatic support for PET image lesion segmentation tool using the MIM Software v7.0) were preprocessed in the similar way to produce an output .nrrd files format. Radiomics features were extracted from the target lesions (described by ROIs) using the program based on pyRadiomics package for Python (v3.0.1), including: – First order features (energy, entropy, minimum, percentiles, maximum, mean, median, interquartile range, range, standard deviation, skewness, kurtosis etc.); Table 1. Characteristic of the patient cohort. For age, mean and range are shown. T, N, M—TNM classification at the moment of imaging: T—primary tumor size, N— spread to lymph nodes, M—distant metastases. Zubrod score—ECOG performance score. Eventual metastasis—MFS status. n = 131 Age

61.83 (43–81)

Sex

Male Female

34 (26.0%) 95 (74.0%)

Zubrod score

0 1 2

39 (29.8%) 91 (69.5%) 1 (0.8%)

T

T0 T1 T2 T3 T4

4 (3.1%) 13 (9.9%) 31 (23.6%) 43 (32.8%) 40 (30.5%)

N

N0 N1 N2 N3

57 11 49 14

M

M0 M1

131 (100%) 0 (0%)

Subtype

Squamous Large cell Adeno Other

90 (68.7%) 29 (22.1%) 10 (7.6%) 2 (1.6%)

Eventual metastasis Yes No

(43.5%) (8.4%) (37.4%) (10.7%)

38 (29.0%) 93 (71%)

Potential of Radiomics Features for Predicting Time to Metastasis

67

– Shape Features (volume, surface area, sphericity,etc.); – higher order statistics texture features, including: • Gray-Level Co-occurrence Matrix (GLCM); • Grey-Level Dependence Matrix (GLDM); • Grey-Level Run Length Matrix (GLRLM); • Grey-Level Size Zone Matrix (GLSZM); • Neighboring -Gray Tone Difference Matrix (NGTDM). Only the original images, without additional filter, were used in this analysis. This procedure gave us a total number of 105 radiomics features. For cases with multiple tumor sites in lung, more than one ROI was generated. Their number varied between 1 and 11 per patient, with a median value of 2, resulting in a total of 356 ROIs. ROIs within a single patient showed high variation, which is demonstrated in Fig. 2. 2.3

Data Pre-processing and Unsupervised Analysis

As the feature values varied considerably in orders of magnitude, we standardized all variables using z-score transformation. We explored relationships between features and samples in terms of Pearson’s correlation, hierarchical clustering, and principal component analysis. 2.4

Modeling of Metastasis Free Survival

We used Cox regression to assess the effect of radiomics features on time to metastasis. We verified proportional hazards assumption with Schoenfeld residuals. We employed several methods of model reduction, including pre-filtering based on univariate analysis, with a cutoff threshold for p-value equal to 0.2, prefiltering based on correlations between variables, recursive elimination based on Akaike Information Criterion, and LASSO. We assessed performance of the models using C-index and time-dependent ROC curve (for time = 900 days) for resubstitution on the training set and visualized the results on Kaplan-Meier plots. We repeated the whole procedure using only the largest ROI (according to voxel volume) for each patient. We compared Kaplan-Meier survival curves for patient groups created by splitting the selected features along the median. Analyses were performed using R environment for statistical computing version 4.0.3 released on 10 October 2020 (R Foundation for Statistical Computing, Vienna, Austria, http://www.r-project. org), packages survival(version 3.2-7) [15,16], glmnet (version 4.1-1) [7,13] and survivalROC (version 1.0.3).

68

A. Wilk et al.

Fig. 1. An example of a PET/CT image.

Fig. 2. Correlations between ROIs calculated based on all radiomics features for a single patient.

Potential of Radiomics Features for Predicting Time to Metastasis

3

69

Results

Hierarchical clustering did not result in emergence of distinct groups converging with metastasis free survival (MFS) or subtype. Moreover, no discernible relationship was found between radiomic features and stage (T and N from TNM staging system). This confirms that radiomic features carry information not explained by clinical characteristics alone. It can also be observed that certain groups of features are highly correlated (Fig. 4. Some examples of (unsurprisingly) correlated features are: LeastAxisLength, MinorAxisLength, Maximum3DDiameter, Maximum2DDiameterSlice, Maximum2DDiameterRow, Maximum2DDiameterColumn, and SurfaceArea (shape features); or Median, Mean, RootMeanSquared, 90 Percentile, 10 Percentile and Maximum (first order features) (Fig. 3).

Fig. 3. Hierarchical clustering based on Euclidean distances.

Of all approaches to feature selection, four yielded similar quality: recursive elimination with univariate analysis preselection, LASSO, recursive elimination with univariate analysis and correlation preselection, and LASSO with correlation preselection. Table 2 outlines the selection results. While the best overall performance was achieved by recursive elimination selection with pre-filtering based on univariate analysis, the LASSO selection yielded a model only slightly worse with considerably fewer predictors. Smaller variable number reduces the risk of overfitting, making the LASSO selection seem more favorable. The selected predictors are presented in Table 3.

70

A. Wilk et al.

Fig. 4. Correlations between features.

Table 2. Model performance for different approaches to feature selection - all ROIs. Pre-filtering

Selection

Number of features

Univariate analysis

Recursive elimination 13

None

LASSO

C-index 0.71

7

0.70

Feature correlations and Recursive elimination univariate analysis

5

0.67

Feature correlations

5

0.67

LASSO

Potential of Radiomics Features for Predicting Time to Metastasis

71

Table 3. Hazard ratios with 95% confidence intervals and statistical significance for the selected predictors - using all ROIs. Feature class Feature

Hazard ratio p-value

Shape

Maximum2DDiameterColumn

1.7 [1.08–2.8] 0.023

First Order

Minimum

1.7 [1.37–2.0] < 0.001

GLCM

InverseVariance

1.0 [0.81–1.3] 0.765

GLRLM

RunLengthNonUniformity

1.1 [0.75–1.6] 0.599

GLRLM

RunLengthNonUniformityNormalized 1.7 [1.25–2.3] < 0.001

GLSZM

SmallAreaEmphasis

1.2 [0.93–1.7] 0.143

GLSZM

SmallAreaLowGrayLevelEmphasis

1.1 [0.82–1.4] 0.627

In the multivariate model, three features show statistic significance. These are Maximum2DiameterColumn, Minimum and RunLengthNonUniformityNormalized. Figure 5 shows the comparison of Kaplan-Meier plots—observed (red) and simulated (blue). As visible from the Kaplan-Meier plot, after around 900 days the number of events is relatively small. Figure 6 shows a time-dependent ROC curve for that moment. The AUC equal to 0.758 is relatively high for this kind of models. For the dataset with only one ROI per patient, without pre-filtering based on correlation large numbers of features were selectrd (see Table 4). Although the observed C-indices were very good, reaching 0.88 for the simple univariate pre-selection with recursive elimination. However, with 131 data points in the training set, the model is most certainly overfitted. The multi-step selection consisting in correlation- and univariate-based filtering and recursive elimination seems much more promising, as it achieves a high C-index for 8 predictors. The selected features in this model are presented in Table 5. Yet again, the majority belong to the texture-based features, with all except for Dependence Entropy reaching statistical significance. Figure 7 visualizes the prognostic ability of selected features. Most of the splits produce two groups with distinct metastasis free survival profiles. Figure 8 demonstrates the result of the fit.

4

Discussion and Future Work

In this paper we presented an attempt to use radiomic features to predict time to metastasis for non-small-cell lung cancer patients. We applied the Cox regression model and several strategies of variable selection. The performance of the best model confirms the potential of radiomic data in prediction of metastasis in lung cancer.

72

A. Wilk et al.

Fig. 5. Kaplan-Meier plot for observed and simulated metastasis free survival - multivariate model for all ROI. (Color figure online)

Fig. 6. ROC curve at t = 900 days.

Potential of Radiomics Features for Predicting Time to Metastasis

73

Table 4. Model performance for different approaches to feature selection - one (largest) ROI per patient. Pre-filtering

Selection

Number of features C-index

Univariate analysis

Recursive elimination 27

None

LASSO

0.88

15

0.79

Feature correlations and Recursive elimination univariate analysis

8

0.77

Feature correlations

2

0.67

LASSO

Table 5. Hazard ratios with 95% confidence intervals and statistical significance for the selected predictors - using one ROI per patient. Feature class Feature

Hazard ratio

p-value

GLCM

InverseVariance

2.26 [1.02–4.97]

0.044

GLSZM

LowGrayLevelZoneEmphasis

2.22 [1.34–3.67]

0.002

GLSZM

SmallAreaEmphasis

1.71 [1.00–3.21]

0.048

GLSZM

HighGrayLevelZoneEmphasis

4.02 [1.34–12.07] 0.013

GLSZM

SizeZoneNonUniformityNormalized 0.10 [0.02–0.61]

First Order

Kurtosis

0.37 [0.21–0.65]

t |T Nk,t | = (9) 0, otherwise o ∈O−O i

k

This allows us to define the cardinality of the set of all false positives |F Nk,t | for a given class as in Eq. 10, knowing that Ok is composed of all true positives and false negatives.

108

D. P¸eszor and K. Wojciechowski

|F Pk,t | = |O| − |Ok | − |T Nk,t | 2.2

(10)

Aggregation Over Classes and Thresholds

Let us now consider one of the performance measures, namely, the True Positive Rate as defined in Eq. 2. It can be redefined for a given class k at some threshold t as in Eq. 11. T P Rk,t =

|T Pk,t | |T Pk,t | + |F Nk,t |

(11)

Which allows us to redefine the intuitive aggregation of TPR over classes based on Eq. 4 as in Eq. 12. |K|

|K|

k=1

k=1

|T Pk,t | 1  1  T P Rt = T P Rk,t = |K| |K| |T Pk,t | + |F Nk,t |

(12)

While a domain might indicate that the value of t should be higher than 1, using any given single number does not solve the problem of an uninformative gradient. Actually, higher t tends to increase the problem due to the broader range of parameterisation resulting in the same value. Instead of focusing on a single value of t, we propose to aggregate the measure over all possible values, so that the change of pk (oi ) will result in the change of the value of the given measure. Note that the effect on TPR of lower t values should never be lower than the effect of higher t as in Eq. 13. f (t) >1 f (t + 1)

(13)

If this is fulfilled, one can aggregate the measures over different t using a simple sum as in Eq. 14. |K|−1

M=



f (t)Mt

(14)

t=1

Let us define f (t) = t−m for any m ≥ 0. The m parameter corresponds to the influence of further positions on the measure. In the general case, we assume that the value of m = 1 is quite universal, although specific domains might benefit from adjusting it accordingly to domain demands. For the measure of Sensitivity, Eq. 14 becomes therefore Eq. 15. Note that t = |K| means that every data point will be considered as True Positive, no matter the pk (oi ), so there is no point in including it as it does not contain any information. |K|−1

TPR =

|K|−1  T P Rt 1   |T Pk,t | = tm |K| t=1 |Ok |tm t=1 k

(15)

Aggregated Performance Measures for Multi-class Classification

109

Substituting Eq. 7 in Eq. 15 one can change the order of summations over classes and over the thresholds getting Eq. 16. ⎛   ⎞ |K|−1 1 ⎝ 1  1  1, if pk (oi ) ≤ t ⎠ TPR = (16) |K| |Ok | t=1 tm 0, otherwise o ∈O k i

k

Similarly, the order of summations over thresholds and over data points of a given class can be changed leading to the modification of the weight of a single data point for the calculated measure as in Eq. 17. ⎛ ⎞  |K|−1 1    1 tm , if pk (oi ) ≤ t ⎠ ⎝ 1 TPR = (17) |K| |Ok | 0, otherwise t=1 o ∈O k i

k

wherein the conditional case can be represented by a different initialisation of the summation as in Eq. 18. ⎛ ⎞ |K|−1 1 ⎝ 1   1 ⎠ (18) TPR = |K| |Ok | tm oi ∈Ok t=pk (oi )

k

One can notice that the last summation in Eq. 17 is actually a difference between two harmonic numbers as presented in Eq. 19, wherein vacuous summation defines H0m . For generality, we use H m to denote generalised harmonic number of order m.   1  m 1  m TPR = H|K|−1 − Hpk (oi )−1 (19) |K| |Ok | k

oi ∈Ok

Which can then be simplified to the form of Eq. 20.   1  1  m m T P R = H|K|−1 − Hpk (oi )−1 |K| |Ok | k

(20)

oi ∈Ok

Notice that the summations do not consider the same data point twice, that is to say, the calculation is actually linear (O(|O|)) in terms of complexity. The values of H m can be precalculated in an array, which allows for easy access. The entirety of the value is therefore quite efficient to calculate, despite the added complexity behind the aggregation of multiple thresholds. 2.3

Normalisation

While the inner sum of Eq. 20 is enough to control the direction of change in the hyperparameter space, it might be beneficial to use a normalised version of the aggregated TPR, especially when one wants to combine it with other measures, as in the case of F1-score. The minimal value of the aggregated TPR is achieved in the pessimistic case, whenever ∀i ∧ k : oi ∈ Ok , pk (oi ) = |K|. It is easy to

110

D. P¸eszor and K. Wojciechowski

notice that in this particular case, the aggregated TPR is zero. The maximum value is achieved whenever the perfect classification occurs - that is to say, when ∀i ∧ k : oi ∈ Ok , pk (oi ) = 1 in which case the summation equals zero. Since the first summand depends on the class count, the normalised aggregated TPR is defined as in Eq. 21.    1 1  m ˆ Hpk (oi )−1 TPR = 1 − (21) m |K|H|K|−1 |Ok | oi ∈Ok

k

2.4

The Case of Specificity

We take a similar approach in the case of True Negative Rate defined in Eq. 3. Redefined for a given class k at a given threshold t it becomes Eq. 22 T N Rk,t =

|T Nk,t | |T Nk,t | + |F Pk,t |

(22)

Which we aggregate over classes as in Eq. 23. |K|

|K|

k=1

k=1

|T Nk,t | 1  1  T N Rt = T N Rk,t = |K| |K| |T Nk,t | + |F Pk,t |

(23)

This allows us to aggregate over different values of the threshold in Eq. 24 as in the case of Sensitivity. |K|−1

TNR =

|K|−1  T N Rt |T Nk,t | 1   = m t |K| t=1 (|O| − |Ok |)tm t=1

(24)

k

Again, we substitute Eq. 9 in Eq. 24 and change the order of summations getting Eq. 25. ⎛  |K|−1  1 1 ⎝ 1 TNR = |K| |O| − |Ok | t=1 tm k

 oi ∈O−Ok

 1, 0,

⎞ if pk (oi ) > t ⎠ otherwise

(25)

Changing the order of summations over thresholds and over data points outside of a given class leads again to the weight factor in condition as in Eq. 26. ⎛ 1 1 ⎝ TNR = |K| |O| − |Ok | k



|K|−1



oi ∈O−Ok t=1



1 tm ,

0,

⎞ if pk (oi ) > t ⎠ otherwise

(26)

Which can be represented as changing the range of summation as in Eq. 27. ⎛ ⎞ i )−1   pk (o 1 ⎠ 1 1 ⎝ (27) TNR = |K| |O| − |Ok | tm t=1 k

oi ∈O−Ok

Aggregated Performance Measures for Multi-class Classification

111

The last summation in Eq. 27 is a harmonic number as presented in Eq. 28, wherein vacuous summation defines H0 .    1  1 m TNR = Hpk (oi )−1 (28) |K| |O| − |Ok | oi ∈O−Ok

k

m

Notice again that the values of H can be precalculated in an array, similarly to the per-class multiplier. The computational complexity can therefore be defined as O((|K| − 1)|O|). It is also useful to note that the calculation of both aggregated Sensitivity and aggregated Specificity can be easily combined. The normalisation of aggregated TNR is quite more involved than in the case of TPR. For most cases, it is enough to assume that the minimal value / Ok , pk (oi ) = 1. In this case, the minimum is achieved when ∀i ∧ ∀k : oi ∈ value would be zero. Similarly, the maximum value would be achieved when / Ok , pk (oi ) = |K|, which is to say that all incorrect classes would be ∀i ∧ ∀k : oi ∈ classified as least probable. That would lead to the normalised aggregated TNR as in Eq. 29.     1 1 m ˆR = Hpk (oi )−1 (29) TN m |K|H|K|−1 |O| − |Ok | oi ∈O−Ok

k

A careful reader will, however, notice that neither all classes can be classified as best fit nor as the worst one. Perfect normalisation is, therefore, dependant on the sizes of classes. The proposed normalisation will generally not achieve neither 0 nor 1 in the general case. 2.5

The Compound Measure of Accuracy

Accuracy is often used as a single-value measure of the quality of classification. This is especially true in the case of optimisation, where increasing specificity and sensitivity at the same time might not be possible. Accuracy is therefore an easy way to present the overall score. Let us, therefore, use the same approach for Accuracy, which is a bit more complicated in its form. We start with a redefinition of Eq. 1 for given class k and threshold t as presented in Eq. 30. ACCk,t =

|T Pk,t | + |T Nk,t | |T Pk,t | + |T Nk,t | + |F Pk,t | + |F Nk,t |

(30)

We aggregate over |K| classes and |K| − 1 threshold values as in Eq. 31. |K|−1

|K|−1  ACCt 1   |T Pk,t | + |T Nk,t | = ACC = tm |K| t=1 |O|tm t=1

(31)

k

In this case, we change the order of summations and then use both Eq. 7 and Eq. 9 in Eq. 31 getting Eq. 32.

112

D. P¸eszor and K. Wojciechowski

|K|−1 1  1  1 ACC = |K| |O| t=1 tm k     1, if pk (oi ) ≤ t  1, + 0, otherwise 0, o ∈O o ∈O−O

 if pk (oi ) > t otherwise i i k k (32) Then we include the threshold factor in the conditional summations and remove the conditions by changing the range of summations as in Eq. 33. ⎛ ⎞ |K|−1 pk (oi )−1      1 1 ⎠ 1 ⎝ (33) + ACC = m |K||O| tm t t=1 oi ∈Ok t=pk (oi )

k

oi ∈O−Ok

Which leads us to harmonic numbers in Eq. 34, wherein vacuous summation defines H0 . 1  ACC = |K||O| k





oi ∈Ok



m H|K|−1 − Hpmk (oi )−1 +



 oi ∈O−Ok

Hpmk (oi )−1

(34)

As in the case of Sensitivity, we can remove the constant factor from the summation as in Eq. 35.

ACC =

m H|K|−1

|K|

1  + |K||O| k





oi ∈O−Ok

Hpmk (oi )−1



 oi ∈Ok

 Hpmk (oi )−1

(35)

Reversing the order of summations allows us to represent the result as in Eq. 36.

ACC =

m H|K|−1

|K|

⎛ ⎞    1 ⎝ + Hpmk (oi )−1 − Hpmk (oi )−1 ⎠ |K||O| oi ∈O

k:oi ∈O / k

(36)

k:oi ∈Ok

Note that the first inner sum is actually the partial sum of generalised harmonic numbers with the second inner sum removed. It can therefore be precalculated. It is clear if we rewrite it for m ∈ IN, where we can represent the sum of n (m−1) generalised harmonic numbers as i=1 Him = (n + 1)H mn − Hn , obtaining Eq. 37.

ACC =

m H|K|−1

|K|

+

 1 (m−1) m (|K| + 1)H|K| − H|K| − 2Hpmk:o ∈O (oi )−1 i k |K||O| oi ∈O

(37)

Aggregated Performance Measures for Multi-class Classification

113

Which allows us to represent it as in Eq. 38. m m + (|K| + 1)H|K| − H|K| H|K|−1

(m−1)

ACC =

|K|



 2 Hpmk:o ∈O (oi )−1 (38) i k |K||O| oi ∈O

Which, despite quite involved general case formulae, can be evaluated with linear (O(|O|)) complexity. It is important to note, that for hyperparameter optimisation, the constant part of Accuracy or other measures does not have to be calculated at all, as the difference between compared classifiers will be fully enclosed in the inner summation. The minimal value of Accuracy occurs when each data point actual class has been assigned the last position, that is ∀i, k : oi ∈ Ok , pk (oi ) = |K|. In such a case, we get Eq. 39. ACC =

m + |K|H|K|

1 |K|m

(m−1)

− H|K|

|K|

(39)

To put the minimal value at 0, we remove this factor and get Eq. 40. ˆ = ACC

m 2H|K|−1

|K|



 2 Hpmk:o ∈O (oi )−1 i k |K||O|

(40)

oi ∈O

Since the maximum value is dependant on the count of classes, we can further assume that the best result is the one in which ∀i, k : oi ∈ Ok , pk (oi ) = 1, therefore the appropriately scaled, normalised aggregated Accuracy is in the form as presented in Eq. 41. ˆ =1− ACC

 1 Hpmk:o ∈O (oi )−1 m i k |O|H|K|−1

(41)

oi ∈O

3

Discussion

In this paper, we present a modification to commonly used performance measures that leverages the ordering of classes in multinomial classification. Instead of using a one-vs-all approach, we aggregate the performance measure over multiple thresholds of how we understand the correct classification, that is - on which position a proper class should be to recognise the classification as the correct one, which leads to two different effects. The first effect is especially important in many real-life scenarios, wherein the final decision is done by humans and the machine learning approach is used to narrow the range of possibilities to consider. Such scenarios include facial recognition, neurodegenerative disease diagnosis, and many others. In such a case, the human operator receives a few best fitting classes and can examine the case further. For example, recognising the few most probable identities based on facial recognition among thousands of classes allows investigators to use additional

114

D. P¸eszor and K. Wojciechowski

information (such as clothing) to pinpoint the suspect. In the case of disease diagnosis, this allows considering extra tests that will differentiate between a few potential diseases or between the best fitting “healthy” class and the slightly less probable disease that would remain ignored using binary classification. An approach that considers not only the most probable result but also further ones leads to an algorithm that positions the correct result as high as possible, even if for some reason it might not be the best fit, while otherwise, the correct result might be in the position that does not allow for recognition by a human operator. The second effect is crucial for the process of machine learning. In the onevs-all approach to classification, adjustments to hyperparameters that result in the change of the position of the correct label in the resulting ordering are ignored. The only shift that actually matters is either improving the position of the correct label to the first one or decreasing it from the first to any other. Therefore, the difference between two values of the hyperparameter might either be non-existent, even when the prediction of the correct class has improved from, e.g., 100th to the 2nd class, or be driven by fluctuations of a few edge cases. The true change of performance of multinomial classification is therefore hidden and only a limited amount of information is used to control the adjustment of hyperparameters, leading to the algorithm that manages to find a way to differentiate a specific data set rather than capture the nature of the problem. Leveraging information about further positions might result in faster convergence and solutions that are less dependant on the training data. It should be noted, that the presented approach does not stand in opposition to the trend of aggregating multiple measures into one, such as in the case of F-score or even to the extension of ROC curves to the multi-class case. Acknowledgements. The research described in the paper was supported by grant no. WND-RPSL.01.02.00-24-00AC/19-011 “An innovative system for the identification and re-identification of people based on a facial image recorded in a short video sequence in order to increase the security of mass events.” funded under the Regional Operational Programme of the Silesia Voivodeship in the years 2014–2020. The work of Damian P¸eszor was supported in part by Silesian University of Technology (SUT) through a grant number BKM-647/RAU6/2021 “Detection of a plane in stereovision images without explicit estimation of disparity with the use of correlation space”.

References 1. Sokolova, M., Japkowicz, N., Szpakowicz, S.: Beyond accuracy, F-score and ROC: a family of discriminant measures for performance evaluation. In: Sattar, A., Kang, B. (eds.) AI 2006. LNCS (LNAI), vol. 4304, pp. 1015–1021. Springer, Heidelberg (2006). https://doi.org/10.1007/11941439 114 2. Singh, A., Singh, M.: Evaluation measure selection for performance estimation of classifiers in real time image processing applications. Res. Cell: Int. J. Eng. Sci. 17(1), 168–174 (2016)

Aggregated Performance Measures for Multi-class Classification

115

3. P¸eszor, D., Paszkuta, M., Wojciechowska, M., Wojciechowski, K.: Optical flow for collision avoidance in autonomous cars. In: Nguyen, N.T., Hoang, D.H., Hong, T.P., Pham, H., Trawi´ nski, B. (eds.) ACIIDS 2018. LNCS (LNAI), vol. 10752, pp. 482–491. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-75420-8 46 4. Dudek, A., et al.: Analysis of facial expressions in patients with schizophrenia, in comparison with a healthy control - case study. Psychiatr. Danub. 29(3), 584–589 (2017) D., Staniszewski, M., Wojciechowska, M.: Facial reconstruction on the basis 5. Peszor,  of video surveillance system for the purpose of suspect identification. In: Nguyen, N.T., Trawi´ nski, B., Fujita, H., Hong, T.-P. (eds.) ACIIDS 2016. LNCS (LNAI), vol. 9622, pp. 467–476. Springer, Heidelberg (2016). https://doi.org/10.1007/9783-662-49390-8 46 6. Huk, M., Szczepanik, M.: Multiple classifier error probability for multi-class problems. Eksploatacja i Niezawodnosc-Maint. Reliab. 51(3), 12–16 (2011) 7. Huk, M.: Notes on the generalized backpropagation algorithm for contextual neural networks with conditional aggregation functions. J. Intell. Fuzzy Syst. 32, 1365– 1376 (2017) 8. Huk, M.: Training contextual neural networks with rectifier activation functions: role and adoption of sorting methods. J. Intell. Fuzzy Syst. 37(6), 7493–7502 (2019) 9. Huk, M.: Stochastic optimization of contextual neural networks with RMSprop. In: Nguyen, N.T., Jearanaitanakij, K., Selamat, A., Trawi´ nski, B., Chittayasothorn, S. (eds.) ACIIDS 2020. LNCS (LNAI), vol. 12034, pp. 343–352. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-42058-1 29 10. Yerushalmy, J.: Statistical problems in assessing methods of medical diagnosis, with special reference to X-ray techniques. Public Health Rep. 62(40), 1432–1449 (1947) 11. Lachiche, N., Flach, P.: Improving accuracy and cost of two-class and multi-class probabilistic classifiers using ROC curves. In: Proceedings of the Twentieth International Conference on Machine Learning, Washington, DC, USA, pp. 416–423. AAAI Press (2003) 12. Sokolova, M., Lapalme, G.: A systematic analysis of performance measures for classification tasks. Inf. Process. Manag. 45(4), 427–37 (2009) 13. Hossin, M., Sulaiman, M.N.: A review on evaluation metrics for data classification evaluations. Int. J. Data Mining Knowl. Manag. Process 1–11 (2015) 14. Ferri, C., Hern´ andez-Orallo, J., Modroiu, R.: An experimental comparison of performance measures for classification. Pattern Recognit. Lett. 30, 27–38 (2009) 15. Japkowicz, N., Shah, M.: Evaluating Learning Algorithms: A Classification Perspective. Cambridge University Press, Cambridge (2011)

Prediction of Lung Cancer Survival Based on Multiomic Data ´ Roman Jaksik(B) and Jarosław Smieja Department of Systems Biology and Engineering, Silesian University of Technology, Gliwice, Poland [email protected]

Abstract. Lung cancer is the leading cause of cancer death among both men and women, which mainly results from low effectiveness of the screening programs and late occurrence of symptoms, that are usually associated with advanced disease stages. Lung cancer shows high heterogeneity which was many times associated with its molecular background, providing the possibility to utilize machine learning approaches to aid both the diagnosis as well as the development of personalized treatments. In this work we utilize multiple -omics datasets in order to assess their usefulness for predicting 2 year survival of lung adenocarcinoma using clinical data of 267 patients. By utilizing mRNA and microRNA expression levels, positions of somatic mutations, changes in the DNA copy number and DNA methylation levels we developed multiple single and multiple omics-based classifiers. We also tested various data aggregation and feature selection techniques, showing their influence on the classification accuracy manifested by the area under ROC curve (AUC). The results of our study show not only that molecular data can be effectively used to predict 2 year survival in lung adenocarcinoma (AUC = 0.85), but also that information on gene expression changes, methylation and mutations provides much better predictors than copy number changes and data from microRNA studies. We were also able to show the classification performance obtained using different dimensionality reduction methods on the most problematic copy number variation dataset, concluding that gene and gene set aggregation provides the best classification results. Keywords: Multiomic data · Machine learning · Next generation sequencing · Lung cancer

1 Introduction Despite significant advancements in cancer diagnosis and treatment lung cancer prognosis remains very poor, accounting for the highest number of cancer related deaths worldwide [1]. This mainly results from inadequate screening programs as well as late occurrence of symptoms, that usually manifest themselves only in advanced cancer stages. Non-small cell lung cancer (NSCLC) contributes to approximately 85% of all © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 N. T. Nguyen et al. (Eds.): ACIIDS 2022, LNAI 13758, pp. 116–127, 2022. https://doi.org/10.1007/978-3-031-21967-2_10

Prediction of Lung Cancer Survival Based on Multiomic Data

117

reported lung cancer cases [1]. Of these, adenocarcinoma (LUAD) and squamous cell carcinoma (LUSC) are its two major subtypes that differ significantly in epidemiology [2] and molecular background [3, 4]. Lung cancers were extensively studied using methods employing various omics approaches, including genomics [5], transcriptomics [6], proteomics [7] and metabolomics [8], as well as their combinations in so called multiomic studies [9, 10]. However, high heterogeneity of the tumor, exhibited by these approaches, is challenging from the clinical perspective.Nevertheless, significant differences in the molecular background make it possible to utilize machine learning approaches to aid both the diagnosis process and the development of personalized treatments. Since their conception, machine learning methods have been extensively used to predict treatment outcomes for various cancers, first, with simple statistical decision theory applications [11] and neural networks for data analysis [12] and treatment outcome prognosis [13]. An overview of these early methods can be found, e.g. in [14]. Subsequently, a wide range of tools have been developed, including Bayesian networks, support vector machines, random forests, various types of neural networks, logistic regression, LASSO regression and many others. Recently, deep Learning methods have been gaining increasing popularity in analysis of cancer-related epigenomic data [15], RNA-seq data [16], next-generation sequencing data (a review for non-small lung cancer can be found in [17]). While, initially, most methods were used to analyze single –omics data, integrative approaches, combining data of different type (so called multiomics) have been increasingly popular in recent years [18, 19]. All these efforts are focused on discovering potential predictors of risk of developing cancer, treatment outcome prognosis. Thus they aim to facilitate identification of target groups for screening, determination of screening interval and risk-stratification [20–22] as well as choosing the most promising treatment modes [23]. This is needed, if precision oncology is to enter daily clinical practice. Despite a flood of papers in this area, there are apparent obstacles to successful application of the methods developed. They result from incomparable datasets, incomplete data, etc., leading to inconsistent conclusions about the quality of methods developed - e.g. 0.767 to 0.94 performance, measured by the area under ROC curve (AUC), for a single method, obtained by different groups (see a comparison in [24]). One of the main weak points is feature selection used for machine learning [25]. Feature selection is the process of automatically selecting the attributes in the data that are most relevant to the predictive modeling problem. Feature selection is not the same as dimensionality reduction, though both approaches have the same goal of reducing the number of attributes in the dataset. Dimensionality reduction methods do this by creating new combinations of attributes (e.g. Principal Component Analysis - PCA), while feature selection methods either incorporate or exclude attributes present in the data without changing them. Feature selection methods aid the development of predictive models by selecting features that increase the accuracy of the predictive model, removing non-informative or redundant features. In most cases simple statistical methods like, Pearson’s correlation coefficient, t test or ANOVA are sufficient to conduct feature selection. However, some -omics datasets, especially those yielding sparse feature matrices require additional filtering or dimensionality reduction. DNA sequencing,

118

´ R. Jaksik and J. Smieja

which is primarily used to identify somatic mutations associated with cancer, provides such sparse datasets. The main reason is that mutations occur, very rarely at the exact same genomic position in multiple cancer genomes, requiring aggregation methods in order to provide data suitable for machine learning algorithms. Selection of relevant features is not the only challenge in a multiomic-based cancer classification project. Additional challenges involve treatment of missing data, integration of data from multiple methods, selection of data-pre-processing methodology as well as classification methodology including validation strategy. The main goal of this study is to identify which -omics dataset or combination of them, provide the most relevant information for the prognosis of lung cancer survival. A complementary goal is to determine the best feature aggregation strategy for the -omics datasets that are characterized by a sparse predictor matrix.

2 Materials and Methods 2.1 Data Used in the Study Multiomic data for a total of 518 lung adenocarcinoma (LUAD) patients were downloaded from The Cancer Genome Atlas project through the GDC Data Portal at https:// portal.gdc.cancer.gov. The set included the following pre-processed data: 1) gene expression levels obtained using RNA sequencing (RNA-seq), represented as read counts associated with individual genes and samples. 2) microRNA expression levels obtained using short RNA sequencing (miRNA-seq), represented as read counts associated with individual microRNAs and samples. 3) positions of somatic mutations obtained using whole exome sequencing (WES), represented as GRCh38 reference genome coordinates with supporting annotation data. 4) DNA methylation levels obtained using Illumina Infinium HumanMethylation450 BeadChip microarrays, represented as values in a range between 0 and 1 associated with specific CpG islands. 5) copy number variation (CNV) data obtained using Affymetrix SNP 6.0 microarrays, represented as GRCh38 reference genome intervals and segment mean statistics (log2(CN/2), where CN is the observed copy number of particular region). Additionally, we obtained clinical data for each patient which included the survival time measured in days from the initial diagnosis. The patients were divided into two groups, those that died within two years from diagnosis and those that survived longer. We removed from the study all cases for which data from one of the five platforms was unavailable (N = 78) and for which we could not reliably determine 2-year survival (N = 173), if the last follow-up was conducted earlier than two years from diagnosis and at the time the patient was alive. This resulted in 267 cases used for further analysis, out of which 178 were labeled as survivors and 89 non-survivors.

Prediction of Lung Cancer Survival Based on Multiomic Data

119

2.2 Feature Definition and Pre-selection Data acquired with the RNA-seq method provided 60 483 features, representing expression levels of coding and non-coding genes. This list was narrowed down to 1000 features with the lowest p-value obtained in a comparison between survivors and non-survivors, based on the regression model (DESeq2 method [26]). A similar procedure was used for microRNA-seq data which initially provided information on 1881 features associated with expression levels of individual microRNAs. Data from the WES method provided 229 439 unique missense somatic mutations, aggregated into 18 409 genes, and later into 35 774 gene sets defined in the MSigDB v.7.5 [27], associated with various intercellular processes or signaling pathways. For each gene or gene set, the number of cases from both patient groups (survivors and non-survivors) was determined in which at least one somatic missense mutation was observed. Using two-proportions z-test, we compared fractions of genes/gene sets in both patient groups and selected 1000 with the lowest p-value. Data from the copy number variation (CNV) study provided a total of 132 327 genomic ranges with altered copy numbers across all samples. Since the ranges are very rarely identical, we used CNregions function from the iClusterPlus R package to aggregate them, using partial overlap information (frac.overlap = 0.9). This resulted in 3122 features across all samples which we further associated with gene and gene set definitions, similarly as for the mutation data, including the preselection based on two-proportions z-test. Methylation dataset contained a total of 485 577 genomic locations (CpG islands) for which methylation levels (represented as beta values) were measured across all samples. Using beta regression test from the betareg R package [28], we identified positions, at which the methylation level differs significantly between both patient groups (survivors and non-survivors). Based on the test results we selected 1000 sites with the lowest p-value. All of the p-values were predominantly used for ranking purposes. However, to show the total number of discriminant features in each dataset, we additionally applied Benjamini and Hochberg correction for multiple testing. 2.3 Variable Importance Study Variable importance study was conducted using two distinct approaches, Least Absolute Shrinkage and Selection Operator (LASSO) regression, using glmnet R package [29], and, additionally, Boruta feature ranking and selection algorithm, which is based on random forest approach, implemented in the Boruta R library [30]. Both methods were repeated 1000 times for the preselected features from each of the five -omics methods, in a 10-fold cross-validation. LASSO was executed for each of the -omics dataset independently, while Boruta - for combined datasets. In both cases features with variance equal to zero and highly correlated features were excluded from the study (correlation coefficient > 0.9) using the findCorrelation function from the caret R package [31].

120

´ R. Jaksik and J. Smieja

2.4 Classification of Data The process of building a predictive model was performed in the R environment with the use of caret R package [31]. The predictive models were obtained using the following methods: 1) 2) 3) 4) 5) 6)

Support Vector Machine (SVM) with a linear and radial kernels Linear Discriminant Analysis (LDA) Random Forest (RF) Neural Network Generalized Linear Model (GLM) Partial Least Squares (PLS).

The quality of the models was evaluated using a 10-fold cross-validation. Since the training set was unbalanced (178 cases classified as survivors and 89 non-survivors), we applied a down-sampling process for case selection in a single cross-validation loop, which reduced the number of cases from a larger group (in this case “survivors”) so that both groups are represented by the same number of cases. As a result, the model is trained on a smaller number of cases, which affects its predictive ability, but allows for a greater balance between sensitivity and specificity.

3 Results 3.1 Aggregation and Dimensionality Reduction This study utilizes five -omics datasets with significantly different data types that in total provide information on over 622 thousand genomic and transcriptomic features. The process of reducing their number is therefore necessary, not only to reduce the calculations time and complexity of the predictive models, but also in some instances to increase information value of individual features. This problem most commonly occurs in mutation studies, where individual mutations may, very rarely, occur at the same genomic position, requiring aggregation techniques. CNV data is however even more problematic, since the input dataset consists of various genomic coordinates, at which the DNA copy number is altered (additional copies can be gained or lost). CNV regions are very rarely identical between multiple samples. That makes it difficult to define numerical features, shared across all of the samples in the experiment. One of the approaches is to aggregate regions basing on partial overlap, as shown on Fig. 1A (OA method). Each of the smaller regions has a specific copy number value attached to it, either as an absolute number of copies or, as in our case, as a segment mean variable, defined as a log2(CNV/2), where CNV is the absolute copy number. It is worth noting that while this approach is based on dividing the genome into segments, identical for each sample, not all of the genomic regions are represented by a numerical value. Sections that do not contain regions altered in at least one sample are omitted.

Prediction of Lung Cancer Survival Based on Multiomic Data

121

Fig. 1. Classification accuracy of patient survival, using various dimensionality reduction methods in the CNV dataset: A) Diagrams showing various aggregation methods utilized in the study. Blue and purple rectangles represent regions with altered copy number, r1-r5 labels mark genomic regions associated with numerical data used as predictors in classification models. With no aggregation the majority of features are associated with a single sample only. OA, GA and GsA aggregation techniques allow to utilize the overlap between observed regions and association between genes, in order to create features of non-zero values in a larger fraction of samples. B) Area Under the Curve (AUC) statistics obtained using 10-fold Lasso Regression repeated 1000 times for individual datasets. Whiskers represent standard deviation.

Another method is to split the copy number ranges into much smaller sections defined by the gene regions (GA method). However, in some conditions (where the total length of copy number-altered regions is very small) this approach may result in a sparse feature matrix, where most of the values will be equal zero. This may happen for genes which do not overlap with at least one region that shows altered copy number (r2 in the example). It is also worth noting that genes located close to each other are likely to show high signal correlation levels (e.g. r4 and r5). By further aggregating CNV statistics obtained for individual genes across specific gene sets (GsA method) both high correlation and low variance problems can be omitted. This approach can additionally reduce the total number of features in a situation where the number of aggregated regions is very high. Principal Component Analysis (PCA) represents another class of dimensionality reduction methods which allows to reduce the number of features to the number of samples tested. PCA is a very popular method used in machine learning, that transforms the data by finding the best linear combinations of the original variables so that the variance or spread along the new variable is maximum.

122

´ R. Jaksik and J. Smieja

Figure 1B shows the Area Under the Curve (AUC) statistics obtained for a specific number of predictors and using various dimensionality reduction techniques prior to classifying LUAD cases into survivors (at least two years after initial diagnosis) and non-survivors. AUC is the measure of the ability of a classifier to distinguish between classes and is used as a summary of the Receiver Operating Characteristic (ROC) curve. The higher is the AUC value the better is the classifier. The performance of simple overlap aggregation is relatively low allowing to reach AUC of just ~ 0.65. PCA transformation leads to significant data loss since a two class classifier with AUC in the vicinity of 0.5 shows no predictive capabilities. Methods based on gene and gene set aggregation show the highest performance of ~ 0.7, for just 20–25 predictors. Gene set aggregation (GsA) allowed to obtain the highest AUC, and while the difference is not substantial compared to gene level aggregation (GA) we decided to use this approach in the -omics dataset comparison. 3.2 Predictive Potential of Various -Omics Datasets The main goal of this work is to determine which -omic dataset provides the most useful information to differentiate LUAD patients that will survive longer than 2 years after initial diagnosis from those that will die prematurely. To reach this goal we utilized lasso regression combined with 10-fold crossvalidation to divide the dataset into small groups and test the classification performance using a variable number of predictors. Figure 2A shows that methylation dataset provides the highest number of features. However, the mRNA set yields much higher number of features that show statistivally significant differences between survivors and non-survivors (adjusted p-value < 0.05, based on Benjamini and Hochberg method), compared to other datasets, where such features do not exist (CNV and mutations) or where their number is too small to be noticeable on the plot (methylation and microRNA). This goes in pair with the predictive potential of features belonging different omic dataset. The highest AUC can be observed for 47 predictors from the mRNA dataset, while the lowest maximum AUC value for a single dataset can be observed in CNV and microRNA sets. It is also worth noting that the highest performance for CNV, miRNA and mRNA sets is located between 22 and 47 features. Firther increase in the number of features results in a decrease of the classifier performance. This is typical for a dataset with a high number of features compared to the number of samples, which, due to small number of training samples, leads to overfitting. Mutation dataset is exceptional in this regard,since it didn’t reach a maximum for less than 100 features used to create Fig. 2A.However, the shape of the curve indicates that the plateau is likely at the level similar to that obtained for 100 features. 3.3 Variable Importance Study in a Multiomic Dataset Lasso regression, used in Sect. 3.2, is a valuable tool to study the predictive potential of various distinct dataset. However, we found it to be inadequate for a case, where multiple different -omics datasets are combined into a single feature matrix. This is likely due to data transformations involved, which usually favor one of the methods, resulting in feature ranking that is dominated by predictors from a single -omics dataset. The performance of a classifier created using such predictors doesn’t utilize the potential

Prediction of Lung Cancer Survival Based on Multiomic Data

123

Fig. 2. Classification accuracy of patient survival for various -omics datasets: A) Bar plot showing the number of available and selected features in each of the -omics datasets. Light blue segments represent features that show statistically significant differences between survivors and non-survivors (adjusted p-value < 0.05); B) Area Under the Curve (AUC) statistics obtained using 10-fold Lasso Regression repeated 1000 times for individual datasets. Whiskers represent standard deviation.

of including multiple different datasets. To overcome this limitation we utilized Boruta algorithm. Boruta is a feature ranking and selection method based on random forests algorithm, with the main advantage of clearly deciding which variable is important, or not, for a particular classification problem. We conducted Boruta analysis using combined feature matrixes used in Sect. 3.2. It yielded 43 attributes confirmed to be important for discriminating LUAD survivors from non survivors. The method was executed 1000 times providing as many variable importance scores for each of the attributes, which are summarized on Fig. 3A. Figure 3A lists only attributes confirmed important by the method which includes 29 methylation features (CpG sites), 13 mRNA (defined by ENSEMBL gene IDs) and 1 miRNA (has-miR-130b). We used those features to test the classification performance using seven different approaches, the results of which are shown on Fig. 3B in the form of ROC curves. The method which preformed the best was Random Forest, allowing to achieve AUC of 0.851. The worst method was the neural network which likely involved to few neurons in the hidden layer for the classifier to reach its full potential for this

124

´ R. Jaksik and J. Smieja

Fig. 3. Results of multiomic data classification: A) variable importance study obtained using Boruta method; B) ROC curves and corresponding AUC statistics obtained for features shown on panel A using various classification methods with 10-fold CV; C) Survival curves for patients classified as 2-year survivals and non-survivals, obtained using Random Forest classifier evaluated using leave-one-out cross-validation.

set of predictors. Figure 3C shows the survival curves obtained by classifying each of the patients into survivals and non-survivals with the Random Forest classifier evaluated using leave-one-out cross-validation. While in some instances patients that survived longer than 5 years were classified as non-survivors the difference in survival time between both groups is substantial (p-value < 0.0001).

4 Discussion The results of our study show not only that molecular data can be effectively used to predict 2 year survival in lung adenocarcinoma but also that information on gene expression changes (RNA-seq), methylation and mutations provides much better predictors than CNV and data from miRNA-seq studies. We were also able to show the classification performance obtained using different dimensionality reduction methods on the most problematic CNV dataset, concluding that gene and gene set aggregation provides the best classification results. While the overall precision of classifiers, constructed using both single omics and multiomic datasets is relatively low, compared to some of the other, similar approaches [24, 32] it was not our primary goal to build a classifier but rather test the usefulness of various -omics datasets, selection and dimensionality reduction techniques. For the

Prediction of Lung Cancer Survival Based on Multiomic Data

125

same reason we did not use a separate test dataset to validate classifiers, that would either originate from a different study or that would be selected from among all cases which we used. Also it is difficult to compare outcomes of our work to the most similar approach, presented by [32], due to different clustering of patient groups based on survival statistics. Due to variability in the times of the last follow-up we decided to focus on the 2 year survival, excluding cases, were the last follow-up was conducted before 2 years, and the patient was alive at this point. In such cases it is impossible to determine if the actual survival time reached at least two years from the initial diagnosis. A significant limitation of this study is the small number of cases studied (N = 267) compared to the number of evaluated predictors. However, our dataset is relatively large, compared to other multiomic classification approaches that focus on lung cancer [33–37], where the number of cases ranged between 28 and 168. It should also be noted that due to the complexity of the learning and testing process and the long computation time, the additional classifier tuning loop, inside which its parameters are changed, has been omitted. By including this step it is potentially possible to further improve the classification results. Acknowledgements. This work was supported by Polish National Science Centre, grant number: UMO-2020/37/B/ST6/01959 and Silesian University of Technology statutory research funds. Calculations were performed on the Ziemowit computer cluster in the Laboratory of Bioinformatics and Computational Biology created in the EU Innovative Economy Programme POIG.02.01.00–00-166/08 and expanded in the POIG.02.03.01–00-040/13 project.

References 1. Gridelli, C., et al.: Non-small-cell lung cancer. Nat. Rev. Dis. Primers. 1, 15009 (2015) 2. O’Brien, T.D., Jia, P., Aldrich, M.C., Zhao, Z.: Lung Cancer: One Disease or Many. Hum. Hered. 83, 65–70 (2018) 3. Yang, Y., Wang, M., Liu, B.: Exploring and comparing of the gene expression and methylation differences between lung adenocarcinoma and squamous cell carcinoma. J. Cell. Physiol. 234, 4454–4459 (2019) 4. Relli, V., Trerotola, M., Guerra, E., Alberti, S.: Distinct lung cancer subtypes associate to distinct drivers of tumor progression. Oncotarget 9, 35528–35540 (2018) 5. Borczuk, A.C., Toonkel, R.L., Powell, C.A.: Genomics of lung cancer. Proc. Am. Thorac. Soc. 6, 152–158 (2009) 6. Xiong, Y., Feng, Y., Qiao, T., Han, Y.: Identifying prognostic biomarkers of non-small cell lung cancer by transcriptome analysis. Cancer biomarkers : section A of Disease markers 27, 243–250 (2020) 7. Cheung, C.H.Y., Juan, H.F.: Quantitative proteomics in lung cancer. J. Biomed. Sci. 24, 37 (2017) 8. Qi, S.A., et al.: High-resolution metabolomic biomarkers for lung cancer diagnosis and prognosis. Sci. Rep. 11, 11805 (2021) 9. Cancer Genome Atlas Research Network, T.: Comprehensive molecular profiling of lung adenocarcinoma. Nature 511, 543-550 (2014) 10. Cancer Genome Atlas Research Network: Comprehensive genomic characterization of squamous cell lung cancers. Nature 489, 519-525 (2012)

126

´ R. Jaksik and J. Smieja

11. Simes, R.J.: Treatment selection for cancer patients: application of statistical decision theory to the treatment of advanced ovarian cancer. J. Chronic Dis. 38, 171–186 (1985) 12. Astion, M.L., Wilding, P.: Application of neural networks to the interpretation of laboratory data in cancer diagnosis. Clin. Chem. 38, 34–38 (1992) 13. Bryce, T.J., Dewhirst, M.W., Floyd, C.E., Jr., Hars, V., Brizel, D.M.: Artificial neural network model of survival in patients treated with irradiation with and without concurrent chemotherapy for advanced carcinoma of the head and neck. Int. J. Radiat. Oncol. Biol. Phys. 41, 339–345 (1998) 14. Cruz, J.A., Wishart, D.S.: Applications of machine learning in cancer prediction and prognosis. Cancer informatics 2, 59–77 (2007) 15. Nguyen, T.M., et al.: Deep learning for human disease detection, subtype classification, and treatment response prediction using epigenomic data. Biomedicines 9 (2021) 16. Huang, Z., et al.: Deep learning-based cancer survival prognosis from RNA-seq data: approaches and evaluations. BMC Med. Genomics 13, 41 (2020) 17. Wang, Y., Lin, X., Sun, D.: A narrative review of prognosis prediction models for non-small cell lung cancer: what kind of predictors should be selected and how to improve models? Annals of translational medicine 9, 1597 (2021) 18. Schulz, S., et al.: Multimodal deep learning for prognosis prediction in renal cancer. Front. Oncol. 11, 788740 (2021) 19. Zhu, W., Xie, L., Han, J., Guo, X.: The application of deep learning in cancer prognosis prediction. Cancers 12, (2020) 20. Ten Haaf, K., et al.: Risk prediction models for selection of lung cancer screening candidates: A retrospective validation study. PLoS Med. 14, e1002277 (2017) 21. Ten Haaf, K., van der Aalst, C.M., de Koning, H.J., Kaaks, R., Tammemagi, M.C.: Personalising lung cancer screening: An overview of risk-stratification opportunities and challenges. Int J Cancer 149, 250–263 (2021) 22. Yeo, Y., et al.: Individual 5-year lung cancer risk prediction model in korea using a nationwide representative database. Cancers 13 (2021) 23. Tufail, A.B., et al.: Deep learning in cancer diagnosis and prognosis prediction: a minireview on challenges, recent trends, and future directions. Comput. Math. Methods Med. 2021, 9025470 (2021) 24. Gao, Y., Zhou, R., Lyu, Q.: Multiomics and machine learning in lung cancer prognosis. J. Thorac. Dis. 12, 4531–4535 (2020) 25. Laios, A., et al.: Feature selection is critical for 2-year prognosis in advanced stage high grade serous ovarian cancer by using machine learning. Cancer control: journal of the Moffitt Cancer Center 28, 10732748211044678 (2021) 26. Love, M.I., Huber, W., Anders, S.: Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15, 550 (2014) 27. Subramanian, A., et al.: Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. U.S.A. 102, 15545–15550 (2005) 28. Francisco, C.-N.: Beta regression in R. Journal of Statistical Software 1–24 (2010) 29. Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33, 1–22 (2010) 30. Kursa, M., Rudnicki, W.: Feature selection with the boruta package. J. Stat. Softw. 36, 1–13 (2010) 31. Kuhn, M.: Building predictive models in R using the caret package. J. Stat. Softw. 28, 1–26 (2008) 32. Malik, V., Dutta, S., Kalakoti, Y., Sundar, D.: Multi-omics integration based predictive model for survival prediction of lung adenocarcinaoma. 2019 Grace Hopper Celebration India (GHCI) 1–5 (2019)

Prediction of Lung Cancer Survival Based on Multiomic Data

127

33. Jayasurya, K., et al.: Comparison of Bayesian network and support vector machine models for two-year survival prediction in lung cancer patients treated with radiotherapy. Med. Phys. 37, 1401–1407 (2010) 34. Sun, T., et al.: Comparative evaluation of support vector machines for computer aided diagnosis of lung cancer in CT based on a multi-dimensional data set. Comput. Methods Programs Biomed. 111, 519–524 (2013) 35. Hyun, S.H., Ahn, M.S., Koh, Y.W., Lee, S.J.: A machine-learning approach using PET-based radiomics to predict the histological subtypes of lung cancer. Clin. Nucl. Med. 44, 956–960 (2019) 36. Wang, D.D., Zhou, W., Yan, H., Wong, M., Lee, V.: Personalized prediction of EGFR mutation-induced drug resistance in lung cancer. Sci. Rep. 3, 2855 (2013) 37. Emaminejad, N., et al.: Fusion of quantitative image and genomic biomarkers to improve prognosis assessment of early stage lung cancer patients. I.E.E.E. Trans. Biomed. Eng. 63, 1034–1043 (2016)

Graph Neural Networks-Based Multilabel Classification of Citation Network Guillaume Lachaud(B) , Patricia Conde-Cespedes(B) , and Maria Trocan(B) ´ ISEP - Institut Sup´erieur d’Electronique de Paris, 10 rue de Vanves, 92130 Issy les Moulineaux, France {glachaud,pconde,maria.trocan}@isep.fr

Abstract. There is an increasing number of applications where data can be represented as graphs. Besides, it is well-known that artificial intelligence approaches have become a very active and promising research field, mostly due to deep learning technologies. However popular deep learning architectures were designed to treat mostly image and text data. Graph Neural Network is the branch of machine learning which builds neural networks for graph data. In this context, many authors have recently proposed to adapt existing approaches to graphs and networks. In this paper we train three models of Graph Neural Networks on an academic citation network of Computer Science papers, and we explore the advantages of turning the problem into a multilabel classification problem.

Keywords: Graph neural networks classification

1

· Citation network · Multilabel

Introduction

Graphs are a kind of structured data that have become one of the pillars of our society. They are ubiquitous, appearing for example in biology with protein interaction graphs, in chemistry with molecules, in epidemiology, telecommunications, finance, sociology with social networks. Graphs are expressive enough that they can model complex networks such as interactions between billions of users. As technology develop and the collected data becomes more structured, graph’s role in shaping our world will increase. While traditional machine learning was focused on providing good statistical models that had a strong theoretical background, deep learning employs a data driven approach. The first neural network models can be trace back as far 70 years ago, however they lacked the computational resources and data to be efficient. With the rise of faster computers and the use of Graphical Processing Units (GPUs), alongside the accumulation of data on the Internet, deep learning has seen tremendous success in computer vision [12] and natural language processing [3].

c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  N. T. Nguyen et al. (Eds.): ACIIDS 2022, LNAI 13758, pp. 128–140, 2022. https://doi.org/10.1007/978-3-031-21967-2_11

Graph Neural Networks-Based Multilabel Classification of Citation Network

129

Although Convolutional Neural Networks (CNNs) perform well on image tasks and Recurrent Neural Networks (RNNs) are the standard for manipulating text sequences, neither of these architectures are designed to exploit the relations between nodes in graphs. These led researchers to build new architectures of neural networks dedicated to handling graph data: Graph Neural Networks (GNNs). In just a few years the field of GNN has vastly grown and incorporates knowledge from machine learning and graph theory [24]. In this paper, we perform a multiclass classification on the citation network of all the Computer Science papers (CS) published in the arXiv web, extracted from the the Open Graph Benchmark repository [9], and referred to as ogbnarxiv. We train three graph neural networks and we select the best performing one to examine the misclassification errors. We attribute the errors to two main causes: in many cases, the second most likely class predicted by the model is the real class. Secondly, the model has difficulty classifying some of the less represented classes. We propose turning the task into a multilabel classification task. In many cases, classification task require neural networks to produce multiple labels: images have several elements, text can be related to different topics, and nodes in a graph may be related to several classes via different neighbors. Formally speaking, given an input feature vector x, multilabel classification aims at finding a prediction of a label vector y, where each element takes the value 0 or 1, indicating whether the input belongs to a class or not. Multilabel classification for each type of data led to the development of specialized network architectures [13,16,21]. The paper is divided as follows: Sect. 2 presents the history of GNNs and some of the existing graph datasets that are used as benchmarks. We detail the architecture of the GNNs we are using in the paper. Section 3 describes ogbnarxiv in detail; Sect. 4 introduces the results of our experiments Sect. 5 considers a multilabel classification approach to mitigate the errors of the model. Section 6 concludes the work and offers future areas of improvement.

2

Related Works

Graph neural networks first emerged in the context of extending recurrent neural networks to handle structured data [18]. Most of the leading approaches now follow a structure similar to the one introduced in [6]: the hidden representation of a node is updated using the hidden representation of its neighbors. The main differences in architectures are usually in how the representations are used, e.g. aggregating the features in GraphSAGE [7], and in the weighing of the neighbors, e.g. using fixed weights in Graph Convolutional Networks (GCN) [11], or assigning attention score in Graph Attention neTwork (GAT) [20]. For complete surveys of graphs neural networks, see [23,24]. One of our motivations for selecting GraphSAGE, GCN and GAT, is that since they are all variations on a more general architecture framework, experiments done with these models will to some extent generalize to other GNNs.

130

G. Lachaud et al.

We use the following notations to describe the models: a graph is denoted as G = (V, E) where V is the set of vertices and E the set of edges. The adjacency matrix of G is A. N represents the number of nodes in the graph, i.e. the cardinality of V. The set of neighbors of a node u is written as Nu 1 . The activation matrix of layer l is H (l) ∈ RN ×D , where D is the dimensionality of the layer. We (l) use hu to specify the activation vector of node u at layer l. The weight matrix at layer l is W (l) . By convention, H (0) = X, the initial feature matrix of the nodes. σ represents an activation function, e.g. a ReLU (Rectified Linear Unit), defined as x → max(0, x). Graph Convolutional Networks GCNs (Graph Convolutional Networks) were introduced in [11] in 2017 for semi-supervised learning on graphs. The main idea is based on using spectral graph wavelet transforms (also called spectral convolutions on graphs) and their approximation using Chebyshev polynomials, both introduced in [8] and refined in [2]. The authors in [11] used a first-order approximation of the spectral convolutions and approximated the largest eigenvalue of the graph Laplacian to be equal to 2, arguing that neural network would adapt during training. With these assumptions, they defined a graph convolutional layer to be defined by the following propagation rule   1 ˜ − 12 H (l) W (l) ˜ 2 A˜D (1) H (l+1) = σ D A˜ is the adjacency matrix of the graph with added self-connections, i.e. A˜ = ˜ ii =  A˜ij . A + IN . Finally, D j A GCN is usually composed of several of these layers, with the activation function of the last layer being a softmax to output probabilities. Graph Attention Networks Inspired by attention mechanisms in deep learning which were first developed in natural language processing for dealing with sequences [1], GATs (Graph Attention NeTworks) were introduced in 2018 in [20] and use the following layer-wise propagation rule    (l+1) (l) (l) (l) (2) =σ αuv W hv hu v∈Nu (l)

αuv is the normalized attention coefficient at layer l of node v with respect to u, that is, it indicates how important the features of node v are to node u. The coefficients are obtained by computing the attention coefficients then performing (l) a softmax for normalization. More formally, with euv the attention coefficients at layer l and a the attention mechanism, we have 1

The neighborhood can include the node itself.

Graph Neural Networks-Based Multilabel Classification of Citation Network

  (l) euv = a W (l) hu(l) , W (l) hv(l)

131

(3)

(l)

(l) αuv =

exp(euv )

w∈Nu

(4)

(l)

exp(euw )

The attention mechanism can be any function that takes as input two vectors, (l) with the same dimension as the product W (l) hu and outputs a real value. For example, a can be a feed-forward neural network. GraphSAGE GCN and GAT can perform both transductive and inductive learning, that is, use the observed data to predict the behaviour of new data (transductive), or use the observed data to infer general rules as to the behaviour of the data (inductive). Graphs are particularly challenging for inductive learning because it requires learning the structures of subgraphs. Contrary to GCN and GAT, GraphSAGE (Graph SAmple and aggreGatE) was specifically designed to tackle the challenge of inductive learning [7] and was introduced in 2017. The network uses the following forward propagation rules for layer l + 1 and for each node v in V (l+1)

hNv

hv(l+1)

  = aggregatel {hu(l) , ∀u ∈ Nv }   (l) = σ W (l) concatenate(hv(l) , hNv )

(5) (6)

(l+1)

hv(l+1) =

hv

(7)

(l+1) 2 

hv

(l+1)

hNv is an activation vector for the neihborhood of node v. The concatenate function concatenates the two vectors for the matrix multiplication with W (l) . The authors in [7] argue that the aggregator function should be symmetric so as to be independent of the ordering of the neighbors, trainable and have high representational capacity. An aggregator can be a simple mean aggregator, which, when injected in Eq. 5, gives     1 (l+1) (l) (l) (l) (h + hu ) (8) hNv = σ W · |Nv | + 1 v u∈Nv

More aggregators are described, such as an LSTM one, in [7].

132

G. Lachaud et al.

While the neighborhood function N : v → 2V can be defined arbitrarily, in practice we draw a uniform sample of fixed size from the neighbors of the node. This draw is performed for every layer. In order to compare GNN architectures, it is essential to have high quality datasets and well designed benchmarks.

3

Dataset Description

The first graph benchmark datasets were quite small, having only a few thousand nodes [17]. This posed a problem when trying to compare GNNs, because the graphs were too small to create meaningful differences between the methods. In the last few years, there has been a trend in trying to standardize benchmarking of GNNs [4]. In this vein, [9] introduced an ensemble of graph datasets of varying size and type: directed, undirected, homogeneous, heterogeneous, etc. Furthermore, they provided a benchmark framework to ease to process of comparing methods. These datasets tackle the three types of graph tasks, which are node, edge and graph-level property predictions. The dataset we use in this paper, ogbn-arxiv, comes from their paper. A leaderboard is available at https://ogb.stanford.edu/docs/leader nodeprop/ to see the best-performing models on obgn-arxiv. Many of the leading submissions are based on some variations of GATs. Two notable approaches in the leaderboard are presented in [19] and [10]. The first one proposes to use a GNN that spreads information further than one node to its first neighbors using multi-hop neighborhood aggregation. The second is an approach that does not use GNNs, but rather rely on using a multi-layer perceptron followed by two post-processing steps that propagate errors from the training data. ogbn-arxiv is a directed homogeneous graph. It has 169, 343 nodes which represent papers from the arXiV Computer Science repository, and 1, 166, 243 directed edges, which correspond to citations between papers. Each node has a date attribute, corresponding to the year of publication, which ranges between 1991 and 20202 , and a 128-dimensional feature vector. The feature vector represents the average of the word embeddings of all the words in the title and the abstract of the paper, which the authors of [9] obtained by applying a WORD2VEC [15] model that they trained on Microsoft Academic Graph (MAG) [22]. Each paper is labelled by the authors and the arXiv moderators; these labels are assigned a number between 0 and 39, representing the 40 categories of the arXiv CS (Computer Science) repository. Each category is denoted by two letters; their full name can be found at https://arxiv.org/corr/subjectclasses.

2

There are 10 papers whose publication date is before 1991, which is the year arXiv was publicly released.

Graph Neural Networks-Based Multilabel Classification of Citation Network

133

In [9], the authors proposed splitting the dataset with respect to the year of publication of the papers, on the basis that this reflects one of the real world applications of GNNs, which is to predict the category of new papers using only already published papers; furthermore, they argue it is a more challenging task than just randomly splitting between train, validation and test. Thus, the split is the following: the train set consists of all papers published before 2018; the validation set has all the papers published in 2018; and the test comprises all the papers from 2019 (inclusive) onwards.

4

Experiments

In this section, we use a single class classification approach. A deep analysis of the misclassifications will lead us to explore a multilabel approach in Sect. 5. For all the experiments, we use a 24GB NVidia RTX GPU. The code is written in Python, Pytorch and PyTorch Geometric [5]. We use the OGB (Open Graph Benchmark) [9] package to get the ogbn-arxiv dataset. Our choice of GCN, GraphSAGE and GAT is based on the following observations: they are special cases of Message Passing Neural Networks [6] which are one of the dominant forms of graph neural networks. Furthermore, they form the basis of most of the leading architectures in the leaderboard on the OGB datasets. The leaderboard is available at https://ogb.stanford.edu/docs/leader nodeprop/. We train 10 instances of GCN, GraphSAGE and GAT for 500 epochs each. Because GNNs are prone to over-smoothing when using too many layers [14], each of our models has 3 layers with a hidden layer size of 256 units. We use dropout to mitigate overfitting. Because GraphSAGE and GCN only handle undirected graphs, we consider the citation network as an undirected network. Average results of training are presented in Table 1. The GAT outperforms the other models. This is consistent with the results of the leaderboard for ogbn-arxiv available at https://ogb.stanford.edu/docs/leader nodeprop/, where GAT-based models occupy the first places. Table 1. Training results for the arXiv dataset Method

Validation accuracy Testing accuracy

GCN

73.33 ± 0.09

Graphsage 71.94 ± 0.1 GAT

73.66 ± 0.11

72.06 ± 0.2 70.77 ± 0.26 72.41 ± 0.19

134

G. Lachaud et al.

In the rest of the paper, in keeping with the results obtained in our own experiments and by other researchers, we focus on the results given by the GAT model.

Fig. 1. Susbset of the confusion matrix

To analyze the misclassification errors made by the model, we need to go beyond the accuracy score and look at the confusion matrix on the test set, which will help us see where it fails to generalize. The value at row ci and column cj indicates the number of times the model has assigned the label cj to a node from ci , divided by the total number of nodes of category ci in the test set. The rows have been normalized and each one adds up to 100. The higher the values outside the diagonal are, the more the model made mistakes. The categories were ordered in such a way that the small classes occupy the first columns while the middle of the matrix is for the most populated classes and the rest of the columns represent mostly middle-sized classes. A subset of the full confusion matrix is displayed in Fig. 1 with the categories that are discussed in the rest of the paper.

Graph Neural Networks-Based Multilabel Classification of Citation Network

135

We first observe that the size of the category in the training set, as represented by the “train size” column in Table 3 is not a sufficient indicator of poor performance. The model achieves low accuracy on such categories as ar (Hardware Architecture) and pf (Performance) but successfully classifies nodes from gt (Computer Science and Game Theory) and sc (Symbolic Computing), despite the fact that these categories have approximately the same number of nodes in the training set. This suggests that some categories display more cohesiveness than others, and that the network is able to detect this pattern. Still on the topic of categories with little representation, we see systematic misattribution for nodes in the mm (Multimedia) and gr (Graphics) categories, which are classified as cv. Considering that the three subjects likely share a similar terminology, and that the initial features of the nodes were based on the words in the title and the abstract, there is little hope, without changing the features, to correctly predict these classes. Next, we are faced with subject areas that are intrinsically interdisciplinary, which means they exploit ideas from other areas of research. The most eminent representative of these categories is hc (Human Computer Interaction). By design, HCI tends to capitalize on the advances in various fields, e.g. computer vision, natural language processing, and study the impact, positive or negative, they can have on users. In ogbn-arxiv, this will be reflected in two manners: hc nodes have neighbors that can belong to other classes, and two hc nodes can have vastly different features. Finally, the error which is the key factor in driving down the accuracy is the confusion of categories within a group of similar categories. This is exemplified with the categories cv (Computer Vision), lg (Machine Learning), ai (Artificial Intelligence), cl (Computation and Language, mostly natural language processing) and ne (Neural and Evolutionary Computation). About 30% of ai nodes in the test set are incorrectly attributed to the one of the above classes, while 20% of lg nodes and 35% of ne nodes are similarly misclassified. All these categories mutually fuel the research of the others. The two biggest reasons for the misclassification are a combination of two causes mentioned earlier: many nodes from these categories share a similar terminology, e.g. papers on neural networks have similar characteristics; and the nodes cite papers from all the areas in the group. Considering the overlapping themes of some categories, as well as the interdisciplinary content of some papers, a multilabel classification approach is preferable to the single label classification task. Firstly, it allows a finer grained categorization of papers, distinguishing between papers in the robotics field that have a computer vision component with those that have a natural language processing component. Secondly, it helps concentrate on the bigger errors made by the neural networks: those in which the category is not in the top predictions.

136

5

G. Lachaud et al.

Multilabel Classification Approach

Instead of focusing on only the top prediction of the model, we retrieve the three most likely predicted classes of our GAT model for each node in the test set. The set of estimated probabilities is usually obtained by applying a softmax activation function to the last layer of the neural network; in the case of multiclass classification, to make a prediction, we simply output the category which is associated with the highest probability. We compute the number of times the correct category is the prediction (accuracy, or top 1), as well as the number of times it appears in the two (top 2) or three (top 3) categories with the highest estimated probabilities. Overall, while the model achieves 72.4% accuracy, the right category is in the two highest predictions 87.3% of the time, a 15% increase. In the top 3, this number rises to 92.4%. Results for each category are presented in Table 3, alongside the relative size of the category in the training set (given in percentage) and its population in the test set. The arXiv categories in bold are the ones discussed in the text (Table 2.). Additionally, a representation of the top 3 predictions for some nodes is presented in Fig. 2. Table 2. Top 3 score on training, validation and test Dataset Top 1 Top 2 Top 3 Train

79.31 90.36 94.14

Valid

73.62 87.76 92.73

Test

72.27 87.25 92.36

We see that, within a group of non-mutually exclusive categories, there are some classes that attract most of the predictions, such as the cv and cl which are in the group of artificial intelligence related categories. These leads to poor accuracy scores for the lg and ai classes. However, when we look at the three highest estimated probabilities, the network gets most of the lg and ai samples right. For example, node 1 in Fig. 2 belongs to the lg category, which is the second prediction of the model. Similarly, nodes 2, 3, 5 and 7 all belong to the ai category, which is the second or the third prediction from the model. Additionally, the top three predictions are either related to the true category, or to the category of the neighbors. For example, node 1 has neighbors that belong to the ai, lg, cv, cl categories. This means that the model is properly learning from the information contained in the neighbors. Nodes with neighbors from different categories than themselves will rarely be classified in the correct category, but the top predictions of the model will most often be related to the

Graph Neural Networks-Based Multilabel Classification of Citation Network

137

Table 3. Top 3 category predicted by the GAT model. The train size represents the percentage of nodes in the training set that are from each category. The test column indicates the number of nodes from the test set that are in each category. Subject Top 1 Top 2 Top 3 Train size (%) Test

Subject Top 1 Top 2 Top 3 Train size (%) Test

cv

91.83 98.10 98.99 10.99

10477

cy

17.13 43.12 59.33 1.12

lg

69.27 91.30 96.38

7.69

10740

cg

75.72 85.62 90.42 1.64

313

it

90.56 96.28 97.54 17.91

2849

dm

26.02 54.65 78.81 1.71

269

cl

92.74 97.06 98.27

4.77

4631

pl

47.41 74.09 81.87 1.39

386

ai

49.14 71.13 82.68

5.70

1455

hc

20.10 36.82 52.89 0.77

622

ds

69.87 86.85 92.43

5.97

1414

dl

76.17 81.78 85.05 1.21

214

ni

55.12 84.32 91.12

4.46

1250



55.45 73.18 80.00 1.02

220

cr

67.04 82.18 87.91

3.15

1869

sd

77.47 89.89 94.95 0.50

475

dc

52.73 75.12 83.07

3.23

1246

ma

6.28 27.62 62.76 0.43

239

lo

67.94 90.18 94.13

3.96

733

et

57.89 74.64 81.82 0.44

209

ro

70.91 88.77 94.63

1.83

2066

mm

22.46 52.94 66.31 0.42

187

si

68.40 82.61 88.18

3.14

1041

sc

83.10 88.73 88.73 0.52

71

gt

74.16 87.40 91.55

2.76

627

ce

28.36 44.78 52.99 0.42

134

sy

63.72 79.47 84.96

2.06

419

na

33.33 55.56 70.37 0.48

54

se

62.00 76.24 81.81

1.69

808

gr

11.33 56.16 77.34 0.22

203

ir

46.52 77.47 90.13

1.48

892

pf

10.00 34.17 52.50 0.26

120

cc

51.59 71.88 87.25

2.47

345

ms

57.83 79.52 84.34 0.30

83

db

63.41 74.43 83.37

1.78

481

ar

45.98 65.52 74.71 0.27

87

ne

44.90 64.97 82.96

1.42

628

oh

0.00

1.96

3.92 0.33

51

os

8.33 25.00 50.00

0.08

36

gl

0.00

0.00

0.00 0.02

5

654

content of the paper. This suggests that focusing on a single category is not sufficient to properly classify a paper, and that a better way is to look at the first two or three predictions to get a meaningful categorization of the paper. Node 1 is paper from the ai category, but it cites papers from the cv category; thus it is likely to contain a sizable amount of information related to computer vision, even if it is not the main theme of the paper. We also observe that the challenges faced by interdisciplinary categories remain when we observe the top three predictions: the model correctly has hc in its top three predictions in only 53% of the cases. Node 4 and node 7 in Fig. 2 illustrate the situation. Node 1 only have lg neighbors, while node 2 only has cv neighbors. Furthermore, hc is not in the first three predictions of the model.

138

G. Lachaud et al.

Fig. 2. Top 3 predictions for a few nodes in the graph. The pie chart represents the probability assigned by the model to the the first three categories. For each node with a piechart, the label of the first prediction is the one on top, the second prediction the one in the middle and the third prediction the one at the bottom. The nodes without piecharts are the neighbors of the nodes on which we do the predictions, and have their true label written inside them.

6

Conclusion and Future Works

In this paper we trained three common graph neural networks (GCN, GraphSAGE and GAT) on the ogbn-arxiv graph for node classification. While typical classification tasks allow the objects to belong to a unique category, we found that many misclassification errors come from the fact that some papers share common features with several other categories that are related. This led us to reframe the problem as a multilabel classification problem where a node might belong to more than one category with a given probability. For instance, a paper in the robotics category might tackle a computer vision problem, while another one might deal with a natural language processing task. We found that considering the top three predicted classes, the real class was present in more than 92% of the cases. In addition, we observed that the categories in the top predictions are usually related to the true category, or to the category of the neighbors of the paper. These facts validate the multilabel approach. Some perspectives for future works include performing a similar analysis on bigger datasets to generalize our findings. The multilabel approach is likely to extend to other domains, as objects in social networks or other real world data do not usually belong exclusively to one class. Furthermore, different set of features can be explored to improve discrimination between classes.

Graph Neural Networks-Based Multilabel Classification of Citation Network

139

References 1. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: Bengio, Y., LeCun, Y. (eds.) 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015 Conference Track Proceedings (2015) 2. Defferrard, M., Bresson, X., Vandergheynst, P.: Convolutional neural networks on graphs with fast localized spectral filtering. In: Proceedings of the 30th International Conference on Neural Information Processing Systems, pp. 3844–3852. NIPS 2016, Curran Associates Inc., Red Hook, NY, USA (2016) 3. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein, J., Doran, C., Solorio, T. (eds.) Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, 2–7 June 2019, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/n19-1423 4. Dwivedi, V.P., Joshi, C.K., Laurent, T., Bengio, Y., Bresson, X.: Benchmarking graph neural networks. arXiv:2003.00982 [cs, stat] (2020) 5. Fey, M., Lenssen, J.E.: Fast graph representation learning with pytorch geometric. arXiv:1903.02428 [cs, stat] (2019) 6. Gilmer, J., Schoenholz, S.S., Riley, P.F., Vinyals, O., Dahl, G.E.: Neural message passing for quantum chemistry. In: Precup, D., Teh, Y.W. (eds.) Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6–11 August 2017. Proceedings of Machine Learning Research, vol. 70, pp. 1263–1272. PMLR (2017) 7. Hamilton, W.L., Ying, R., Leskovec, J.: Inductive representation learning on large graphs. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 1025–1035. NIPS 2017, Curran Associates Inc., Red Hook, NY, USA (2017) 8. Hammond, D.K., Vandergheynst, P., Gribonval, R.: Wavelets on graphs via spectral graph theory. Appl. Comput. Harmonic Anal. 30(2), 129–150 (2011). https:// doi.org/10.1016/j.acha.2010.04.005 9. Hu, W., et al.: Open graph benchmark: datasets for machine learning on graphs. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H.T. (eds.) Advances in Neural Information Processing Systems, vol. 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, 6–12 December 2020, Virtual (2020) 10. Huang, Q., He, H., Singh, A., Lim, S.N., Benson, A.R.: Combining label propagation and simple models out-performs graph neural networks. In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, 3–7 May 2021. OpenReview.net (2021) 11. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. In: 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, 24–26 April 2017, Conference Track Proceedings. OpenReview.net (2017) 12. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. Commun. ACM 60(6), 84–90 (2017). https://doi.org/ 10.1145/3065386

140

G. Lachaud et al.

13. Lanchantin, J., Sekhon, A., Qi, Y.: Neural message passing for multi-label classification. In: Brefeld, U., Fromont, E., Hotho, A., Knobbe, A., Maathuis, M., Robardet, C. (eds.) ECML PKDD 2019. LNCS (LNAI), vol. 11907, pp. 138–163. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-46147-8 9 14. Li, G., Muller, M., Thabet, A., Ghanem, B.: DeepGCNs: can GCNs go as deep as CNNs? In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9266–9275. IEEE, Seoul, Korea (South) (2019). https://doi.org/10.1109/ICCV. 2019.00936 15. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: Bengio, Y., LeCun, Y. (eds.) 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, 2–4 May 2013, Workshop Track Proceedings (2013) 16. Nam, J., Kim, J., Loza Menc´ıa, E., Gurevych, I., F¨ urnkranz, J.: Large-scale multilabel text classification — revisiting neural networks. In: Calders, T., Esposito, F., H¨ ullermeier, E., Meo, R. (eds.) ECML PKDD 2014. LNCS (LNAI), vol. 8725, pp. 437–452. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-662-448519 28 17. Sen, P., Namata, G., Bilgic, M., Getoor, L., Galligher, B., Eliassi-Rad, T.: Collective classification in network data. AI Mag. 29(3), 93 (2008). https://doi.org/10. 1609/aimag.v29i3.2157 18. Sperduti, A., Starita, A.: Supervised neural networks for the classification of structures. IEEE Trans. Neural Netw. 8(3), 714–735 (1997). https://doi.org/10.1109/ 72.572108 19. Sun, C., Wu, G.: Adaptive graph diffusion networks with hop-wise attention. arXiv:2012.15024 [cs] (2020) 20. Velickovic, P., Cucurull, G., Casanova, A., Romero, A., Li` o, P., Bengio, Y.: Graph attention networks. In: 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, 30 April –3 May 2018, Conference Track Proceedings. OpenReview.net (2018) 21. Wang, J., Yang, Y., Mao, J., Huang, Z., Huang, C., Xu, W.: CNN-RNN: a unified framework for multi-label image classification. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2285–2294. IEEE, Las Vegas, NV, USA (2016). https://doi.org/10.1109/CVPR.2016.251 22. Wang, K., Shen, Z., Huang, C., Wu, C.H., Dong, Y., Kanakia, A.: Microsoft academic graph: when experts are not enough. Quant. Sci. Stud. 1(1), 396–413 (2020). https://doi.org/10.1162/qss a 00021 23. Wu, Z., Pan, S., Chen, F., Long, G., Zhang, C., Yu, P.S.: A comprehensive survey on graph neural networks. IEEE Trans. Neural Netw. Learn. Syst. 32(1), 4–24 (2021). https://doi.org/10.1109/TNNLS.2020.2978386 24. Zhang, Z., Cui, P., Zhu, W.: Deep learning on graphs: a survey. IEEE Trans. Knowl. Data Eng. 34(1), 249–270 (2022). https://doi.org/10.1109/TKDE.2020.2981333

Towards Efficient Discovery of Partial Periodic Patterns in Columnar Temporal Databases Penugonda Ravikumar1,4(B) , Venus Vikranth Raj4 , Palla Likhitha1 , Rage Uday Kiran1,2,3 , Yutaka Watanobe1 , Sadanori Ito2 , Koji Zettsu2 , and Masashi Toyoda3 1

2

The University of Aizu, Fukushima, Japan [email protected] National Institute of Information and Communications Technology, Tokyo, Japan {ito,zettsu}@nict.go.jp 3 The University of Tokyo, Tokyo, Japan [email protected] 4 IIIT-RK Valley, RGUKT-Andhar Pradesh, Vempalli, India

Abstract. Finding partial periodic patterns in temporal databases is a challenging problem of great importance in many real-world applications. Most previous studies focused on finding these patterns in row temporal databases. To the best of our knowledge, there exists no study that aims to find partial periodic patterns in columnar temporal databases. One cannot ignore the importance of the knowledge that exists in very large columnar temporal databases. It is because real-world big data is widely stored in columnar temporal databases. With this motivation, this paper proposes an efficient algorithm, Partial Periodic PatternEquivalence Class Transformation (3P-ECLAT), to find desired patterns in a columnar temporal database. Experimental results on synthetic and real-world databases demonstrate that 3P-ECLAT is not only memory and runtime efficient but also highly scalable. Finally, we present the usefulness of 3P-ECLAT with a case study on air pollution analytics. Keywords: Pattern mining

1

· Periodic patterns · Columnar databases

Introduction

The big data generated by real-world applications are naturally stored in row or columnar databases. Row databases help the user write the data quickly, while columnar databases facilitate the user to execute fast (aggregate) queries. Thus, row databases are suitable for Online Transaction Processing (OLTP), while columnar databases are suitable for Online Analytical Processing (OLAP). As the objective of knowledge discovery in databases falls under OLAP, this paper aims to find partial periodic patterns in columnar databases. This research was funded by JSPS Kakenhi 21K12034. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  N. T. Nguyen et al. (Eds.): ACIIDS 2022, LNAI 13758, pp. 141–154, 2022. https://doi.org/10.1007/978-3-031-21967-2_12

142

P. Ravikumar et al.

Partial periodic pattern mining [4] is an important knowledge discovery technique in data mining. It involves discovering all patterns in a temporal database that satisfy the user-specified minimum periodic-support (minP S) and periodicity (per) constraints. The minP S controls the minimum number of periodic occurrences of a pattern in a database. The per controls the maximum interarrival time of a pattern in the database. A classical application is air pollution analytics. It involves identifying the geographical areas in which people were regularly exposed to harmful air pollutants, say PM2.5. A partial periodic pattern discovered in our air pollution database is as follows: {1591, 1266, 1250}

[periodic-support = 23, periodicity = 3 h].

The above pattern indicates that the people living close to the sensors, 1591, 1266, and 1250, were frequently and regularly (i.e., at least once every 3 h) exposed to harmful levels of PM2.5. The produced information may help the users for various purposes, such as alerting local authorities and introducing new pollution control policies. (This application was further discussed as a case study in the latter parts of this paper.) Uday et al. [4] described the Partial Periodic Pattern-growth (3P-growth) algorithm to find desired patterns in a temporal database. It is a depth-first search algorithm that can find partial periodic patterns only in a row database. In other words, this algorithm cannot find partial periodic patterns in a columnar database. One can find partial periodic patterns by transforming a columnar temporal database into a row database. However, we must avoid such na¨ıve transformation process due to its high computational cost. With this motivation, this paper aims to develop an efficient algorithm that can find partial periodic patterns in a columnar database. Finding partial periodic patterns in columnar databases is non-trivial and challenging due to the following reasons: 1. Zaki et al. [8] first discussed the importance of finding frequent patterns in columnar databases. Besides, a depth-first search algorithm, called Equivalence Class Transformation (ECLAT), was also described to find frequent patterns in a columnar database. Unfortunately, we cannot directly use this algorithm to find periodic-frequent patterns in a columnar database. It is because the ECLAT algorithm completely disregards the temporal occurrence information of an item in the database. 2. The space of items in a database gives raise to an itemset lattice. The size of this lattice is 2n − 1, where n represents the total number of items in a database. This lattice represents the search space for finding interesting patterns. Reducing this vast search space is a challenging task. This paper proposes a novel and generic ECLAT algorithm by addressing the above two issues. Our algorithm finds the desired patterns by taking into account the frequency and temporal occurrence information of the items in the data Fig. 1. The contributions of this paper are as follows: (i) This paper proposes a novel algorithm to find partial periodic patterns in a columnar temporal database. We

Towards Efficient Discovery of Partial Periodic Patterns

143

Fig. 1. Search space of the items a, b and c. (a) Itemset lattice and (b) Depth first search on the lattice

call our algorithm as Partial Periodic Pattern-Equivalence Class Transformation (3P-ECLAT). (ii) To the best of our knowledge, this is the first algorithm that aims to find partial periodic patterns in a columnar temporal database. A key advantage of this algorithm over the state-of-the-art algorithms is that it can also be employed to find partial periodic patterns in a horizontal database. (iii) Experimental results on synthetic and real-world databases demonstrate that our algorithm is not only memory and runtime efficient but also highly scalable. We will also show that 3P-ECLAT even outperforms the state-of-the-art algorithm while finding partial periodic patterns in a row database. (iv) Finally, we describe the usefulness of our algorithm with a case study on air pollution data. The rest of the paper is organized as follows. Section 2 reviews the work related to our method. Section 3 introduces the model of the partial periodic pattern. Section 4 presents the proposed algorithm. Section 5 shows the experimental results. Section 6 concludes the paper with future research directions.

2

Related Work

Agrawal et al. [1] introduced the concept of frequent pattern mining to extract useful information from the transactional databases. It has been used in many domains, and several other algorithms have been developed. Luna et al. [5] conducted a detailed survey on frequent pattern mining and presented the improvements that happened in the past 25 years. However, frequent pattern mining is inappropriate for identifying patterns that are regularly appearing in a database. Tanbeer et al. [7] introduced the periodic-frequent pattern model to discover temporal regularities in a database. Amphawan et al. [2] extended this model to find top-K periodic-frequent patterns in a database. Amphawan et al. [3] have also discussed novel measuring technique named approximate periodicity to discover periodic-frequent patterns in a transactional database. Nofong et al. [6] have proposed a novel two-stage approach to discover periodic-frequent patterns efficiently. The widespread adoption and industrial application of this model have been hindered by the following limitation: “Since the objective of maxP er constraint to find all patterns that have maximum inter-arrival time no more than maxP er, the model discover only those patterns that have exhibited

144

P. Ravikumar et al.

Table 1. Row database Table 2. Columnar database ts 1 2 3 4 5 6

items ace bc bdef abef acdf abcd

ts 7 8 9 10 11 12

items bcd bf abcd cd abcd abcd

ts 1 2 3 4 5 6

items a b c 1 01 0 11 0 10 1 10 1 01 1 11

d 0 0 1 0 1 1

e 1 0 1 1 0 0

f 0 0 1 1 1 0

ts 7 8 9 10 11 12

items a b c 0 11 0 10 1 11 0 01 1 11 1 11

d 1 0 1 1 1 1

e 0 0 0 0 0 0

f 0 1 0 0 0 0

Table 3. List of ts of an item item a b c d e f

TS-list 1, 4, 5, 6, 9, 11, 12 2, 3, 4, 6, 7, 8, 9, 11, 12 1, 2, 5, 6, 7, 9, 10, 11, 12 3, 5, 6, 7, 9, 10, 11, 12 1, 3, 4 3, 4, 5, 8

full periodic behavior in the database. In other words, this model fails to discover all those interesting patterns that have exhibited partial periodic behavior in the database.” When confronted with this problem in the real-world applications, researchers have tried to find partially occurring periodic-frequent patterns using constraints such as periodic-ratio, standard deviation, and average period, Unfortunately, these extended models require too many input parameters and are not practicable on large databases as the generated patterns do not satisfy the downward closure property. Uday et al. [4] described a novel model to discover partial periodic patterns in a temporal database. Unlike the studies mentioned above, this model is easy to use as it requires only two input parameters, and the generated patterns satisfy the downward closure property. A pattern-growth algorithm, called 3P-growth, was also described to find desired patterns in a temporal database. Unfortunately, this algorithm can find partial periodic patterns only in row databases. In this context, this paper aims to advance the state-of-the-art by proposing an algorithm to find partial periodic patterns in a columnar database. Overall, the proposed algorithm to find partial periodic patterns in a columnar database is novel and distinct from existing studies.

3

The Model of Partial Periodic Pattern

Let I be the set of items. Let X ⊆ I be a pattern (or an itemset). A pattern containing β, β ≥ 1, number of items is called a β-pattern. A transaction, tk = (ts, Y ) is a tuple, where ts ∈ R+ represents the timestamp at which the pattern Y has occurred. A temporal database T DB over I is a set of transactions, i.e., T DB = {t1 , · · · , tm }, m = |T DB|, where |T DB| can be defined as the number of transactions in T DB. For a transaction tk = (ts, Y ), k ≥1, such that X ⊆ Y , it is said that X occurs in tk (or tk contains X) and X such a timestamp is denoted as tsX . Let T S X = {tsX j , · · · , tsk }, j, k ∈ [1, m] and j ≤ k, be an ordered set of timestamps where X has occurred in T DB. The number of transactions containing X in T DB is defined as the support of X and denoted as sup(X). That is, sup(X) = |T S X |.

Towards Efficient Discovery of Partial Periodic Patterns

145

Example 1. Let I = {a, b, c, d, e, f } be the set of items. A hypothetical row temporal database generated from I is shown in Table 1. Without loss of generality, this row temporal database can be represented as a columnar temporal database as shown in Table 2 and which is also a binary columnar database. The temporal occurrences of each item in the entire database is shown in Table 3. The set of items ‘b’ and ‘c’, i.e., {b, c} is a pattern. For brevity, we represent this pattern as ‘bc’. This pattern contains two items. Therefore, it is 2-pattern. The temporal database contains 12 transactions. Therefore, m = 12. The minimum and maximum timestamps in this database are 1 and 12, respectively. Therefore, tsmin = 1 and tsmax = 12. The pattern ‘bc’ appears at the timestamps of 2, 6, 7, 9, 11, and 12. Therefore, the list of timestamps containing ‘bc’, i.e., T S bc = {2, 6, 7, 9, 11, 12}. The support of ‘bc,’ i.e., sup(bc) = |T S bc | = 6. X X Definition 1 (Periodic appearance of pattern X). Let tsX j , tsk ∈ T S , X 1 ≤ j < k ≤ m, denote any two consecutive timestamps in T S . An interX X X = {iatX arrival time of X denoted as iatX = (tsX 1 , iat2 ,k − tsj ). Let IAT · · · , iatX k }, k = sup(X) − 1, be the list of all inter-arrival times of X in T DB. An inter-arrival time of X is said to be periodic (or interesting) if it is no X is said to be more than the user-specified period (per). That is, a iatX i ∈ IAT X periodic if iati ≤ per.

Example 2. The pattern ‘bc’ has initially appeared at the timestamps of 2 and 6. Thus, the difference between these two timestamps gives an inter-arrival time of ‘bc.’ That is, iatbc 1 = 4 (= 6 − 2). Similarly, other inter-arrival times of ‘bc’ are bc bc = 1 (= 7−6), iatbc iatbc 2 3 = 2 (= 9−7), iat4 = 2 (= 11−9), and iat5 = 1 (= 12− ab 11). Therefore, the resultant IAT = {4, 1, 2, 2, 1}. If the user-specified per = 2, bc bc bc then iatbc 2 , iat3 , iat4 and iat5 are considered as the periodic occurrences of ‘bc’ bc in the data. In contrast, iat1 is not considered as a periodic occurrence of ‘bc’ because iatbc 1 ≤ per.  X be the set of all Definition 2 (Period-support of pattern X). Let IAT X X  X ⊆ IAT X inter-arrival times in IAT that have iat ≤ per. That is, IAT X X  X such that if ∃iatX : iatX k ∈ IAT k ≤ per, then iatk ∈ IAT . The period X |. support of X, denoted as P S(X) = |IAT  bc = {1, 2, 2, 1}. ThereExample 3. Continuing with the previous example, IAT  bc | = |{1, 2, 2, 1}| = 4. fore, the period-support of ‘bc,’ i.e. P S(bc) = |IAT Definition 3 (Partial periodic pattern X). A pattern X is said to be a partial periodic pattern if P S(X) ≥ minP S, where minP S is the user-specified minimum period-support. Example 4. Continuing with the previous example, if the user-specified minP S = 4, then ‘bc’ is a partial periodic pattern because P S(bc) ≥ minP S. The complete set of partial periodic patterns discovered from Table 3 including 1-patterns’(in Fig. 2(f)) are shown in Fig. 3 without “sample”(i.e., Strikethrough) mark on the text.

146

P. Ravikumar et al.

Fig. 2. Finding partial periodic patterns. (a) after scanning the first transaction, (b) after scanning the second transaction, (c) after scanning the entire database, and (d) final list of partial periodic patterns sorted in descending order of their P S (or the size of TS-list) with the constraint minP S = 4 and per = 2

Definition 4 (Problem definition). Given a temporal database (T DB) and the user-specified period (per) and minimum period-support (minP S) constraints, find all partial periodic patterns in T DB that have period-support no less than minP S. The period-support of a pattern can be expressed in percentage of (|T DB| − 1). The per can be expressed in percentage of (tsmax − tsmin ). In this paper, we employ the above definitions of the period and period-support for brevity.

4

Proposed Algorithm

This section first describes the procedure for finding one-length partial periodic patterns (or 1-patterns) and transforming row database to columnar database. Next, we will explain the 3P-ECLAT algorithm to discover a complete set of partial periodic patterns in columnar temporal databases. The 3P-ECLAT algorithm employs Depth-First Search (DFS) and the downward closure property (see Property 1) of partial periodic patterns to reduce the vast search space effectively. Property 1 (The downward closure property [7]). If Y is a partial periodic pattern, then ∀X ⊂ Y and X =  ∅, X is also a partial periodic pattern.

4.1

3P-ECLAT Algorithm

Finding One Length Partial Periodic Patterns. Algorithm 1 describes the procedure to find 1-patterns using 3P-list, which is a dictionary. We now describe this algorithm’s working using the row database shown in Table 1. Let minP S = 4 and per = 2. We will scan the complete database once to generate 1-patterns and transforming the row database to columnar database. The scan on the first transaction, “1 : ace”, with tscur = 1 inserts the items a, c, and e in the 3P-list. The timestamps of these items is set to 1 (= tscur ). Similarly, P S and T Sl values of

Towards Efficient Discovery of Partial Periodic Patterns

147

Fig. 3. Mining partial periodic patterns using DFS

these items were also set to 0 and 1, respectively (lines 5 and 6 in Algorithm 1). The 3P-list generated after scanning the first transaction is shown in Fig. 2(a). The scan on the second transaction, “2 : bc”, with tscur = 2 inserts the new item b into the 3P-list by adding 2 (= tscur ) in their TS-list. Simultaneously, the P S and T Sl values were set to 0 and 2, respectively. On the other hand, 2 (= tscur ) was added to the TS-list of already existing item c with P S and T Sl set to 1 and 2, respectively (lines 7 and 8 in Algorithm 1). The 3P-list generated after scanning the second transaction is shown in Fig. 2(b). A similar process is repeated for the remaining transactions in the database. The final 3P-list generated after scanning the entire database is shown in Fig. 2(c). The pattern e and f are pruned (using the Property 1) from the 3P-list as its P S value is less than the user-specified minP S value (lines 10 and 11 in Algorithm 1). The remaining patterns in the 3P-list are considered partial periodic patterns and sorted in descending order of their P S values. The final 3P-list generated after sorting the partial periodic patterns is shown in Fig. 2(d). Finding Partial Periodic Patterns Using 3P-list. Algorithm 2 describes the procedure for finding all partial periodic patterns in a database. We now describe the working of this algorithm using the newly generated 3P-list. We start with item b, which is the first pattern in the 3P-list (line 2 in Algorithm 2). We record its P S, as shown in Fig. 3(a). Since b is a partial periodic pattern, we move to its child node bc and generate its TS-list by performing intersection of TS-lists of b and c, i.e., T S bc = T S b ∩ T S c (lines 3 and 4 in

148

P. Ravikumar et al.

Algorithm 1. PartialPeriodicItems(T DB: temporal database, minP S: periodSupport and per: period) 1: Let 3P-list=(X,TS-list(X)) be a dictionary that records the temporal occurrence information of a pattern in a T DB. Let T Sl be a temporary list to record the timestamp of the last occurrence of an item in the database. Let P S be a temporary list to record the periodic-support of an item in the database. Let i is an item in any transaction t ∈ T DB and tscur is current time stamp of any item i ∈ t. 2: for each transaction t ∈ T DB do 3: if tscur is i’s first occurrence then 4: Insert i and its timestamp into the 3P-list. 5: Set T Sl [i] = tscur and P S i = 0. 6: else 7: Add i’s timestamp in the 3P-list. 8: if (tscur − T Sl [i]) ≤ per then 9: Set P S i + +. 10: Set T Sl [i] = tscur . 11: for each item i in 3P-list do 12: if (P S i < minP S) then 13: Remove i from 3P-list. 14: Consider the remaining items in 3P-list as partial periodic items. Sort these items in support descending order. Let L denote this sorted list of partial periodic items.

Algorithm 2). We record P S of bc, as shown in Fig. 3(b). We verify whether bc is partial periodic or uninteresting pattern (line 5 in Algorithm 2). Since bc is partial periodic pattern, we move to its child node bcd and generate its TS-list by performing intersection of TS-lists of bc and d, i.e., T S bcd = T S bc ∩ T S d . We record P S bcd, as shown in Fig. 3(c) and identified it as a partial periodic pattern. We once again, move to its child node bcda and generate its TS-list by performing intersection of TS-lists of bcd and a, i.e., T S bcda = T S bcd ∩ T S a . As P S of bcda is less than the user-specified minP S, we will prune the pattern bcda from the partial periodic patterns list as shown in Fig. 3(d). A similar process is repeated for remaining nodes in the set-enumeration tree to find all partial periodic patterns. The final list of partial periodic patterns generated from Table 1 are shown in Fig. 3(e). The above approach of finding partial periodic patterns using the downward closure property is efficient because it effectively reduces the search space and the computational cost.

5

Experimental Results

In this section, we first compare the 3P-ECLAT against the state-of-the-art algorithm 3P-growth [4] and show that our algorithm is not only memory and runtime efficient but also highly scalable as well. Next, we describe the usefulness of our algorithm with a case study on air pollution data. Please note that 3Pgrowth ran out of memory on this database. The algorithms 3P-growth and 3P-ECLAT were developed in Python 3.7 and executed on an Intel i5 2.6 GHz,

Towards Efficient Discovery of Partial Periodic Patterns

149

Table 4. Statistics of the databases S.No Database

Type

Nature Transaction Length Total transactions min avg max

1

Kosarak

Real

Sparse

2

2

T20I6d100k Synthetic Sparse

1

20

47

199844

3

Congestion

Real

Sparse

1

58

337

17856

4

Pollution

Real

Dense

11

460

971

1438

9 2499

990000

Algorithm 2. 3P-ECLAT(3P-List) 1: for each item i in 3P-List do 2: Set pi = ∅ and X = i; 3: for each item j that comes after i in the 3P-list do 4: Set Y = X ∪ j and T idY = T idX ∩ T idj ; 5: Calculate P eriod-support of Y ; 6: if P eriod-support ≥ minP S then 7: Add Y to pi and Y is considered as partial periodic; 8: Store the P eriod-support of a partial periodic pattern Y ; 9: 3P-ECLAT(pi)

8GB RAM machine running Ubuntu 18.04 operating system. The experiments have been conducted using synthetic (T20I6d100K) and real-world (Congestion and Pollution) databases. The statistics of all the above databases were shown in Table 4. The complete evaluation results, databases, and algorithms have been provided through GitHub1 to verify the repeatability of our experiments. We are not providing the Congestion databases on GitHub due to confidential reasons. 5.1

Evaluation of Algorithms by Varying minP S

In this experiment, we evaluate 3P-growth and 3P-ECLAT algorithms performance by varying only the minP S constraint in each of the databases. The P er value in each of the databases will be set to a particular value. The minP S in T20I6d100K,Congestion and Pollution databases has been set at 60%, 50%, and 50%, respectively. Figure 4 shows the number of partial periodic patterns generated in T20I6d100K, Congestion, and Pollution databases at different minP S values. It can be observed that an increase in minP S has a negative effect on the generation of partial periodic patterns. It is because many patterns fail to satisfy the increased minP S. Figure 5 shows the runtime requirements of 3P-growth and 3P-ECLAT algorithms in T20I6d100K, Congestion, and Pollution databases at different minP S values. It can be observed that even though the runtime requirements of both 1

https://github.com/udayRage/pykit old/tree/master/traditional/3peclat.

150

P. Ravikumar et al.

Fig. 4. Patterns evaluation of 3P-growth and 3P-ECLAT algorithms at constant P er

Fig. 5. Runtime evaluation of 3P-growth and 3P-ECLAT algorithms at constant P er

the algorithms decrease with the increase in minP S, the 3P-ECLAT algorithm completed the mining process much faster than the 3P-growth algorithm in both sparse and dense databases at any given minP S. More importantly, the 3P-ECLAT algorithm was several times faster than the 3P-growth algorithm, especially at low minP S values. Figure 6 shows the memory requirements of 3P-growth and 3P-ECLAT algorithms in T20I6d100K, Congestion, and Pollution databases at different minP S values. It can be observed that though an increase in minP S resulted in the decrease of memory requirements for both the algorithms, the 3P-ECLAT algorithm has consumed relatively very little memory in all databases at different minP S values. More importantly, 3P-growth has taken a huge amount of memory, especially at low minP S values in all of the databases, and ran out of memory in the Pollution database. 5.2

Evaluation of Algorithms by Varying P er

Figure 7 first graph shows the number of partial periodic patterns generated in Congestion database at different P er values. It can be observed that an increase in P er has increased the number of partial periodic patterns in both of the algorithms.

Towards Efficient Discovery of Partial Periodic Patterns

151

Fig. 6. Memory evaluation of 3P-growth and 3P-ECLAT algorithms at constant P er

Fig. 7. Evaluation of 3P-growth and 3P-ECLAT algorithms using Congestion database

Figure 7 second graph shows the runtime requirements of 3P-growth and 3PECLAT algorithms in Congestion database at different P er values. It can be observed that though the runtime requirements of both the algorithms increase with the increase in P er value, the 3P-ECLAT algorithm consumes relatively less runtime than the 3P-growth algorithm. Figure 7 third graph shows the memory requirements of 3P-growth and 3PECLAT algorithms in Congestion database at different P er values. It can be observed that though the memory requirements of both the algorithms increase with P er, the 3P-ECLAT algorithm consumes very less memory than the 3Pgrowth algorithm. The minP S is set at 23% during the above evaluation. Similar results were obtained during the experimentation on remaining databases. However, we have confined this experiment to the Congestion database due to page limitations. 5.3

Scalability Test

The Kosarak database was divided into five portions of 0.2 million transactions in each part, in order to check the performance of 3P-ECLAT against 3P-growth. We have investigated the performance of 3P-growth and 3P-ECLAT algorithms after accumulating each portion with previous parts. Figure 8 shows the runtime

152

P. Ravikumar et al.

Fig. 8. Scalability of 3P-growth and 3P-ECLAT

and memory requirements of both algorithms at different database sizes(i.e., increasing order of the size) when minP S = 1 (%) and P er = 1 (%). The following two observations can be drawn from these figures: (i) Runtime and memory requirements of 3P-growth and 3P-ECLAT algorithms increase almost linearly with the increase in database size. (ii) At any given database size, 3PECLAT consumes less runtime and memory as compared against the 3P-growth algorithm. 5.4

A Case Study: Finding Areas Where People Have Been Regularly Exposed to Hazardous Levels of PM2.5 Pollutant

The Ministry of Environment, Japan has set up a sensor network system, called SORAMAME, to monitor air pollution throughout Japan, is shown in Fig. 9(a). The raw data produced by these sensors i.e., quantitative columnar database (see Fig. 9(b)) can be transformed into a binary columnar database, if the raw data value is ≥15 (see Fig. 9(c)). The transformed data is provided to 3P-ECLAT algorithm (see Fig. 9(d)) to identify all sets of sensor identifiers in which pollution levels are high (see Fig. 9(e)). The spatial locations of interesting patterns generated from the Pollution database are visualized in Fig. 9(f). It can be observed that most of the sensors in this figure are situated in the southeast of Japan. Thus, it can be inferred that people working or living in the southeast parts of Japan were periodically exposed to high levels of PM2.5. Such information may be useful to the Ecologists in devising policies to control pollution and improve public health. Please note that more in-depth studies, such as finding high polluted areas on weekends or particular time intervals of a day, can also be carried out with our algorithm efficiently.

Towards Efficient Discovery of Partial Periodic Patterns

153

Fig. 9. Finding partial periodic patterns in Pollution data. The terms ‘s1 ,’ ‘s2 ,’ · · · ‘sn ’ represents ‘sensor identifiers’ and ‘P S’ represents ‘periodic-support’

6

Conclusions and Future Work

This paper has proposed an efficient algorithm named Partial Periodic PatternEquivalence Class Transformation (3P-ECLAT) to find partial periodic patterns in columnar temporal databases. The performance of the 3P-ECLAT is verified by comparing it with a 3P-growth algorithm on different real-world and synthetic databases. Experimental analysis shows that 3P-ECLAT exhibits high performance in partial periodic pattern mining and can obtain all partial periodic patterns faster and with less memory usage against the state-of-the-art algorithm. We have also presented a case study to illustrate the usefulness of generated patterns in a real-world application. As part of future work, we would like to investigate parallel algorithms to find periodic and fuzzy partial periodic patterns in very large temporal databases. We will try to extend our model to the distributed environment and develop novel pruning techniques to reduce the computation cost further.

References 1. Agrawal, R., Imieli´ nski, T., Swami, A.: Mining association rules between sets of items in large databases. In: SIGMOD, pp. 207–216 (1993) 2. Amphawan, K., Lenca, P., Surarerks, A.: Mining top-k periodic-frequent pattern from transactional databases without support threshold. In: Papasratorn, B., Chutimaskul, W., Porkaew, K., Vanijja, V. (eds.) IAIT 2009. CCIS, vol. 55, pp. 18–29. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-10392-6 3

154

P. Ravikumar et al.

3. Amphawan, K., Surarerks, A., Lenca, P.: Mining periodic-frequent itemsets with approximate periodicity using interval transaction-ids list tree. In: 2010 Third International Conference on Knowledge Discovery and Data Mining, pp. 245–248 (2010) 4. Kiran, R.U., Shang, H., Toyoda, M., Kitsuregawa, M.: Discovering partial periodic itemsets in temporal databases. In: Proceedings of the 29th International Conference on Scientific and Statistical Database Management. SSDBM ’17 (2017) 5. Luna, J.M., Fournier-Viger, P., Ventura, S.: Frequent itemset mining: a 25 years review. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 9(6), e1329 (2019) 6. Nofong, V.M., Wondoh, J.: Towards fast and memory efficient discovery of periodic frequent patterns. J. Inf. Telecommun. 3(4), 480–493 (2019) 7. Tanbeer, S.K., Ahmed, C.F., Jeong, B.-S., Lee, Y.-K.: Discovering periodic-frequent patterns in transactional databases. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, T.-B. (eds.) PAKDD 2009. LNCS (LNAI), vol. 5476, pp. 242–253. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-01307-2 24 8. Zaki, M.J.: Scalable algorithms for association mining. IEEE Trans. Knowl. Data Eng. 12(3), 372–390 (2000)

Avoiding Time Series Prediction Disbelief with Ensemble Classifiers in Multi-class Problem Spaces Maciej Huk(B) Faculty of Information and Communication Technology, Wroclaw University of Science and Technology, Wroclaw, Poland [email protected]

Abstract. Time series data is everywhere: it comes e.g. from IoT devices, financial transactions as well as medical and scientific observations. Time series analysis provides powerful tools and methodologies for modeling many kinds of related processes. Predictions based on such models often are of great value for many applications. But even the most accurate prediction will be useless if potential users will not want to accept and further use it. The article presents the problem of prediction disbelief and its relation with acceptance tests of predictions during lifecycle of time series analysis. The main contribution of the paper is classification and modeling of possible types of organization of acceptance tests of the outcomes of forecasting tools. This is done in the form of ensembles of classifiers working contextually in multi-class problem spaces. This allows to formulate, analyze and select the best methods of avoiding influence of prediction disbelief problem during time series analysis lifecycle. Keywords: Time series · Ensemble of classifiers · Selective exposure · Context

1 Introduction Many classes of physical and artificial phenomena can be described by time series. As the time series data can be used to characterize, model and predict behavior of such processes, time series analysis very often helps to provide precise and valuable information for many important engineering, business, medical and scientific purposes [1]. This is why time series data are processed to improve satellite imaging by advanced filtering and real-time sensors calibration [2], to predict solar activity [3–5] and changes within Earth’s environment [6–8]. They are also used in medicine for patient monitoring [9] and breast cancer classification [10]. There are also numerous examples of financial time series data exploration for stock and resources prices forecasting [11–13], corporate decision support [14] as well as prices manipulation and fraud detection [15]. One can also observe many time series analysis applications in engineering such as car fault and traffic prediction [16, 17], mining subsidence prediction [18] and power load forecasting for efficient power stations control [19]. Wireless sensor networks, web, IoT © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 N. T. Nguyen et al. (Eds.): ACIIDS 2022, LNAI 13758, pp. 155–166, 2022. https://doi.org/10.1007/978-3-031-21967-2_13

156

M. Huk

and mobile phones networks are likewise improved by using time series data mining for wireless sensors energetic efficiency enhancement [20], top mobile games prediction [21], cyber-security anomalies detection [22] and cognitive radio communication control [23]. Numerous examples of time series analysis applications listed above indicate that time series usage provides powerful tools and methodologies for modeling data streams as well as processes forecasting and control. This suite includes a wide and diverse set of methods. There are statistical techniques such as sequential Monte-Carlo [24], Fourier transform [25], Holt–Winters, ARIMA and higher order statistics (e.g. skewness and kurtosis) [26]. Researchers use also a broad family of heuristics and machine learning methods such as evolutionary algorithms and genetic programming [27], artificial neural networks and support vector machines [28, 29], fuzzy algorithms [30], ensembles of classifiers [31] and their various hybrids [32, 33]. Mentioned statistical and heuristic models are designed to deal with various problems connected with time series analyses. The most important are irregular sampling in the temporal dimension, high data rates and flood of irrelevant attributes, irregular data availability and the need for using of multiple time series having different number of samples. This builds high effectiveness and usability of existing solutions and is the source of their successful use in many complex applications. But there are more aspects of time series data usage to be considered than the problems addressed by mentioned statistical and heuristic methods. Time series researcher must deal with a range of issues arising during the whole time series analysis lifecycle – including the acceptance tests of prepared models and predictions by their final users (e.g. business or engineers). Improper organization of such tests can increase the influence of human-related errors related to prediction disbelief and selective exposure phenomenon [34]. The remaining parts of this paper are organized as follows. In the next section, preliminary knowledge about time series prediction life-cycle is presented. In Sect. 3, the problem of prediction disbelief is discussed and its contextual model of competing ensembles is used to propose feasible solution. In Sect. 4, a brief conclusion of the main results is given.

2 Time Series Analysis Life-Cycle Even if the time series analysis can be used for various tasks such as series comparison and audio noise level reduction, from the general computer science perspective the time series model is an algorithm that allows to compute future values of selected time-related variables correlated with given physical or artificial process with finite but acceptable precision. Building and using of the model is connected with its implementation, regardless of its notation in some programming or modeling language. Thus one can find that modeling the time series and creating a prediction is a special case of building and deploying a software system and as such can be described in terms of the software engineering life-cycle [35, 36]. The time series analysis life-cycle consists of four main elements: 1. Problem analysis and model design, 2. Model implementation (observable data collection, model development, training and testing), 3. Prediction generation (using model for extrapolation), 4. Prediction acceptance, usage and verification [37]. After the verification the

Avoiding Time Series Prediction Disbelief with Ensemble Classifiers

157

problem analysis can be started again if it is needed. Leaving aside the question whether one should understand elements of the time series analysis life-cycle as blocks of the waterfall or the spiral model, it is worth to note that each of those steps is connected with the set of characteristic properties and time series related issues. During the problem analysis and model design phase, questions must be answered about data sources and features selection as well as eventual sensors placement and calibration, data sampling parameters settings and dealing with energy-accuracy trade-off in wireless sensor networks. Also proper selection of the data processing and confidence estimation methods must be done, according to the expected data flow and volume, its semantics, completeness, goals of the processing and expected accuracy and confidence of outcome predictions [38, 39]. In general this is a complex and experience demanding task of building time series analysis architecture and there is no single method of doing needed choices. The second phase - model implementation - includes collecting the time series data and using it to select values of model‘s parameters (training) as well as for further testing. Here questions arise about relevant features and training data selection, as well as about which metric to use for model testing and its accuracy estimation. Again, there is no best recipe for training and testing. For example one can select among more than fifty measures of the quality of classification models (e.g. Macro Averaged Accuracy, F1 score or Modified Confusion Entropy) [40, 41]. In turn, the third phase integrates trained model and selected input data into the prediction. It is interesting that unlike software integration, prediction generation does not look as complicated and costly, as the trained model and its input data fit together by design at the syntax level. This can be misleading, as only accurate prediction can be viewed as the proper output of the integration phase. Thus the integration will fail if the used input data lie outside the region for which model generates accurate predictions. In the effect, the true goal of the integration is obtaining predictions with associated confidence. Confidence level can inform on the certainty or extent of reliability that may be associated with the prediction. This can be crucial where model outputs are used in making potentially costly or health-related decisions. And the last phase is even more interesting. If the prediction was precise enough its usage could lead to intended goals – e.g. earnings after successful investment, losses minimization after early warning about earthquake or precise measurements after proper experiment planning, etc. But this can happen only when the prediction will be used to control the given process, and unfortunately this not always is the case. As in the software product life-cycle the product release should be preceded by the user acceptance tests, in the case of time series analysis the prerequisite for forecasting model and prediction usage is their previous acceptance by the stakeholders and/or final users. Thus it is of great importance how acceptance tests of forecasting model are organized because negative decision at this stage can block usage of the predictions independently from their accuracy, certainty and value, with all possible consequences. In the effect, reasons for such phenomenon should be analyzed, and preventive measures defined.

158

M. Huk

3 Prediction Disbelief in Acceptance Tests of Forecasting Models The situation presented in the previous section, where time series analysis is used for generation of prediction together with its certainty estimation and despite high estimated accuracy and certainty the potential user does not accept the result, will be further named as prediction disbelief. As it was suggested the problem of prediction disbelief is strongly connected with selective exposure phenomenon [34] and the organization of user acceptance tests of prediction models which can be performed in the last phase of time series analysis lifecycle. The identified kinds of organization of the related acceptance tests and the main properties of accompanying processes can be modeled and visualized with the use of paradigm of non-weighted ensemble of classifiers. Within this method the forecasting model is represented as multi-class classifier that outputs predictions and their estimated confidence. This is needed as eventual real-valued output can be projected onto the finite number of intervals mapped as classes. Also each stakeholder or user taking part in acceptance tests is represented by analogous black box classifier (in reality it is a complicated neural network classifier). Such user related classifier has additional input representing personal reasoning bias resulting from the human susceptibility to selective exposure phenomenon - preferring result that was set before the reasoning started [38]. Such hybrid set of base classifiers can be used to form different architectures of classifiers ensemble. During this analysis five general ensemble architectures were identified that can represent organizations of the user acceptance tests of forecasting models. All related solutions are not only theoretical but were observed by the Author as used in practice by data driven and AI related companies. Considered types of architectures of the acceptance tests of forecasting models are: 1. Oracle – degenerated ensemble that includes only forecasting model (M). This architecture represents situations when users assume that predictions of the model M are valid without any additional analysis of its output, (Fig. 1a) 2. Court – ensemble of binary user classifiers (Uc) that produce only accept/reject answers in response to analyzed data and outcomes of the forecasting model (M). In this model the personal bias can cause situation in which predictions of the forecasting model have little or no influence on acceptance of its prediction. Such organization can lead to high level of human-related bias because it is concentrated around judging model M even without doing test forecasts by Uc, (Fig. 1b) 3. Competing ensembles – two-level heterogenous ensemble in which two inner ensembles act as the base classifiers in the top-level ensemble. One of the inner ensembles includes only user classifiers Uc which produce their own test forecasts for given input data. Then their predictions are fused, e.g. by voting, giving user-related prediction. The second inner model is built from the predictor M which in general case can be also an ensemble of classifiers. Finally outcomes of both inner ensembles are compared (not fused) on the top-level ensemble, which produces accept/reject decision. The user-based ensemble is contextually treated as the reference model, and differences between reference prediction and outcomes of M result in lack of acceptance of the forecasting model M, (Fig. 1c)

Avoiding Time Series Prediction Disbelief with Ensemble Classifiers acceptance / disbelief

acceptance / disbelief b.)

c.)

Voting acc/dis

acc/dis

159

Compare pred

acc/dis

pred

Fusing a.)

Uc

Uc

Uc

Personal Bias

M

Personal Bias

pred prediction

pred M

Input Data

pred pred

Uc

Uc

Uc

M

Input Data

Input Data

Fig. 1. Basic ensemble architectures of acceptance tests of the forecasting model M done by human users Uc1 ,..,UcN . Types: Oracle (a), Court (b), Competing ensembles (c).

1. Biased cooperation – single level ensemble that includes both predictor classifier M and user classifiers Uc as base classifiers. Base models create their own predictions and fusing is used to build target prediction and confidence from the outcomes of all base classifiers. The ensemble output is regarded as the best available prediction and the model M is accepted/rejected depending on the confidence of the fused result. In such solution even if all the base classifiers are treated as equally important (context of the nature of classifiers is ignored) and selective exposure problem is limited, the user-related bias can still be crucial. This is due to high disproportion of the numbers of user classifiers Uc and of considered forecasting models M (typically one), (Fig. 2d) 2. Balanced cooperation – built similarly to the Biased cooperation model, but the number of user classifiers Uc is equal or lower than of considered forecasting models M. Again, the ensemble output is regarded as the best available prediction and accept/reject decision for model M is based only on the confidence of fused result. This decreases the influence of selective exposure by forcing Ucs to formulate possible predictions and balances both types of base classifiers. (Fig. 2e)

prediction

prediction d.)

e.) Fusing

Fusing pred

pred

pred

pred Uc

Uc

Input Data

Uc

M

Personal Bias

Personal Bias

Uc

pred

pred Uc

Uc

M

M

M

Input Data

Fig. 2. Improved ensemble architectures of acceptance tests of the forecasting model M (or set of the models M1 ,..,MK ) done with human users (Uc). Types: Biased cooperation (d) and Balanced cooperation (e).

160

M. Huk

In the proposed ensemble architectures of acceptance tests of forecasting model (except Oracle and Court models) it is assumed that base classifiers are functioning in multi-class decision spaces. This includes both human users (Uc) and the time series prediction models (M).

4 Discussion The proposed architectures of acceptance tests of forecasting models suggest that organization such as Court and Competing ensembles (Fig. 1) should be avoided. Especially the Court architecture – because the user personal, contextual bias can cause situation in which outcomes of judged model have little or no influence on its prediction acceptance. This causes that prediction itself is in most cases compared only with previous user opinions (selective exposure). In turn, the Competing ensembles model forces users to group together and create their fused prediction in opposition to the prediction from evaluated forecasting model. Such situation can suggest stakeholders/users to act contextually - together against the forecasting model as it would be their opponent. Because both predictions are not fused but compared in the top-level ensemble, when they are different, considered model is rejected. And such binary decision process is not the best possible [42]. Contrary to the Court and Competing ensembles architectures, the Biased and Balanced cooperation organizations (Fig. 2) create situations in which users do not act against the evaluated prediction and are concentrated on analyzing the input data to create their own predictions and confidences. Additionally, as users are involved in creation of the target prediction, they will more easily accept even the not expected results, so their personal bias can be decreased. In fact both cooperation architectures reorganize the acceptance tests of forecasting models into cooperative generation of target prediction being a mixture of time series analysis outcome and users experience. If possible, the non-contextual Balanced cooperation organization should be used and automatic fusing mechanism should be run not earlier than when all Uc classifiers will generate their predictions. The above analysis indicates that the prediction disbelief during acceptance tests of time series prediction can arise in the effect of at least two factors. The first one is selective exposure phenomenon and its importance can be limited by redirecting users attention from prediction evaluation to prediction generation task. The second one is the improper organization of the acceptance tests which can focus users attention on competition with the forecasting model rather than on analysis of its output. Here the context of the model nature (human/non-human) can have negative influence on results. Thus the time series researcher should consider proper planning also of the final phase of time series analysis life-cycle. If it is not done accordingly earlier, stakeholders/users of time series analysis can form ensemble structures similar to the Court or Competing ensembles architectures what can have negative impact on the proper evaluation of prediction and its usage. But it is also worth to notice that proposed architectures use multi-class base classifiers [42, 43]. In the effect, basic properties of considered complex architectures can be partially described and compared with usage of known formulas for probability PE of classification errors of ensembles. After extending the formula for PE for binary classifiers, given by Kuncheva [44], this can be done also for multi-class base classifiers [42].

Avoiding Time Series Prediction Disbelief with Ensemble Classifiers

161

Thus under the assumption that L base classifiers make negatively correlated mutual errors with probability PS , when they are using K-dimensional decision space (K ≥ 2) ensemble classification error probability PE is:  K     L L  L K    Piki 1 PE = L! ··· ki , L) · HE + (1 − HE ) 1 − δ( 1 + HD ki ! k1 =0 k2 =0

kK =0

i=1

i=1

(1) where H E and H D are helper functions HE = H (

K 

H (ki − k1 )) and HD =

i=2

K 

δ(k1 , ki ),

(2)

i=2

and the sum of probabilities of giving votes for each of K classes is K 

Pi = 1

(3)

i=1

Function H E has value zero when the correct class gets the highest number of votes and zero otherwise, function H D gives the number of ties with the correct class – and they are based on Heaviside and Kronecker’ delta functions H(x) and δ(x,y), respectively:



1 :x>0 1 :x=y H (x) = , δ(x, y) = (4) 0 :x≤0 0 : x = y Using formula (1) and statistical simulations within R-4.0.1 environment it was verified that usage of Biased and Balanced cooperation architectures can lower requirements for base classifiers accuracy and decrease ensemble error probability (K > 2) in comparison to Court and Competing ensembles solutions which are binary in nature (K = 2, L ≥ 2 and K ≥ 2, L = 2, respectively). At the Fig. 3 one can see that even small ensemble built from 7 base classifiers having error probability PS = 0.5 will benefit 40% decrease of error probability PE , when the number of considered classes is changed from two to three. The benefit will rise to 60% when the number of classes is changed from two to seven. And independently form the number of base classifiers, the two-class problem forces ensemble to use base classifiers with PS < 0.5 to get PE < 0.5 (Fig. 4). Thus operating within more than two-class problem space does not have such limitation and ensemble can have PE < 0.5 even when base classifiers have PS > 0.5. This can make creation of such hybrid ensembles easier because users will not have to be experts to create together proper evaluation of the forecasting model.

162

M. Huk

Fig. 3. Results of simulation of error probability PE of Balanced Cooperation ensemble as a function of the error probability PS of seven base classifiers (L = 7), with negatively correlated mutual errors for different numbers of classes K.

Fig. 4. Error probability PE of Court ensemble as a function of the error probability PS of different numbers of base classifiers (L ≥ 1), with negatively correlated mutual errors for numbers of classes K = 2.

5 Conclusions Time series analysis methods present a great value for many important areas of business, medicine, science and engineering. To achieve that, various statistical and heuristic tools have been developed and are designed to deal with many issues that arise during time series data collection and processing. This work highlights that time series analysis has its own characteristic life-cycle, and that it is important to properly plan also the phase of acceptance tests after prediction model generation. On the above background a prediction disbelief phenomenon is described and discussed. It occurs when accurate prediction with high confidence level is rejected by its potential users. The main causes for that are the improper organization of the acceptance tests of forecasting model, by focusing users attention on the prediction evaluation, and

Avoiding Time Series Prediction Disbelief with Ensemble Classifiers

163

by dependence of its acceptance from selective exposure. This is shown with five proposed general ensemble models of organization of acceptance tests for forecast models: Oracle, Court, Competing ensembles, Biased cooperation and Balanced cooperation. Last two of the proposed ensemble architectures – Biased and Balanced cooperation – are recommended for use in time series analysis domain as they help to avoid prediction disbelief. Those ensembles are constructed with the use of multi-class base classifiers what decreases their complexity and error probability in comparison to ensembles using binary base classifiers. This also lowers requirements on the needed experience of users performing acceptance tests, making their work easier and potentially more accurate. Finally, presented problem is especially interesting because in many other cases context information helps to construct more precise solutions [45–51]. And context considered above – additional knowledge about the nature of base classifier – can have negative influence on the outcomes. Thus it can also be treated as an example being a warning that contextual nature of the system does not guarantee that usage of given context information will be beneficial for the result of data processing.

References 1. Rahardja, D.: Statistical methodological review for time-series data. J. Stat. Manag. Syst. 23(8), 1445–1461 (2020) 2. Petitjean, F., Weber, J.: Efficient satellite image time series analysis under time warping. Geosc. Remote Sens. Lett. 11(6), 1143–1147 (2014) 3. Wonkook, K., Tao, H., Dongdong, W., Changyong, C., Shunlin, L.: Assessment of long-term sensor radiometric degradation using time series analysis. IEEE Trans. Geosci. Remote Sens. 52(5), 2960–2976 (2013) 4. Orfila, A., Ballester, J.L., Oliver, R., Alvarez, A., Tintoré, J.: Forecasting the solar cycle with genetic algorithms. Astron. Astrophys. 386, 313–318 (2002) 5. Mirmomeni, M., Lucas, C., Moshiri, B., Araabi, B.N.: Introducing adaptive neurofuzzy modeling with online learning method for prediction of time-varying solar and geomagnetic activity indices. Expert Sys. with App. 37(12), 8267–8277 (2010) 6. Lhermitte, S., Verbesselt, J., Verstraeten, W.W., Coppin, P.: A comparison of time series similarity measures for classification and change detection of ecosystem dynamics. Remote Sens. Environ. 115(12), 3129–3152 (2011) 7. Rivero, C.R., Pucheta, J., Laboret, S., Herrera, M., Sauchelli, V.: Time series forecasting using bayesian method: application to cumulative rainfall. IEEE Lat. Am. Trans. 11(1), 359–364 (2013) 8. Saulquin, B., Fablet, R., Mercier, G., Demarcq, H., Mangin, A., FantondAndon, O.H.: Multiscale event-based mining in geophysical time series: characterization and distribution of significant time-scales in the sea surface temperature anomalies relatively to ENSO periods from 1985 to 2009. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 7(8), 3543–3552 (2014) 9. Lehman, L.W.H., et al.: A physiological time series dynamics-based approach to patient monitoring and outcome prediction. IEEE J. Biomed. Health Inform. 19(3), 1068–1076 (2015) 10. Uniyal, N., et al.: Ultrasound RF Time Series for Classification of Breast Lesions. IEEE Trans. Med. Imaging 34(2), 652–661 (2015) 11. Nogales, F.J., Contreras, J., Conejo, A.J., Espinola, R.: Forecasting next-day electricity prices by time series models. IEEE Tran. Power Syst. 17(2), 342–348 (2002)

164

M. Huk

12. Huarnga, K., HuiKuang, Y.: A type 2 fuzzy time series model for stock index forecasting. Phys. A 353, 445–462 (2005) 13. Chand, S., Chandra, R.: Cooperative coevolution of feed forward neural networks for financial time series problem. In: 2014 International Joint Conference on Neural Networks (IJCNN), pp. 202–209. IEEE, Beijing (2014) 14. Yatao, L., Fen, Y.: Multivariate time series analysis in corporate decision-making application. In: 2011 International Conf. on Information Technology, Computer Engineering and Management Sciences (ICM), pp. 374–376. IEEE, Nanjing (2011) 15. Yi, C., Yuhua, L., Coleman, S., Belatreche, A., McGinnity, T.M.: Adaptive hidden markov model with anomaly states for price manipulation detection. IEEE Trans. Neural Netw. Learn. Syst. 26(2), 318–330 (2015) 16. Fujimaki, R., Nakata, T., Tsukahara, H., Sato, A, Yamanishi, K.: Mining abnormal patterns from heterogeneous time-series with irrelevant features for fault event detection. J. Statist. Anal. Data Mining. 2(1), 1–17 (2009) 17. Pascale, A., Nicoli, M.: Adaptive Bayesian network for traffic flow prediction. In: Statistical Signal Processing Workshop (SSP), pp. 177–180. IEEE (2011) 18. Peixian, L., Zhixiang, T., Lili Y., Kazhong D.: Time series prediction of mining subsidence based on genetic algorithm neural network. In: 2011 International Symposium on Computer Science and Society (ISCCS), pp. 83–86. IEEE, Kota Kinabalu (2011) 19. Hao, Q., Srinivasan, D., Khosravi, A.: Short-term load and wind power forecasting using neural network-based prediction intervals. IEEE Trans. Neural Netw. Learn. Syst. 25(2), 303–315 (2014) 20. Arbi, I.B., Derbel, F., Strakosch, F.: Forecasting methods to reduce energy consumption in WSN. In: 2017 IEEE International Instrumentation and Measurement Technology Conference (I2MTC), pp. 1–6 (2017) 21. Ramadhan, A., Khodra, M.L.: Ranking prediction for time-series data using learning to rank (Case Study: Top mobile games prediction). In: 2014 International Conference of Advance Informatics: Concept, Theory and Application (ICAICTA), pp. 214–219. IEEE, Bandung (2014) 22. Zhan, P., Xu, H., Luo, W., Li, W.: A novel network traffic anomaly detection approach using the optimal ϕ-DTW. In: 2020 IEEE 11th International Conference on Software Engineering and Service Science (ICSESS), pp. 1–4 (2020) 23. Xiaoshuang, X., Tao, J., Wei, C., Yan, H., Xiuzhen, C.: Spectrum prediction in cognitive radio networks. Wirel. Commun. 20(2), 90–96 (2013) 24. Matsumoto, T, Yosui, K.: Adaptation and change detection with a sequential monte carlo scheme. IEEE Trans. Syst., Man, Cybern. B. 37(3), 592–606 (2007) 25. Bo-Tsuen, C., Mu-Yen, C., Min-Hsuan, F., Chia-Chen, C.: Forecasting stock price based on fuzzy time-series with equal-frequency partitioning and fast Fourier transform algorithm. In: 2012 Computing, Communications and Applications Conference (ComComAp), pp. 238– 243. IEEE, Hong Kong (2012) 26. Hilas, C.S., Rekanos, I.T., Goudos, S.K., Mastorocostas, P.A., Sahalos, J.N.: Level change detection in time series using higher order statistics. In: 16th International Conference on Digital Signal Processing, pp. 1–6. IEEE, Santorini-Hellas (2009) 27. Dabhi, V.K., Chaudhary, S.: Time series modeling and prediction using postfix genetic programming. In: 2014 Fourth International Conference on Advanced Computing & Communication Technologies (ACCT), pp. 307–314. IEEE, Rohtak (2014) 28. Harphama, C., Dawson, C.W.: The effect of different basis functions on a radial basis function network for time series prediction: a comparative study. Neurocomputing 69(16), 2161–2170 (2006)

Avoiding Time Series Prediction Disbelief with Ensemble Classifiers

165

29. Li, Y., Kong, X., Fu, H., Tian, Q.: Contextual modeling on auxiliary points for robust image reranking. Front. Comp. Sci. 13(5), 1010–1022 (2018). https://doi.org/10.1007/s11704-0187403-7 30. Bas, E., Egrioglu, E., Aladag, C.H., Yolcu, U.: Fuzzy-time-series network used to forecast linear and nonlinear time series. Appl. Intell. 43(2), 343–355 (2015). https://doi.org/10.1007/ s10489-015-0647-0 31. Rahman, M.M., Santu, S.K.K., Islam, M.M., Murase, K.: Forecasting time series - a layered ensemble architecture. In: 2014 International Joint Conference on Neural Networks (IJCNN), pp. 210–217. IEEE, Beijing (2014) 32. SadeghiNiaraki, A., Mirshafiei, P., Shakeri, M., Choi, S.-M.: Short-term traffic flow prediction using the modified Elman recurrent neural network optimized through a genetic algorithm. IEEE Access 8, 217526 (2020) 33. Miranian, A., Abdollahzade, M.: Developing a local least-squares support vector machinesbased neuro-fuzzy model for nonlinear and chaotic time series prediction. IEEE Trans. Neural Netw. Learn. Syst. 24(2), 207–218 (2013) 34. Matsumoto, T.: Connectionist interpretation of the association between cognitive dissonance and attention switching. Neural Netw. 60, 119–132 (2014) 35. Ruparelia, N.: Software development lifecycle models, Hewlett Packard enterprise. ACM SIGSOFT Softw. Eng. Notes 35(3), 8–13 (2010) 36. Stefanou, C.J.: System Development Life Cycle, pp. 329–344. Encyclopedia of Information Systems, Elsevier (2003) 37. Ashmore, R., Calinescu, R., Paterson, C.: Assuring the Machine Learning Lifecycle: Desiderata, Methods, and Challenges. arXiv:1905.04223, pp. 1–36 (2019) 38. Bhattacharyya, S.: Confidence in predictions from random tree ensembles. In: 11th International Conference on Data Mining (ICDM), pp. 71–80. IEEE, Vancouver (2011) 39. Yamasaki, T., Maeda, T., Aizawa, K.: SVM is not always confident: telling whether the output from multiclass SVM is true or false by analysing its confidence values. In: 16th International Workshop on Multimedia Signal Processing, pp. 1–5. IEEE, Jakarta (2014) 40. Haghighi, S., Jasemi, M., Hessabi, S., Zolanvari, A.: PyCM: Multiclass confusion matrix library in Python. J. Open Source Softw. 3(25), 729 (2018) 41. Delgado, R., Núñez-González, J.D. Enhancing Confusion Entropy (CEN) for binary and multiclass classification. PLoS ONE. 14(1), e0210264 (2019) 42. Huk, M., Szczepanik, M.: Multiple classifier error probability for multi-class problems. Eksploatacja i Niezawodno´sc´ – Maint. Reliab. 51(3), 12–20 (2011) 43. Wang, X., Davidson, N.: The upper and lower bounds of the prediction accuracies of ensemble methods for binary classification. In: Ninth International Conference on Machine Learning and Applications, pp. 373–378. IEEE, Washington, DC (2010) 44. Kuncheva, L.: Combining Pattern Classifiers: Methods and Algorithms. Wiley, New York (2004) 45. Szczepanik, M., Jó´zwiak, I.: Fingerprint recognition based on minutes groups using directing attention algorithms. In: Rutkowski, L., Korytkowski, M., Scherer, R., Tadeusiewicz, R., Zadeh, L.A., Zurada, J.M. (eds.) Artificial Intelligence and Soft Computing. LNCS (LNAI), vol. 7268, pp. 347–354. Springer, Heidelberg (2012). https://doi.org/10.1007/9783-642-29350-4_42 46. Tripathi, A.M., Baruah, R.D.: Contextual anomaly detection in time series using dynamic Bayesian network. In: Nguyen, N.T., Jearanaitanakij, K., Selamat, A., Trawi´nski, B., Chittayasothorn, S. (eds.) Intelligent Information and Database Systems. LNCS (LNAI), vol. 12034, pp. 333–342. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-42058-1_28 47. Huk, M.: Backpropagation generalized delta rule for the selective attention Sigma-if artificial neural network. Int. J. Appl. Math. Comput. Sci. 22(2), 449–459 (2012)

166

M. Huk

48. Dragulescu, D., Albu, A.: Expert system for medical predictions. In: Proceedings of 4th International Symposium on Applied Computational Intelligence and Informatics, Timisoara, Romania, pp. 123–128. IEEE (2007) 49. Mikusova, M., Abdunazarov, J., Zukowska, J.: Modelling of the movement of design vehicles on parking space for designing parking. Commun. Comput. Inf. Sci. 1049, 188–201 (2019) 50. P¸eszor, D., Paszkuta, M., Wojciechowska, M., Wojciechowski, K.: Optical flow for collision avoidance in autonomous cars. In: Nguyen, N.T., Hoang, D.H., Hong, T.-P., Pham, H., Trawi´nski, B. (eds.) Intelligent Information and Database Systems. LNCS (LNAI), vol. 10752, pp. 482–491. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-75420-8_46 51. Wereszczy´nski, K., et al.: Cosine series quantum sampling method with applications in signal and image processing. arXiv:2011.12738, pp. 1–64 (2020)

Speeding Up Recommender Systems Using Association Rules Eyad Kannout(B) , Hung Son Nguyen , and Marek Grzegorowski Institute of Informatics, University of Warsaw, Warsaw, Poland {eyad.kannout,son,m.grzegorowski}@mimuw.edu.pl

Abstract. Recommender systems are considered one of the most rapidly growing branches of Artificial Intelligence. The demand for finding more efficient techniques to generate recommendations becomes urgent. However, many recommendations become useless if there is a delay in generating and showing them to the user. Therefore, we focus on improving the speed of recommendation systems without impacting the accuracy In this paper, we suggest a novel recommender system based on Factorization Machines and Association Rules (FMAR). We introduce an approach to generate association rules using two algorithms: (i) apriori and (ii) frequent pattern (FP) growth. These association rules will be utilized to reduce the number of items passed to the factorization machines recommendation model. We show that FMAR has significantly decreased the number of new items that the recommender system has to predict and hence, decreased the required time for generating the recommendations. On the other hand, while building the FMAR tool, we concentrate on making a balance between prediction time and accuracy of generated recommendations to ensure that the accuracy is not significantly impacted compared to the accuracy of using factorization machines without association rules. Keywords: Recommendation system · Association rules · Apriori algorithm · Frequent pattern growth algorithm · Factorization machines · Prediction’s time · Quality of recommendations

1

Introduction

Throughout the past decade, recommender systems have become an essential feature in our digital world due to their great help in guiding the users towards the most likely items they might like. Recently, recommendation systems have taken more and more place in our lives, especially during the COVID-19 pandemic, where many people all over the world switched to online services to reduce the direct interaction between each other. Many researchers do not expect life Research co-funded by 2018/31/N/ST6/00610.

Polish

National

Science

Centre

(NCN)

c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  N. T. Nguyen et al. (Eds.): ACIIDS 2022, LNAI 13758, pp. 167–179, 2022. https://doi.org/10.1007/978-3-031-21967-2_14

grant

no.

168

E. Kannout et al.

to return to normal even after the epidemic. All previous factors made recommender systems inevitable in our daily online journeys. Many online services are trying to boost their sales by implementing recommendation systems that estimate users’ preferences or ratings to generate personalized offers and thus recommend items that are interesting for the users. Recommendation systems can be built using different techniques which leverage the rating history and possibly some other information, such as users’ demographics and items’ characteristics. The goal is to generate more relevant recommendations. However, these recommendations might become useless if the recommendation engine does not produce them in a proper time frame. Recently, the factorization machine has become a prevalent technique in the context of recommender systems due to its capabilities of handling large, sparse datasets with many categorical features. Although many studies have proved that factorization machines can produce accurate predictions, we believe that the prediction time should also be considered while evaluating this technique. Therefore, in this paper, we work on finding a novel approach that incorporates association rules in generating the recommendations using the factorization machines algorithm to improve the efficiency of recommendation systems. It is worth noting that the factorization machine model is used to evaluate our method and compare the latency of FM before and after using the association rules. However, in practice, our method can be combined with any other recommendation engine to speed up its recommendations. The main contributions of this paper are as follows: 1) proposing a method that uses the apriori algorithm or frequent pattern growth (FP-growth) algorithm to generate association rules which suggest items for every user based on the rating history of all users; 2) utilizing these association rules to create shortlisted set of items that we need to generate predictions for them; 3) employing factorization machines model to predict missing user preferences for the shortlisted set of items and evaluate the top-N produced predictions. The remainder of this paper is organized as follows. In Sect. 2, we provide background information for factorization machines algorithm and association rules in addition to reviewing some related works. In Sect. 3, we describe the problem we study in this paper. Also, we present FMAR - a novel recommender system that utilizes factorization machines and association rules to estimate users’ ratings for new items. Section 4 evaluates and compares FMAR with a traditional recommender system built without employing association rules. Finally, in Sect. 5, we conclude the study and suggest possible future work.

2

Preliminaries

In this section, we briefly summarize the academic knowledge of factorization machines and association rules.

Speeding Up Recommender Systems Using Association Rules

2.1

169

Factorization Machines

In linear models, the effect of one feature depends on its value. While in polynomial models, the effect of one feature depends on the value of the other features. Factorization machines [1] can be seen as an extension of a linear model which efficiently incorporates information about features interactions, or it can be considered equivalent to polynomial regression models where the interactions among features are taken into account by replacing the model parameters with factorized interaction parameters. [2]. However, polynomial regression is prone to overfitting due to a large number of parameters in the model. Needless to say, it is computationally inefficient to compute weights for each interaction since the number of pairwise interactions scales quadratically with the number of features. On the other hand, factorization machines elegantly handled previous issues by finding a one-dimensional vector of size k for each feature. Then, the weight values of any combination of two features can be represented by the inner product of the corresponding features vectors. Therefore, factorization machines manage to factorize the interactions weight matrix W ∈ Rn×n , which is used in polynomial regression, as a product V V T , where V ∈ Rn×k . So, instead of modeling all interactions between pairs of features by independent parameters like in polynomial regression (see Eq. 1). We can achieve that using factorized interaction parameters, also known as latent vectors, in factorization machines (see Eq. 2). y(x) = w0 +

N 

wi xi +

i=1

N  N 

wij xi xj

(1)

i=1 j=i+1

w0 ∈ R: is the global bias. wi ∈ Rn : models the strength of the i-th variable. wij ∈ Rn×n : models the interaction between the ith and j-th variable. y(x) = w0 +

N  i=1

wi xi +

N  N 

vi , vj xi xj

(2)

i=1 j=i+1

vi , vj : models the interaction between the i-th and j-th variable by factorizing it, where V ∈ Rn×k and ., . is the dot product of two vectors of size k. This advantage is very useful in recommendation systems since the datasets are mostly sparse, and this will adversely affect the ability to learn the feature interactions matrix as it depends on the feature interactions being explicitly recorded in the available dataset. 2.2

Association Rules

The basic idea of association rules [3,4] is to uncover all relationships between elements from massive databases. These relationships between the items are extracted using every distinct transaction. In other words, association rules try

170

E. Kannout et al.

to find global or shared preferences across all users rather than finding an individual’s preference like in collaborative filtering-based recommender systems. At a basic level, association rule mining [3–5] analyzes data for patterns or co-occurrences using machine learning models. An association rule consists of an antecedent, which is an item found within the data, and a consequent, which is an item found in combination with the antecedent. Various metrics, such as support, confidence, and lift, identify the most important relationships and calculate their strength. Support metric [3–5] is the measure that gives an idea of how frequent an itemset is in all transactions (see Eq. 3). The itemset here includes all items in antecedent and consequent. On the other hand, the confidence [3–5] indicates how often the rule is true. In other words, it defines the percentage of occurrence of consequent given that the antecedents occur (see Eq. 4). Finally, the lift [5] is used to discover and exclude the weak rules that have high confidence, which can be calculated by dividing the confidence by the unconditional probability of the consequent (see Eq. 5). Various algorithms are in place to create associations rules using previous metrics, such as Apriori [4,6], AprioriTID [4,6], Apriori Hybrid [4,6], AIS (Artificial Immune System) [7], SETM [8] and FP-growth (Frequent pattern) [4,9]. In the next section, we provide more details about how we use these metrics to find the association rules used to improve the prediction time in the recommender system.

Support({X} → {Y }) =

Transactions containing both X and Y Total Number of transactions

(3)

Confidence({X} → {Y }) =

Transactions containing both X and Y Transactions containing X

(4)

Lift({X} → {Y }) = 2.3

Confidence Transactions containing Y

(5)

Related Works

Over the past decade, a lot of algorithms concerned with improving the accuracy of the recommendation have been constantly proposed. However, while reviewing the research literature related to recommendation systems and what has been done to improve the prediction’s time, we find that there is a research gap in this area even though the speed of recommendation, besides the accuracy, is a major factor in real-time recommender systems. Xiao et al. [10] worked on increasing the speed of recommendation engines. They spotted that the dimension of the item vector in a collaborative filtering algorithm is usually very large when we calculate the similarity between two items. To solve this problem, they introduced some methods to create a set of expert users by selecting small parts of user data. One of these methods is based on selecting expert users according to the number of types of products they

Speeding Up Recommender Systems Using Association Rules

171

have purchased before. In comparison, another method calculates the similarities between users and then selects expert users based on the frequency that the user appears in other users’ K-most similar users set. The results show that using expert users in an item-based collaborative filtering algorithm has increased the speed of generating recommendations with preserving the accuracy to be very close to original results. Tapucu et al. [11] carried out some experiments to check the performance of user-based, item-based, and combined user/itembased collaborative filtering algorithms. Different aspects have been considered in their comparisons, such as size and sparsity of datasets, execution time, and k-neighborhood values. They concluded that the scalability and efficiency of collaborative filtering algorithms need to be improved. The existing algorithms can deal with thousands of users within a reasonable time. Still, modern e-commerce systems require to scale to millions of users and hence, expect even improved prediction time and throughput of recommendation engines. According to previous findings, we believe there is room for further improvements concerning prediction time and efficiency of recommendation systems.

3

FMAR Recommender System

In this section, we formally provide the statement of the problem that we aim to tackle. Then, we introduce the details of a novel recommender system that is based on Factorization Machine and Association Rules (FMAR). We first formalize the problem we plan to solve. Then, we describe our proposed model which has two versions based on the algorithm used to generate the association rules: (i) factorization machine apriori based model, and (ii) factorization machine FP-growth based model. 3.1

Problem Definition

In many recommender systems, the elapsed time required to generate the recommendations is very crucial. Moreover, in some systems, any delay in generating the recommendations can be considered as a failure in the recommendation engine. The main problem we address in this paper is to minimize the prediction latency of the recommender system by incorporating the association rules in the process of creating the recommender system. The main idea is to use the association rules to decrease the number of items that we need to approximate their ratings, hence, decreasing the time that the recommender system requires to generate the recommendations. Also, our goal is to make sure that the accuracy of the final recommendations is not impacted after filtering the items using the association rules. 3.2

Factorization Machine Apriori Based Model

In this section, we introduce the reader to the first version of FMAR which proposes a hybrid model that utilizes factorization machines [1] and apriori [4,6] algorithms to speed up the process of generating the recommendations. Firstly, we use

172

E. Kannout et al.

apriori algorithm to create a set of association rules based on the rating history of users. Secondly, we use these rules to create users’ profile which recommends a set of items for every user. Then, when we need to generate recommendations for a user, we find all products that are not rated before by this user, and instead of generating predictions for all of them, we filter them using the items in the users’ profile. Finally, we pass the short-listed set of items to a recommender system to estimate their ratings using factorization machines model. In the context of association rules, it is worth noting that while generating the rules, all unique transactions from all users are studied as one group. On the other hand, while calculating the similarity matrix in collaborative filtering algorithms, we need to iterate over all users and every time we need to identify the similarity matrix using transactions corresponding to a specific user. However, what we need to do to improve the recommendation speed is to generate predictions for parts of items instead of doing that for all of them. Next, we introduce the algorithms that we use to generate the association rules and users’ profile (cf. Algorithm 1).

Algorithm 1. Association Rules Generation Using Apriori Algorithm 1: 2: 3: 4: 5: 6: 7: 8: 9: 10:

Extract favorable reviews  ratings > 3 Find frequent item-sets F  support > min support Extract all possible association rules R for each r ∈ R do Compute confidence and lift if (confidence < min confidence) or (lift < min lift) then Filter out this rule from R end if end for Create users’ profile using rules in R

Algorithm 2. Users’ Profile Generation 1: for each user do 2: Find high rated items based on rating history 3: Find the rules that their antecedents are subset of high rated items 4: Recommend all consequences of these rules 5: end for

After generating the users’ profile, we can use it to improve the recommendation speed for any user by generating predictions for a subset of not-rated items instead of doing that for all of them. The filtering is simply done using the recommended items which are extracted for every user using the association rules (cf. Algorithm 2). Moreover, the filtering criteria can be enhanced by using the recommended items of the closest n-neighbors of the target user. The similarity between the users can be calculated using pearson correlation or cosine similarity measures.

Speeding Up Recommender Systems Using Association Rules

173

On the other hand, it is noteworthy that the association rules in our experiments are generated using the entire dataset which means that these rules try to find global or shared preferences across all users. However, another way to generate the association rules is to split the dataset based on users’ demographics, such as gender, or items’ characteristics, such as genre, or even contextual information, such as weather and season. Thus, if we are producing recommendations for a female user in winter season, we can use dedicated rules which are extracted from historical ratings given by females in winter season. Following this strategy, we can generate multiple sets of recommendation rules which can be used later during prediction time to filter the items. Obviously, the rules generated after splitting the dataset will be smaller. So, the prediction latency can be minimized by selecting a smaller set of rules. In fact, this feature is very useful when we want to make a trade-off between the speed and quality of recommendations. Lastly, it is important to note that several experiments are conducted in order to select the appropriate values of hyper-parameters used in previous algorithms. For instance, min support = 250, min confidence = 0.65, number of epochs in FM = 100, number of factors in FM = 8. However, multiple factors are taken into consideration while selecting those values, including accuracy, number of generated rules, and memory consumption.

Algorithm 3. FP-Tree Construction 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13:

3.3

Find the frequency of 1-itemset Create the root of the tree (represented by null) for each transaction do Remove the items below min support threshold Sort the items in frequency support descending order for each item do  starting from highest frequency if item not exists in the branch then Create new node with count 1 else Share the same node and increment the count end if end for end for

Factorization Machine FP-Growth Based Model

In this section, we introduce the second version of FMAR where FP-growth [4,9] algorithm has been employed to generate the association rules. In general, FPgrowth algorithm is considered as an improved version of apriori method which has two major shortcomings: (i) candidate generation of itemsets which could be extremely large, and (ii) computing support for all candidate itemsets which is computationally inefficient since it requires scanning the database many times. However, what makes FP-growth algorithm different from apriori algorithm is the fact that in FP-growth no candidate generation is required. This is achieved

174

E. Kannout et al.

by using FP-tree (frequent pattern tree) data structure which stores all data in a concise and compact way. Moreover, once the FP-tree is constructed, we can directly use a recursive divide-and-conquer approach to efficiently mine the frequent itemsets without any need to scan the database over and over again. Next, we introduce the steps followed to mine the frequent itemsets using FPgrowth algorithm. We will divide the algorithm into two stages: (i) FP-tree construction, and (ii) mine frequent itemsets (cf. Algorithm 3 and Algorithm 4).

Algorithm 4. Mining Frequent Itemsets 1: 2: 3: 4: 5: 6: 7:

Sort 1-itemset in frequency support ascending order Remove the items below min support threshold for each 1-itemset do  starting from lowest frequency Find conditional pattern base by traversing the paths in FP-tree Construct conditional FP-tree from conditional pattern base Generate frequent itemsets from conditional FP-ree end for

After finding the frequent itemsets, we generate the association rules and users’ profiles in the same way as in FM Apriori-based model. Regarding the hyper-parameters of FP-Growth algorithm, we used min support = 60 and min confidence = 0.65. Finally, in order to generate predictions, we employ a factorization machines model, which is created using the publicly available software tool libFM [12]. This library provides an implementation for factorization machines in addition to proposing three learning algorithms: stochastic gradient descent (SGD) [1], alternating least-squares (ALS) [13], and Markov chain Monte Carlo (MCMC) inference [14].

4

Evaluation for FMAR

In this section, we conduct comprehensive experiments to evaluate the performance of the FMAR recommender system. In our experiments, we used MovieLens 100K dataset1 which was collected by the GroupLens research project at the University of Minnesota. MovieLens 100K is a stable benchmark dataset that consists of 1682 movies and 943 users who provide 100,000 ratings on a scale of 1 to 5. It is important to note that, in this paper, we are not concerned about users’ demographics and contextual information since the association rules are generated based only on rating history. 4.1

Performance Comparison and Analysis

In order to provide a fair comparison, we use several metrics and methods to evaluate FMAR and FM recommender systems, such as Mean Absolute Error 1

https://grouplens.org/datasets/movielens/.

Speeding Up Recommender Systems Using Association Rules

Fig. 1. MAE comparison

175

Fig. 2. NDCG comparison

(MAE), Normalized Discounted Cumulative Gain (NDCG), and Wilcoxon RankSum test. Firstly, we selected 50 users who made a significant amount of ratings in the past. For every user, a dedicated testing set has been created by arbitrary selecting 70% of the ratings made by this user in the past. On the other hand, the training set is constructed using the rest of the records in the entire dataset, which are not used in testing sets. This training set is used to generate the association rules and build the factorization machines model. For each evaluation method, we created two sets of items for every user. The first one, called original, contains all items in the testing set. While the second one, called shortlisted, is created by filtering the original set using the association rules. Finally, we pass both sets to the factorization machines model to generate predictions and evaluate both versions of FMAR, i.e., Apriori-based and FP-growth-based FM models, by comparing them with the standard FM model operating on the complete data. In the first experiment, we calculate the mean absolute error (MAE) generated in both recommendation engines. The main goal of this approach is to show that the quality of recommendations is not significantly impacted after filtering the items in the testing set using the association rules. Figure 1 compares the mean absolute error of the predictions made using FM model with FM Apriori model and FM FP-growth model for 50 users. We use a box plot, which is a standardized way of displaying the distribution of data, to plot how the values of mean absolute error are spread out for 50 users. This graph encodes five characteristics of the distribution of data which show their position and length. These characteristics are minimum, first quartile (Q1), median, third quartile (Q3), and maximum. The results of this experiment show that MAE of FMAR in both versions, FM Apriori and FM FP-growth, is very close to MAE of FM recommender system. However, the average value of MAE for 50 users is 0.71 using FM model, 0.80 using FMAR (Apriori model), and 0.78 using FMAR (FP Growth model). In the second experiment, we evaluate FMAR by comparing its recommendation with FM using Normalized Discounted Cumulative Gain (NDCG) which is a measure of ranking quality that is often used to measure the effectiveness of recommendation systems or web search engines. NDCG is based on the

176

E. Kannout et al.

assumption that highly relevant recommendations are more useful when appearing earlier in the recommendations list. So, the main idea of NDCG is to penalize highly relevant recommendations that appear lower in the recommendation list by reducing the graded relevance value logarithmically proportional to the position of the result. Figure 2 compares the accuracy of FM model with FM Apriori model and FM FP-growth model for 50 users and shows the distribution of the results. It is worth noting that in this test we calculate NDCG using the highest 10 scores in the ranking which are generated by FM or FMAR. The results show that both versions of FMAR model have always higher NDCG values than FM model which means that true labels are ranked higher by predicted values in FMAR model than in FM model for the top 10 scores in the ranking. In the third evaluation method, we run Wilcoxon Rank-Sum test on the results of previous experiments. Firstly, we apply Wilcoxon Rank-Sum test to the results of the first experiment in order to prove that the difference in MAE between FM and FMAR is not significant, and hence, it can be discarded. So, we pass two sets of samples of MAE for FM and FMAR. The Table 1 shows the p-value for comparing FMAR using Apriori model and FP Growth model with FM model. In both cases, we got p-value > 0.05 which means that the null hypothesis is accepted at the 5% significance level, and hence, the difference between the two sets of measurements is not significant. On the other hand, we apply Wilcoxon Rank-Sum test to the results of the second experiment to check if the difference in NDCG between FM and FMAR is significant. However, the Table 1 shows that p-value < 0.05 for comparing both models of FMAR with FM model. This means the null hypothesis is rejected at the 5% significance level (accept the alternative hypothesis), and hence, the difference is significant. Since FMAR model has higher NDCG than FM model, we can conclude that FMAR outperforms FM for the highest top 10 predictions. Table 1. Wilcoxon Rank-Sum Test Model

p-value

MAE (FM Vs FMAR-Apriori model)

0.29

MAE (FM Vs FMAR-FP Growth model)

0.13

NDCG (FM Vs FMAR-Apriori model)

1.74e−08

NDCG (FM Vs FMAR-FP Growth model) 0.04

In the last experiment, we compare FM and FMAR in terms of the speed of their operation, measured as the number of predictions performed by the factorization machines model. The main idea is to estimate the time necessary to prepare recommendations for every tested user for both evaluated approaches. Figure 3 shows the distribution of the results of this experiment for the selected 50 test users. Observably, the number of items that we need to predict with FMAR is significantly lower due to using the association rules for filtering. The results show that the FMAR model can generate predictions for any user at least

Speeding Up Recommender Systems Using Association Rules

177

Fig. 3. Comparison of the speed of methods (estimated by the number of predictions made by the factorization machine model, i.e., the lower the better)

four times faster than the FM model. Finally, it is noteworthy that generating the rules is part of the training procedure. Therefore, it is a one-time effort, and there is no need to regenerate or update the association rules frequently in FMAR. Therefore, the computational cost of the training procedure for every method, including extracting the association rules, is not considered in our comparisons. In the final analysis, all previous experimentations showed that after applying our method, the factorization machines could perform significantly faster with no drop in quality considering MAE and NDCG measures.

5

Conclusions and Future Work

This article introduces FMAR, a novel recommender system, which methodically incorporates the association rules in generating the recommendations using the factorization machines model. Our study evaluates two approaches to creating association rules based on the users’ rating history, namely: the apriori and frequent pattern growth algorithms. These rules are used to decrease the number of items passed to the model to estimate the ratings, reducing the latency of the recommender system prediction. To evaluate our proposed model, we conducted comprehensive experiments on MovieLens 100K dataset using the libFM tool [12] which provides implementations for factorization machines as well as machine learning algorithms. Moreover, we presented multiple evaluation methods to compare the performance of FMAR against the recommender system built using the factorization machines algorithm. The experimental results show that FMAR has improved the efficiency of recommender systems. Furthermore, the experiments also indicate that the accuracy of FMAR is very close to the results produced by the standard recommender system. In the future work, we plan to incorporate more information in the process of producing the association rules, such as users’ demographics, items’ characteristics, and contextual information. Another important aspect to consider is to

178

E. Kannout et al.

evaluate our proposed model using different recommender systems and different sizes of datasets. However, we are interested in creating a web interface where FMAR is used to generate recommendations for users. In this scenario, we are particularly interested in employing more advanced algorithms to generate the association rules, such as AprioriTID [4,6], Apriori Hybrid [4,6], AIS (Artificial Immune System) [7] and SETM [8]. We believe that using previous algorithms would help to further improve the performance and accuracy of FMAR. As a result, FMAR can be evaluated using different settings selected in a user-friendly web interface. Furthermore, we plan to consider the changes in users’ behavior and preferences by periodically updating association rules based on recent changes in rating history. Another direction of future work is to utilize the generated association rules to solve the cold-start problem in recommendation systems where the new users do not have (or have very few) ratings in the past. Finally, we plan to use distributed stream processing engines, like Apache Flink, to examine parallel implementations of FMAR, where the process of extracting the rules and generating the recommendations is scalable to vast streams or large-scale datasets.

References 1. Rendle, S.: Factorization machines. In: Proceedings of IEEE International Conference on Data Mining, pp. 995–1000 (2010). https://doi.org/10.1109/ICDM.2010. 127 2. Freudenthaler, C., Schmidt-Thieme, L., Rendle, S.: Factorization machines factorized polynomial regression models (2011) 3. Haotong, W.: Data association rules mining method based on improved apriori algorithm. In: 2020 the 4th International Conference on Big Data Research (ICBDR 2020) (ICBDR 2020). Association for Computing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3445945.3445948 4. Satyavathi, N., Rama, B., Nagaraju, A.: Present state-of-the-art of dynamic association rule mining algorithms. Int. J. Eng. Adv. Technol. 9(1), 6398–6405 (2019). https://doi.org/10.35940/ijeat.A2202.109119 https://doi.org/10.35940/ ijeat.A2202.109119 5. Bao, F., Mao, L., Zhu, Y., Xiao, C., Xu, C.: An improved evaluation methodology for mining association rules. Axioms 11, 17 (2022). https://doi.org/10.3390/ axioms11010017. ISSN:2075-1680 6. Merry, K.P., Singh, R.K., Kumar, S.S.: Apriori-hybrid algorithm as a tool for colon cancer microarray data classification. Int. J. Eng. Res. Dev. 4, 53–57 (2012) 7. Khurana, K., Sharm, S.: A comparative analysis of association rule mining algorithms. J. Sci. Res. Publ. 3(5) (2013) 8. Saxena, A., Rajpoot, V.: A comparative analysis of association rule mining algorithms. In: IOP Conference Series: Materials Science and Engineering, vol. 1099 (2021). https://doi.org/10.1088/1757-899X/1099/1/012032 9. Zeng, Y., Yin, S., Liu, J., Zhang, M.: Research of improved FP-growth algorithm in association rules mining. Sci. Program. J. (2015). https://doi.org/10.1155/2015/ 910281. ISSN:1058–9244

Speeding Up Recommender Systems Using Association Rules

179

10. Xiao, W., Yao, S., Wu, S.: Improving on recommend speed of recommender systems by using expert users. In: Chinese Control and Decision Conference, pp. 2425–30 (CCDC) (2016). https://doi.org/10.1109/CCDC.2016.7531392 11. Tapucu, D., Kasap, S., Tekbacak, F.: Performance comparison of combined collaborative filtering algorithms for recommender systems. In: 2012 IEEE 36th Annual Computer Software and Applications Conference Workshops, pp. 284–289 (2012). https://doi.org/10.1109/COMPSACW.2012.59 12. Rendle, S.: Factorization machines with libFM. ACM Trans. Intell. Syst. Technol. 3(3), 1–22 (2012). https://doi.org/10.1145/2168752.2168771 Article 57 13. Rendle, S., Gantner, Z., Fredenthale, C., Schmidit-Thieme, L.: Fast context-aware recommendations with factorization machines. In: Proceedings of the 34th ACM SIGIR Conference on Research and Development in Information Retrieval (2011). https://doi.org/10.1145/2009916.2010002 14. Fredenthaler, C., Schmidit-Thieme, L., Rendle, S.: Bayesian factorization machines. In: Proceedings of the NIPS Workshop on Sparse Representation and Low-rank Approximation (2010)

An Empirical Experiment on Feature Extractions Based for Speech Emotion Recognition Binh Van Duong1,3(B) , Chien Nhu Ha1,3 , Trung T. Nguyen2,3 , Phuc Nguyen4 , and Trong-Hop Do1,3 1

Faculty of Information Science and Engineering, University of Information Technology, Ho Chi Minh City, Vietnam {18520505,18520527}@gm.uit.edu.vn, [email protected] 2 University of Technology, Ho Chi Minh City, Vietnam [email protected] 3 Vietnam National University, Ho Chi Minh City, Vietnam 4 New York University, Abu Dhabi, UAE [email protected]

Abstract. In recent years, the virtual assistant has become an essential part of many applications on smart devices. In these applications, users talk to virtual assistants in order to give commands. This makes speech emotion recognition to be a serious problem in improving the service and the quality of virtual assistants. However, speech emotion recognition is not a straightforward task as emotion can be expressed through various features. Having a deep understanding of these features is crucial to achieving a good result in speech emotion recognition. To this end, this paper conducts empirical experiments on three kinds of speech features: Mel-spectrogram, Mel-frequency cepstral coefficients, Tempogram, and their variants for the task of speech emotion recognition. Convolutional Neural Networks, Long Short-Term Memory, Multi-layer Perceptron Classifier, and Light Gradient Boosting Machine are used to build classification models used for the emotion classification task based on the three speech features. Two popular datasets: The Ryerson Audio-Visual Database of Emotional Speech and Song, and The Crowd-Sourced Emotional Multimodal Actors Dataset are used to train these models. Keywords: MFCCs emotion

1

· Mel-spectrogram · Tempogram · Speech

Introduction

Lately, many researchers have studied Machine Learning and Deep Learning methods so as to recognize humans’ emotions through writing. They achieved considerable achievements [1,2]. Besides, speech resources are also noticed to be employed. Especially, speech emotion recognition is a promising field for scientific research and practical applications. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  N. T. Nguyen et al. (Eds.): ACIIDS 2022, LNAI 13758, pp. 180–191, 2022. https://doi.org/10.1007/978-3-031-21967-2_15

An Empirical Experiment on Feature Extractions

181

Speech is far different from writing, this causes speech preprocessing to become a challenging phase when dealing with this kind of data. In particular, the feature extraction phase for getting the quality features to do experiments on models still has lots of aspects to be explored because the brief speech audio could give so much information. The paper uses two English speech datasets: The Ryerson Audio-Visual Database of Emotional Speech and the Song and Crowd-Sourced Emotional Multimodal Actors Dataset. This paper presents the experimental results for the purpose of comparing the effect of each extracted feature. It also aims to yield a resource for choosing features on the speech emotion recognition problem. The extracted features Mel-spectrogram, Mel-frequency cepstral coefficients, and Tempogram are used for finding out the most suitable feature for each dataset. Deep Learning and Machine Learning based models are experimented on the extracted features for comparison. The research focuses on the comparison of the effectiveness of different kinds of speech features for the speech emotion recognition task. In Sect. 2, some related research is introduced. In Sect. 3, the information of the datasets is made clear. The feature details and extraction are presented in Sect. 4. The methodologies used for experiments are placed in Sect. 5. The experimental results are spoken in Sect. 6 and Sect. 7 is for conclusion and discussion.

2

Literature Review

This section lists some previous research that is related to feature selection and extraction as well as model selection in order to solve the speech emotion problem. Tin Lay Nwe et al. in the publication [3] compared the effectiveness of using Log frequency power coefficients (LFPC) with traditional features such as Linear prediction cepstral coefficients (LPCC) or Mel-frequency cepstral coefficients (MFCC), using a discrete hidden Markov model (HMM) as a classifier. The mentioned features are extracted from their designated dataset, which contains a total of 720 utterances from 12 different people (6 males, 6 females), all speaking in their native languages - Mandarin and Burmese. There are 6 category labels in this dataset: Anger, Disgust, Fear, Joy, Sadness, and Surprise. After conducting various experiments to demonstrate the superiority of LFPC over LPCC and MFCC, the authors concluded that LFPC is a better choice as feature parameters for emotional speech classification than the traditionally-used feature parameters and that higher accuracy can be achieved if we classify emotions as more generalized groups rather than individual emotions. Their results also outperformed human subjects’ evaluations of the dataset. The authors made use of short-time log frequency power coefficients (LFPC) to represent the speech signals (i.e. feature vectors) and a discrete hidden Markov model (HMM) (4-state ergodic HMM) as the classifier. The performance of LFPC was done in comparison to LPCC and MFCC. The authors in research [4] combined five different features including Melfrequency cepstral coefficients, Chromogram, Mel-scale spectrogram, Tonnetz

182

B. V. Duong et al.

representation, and Spectral contrast features as inputs for 1-dimensional Convolutional Neural Network (CNN) for the classification of different emotions using the samples from Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), Berlin (EMO-DB), and Interactive Emotional Dyadic Motion Capture (IEMOCAP). Despite the fact that all models in this study followed a general CNN structure, the authors constructed, modified, and tested various classifier model versions for each dataset to obtain the best model. Although the method of combining mentioned features is not optimized yet, the choice of implementing such techniques allowed the researchers to encapsulate some of the most important aspects of the audio files, leading to a richer representation of the samples and improving the overall performance of the classifier models. The authors also used 5-fold cross-validation, incremental methodology, and data augmentation techniques to achieve more generalized results. For the RAVDESS and IMEOCAP datasets, the final outcomes of this proposed system exceeded the classification accuracy of all other pre-existing frameworks. For the EMO-DB dataset, however, the final results outperformed all other pre-existing frameworks with the exception of the study conducted by Zhao et al., but this system compares favorably with that one in terms of generality, simplicity, and applicability. The research Spectral-Temporal Receptive Fields and MFCC Balanced Feature Extraction for Noisy Speech Recognition [5] gave the comparison on features: Mel-frequency cepstral coefficients and their proposed feature Spectral-temporal receptive fields. The authors also experimented on the combination of different features. The hidden Markov model was used for the experiments. It can be seen that the comparison on different speech features are being researched by researchers. However, the authors only used one model for classification and the datasets used in those research were quite limited, which led to the lack of experimental results. This research learns from the previous ones and extends to using various types of models to run experiments. Besides, the three different types of speech features and their variants would provide much more experimental results. This paper hopefully contributes more empirical experiment results on the effectiveness of features in the speech emotion recognition problem which could be a valuable resource for researchers working on the task of speech emotion.

3

Dataset

The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS). RAVDESS [6] is an English speech dataset created with the attendance of 24 professional actors (12 male, 12 female). The dataset is a multimodal dataset, so that it can be employed in different researches. RAVDESS consists of 7356 files: 1440 speech audios, 1012 song audios, 2880 videos of the actors’ faces recording speech audios, and 2024 videos of the actors’ faces recording song audios. The actors speak with the North American accent. In this paper, 1440 speech audios are used for extracting speech features. The audios were created by recording the speeches from each actor when he

An Empirical Experiment on Feature Extractions

183

or she made 60 different records for 2 sentences Kids Are Talking by the Door and Dogs Are Sitting by the Door with 7 different emotional expressions: calm, happy, sad, angry, fearful, surprise and disgust. The actors were also required to add intensity levels to their records: normal and strong. Crowd-Sourced Emotional Multimodal Actors Dataset (CREMA-D). CREMA-D [7] is a multimodal dataset including 7440 videos and audios recorded from 91 actors (48 male, 43 female) in the age range of 20 and 74 coming from a variety of races and ethnicities: African America, Asian, Caucasian, Hispanic, and Unspecified. CREMA-D provides a dataset of videos and audios, however, this paper only employs the audios from CREMA-D. In CREMA-D, the actors spoke 12 sentences: It’s eleven o’clock, That is exactly what happened, I’m on my way to the meeting, I wonder what this is about, The airplane is almost full, Maybe tomorrow it will be cold, I think I have a doctor’s appointment, Don’t forget a jacket, I think I’ve seen this before, The surface is slick, I would like a new alarm clock, and We’ll stop in a couple of minutes with 6 emotional expressions: anger, disgust, fear, happy, neutral and sad in different intensity levels. Dataset Statistics. Because of the variety of races and ethnicities in CREMAD, it is difficult to obtain good results from the models. However, the models could learn different accents from the data. Therefore, the trained models would be applied to a large group of clients in practice. In RAVDESS, the actors have only the North American accent, so that the trained models on the dataset would be applied to a smaller group of clients in practice. But the experimental results could be better (Fig. 1).

(a) RAVEDSS.

(b) CREMA-D.

Fig. 1. Emotions distribution.

4

Feature Extraction

Audio data is not as simple to deal with as image data - one of the most popular data types in the Machine Learning/Deep learning field. If we consider each pixel

184

B. V. Duong et al.

of an image as a feature, we can eventually end up with an acceptable result. That approach, however, is usually impossible to apply to audio data. Thus, many feature extraction methods for audio types have been proposed to assist with exploring audio. In this paper, we employ three feature extraction methods: Mel-spectrogram, MFCCs, and Tempogram. Mel-spectrogram. Spectrum is a visual way of representing the signal strength without counting time. In need of representing the strength of the signal over time, spectrogram is devised. A spectrogram is like a picture of sound, and the frequencies that make up the sound vary with time [8]. For the Mel term, this is a slight adjustment in the frequency scale [9]. In place of using normal frequency, Mel frequency is generated by dilating in low frequency and squashed in high frequency (see Eq. 1). By doing that, it simulates how humans interact with sound in practice. Merely, human beings tend to be more sensitive to low frequencies and less likely able to recognize the difference between high ones.   fHz fMel = 2595 log10 1 + 700

(1)

Mel-Frequency Cepstral Coefficients (MFCCs). To the best of our knowledge, MFCCs is usually the first choice when people want to find interesting things in a sound, especially in speech processing [10]. As its name described, this method helps extract coefficients of cepstral from Mel-frequency filters. The word “cepstral” is a wordplay that is derived by reversing the first four letters of “spectral”. For a better understanding of where that wordplay came from, see how to compute “cepstrum”: C(x(t)) = F−1 [log (F (x(t)))]

(2)

In the aforementioned equation, x(t) is the time-domain signal, and F{} is a Fourier transform operator. What we actually do is to take a logarithm of a spectrum and then apply inverse Fourier transform. Hence, according to researchers, we could it the spectrum of a spectrum, or in a fancier way, that is “cepstrum”. After producing cepstrum, we use Mel filters to collect coefficients. The common number of filters is 13 or the multiples of 13 (e.g., 26, 39). Tempogram. This feature is useful for studying sounds from musical instruments [11]. Roughly speaking, Tempogram is a representative of tempo determined by the number of beats per unit of time and varies with time. In our experiments, we extract features by using librosa1 . Each feature provides a two-dimensional matrix in fixed rows and varied columns with the length of the audio input.

1

https://librosa.org/doc/latest/index.html.

An Empirical Experiment on Feature Extractions

5

185

Methodology

5.1

Input Preparation

In order to train the models, there are four emotions (angry, fear, happy, and sad) chosen for classification. Because those emotions all appear in both datasets and the difference between them is pretty clear. This could help to evaluate the models better than using other emotions. The models as simple Deep Learning, Light Gradient Boosting Machine, and Multi-layer Perceptron Classifier would accept the one-dimensional input. Other models use two-dimensional input with a fixed length of 3 s per audio. So that audio signal having a duration longer than that fixed-length would be pruned or padded before moving to the features extraction stage (Fig. 2).

(a) One-dimensional input.

(b) Two-dimensional input.

Fig. 2. Input processing.

5.2

Classification Models

The parameters of each model shown in this section are the best version after experiments and fine-tuning. Models’ architectures are set up with the help of public libraries and framework Sklearn2 , and Tensorflow3 . Multi-layer Perceptron Classifier (MLP) is a feed-forward neural network that reflects the information transformation of neurons in the human brain. MLP’s parameters: alpha = 0.01, batch size = 256, epsilon = 1e−08, hidden layer sizes = (300), learning rate = ‘adaptive’, max iter = 1000. Light Gradient Boosting Machine (Light-GBM) is an open-source framework developed by Microsoft. It is designed to be distributed and efficient with the following advantages: faster training speed and higher efficiency, lower memory usage, better accuracy, support of parallel, distributed, and GPU learning and capable of handling large-scale data4 . It appears in many winner solutions from different competitions as [12]. Public experiments results show that Light-GBM outperforms other gradient boosting algorithms with lower memory resources [13]. Parameters: learning rate = 0.01, boosting type = ‘gbdt’, objective = ‘multiclass’, metric = ‘multi logloss’, num leaves = 100, min data = 100, max depth = 50, num class = 4, max bin = 150, bagging freq = 100, feature fraction = 0.6, bagging fraction = 0.6. 2 3 4

https://scikit-learn.org/stable/. https://www.tensorflow.org/. https://lightgbm.readthedocs.io/en/latest/.

186

B. V. Duong et al.

Simple Deep Learning (sDL) Architecture. This model consists of 4 dense layers with the number of nodes 13, 128, 128, and 4, respectively. Activation function relu is applied to the first three layers, and softmax is applied to the last layer. The optimizer function is adam, and the learning rate is 0.001 (Fig. 3).

Fig. 3. Visual simple Deep Learning’s architecture.

Convolutional Neural Network + Long Short-Term Memory (LSTM). This model (Conv-1d + LSTM) is built with four one-dimensional convolutional (Conv-1d) layers, the result from each layer is fed into a batch normalization layer then a one-dimensional maxpooling. In order to avoid overfitting during model training, a dropout layer is added before feeding the features into the LSTM layer with 256 units. Besides, the model can fine-tune the learning rate using the ReduceLROnPlateau layer when a metric has stopped improving after 20 epochs. The activation function is adam. Convolutional Neural Network. This model’s architecture is similar to Conv-1d + LSTM, but four Conv-1d layers are replaced by two Conv-2d layers without the LSTM layer, and the activation function is SGD (Fig. 4).

(a) Conv-1d + LSTM.

(b) Conv-2d.

Fig. 4. Visual architectures of models.

Three Bi-classifiers Stacking Model. This model is based on independent bi-classifiers for one pair of emotions such as angry-happy, fear-sad, angry-happy, and fear-sad. The bi-classifiers are MLP, Light-GBM, conv1d + LSTM, and conv2d.

An Empirical Experiment on Feature Extractions

6

187

Experimental Results

In addition to evaluating three types of features (Mel-spectrogram, MFCCs, and Tempogram) on designed models; options in the dimension of MFCCs experimented on several models result in more comparative evaluation numbers. This is a plus point in making comparisons and discussing the strengths and weaknesses of different sets of parameters as well as implemented features. Table 1 presents the most parameter-optimized models described in Sect. 5. It illustrates the results of implementations on three sorts of features along with RAVDESS using Light-GBM and MLP. It is noteworthy that the Combined feature (synthesized from MFCCs, Mel-spectrogram, and Tempogram) brings the best scores for both accuracy and F1-score. Combined feature with Light-GBM model returns 59% for both accuracy and F1-score. Most noticeably, the combined feature with MLP model has the best scores of approximately 70% for both metrics. Achieved results from using three separate types of features also have some differences. Specifically, MFCCs and Mel-spectrogram still perform more efficiently in both models. Most importantly, MFCCs is the best feature in comparison with two other kinds of aforementioned features. The “NaN” value in result tables means that the experimental results are not available because of the limitation of computation resources. Experimental results on CREMA-D are represented in Table 2. In this comparison, the most prospective implementation still belongs to the Combined feature giving 61% accuracy and 60% F1-score for both Light-GBM and MLP classifiers. Additionally, Mel-spectrogram used for Light-GBM classification results in similar evaluation rates which is considered a competitive way. The Tempogram feature still returns relatively lower proportions. However, these have small improvements compared with the percentage of MFCCs using methods. According to MLP, the way of Tempogram extraction pop out 55% for accuracy and 54% for F1-score, and these figures are a little higher than using MFCCs. Table 1. Results on RAVDESS. Light-GBM MLP Accuracy F1 score Accuracy F1 score MFCCs

0.54

0.52

0.63

0.62

Mel-Spectrogram 0.52

0.53

0.55

0.53

Tempogram

0.47

0.46

0.43

0.42

Combined feature 0.59

0.59

0.7

0.69

188

B. V. Duong et al. Table 2. Results on CREMA-D. Light-GBM MLP Accuracy F1 score Accuracy F1 score MFCCs

0.54

0.53

0.51

Mel-Spectrogram 0.61

0.55

0.6

0.58

0.57

Tempogram

0.52

0.5

0.55

0.54

Combined feature 0.61

0.6

0.61

0.6

Table 3. Results on RAVDESS + CREMA-D. Light-GBM MLP Accuracy F1 score Accuracy F1 score MFCCs

0.57

0.57

0.55

0.53

Mel-Spectrogram 0.62

0.61

0.56

0.55

Tempogram

0.51

0.49

0.51

0.51

Combined feature 0.65

0.64

0.62

0.62

Table 3 shows the experimental results from training models on the combination of RAVDESS and CREMA-D. The combined features give the best result, with 65% accuracy and 64% F1 score for Light-GBM, 62% accuracy and F1 score for MLP. Tables 4, 5, and 6 provide the result from models with different number of MFCCs coefficients. Models trained on MFCCs outperform the others with results greater than 70% in accuracy and F1 score. MFCCs dominants the other features, and the number of MFCCs coefficients also affects the models’ performances. Table 4. Results on RAVDESS. sDL Conv-1d + LSTM Conv-2d Accuracy F1 score Accuracy F1 score Accuracy F1 score MFCCs-13

0.44

0.41

0.71

0.72

0.64

0.64

MFCCs-26

0.65

0.66

0.66

0.67

0.64

0.64

MFCCs-39

0.65

0.66

0.65

0.65

0.71

0.71

Mel-Spectrogram 0.47

0.49

0.44

0.42

0.55

0.55

Tempogram

0.43

0.43

0.4

0.38

0.35

0.26

Combined feature 0.71

0.71

NaN

NaN

NaN

NaN

An Empirical Experiment on Feature Extractions

189

Table 5. Results on CREMA-D. sDL Conv-1d + LSTM Conv-2d Accuracy F1 score Accuracy F1 score Accuracy F1 score MFCCs-13

0.55

MFCCs-26

0.5

0.5

0.56

0.56

0.64

0.64

MFCCs-39

0.55

0.55

0.55

0.56

0.64

0.64

Mel-Spectrogram 0.46

0.46

0.44

0.38

0.46

0.42

Tempogram

0.53

0.42

0.36

0.34

0.32

0.6

NaN

NaN

NaN

NaN

0.53

Combined feature 0.6

0.55

0.56

0.56

0.58

0.57

Table 6. Results on RAVDESS + CREMA-D. sDL Conv-1d + LSTM Conv-2d Accuracy F1 score Accuracy F1 score Accuracy F1 score MFCCs-13

0.58

0.54

0.59

0.57

0.61

0.61

MFCCs-26

0.58

0.57

0.54

0.53

0.64

0.64

MFCCs-39

0.53

0.52

0.51

0.45

0.58

0.59

Mel-Spectrogram 0.45

0.45

0.45

0.44

0.5

0.49

Tempogram

0.54

Combined feature 0.6

0.53

0.38

0.32

0.48

0.49

0.6

NaN

NaN

NaN

NaN

Table 7. Results on RAVDESS + CREMA-D using bi-classifier of Light-GBM and MLP. Light-GBM MLP Accuracy F1 score Accuracy F1 score MFCCs

0.35

0.23

0.44

0.44

Mel-Spectrogram 0.51

0.44

0.56

0.56

Tempogram

0.38

0.52

0.48

0.48

Table 8. Results on RAVDESS + CREMA-D using bi-classifier of sDL, Conv-1d + LSTM and Conv-2d. sDL Conv-1d + LSTM Conv-2d Accuracy F1 score Accuracy F1 score Accuracy F1 score MFCCs

0.46

0.46

0.46

0.44

0.45

0.39

Mel-Spectrogram 0.52

0.52

0.27

0.18

0.42

0.38

Tempogram

0.50

NaN

NaN

0.36

0.32

0.52

190

B. V. Duong et al.

Through experimental results synthesized in Tables 1, 2 and 3, Combined feature is the best approach to this kind of sentiment classification. This means each feature has a distinct informative representation which is described in Sect. 4 and these features are capable of combining to boost the models’ performance. Regarding the single feature approach, we highly recommend using MFCCs or Mel-spectrogram in this context. Besides, Tempogram is a valuable feature, although this feature requires more fine-tuning to compete with other mentioned ways of extractions. The reason why Tempogram is not a preferable choice is its implicit extracted information fiercely related to acoustic repetition often seen in musical beat research. In terms of the MFCCs feature, it is evident that the extracted dimensions which is the number of MFCCs coefficients should be modified to gain the best result. We have also conducted a binary approach by using only two kinds of emotion with the desire to achieve better results (Tables 7 and 8). However, this strategy brought back even worse evaluated numbers; therefore, we concentrate on four types of emotion as an ideal quantity to carry out further research.

7

Conclusion and Discussion

The experimental results obtained from this research provide a better understanding of the features in the speech emotion recognition problem. The research furnishes with the details of each feature extraction and the basic knowledge of the research field. It can be seen that utilizing different features at once (alike the Combined feature) could nourish the models so much. In the future, this paper can be updated in the feature selection and models. Many other kinds of speech features might be employed as well as the way to combine them, and the models’ architectures can be developed better. Furthermore, a Vietnamese emotional speech dataset could be created to supply a data resource for the speech emotion recognition problem.

References 1. Luu, S.T., Nguyen, H.P., Van Nguyen, K., Nguyen, N.L.-T.: Comparison between traditional machine learning models and neural network models for Vietnamese hate speech detection. In: 2020 RIVF International Conference on Computing and Communication Technologies (RIVF), pp. 1–6. IEEE (2020) 2. Van Huynh, T., Nguyen, V.D., Van Nguyen, K., Nguyen, N.L.-T., Nguyen, A.G.-T.: Hate speech detection on Vietnamese social media text using the Bi-GRU-LSTMCNN model. arXiv preprint arXiv:1911.03644 (2019) 3. Nwe, T.L., Foo, S.W., De Silva, L.C.: Speech emotion recognition using hidden Markov models. Speech Commun. 41(4), 603–623 (2003) 4. Issa, D., Demirci, M.F., Yazici, A.: Speech emotion recognition with deep convolutional neural networks. Biomed. Signal Process. Control 59, 101894 (2020) 5. Wang, J.-C., Lin, C.-H., Chen, E.-T., Chang, P.-C.: Spectral-temporal receptive fields and MFCC balanced feature extraction for noisy speech recognition. In: 2014 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), pp. 1–4. IEEE (2014)

An Empirical Experiment on Feature Extractions

191

6. Livingstone, S.R., Russo, F.A.: The Ryerson audio-visual database of emotional speech and song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in North American English. PloS ONE 13(5), e0196391 (2018) 7. Cao, H., Cooper, D.G., Keutmann, M.K., Gur, R.C., Nenkova, A., Verma, R.: CREMA-D: crowd-sourced emotional multimodal actors dataset. IEEE Trans. Affect. Comput. 5(4), 377–390 (2014) 8. Mulimani, M., Koolagudi, S.G.: Acoustic event classification using spectrogram features. In: 2018 IEEE Region 10 Conference, TENCON 2018, pp. 1460–1464 (2018). https://doi.org/10.1109/TENCON.2018.8650444 9. Stevens, S.S., Volkmann, J., Newman, E.B.: A scale for the measurement of the psychological magnitude pitch. J. Acoust. Soc. Am. 8(3), 185–190 (1937) 10. Tiwari, V.T.: MFCC and its applications in speaker recognition. Int. J. Emerg. Technol. 1, 01 (2010) 11. Tian, M., Fazekas, G., Black, D.A.A., Sandler, M.: On the use of the tempogram to describe audio content and its application to music structural segmentation. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 419–423. IEEE (2015) 12. Tran, K.Q., Duong, B.V., Tran, L.Q., Tran, A.L.-H., Nguyen, A.T., Nguyen, K.V.: Machine learning-based empirical investigation for credit scoring in Vietnam’s banking. In: Fujita, H., Selamat, A., Lin, J.C.-W., Ali, M. (eds.) IEA/AIE 2021. LNCS (LNAI), vol. 12799, pp. 564–574. Springer, Cham (2021). https://doi.org/ 10.1007/978-3-030-79463-7 48 13. Ke, G., et al.: LightGBM: a highly efficient gradient boosting decision tree. In: Advances in Neural Information Processing Systems 30, pp. 3146–3154 (2017)

Parameter Distribution Ensemble Learning for Sudden Concept Drift Detection Khanh-Tung Nguyen1(B) , Trung Tran1 , Anh-Duc Nguyen2 , Xuan-Hieu Phan2 , and Quang-Thuy Ha2 1 Electric Power University, Hanoi, Vietnam [email protected], [email protected] 2 Vietnam National University (Hanoi), VNU-University of Engineering and Technology, Hanoi, Vietnam {20021336,hieupx,thuyhq}@vnu.edu.vn

Abstract. Concept drift is a big challenge in data stream mining (including process mining) since it seriously decreases the accuracy of a model in online learning problems. Model adaptation to changes in data distribution before making new predictions is very necessary. This paper proposes a novel ensemble method called E-ERICS, which combines multiple Bayesian-optimized ERICS models into one model and uses a voting mechanism to determine whether each instance of a data stream is a concept drift point or not. The experimental results on the synthetic and classic real-world streaming datasets showed that the proposed method is much more precise and more sensitive (shown in F1-score, precision, and recall metrics) than the original ERICS models in detecting concept drift, especially a sudden drift. Keywords: Concept drift · Data stream · Ensemble learning · Bayesian optimization

1 Introduction Advanced technologies and their modern devices have been generating more and more data streams. These streams are precious resources if we use them appropriately [5]. One of the most important properties of streaming data is that its distribution always changes over time (i.e., concept drift) [11]. In definition, according to [17], Concept = P(X , y),

(1)

with P(X , y) is the joint probability function of X and y, X is a d-dimensional feature vector and y is the target label. According to [10], concept drift between two timestamps t and v occurs if and only if: Pt (X , y) = Pv (X , y).

(2)

A concept drift is called a real concept drift when: Pt (y|X ) = Pv (y|X ). © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 N. T. Nguyen et al. (Eds.): ACIIDS 2022, LNAI 13758, pp. 192–203, 2022. https://doi.org/10.1007/978-3-031-21967-2_16

(3)

Parameter Distribution Ensemble Learning for Sudden Concept Drift Detection

193

Otherwise, it is a virtual concept drift. In classification problems, as shown in [2], real concept drift affects the decision boundary, while virtual drift does not. Therefore, drift detection models should mainly be concerned about real concept drifts. Concept drifts can also be divided into 4 types based on the way concept changes [18]. Let C0 be the old concept and C1 be the new concept of data. The properties of each type are presented in its name (see Fig. 1 – with C0 : the blue blocks and C1 : the orange blocks): -Sudden Drift: The concept of data changes abruptly from C0 to C1 . -Gradual Drift: C1 appears more frequently while C0 gradually disappears. -Incremental Drift: The old concept changes into a new concept in small steps. There are a number of transitive concepts between C0 and C1 . -Reoccurring context: C0 reoccurs after a period with concept C1 .

Fig. 1. Four types of concept drift (Original figure in [18] - pp. 1)

Concept drift detection is very essential in data stream mining since if we cannot point out the drift, predictions made by the model that is based on past data will not be adaptive to new data [11]. Several methods have been introduced to deal with concept drift [5]. Early methods, such as DDM [8] and EDDM [3], work by checking whether the change in the error rate is higher than a pre-defined threshold. Some other methods use the windowing mechanism. For instance, to detect concept drift, in ADWIN [4], the error rate difference between two windows with the changing size is calculated and in KSWIN [15], the distribution distance between data in two sliding windows is measured using the Kolmogorov-Smirnov test. Additionally, several recent methods have been developed based on unsupervised learning, even automatic learning, such as LD3 [9], Meta-ADD [18], etc. Ensemble learning, such as model averaging, bagging, and boosting, is often used to take full advantage and minimize the drawbacks of its base models. Dietterich et al.

194

K.-T. Nguyen et al.

[6] gave the definition for ensemble learning as “a learning method that constructs a set of classifiers (base-learners) and then classifies new data points by taking a (weighted) vote of their predictions”. In the above definition, a base learner is a simple ML model, and a vote is represented by a real number called weight. An ensemble model sums up the votes of N base learners by calculating the sum of their contributions. In some binary classification tasks, a vote i often has the value ai ∈ {0, 1} (ai = 1 if base learner i predicts the instance as a positive point and ai = 0 otherwise) and the weight wi . Set a threshold T (in most cases, set T = 0.5). The ensemble model recognizes an instance as positive if it satisfies (4) and as negative if it does not satisfy (4). P(output = 1|a) ≥ T

(4)

N ai wi ⇔ i=1 ≥ T. N i=1 wi A number of frameworks use the ensemble approach for detecting concept drift. For example, RACE [13] combines various algorithms, such as J48 Decision Tree, MLP, and SVM as base learners to detect only recurring drift. ElStream [1] is an ensemble approach, where a base classifier can only vote if it passes a threshold. ERICS (Effective and Robust Identifications of Concept Shift) (by Haug and Kasneci) [10] is a framework that detects only real concept drift by approximating Pt (y|X) using a predictive model with a model parameter θ that changes over time. This approach satisfies two properties: model-awareness and explainability. θ is treated as a random variable, i.e. θ ∼ P(θ ; ψ). After several transformation steps (as proved in [10]), formula (3) turns into: |h[P(θ ; ψv )] − h[P(θ ; ψt )] + DKL [P(θ ; ψv )||P(θ ; ψt )]| > 0

(5)

Then, the moving average of (5) in M consecutive timestamps is calculated: MAt =

t      h P(θ ; ψi )]−h[P(θ ; ψi−1 ) + DKL [P(θ ; ψi )||P(θ ; ψi−1 )]

(6)

i=t−M +1

After that, Haug and Kasneci [10] “measure the total change of (6) in a sliding window of size W ”: t    MAj − MAj−1 > αt ⇔ Drift at t,

(7)

j=t−W +1

with α t being the threshold updated when every new instance arrives. If a concept drift occurs, α t is updated by: αt =

t    MAj − MAj−1 j=t−W +1

(8)

Parameter Distribution Ensemble Learning for Sudden Concept Drift Detection

195

Otherwise, with ΔDrift being the distance from t to the nearest detected drift point and β being a fixed number (0 < β < 1), we have the following equation:   αt = αt−1 − αt−1 ∗ β ∗ Drift (9) Experiments in [10] showed that ERICS outperforms traditional concept drift detection methods in terms of average delay, precision, recall, and F1-score. However, ERICS has a drawback in that it detects lots of false drifts. This research is conducted to improve this weakness of ERICS. In this study, our main contributions are summarized as follows: – We introduce a Bayesian optimization pipeline to find hyperparameter values that increase the precision, recall, and F1-score of ERICS [10] models. – We propose a novel framework E-ERICS (Ensemble of ERICS models), an improved version of ERICS by ensembling ERICS models with the high performance achieved in Bayesian Optimization phase, to detect concept drift (especially, sudden concept drift) more precisely and sensitively than the original ERICS model does. The remainder of this paper is organized as follows. Section 2 introduces our methods used in E-ERICS. Section 3 proves the effectiveness of our new methods and presents our experiments and the results compared with ERICS. Finally, the conclusions are given in Sect. 4.

2 Methods In general, our E-ERICS method is a concept drift detection model that uses ensemble learning with a voting mechanism in which the base learners are ERICS models [10] with values of hyperparameters adjusted using Bayesian optimization. In our poll, the weight of the base learner with the highest F1-score is twice as much as the vote of others. For more detailed information, we can see Fig. 2. Our E-ERICS method is divided into two phases: – BO-ERICS Phase (Bayes Optimization on ERICS): Do Bayesian Optimization on ERICS Hyperparameters (M, W, β). – Ensemble Phase: Ensemble selected models which have been found in BO-ERICS. The further details of these two phases are presented in Sects. 2.1 and 2.2. 2.1 BO-ERICS Phase In this phase, we use Bayesian Optimization on some ERICS Hyperparameters to achieve some models with high performance to use as base-learners in the Ensemble phase. As mentioned in Part 1, ERICS [10] uses many hyperparameters, but the three most typical ones for this model are moving average size (M), sliding window size (W ), and update rate of threshold (β). Therefore, this research focuses on choosing better moving

196

K.-T. Nguyen et al.

Fig. 2. E-ERICS Diagram

average size, sliding window size, and update rate for the ERICS model (BO-ERICS). For the simplest, in this study, we set F1-score as the acquisition function and the Gaussian process [16] as the surrogate model of the optimization procedure. Procedure: • Step 1: Let D be the domain of values of (M, W, β) surrounding the grid-searched values (in [10]) for the dataset. • Step 2: Initialize several ERICS models with random values of (M, W, β) from D. • Step 3: From the results of the above models, Bayesian optimization automatically proposes new values of (M, W, β) which are considered to have a higher value of acquisition function. Running more base models makes it possible to find more accurate models to use in the ensemble phase. 2.2 Ensemble Phase This is the crucial phase of our method, with the purpose of generating a new ensemble model that has better performance than any base model generated in the BO-ERICS phase. In this part, we take the best base-learners (with the highest F1-score) found in the BO-ERICS phase as input and use ensemble learning on them to improve the

Parameter Distribution Ensemble Learning for Sudden Concept Drift Detection

197

performance of our model by reducing the number of false positive points and increasing more true positive ones. Initial Idea: WE intend to build a model that satisfies: At an arbitrary point in the data stream, on the one hand, if concept drift is detected in the final model of BO-ERICS and nowhere else, it will no longer be considered a concept drift point. On the other hand, any concept drift detected in all remaining models will be added to the final model in BO-ERICS. By doing some small experiments, we found that the ensemble model with four base learners achieved the best performance, so we use four base models in this Ensemble phase. Procedure: Select the four base models with the highest F1-scores found in the BOERICS phase, in which M 0 is the best model (i.e., the model with the highest F1-score at all), and we call the others “good models”. Then ensemble voting with the weight of M 0 ’s vote being twice as much as the weights of the three remaining votes is applied to each data point, and the ones with 3 or more votes will be claimed by E-ERICS as the concept drift point. In this way, the initial problem is transformed into a binary classification task that classifies each instance in a data stream into one of the two classes: drift point or non-drift point. The above procedure, methodologically, is equivalent to improving the model with the highest F1-score of the BO-ERICS phase (M 0 ) by using 3 other base models generated during hyperparameter optimization duration with high F1-scores (M 1 , M 2 , M 3 ): • Drift points detected by only M 0 but not in M 1 , M 2 , M 3 will be eliminated. • Drift points not detected at M 0 but detected simultaneously at all M 1 , M 2 , M 3 will be added to the drift point sets.

Algorithm 1: E-ERICS Ensemble Input: Best found model M0, 3 “good models” M1, M2, M3 Output: Concept drift points 1: 2: 3: 4:

s1, s2, s3, s0 sets of detected concept drift points in M1, M2, M3, M0 s (s0 s1) (s0 s2) (s0 s3) s s (s1 s2 s3) return s

3 Experiments and Discussion Our models are evaluated using Google Colab virtual machine with 2 processors Intel(R) Xeon(R) CPU @ 2.20GHz, RAM: 12.69 GB, and Python 3.7.12.

198

K.-T. Nguyen et al.

3.1 Datasets Synthetic datasets: We use scikit-multiflow framework [12] to generate two datasets. Each contains 100,000 samples generated randomly with SEAGenerator and HyperplaneGenerator and saved in .csv files. • SEA: In this dataset, each sample has 3 attributes, if attr1 + attr2 ≤ threshold, the data point belongs to class 0, else belongs to class 1, with 10% noisy data, four sudden concept drifts are generated by alternating between the four available classifier functions (with different thresholds) available in the package. • Hyperplane: The dataset of determining the relative location of a point to a given hyperplane in 20-dimensional space, with noisy data of 10%, incremental drift points are generated throughout the dataset by changing the hyperplane equation after every 100 data points. Real-World Datasets: To prove the efficiency of our models in real-world problems, we use 4 datasets from the UCI Machine Learning repository [7]. In all the below datasets, the attributes are numeric or encoded in numeric form: • Spambase: A stream of 4,601 emails, including spam and normal mails, each sample has 57 real attributes, such as frequency of special words and chars, length of the longest capital run, etc. • Adult: A dataset on personal income, deciding whether income exceeds $50K/yr based on census data, with 48,842 samples, each one has 14 attributes. • KDD: In this experiment, we select 100,000 samples of 3 different distribution functions from the KDDCup1999_train dataset, and divide them into 5 parts with an equal number of instances such that sudden concept drift always occurs between two consecutive parts and never occurs at other points. • Dota: The dataset on Dota2 game with 102,944 samples of 116 attributes, and the target class is the result of whether the team with mentioned features won the game. In ERICS [10], Haug and Kasneci trained the online predictive models (Probit and VFDT) in batch, with a batch_size of 10 for Spambase, 50 for Adult, and 100 for KDD, Dota, and synthetic datasets. We also conduct our experiments in the same conditions as mentioned above. 3.2 Evaluation We evaluate the performance of three models, including the Original ERICS model in [10], the best found model after BO-ERICS phase, and the E-ERICS model using prequential evaluation (interleaved test-then-train) “in which each sample serves two purposes: first the sample is used for testing, by making a prediction and updating the metrics, and then used for training (partially fitting) the model” [11]. This method uses all the samples for training and it is also memory-free since no holdout has to be kept.

Parameter Distribution Ensemble Learning for Sudden Concept Drift Detection

199

• To calculate Precision of the models, a drift detection is counted as true positive (TP1 ) if it is less than 50 batches after an actual drift point, otherwise it is a false positive (FP1 ) point. This evaluation method is similar to that in [19]. We ignore all detected drifts in the first 80 batches, which is the same as in [10]. Precision =

|TP1 | |TP1 |+|FP1 |

(10)

• To calculate Recall, a false negative (FN 2 ) is considered if after 50 batches from the actual drift point, the model does not detect any drift point at all. Any actual drift detected in the range of 50 batches is counted as true positive (TP2 ). (Note that, TP2 = TP1 ) Recall =

|TP2 | |TP2 | + |FN2 |

(11)

• F1-score is calculated as the harmonic mean of the above (Precision and Recall): F1 =

2 1 Precision

+

1 Recall

(12)

3.3 Results Experiment: We use BayesianOptimization framework [14] for the BO-ERICS phase in our experiments. For each dataset, it initializes 10 hyperparameter value sets and then automatically generates 350 new sets that are considered to gain the F1-score of the base model (Running too many base models costs too much time while running not enough base models makes it difficult to find “good” ones to use in the Ensemble phase). The results of our experiments are shown in Tables 1, 2, and Fig. 3. Due to some differences in the runtime environments, settings, and evaluation metrics, the results we achieve in running the original ERICS model are not similar to the results in [10]. 3.4 Discussion The above results show that thanks to the improvement in important hyperparameters (M, W, and β), some models in the BO-ERICS phase achieved superior results compared with the original ERICS model proposed by Haug and Kasneci [10] in all cases. E-ERICS, with ensemble learning, takes advantage of single models in eliminating a number of false positive points found in BO-ERICS besides adding a few more true positives, increases F1-score by approximately 4% in SEA and KDD datasets, and gives quite better results in all evaluation metrics than only using BO-ERICS phase in most cases, especially in detecting sudden concept drift.

200

K.-T. Nguyen et al.

Table 1. The hyperparameter search space of each dataset and the hyperparameters used in ERICS [10] (optimized with grid search) and the model with the best performance in BO-ERICS Phase Dataset

M

W

-logβ

Search Used Best in Search Used Best in Search Used Best in space in BOspace in BOspace in BOERICS ERICS ERICS ERICS ERICS ERICS SEA

(20, 90)

75

35

(10, 70)

50

50

(2, 5)

4

3.535

Hyperplane (20, 110)

100

53

(10, 70)

50

10

(2, 5)

4

2.019

KDD

(20, 90)

50

26

(8, 80) 50

24

(2, 5)

4

3.677

Spambase

(10, 70)

35

53

(10, 60)

25

46

(2, 5)

3

2.032

Dota

(20, 90)

75

38

(10, 75)

50

10

(2, 5)

4

3.179

Adult

(10, 60)

50

24

(10, 60)

50

31

(2, 5)

3

3.096

SEA

Hyperplane

Spambase

KDD

Dota

Adult

Fig. 3. Detected concept drifts in the three models (red lines: actual sudden concept drift points, red zone: actual consecutive incremental concept drift points, blue lines: detected drift points (Color figure online))

BO

0.7324

0.1701

0.2500

0.0417

0.0714

Recall

Precision

F1

0.5625

0.3913

1.0000 0.5714

0.4000

1.0000

E

0.1748

1.0000

1.0000 1.0000

1.0000

0.3623

0.2212

1.0000

Adult

0.6933

1.0000

0.5778

Orgn

F1

1.0000

0.5306

Orgn

0.3830

Precision

Dota

0.5000

0.3103

Recall

BO

Orgn

E

Orgn

BO

Hyperplane

SEA

0.2581

0.2667

0.2500

Orgn

0.5641

0.3929

1.0000

BO

1.0000

1.0000

1.0000

E

KDD

0.5647

0.3934

1.0000

E

0.8462

0.7333

1.0000

BO

0.3200

0.4077

0.4867

Orgn

0.7446

0.6515

0.9583

BO

0.6704

0.6061

0.7500

Orgn

AVERAGE

0.8837

0.7917

1.0000

E

Spambase

0.8017

0.8611

0.7500

BO

0.7509

0.6700

0.9583

E

0.8000

0.8571

0.7500

E

Table 2. Precision, recall, and F1-score of the three models: Original model in [10] (Orgn), best-found model in BO-ERICS phase (BO), and E-ERICS model (E)

Parameter Distribution Ensemble Learning for Sudden Concept Drift Detection 201

202

K.-T. Nguyen et al.

4 Conclusions In this article, we have presented a new method called E-ERICS (using Bayesian optimization combined with ensemble learning). The experimental results show that the E-ERICS model has achieved better results than any ERICS model found using only Bayesian Optimization and ERICS [10], especially in generic datasets with sudden concept drifts. In the future work, we will apply further changes to the model such as improving the core (probit, VFDT, etc.) of each base model to make E-ERICS more effective.

References 1. Abbasi, A., Javed, A.R., Chakraborty, C., Nebhen, J., Zehra, W., Jalil, Z.: ElStream: an ensemble learning approach for concept drift detection in dynamic social big data stream learning. In IEEE Access 9, 66408–66419 (2021) 2. Althabiti, M., Abdullah, M.: Streaming data classification with concept drift. Bioscience Biotechnology Research Communications 12(1) (2019) 3. Baena-Garcia, M., del Campo-Avila, J., Fidalgo, R., Bifet, A., Gavalda, R., Morales-Bueno, R.: Early drift detection method. In: Fourth int. workshop on knowledge discovery from data streams, vol. 6, pp. 77–86 (2006) 4. Bifet, A., Gavalda, R.: Learning from time-changing data with adaptive windowing. In: Proceedings of the 2007 SIAM International Conference on Data Mining, pp. 443–448. SIAM (2007) 5. de Barros, R.S.M., de Carvalho Santos, S.G.T.: An overview and comprehensive comparison of ensembles for concept drift. In: Information Fusion, vol. 52, pp. 213–244 (2019) 6. Dietterich, T.G.: Ensemble methods in machine learning. In: International workshop on multiple classifier systems, pp. 1–15. Springer, Berlin, Heidelberg (2000) 7. Dua, D., Graff, C.: UCI machine learning repository (2017). http://archive.ics.uci.edu/ml, last accessed 30 March 2022 8. Gama, J., Medas, P., C, G., Rodrigues, P.: Learning with drift detection. In: Bazzan, A.L.C., Labidi, S. (eds.) SBIA 2004. LNCS (LNAI), vol. 3171, pp. 286–295. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-28645-5_29 9. Gulcan, E.B., Can, F.: Implicit Concept Drift Detection for Multi-label Data Streams. arXiv preprint, arXiv:2202.00070v1 (2022) 10. Haug, J., Kasneci, G.: Learning parameter distributions to detect concept drift in data streams. In: 2020 25th International Conference on Pattern Recognition (ICPR). IEEE (2021) 11. Imbrea, A.: Automated Machine Learning Techniques for Data Streams. arXiv preprint, arXiv: 2106.07317v1 (2021) 12. Montiel, J., Read, J., Bifet, A., Abdessalem, T.: Scikit-multiflow: A multi-output streaming framework. The Journal of Machine Learning Research 19(72), 1–5 (2018) 13. Museba, T., Nelwamondo, F., Ouahada, K., Akinola, A.: Recurrent adaptive classifier ensemble for handling recurring concept drifts. In: Applied Computational Intelligence and Soft Computing, vol. 2021, pp. 1–13 (2021) 14. Nogueira, F.: Bayesian Optimization: Open source constrained global optimization tool for Python (2014). https://github.com/fmfn/BayesianOptimization 15. Raab, C., Heusinger, M., Schleif, F.M.: Reactive soft prototype computing for concept drift streams. Neurocomputing, vol. 416, pp. 340–351. Elsevier (2020) 16. Rasmussen, C., Williams, C.: Gaussian Processes for Machine Learning. The MIT Press (2006)

Parameter Distribution Ensemble Learning for Sudden Concept Drift Detection

203

17. Webb, G.I., Hyde, R., Cao, H., Nguyen, H.L., Petitjean, F.: Characterizing concept drift. Data Min. Knowl. Disc. 30(4), 964–994 (2016). https://doi.org/10.1007/s10618-015-0448-4 18. Yu, H., Liu, T., Lu, J., Zhang, G.: Automatic Learning to Detect Concept Drift. arXiv preprint, arXiv: 2105.01419v1 (2021) 19. Yu, S., Abraham, Z.: Concept drift detection with hierarchical hypothesis testing. In: Proceedings of the 2017 SIAM International Conference on Data Mining, pp. 768–776. SIAM (2017)

MLP-Mixer Approach for Corn Leaf Diseases Classification Li-Hua Li1 and Radius Tanone1,2(B) 1 Department of Information Management, Chaoyang University of Technology,

Taichung City, Taiwan {lhli,s11014903}@gm.cyut.edu.tw 2 Fakultas Teknologi Informasi, Universitas Kristen Satya Wacana, Salatiga, Indonesia

Abstract. Corn is one of the staple foods in Indonesia. However, corn leaf disease poses a threat to corn farmers in increasing production. Farmers find it difficult to identify the type of corn leaf that is affected by the disease. Seeing the development of corn that continues to increase, prevention of common corn leaf disease needs to be prevented to increase production. By using an open dataset, the modern MLP-Mixer model is used to train the smaller size of datasets for further use in predicting the classification of diseases that attack corn leaves. This experiment uses an MLP-Mixer with a basic Multi-Layer Perceptron which is repeatedly applied in feature channels. This makes the MLP-Mixer model more resource efficient in carrying out the process to classify corn leaf disease. In this research, a well-designed method ranging from data preparation related to corn leaf disease images to pre-training and model evaluation is proposed. The performance of our model shows 98.09% of test accuracy. This result is certainly a new trend in image classification, so that it can be a solution in handling computer vision problems in general. Furthermore, the high precision achieved in this experiment can be applied to small devices such as smartphones, drones, or embedded systems. Based on the images obtained, these results can undoubtedly be a solution for corn farmers in recognizing the types of leaf diseases in order to achieve smart farming in Indonesia. Keywords: MLP-Mixer · Image classification · Corn leaf diseases

1 Introduction The growth in the number of people in the world [1] keeps increasing and, undoubtedly, this will be followed by the increasing need for food. Agriculture is very popular in Indonesia, where the population is more than 273 million and, among them, 10 million people are farmers [2]. In Indonesia, the national corn production in 2018 was in a surplus and had even exported to the Philippines and Malaysia [3]. Seeing this potential, the quality of corn [4] needs to be maintained from planting to harvest. In fact, there are several diseases on corn leaves [5, 6, 7] including Common Rust, Leaf Blight, and Gray Leaf Spot. In the large size of the land [8] and the possible varying conditions of corn leaves, farmers can have difficulty in recognizing the types of diseases that may © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 N. T. Nguyen et al. (Eds.): ACIIDS 2022, LNAI 13758, pp. 204–215, 2022. https://doi.org/10.1007/978-3-031-21967-2_17

MLP-Mixer Approach for Corn Leaf Diseases Classification

205

affect the harvest of corn. Seeing the problems in classifying the types of diseases on corn leaves, information technology can be one solution in helping farmers recognize the types of diseases on corn leaves. The development of information technology continues to grow, giving rise to new trends such as Industry 4.0 [9, 10]. Globally, the number of Internet users [11] increased from only 413 million in 2000 to over 3.4 billion in 2016 [12]. Furthermore, the use of modern devices such as drones, robots, or embedded systems is becoming more common, including in agriculture. This gives rise to the trend of developing smart agriculture in the future which can be easier to implement. By looking at this development, applicable research in supporting the development of smart agriculture needs to be improved. In Indonesia, the corn-producing granary area is divided into several provinces, of which there are 10 largest corn-producing provinces. However, with the potential of many corn-producing areas, the high susceptibility to disease needs to be addressed properly by farmers to maintain and even increase corn production. The increasingly rapid development of agriculture 4.0 must be followed by innovations in maize agricultural land in Indonesia. Various research on the development of information technology in agricultural land have been carried out such as drones, robotics, IoT and so on. Of course, research in the development of mobile technology on corn farming is a good opportunity to answer the existing problems. Seeing the development of leaf diseases in corn, the application of mobile technology to identify diseases in corn leaves can be a breakthrough in helping farmers increase their production. In the era of machine learning (ML) and deep learning (DL), many models have been developed and used for the application of image classification, including KNN, CNN and so on. Since Google Research, Brain Team introduced the MLP-Mixer for vision [13], this research on corn leaf diseases image classification has become the main target in implementing this model. This research focuses on how to develop a modern MLP-Mixer pre-trained model with small amount of data datasets, trained and then able to predict the classification of a disease in corn leaves. The MLP-Mixer model uses fewer resources than other algorithms so that in future developments it can be used on devices with smaller sizes. The purpose of this research is to help farmers to work on agricultural land with more quickly and efficiently in recognizing diseases in corn leaves.

2 Related Work 2.1 Literature Review Studies on corn leaf diseases have been carried out by several researchers [14, 15], seen from papers such as an optimized dense convolutional neural network (CNN) architecture (DenseNet) [6] for corn leaf disease recognition and classification. In this research, the accuracy is 98.06% from optimizing the DenseNet model on CNN. Some parameters are also well set so that it can produce high accuracy compared to other models. Javanmardi et al. [14] describes a novel approach for extracting generic features using a deep convolutional neural network (CNN). Artificial neural network (ANN), cubic Support Vector Machine (SVM), quadratic SVM, weighted K-Nearest Neighbor (KNN), boosted tree, bagged tree, and linear discriminant analysis (LDA) were used to classify

206

L.-H. Li and R. Tanone

the extracted features. Models trained with CNN-extracted features outperformed models trained with only simple features in terms of classification accuracy of corn seed varieties. Misra et al. [15] describes a real-time method for detecting corn leaf disease based on a deep convolutional neural network. Tuning the hyperparameters and adjusting the pooling combinations on a GPU-powered system improves deep neural network performance. The deep learning model achieves an accuracy of 88.46% when recognizing corn leaf diseases, demonstrating the method’s feasibility. Furthermore, Yu et al. [16] proposed a method for accurately diagnosing three common diseases of corn leaves: gray spot, leaf spot, and rust, based on K-Means clustering and an improved deep learning model. The impact of various k values (2, 4, 8, 16, 32, and 64) and models (VGG-16, ResNet18, Inception v3, VGG-19, and the improved deep learning model) on corn disease diagnosis is investigated in this paper. In this paper, after detecting the disease on corn leaves, then the model is modified to get more accurate results. Many papers have been written about the classification of leaves [17, 18] on corn, rice [19], potatoes [20], and other crops. This study aims to find new approaches of computer vision problems, specifically disease classification on corn leaves, so that the model can be implemented on various devices in the future for agricultural advancement. 2.2 MLP-Mixer

Fig. 1. MLP-Mixer architecture [13]

This paper discusses deep learning and computer vision by focusing on the MLP-Mixer [13], a new model developed by Google Research. MLP-Mixer is a MLP-based model which has a simple design and is resource efficient when calculating. Several elements are merged such as presenting a spatial location and allowing users to switch between several spatial places, or both at the same time. It’s also possible to use NN convolutions (for N > 1) and pooling. Furthermore, Neurons used in deeper levels have a broader receptive field [21, 22]. MLP-Mixer, on the other hand, has a straightforward structure that includes per-patch linear embeddings, Mixer layers, and a classifier head. There is

MLP-Mixer Approach for Corn Leaf Diseases Classification

207

one token-mixing MLP and one channel-mixing MLP in the Mixer layers, each of which is made up of two fully connected layers with GELU (Gaussian Error Linear Unit) as the activation function. The complete structure of the MLP-Mixer can be seen in the Fig. 1. A sequence of S patches of non-overlapping pictures, each projected to the chosen hidden dimension C, is fed into the mixer. As a result, a two-dimensional real-valued input table is created., XeRS×C . If the original input image has a resolution (H,W ), and each patch has a resolution (P,P), then the number of patches is S = HW/P2 . All patches are projected linearly with the same projection matrix. The mixer consists of several layers of the same size, and each layer consists of two MLP blocks. The first one is the token-mixing MLP: it acts on columns of X (i.e. it is applied to a transposed input table X T ), maps RS → RS , and is shared across all columns. The second one is the channel-mixing MLP: it acts on rows of X, maps RC → RC , and is shared across all rows [13]. Each MLP block has two fully linked layers and a nonlinearity applied to each row of the input data tensor individually. Except for the initial patch projection layer, every layer in the Mixer accepts the same size input. Transformers or deep RNNs in other domains, which likewise use a fixed width, are most similar to this “isotropic” design. Typical CNNs, on the other hand, have a pyramidal structure with lower resolution input but more channels in the deeper layers [13]. With a simpler architecture and a different way of working with CNN, MLP-Mixer in the future can be made in a lightweight model to run on mobile devices such as Android [23], iOS and embedded systems using TensorFlow Lite tools [24]. 2.3 Deep Learning In the current development of information technology, the role of Artificial Intelligence becomes very important in helping humans solve problems. In fact, the existence of Artificial Intelligence (AI) is supported by Machine Learning and Deep Learning [25]. There are many experts who give interpretations about AI and ML, one of them is IBM [26, 27, 28]. According to IBM, deep learning tries to emulate the capabilities of the human brain where although it has not yet reached perfection, the data is processed in such a way that it can produce high accuracy. There are many algorithms in deep learning that can be a solution to problems such as classification, object detection and so on. Call it the famous algorithms for classification such as Convolution Neural Network and others. However, as demonstrated in this experiment, the MLP-Mixer can be one of the models used to solve classification problems in computer vision.

3 Methods To conduct this research, we design the research flowchart as shown in Fig. 2. The stages from beginning to end will then be explained in the next sub-chapter.

208

L.-H. Li and R. Tanone

Fig. 2. Process flow of building MLP-Mixer model

3.1 Data Requirements, Collection and Preparation At this stage, data will be collected from Kaggle [29] and the total number of image data is 14,632 with a size of 256 × 256 pixels. Furthermore, the datasets are processed by separating them into train dataset of 11,704 images and test dataset of 2,928 images. The number of classes processed in data preparation is 4 classes with the input image size of 32 × 32 pixels. Next, data understanding stage is the stage to understand the structure of the dataset which will be used for further processing. After that, data preparation will be carried out, namely the stage where to prepare the data that will be used to build the model. At this point, it will be determined whether or not the existing dataset meets the standards. 3.2 Configure the Hyperparameters There are several items needed to be set such as weight decay, batch size, number of epochs, and dropout rate. Additionally, we’ll resize input images to a specific size. Next, size of the patches to be extracted from the input images. The next step is to set the size of the data array, number of hidden units and number of blocks. In detail, the hyperparameters settings are as follows in Table 1. Table 1. Configure the hyperparameters Parameters

Size

weight_decay

0.0001

batch_size

128

num_epochs

200

dropout_rate

0.5

image_size will resize into this size

64

patch_size

8

num_patches = (image_size // patch_size) ** 2

16

embedding_dim

256

num_blocks

4

MLP-Mixer Approach for Corn Leaf Diseases Classification

209

According to Table 1, an important part of setting parameters in MLP-Mixer is the number of patches. In this experimental setting, we use the number of patches per image is 16. This means that these patches will be projected linearly into an H-dimesion latent representation (which is hidden) and passes on to the Mixer layer as in the MLP-Mixer structure. 3.3 Build a Classification Model At this stage, we conduct several processes including data augmentation, create and patches generate (including number of patches, embedding dimension, batch size). In addition, other steps are processing x using the module blocks, applying global average pooling to generate a (batch size, embedding dimension) representation tensor, applying dropout and creates the model. 3.4 Define an Experiment and Data Augmentation To build the classifications model, we implement several methods that builds a classifier given the processing blocks that have been set. At this stage, several stages will be carried out including create adam optimizer with weight decay, compile the model, create a learning rate scheduler callback, early stopping callback and fit a model. Also at this stage, the process carried out is compute the mean and the variance of the training data for normalization. Finally process at this stage is implementing patch extraction as a layer. 3.5 The MLP-Mixer Model Structure The MLP-Mixer is an architecture that is entirely based on Multi-Layer Perceptrons (MLPs) with two types of MLP layers: one is applied to corn leaf image patches independently, and the other it mixes the per-location features. In our model architecture, the original 32 × 32 input image is resized to 64 × 64. From the size of these input images, the image will be converted into 16 patches. Each of these patches will be processed in full connected and N × (Mixer Layer). After that, the patches will be processed in the Global Average Pooling and Fully-connected layer before finally issuing a class on the classification of diseases on corn leaves. In accordance with the original architecture, the process in this model architecture will use two MLP heads. In general, the architectural design of the MLP-Mixer model for corn leaf diseases can be seen in Fig. 3.

210

L.-H. Li and R. Tanone

Fig. 3. The modified architecture of MLP-Mixer for corn leaf diseases classification

3.6 Build, Train, and Evaluate the MLP-Mixer Model The next stage is to conduct training for pre-trained model using several epochs. Furthermore, after the model is trained, the last step is to evaluate the model. Evaluation is done by looking at the value of training and validation accuracy as well as training and validation loss. In conducting this research, we evaluated the model using the metrics of precision, recall, accuracy, and F1 score [30, 31]. Each metric is represented in formulas as shown in Eqs. 1, 2, 3 and 4. Precision = TP/(TP + FP)

(1)

Recall = TP/(TP + FN)

(2)

Accuracy score = (TP + TN)/(TP + TN + FP + FN)

(3)

F1 = (2 × Precision × Recall)/(Precision + Recall)

(4)

Based on the formula, True Positive (TP) is the number of corn leaves sorted properly into the corn leaf classifications. False Positive (FP) is the number of corn leaves assigned to classifications class which they are not belonging. Next True Negative (TN) is the number of correctly detected corn leaf negative data. Meanwhile, False Negative (FN) is the detected corn leaf classification data was negative but actually the results were positive.

4 Experiment and Result In the process of building and producing a model using MLP-Mixer for corn leaf disease classification, the stages that have been introduced previously will be described in this chapter. The processes that go through include data preparation, building, training, and evaluating the MLP-Mixer model.

MLP-Mixer Approach for Corn Leaf Diseases Classification

211

4.1 Image Segmentation Corn leaf diseases which consist of 4 classes are obtained from the open database. Corn leaf diseases are divided into 4 classes: (1) healthy, (2) common rust, (3) blight, and (4) gray leaf spot. In the future, leaf datasets can be collected from the captures of smartphone cameras, drones, or other embedded devices. Next, the image will be mapped into several classes as shown in the Fig. 4.

Fig. 4. Image segmentation

4.2 Experiment Results (Train and Evaluate Model) In conducting a training model, we take 20% of the dataset for training experiments. We ran three experiments to train the MLP-Mixer model to classify diseases on corn leaves, and the model’s performance improved as the number of epochs increased (see Table 2). For our last experiment in conducting training, the MLP-Mixer model was able to achieve an accuracy of 98.09% for 200 epochs with learning rate = 0.0001. Table 2 shows the detail information on the number of epochs used in model training. As illustrated in Fig. 5, model evaluation is critical in this experiment. Figure 5 shows the value of training and validation accuracy, as well as training and validation loss, for this pre-trained model. Higher levels of accuracy can be affected by increasing the embedding dimensions, increasing the number of mixer blocks, and training the model for longer period of time. In addition, by capturing images from devices with varying image quality, parameter settings can be tried by increasing the size of the input images and using different patch

212

L.-H. Li and R. Tanone Table 2. Model training No. Model

Epochs Accuracy

1

MLP-Mixer from scratch 100

97.54%

2

MLP-Mixer from scratch 150

97.92%

3

MLP-Mixer from scratch 200

98.09%

Fig. 5. Training and validation accuracy/loss.

sizes. From these results, it can be proven that the MLP-Mixer is able to perform training with a small number of parameters so that it will obviously save on the use of resources in processing time. To measure the performance of MLP-Mixer model, we use metrics of precision, recall, accuracy, and F1 score and the performance outcome is listed in Table 3. Table 3. Model performance metrics. Label

Precision

Recall

F1-score

Support

0

1.00

1.00

1.00

763

1

0.95

0.97

0.96

657

2

1.00

1.00

1.00

744

3

0.97

0.96

0.97

764

0.98

2928

Accuracy Marco avg.

0.98

0.98

0.98

2928

Weighted avg.

0.98

0.98

0.98

2928

MLP-Mixer Approach for Corn Leaf Diseases Classification

213

Table 3 illustrates the model’s performance in classifying each Label. For the precision value, Label 0 and Label 2 images are recognized 100% correctly. Label 1 images can be recognized with 95% and Label 3 images can be recognized with 97% well. For the Recall value, again, Label 0 and Label 2 images can all be recognized 100% correctly. Label 1 and Label 2 images can be recognized with 97% and 96%, respectively. For F1-score, Label 1 has the value, namely 0.96 followed by Label 3 with 97%, while the other Labels have a value of 100%. From Table 3 it is clear that the overall performance is above 95% and the average F1-score is 98% which outperforms all corn diseases models. The confusion matrix of MLP-Mixer trained model is shown in Fig. 6. From Fig. 6, it shows that the model is able to predicts Label 0 with 761 images correctly out of 763 images, predicts Label 1 with 635 images correctly out of 657 images, predicts Label 2 with 742 images correctly out of 744 images, and predicts Label 3 with 734images correctly out of 764 images (see Fig. 6).

Fig. 6. Confusion matrix.

From the above experiments and outcomes, we can conclude that MLP-Mixer is a stable classification model for corn disease. 4.3 Discussion The experimental results show promising results regarding corn leaf classification using the MLP-Mixer model. Some of the advantages gained from using this MLP-Mixer model are as described in the original paper. The MLP-Mixer also has a few advantages that simplify its architecture, such as identical layer sizes, each layer consisting of only two MLP blocks and accepting input of the same size. Another significant point is that all image patches are projected linearly using the same projection matrix. This model also has a small number of parameters, 218,123 in total. This will undoubtedly help reducing the cost and speed of the computational process in image classification. Furthermore, accuracy results can be improved by using suitable parameters and a longer training time.

214

L.-H. Li and R. Tanone

5 Conclusion The conclusion that can be drawn from this research is that to support smart agriculture in Indonesia, especially on agricultural land, it is necessary to develop applicable research to facilitate decision-making. The MLP-Mixer model is a new approach of problemsolving in computer vision, particularly in the classification of diseases on corn leaves. This experiment shows that the classification of corn leaf diseases using the MLP-Mixer model produces an accuracy rate of 98.09%, indicating a new trend in research towards smart agriculture in Indonesia. MLP-Mixer is a new breakthrough besides Neural Network which has been used for image classification. With the use of small resources for the model performance process, this can be a good step in the future where models with minimal resource can be applied to smart devices such as smartphones, drones, or embedded systems. In addition, this study can point to transition potential from Indonesia’s traditional agriculture to smart agriculture.

References 1. Rentschler, J., Salhab, M., Jafino, B.A.: Flood exposure and poverty in 188 countries. Nat. Commun. 13(1), 3527 (2022). https://doi.org/10.1038/S41467-022-30727-4 2. Timmer, C.P.: The Corn economy of Indonesia, p. 302 (1987) 3. Kementerian Pertanian - Kementan Pastikan Produksi Jagung Nasional Surplus. https://www. pertanian.go.id/home/?show=news&act=view&id=3395. Accessed 13 Jan 2022 4. Hamaisa, A., Estiasih, T., Putri, W.D.R., Fibrianto, K.: Physicochemical characteristics of jagung bose, an ethnic staple food from East Nusa Tenggara, Indonesia. J. Ethn. Foods 9(1), 24 (2022). https://doi.org/10.1186/S42779-022-00140-9 5. Diseases of Corn | CALS. https://cals.cornell.edu/field-crops/corn/diseases-corn. Accessed 13 Jan 2022 6. Waheed, A., Goyal, M., Gupta, D., Khanna, A., Hassanien, A.E., Pandey, H.M.: An optimized dense convolutional neural network model for disease recognition and classification in corn leaf. Comput. Electron. Agric. 175 (2020). https://doi.org/10.1016/j.compag.2020.105456 7. Noola, D.A., Basavaraju, D.R.: Corn leaf image classification based on machine learning techniques for accurate leaf disease detection. Int. J. Electr. Comput. Eng. 12(3), 2509–2516 (2022). https://doi.org/10.11591/IJECE.V12I3.PP2509-2516 8. Hein, L., et al.: The health impacts of Indonesian peatland fires. Environ. Heal. 21(1), 62 (2022). https://doi.org/10.1186/S12940-022-00872-W 9. Salim, J.N., Trisnawarman, D., Imam, M.C.: Twitter users opinion classification of smart farming in Indonesia. IOP Conf. Ser. Mater. Sci. Eng. 852(1), 012165 (2020). https://doi.org/ 10.1088/1757-899X/852/1/012165 10. Gunawan, F.E., et al.: Design and energy assessment of a new hybrid solar drying dome Enabling Low-Cost, Independent and Smart Solar Dryer for Indonesia Agriculture 4.0. IOP Conf. Ser. Earth Environ. Sci. 998(1), 012052 (2022). https://doi.org/10.1088/1755-1315/ 998/1/012052 11. Habeahan, N.L.S., Leba, S.M.R., Wahyuniar, W., Tarigan, D.B., Asaloei, S.I., Werang, B.R.: Online teaching in an Indonesian higher education institution: Student’s perspective. Int. J. Eval. Res. Educ. 11(2), 580–587 (2022). https://doi.org/10.11591/IJERE.V11I2.21824 12. Internet - Our World in Data. https://ourworldindata.org/internet. Accessed 13 Jan 2022 13. Tolstikhin, I., et al.: MLP-Mixer: An all-MLP Architecture for Vision.

MLP-Mixer Approach for Corn Leaf Diseases Classification

215

14. Javanmardi, S., Miraei Ashtiani, S.H., Verbeek, F.J., Martynenko, A.: Computer-vision classification of corn seed varieties using deep convolutional neural network. J. Stored Prod. Res. 92, 101800 (2021). https://doi.org/10.1016/J.JSPR.2021.101800 15. Misra, N.N., Dixit, Y., Al-Mallahi, A., Bhullar, M.S., Upadhyay, R., Martynenko, A.: IoT, big data and artificial intelligence in agriculture and food industry. IEEE Internet Things J. 9, 6305–6324 (2020). https://doi.org/10.1109/JIOT.2020.2998584 16. Yu, H., et al.: Corn Leaf Diseases Diagnosis Based on K-Means Clustering and Deep Learning. IEEE Access 9, 143824–143835 (2021). https://doi.org/10.1109/ACCESS.2021.3120379 17. Lakshmi, P., Mekala, K.R., Sai, V., Sree Modala, R., Devalla, V., Kompalli, A.B.: Leaf disease detection and remedy recommendation using CNN algorithm. Int. J. Online Biomed. Eng. 18(07), 85–100 (2022). https://doi.org/10.3991/IJOE.V18I07.30383 18. Prashar, N., Sangal, A.L.: Plant Disease Detection Using Deep Learning (Convolutional Neural Networks). In: Chen, J.-Z., Tavares, J.M.R.S., Iliyasu, A.M., Du, K.-L. (eds.) ICIPCN 2021. LNNS, vol. 300, pp. 635–649. Springer, Cham (2022). https://doi.org/10.1007/978-3030-84760-9_54 19. Cham, M.F.X., Tanone, R., Riadi, H.A.T.: Identification of rice leaf disease using convolutional neural network based on android mobile platform. 2021 2nd Int Conf. Innov. Creat. Inf. Technol. ICITech 2021, 140–144 (2021). https://doi.org/10.1109/ICITECH50181.2021. 9590188 20. Mahum, R., et al.: A novel framework for potato leaf disease detection using an efficient deep learning model. Hum. Ecol. Risk Assess. (2022). https://doi.org/10.1080/10807039.2022.206 4814 21. Araujo, A., Norris, W., Sim, J.: Computing Receptive Fields of Convolutional Neural Networks. Distill 4(11), e21 (2019). https://doi.org/10.23915/DISTILL.00021 22. Luo, W., Li, Y., Urtasun, R., Zemel, R.: Understanding the Effective Receptive Field in Deep Convolutional Neural Networks. 23. What is Android | Android. https://www.android.com/what-is-android/. Accessed 17 Jan 2022 24. TensorFlow Lite | ML for Mobile and Edge Devices. https://www.tensorflow.org/lite. Accessed 17 Jan 2022 25. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015). https:// doi.org/10.1038/nature14539 26. What is Artificial Intelligence (AI)? | IBM. https://www.ibm.com/cloud/learn/what-is-artifi cial-intelligence. Accessed 21 Apr 2022 27. What is Machine Learning? | IBM. https://www.ibm.com/cloud/learn/machine-learning. Accessed 21 Apr 2022 28. What is Deep Learning? | IBM. https://www.ibm.com/cloud/learn/deep-learning. Accessed 21 Apr 2022 29. Bangladeshi Crops Disease Dataset | Kaggle. https://www.kaggle.com/datasets/nafishamoin/ bangladeshi-crops-disease-dataset. Accessed 29 Mar 2022 30. Sasaki, Y., Fellow, R.: The truth of the F-measure (2007) 31. Van Rijsbergen, C.J.: INFORMATION RETRIEVAL. Butterworth-Heinemann (1979)

A Novel Neural Network Training Method for Autonomous Driving Using Semi-Pseudo-Labels and 3D Data Augmentations Tam´as Matuszka and D´ aniel Kozma(B) aiMotive, Budapest, Hungary {tamas.matuszka,daniel.kozma}@aimotive.com https://aimotive.com/

Abstract. Training neural networks to perform 3D object detection for autonomous driving requires a large amount of diverse annotated data. However, obtaining training data with sufficient quality and quantity is expensive and sometimes impossible due to human and sensor constraints. Therefore, a novel solution is needed for extending current training methods to overcome this limitation and enable accurate 3D object detection. Our solution for the above-mentioned problem combines semipseudo-labeling and novel 3D augmentations. For demonstrating the applicability of the proposed method, we have designed a convolutional neural network for 3D object detection which can significantly increase the detection range in comparison with the training data distribution. Keywords: Semi-pseudo-labeling · 3D data augmentation network training · 3D object detection · Machine learning

1

· Neural

Introduction

Object detection is a crucial part of an autonomous driving software since increasingly complex layers are built on the top of the perception system which itself relies on fast and accurate obstacle detections. Object detection is typically performed by convolutional neural networks which are trained by means of supervised learning. Supervised learning is a method where a model is fed with input data and its main objective is to learn a function that maps the input data to the corresponding output. Since convolutional neural networks, the best models for visual domain (except large-data regime where visual transformers [7] excel), are heavily overparameterized, a large amount of annotated data is needed for learning the mapping function. Therefore, a substantial amount of manual effort is required to annotate data with sufficient quality and quantity, which is an expensive and error-prone method. In addition, obtaining precise ground truth data is sometimes impossible due to human or sensor constraints. For example, the detection range of LiDARs limits the annotation of distant objects which presence must be known by c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  N. T. Nguyen et al. (Eds.): ACIIDS 2022, LNAI 13758, pp. 216–229, 2022. https://doi.org/10.1007/978-3-031-21967-2_18

A Novel Neural Network Training Method for Autonomous Driving

217

an autonomous driving system due to fast position and location change on a highway. Radars could overcome this kind of limitation, but they have a more reduced field of view and only a subset of the interesting object categories are detectable with it. A human limitation is, for example, the inability to accurately estimate the distance of objects in 3D from 2D images without any 3D clue e.g. point clouds collected by a LiDAR or radar detections. Consequently, a novel solution is needed for extending current training methods to overcome this limitation and enable accurate 3D object detection. Several approaches have been developed to facilitate neural network training. One of the most popular solutions is transfer learning where a neural network is trained on a particular dataset, such as ImageNet [5] and then fine-tuned on another dataset (e.g., a model trained to recognize cars can be trained to classify trucks using transfer learning). Self-supervised learning which utilizes unlabeled data to train a model performing proxy tasks and then fine-tunes it on a downstream task in a supervised manner has resulted in breakthroughs in language modeling. Pseudo-labeling is a simple solution that uses the same model’s predictions as true labels during the training. However, none of these solutions helps the model to produce predictions that are not part of the training distribution. The main motivation of this work is to develop a training method which massively overcomes the limitations of the training dataset and so extends the prediction capabilities of a neural network. To summarize, this paper makes the following three main contributions: – We introduced semi-pseudo-labeling (SPL) as a method where pseudo-labels are generated by a neural network trained on a simpler task and utilized during the training of another network performing a more complex task. – We extended several conventional 2D data augmentation methods to work in 3D. – We described a training method for allowing neural networks to predict certain characteristics from out of training distribution using semi-pseudolabeling and 3D data augmentations.

2

Related Work

The concept of pseudo-labeling has been introduced by Lee in [9] as a simple and efficient self-supervised method for deep neural networks. The main idea of pseudo-labeling is to consider the predictions of a trained model as ground truth. Unlabeled data, which is typically easy to obtain, can be annotated using the trained model’s prediction. Then, the same model is retrained on the labeled and pseudo-labeled data simultaneously. Our proposed method is based on this concept but there is a fundamental difference between the solutions. Pseudolabeling generates labels for the same task using the same model while our semipseudo-labeling method utilizes pseudo-labels generated by a different model for a more complex task. In [4], Chen generated pseudo-labels for object detection on dynamic vision sensor using a convolutional neural network trained on an active pixel sensor. The main difference compared to our solution is that pseudo-labels

218

T. Matuszka and D. Kozma

in the paper are used for the same task, namely two-dimensional bounding box detection of cars on different sensor modalities. The solution described in [19] also uses pseudo-labels for training object detection neural networks. However, the invention described in the patent uses regular pseudo-labeling for training a neural network to perform the same task, namely, 2D object detection as opposed to our solution where the tasks are not the same. In addition, the solution requires the use of region proposal networks which indicates a twostage network architecture that might not fulfill real-time criterion while our 3D object detection network uses a single stage architecture that utilizes semipseudo-labeling during training. Watson et al. used pseudo-labeling in [17] for generating and augmenting their data labeling method. Their proposed solution has created pseudo-labels for unlabeled data while our method enables us to use annotated data created for a simpler task as pseudo-labels and does not exclusively rely on unlabeled data. Transfer learning [1] is the process when a model is trained for performing a specific task and its knowledge is utilized to solve another (related) problem. Transfer learning involves pretraining (typically on a large-scale dataset) a model and then customizing this model to a given task by either using the trained weights or fine-tuning the model by adding an additional classifier on the top of frozen weights. Transfer learning can be effectively used to transfer knowledge between related problems which has similarities with our proposed method which can be considered as an extended version of transfer learning. However, we utilize semi-pseudo-labels to perform a more complex task (e.g. 3D object detection with 2D pseudo labels) which might not be solvable using regular transfer learning. In addition, our solution enables simultaneous learning of different tasks as opposed to transfer learning. Data augmentation [18] is a standard technique for enhancing training data and preventing overfitting using geometric transformations, color space augmentations, random erasing, mixing images, etc. Most data augmentation techniques operate in image space (2D) [11], but recent work started to extend their domain to 3D [11,18]. To the best of our knowledge, none of these solutions introduced zoom augmentation in 3D using a virtual camera, as our work proposes. The closest solution which tries to solve the limited perception range is described in [14]. The method in the paper proposes to break down the entire image into multiple image patches, each containing at least one entire car and with limited depth variation. During inference, a pyramid-like tiling of images is generated which increases the running time. In addition, the perception range of the approach described in the paper did not exceed 50 m.

3

A Novel Training Method with Semi-Pseudo-Labeling and 3D Augmentations

Two methods have been developed for enhancing training dataset limitations, namely semi-pseudo-labeling (SPL) and 3D augmentations (zoom and shift). SPL is first introduced as a general, abstract description. Then, the concept and its combination with 3D augmentations is detailed using a concrete example.

A Novel Neural Network Training Method for Autonomous Driving

3.1

219

Semi-Pseudo-Labeling

The main objective of supervised learning is to define a mapping from input space to the output by learning the value of parameters that minimizes the error of the approximation function [6], formally M (X; θ) = Y

(1)

min L(Y − M (X, θ))

(2)

and θ

where L is an arbitrary loss function, Y is the predictable target, X is the input and M is the model parameterized by θ. For training a model using supervised learning, a training set is required:   L L   L ⊆ Rd × C (3) DGT = xL 1 , y1 , . . . , xn , yn L L L is where xL 1 ∈ X is the input sample where annotations are available, y1 ∈ Y d the ground truth, R is the d dimensional input space, C is the label space. The pseudo-labeling method introduces another dataset where labels for an unlabeled dataset are generated by the trained model M (X; θ):     PS PS , . . . , xU ⊆ Rd × C (4) DP S = xU 1 , y1 m , ym U is the input sample from unlabeled data, yiP S ∈ M (X U ; θ) is where xU i ∈ X the pseudo-label generated by the trained model. The final model is trained on the union of the annotated and pseudo-labeled datasets. The main objective of semi-pseudo-labeling is to utilize pseudo-labels generated by a model trained on a simpler task for training another model performing a more complex task. Both simple and complex tasks have annotated training sets for their specific tasks:     S SL SL , . . . , xSL ⊆ Rd × C S = xSL (5) DGT 1 , y1 n , yn     C CL CL , . . . , xCL ⊆ Rd × C C DGT = xCL (6) 1 , y1 m , ym

The main differentiator between regular pseudo-labeling and semi-pseudolabeling method is that the simple model M S does not generate pseudo-labels on unlabeled data (although it is a viable solution and might be beneficial in some cases). Rather, pseudo-labels are generated using the input data of the complex model M C . In this way, the label space of complex model can be extended, as can be seen in 7:      CL CL  C CL SP S ⊆ Rd × C C ∪ C S (7) x1 , y1 ∪ y1SP S , . . . , xCL DSP S = m , ym ∪ y m where xCL ∈ X CL is the input sample where annotations for the complex task i are available, yiCL ∈ Y CL is the ground truth label for the complex task, yiSP S ∈ M S (X CL ; θS ) is the semi-pseudo-label generated by the simple model on the complex task’s input. The final model M C (X CL ; θC ) is trained on the semiC pseudo-labeled DSP S dataset.

220

3.2

T. Matuszka and D. Kozma

3D Augmentations

Issues with Vanilla Zoom Augmentation. For training a network predicting 3D attributes of dynamic objects, accurate 3D position, size, and orientation data is required in model space. The principal problem to solve for overcoming the dataset limitations is handling image-visible, non-annotated objects (red squares on Fig. 1) within the required operational distance range (enclosed by red dashed lines in Fig. 1). Gray dashed area presents the non-annotated region and red square is nonannotated object, while green area is the annotated region and blue square is an annotated object. Red dashed horizontal lines represent the required operational domain, in which the developed algorithm has to detect all the objects, while blue dashed horizontal lines are the distance limits of the annotated data. The three columns represent the three options during zoom augmentation. First case (left) is the non-augmented, original version. Second (center) case when input image is downscaled, mimicking farther objects in image space, so the corresponding ground truth in the model space should be adjusted consistently. The third case (right), when input image is upscaled, bringing the objects closer to the camera. The figure highlights the inconsistency between applying various zoom levels, as annotated regions on the transformed cases (second, third) overlap with the original non-annotated region. Figure 1 presents the inconsistencies of applying vanilla zoom augmentation. The green area represents the region where all image-visible objects are annotated, while the gray dashed one is, where our annotation is imperfect and contains false negatives. When applying vanilla zoom augmentation technique to extend the operational domain of the developed algorithm, discrepancies may arise, i.e., when the zoom augmented dataset contains original and down-scaled images (case #1 and #2) on Fig. 1, it can be seen ground truth frames contradict each other. In case #2 it is required to detect objects beyond the original ground truth limit (upper blue dashed line), while in case #1 it cannot be utilized in the loss function, since there is no available information on even the existence of the red object. To overcome the limitations and make zoom augmentation viable in our case, additional information is required to fill missing data, i.e., non-annotated objects at least in image space. Lacking data could be refill with human supervision but this is infeasible since it is not scalable. Pseudo-labeling is a promising solution. However, in our case, the whole 3D information could not be recovered but the existence of 2D information is sufficient to overcome the above-mentioned issues. Therefore, 3D zoom augmentation becomes a viable solution for widening the limits of the dataset and extending the operational domain of the developed detection algorithm. A pretrained, state-of-the-art 2D bounding box network can be used to detect all image-visible objects. Improving Over Existing Augmentations. Most 2D data augmentations are easy to generalize to three-dimension. However, zooming is not trivial since changes in the image scale modify the position and egocentric orientation of the annotations in 3D space too. A 3D zoom augmentation using a virtual camera

A Novel Neural Network Training Method for Autonomous Driving

221

Fig. 1. Effects of zoom augmentation on image and model space data.

has been developed to resolve this issue. The method consists of two main steps. The first is to either zoom in or zoom out of the image. In this way, it can be emulated that an object moves either closer or farther to the camera. The second step is to modify the camera matrix to follow the 2D transformations and to keep 3D annotations intact. This can be performed by linear transformations and a virtual camera that adjusts its principal point and focal length considering the original camera matrix and the 2D scaling transformation. Changing camera intrinsic parameters mimics the change of the egocentric orientation of the given object, but its apparent orientation, which is the regressed parameter during the training, remains the same. The 3D zoom augmentation can be implemented as follows. As a first step, a scaling factor between an empirically chosen lower and upper bound is randomly drawn. If the lower and upper bound is smaller than one, a zoom-out operation is performed. If the lower and upper bound are greater than one, a zoom-in operation is performed. If the lower bound is less than one and the upper bound is greater than one, either a zoom-in or zoom-out is performed. The 2D part of zoom works as in the traditional case when one zooms in/out to the image using the above-mentioned scaling factor (in the case of zooming out, the image is padded with zeros for having the original image size). Then, the camera matrix corresponding to the image can be adjusted by scaling the focal length components with the randomly drawn scaling factor. If the 2D image is shifted beside the zoom operation, the camera matrix can be adjusted by shifting the principal point components. Therefore, the augmentations for a corresponding image and 3D labels are performed in a consistent manner. Applying random shift of the image enforces the decoupling of image position and object distance. Due to this augmentation, the detection system can prevent overfitting to a specific camera intrinsics.

222

3.3

T. Matuszka and D. Kozma

An Example of Training with Semi-Pseudo-Labeling and 3D Augmentations

The semi-pseudo-labeling method combined with 3D augmentations was used for training a 3D object detection neural network to perform predictions that are out of the training data distribution. Figure 2 describes the steps of the utilization of the SPL method. The requirement was to extend the detection range of an autonomous driving system to 200 m while the distance range of annotated data did not exceed 120 m. In addition, some detectable classes were missing from the training data. The FCOS [16] 2D bounding box detector has been chosen as the simple model M S (X SL ; θS ) where the input space X SL contains HD resolution stereo image pairs and label space C S consists of (x, y, w, h, o, c1 , . . . , cn ) tuples, where x is the x coordinate of the bounding box center in image space, y is the y coordinate of the bounding box center in image space, w is the width of the bounding box in image space, h is the height of the bounding box in image space, o is the objectness score, ci is the probability that the object belongs to the i-th category. The model M S has performed 2D object detections on the 3D annotated dataset which in our case is the same as X SL , i.e., HD images. The resulting 2D detections found distant objects which are not annotated by 3D bounding boxes and added as semi-pseudo-labels. Finally, the 3D object detector was trained on the combination of 3D annotated data and semi-pseudo-labeled 2D bounding boxes. The label space of the 3D detector is (x, y, w, h, o, c1 , . . . , cn , P, D, O) tuples, where x is the x coordinate of the bounding box center in image space, y is the y coordinate of the bounding box center in image space, w is the width of the bounding box in image space, h is the height of the bounding box in image space, o is the objectness score, ci is the probability that the object belongs to the i-th category, P is a three-dimensional vector of the center point of 3D bounding box in model space, D is a three-dimensional vector containing the dimensions (width, height, length) of the 3D bounding box, O is a four-dimensional vector of the orientation of the 3D bounding box represented as a quaternion. A deduplication algorithm is required to avoid double annotations that are included both in the semi-pseudo-labeled dataset and in the 3D annotated ground truth. This post-processing step can be executed by examining the ratio of the intersection over union (IoU) of the semi-pseudo-labeled annotations and 2D projection of 3D bounding boxes. If the ratio exceeds a threshold, the pseudolabeled annotation should be filtered out. Baseline Neural Network and Training. We have developed a simple singlestage object detector based on the YOLOv3 [12] convolutional neural network architecture which has utilized our semi-pseudo-labeling method and 3D data augmentations during its training. The simple architecture was a design choice taking into account two reasons. First, the model has to be lightweight in order to be able to run real-time in a computationally constrained environment (i.e. in a self-driving car). Second, a simple model facilitates the benchmarking of the

A Novel Neural Network Training Method for Autonomous Driving

223

Fig. 2. SPL applied in 3D object detection using 2D detection as the simple task.

effects of the proposed methods. As the first step of the training, the input image is fed to an Inception-ResNet [15] backbone. Then, the resulting embedding is passed to a Feature Pyramid Network [10]. The head is adapted from the YOLOv3 paper and is extended with channels that are responsible for predicting 3D characteristics mentioned above. The neural network has been trained using multitask learning [2], 2D (using previously generated semi-pseudo-labels) and 3D detection are learned in a parallel manner. Instead of directly learning the 3D center point of the cuboid, the network was designed to predict the 2D projection of the center of a 3D cuboid. The center point of the 3D bounding box can later be reconstructed from the depth and its 2D projection. Finally, the dimension prediction part of the network uses priors (i.e., precomputed category averages), and only the differences from these statistics are predicted instead of directly regressing the dimensions. This approach was inspired by the YOLOv3 architecture which uses a similar solution for bounding box width and height formulation. 3D zoom augmentation was performed during the training where the lower and upper bound of scaling factor hyperparameters were set to 0.5 and 2.0, respectively. Loss Functions. The label space of semi-pseudo-labels is more restricted than the 3D label space since SPLs (i.e., 2D detections) do not contain 3D characteristics. The ground truth was extended with a boolean flag that indicates whether the annotated object is a semi-pseudo-label or not. This value was used in the loss function to mask out 3D loss terms in the case of semi-pseudo-labels to not penalize the weights corresponding to 3D properties during backpropagation when no ground truth is known. Due to this solution and the single-stage architecture as well as label space representation described in Sect. 3.3, we were able to simultaneously train the neural network to detect objects in 2D and 3D space. As mentioned in Sect. 3.3, the training of the neural network has been framed as a multitask-learning problem. The loss function consists of two parts, 2D and 3D loss terms. The loss function for 2D properties is adapted from YOLO paper

224

T. Matuszka and D. Kozma

[12]. For the 3D loss term, the loss has been lifted to 3D instead of calculating the loss for certain individual loss terms (e.g., 2D projection of cuboid center point, orientation). The 3D loss is calculated by reconstructing the bounding cuboid in 3D and then calculating the L2 loss of predicted and ground truth corner points of the cuboid. In addition, the method described in [13] has been utilized to disentangle loss terms. As mentioned before, a masking solution has been utilized to avoid penalizing the network when predicting 3D properties to semi-pseudo-labels that do not have 3D annotations. The final loss is the sum of 2D and 3D loss.

4

Experiments

We have conducted experiments with the neural network described in Sect. 3.3 on a publicly available dataset as well as on internal data. The main goal of the experiments was not to compete with state-of-the-art solutions rather validate the viability of the proposed semi-pseudo-labeling and 3D augmentation methods. Therefore, the baseline is a model trained using neither semi-pseudo-labeling nor 3D augmentations. 4.1

Argoverse

Argoverse [3] is a collection of two datasets designed to facilitate autonomous vehicle machine learning tasks. The collected dataset consists of 360-degree camera images and long-range LiDAR point clouds recorded in an urban environment. Since the perception range of the LiDAR used for ground truth generation is 200 m, Argoverse dataset is suitable to validate our methods for enabling longrange camera-only detections. However, LiDAR point cloud itself was not used as an input for the model, only camera frames and corresponding 3D annotation were shown to the neural network. In order to enable semi-pseudo-labeling, the two-dimensional projection of 3D annotations were calculated and an FCOS [16] model was run on the Argoverse images to obtain 2D detection of unannotated objects. Finally, the deduplication algorithm described in Sect. 3.3 has been executed for avoiding multiple containment of objects within the dataset. The performance of the models have been measured both in image and model space. Image-space detections indicate the projected 2D bounding boxes of 3D objects while model space is represented as Bird’s-Eye-View (BEV) and is used for measuring detection quality in 3D space. Table 1 shows the difference between the performance of the baseline model and a model trained using our novel training method for the category ‘Car’. A solid improvement in both 2D and BEV metrics can be observed. Since category ‘Car’ is highly over-represented in the training data with small number of image-visible but not annotated ground truth (these objects are located in the far-region), the performance improvement is not as visible as in the case of other, less frequent object categories. The reason for the low values corresponding to BEV metrics is that the ground truth-prediction assignment happens using the Hungarian algorithm [8] based on the intersection

A Novel Neural Network Training Method for Autonomous Driving

225

Table 1. Results on Argoverse dataset for ‘Car’ and ‘Large vehicle’ objects. 2D AUC 2D Recall 2D Precision BEV AUC BEV Recall BEV Precision Car Baseline 0.5584 Ours

0.6307

0.5681

0.7773

0.1035

0.1602

0.4161

0.6061

0.7920

0.1180

0.1800

0.4247

0.1421

0.4406

0.0338

0.0214

0.8636

0.4304

0.5481

0.0767

0.1566

0.3798

Large vehicle Baseline 0.1412 Ours

0.3921

Table 2. Results on Argoverse dataset for ‘Pedestrian’ objects. PEDESTRIAN 2D AUC 2D Recall 2D Precision Baseline

0.1397

0.1510

0.4985

Ours

0.2745

0.2884

0.5353

over union metric (IoU threshold is set to 0.5). Since the longitudinal error of predictions is increasing as the detection distance increases, the bounding box association in BEV space might fail even though the image space detection and association was successful. Figure 4 depicts some example detection on the Argoverse dataset where distant objects are successfully detected. The image-space and BEV metrics are shown in Table 1 for category ‘Large vehicle’. The effect of semi-pseudo-labeling and 3D augmentations can even more be observed than in the case of the ‘Car’ category. The performance of the baseline model in BEV-space is barely measurable due to the very strict ground truth-prediction assignment rules and heavy class imbalance. This explains the difference between the baseline and the proposed method on BEV Precision metric too. The baseline provides only a few detections in far range with high precision. Our proposed method is able to detect in far-range too (c.f. the difference between 2D and BEV recall of baseline and proposed method), but due to the strict association rules, the BEV precision is low. Overall, the model trained with our method has significantly better performance in BEV-space as well as in image space. Table 2 shows a similar effect in the case of the ‘Pedestrian’ category. The BEV metrics are omitted since the top-view IoU-based bounding box assignment violates the association rules due to the small object size. 4.2

In-House Highway Dataset

Since the operational domain of Argoverse dataset is urban environment and the validation of our method in highway environment is also a requirement, we have performed an in-house data collection method and created 3D bounding box annotations using semi-automated methods. The sensor setup used for the recordings consisted of four cameras and a LiDAR with a 120-m perception range both in front and back directions. Figure 5 shows the projected cuboids

226

T. Matuszka and D. Kozma

of a semi-automated annotation sample. It can be observed that distant objects (rarely objects in near/middle-distance region too) are not annotated due to the lack of LiDAR reflections. As a consequence of the limited perception range of the LiDAR, a manual annotation step was needed for creating a validation set. In this way, distant objects (up to 200 m) and objects without sufficient LiDAR reflections can also be labeled and a consistent validation set can be created. The collected dataset consists of Car, Van, Truck, and Motorcycle categories.

Fig. 3. Precision and recall metrics displayed as a heatmap around the ego car.

The model was trained on the semi-automatically annotated data and validated on the manually annotated validation set. Figure 3 shows benchmark results (namely precision and recall metrics) of the neural network trained with our method in a class-agnostic manner. The heatmaps visualize the top-view world space around the ego car where the world space is split into 4 m by 10 m cells. The blank cell in the left heatmap indicates the ego car position and can be seen as the origin of the heatmap. The total values on the figure are the average over the heatmap. A prediction is associated with a ground truth if the distance between them is less than 10 m. The forward detection range is 200 m while the backward range is 100 m. It can be observed that the model is able to detect up to 200 m in forward direction even though the training data did not contain any annotated objects over 120 m. The low recall value in a near range (−10 m, 10 m) can be explained by the fact that the model was trained only with front and back camera frames, and objects in this detection area might not be covered by the field-of-view of the camera sensors. The high precision in (180 m, 200 m) can be attributed to the fact that the model produces only a few detections in the very far range with high confidence (i.e. the model does not produce a large number of false-positive detections in exchange for the larger number of false-negative detections). Three-dimensional zoom augmentation without semi-pseudo-labeling could not have been able to perform similarly due to the issues described in Sect. 3.2.

A Novel Neural Network Training Method for Autonomous Driving

227

However, a limitation can be observed since the detection ability significantly drops over 150 m, as the heatmap of recall metric shows in Fig. 3.

Fig. 4. Qualitative results, depicting detections on Argoverse dataset. White cuboids: 3D annotation, black rectangles: 2D projection of 3D annotations, cuboids and rectangles with other colors: 2D and 3D detections of the model.

(a) Detections of the model trained using our proposed methods.

(b) Ground truth with missing distant objects.

Fig. 5. The distant objects missing from the ground truth are detected by the model.

5

Conclusion

In this paper, we have introduced a novel training method for facilitating training neural networks used in the autonomous driving domain. The 3D augmentations have the advantageous effect that it is possible to accurately detect objects that are not part of the training distribution (i.e. detect distant objects without ground truth labels). It is true that semi-pseudo-labelling alone can be enough for the detections, however, the 3D properties, especially depth estimation would be suboptimal due to the fact that neural networks cannot extrapolate properly outside of the training distribution. Since our main interest was to validate the viability of the proposed method, we used a simple model for the experiments. A future research direction could be to integrate semi-pseudo-labeling and 3D zoom augmentation into state-of-the-art models and conduct experiments in order to examine the effects of our method.

228

T. Matuszka and D. Kozma

References 1. Bozinovski, S.: Reminder of the first paper on transfer learning in neural networks, 1976. Informatica 44(3), 291–302 (2020) 2. Caruana, R.: Multitask learning. Mach. Learn. 28(1), 41–75 (1997) 3. Chang, M.F., et al.: Argoverse: 3D tracking and forecasting with rich maps. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8748–8757 (2019) 4. Chen, N.F.: Pseudo-labels for supervised learning on dynamic vision sensor data, applied to object detection under ego-motion. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 644–653 (2018) 5. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). https://doi.org/10.1109/CVPR.2009. 5206848 6. Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press, Cambridge (2016) 7. Kolesnikov, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: Proceedings of the Ninth International Conference on Learning Representations (2021) 8. Kuhn, H.W., Yaw, B.: The Hungarian method for the assignment problem. Naval Res. Logist. Q. 2, 83–97 (1955) 9. Lee, D.H.: Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks (2013) 10. Lin, T.Y., Doll´ ar, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117–2125 (2017) 11. Liu, Z., Wu, Z., T´ oth, R.: Smoke: single-stage monocular 3D object detection via keypoint estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 996–997 (2020) 12. Redmon, J., Farhadi, A.: YOLOv3: an incremental improvement. arXiv preprint arXiv:1804.02767 (2018) 13. Simonelli, A., Bulo, S.R., Porzi, L., L´ opez-Antequera, M., Kontschieder, P.: Disentangling monocular 3D object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1991–1999 (2019) 14. Simonelli, A., Bul´ o, S.R., Porzi, L., Ricci, E., Kontschieder, P.: Towards generalization across depth for monocular 3D object detection. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12367, pp. 767–782. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58542-6 46 15. Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.A.: Inception-v4, inception-resnet and the impact of residual connections on learning. In: Thirty-First AAAI Conference on Artificial Intelligence (2017) 16. Tian, Z., Shen, C., Chen, H., He, T.: FCOS: fully convolutional one-stage object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9627–9636 (2019) 17. Watson, P., et al.: Generating and augmenting transfer learning datasets with pseudo-labeled images. US Patent 11,151,410, 19 October 2021

A Novel Neural Network Training Method for Autonomous Driving

229

18. Xu, J., Li, M., Zhu, Z.: Automatic data augmentation for 3D medical image segmentation. In: Martel, A.L., et al. (eds.) MICCAI 2020. LNCS, vol. 12261, pp. 378–387. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59710-8 37 19. Yu, Z., Ren, J., Yang, X., Liu, M.Y., Kautz, J.: Weakly-supervised object detection using one or more neural networks. US Patent App. 16/443,346, 17 December 2020

Machine Learning Methods for BIM Data ´ Gra˙zyna Slusarczyk

and Barbara Strug(B)

Institute of Applied Computer Science, Jagiellonian University, Lojasiewicza 11, 30-059 Krak´ ow, Poland {grazyna.slusarczyk,barbara.strug}@uj.edu.pl

Abstract. This paper presents a survey of machine learning methods used in applications dedicated to building and construction industry. A BIM model being a database system for civil engineering data is presented. A representative selection of methods and applications is described. The aim of this paper is to facilitate the continuation of research efforts and to encourage bigger participation of researchers in database systems in the filed of civil engineering.

Keywords: BIM data

1

· Machine learning · Civil engineering modelling

Introduction

Building Information Modeling (BIM) is nowadays widely used in architecture, engineering and construction industry (AEC). The building and construction industry employs currently about 7 percent of the world’s working-age population and is one of the world economy’s largest sectors. It is estimated that about $10 trillion is spent on construction-related goods and services every year. In the last decade, the acceptance and actual use of BIM has increased significantly within the building community. It has largely contributed to the process of eliminating faults in designs. BIM allows architects and engineers to create 3D simulations of the desired structures which contain significantly more information, on the actual structures, than drawings produced using traditional Computer Aided Drafting CAD systems. As a result, BIM has become more and more present in the construction industry. BIM technology enables to represent syntactic and semantic building information with respect to the entire life cycle of designed objects, from the design phase, through construction to the facility management phase. BIM includes information about the elements and spaces within buildings, their constituting elements, their interrelations, properties and performance. The project created in BIM technology can be treated as a database that allows to record both technical information about building elements, and its purpose and history. However although BIM is information rich, not all knowledge is explicitly stated. It seems that machine learning approaches suit well to deduce implicit knowledge from BIM models. Contrary to querying approaches used to extract knowledge from c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  N. T. Nguyen et al. (Eds.): ACIIDS 2022, LNAI 13758, pp. 230–240, 2022. https://doi.org/10.1007/978-3-031-21967-2_19

Machine Learning Methods for BIM Data

231

building models [5,15,25], which are tailored to specific scenarios with predefined outcomes, machine learning methods are able to detect patterns and make predictions. Using machine learning (ML) and artificial intelligence (AI) in AEC industry is a promising research direction. It has to be noted that while both ML and AI is rapidly developing across many other industries, the construction industry is lagging behind in the rate by which improvements are introduced. The usage of BIM could be seen as a tool to revert this trend, but software tools able to implement BIM still require quite laborious routine tasks to properly execute BIM. The structure of data in BIM, where knowledge is represented in an object oriented way, is ideal for analytical purposes and the application of machine learning techniques [17]. Machine learning (ML) is related to extrapolating object behaviours and generating logical responses from information provided by examples, and enabling a computer to gradually learn. Various classification algorithms, anomaly detection, and time series analysis can be used in respect to BIM. Classification algorithms can be used for example to predict characteristics of flats and their sale demand, the likelihood of construction delays, or to diagnose the assets in historic buildings. Anomaly detection is useful in assessment of architectural models, discovering modelling errors, while time series analysis can be applied to make maintenance predictions or renovations planning. In order to make predictions or detect patterns, metrics about buildings are to be specified. They serve as a label to building models which allows for measuring their performance. The objective of this paper is to present highlights of references pertaining to machine learning in Building Information Modelling. It complements previously published literature survey articles in order to provide the insight into the development of artificial intelligence in BIM, underline the hotspots of current research in this domain, and facilitate continued research efforts. The paper summarizes recently developed theories and methods applied in BIM-based knowledge processing, extraction, and semantic enrichment of BIM models. They include neural networks, decision trees, logistic regression, affinity propagation clustering, term frequency, random forests, SVM, as well as Bayesian networks. The paper provides an overview of the advances of machine learning methods applied in BIM.

2

BIM Data - IFC Files

The information about a building created in any software can be exported to different formats. Each commercial application has its own file type to store building data, but all of them can also export building information to an IFC file. The file format IFC [4] has become de facto standard way of interchanging and storing BIM data. It is an interoperable BIM standard for CAD applications, which supports a full range of data exchange among different disciplines and heterogeneous applications. Information retrieved from IFC files can be used by many different applications.

232

´ G. Slusarczyk and B. Strug

Floors

IfcWall

Fig. 1. An example of IFC data visualization

IFC specifies different types of building entities and their basic properties. It defines an EXPRESS based entity-relationship model, which consists of several hundred entities organized into an object-based inheritance hierarchy. All the entities in IFC are divided into rooted and non-rooted ones. While the first ones are derived from IfcRoot and have identity (a GUID), attributes for name, description, and revision control, the other ones (non-rooted) do not have identity and their instances exist only if they are referenced from a rooted instance directly or indirectly. IfcRoot is subdivided into three concepts: object definitions, relationships, and property sets: 1. IfcObjectDefinition captures tangible object occurrences and types 2. IfcRelationship captures relationships among objects 3. IfcPropertyDefinition captures dynamically extensible properties of objects In Fig. 1 a fragment of an example visualization of an IFC file representing a multi-storey building is depicted. In the top left-hand side panel the hierarchical structure of the file is shown. The floor elements describe the storeys of the building. One of the floor elements is developed to show its different component entities of the type IfcBeam, IfcColumn and IfcWall in a more detail way. The component entities of one of the IfcWall elements are shown and their visualization is depicted by the darker colour in the right-hand side panel in Fig. 1. The entities that can be used in IFC include building components like IfcWall, IfcDoor, IfcWindow, geometry such as IfcExtrudedAreaSolid, and basic constructs such as IfcCartesianPoint. The most often used building elements are IfcSpace, IfcDoor, IfcWall, IfcStair and IfcWindow. According to the IFC 2x Edition3 Model Implementation Guide [12] and the IFC specification (IFC2x3 specification, 2013) the above mentioned classes can be described as follows:

Machine Learning Methods for BIM Data

233

Fig. 2. IfcWall and IfcDoor within IFC hierarchy

– IfcSpace is the instance used to represent a space as the area or volume of a functional region. It is often associated with the class IfcBuildingStorey representing one floor (the building itself is an “aggregation” of several storeys) or with IfcSite, which represents the construction site. A space in the building is usually associated with certain functions (e.g., kitchen, bathroom). These functions are specified by attributes of the class IfcSpace (Name, LongName, Description). – IfcWall is the instance used to represent a vertical element, which is to merge or split the space. In IFC files two representations of a wall can be distinguished. The subclass IfcWallStandardCase of IfcWall is used for all walls that do not change their thickness (the thickness of a wall is the sum of the materials used). IfcWall is used for all other walls, in particular for constructs of varying thickness and for the walls with non-rectangular cross-sections. – IfcStair represents a vertical passage allowing for moving from one floor to the other. It can contain an intermediate landing. Instances of IfcStairs are treated as containers, by which we refer to component elements as IfcStairFlight using IfcRelAggregates. – IfcDoor represents a building element used to provide access to a specific area or room. Parameters of IfcDoors specify dimensions, an opening direction and a style of the door (IfcDoorStyle). IfcDoor is a subclass of IfcBuildingElement and a superclass of IfcDoorStandardCase. Door instances are usually located in a space IfcOpeningElement to which we refer by IfcRelFillsElement. – IfcWindow represents a building element used to fill vertical or near-vertical openings in walls or roofs. It provides a view, light and fresh air. Dimensions of the window and its shape can be found in IfcWindowStyle, to which we refer by IfcRelDefinesByType. IfcWindow is a subclass of IfcBuildingElement and a superclass of IfcWindowStandardCase. Window instances are placed in a space IfcOpeningElement to which we refer by IfcRelFillsElement. The above mentioned instances inherit from IfcProduct class, which allows for determining their positions using geometrical entities like IfcLocalPlacement and relationships like IfcPlacementRelTo. The place of IfcWall and IfcDoor within the IFC hierarchy is depicted in Fig. 2 (from [22]).

234

3

´ G. Slusarczyk and B. Strug

Machine Learning Techniques for BIM

Information important from the point of view of many applications lies implicit in the interrelation between building elements. Therefore several approaches directed at extracting implicit data from building models have been presented. 3.1

Learning Semantic Information - Space Classification

An unsupervised learning method for mining the IFC based BIM data by exploring interrelations between building spaces is presented in [9]. A method of extracting features which are then used in the affinity propagation clustering algorithm to get spaces with similar usage functions, is proposed. The method allows for automatic learning of functional knowledge from building space structures. The physical properties of each space and their boundary relationships in BIM model are extracted from the IFC file based on BIM data. Then boundary graphs with space boundary relationships, where properties of each space propagate along the edges, are build. Features of building spaces are extracted from the space boundary graphs. Based on these features and the graph representation of the building structure, the adapted affinity propagation algorithm performs building space clustering analysis, in order to get representative samples of building spaces. The experimental results performed on a real world BIM dataset containing 595 spaces from a 20-storey building show that building spaces with typical usage functions, like senior offices, open offices and circulation spaces, can be discovered by unsupervised learning algorithm. In [17] rooms in a housing unit were named according to their use (dining room/lounge, kitchen, bedroom, etc.) based on their geometry. The different types of rooms in BIM are usually labelled entirely by hand by the expert designer. Using Machine Learning algorithms to automate this type of task considerably reduces the computational time. Three different classification algorithms, namely decision trees, logistic regressions, and neural networks, were used to solve the problem of labelling rooms according to their function. The input data from which the algorithms are to learn, consist of rooms in the housing unit whose function is labelled by hand previously. The data were obtained from two models of housing developments created with Autodesk Revit, each consisting of more than 200 housing units, from which some rooms were extracted. The algorithms were trained with one project and evaluated with another, which ensures that the results could be extrapolated to other projects. In order to compare the sensitivity of the amount of information available, two different data sets were created for each model. The first model only includes the information obtained directly (or via formulae) from the Revit schedules. The second model also includes information that can only be extracted or calculated by C# programming using the Revit API. Predictions made by logistic regression or neural network with complete data were about 80%–90% accurate in predicting room use. The most common errors are mistaking kitchens for bedrooms and classifying corridors or bedrooms as bathrooms.

Machine Learning Methods for BIM Data

235

The paper [11] presents experimental approaches directed at extracting implicit data form building models. Both unsupervised and supervised machine learning of BIM models is considered. Supervised machine learning approach, which is based on a neural network, is able to classify floor plans according to its intended function. By looking at the spatial configurations of floor plans, a neural network was trained to differentiate between residential and institutional facilities. This approach can be used to complete missing attributes in datasets, where information pertaining to intended function and use is fragmented and incomplete. Both supervised and unsupervised learning algorithms assess a building, by means of a set of its characteristic features. The IFC Machine Learning platform presented in this paper is built on top of the DURAARK IFC metadata extractor [2]. This tool is able to extract literal values, aggregates and derived values from IFC SPF files. In [6] a dataset of IFC files have been proposed to facilitate the comparison of the classification results for different IFC entities. To show anomalies in building models in the same paper unsupervised learning approach is used. The obtained results make it possible to flag uncommon situations (like unusual large overhangs or an unusual confluence of several building elements) that might need additional checks or coordination, and therefore reduce the failure costs in the construction industry. Four different machine learning methods are used to categorize images extracted from BIM of building designs in [13]. BIM data are separated into three categories: apartment buildings, industrial building and others. The first method is based on classical machine learning, where Histogram of Oriented Gradients (HOG) is used to extract features, and a Support Vector Machine (SVM) is used for classification. The other three methods are based on deep learning. The first two use pre-trained Convolutional Neural Networks (a MobileNet [8] and a Residual Network [7]). The third one is a CNN with a randomly generated structure. A data base of 240 images extracted from 60 BIM virtual representations is used to validate the classification precision of the models. The accuracy achieved by the HOG+SVM model is 57%, while for the neural networks it is above 89%. The approach shows that it is possible to automatically categorize a structure type from a BIM representation. 3.2

Semantic Enrichment of BIM Models from Point Clouds

The other group of papers considers the use of machine learning algorithms for semantic enrichment of BIM models obtained from point cloud data [18]. In this way a time-consuming process of manually creating 3D models useful for architectural and civil engineering applications can be avoided. Semantic enrichment encompasses classification of building objects, aggregation and grouping of building elements, implementing associations to reflect connections and numbering [14], unique identification, completion of missing objects, and reconstruction of occluded objects. Then the classification of the model as a whole, or of particular assemblies or objects within the model in respect to code compliance, can be performed. BIM objects derive many of their properties from their class,

236

´ G. Slusarczyk and B. Strug

making object classification crucial for reuse in different analysis tasks, like spatial validation of a BIM model, quantity take-off and cost estimation. Xiong et al. [24] use machine learning for classifying and labeling surfaces obtained from a laser scan, in order to semantically enrich a BIM model. They use both the shape features, which are referred to as local features, and the spatial relationships (orthogonal, parallel, adjacent and coplanar) referred to as contextual features. A context-based machine learning algorithm, called stacked learning [23], is used to label patches extracted from a voxelized version of the input point cloud. The main constructive objects of a building, i.e. walls, ceilings and floors, were classified, and separated from other objects obtained from the scan that are considered clutter. The method achieved an average 85% accuracy over 4 classes. Then a SVM algorithm is used to encode the characteristics of opening shape and location, which allows the algorithm to estimate the shape of window and doorway openings even when they are partially occluded. The method was evaluated on a large, highly cluttered data set of a two-story schoolhouse building containing 40 rooms. The facility was scanned from 225 locations resulting in over 3 billion 3D measurements. The SVM algorithm was able to detect window and door openings with 88% accuracy. In [3] the use of machine learning algorithms for semantic enrichment of BIM models is illustrated through application to the problem of classification of room types in residential apartments. Classification of room types and space labeling are important for the design process, compliance checking, management operations, and many building analysis tasks. The dataset used for supervised machine learning processes contains 32 apartment models. The classification and labeling of room types in this work is based on their function, and assumes that spaces do not have dual function. A multiclass neural network was used with a total of 150 spaces in the dataset for the training process. The dataset was split to 70% for training and 30% for validation of the trained model, which resulted in an 82% correctly classified validation set. The building objects were classified based on the five local features, area, number of doors, number of windows, number of room boundary lines, and floor level offset, and one connecting feature being a direct access. The obtained results showed that machine learning is directly applicable to the space classification problem. The machine learning methods for both semantic enrichment and automating design review using BIM models is proposed in [19]. The approach was applied to identify security rooms and their walls within the BIM model created in the design process. A two-class decision forest classification algorithm [10] was chosen for this experiment. It was implemented on a data set of models with 642 security rooms with non-regular complex geometry, arranged in 64 shafts. The dataset for training contained 448 spaces, 278 of which were security rooms compliant to the described code clause and the remaining 170 were other rooms or open spaces. The spaces were organized in 64 vertical shafts, each of which was comprised of security rooms and other spaces. Running a 10-fold cross validation algorithm achieved 88% accuracy. Now the authors explore the possibility of developing a

Machine Learning Methods for BIM Data

237

deep neural network to classify the rooms when the only input data are a wall schedule, a room schedule and a table of the relationships between them. 3.3

Building Condition Diagnosis

Semantically enriched BIM models are often used in the heritage industry to manage, analyse and diagnose the assets at varying stages of the conservation process. [1] use machine learning techniques to automatically classify heritage buildings. SVM are proposed to extract the main structural components such as floors, ceilings, roofs, walls and beams. The proposed semantic labelling of the objects is based on features, which encode both local (surface area, orientation, dimensions) and contextual information (normal similarity, coplanarity, paralellity, proximity, topology) are extracted form training data sets. The proposed automated feature extraction algorithm combined with an SVM classifier takes the preprocessed data in the form of planar triangular meshes and outputs the classified objects. The algorithm was trained and tested using real data of a variety of existing buildings, like houses, offices, industrial buildings and churches. 10 structures representing different types of buildings were selected for the evaluation. The average accuracy of the model is 81%. The experiments prove that the approach reliably labels entire point cloud data sets and can effectively support experts in documenting and processing heritage assets. Machine learning methods are also used for defect classification in masonry walls of historic buildings [21]. First, the process of Scan-to-BIM, which automatically segments point clouds of ashlar masonry walls into their constitutive elements, is presented. Then machine learning based approach to classification of common types of wall defects, that considers both the geometry and colour information of the acquired point clouds, is described. The found defects are recorded in a structured manner within the BIM model, which allows for monitoring the effects of deterioration. A supervised logistic regression algorithm has been employed to classify different types of decay using parameters of roughness of stones and dispersion of colour in stones. Stones labelled ad ’defective’ by experts are used for training the classifier, which is subsequently employed to label new data. The proposed approach has been tested on data from the main fa¸cade of the Royal Chapel in Stirling Castle, Scotland. For the training process samples of three classes of decay (erosion, mechanical damage and discolouration) were used. 15 samples (5 of each class) were included in the test set, obtaining a global accuracy of 93.3% in the classification. 3.4

BIM Enhancement in the Facility Management Context

Semantic enrichment by integrating facility management (FM) information with a Building Information Model is presented in [16]. At first, various machine learning algorithms, which analyse the unstructured text from occupant-generated work orders and classify it by category and subcategory with high accuracy, have been investigated. Then, three learning methods, Term Frequency (TF), Term Frequency-Inverse Document Frequency (TF-IDF), and Random Forest

238

´ G. Slusarczyk and B. Strug

classifier, were applied to perform this classification. A set of 155.00 historical WOs was used for model development and testing textual classification. Classifier prediction accuracies ranged from 46.6% to 81.3% for classification by detailed subcategory. It increased to 68% for simple TF and to 90% for Random Forest when the dataset included only the ten most common subcategories. A FM-BIM integration provides FM teams with spatio-temporal visualization of the work order categories across a series of buildings and help prioritize maintenance tasks. The paper shows that machine learning can be applied to support FM activities can enhance BIM use in the FM context. 3.5

Knowledge Extraction from BIM

Methods of extracting knowledge from existing BIM models with the use of machine learning are also considered. In [5] the notion of the building fingerprint is used to capture the main characteristics of a building design. It serves as a measure of similarity which allows for finding a suitable reference for a given problem. The fingerprint is based on accessibility and adjacency relationships among spaces within a building model. Therefore the authors retrieve accessibility and adjacency relationships among spaces encoded in IFC models, and build accessibility and adjacency graphs. Building fingerprints are automatically generated based on a spatial semantic query language for BIM and applied as indexes of the building model repository. In [20] a Bayesian network is used to gain knowledge about existing bridges by the use of data in the bridge management system in order to support bridge engineering design. Two variants for the generation of the Bayesian network were considered: the Tree Augmented Naive Bayesian Network Algorithm and the manual creation of the network by a knowledge engineer. Both variants permit the determination of design bridge parameters on the basis of given boundary conditions. The first variant fits better to the generated training data sets, while the manually created network supports the bridge design process in a more intuitive way. The extracted knowledge is used for new bridge design problems to generate BIM models for possible bridge variants.

4

Conclusions

Traditional methods for modeling and optimizing complex structure systems require huge amounts of computing resources. Artificial-intelligence-based methods can often provide valuable alternatives for efficiently solving problems in architectural and engineering design, construction and manufacturing. Machine learning techniques have considerable potential in the development of BIM. The application of classification algorithms would enable machines to do tasks usually done by hand. Results from machine learning on architectural datasets provide a relevant alternative view to explicit querying mechanisms and provides useful insights for more informed decisions in the design and management of buildings. Machine learning might facilitate less experienced users to query complex BIM datasets for project specific insights.

Machine Learning Methods for BIM Data

239

It was shown that machine learning algorithms can learn the key features of a building belonging to a certain category, and this acquired knowledge could be used in the future when designing the methods to automatically design other structure based on BIM historical data. The proposed techniques can be applied for retrieval, reference, and evaluation of designing, as well as generative design The presence of the historical data combined with the acquired knowledge of the building type key features could help in developing methods for automatically design building structures with required characteristics. The presented methods can be extended to further subdivide the BIM main categories into sub-categories that could represent different areas of interest in these structures. Many problems in architectural and engineering design, construction management, and program decision-making, are influenced by many uncertainties, incomplete and imprecise knowledge. It seems that machine learning techniques should be able to fill knowledge gaps in the knowledge bases and therefore they have a broad application prospects in the practice of design, construction, manufacturing and management. Successful semantic enrichment tools would infer any missing information required by the receiving application, thus alleviating the need for the domain expert to preprocess the building model. They can help inexperienced users to solve complex problems, and can also help experienced users to improve their work efficiency, and share experience. The authors believe that the widespread use of machine learning methods in BIM community would require a more probabilistic and less deterministic approach to the parameters in the models, and/or the implementation of correction and post-process measures. Such measures could include automatic revision based on criteria established using traditional programming, to pinpoint the elements where ML algorithms make mistakes. In order to obtain consistent results with sufficient predictive potential, not only the choice of the right ML algorithms is important but also the choice, quantity and quality of the data used to train the algorithms.

References 1. Bassier, M., Vergauwen, M., Van Genechten, B.: Automated classification of heritage buildings for as-built BIM using machine learning techniques. In: ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, vol. IV-2/W2 (2017) 2. Beetz, J., et al.: Enrichment and preservation of architectural Knowledge. In: M¨ unster, S., Pfarr-Harfst, M., Kuroczy´ nski, P., Ioannides, M. (eds.) 3D Research Challenges in Cultural Heritage II. LNCS, vol. 10025, pp. 231–255. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-47647-6 11 3. Bloch, T., Sacks, R.: Comparing machine learning and rule-based inferencing for semantic enrichment of BIM models. Autom. Constr. 91, 256–272 (2018) 4. BuildingSMART (2013). http://www.buildingsmart-tech.org/ 5. Daum, S., Bormann, A.: Automated generation of building fingerprints using a spatio-semantic query language for building information models. In: eWork and eBusiness in Architecture, Engineering and Construction, ECPPM 2014 (2014)

240

´ G. Slusarczyk and B. Strug

6. Emunds, C., Pauen, N., Richter, V., Frisch, J., van Treeck, C.: IFCNet: a benchmark dataset for IFC entity classification. In: Proceedings of the EG-ICE 2021, pp. 166–175. Universitaetsverlag der TU Berlin, Berlin (2021) 7. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016) 8. Howard, A., et al.: MobileNets: efficient convolutional neural networks for mobile vision applications, April 2017 9. Jin, C., Xu, M., Lin, L., Zhou, X.: Exploring BIM data by graph-based unsupervised learning, pp. 582–589, January 2018 10. Kotsiantis, S.B., Zaharakis, I., Pintelas, P.: Supervised machine learning: a review of classification techniques. Emerging Artif. Intell. Appl. Comput. Eng. 160, 3–24 (2007) 11. Krijnen, T., Tamke, M.: Assessing implicit knowledge in BIM models with machine learning, pp. 397–406, January 2015 12. Liebich, T.: IFC 2x edition 3 model implementation guide (2009). https:// standards.buildingsmart.org 13. Lomio, F., Farinha, R., Laasonen, M., Huttunen, H.: Classification of building information model (BIM) structures with deep learning. arXiv:1808.00601 [cs.CV] (2018) 14. Ma, L., Sacks, R., Kattel, U., Bloch, T.: 3D object classification using geometric features and pairwise relationships: 3D object classification using features and relationships. Comput. Aided Civil Infrastr. Eng. 33(2), 152–164 (2018) 15. Mazairac, W., Beetz, J.: BIMQL? An open query language for building information models. Adv. Eng. Inform. 27, 444–456 (2013) 16. McArthur, J.J., Shahbazi, N., Fok, R., Raghubar, C., Bortoluzzi, B., An, A.: Machine learning and BIM visualization for maintenance issue classification and enhanced data collection. Adv. Eng. Inform. 38, 101–112 (2018) 17. Nnez-Calzado, P.E., Alarcn-Lpez, I.J., Martnez-Gmez, D.C.: Machine learning in BIM. In: EUBIM 2018 (2018) 18. Sacks, R., Ma, L., Yosef, R., Borrmann, A., Daum, S., Kattel, U.: Semantic enrichment for building information modeling: procedure for compiling inference rules and operators for complex geometry. J. Comput. Civil Eng. 31(6), 04017062 (2017) 19. Sacks, R., Bloch, T., Katz, M., Yosef, R.: Automating design review with artificial intelligence and BIM: state of the art and research framework. In: ASCE International Conference on Computing in Civil Engineering 2019 (2019) 20. Singer, D., B¨ ugler, M., Borrmann, A.: Knowledge based bridge engineering - artificial intelligence meets building information modeling (2016) 21. Valero, E., Forster, A., Bosch´e, F., Renier, C., Hyslop, E., Wilson, L.: High levelof-detail BIM and machine learning for automated masonry wall defect surveying. In: 35th International Symposium on Automation and Robotics in Construction (ISARC 2018) (2018) 22. Wang, S., Wainer, G., Rajus, V.S., Woodbury, R.: Occupancy analysis using building information modeling and Cell-DEVS simulation, vol. 45, April 2013 23. Wolpert, D.: Stacked generalization. Neural Netw. 5, 241–259 (1992) 24. Xiong, A., Adan, B., Akinci, D.H.: Automatic creation of semantically rich 3D building models from laser scanner data. Autom. Constr. 31, 325–333 (2013) 25. Zhang, C., Beetz, J., Weise, M.: Interoperable validation for IFC building models using open standards. In: Special Issue ECPPM 2014–10th European Conference on Product and Process Modelling, vol. 20, pp. 24–39 (2014)

Self-Optimizing Neural Network in Classification of Real Valued Experimental Data Alicja Miniak-Górecka(B) , Krzysztof Podlaski , and Tomasz Gwizdałła Department of Intelligent Systems, Faculty of Physics and Informatics, University of Lodz, Lodz, Poland {alicja.miniak,krzysztof.podlaski,tomasz.gwizdalla}@uni.lodz.pl Abstract. The classification of data is a well-known and extensively studied scientific problem. When applying some procedure to a particular dataset, we have to consider many features related to the dataset and the technique applied. In the presented paper, we show the application of the Self-Optimizing Neural Network, which we will call it SONN, however with the remark that it cannot be confused with the Self-Organized NN. Our SONN can be in principle understood as a form of decision network with the reduced number of paths corresponding to every possible set of discretized values obtained by the special procedure from the real-valued data measured by the experimental setup. In the paper, we use the dataset obtained during the meteorological study in eastern Poland, which was burdened with the significant measurement error. The analysis, performed with various methods of determining of final signal as well as various metrics defined in the discretized space of solutions, shows that the proposed method can lead to visible improvement when compared to typical classification methods like SVM or Neural Networks. Keywords: SONN · Experimental data classification · SVM

1

· Data analysis · Data

Introduction

Almost in every live experiment, there is used special measurement equipment. Since the scale of the equipment show the result with a certain accuracy, the data obtained are not continuous, they are discrete. Experimental measurements maybe be real-valued, whereas they are only the approximation of the true value. In the paper, we consider the data acquired in a continuous meteorological experiment and focus on the time series of CO2 collected in the wetlands of Biebrza National Park, northeastern Poland [3,4]. The measured CO2 represents the net exchange of this greenhouse gas between the surface and the atmosphere. In the paper, we present the classification method for geographical data analysis. For our research, we choose an approach that takes into account only discrete signals. We use the Self-Optimizing Neural Networks [8,9] whose main idea is to minimize the size of the neural network by retrenching connections that c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  N. T. Nguyen et al. (Eds.): ACIIDS 2022, LNAI 13758, pp. 241–254, 2022. https://doi.org/10.1007/978-3-031-21967-2_20

242

A. Miniak-Górecka et al.

are not needed. In the first stage, the dataset is discretized, and then the values of discrimination coefficients are calculated. We obtain the system which deterministically builds the network based on the learning set and classifies the items from the test set. The presented results show that the proposed method can give better results than the SVM method and the climate researchers’ approach, which we use for comparison. The paper is organized as follows. In Sect. 2, we present the formalism of the method, defining the discriminants, weights, and output values. In Sect. 3, the results of the calculations are described. Finally, in Sect. 4, the conclusions are presented.

2

Self Optimizing Neural Network

This paper describes the use of the Self-Optimizing Neural Network (SONN), which structure is related to the adaptation process based on learning data. The construction process distinguishes the most general and discriminating features of the data. This network can adapt the topology and its weights in the deterministic process [8,9]. 2.1

SONN Formalism

The idea of Self-Optimizing Neural Networks was originally introduced by Horzyk and Tadeusiewicz in [8,9]. Although called Neural Networks, they should be instead classified as some particular form of the decision tree. Following, however, the initial nomenclature, we will use the original name. The crucial property of SONN is that it operates on discrete values. This assumption seems very limiting for the computational possibilities of the method, and it seems to be why it did not achieve popularity similar to other techniques. In the paper, we want to deal with a complex dataset burdened with significant experimental errors, and we hope that the proposed technique can be useful for such a dataset. The form of discretization is also not obvious. When working on a specific set of patterns (i.e., elements contained in a data set) with discrete values, we consider a vector of values representing each of them individually. It means that, instead of representing each parameter by one variable of this vector containing the discrete value from the set corresponding to the division of parameter into intervals, we have, for every parameter, several variables. The number of these variables is connected with the division of the parameter range into intervals, and it can contain only one of three values from the set {−1, 0, 1}. Value −1 means that a feature does not appear in a given pattern (so the value of a particular parameter does not belong to the interval corresponding to the interval represented by this variable); by 0 we denote an undefined feature, and 1 means that the pattern has values in a given feature. The detailed description of mapping from real values of parameters into discrete vectors will be presented in Sects. 2.4 and 3.2.

SONN in Classification of Real Valued Experimental Data

243

Let us shortly recapitulate the basics of the approach and our modifications. Let U be a set of the form U = {(un , cn )} ,

(1)

where un = [un1 , un2 , ..., unNF ], unf is the value of the f -th feature for n-th pattern and unf ∈ {−1, 0, 1}, n = 1, 2, ..., NP , NP is the total number of patterns, f = 1, 2, ..., NF , NF is the number of features, cn is the class for the n-th pattern and cn ∈ NC , where NC is the set of classes. As a pattern, we understand every vector of discrete data described above, and as a feature, every single variable of this vector. Let Pfc denote the number of patterns with values greater than 0 in f -th features in c-th class and Mfc denote the number of patterns features with values less than 0 as noted  uif , Pfc = i i uf ∈{uf >0,i=1,2,...,Qc } (2)  uif , Mfc = uif ∈{uif 0,i=1,2,...,Qc } (4)  ˆ fc = M xif − . uif ∈{uif >

>>

>>

Avg. MAPE

12.322 vs. 20.702 9.444 vs. 20.702

9.444 vs. 12.322

Avg. p-value

0.00324

0.00069

0.00067

Statistical conclusion

>>

>>

>>

Avg. RMSE

569.142 vs. 1143.45

330.724 vs. 1143.45

330.724 vs. 569.142

Avg. p-value

0.00087

0.00047

0.00242

Statistical conclusion

>>

>>

>>

Avg. RMSE

0.125 vs. 0.207

0.097 vs. 0.207

0.097 vs. 0.125

Avg. p-value

0.00000

0.00000

0.00061

Statistical conclusion

>>

>>

>>

RQ3. Which clustering algorithm between the GMM and k-means gives the higher estimation accuracy? Based on Table 4, we can see that the GMM algorithm consistently achieves lower estimation error than the k-means algorithm on clusters and the whole dataset. Table 6 gives us a statistical assertion that the GMM algorithm has higher estimation accuracy than the k-means algorithm in this context.

5 Conclusion In this study, we have focused on evaluating the effect of clustering on software effort estimation using the FPA method. Two clustering algorithms, k-means and GMM, have been assessed, and the results of this procedure are compared with the baseline model. The experimental results show that the estimation accuracy will significantly improve when applying the clustering algorithm. Of the two algorithms used, the GMM algorithm achieves higher accuracy in estimation than the k-means algorithm. Specifically, in the comparison between two seleted algorithms and baseline, the improvement percentage of applying the k-means algorithm according to five evaluation criteria, MAE, MAPE, RMSE, MBRE, and MIBRE, is 41.56%, 40.48%, 50.23%, 40.19%, and 28.78%, respectively. The improvement percentage while applying the

Analyzing the Effectiveness of the Gaussian Mixture Model

267

GMM algorithm is 65.96%, 54.38%, 71.08%, 53.59%, and 43.17%. In the comparison between GMM and k-means, the improvement percentage of GMM according to five evaluation criteria is 41.75%, 23.36%, 41.89%, 22.4%, and 20.2%, respectively. Many clustering algorithms have been proposed and applied with many positive results in data mining. However, their application in software prediction has many problems to consider. We will investigate other clustering algorithms combined with machine learning techniques to improve the software effort estimation accuracy in future work. Acknowledgment. This work was supported by the Faculty of Applied Informatics, Tomas Bata University in Zlin, under project IGA/CebiaTech/2022/001 and under project RVO/FAI/2021/002.

References 1. Vera, T., Ochoa, S.F., Perovich, D.: Survey of software development effort estimation taxonomies. Technical Report, Computer Science Department. University of Chile, Chile (2017) 2. Khan, B., Khan, W., Arshad, M., Jan, N.: Software cost estimation: algorithmic and nonalgorithmic approaches. Int. J. Data Sci. Adv. Analytics 2(2), 1–5 (2020) 3. Sharma, P., Singh, J.: Systematic literature review on software effort estimation using machine learning approaches. In: International Conference on Next Generation Computing and Information Systems (ICNGCIS), pp. 43–47 (2017) 4. Hai, V.V., Nhung, H.L.T.K., Prokopova, Z., Silhavy, R., Silhavy, P.: A new approach to calibrating functional complexity weight in software development effort estimation. MDPI Comput. 11(15), 1–20 (2022) 5. Silhavy, P., Silhavy, R., Prokopova, Z.: Categorical variable segmentation model for software development effort estimation. IEEE Access 7, 9618–9626 (2019) 6. Prokopova, Z., Silhavy, R., Silhavy, P.: The effects of clustering to software size estimation for the use case points methods. In: Silhavy, R., Silhavy, P., Prokopova, Z., Senkerik, R., Kominkova Oplatkova, Z. (eds.) CSOC 2017. AISC, vol. 575, pp. 479–490. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-57141-6_51 7. Bardsiri, V.K., Jawawi, D.N.A., Hashim, S.Z.M., Khatibi, E.: Increasing the accuracy of software development effort estimation using projects clustering. IET Softw. 6(6), 461–473 (2012) 8. Van Hai, V., Nhung, H.L.T.K., Jasek, R.: Toward applying agglomerative hierarchical clustering in improving the software development effort estimation. In: Silhavy, R. (eds.) Software Engineering Perspectives in Systems. CSOC 2022. Lecture Notes in Networks and Systems, vol. 501. Springer, Cham. https://doi.org/10.1007/978-3-031-09070-7_30 9. Lokan, C., Mendes, E.: Investigating the use of duration-based moving windows to improve software effort prediction. In: Proceedings of the 19th Asia–Pacific Software Engineering Conference (Apsec), vol. 1, pp. 818–827 (2012) 10. Azzeh, M., Nassif, A.B.: A hybrid model for estimating software project effort from use case points. Appl. Soft Comput. 49, 981–989 (2016) 11. Albrecht, A.J.: Measuring application development productivity. In: Proceedings of the IBM Applications Developoment Symposium, p. 83 (1979) 12. IFPUG: Function Point Counting Practices Manual, Release 4.3.1, International Function Point Users Group, Westerville, Ohio, USA (2010) 13. ISO/IEC 20926:2009: (IFPUG) Software and systems engineering -- Software measurement – IFPUG functional size measurement method (2009)

268

V. Van Hai et al.

14. Liu, J., Cai, D., He, X.: Gaussian mixture model with local consistency. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 24, no. 1 (2010) 15. Nilashi, M., Bin Ibrahim, O., Ithnin, N., Sarmin, N.H.: A multicriteria collaborative filtering recommender system for the tourism domain using Expectation Maximization (EM) and PCA–ANFIS. Electron. Commer. Res. Appl. 14(6), 542–562 (2015) 16. Upton, G., Cook, I.: Understanding Statistics. Oxford University Press. p. 55. ISBN 0-19914391-9 17. Zwillinger, D., Kokoska, S.: CRC Standard Probability and Statistics Tables and Formulae, p. 18. CRC Press (2000). ISBN 1-58488-059-7 18. Yellowbrick: https://www.scikit-yb.org. Accessed: May 2022 19. Nhung, H.L.T.K., Van Hai, V., Silhavy, R., Prokopova, Z., Silhavy, P.: Parametric software effort estimation based on optimizing correction factors and multiple linear regression. IEEE Access 10, 2963–2986 (2022) 20. Azzeh, M., Nassif, A.B., Attili, I.B.: Predicting software effort from use case points: A systematic review. Sci. Comput. Programm. 204, 10296 (2021) 21. de Myttenaere, A., Golden, B., Le Grand, B., Rossi, F.: Mean absolute percentage error for regression models. Neurocomputing 192, 38–48 (2016) 22. MacQueen, J.B.:Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281–297 (1967) 23. Khan, S.S., Ahmad, A.: Cluster center initialization algorithm for K-means clustering. Pattern Recogn. Lett. 25(11), 1293–1302 (2004)

Graph Classification via Graph Structure Learning Tu Huynh1,2 , Tuyen Thanh Thi Ho1,2,3 , and Bac Le1,2(B) 1 Faculty of Information Technology, University of Science, Ho Chi Minh City, Vietnam

[email protected], [email protected] 2 Vietnam National University, Ho Chi Minh City, Vietnam 3 University of Economics Ho Chi Minh City, Ho Chi Minh City, Vietnam

Abstract. With the ability of representing structures and complex relationships between data, graph learning is widely applied in many fields. The problem of graph classification is important in graph analysis and learning. There are many popular graph classification methods based on substructures such as graph kernels or ones based on frequent subgraph mining. Graph kernels use handcraft features, hence it is so poor generalization. The process of frequent subgraph mining is NP-complete because we need to test isomorphism subgraph, so methods based on frequent subgraph mining are ineffective. To address this limitation, in this work, we proposed novel graph classification via graph structure learning, which automatically learns hidden representations of substructures. Inspired by doc2vec, a successful and efficient model in Natural Language Processing, graph embedding uses rooted subgraph and topological features to learn representations of graphs. Then, we can easily build a Machine Learning model to classify them. We demonstrate our method on several benchmark datasets in comparison with state-of-the-art baselines and show its advantages for classification tasks. Keywords: Graph classification · Graph mining · Graph embedding

1 Introduction In recent years, graph data has become increasingly popular and widely applied in many fields such as biology (Protein-Protein interaction networks) [1], chemistry (molecular structures) [2], neuroscience (brain networks) [3, 4], social networks (networks of friends) [5], and knowledge graphs [6, 7]. The power of graphs is their capacity to represent complex entities and their relationships. Graph classification is an important problem because of its wide range of applications including predicting whether a protein structure is mutated, recognizing unknown compounds, etc. Because traditional classification algorithms cannot be applied directly to graph data, graph classification has become an independent sub-field. There are many popular graph classification methods based on substructures such as graph kernels or ones based on frequent subgraph mining. The core idea of the former is to extract information from T. Huynh and T. T. T. Ho---Contributed equally to this work. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 N. T. Nguyen et al. (Eds.): ACIIDS 2022, LNAI 13758, pp. 269–281, 2022. https://doi.org/10.1007/978-3-031-21967-2_22

270

T. Huynh et al.

substructures (such as subgraph, path, walk, etc.) and apply conventional classifiers. The problem with this approach is in extracting information from substructures. There are two lines of methods to extract information from substructures. First, graph kernels that are based on kernel methods, work on graph elements such as walk or path. However, these methods are difficult to find a suitable kernel function that captures the semantics of the structure while being computationally tractable. The second group of substructure methods aim at mining frequent subgraphs in graphs. The main drawback of this group is time-consuming because of the high cost of subgraph mining step. This paper addresses these limitations by using features learned automatically from data instead of handcraft features in graph kernels. To overcome the NP-complete problem in subgraph isomorphism testing, we could mine rooted subgraphs and apply Weisfeiler-Lehman relabeling method proposed in Weisfeiler-Lehman graph kernel [8] to find subgraphs more effectively. Besides that, inspired by the recent success of doc2vec [20] in NLP, which exploits how words compose documents to learn their representation, we adopt this idea to learn graph representation as a document and rooted subgraphs as words. Specifically, we proposed a novel Graph Classification via Graph Structure Learning (GC-GSL). First, GC-GSL extracts topological attributes and builds a subgraph “vocabulary” set of graphs. Then, to train graph embedding, a neural network is designed to take a graph as input, and output is subgraphs appearing in the graph as well as topological attributes extracted from the graph. Finally, a fundamental classifier is trained for graph embedding. We make the following main contributions: • We proposed a neural network graph embedding model. The neural network model will automatically learn the graph embedding corresponding to each graph. The embedding of the graph after learning not only reflects the characteristics of the graph itself but also contains the relationship between the graphs. • Through our experiments on several benchmark datasets, we demonstrate that GCGSL is highly competitive compared with the graph kernels and graph classification methods based on feature vector construction. The remainder of this article is structured as follows. Related work is listed in Sect. 2. Section 3 introduces the proposed method from extracted topological attributes vector, mining rooted subgraphs to neural networks for graph embedding. The experimental results and discussions are presented in Sect. 4. Conclusions and future works are presented in Sect. 5.

2 Related Works Graph Kernels. Graph kernels [9] are one of the prominent methods in graph classification problems. Graph kernels evaluate the similarity between a pair of graphs by recursively decomposing them into substructures (e.g. walks, paths, cycles, graphlets, etc.) and defining a similarity function over the substructures (e.g. count the number of similar substructures two graphs, etc.). Then, kernel methods (e.g. Support Vector Machines [10], etc.) could be used to perform classification tasks. Random Walk Kernels [11], the similarity of two graphs is calculated by counting the number of common

Graph Classification via Graph Structure Learning

271

walk labels of the two graphs. Shortest Path Kernels [12] first computes the shortest path for each graph in the dataset. The kernel is defined as the sum over all pairs of shortest path edges of two graphs and, using any positive definite kernel that is suitable on the edges. Nikolentzos et al. [13] proposed a method to measure the similarity between pairs of documents based on the Shortest Path Kernel method. In it, each document is represented by a graph of words. Cyclic Pattern Kernels [14] is based on the common number of cycles occurring in both graphs. Since there is no known polynomial-time algorithm to find all cycles in a graph, time-limited sampling and enumeration of cycles are used to measure the similarity of a graph. Graphlet and Subgraph Kernels [15], similar graphs should have similar subgraphs. Kernel graphlets measure the similarity of two graphs as the dot product of the count vectors of all possible connected subgraphs of degreed. Subtree Kernels [16] are based on common subtree patterns in graphs. To compare two graphs, the subtree kernel compares all pairs of vertices from two graphs by iteratively comparing their neighborhoods. However, most of them still have some limitations. First, many of them do not provide explicit graph embedding. They allow kernelized learning algorithms such as Support Vector Machines [10] to work directly on graphs, without having to do feature extraction to transform them to fixed-length, real-valued feature vectors. Second, these graph kernels use handcrafted features (e.g., shortest paths, graphlets, etc.) i.e., features determined manually with specific well-defined functions that help to extract such substructures from graphs and thus yield poor generalization. Graph Classification Based on Frequent Subgraph Mining. Constructing feature vectors based on frequent subgraph mining consists of 3 steps. 1) mining the frequent subgraph of graphs (e.g. Fast Frequent Subgraph Mining [27]), 2) filtering the subgraph features (e.g. structure-based, semi-supervised learning, etc.), and 3) vectorize the graph based on subgraph features. The key issue for classification efficiency of this method is the selection of discriminative subgraph features. Fei and Huan [17] first used Fast Frequent Subgraph Mining (FFSM) [27] to mine frequent subgraphs, then proposed an embedding distance method to describe the feature consistency relationship between subgraphs, which maps all frequent subgraphs into feature consistency graphs. Then, the order of the nodes corresponding to the frequent subgraph in the consistency graph is used to extract the feature subgraphs, so that the original graph has been converted to the feature vector. Finally, Support Vector Machines [10] is used for classification. Kong and Yu [18] use the gSpan algorithm [26] to mine the subgraphs and then propose a feature evaluation criterion, called gSemi, to evaluate the subgraphs characteristic for both labeled and unlabeled graphs and get the upper bound for gSemi to reduce the subgraph search space. Then, a branch-and-bound algorithm is proposed to efficiently find the optimal set of feature subgraphs, which is useful for graph classification. However, there are still some challenges when classifying graphs based on frequent subgraph mining. First, a large number of subgraphs are mined. Applying frequent subgraph mining algorithms on a set of graphs causes all subgraphs that occur more than a particular threshold to be detected. These samples are numerous, and the number of these samples depends on the data attribute and the defined threshold. The considerable number of samples increases the uptime, makes the selection of valuable samples more difficult and reduces scalability. Second, the process of frequent subgraph mining is

272

T. Huynh et al.

NP-complete. To determine the frequency of the subgraphs, we need to test the isomorphic subgraphs. If two subgraphs are the same in terms of connectivity, then they are isomorphic. Isomorphism testing is an NP-complete problem [19], so it is expensive, especially for large graphs.

3 Proposed Method: GC-GSL Our graph classification method, which we name GC-GSL Graph Classification via Graph Structure Learning, is based on the following key observation: two graphs are similar if they have similar graph substructures. Inspired by doc2vec [20] to learn document embedding, we extend the same to learn graph embeddings. Doc2Vec exploits the way words/word sequences compose documents to learn their embedding. Similar to doc2vec, in GC-GSL, we view a graph as a document and the rooted subgraphs in the graph as words. In addition, PV-DBOW model in doc2vec [20] ignores the local context of words, that is, the model only considers which words a document contains, not the order of words in the document. Therefore, PV-DBOW model is very suitable for graph embedding, where there is no sequential relationship between the rooted subgraphs. Besides, in GC-GSL, the topological attribute vector is added to the output layer to help the model learn the general information of the graph. And after training the graph embedding, a basic classification in Machine Learning is used for classification. In this session, we discuss the main components and complexity of GC-GSL. 3.1 Extracting Topological Attribute Vector The extracted topological attribute vector includes 16 features related to many different feature groups of the graph from the statistical information, such as the number of nodes, the number of edges, to structural features such as clustering, connectivity, centrality, distance measure, and percentages of some typical node types. f1 – Number of nodes. f2 – Number of edges. f3 – Average degree: the average value of the degree of all nodes in the graph. f4 – Average neighbor degree: First, we calculate the average neighbor degree of each node. Then we take the average over all the nodes of the graph. f5 – Degree assortativity coefficient: Assortativity measures how similar the connections in a graph are to the node level. It is like the Pearson correlation coefficient but measures the correlation between every pair of connected nodes. f6 – Average clustering coefficient: The clustering coefficient is a measure of the degree to which nodes in a graph tend to cluster together. f7 – Pagerank score: PageRank calculates the rank of nodes in a graph based on the structure of incoming links. f8 – Eigenvector centrality: Eigenvector centrality is a measure of the influence of a node in a graph. It calculates the central position for a node based on the central position of the neighboring nodes.

Graph Classification via Graph Structure Learning

273

f9 – Closeness centrality: Closeness centrality is a way of detecting nodes that can propagate information very efficiently through a graph. Calculating the near center of a node measures its average distance (inverse distance) to all other nodes. f10 – Average betweenness centrality: Betweenness centrality measures the degree to which a vertex lies on the path between other vertices. f11 – Average effective eccentricity: The eccentricity of a node is the maximum distance from that node to all other nodes in the graph, which means the longest path of all shortest paths from that node to other nodes in the graph. Average effective eccentricity is the average value of the effective eccentricity of all nodes in the graph. f12 – Effective diameter: The effective diameter is the maximum effective eccentricity of the graph, defined as the maximum value of the effective eccentricity over all nodes in the graph. f13 – Effective radius: The effective radius is the minimum effective eccentricity, defined as the minimum value of the effective eccentricity over all nodes in the graph. f14 – Percentage of central points: A central node is a node whose eccentricity is equal to the effective radius of the graph. f15 – Percentage of periphery points: Percentage of nodes in the set of nodes whose eccentricity is equal to the effective diameter. f16 – Percentage of endpoints: The ratio between the number of endpoints (leaf nodes) and the total number of nodes in the graph is selected as a feature. In extracting the topological attributes vector, if a certain graph in the dataset is disconnected and contains several components, we compute the mean for a given feature’s overall components. Table 1 shows the topological attributes vector consisting of 16 features extracted from graph 0 in the MUTAG dataset. Table 1. The topological attributes vector is extracted from graph 0 in the MUTAG dataset consisting of 16 features from f1 to f16. F f1

f2

f3

f4

f5

f6

f7

f8

V 17.00 19.00 2.24 2.47 −0.21 0.00 0.06 0.22 F

f9

f10

f11

f12

f13

f14

f15

f16

V 0.29 0.17 6.82 9.00 5.00 0.24 0.18 0.12

3.2 Rooted Subgraph Mining The main objective in this section is to build a “vocabulary” of rooted subgraphs (i.e., neighborhoods around every node up to a certain degree) of graphs like vocabulary in documents. Rooted subgraphs can be considered as fundamental components of any graph. Other substructures are nodes, paths, walks, etc. but this paper uses rooted subgraphs because they are non-linear and higher-order substructures, capturing more information than other substructures.

274

T. Huynh et al.

To mine rooted subgraphs in graphs, we take each node in the graph as the root node, then we find its neighborhood at a certain level d, from d = 0 (layer 1, the node itself) to d = 3 (layer 4). After that, we aggregate all rooted subgraphs of four layers and remove the repeated subgraphs to obtain a subgraph “vocabulary” set. After rooted subgraphs are mined, we have to test the isomorphism of all the subgraphs to remove repeated ones. The subgraph isomorphism test is an NP-complete problem, so it is time-consuming. To solve this problem, we follow a well-known Weisfeiler-Lehman relabeling method proposed in [8]. One iteration in Weisfeiler-Lehman relabeling consists of 4 steps: • Step 1: Multiset-label determination. Determine the set of multiset-label for each node in the graph. The multiset-label here is the set consisting of root node’s labels and the labels of its neighbors. • Step 2: Sorting each multiset. In this step, we sort the labels of neighboring nodes in ascending order, then combine with that root node’s label and convert it into a string of characters representing that node’s label. • Step 3: Label compression. After we have determined and sorted the set of multisetlabels, we map each multiset-label to a numeric character that has not appeared in the previous labels to represent the label. • Step 4: Relabeling. In this step, we use the mapping in Step 3 to relabel all nodes in the graph. 3.3 Neural Network Graph Embedding Figure 1 shows the architecture of the graph embedding neural network of GC-GSL, this neural network is similar to the PV-DBOW model in doc2vec [20]. The input layer gets a one-hot vector of graphs whose length is equal to the number of graphs in the dataset. Next, there is only one hidden layer in the neural network, the number of neurons of this hidden layer equals the dimensionality of feature vectors we expect after training graph embedding. The embedding matrix between the input layer and hidden is the embedding of the graphs we need to train. Finally, the output layer consists of two parts, the first part is the topological attributes vector with 16 dimensions corresponding to 16 features of the graph. And the second part of the output layer is taken from the subgraph “vocabulary” set. More formally, the second part of the output layer tries to maximize the following objective:  k    (1) logσ −vgT vsgi J = logσ vgT vsg  + i=1

where vg is the embedding of graph g, and vsg  is the embedding of a subgraph sg that co-occurs in the graph g. sgi is a random sample from the subgraph “vocabulary” set and does not appear in the graph g. σ(x) is sigmoid function, i.e., σ(x) = 1/(1 + e−x ). This graph embedding training is an unsupervised learning method, it only uses the information and structures extracted from graphs including the topological attributes vectors, the rooted subgraphs in graphs, and graphs themselves to be used for training. Therefore, it does not depend on graph labels, and only learns embedding through substructures and information of graphs. Moreover, this graph embedding model automatically learns the corresponding embedding for each graph, and the embedding that

Graph Classification via Graph Structure Learning

275

we get after training not only reflects the components of the graph itself but also reflects information about relationships between graphs.

Fig. 1. Architecture of the graph embedding neural network.

3.4 Computational Complexity GC-GSL consists of three main parts: the topological attributes vector extraction, the subgraph “vocabulary” set construction, and graph embedding neural network. In the algorithm to extract topological attributes vector, we use n to represent the number of nodes and m to represent the number of edges of a graph. f1 and f2 are known, so their cost is O(1). Degree-based features (f3, f4, f5 and f16) can be computed in linear time O(m + n). Features  depend on the eigen-decomposition of the graph (f7 and f8) can be computed in O n3 time in the worst case. The clustering coefficient of each node    2 where d = 2m calculated in the average time is is O d 2 = 2m n n is the average degree. Therefore, the average clustering coefficient (f6) on all nodes can be computed in time  2 m O n . Features calculated based on eccentricity (f9, f10, f11, f12, f13, f14 and f15) are calculated from the SP matrix  Path Matrix) of all pairs of vertices. The SP  (Shortest matrix can be calculated in O n2 + mn time. From this SP matrix, the features can be   computed in O n2 time.

276

T. Huynh et al.

Building the subgraph “vocabulary” set requires a Breadth-First Search (BFS) algorithm to mine the subgraph for each node computed in O(n), then the Weisfeiler-Lahman relabeling algorithm is used to solve the subgraph isomorphism test problem computed in O(lt) where l is the number of iterations and t is the size of multi-labels set in each iteration. Graph embedding neural network complexity of O(ES(NH + NHlog(V + F))), where E is the number of epochs to train the model, S is the number of graphs in dataset, N is the number of negative samples, H is the number of neurons in the hidden layer, V is the size of the subgraph “vocabulary” set, F is the dimensionality of the topological attributes vector.

4 Experiments Datasets. We use seven popular benchmark datasets, including 5 bioinformatics datasets of MUTAG, PROTEINS, NCI1, NCI109 and PTC_MR, and 2 social network datasets of IMDB-BINARY and IMDB-MULTI [21], whose characteristics are summarized in Table 2. MUTAG is a dataset of 188 nitro compounds labeled with respect to whether they have mutagenic effects on bacteria. NCI1 and NCI109 datasets are two subsets of the balanced dataset of screened chemical compounds for activity against non-small cell lung cancer and ovarian cancer cell lines. PROTEINS is a dataset where nodes are Secondary Structural Elements (SSEs), and edges represent neighborhood relationships in amino acid sequences or in 3-dimension space. PTC_MR dataset records the carcinogenicity of 344 chemical compounds in male rats. IMDB-BINARY and IMDB-MULTI are egonetwork collection of actors, where two actors in the same movies make an edge, and the task is to infer the genre of an ego-network. Table 2. Dataset statistics including the number of graphs (#graphs), the number of graph labels (#classes), the average number of nodes (#nodes), the average number of edges (#edges), the number of positive (#pos) and negative (#neg) samples. #graphs MUTAG NCI1

#classes

#nodes

#edges

#pos

#neg

188

2

17.93

19.79

125

63

4110

2

29.87

32.3

2057

2053

NCI109

4127

2

29.68

32.13

2079

2048

PROTEINS

1113

2

39.06

72.82

663

450

PTC_MR

344

2

14.29

14.69

192

152

IMDB-B

1000

2

19.77

96.53

500

500

IMDB-M

1500

3

13.00

65.94

-

-

Graph Classification via Graph Structure Learning

277

Experiments and Configurations. In our experiment, the dimension of the hidden layer in the neural network is chosen to be 128, the best results are obtained when the topological attributes vector normalization method is z-score, the learning rate value is 0.003 and the number of epochs on seven data sets is 3000. The learning algorithm used is Adam. The size of a mini-batch for all datasets is 512. The base classifier is Support Vector Machines (SVM) [10]. Evaluation results are based on the results of 10 runs of each graph classification algorithm, with each run using 10-fold cross-validation. The final evaluation results are the mean and standard deviation of accuracy across all runs of each algorithm on each dataset. Baselines. Our method is compared with state-of-the-art baselines including WeisfeilerLehman kernel (WL), [8] Deep WL [21], Deep Divergence Graph Kernels (DDGK) [22], Anonymous walk embeddings (AWE) [23], Methods based on frequent subfragment mining (FSG) [24, 25].

4.1 Results Accuracy. Table 3 shows the average classification accuracy and standard deviation of the three graph classification methods on five bioinformatics datasets and two social network datasets. Overall, GC-GSL gives the best results on all the datasets compared with the traditional graph kernels and graph classification based on frequent subgraph mining. The accuracy of GC-GSL is better than FSG-Bin on all 7 datasets. Compared with the other combination graph kernels and neural networks approaches such as Deep WL, DDGK and AWE, results of GC-GSL are also good, especially on PROTEINS, NCI1 and NCI109 datasets. In addition, 3 datasets MUTAG, NCI1 and NCI109 have accuracy greater than 80%. In summary, GC-GSL is highly effective on large datasets such as PROTEINS and NCIs with the graph classification accuracy better than the other methods. Moreover, it is robust and stable, which is implied by its small standard deviations. GC-GSL is the most effective because it automatically learns information and structures of the graph at both local and global levels. Moreover, in training graph embedding, GC-GSL also learns the relationship between graphs. Table 3. Average Accuracy (± std dev.) for our method GC-GSL and state-of-the-art baselines on benchmark datasets. Bold font marks the best performance in a column. MUTAG

PROTEINS

NCI1

NCI109

PTC_MR

WL

80.72 ± 3.00

72.92 ± 0.56

80.13 ± 0.50

80.22 ± 0.34

56.97 ± 2.01



Deep WL

82.94 ± 2.68

73.30 ± 0.82

80.31 ± 0.46

80.32 ± 0.33

59.17 ± 1.56





DDGK

91.58 ± 6.74

63.14 ± 6.57





AWE

87.87 ± 9.76

70.01 ± 2.52

62.72 ± 1.67

63.21 ± 1.42

59.14 ± 1.83

74.45 ± 5.83

51.54 ± 3.61

FSG-Bin

81.58 ± 0.08

71.61 ± 0.03

77.01 ± 0.03

74.58 ± 0.02

60.29 ± 0.05

64.40 ± 0.05

46.53 ± 0.04

GC-GSL

83.86 ± 2.16

76.55 ± 1.02

82.04 ± 0.45

81.86 ± 0.33

60.11 ± 1.17

68.46 ± 1.12

46.39 ± 0.44



68.10 ± 2.30



IMDB-B

IMDB-M –

278

T. Huynh et al.

Embedding. Through the graph embedding neural network, the graph dataset is transformed into a set of embedding vectors corresponding to each graph, where each vector has a length of 128. Therefore, to observe this result, the t-distributed Stochastic Neighbor Embedding (t-SNE) algorithm is used for nonlinear dimensionality reduction on the 128-dimensional embedding vectors into 2-dimensional vectors. Figure 2 shows the visualization results after the graph embedding of 7 datasets in 2-dimensions with different colored circles representing different categories of graphs in the dataset. We can see that the visualization results of MUTAG and PROTEINS datasets have a clear distinction between the two classes. The visualization of NCI1 and NCI109 datasets, although as explicit as MUTAG, is that a class is subdivided into many other sub-clusters and there is a distinction between sub-clusters of two different classes. Particularly, PTC_MR dataset, which distributes data of two different classes, is complex and intertwined. IMDB-BINARY dataset has visual results quite like the NCI sets but is sparser. With

Fig. 2. The visualization of results after the graph embedding in 2-dimensions of five bioinformatics datasets and two social network datasets.

Graph Classification via Graph Structure Learning

279

IMDB-MULTI dataset, it is easy to see the different partitions of the blue circles, the green and orange circles distributed around the partitions of the blue circles. In conclusion, graph embedding is trained based on substructures and feature information in the graph, so graphs with similar substructures and information will be closer together. 4.2 Discussions We discuss in terms of parts of GC-GSL. First, with the proposed topological attributes vector, it helps the neural network that trains the graph embedding to learn more general information about the graph. Although the topological attributes vector carries a variety of information from many distinct aspects of the graph, these features only revolve around the topology in the graph. In the graph, there is still other useful information that has not been considered such as properties of nodes, edges, etc. For example, with the analysis of social network problems, in addition to the connections between people, personal information of each person such as gender, age, etc. is also extremely necessary. Second, the rooted subgraph mining is more effective when applying the WeisfeilerLehman relabeling method. However, these subgraphs are only local. This is the main reason the topological attributes vector is proposed. It helps the neural network to train graph embedding to learn more global information, but it only solves a part of the problem that cannot be solved yet thoroughly. Finally, the neural network used to train graph embedding, the training results are effective. As we can see in Fig. 2, graphs with similar substructures will be closer to each other, and graphs with different substructures will be far away from each other. Therefore, with these graph embedding results, the classification results on the embedding vectors of the graphs will give satisfactory results (see Table 3). The results of the evaluation of the measures on experimental datasets have confirmed this with some datasets having an accuracy of over 80% and better than the other graph classification methods. Moreover, with the results of graph embedding, in addition to being used for graph classification, we can also use it for many other tasks at the graph level such as clustering, community detection.

5 Conclusion Based on the original idea of doc2vesc in NLP to automatically learn document embedding, we applied it to graph data and proposed a novel graph classification method based on graph structure learning named GC-GSL. GC-GSL helps us to solve the problem of subgraph isomorphism testing as well as easily applied to real-world problems with large datasets and highly scalable without the need for complex implementation as in graph kernels or graph classification based on frequent subgraph mining methods. Experiments on bioinformatics data sets and social networks have both good and effective results. Although the results of GC-GSL are effective, there will still be limitations as mentioned in the discussion such as local subgraph problems, some other information in the graph such as node, edge, etc. is still unexplored. Further improvement in classification results by overcoming the above limitations is one of the probable future development directions. On the other hand, this novel graph classification method also needs further improvements in practical terms such as speed and scalability for larger datasets.

280

T. Huynh et al.

References 1. Szklarczyk, D., et al.: STRING v11: protein–protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Res. 47(D1), D607–D613 (2019) 2. Trinajstic, N.: Chemical Graph Theory. CRC Press (2018) 3. Siew, C.S., Wulff, D.U., Beckage, N.M., Kenett, Y.N.: Cognitive network science: a review of research on cognition through the lens of network representations, processes, and dynamics. Complexity 2019, 2108423 (2019) 4. Lanciano, T., Bonchi, F., Gionis, A.: Explainable classification of brain networks via contrast subgraphs. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 3308–3318 (2020) 5. Tabassum, S., Pereira, F.S., Fernandes, S., Gama, J.: Social network analysis: an overview. Wiley Interdisc. Rev.: Data Min. Knowl. Discovery 8(5), e1256 (2018) 6. Chen, X., Jia, S., Xiang, Y.: A review: knowledge reasoning over knowledge graph. Expert Syst. Appl. 141, 112948 (2020) 7. Domingo-Fernández, D., et al.: COVID-19 knowledge graph: a computable, multi-modal, cause-and-effect knowledge model of COVID-19 pathophysiology. Bioinformatics 37(9), 1332–1334 (2021) 8. Shervashidze, N., Schweitzer, P., Van Leeuwen, E.J., Mehlhorn, K., Borgwardt, K.M.: Weisfeiler-lehman graph kernels. J. Mach. Learn. Res. 12(9), 2539–2561 (2011) 9. Kriege, N.M., Johansson, F.D., Morris, C.: A survey on graph kernels. Appl. Netw. Sci. 5(1), 1–42 (2019). https://doi.org/10.1007/s41109-019-0195-3 10. Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst, Technol. (TIST) 2(3), 1–27 (2011) 11. Vishwanathan, S.V.N., Schraudolph, N.N., Kondor, R., Borgwardt, K.M.: Graph kernels. J. Mach. Learn. Res. 11, 1201–1242 (2010) 12. Borgwardt, K.M., Kriegel, H.P.: Shortest-path kernels on graphs. In: Fifth IEEE International Conference on Data Mining (ICDM’05), pp. 8-pp. IEEE (2005) 13. Nikolentzos, G., Meladianos, P., Rousseau, F., Stavrakas, Y., Vazirgiannis, M.: Shortest-path graph kernels for document similarity. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 1890–1900 (2017) 14. Horváth, T., Gärtner, T., Wrobel, S.: Cyclic pattern kernels for predictive graph mining. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 158–167 (2004) 15. Shervashidze, N., Vishwanathan, S. V. N., Petri, T., Mehlhorn, K., Borgwardt, K.: Efficient graphlet kernels for large graph comparison. In: Artificial Intelligence and Statistics, pp. 488– 495. PMLR (2009) 16. Ramon, J., Gärtner, T.: Expressivity versus efficiency of graph kernels. In: Proceedings of the First International Workshop on Mining Graphs, Trees and Sequences, pp. 65–74 (2003) 17. Fei, H., Huan, J.: Structure feature selection for graph classification. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management, pp. 991–1000 (2008) 18. Kong, X., Yu, P.S.: Semi-supervised feature selection for graph classification. In: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 793–802 (2010) 19. Schöning, U.: Graph isomorphism is in the low hierarchy. J. Comput. Syst. Sci. 37(3), 312–323 (1988) 20. Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: International Conference on Machine Learning, pp. 1188–1196. PMLR (2014)

Graph Classification via Graph Structure Learning

281

21. Yanardag, P., Vishwanathan, S.V.N.: Deep graph kernels. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1365– 1374 (2015) 22. Al-Rfou, R., Perozzi, B., Zelle, D.: Ddgk: Learning graph representations for deep divergence graph kernels. In: The World Wide Web Conference, pp. 37–48 (2019) 23. Ivanov, S., Burnaev, E.: Anonymous walk embeddings. In: International conference on machine learning, pp. 2186–2195. PMLR (2018) 24. Rousseau, F., Kiagias, E., Vazirgiannis, M.: Text categorization as a graph classification problem. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 1702–1712 (2015) 25. Wang, H., et al.: Incremental subgraph feature selection for graph classification. IEEE Trans. Knowl. Data Eng. 29(1), 128–142 (2016) 26. Yan, X., Han, J.: gSpan: graph-based substructure pattern mining. In: 2002 IEEE International Conference on Data Mining, 2002 Proceedings, pp. 721–724. IEEE (2002) 27. Huan, J., Wang, W., Prins, J.: Efficient mining of frequent subgraphs in the presence of isomorphism. In: Third IEEE International Conference on Data Mining, pp. 549–552. IEEE (2003)

Relearning Ensemble Selection Based on New Generated Features Robert Burduk(B) Department of Systems and Computer Networks, Wroclaw University of Science and Technology, Wroclaw, Poland [email protected]

Abstract. The ensemble methods are meta-algorithms that combine several base machine learning techniques to increase the effectiveness of the classification. Many existing committees of classifiers use the classifier selection process to determine the optimal set of base classifiers. In this article, we propose the classifiers selection framework with relearning base classifiers. Additionally, we use in the proposed framework the newly generated features, which can be obtained after the relearning process. The proposed technique was compared with state-of-the-art ensemble methods using three benchmark datasets and one synthetic dataset. Four classification performance measures are used to evaluate the proposed method. Keywords: Combining classifiers selection · Feature generation

1

· Ensemble of classifiers · Classifier

Introduction

The purpose of the supervised classification is to assign to a recognized object a predefined class label using known features of this object. Therefore, the goal of the classification system is to map the feature space of the object into the space of class labels. This goal can be fulfilled using one classification model (base classifier), or a set of base models called an ensemble, committee of classifiers or multiple classifier system. The multiple classifier system (MSC) is essentially composed of three stages: 1) generation, 2) selection, and 3) aggregation or integration. The aim of the generation phase is to create basic classification models, which are assumed to be diverse. In the selection phase, one classifier (the classifier selection) or a certain subset of classifiers is selected (the ensemble selection (ES) or ensemble pruning) learned at an earlier stage. The final effect of the integration stage is the class label, which is the final decision of the ensemble of classifiers. There are two approaches to the ensemble selection, static and dynamic [5]. An approach has also been proposed, which combines the features of static and dynamic ensemble selection [8]. Regardless of the division criteria for the ensemble selection, none of the known algorithms uses the relearning of classification c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  N. T. Nguyen et al. (Eds.): ACIIDS 2022, LNAI 13758, pp. 282–291, 2022. https://doi.org/10.1007/978-3-031-21967-2_23

Relearning Ensemble Selection Based on New Generated Features

283

models in the selection process of classifiers. An observation that there is a lack of the ensemble selection methods with relearning has prompted to undertake the research problem, whose aim is to develop and experimentally verify the relearning ensemble selection method. Given the above, the main objectives of this work can be summarized as follows: – A proposal of a new relearning ensemble selection framework. – A proposal of the feature generation that is used to learn second-level base classifiers used in ensemble selection process. – An experimental setup to compare the proposed method with other MCS approaches using different classification performance measures. This paper is organized as follows: Sect. 2 presents works related to the classifiers selection. Section 3 presents the proposed approach to the relearning ensemble selection based on new generated features. In Sect. 4 the experiments that were carried out and the discussion of the obtained results are presented. Finally, we conclude the paper in Sect. 5.

2

Related Works

For about twenty years in the literature related to classification systems there has been considered the problem of using more than one base classifiers at the same time to make a decision on whether an object belongs to a class label [10,12,14]. During this period, multiple classifier systems were used in many practical aspects [15], and the ensemble pruning proved to have a significant impact on the performance of recognition systems using an ensemble of classifiers. The taxonomy of the selection methods distinguishes the static and dynamic selection. The static pruning process selects one or a certain subset of base classifiers that is invariable throughout the all feature space or defined feature subspaces. In the case of the dynamic selection, knowledge about the neighborhood of the newly classified object is used (most often defined by a fixed number of nearest neighbors) to determine one or a certain subset of the base classifiers for the classification of a new object. The discussed topic is still up-to-date, as evidenced by the propositions of new the ensemble selection methods. The dynamic programming-based ensemble pruning algorithm is proposed in [1], where the cooperative game theory is used in the first phase of the selection and a dynamic programming approach is used in the second phase of the selection procedure. The Frienemy Indecision Region Dynamic Ensemble Selection framework [9] pre-selects base classifiers before applying the dynamic ensemble selection techniques. This method allows to analyze the new object and its region of competence to decide whether or not it is located in an indecision region (the region of competence with samples from different classes). The method proposed in [4] enhances the previous algorithm and reduces the overlap of classes in the validation set. It also defines the region of competence using an equal number of samples from each class. The combination

284

R. Burduk

of the static and dynamic ensemble selection was proposed in [8]. In this ensemble selection method a base classifier is selected to predict a test sample if the confidence in its prediction is higher than its credibility threshold. The credibility thresholds of the base classifiers are found by minimizing the empirical 0 − 1 loss on the entire training observations. The optimization-based approach to ensemble pruning is proposed in [2], where from an information entropy perspective an objective function used in the selection process is proposed. This function takes diversity and accuracy into consideration both implicitly and simultaneously. The problem of the ensemble selection is considered, inter alia, from the point of view of the difficulty of data sets being classified [3]. An example of the difficult data are imbalanced datasets in which there is a large disproportion in the number of objects from the particular class labels. As shown in [11], the dynamic ensemble selection, due to the nearest neighbors analysis of the newly classified object, leads to better classification performance than the static ensemble pruning. The subject of classifiers selection for imbalanced data is up-to-date, and the existing methods are modified to obtain higher values of classification performance measures [7].

3

The Proposed Framework

We propose ES framework with a structure consisting of the following steps: (1) learning base classifiers, (2) relearning base classifiers, (3) feature generation based on learned and relearned base classifiers, (4) learning second-level base classifier based on a new vector of the features and (5) selection base classifiers based on second-level classification result. 3.1

Generation of Diverse Base Classifiers

Generating a diverse set of base classifiers is one of the essential points in creating an MCS. We propose that the set of homogeneous base classifiers should be generated using the bagging method. We denote as Dk the training dataset of ΨkB base classifier, where k ∈ {1, ..., K}, K is the number of base classifiers. 3.2

Relearning Base Classifiers

Suppose we consider the problem of binary classification. Then each object can belong to one of the two class labels {−1, 1}. So we define two new training sets DS (−1) and DS (1) . These datasets contain objects from Dk and new objects (or objects set) with arbitrarily assigned class labels. For example, in the dataset DS (−1) , a new object or set of objects has an arbitrarily assigned class label −1. This dataset is used to relearning base classifiers separately, i.e. base classifiers R(−1) R(1) labeled as Ψk and Ψk .

Relearning Ensemble Selection Based on New Generated Features

3.3

285

Feature Generation Based on Learned and Relearned Base Classifiers

The  xi is represented in d dimensions feature space as an vector xi =  1 object xi , ..., xdi . Based on learned base classifiers ΨkB and relearned base classifiers R(−1) R(1) and Ψk we can generate the new features for the new object or set Ψk of objects. We propose to add new feature to the object vector as the following features: – the score returned by base classifier ΨkB (xi ), – the difference between score function returned by base and relearned classifier (−1) R(−1) for added class label −1: σi,k = |ΨkB (xi ) − Ψk (xi )|, – the difference between score function returned by base and relearned classifier (1) R(1) for added class label 1: σi,k = |ΨkB (xi ) − Ψk (xi )|, – the  whether the object xi is correctly classified by base classifier  information I ΨkB (xi ), ωxi ,   where I ΨkB (xi ), ωxi return 0 when object xi is incorrect classified by base classifier ΨkB (xi ) and 1 in the case of correctly classification.

(−1)

(1)

Fig. 1. Correctly (blue points) and incorrect (red points) classified object in σi,k , σi,k feature space. (Color figure online)

  (−1) (1) The visualization of the two new features (σi,k , σi,k and I ΨkB (xi ), ωxi ) for one base classifier ΨkB and one set of object xi form validation dataset (synthetic Hygleman dataset used in experiments) is shown in Fig. 1. The red points represent incorrect classified object by base classifier ΨkB , while blue points represent correct classified object by base classifier ΨkB . As it is easy to see objects

286

R. Burduk (−1)

(1)

incorrectly classified in the space defined by the features σi,k , σi,k are located close to each other. This observation is the basis for the next step in which generated features are used to ES process. 3.4

Learning Second-Level Base Classifier Based on New Vector of the Features

In this step, we propose to learn base classifiers using the dataset used in the previous step. It means that second-level base classifiers also used features generated during the last phase. The learning process considers only objects for which new features have been generated. The aim of learning at the second level is to build a model that will determine for the newly classified object x0 and its additional features whether the base classifier is removed from the pool. 3.5

Selection Base Classifiers Based on Second-Level Classification Result

Results from the previous step are used in ES process. The x0 object with an arbitrarily assigned class label is used to retrain the base classifiers. Next, the new features of object x0 are generated according to the previously described procedure. The second-level base classifiers are trained based on a new feature vector. If the result of the classification on the second level is uncertain, i.e., the value of the score function is close to the value of zero, then the given base classifier is not selected for the pool of classifiers. In the adopted binary classification model, the value of the base classifier’s score function belongs to the range (−1, 1). Therefore the most uncertain results of the classifier’s evaluation are close to the zero value. The entire proposed ES framework for the binary problem is presented in the Algorithm 1.

4 4.1

Experiments Experimental Setup

The implementation of SVM from SAS 9.4 Software was used to conduct the experiments. In experiments, we took into account the radial basis function as the kernel function. The regularization parameter C was searched in the set C ∈ {0.001, 0.01, 0.1, 1, 10, 100} using grid search procedure [13]. This parameter was searched separately for train first and second-level base classifiers. In the experiments, 5 × 2 cross-validation method has been used. As a validation dataset we use a learning dataset, which means that we use the resubstitution method for new features generation. The hyperparameter τ to ensemble selection was set to a value 0.05.

Relearning Ensemble Selection Based on New Generated Features

287

Algorithm 1: Relearning ensemble selection algorithm based on new generated features – for the binary problem

1 2 3 4 5 6 7 8

Input: Dataset D, new object x0 , number of base classifier K, τ hyperparameter to ensemble selection Output: The ensemble decision after the relearning ensemble selection based on new generated features Split D into: training Dtr dataset and the validation dataset Dva Split Dtr into K folds D1 , ..., DK by bagging procedure (−1) ∀Dk ∀xi ∈Dva add xi with class label = −1 to Dk – new fold DSk (1) ∀Dk ∀xi ∈Dva add xi with class label = 1 to Dk – new fold DSk B Train base classifier Ψk using Dk R(−1) (−1) Train the relearned base classifiers for class label −1, Ψk using DSk R(1) (1) Train the relearned base classifiers for class label 1, Ψk using DSk (−1) R(−1) B ∀Dk ∀xi ∈Dk calculate σi,k = |Ψk (xi ) − Ψk (xi )| and (1)

9

R(1)

σi,k = |ΨkB (xi ) − Ψk (xi )| Create K new learning datasets DkR :   (−1) (1) 1 d B ∀xi ∈DR xR i,k = xi , ..., xi , Ψk (xi ), σi,k , σi,k k

(−1)

10

Add x0 with class label = −1 to Dk – new fold DS0,k

11

Add x0 with class label = 1 to Dk – new fold DS0,k Crate new features for x0 using trained base classifiers ΨkB and trained the (−1) (1) learned base classifiers on datasets DS0,k and DS0,k :

12

(1)

  (−1) (1) x0,k = x10 , ..., xd0 , ΨkB (x0 ), σ0,k , σ0,k 13 14 15

Train second level base classifiers ΨkBSL using DkR If |ΨkBSL (x0,k )| ≤ τ , then ΨkRES (x0,k ) = 0 else ΨkRES (x0,k ) = ΨkB (x0,k ) The ensemble decision after the relearned ensemble selection:  Ψ

RES

(x0 ) = sign

K 

 ΨkRES (x0,k )

,

k=1

where ΨkRES (x0 ) return value in the range (−1, 0) for predicted class label −1 and (0, 1) for predicted class label 1.

A performance classification metric such as the area under the curve (AUC), the G-mean (G), the F-1 score (F-1) and the Matthews correlation coefficient (MCC) have been used. As a reference ensemble of classifiers, we use majority voting (Ψ M V ) and sum rule without selection (Ψ SU M ). Three real datasets from UCI [6] repository and one synthetic dataset were used in the experiments. The synthetic dataset used in the experiment is presented in Fig. 2. The datasets that were used to validate the algorithms are

288

R. Burduk

Table 1. Descriptions of datasets used in experiments (name with abbreviation, number of instances, number of features, imbalance ratio). Dataset

#inst #f Imb

Breast Cancer – original (Cancer) 699

9

1.9

Liver Disorders (Bupa)

6

1.4

345

Pima (Pima)

768

8

1.9

Synthetic dataset (Syn)

400

2

1

Fig. 2. Synthetic datasest used in the experiment.

presented in Table 1. The number of instances, features, and imbalance ratio were included in the description. 4.2

Results

The experiments were conducted in order to compare the classification performance metrics of the proposed relearning ensemble selection based on new generated features algorithm Ψ RES with referential ensemble techniques: majority voting (Ψ M V ) and sum rule without selection (Ψ SU M ). The results of the experiment for four classification performance metrics are presented in Table 2. The bold letters indicate the best results for each performance metric and each database separately.

Relearning Ensemble Selection Based on New Generated Features

5

289

Discussion Table 2. Result of classification. AUC

G

F-1

MCC

Cancer Ψ SU M 0.942 0.940 0.941 0.806 Ψ RES 0.944 0.943 0.944 0.812 Ψ M V 0.942 0.940 0.941 0.806 Bupa

Ψ SU M 0.537 0.487 0.674 0.083 Ψ RES 0.543 0.494 0.678 0.096 ΨM V 0.540 0.495 0.673 0.087

Pima

Ψ SU M 0.715 0.693 0.832 0.465 Ψ RES 0.721 0.703 0.833 0.473 Ψ M V 0.722 0.703 0.835 0.478

Syn

Ψ SU M 0.845 0.844 0.840 0.691 Ψ RES 0.875 0.875 0.872 0.751 Ψ M V 0.840 0.839 0.834 0.682

It should be noted that the proposed algorithm may improve the quality of the classification compared to the reference methods. In particular, for the synthetic set, the improvement is visible for all analyzed metrics and exceeds the value of 3%. In the case of real datasets, the improvement of the value of the metrics is not so significant, but the method without selection Ψ SU M is always worse than the proposed algorithm Ψ RES . The conducted experiments concern one test scenario in which the SVM base classifiers were learned using the bagging procedure. We treat our research as a preliminary study. The directions of further research include: – evaluation on more datasets with performing statistical analysis, – performing an evaluation of the computational complexity of the proposed algorithm, – evaluation of larger groups base classifiers, – validating the method against other classifier selection methods, – development of a new features dedicated to semi-supervised problem, – development of a new feature dedicated to the problem of decomposition of a multi-class task that eliminate the problem of incompetent binary classifier, – development of a new features dedicated to imbalanced dataset problem.

6

Conclusions

This paper presents a new approach to the ES process. In the proposal, the ES process is based on second-level base classifiers. These classifiers are learned

290

R. Burduk

based on the extended feature vector. The newly generated feature vector is obtained based on relearning base classifiers from the first level. The experimental results show that the proposed method can obtain better classification results than the reference methods. Such results were obtained for four different performance classification measures. The paper presents research on relearning ensemble selection as a preliminary study. Based on the promising results, future research will focus, among other things, on developing new features dedicated to semi-supervised, multi-class, and imbalanced dataset problems.

References 1. Alzubi, O.A., Alzubi, J.A., Alweshah, M., Qiqieh, I., Al-Shami, S., Ramachandran, M.: An optimal pruning algorithm of classifier ensembles: dynamic programming approach. Neural Comput. Appl. 32(20), 16091–16107 (2020). https://doi.org/10. 1007/s00521-020-04761-6 2. Bian, Y., Wang, Y., Yao, Y., Chen, H.: Ensemble pruning based on objection maximization with a general distributed framework. IEEE Trans. Neural Netw. Learn. Syst. 31(9), 3766–3774 (2019) 3. Brun, A.L., Britto Jr., A.S., Oliveira, L.S., Enembreck, F., Sabourin, R.: A framework for dynamic classifier selection oriented by the classification problem difficulty. Pattern Recogn. 76, 175–190 (2018) 4. Cruz, R.M., Oliveira, D.V., Cavalcanti, G.D., Sabourin, R.: FIRE-DES++: enhanced online pruning of base classifiers for dynamic ensemble selection. Pattern Recogn. 85, 149–160 (2019) 5. Cruz, R.M., Sabourin, R., Cavalcanti, G.D.: Dynamic classifier selection: recent advances and perspectives. Inf. Fusion 41, 195–216 (2018) 6. Dua, D., Graff, C.: UCI machine learning repository (2017). http://archive.ics.uci. edu/ml 7. Junior, L.M., Nardini, F.M., Renso, C., Trani, R., Macedo, J.A.: A novel approach to define the local region of dynamic selection techniques in imbalanced credit scoring problems. Expert Syst. Appl. 152, 113351 (2020) 8. Nguyen, T.T., Luong, A.V., Dang, M.T., Liew, A.W.C., McCall, J.: Ensemble selection based on classifier prediction confidence. Pattern Recogn. 100, 107104 (2020) 9. Oliveira, D.V., Cavalcanti, G.D., Sabourin, R.: Online pruning of base classifiers for dynamic ensemble selection. Pattern Recogn. 72, 44–58 (2017) 10. Piwowarczyk, M., Muke, P.Z., Telec, Z., Tworek, M., Trawi´ nski, B.: Comparative analysis of ensembles created using diversity measures of regressors. In: 2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pp. 2207– 2214. IEEE (2020) 11. Roy, A., Cruz, R.M., Sabourin, R., Cavalcanti, G.D.: A study on combining dynamic selection and data preprocessing for imbalance learning. Neurocomputing 286, 179–192 (2018) 12. Sagi, O., Rokach, L.: Ensemble learning: a survey. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 8(4), e1249 (2018) 13. Scholkopf, B., Smola, A.J.: Learning with kernels: support vector machines, regularization, optimization, and beyond. In: Adaptive Computation and Machine Learning Series (2018)

Relearning Ensemble Selection Based on New Generated Features

291

14. Wo´zniak, M., Gra˜ na, M., Corchado, E.: A survey of multiple classifier systems as hybrid systems. Inf. Fusion 16, 3–17 (2014) 15. Zhang, C., Ma, Y.: Ensemble Machine Learning: Methods and Applications. Springer, NY (2012). https://doi.org/10.1007/978-1-4419-9326-7

Random Forest in Whitelist-Based ATM Security Michal Maliszewski1(B) and Urszula Boryczka2 1

2

Software Security Group, Diebold Nixdorf, Katowice, Poland [email protected] Institute of Computer Science, University of Silesia, Sosnowiec, Poland [email protected]

Abstract. Accelerated by the COVID-19 pandemic, the trend of highlysophisticated logical attacks on Automated Teller Machines (ATMs) is ever-increasing nowadays. Due to the nature of attacks, it is common to use zero-day protection for the devices. The most secure solutions available are using whitelist-based policies, which are extremely hard to configure. This article presents the concept of a semi-supervised decision support system based on the Random forest algorithm for generating a whitelist-based security policy using the ATM usage data. The obtained results confirm that the Random forest algorithm is effective in such scenarios and can be used to increase the security of the ATMs. Keywords: Whitelisting · Random forest · Software security Semi-supervised learning · Decision support system

1

·

Introduction

In 2020, the number of automatic teller machines exceeded 41 per 100,000 adults, which meant over 3 million ATMs around the globe [8]. Millions of people use ATMs every day, which is why the security of cash systems is of utmost importance. ATMs are exposed to two types of attacks: physical and logical, with the latter becoming more frequent in the last years. Europe currently has one of the highest ATM security standards, making it one of the safest areas due to constant ATM surveillance. However, that did not stop the threat of logical attacks from becoming more and more noticeable (see Fig. 1). It is assumed that losses generated by logical attacks are hundreds of times more significant in other continents, particularly in Asia and South America. However, there is no reliable data about the scale of losses caused by logical attacks. It is common not to report detected incidents for fear of losing customers’ trust. A large part of logical attacks is prepared for a specific type of device, considering the installed software stack and its potential security issues. We refer here to zero-day vulnerability, a recently popular term due to a security flaw Supported by Diebold Nixodrf. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  N. T. Nguyen et al. (Eds.): ACIIDS 2022, LNAI 13758, pp. 292–301, 2022. https://doi.org/10.1007/978-3-031-21967-2_24

Random Forest in Whitelist-Based ATM Security

293

Fig. 1. ATM malware and logical attacks in Europe [3]

in the Apache logging library, Log4j, widely used across Java applications [13]. The term zero-day refers to a security flaw unknown or known to the vendor, who did not have yet a valid fix for the defect. This, in turn, could serve as a backdoor for cybercriminals who could exploit the flaw for their benefit [2]. In the case of the ATMs, an undetected exploit may lead to the theft of money, customer data or infect other parts of the ATM network. As the potential threat is unknown, standard protection mechanisms such as intrusion detection systems and anti-viruses are not the preferred form of ATM protection. To ensure the security of the devices against yet unknown threats, the whitelist approach is used. Whitelisting is a cybersecurity strategy that allows programs to run only if the defined security policy explicitly permits them to do so. The goal of whitelisting is to, most importantly, protect computers and networks from harmful applications and, to a lesser extent, prevent unnecessary demand for resources [17]. Thanks to a precise security policy that defines what can and cannot run in the device’s operating system, whitelists can block all zero-day attacks with great efficiency. Unfortunately, this kind of protection is difficult to configure, as it requires in-depth knowledge about the software installed on the device and how it works. R Security For the purposes of this study, we used Diebold Nixdorf’s Vynamic Intrusion Protection software, which uses the whitelist mechanism to create a security policy for the protected devices. For the sake of simplicity, we will refer to this software in the later parts of this article as Intrusion Protection. Each program supposed to work in the operating system (OS) protected by whitelisting solution needs to be defined in the proper security policy rules. Those rules define the allowed behavior of the program, for example, whether the program a should be able to write to the resource b. For this research, by creating a security policy, we understand the classification of the programs installed on the ATM to the security rules defined by Intrusion Protection.

294

2

M. Maliszewski and U. Boryczka

Related Work

The first step of introducing whitelist-based security to the ATM is to collect the data about the installed software on the target device. This can be done by using discrete event systems as proposed by T. Klerx, M. Anderka, H. K. Büning, and S. Priesterjahn in the article [9]. In the mentioned research, the data was used to detect anomalies in the ATM’s OS. Intrusion Protection’s capabilities allowed us to collect the data for our research. R. Barbosa, R. Sadre, A. Pras in 2013 presented an idea of a whitelisting solution for SCADA networks, which incorporated a learning phase in which the flow whitelist was learned by capturing network traffic over time and aggregating it [1]. We use a similar strategy in the ATM environment, where the data about installed applications and the needed resources are collected during the supervising phase. Unfortunately, Intrusion Protection collects information about all processes running in the operating system, including operations done by system’s tools. It is estimated, that the number of noises in the collected dataset might exceed 90%. Any machine learning approach might fail in a such scenario. In 2017, we proposed a set of filters, which are also used to solve this problem [10]. As for ATM security itself, there are not many publications on the subject. Most of them are based on an analysis of the threats or the current situation. However, as part of our work on ATM security, we published articles on the use of grouping algorithms to protect these systems from logical attacks using the sandbox mechanism [11,12]. The solutions we proposed are extremely effective in limiting the spread of viruses; yet, they do not guarantee such comprehensive protection as the whitelisting approach discussed in these studies.

3

Test Procedure

The test procedure was performed on three ATMs running Windows 10 2019 LTSC in Diebold Nixdorf’s laboratory in Katowice, Poland. In the first step we have ensured the necessary software was installed, and the hardware components worked together with the software as expected by simulating standard actions done with the ATM, such as checking the account balance or withdrawing cash. The software stacks on all machines were similar but not equal. Intrusion Protection can be used in the so-called monitoring mode, during which every process is under control, but it is not yet blocked. All of this results in the controlling program providing detailed log files, including an exact list of all programs and the resources, such as files or registries, they try to use. The example of the single data record transformed for the purpose of this research is presented in Table 1. After the ATMs were fully operational, the test procedure was performed. The tests simulated standard ATM operations such as checking the account balance and withdrawing cash and less common activities like starting the service mode. We used the test automation framework called Ranorex to record the test sequence [7]. Every action has been repeated several times during the tests.

Random Forest in Whitelist-Based ATM Security

295

Table 1. Sample Intrusion Protection event log entry Attribute

Value

Event type

Registry

Permissions Read (0002001f) Operation

Open Registry Key

Disposition Allow Program

C:/Protopas/bin/pslist.exe

Resource

HKLM/Software/MS/Windows NT/CurrentVersion/Perflib

Agent state XB Msg type

Warning

As a result, we received three datasets, one from each ATM. The first dataset was classified by the authors in the manual manner, after the pre-processing had been done. It is used in the classification process as a training set. We have gathered the dataset in which every record is defined by 33 attributes before the pre-processing phase. Each record can be described as a pair containing a program, a resource it uses, and additional information regarding the type of the event (e.g. access to the registry or file), permissions requested by the program (mostly read or write access) and many more. Initially, we collected about 25,000 events per ATM during our test procedure. Table 1 shows a single, simplified record of the dataset and contains only 8 out of the initial 33 attributes. Moreover, the value of the permissions parameter is a hexadecimal number. However, the table presents its human-readable form. An example of a permissions value in its hexadecimal form is 0002001f, meaning read-only registry access.

4

Data Pre-processing

The preparation of appropriate pre-processing mechanisms is a time-consuming procedure. However, it is necessary to use classification methods to secure ATMs based on whitelists. Due to the complexity and high detail of the solutions used, we have only decided to list them below: – Noise Removal - It is estimated that 90–95% of the data incoming from the monitoring system is noise. To solve this problem, we used the filtering mechanisms mentioned in our previous paper [10]. – Unnecessary Events Removal - we omitted events other than requests for access to files, registries, memory, or network resources. – To ensure the duplicates are correctly removed, we convert all path-related attributes to lowercase. – Duplicates Removal - there is no need to classify the same program and the resource requested by it more than once.

296

M. Maliszewski and U. Boryczka

– Program arguments are not taken into account, so two calls of program p for resource r with or without parameters are treated in the same way. – Data attributes removal - we selected only a subset of 8 out of 33 available attributes with the highest information ratio, that includes paths of the programs and their resources, mentioned in the next point. – The attributes containing paths to programs and their resources are split into multiple arguments for further use, e.g., the C:/Program Files/MyApp.exe program is formatted to the following three attributes: C:, Program Files and MyApp.exe. The division allows for more precise classification using Random forest as both attributes have the highest information ratio out of 8 selected (see previous point), which provides to correlation of the trees. The last of these points is particularly important. In the classification process, we use a Random forest algorithm, which is based on decision trees. The decision trees tend to correlate when some of the arguments of the test set are exceptionally strong predictors [5]. Both, programs and their resources are the strongest predictors in our datasets, which is presented in Table 2. Table 2. Information Gain Ratio for the tested datasets before splitting the attributes Attribute

IGR(Ex, a)

Resource

0,5183

Program

0,1880

Permissions 0,1616 Agent state 0,0627 Event user

0,0306

Disposition

0,0154

Event type

0,0148

Operation

0,0071

Msg type

0,0020

The authors used the Information Gain Ratio (IGR(Ex, a)) coefficient proposed by Quinlan to calculate the importance of individual attributes [15]. Quinlan’s coefficient measures the relationship between two random variables. Intuitively, when having two variables: a and b, it is possible to measure how much information about one variable can be obtained by knowing the other. The gain factor is, therefore, the ratio between the gain and the intrinsic value of the information. Table 2 presents the Information Gain Ratio of attributes in relation to their category. The Program attribute has been split to 10 attributes, while the Resource attribute after pre-processing comprised 11 attributes. This is equal to the maximum length (number of levels) of access paths across the datasets. It is worth noting that such modification of the test data means an increase in the number of analyzed attributes from 8 to 27. Please note that there is another strong predictor in the dataset, the permissions, which, as a single hexadecimal number, define the type of permissions a

Random Forest in Whitelist-Based ATM Security

297

program requires to access a resource, as explained in Sect. 3. Unfortunately, due to its nature, we could not modify it without a significant drop in classification accuracy.

5

Data Classification

Setting up a security policy that uses whitelists is a manual process. The person responsible for the device security must determine how the software installed on the ATM operates and classify it to the predefined security rules to ensure the proper working. For example, program p must get permission to write to directory r, where it can store its logs. Intrusion Protection provides a list of rules where the running software and the resources can be classified to keep it running properly. In this research we are using Random forest to classify the data. Random forest is a collaborative machine learning method for classification and regression problems that constructs multiple decision trees during training and generates a class that dominates the classes (classification) or the predicted mean (regression) of individual trees [6,18]. Random forests eliminate the tendency of excessive adjustment of decision trees to the training set [4]. The results of the classification will be added to the training set, so it is possible to use the newly aggregated data in the next classifications. The user should be able to supervise the decisions made by the algorithm before storing the data, and if necessary, adjust its’ choices to enhance the quality of the solution. Decision trees included in the Random forest are generated using the J48 algorithm. It is an open source implementation of the C4.5 algorithm, written in Java [16]. The algorithm creates a decision tree using the concept of information entropy. Training data is a set of S = {s1 , ..., sn }, where each element si = (pi , ri ) is presented as a pair of program and resource paths, additional attributes and the category to which the record belongs (see Table 1 for reference). In the case of whitelist-based protection, the classification is not binary. Intrusion Protection provides many categories to which a given program can be assigned, but in the case of our datasets, only some apply. Each record can be classified into exactly one category. Table 3 presents the categories available for the classification of the collected datasets. Table 3. The list of categories available for classification Category

Description

Registry-readonly Allows users to open the system registry and read through its content Registry-writable

Allows users to open the system registry and edit its content

Files-readonly

Allows users to open the system file and browse through its content

Files-writable

Allows users to open the system file and edit its content

Pac-readonly

This rule guarantees read access to the process memory

Pac-writable

This rule guarantees write access to the process memory

Program

Allows users to start a given program

298

M. Maliszewski and U. Boryczka

According to the research by Probst and Boulesteix, the size of the Random forest should be possibly large, considering the computational cost [14]. Taking into consideration the current trends in the increase in Random Access Memory (RAM), the number of cores of an average processor, and their overall computing power, the size T of the Random forest for this research was defined according to the Eq. 1. T = 28 · CP U core number

(1)

In this study, we used quad and eight-core processors, so the number of decision trees was 1024 or 2048. Both values gave identical classification results. Training a single classifier (tree) uses the random subspace method (bagging), defined by Ho [18]. √ Typically, classification problems use a random subset of the size attributes  n, where n is the number of features available, as proposed by Trevor, Tibshirani and Friedman [4]. Due to the number of attributes available after path transformations, a less radical approach was used. The coefficient determining the size of the subset of the m features from the n set of features is represented by the Eq. 2.   log(n) + 1 m= n· (2) 10 For the datasets in this research means that 7, instead of 5 features will be used. This seems to be the right approach considering that some parts of program paths have a very low information gain ratio, e.g., the Program Files path fragment gives almost no information about the record as most programs in the Windows operating system are installed in this directory.

6

Results

As mentioned in Sect. 3, we manually classified the first dataset. To measure the quality of the classification made on datasets 2 and 3, we used popular measurement metrics: Accuracy, Precision, Recall and the F-Score. Accuracy is the most intuitive measure that requires no explanation. It is simply a fraction of the correct classifications over all classifications. The precision defines to what extent the predictions for a given class are correct. The average precision value for all classes can be represented as shown in Eq. 3. |tpi | is the number of observations correctly assigned to the analyzed class and |f pi | is the number of incorrectly assigned observations. k is the set of available classes k = {k1 , ..., kn }. It is worth noting that the weights for the categories were not used, as it is hard to determinate the importance of the paths attributes in the classification process. |k| 

pr =

i=0

|tpi | |tpi |+|f pi |

|k|

(3)

Random Forest in Whitelist-Based ATM Security

299

The recall is a metric that determines the model’s ability to find all relevant cases in the data set. As in the case of precision, |tpi | is the number of observations that are well assigned to the analyzed class, while f ni is the number of elements incorrectly assigned to the category under consideration, k is the set of available classes k = {k1 , ..., kn }. Again, no category weights were used. The average recall is presented in Eq. 4. |k| 

rc =

i=0

|tpi | |tpi |+|f ni |

(4)

|k|

The last criterion for assessing the quality of the classification is the F-Score (f1 ). Both, precision (pr) and recall (rc) are taken into account for the calculation of the test result. f1 is the harmonic mean of both values, determined by the formula 5. f1 = 2 ·

pr · rc pr + rc

(5)

The authors classified Dataset 1 and based the classification of Dataset 2 on Dataset 1, using it as a training set. Table 4 contains the classification results for datasets 2 and 3. The values in Dataset 3A originate from merging the results achieved in Dataset 2 with the training set without any supervision, whereas Dataset 3B shows what would the classification look like if the authors introduced modifications to the Dataset 2 classification. Table 4. The quality of the Random forest classification of the ATM software Metric

Dataset 2 Dataset 3A Dataset 3B

Accuracy 0.8604

0.8723

0.9133

Precision 0.9235

0.9272

0.9406

Recall

0.8548

0.8676

0.9061

F-score

0.8802

0.8953

0.9223

It is possible to observe that the Random forest obtains very good results when classifying dataset 2. The unsupervised classification over third dataset (Dataset 3A) also shows a slight increase, but it is difficult to say whether this is more the result of extended training set or a matter of the data in the third dataset. For supervised learning, the classification of the third set has already increased, exceeding 90% for each measurement method used. It is worth noting that in relation to the classification of the second set (Dataset 2), a significant increase in the quality of the classification was noted, depending on the measurement method, shows the improvement from 1.71% to 5.29%.

300

7

M. Maliszewski and U. Boryczka

Conclusions

Over the years, the number of highly sophisticated software-based attacks on ATMs has been growing, causing more and more damage and forcing banks to invest millions in cybersecurity, which can still be ineffective (see Sect. 1 for reference). Even though it is challenging to configure whitelist solutions, it is safe to say that this approach is the future of cybersecurity for single-purpose systems or for Internet of Things (IoT). The automatic creation of security configurations is tough to achieve, as there are still security goals that require expert knowledge and human supervision; However, it is much easier for specialized systems to provide complex programs classification from observations than having someone do them manually. The results achieved in these studies show that machine learning techniques, especially the Random forest algorithm can be successfully used to protect ATMs. Analyzing the causes of incorrect classifications within the data obtained from ATMs will be the next subject of the authors’ research. Knowledge about those will allow to improve classification methods and possibly lead to a shift from supervised (or semi-supervised) machine learning to a fully autonomous classification system, which in turn will allow the creation of whitelist-based protection against logical attacks automatically.

References 1. Barbosa, R., Sadre, R., Pras, A.: Flow whitelisting in SCADA networks. Int. J. Crit. Infrastruct. Prot. 6, 150–158 (2013). https://doi.org/10.1016/j.ijcip.2013.08. 003 2. Stouffer, C.: NortonLifeLock: What is a zero-day exploit? https://us.norton.com/ internetsecurity-emerging-threats-how-do-zero-day-vulnerabilities-work.html 3. European Association for Secure Transactions: Terminal fraud attacks in Europe drop during the Covid-19 pandemic. https://www.association-secure-transactions. eu/terminal-fraud-attacks-in-europe-drop-during-the-covid-19-pandemic/ 4. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning, 2nd edn. Springer, NY (2009). https://doi.org/10.1007/978-0-387-21606-5 5. Ho, T.: A data complexity analysis of comparative advantages of decision forest constructors. Pattern Anal. Appl. 5, 102–112 (2002). https://doi.org/10.1007/ s100440200009 6. Ho, T.K.: Random decision forests. In: Proceedings of the Third International Conference on Document Analysis and Recognition, vol. 1, p. 278. ICDAR 1995, IEEE Computer Society, USA (1995) 7. Idera, Inc. Company: Test Automation for All. https://www.ranorex.com 8. International Monetary Fund, Financial Access Survey: ATMs per 100,000 adults. https://data.worldbank.org/indicator/fb.atm.totl.p5 9. Klerx, T., Anderka, M., Büning, H.K., Priesterjahn, S.: Model-based anomaly detection for discrete event systems. In: 2014 IEEE 26th International Conference on Tools with Artificial Intelligence, pp. 665–672 (2014). https://doi.org/10. 1109/ICTAI.2014.105

Random Forest in Whitelist-Based ATM Security

301

10. Maliszewski, M., Boryczka, U.: Basic clustering algorithms used for monitoring the processes of the ATM’s OS. In: 2017 IEEE International Conference on INnovations in Intelligent SysTems and Applications (INISTA), pp. 34–39 (2017). https://doi. org/10.1109/INISTA.2017.8001128 11. Maliszewski, M., Pristerjahn, S., Boryczka, U.: DBSCAN algorithm as a means to protect the ATM systems. In: 2018 Innovations in Intelligent Systems and Applications (INISTA), pp. 1–6 (2018). https://doi.org/10.1109/INISTA.2018.8466322 12. Maliszewski, M., Boryczka, U.: Using MajorClust algorithm for sandbox-based ATM security. In: 2021 IEEE Congress on Evolutionary Computation (CEC), pp. 1054–1061 (2021). https://doi.org/10.1109/CEC45853.2021.9504862 13. NIST, National Vulnerability Database: CVE-2021-44228. https://nvd.nist.gov/ vuln/detail/CVE-2021-44228 14. Probst, P., Boulesteix, A.L.: To tune or not to tune the number of trees in random forest. J. Mach. Learn. Res. 18(1), 6673–6690 (2017) 15. Quinlan, J.R.: Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986). https://doi.org/10.1023/A:1022643204877 16. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco, CA (1993) 17. Sedgewick, A., Souppaya, M., Scarfone, K.: Guide to application whitelisting. NIST Spec. Publ. 800(167), 2–3 (2015). https://doi.org/10.6028/NIST.SP.800-167 18. Ho, T.K.: The random subspace method for constructing decision forests. IEEE Trans. Pattern Anal. Mach. Intell. 20(8), 832–844 (1998)

Layer-Wise Optimization of Contextual Neural Networks with Dynamic Field of Aggregation Marcin Jodłowiec1(B) , Adriana Albu2 , Krzysztof Wołk3 Nguyen Thai-Nghe4 , and Adrian Karasi´nski5

,

1 Wroclaw University of Science and Technology, Wroclaw, Poland

[email protected]

2 Politechnica University Timisoara, Timisoara, Romania

[email protected] 3 Esently, Rzeszów, Poland 4 Can Tho University, Can Tho, Vietnam [email protected] 5 Information Technologies Department, CERN, Geneva, Switzerland [email protected]

Abstract. This paper includes a presentation of experiments performed on Contextual Neural Networks with a dynamic field of view. It is checked how their properties can be affected by the usage of not-uniform numbers of groups in different layers of contextual neurons. Basic classification properties and activity of connections are reported based on simulations with H2O machine learning server and Generalized Backpropagation algorithm. Results are obtained for data sets with a high number of attributes (gene expression of bone marrow cancer and myeloid leukemia) as well as for standard problems from UCI Machine Learning Repository. Results indicate that layer-wise selection of numbers of connection groups can have a positive influence on the behavior of Contextual Neural Networks. Keywords: Neural networks · Connections grouping · Selective attention

1 Introduction Artificial neural networks are one of the groups of machine learning tools that are successfully applied to solve numerous problems. Neural models are used to forecast changes and find anomalies e.g. in medicine [1–3], financial markets [4], biology [5], and astronomy [6, 7]. They can be found in autonomous vehicles including cars and space exploration rockets [8–10] as well as in telecommunication systems [11]. Neural networks are also very useful in science [12] and multimedia hardware as accelerators of scenes generation and upscaling [13, 14]. Such wide adoption of artificial neural models is related to very intensive research and development of their various architectures. This includes Generative Adversarial Networks (GAN), Variational Autoencoders (VAE) [15, 16], recurrent models using Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU) [17, 18] as well as frequently used Convolutional Neural Networks (CNN) © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 N. T. Nguyen et al. (Eds.): ACIIDS 2022, LNAI 13758, pp. 302–312, 2022. https://doi.org/10.1007/978-3-031-21967-2_25

Layer-Wise Optimization of Contextual Neural Networks

303

[19, 20]. In some applications the memory and time cost of mentioned models can be too high in comparison to the needed accuracy, thus simpler solutions are developed and used when it is feasible – e.g. Self Organizing Maps [21] or Multilayer Perceptrons (MLP) [22]. But it is not hard to spot that all mentioned types of models share a common feature: for every data vector, they analyze signals from all inputs of their neurons. Such a solution forces the model to keep all internal connections active whenever the model is in use. This makes neural architectures and related training algorithms simpler, but at the same time limits their capabilities of analyzing contextual relations within data. To bypass this problem contextual neurons [23–25] and Contextual Neural Networks (CxNN) were proposed, giving access to solutions with unique properties [26–31]. Contextual Neural Networks represent a generalization of MLP models and can be applied as a direct replacement of feedforward neural networks [23]. As a result they were used to classify microarray data of cancer gene expression [27, 28], to detect fingerprints in criminal investigations through contextual analysis of minutia groups [32, 33] and to improve communication in cognitive radio systems [34]. Contextual data analysis and CxNNs can be also considered as useful parts of telemedicine and context-sensitive text mining systems [35–37]. The schema of input signals processing by contextual neurons in CxNNs reflects the idea of the scan-path theory which was developed by Stark to mimic the human sensory system [38]. This is achieved by the usage of neuronal aggregation functions which perform conditional, multi-step accumulation of input signals. It requires additional identification of an ordered list of groups of neuron input connections to form a scanning path that controls which input connections are considered to be active for a given input vector. This can be achieved in multiple ways – with the usage of genetic algorithms as well as with gradient based Generalized Error Backpropagation method (GBP). Therefore every single contextual neuron can calculate its output based on different subsets of inputs in relation to different data vectors. As a result, data sets can be processed more accurately and with lower time and energy costs [25]. Within this paper we are analyzing Contextual Neural Networks with different numbers of connection groups in different layers of neurons. Especially, three patterns of numbers of groups in hidden layers are considered (assuming direction from inputs to outputs): increasing, decreasing and constant (uniform). The main goal of this work was to verify if the properties of Contextual Neural Networks with a dynamic field of aggregation (implemented by Sigma-if neurons) can be modified by layer-wise optimization based on mentioned patterns of numbers of groups. Reported results were obtained with a dedicated H2O machine learning server for standard problems from UCI Machine Learning Repository [39], as well as for real-life data sets with a high number of attributes (gene expression of bone marrow cancer and myeloid leukemia) [40]. The remaining parts of this paper are organized as follows. Section 2 includes the presentation of Contextual Neural Networks and GBP training algorithm. In Sect. 3, the idea of patterns of numbers of connection groups is considered as a method of layer-wise optimization of CxNNs. Section 4 presents the experimental setup and obtained results of measurements. Finally, the paper ends in Sect. 5 with a summary of conclusions and prospects for further research.

304

M. Jodłowiec et al.

2 Contextual Neural Networks The main elements of Contextual Neural Networks are neurons with multi-step aggregation functions [24]. Such neurons can aggregate signals from their all or only selected inputs depending on the data vector being processed. Such possibility makes CxNN generalization of Multilayer Perceptron network of which neurons analyze signals from all their inputs and without any relation to data. It was shown that conditional, multistep aggregation functions – in comparison to MLP – can notably increase classification accuracy and decrease its computations cost [23, 27]. Contextual neuron performs multi-step aggregation in the following way. Dendrites of neuron are assigned during training to preselected number of chunks of different priorities. The list of groups of connections is a form of a scan-path considered by Stark [38]. Input signals are aggregated till the given condition is not met and when aggregation is finished not read signals are ignored. Such behavior allowed to create a family of aggregation functions such as e.g. CFA, PRDFA, and Sigma-if (Fig. 1) [23, 24, 26, 28]. Each of those aggregation methods represents different schemes of the behavior of neuron attention field size – constant, random, and dynamic. In the latter case the field of attention (the number of inputs of the neuron considered during the calculation of neural activation in the given step of aggregation) changes with each step of aggregation for a given data vector (e.g. grows as in Sigma-if neuron).

Fig. 1. Diagram of the Sigma-if aggregation function. An example given for contextual neuron with the number of groups K = 3, aggregation threshold ϕ*, activation function F and set of inputs X being sum the of subsets X 1 , X 2 and X 3 .

Figure 1 includes an example diagram of the Sigma-if aggregation function in which the set of input connections of the neuron is divided into K = 3 groups and the condition identifying the end of aggregation process is based on comparison of value aggregated in the given i-th step with the single, constant value of aggregation threshold ϕ*. It can be also noticed that in Sigma-if method actual aggregation variable ϕ i serves as a memory of signals aggregated in earlier steps for a given data vector.

Layer-Wise Optimization of Contextual Neural Networks

305

It seems that implementation of the Sigma-if aggregation requires describing each input connection of contextual neuron not only with connection weight but also with additional integer value representing the identifier of connections group to which given dendrite is assigned (values 1..K). But in practice, this information can be encoded within the value of connection weight. This can be done by assuming that connections with the highest weights belong to the group processed in the first step and with the lowest ´ which can weights are assigned to group K. This allows to create grouping function Ω, be used to generate grouping vector of the neuron from its weights vector and the number of groups K. It was also proven that models build of such contextual neurons can be trained with gradient based algorithm [23, 25, 30]. An overview of such procedure is depicted in Fig. 2.

Fig. 2. Generalized error backpropagation algorithm for training Contextual Neural Networks for error function E, grouping function Ω´ and interval of groups update ω.

Not-continuous nature of the grouping vector of contextual neuron, despite it is embedded within weights vector, creates situation in which training of CxNN also requires optimization of not-differentiable parameters. This can be achieved with gradient based algorithm by applying self-consistency paradigm known from physics [41]. Example of such method is Generalized Error Backpropagation algorithm (GBP) presented on Fig. 2 [23].

306

M. Jodłowiec et al.

3 Layer-Wise Patterns of Connections Grouping in CxNNs Most of the literature about Contextual Neural Networks discusses models in which every hidden neuron in a CxNN network uses the same count of groups K [23–34, 42]. To our best knowledge only in [43] CxNNs architectures with layer-wise assignments of groups for neurons with CFA aggregation function were analyzed by our team. In this work we extend our previous analysis using not-uniform patterns of numbers of groups in CxNNs with Sigma-if aggregation function.

Fig. 3. A case of CxNN models having two hidden layers with numbers of neurons N =[6, 4, 3, 1]. Counts of input groups in Sigma-if units of subsequent layers: a.): K = [1, 4, 4, 1] (uniform pattern in hidden layers), b.): K = [1, 4, 2, 1] (not-uniform, decreasing pattern). Example assignments of connections to groups are presented in the last two layers.

The difference between uniform and not-uniform patterns of group numbers in hidden layers of CxNN networks can be observed on Fig. 3. Both depicted architectures are equal in terms of the numbers of neurons in subsequent layers. But Sigma-if neurons in all hidden layers of the case shown on Fig. 3a have equal number of groups (K = 4). And units in hidden layers of the network on Fig. 3b have different counts of groups. Such uniform and not-uniform patterns of numbers of groups will be further denoted as G1441 and G1421, respectively. It seems to be worth checking how the usage of not-uniform grouping patterns in hidden layers can influence the behavior of CxNNs with neurons having dynamic size of fields of aggregation. This is because such models, even with equal numbers of neurons in each layer, can include structures analogous to encoder-decoder networks [15, 16], but realized in the space of connections groups. It is still an open question how this could impact the training and properties of CxNNs built of Sigma-if neurons. However, as the number of all possible combinations of layer-wise patterns of connection groupings is high even in small models, in this work three basic grouping patterns are considered: uniform, decreasing, and increasing.

Layer-Wise Optimization of Contextual Neural Networks

307

4 Experiments and Results In this study, we have simulated four-layer Contextual Neural Networks with Sigmaif neurons to observe how their properties change for different layer-wise patterns of connection groups. All networks had two hidden layers built of ten Sigma-if units each. One-hot and one-neuron-per-class unipolar encodings were used to construct input and output layers. The following patterns of numbers of groups were considered: increasing (G1471), decreasing (G1721, G1741) and constant (G1771). The numbers of groups in the input and output layers were by default set to one. Overall classification accuracy and hidden connections activity were measured with stratified, ten times repeated 10fold cross-validation. CxNN models were applied to classify real-life gene expression microarray data of bone marrow cancer and ALL-AML myeloid leukemia as well as selected benchmark tasks from UCI Machine Learning repository [39, 40]. The most important properties of considered problems are presented in Table 1. Table 1. Main parameters of problems solved in this work. Data set

Number of samples

Breast_cancer

699

10

2

Crx

690

15

2

72

7130

2

Heart_disease

303

75

5

Sonar

208

60

2

Soybean

307

35

19

Golub

Number of attributes

Number of classes

Reported measurements were obtained for CxNN models trained with H2O 3.24.0.3 machine learning server extended to include GBP algorithm [26, 44]. The most important settings applied during the training process were as follows: number of hidden layers = 2, number of neurons in each hidden layer = 10, neuron activation function = tanh, neuron aggregation function = Sigma-if, aggregation threshold ϕ* = 0.6, groupings update interval ω = 5, the loss function = logloss. Initialization of connections weights was done with Xavier algorithm. Standardization of input data was performed with unipolar Min-Max scaler. Measurements of classification accuracy obtained for test data are presented in Table 2. It is visible that not-uniform patterns of numbers of connection groups can significantly modify results achieved by CxNN models. Especially, the decreasing pattern of number of connection groups seems to increase the chances of creation of CxNNs with higher average classification accuracy. This can be observed both for G1721 and G1741 patterns, but the former leads to a stronger effect. Additionally, for G1721 pattern, four out of six data sets generated outcomes with decreased standard deviation of classification accuracy.

308

M. Jodłowiec et al.

Table 2. Average classification error and its standard deviation of considered CxNNs with Sigmaif neurons and various patterns of numbers of groups in the hidden layers. Bold font represents statistically best results and lowest standard deviations. Dataset

G1441 [%]

G1771 [%]

G1471 [%]

G1741 [%]

G1721 [%]

Breast cancer

3.2 ± 1.7

4.8 ± 1.8

4.6 ± 2.9

3.8 ± 3.1

3.7 ± 2.8

Crx

12.7 ± 4.6

12.5 ± 4.7

12.4 ± 4.3

11.6 ± 4.4

10.2 ± 1.6

7.2 ± 0.4

7.6 ± 0.3

7.3 ± 0.4

6.7 ± 0.2

6.3 ± 0.3

11.5 ± 4.8

12.8 ± 4.7

11.2 ± 3.9

11.6 ± 4.3

11.2 ± 4.3

Sonar

7.4 ± 2.0

6.7 ± 1.4

6.8 ± 1.8

7.8 ± 1.6

5.4 ± 1.2

Soybean

4.6 ± 1.7

4.2 ± 1.5

4.5 ± 1.3

4.7 ± 1.6

3.1 ± 1.1

Golub Heart disease

It can be also noticed that usage of layer-wise decreasing patterns of numbers of groups (G1741 and G1721) decreases activity of hidden connections of CxNNs. And again, the effect is stronger for G1721 – in this case lowered activity of connections can be observed for four out of six analyzed data sets. Similar behavior was earlier observed for CxNNs with CFA aggregation function [43] and in both cases it was unexpected. This is because single contextual neurons with lower number of groups typically present higher activity of input connections. The most evident case of such relation is contextual neuron with only one group, which by default has maximal, 100% activity of connections (Table 3). Table 3. Average activity and its standard deviation of hidden connections of considered CxNNs with Sigma-if neurons and various patterns of numbers of groups in the hidden layers. Bold font represents statistically best results. Data set

G1441 [%]

G1771 [%]

G1471 [%]

G1741 [%]

G1721 [%]

Breast_cancer

82.9 ± 4.7

82.4 ± 5.1

92.5 ± 8.2

89.1 ± 6.4

88.8 ± 4.6

Crx

80.7 ± 3.4

78.5 ± 2.2

78.6 ± 3.8

74.6 ± 4.0

75.1 ± 2.9

Golub

21.8 ± 2.7

24.1 ± 2.9

23.0 ± 3.1

17.2 ± 3.5

16.4 ± 3.8

Heart_disease

80.2 ± 0.9

82.6 ± 1.8

85.2 ± 1.9

79.4 ± 1.2

78.7 ± 1.7

Sonar

49.3 ± 0.9

48.0 ± 0.5

48.9 ± 0.8

49.6 ± 0.9

51.7 ± 0.7

Soybean

53.6 ± 1.5

52.8 ± 1.7

54.4 ± 1.7

52.9 ± 1.6

48.5 ± 1.5

And we observe that the change of the number of groups in the second hidden layer from four to two (from G1741 to G1721) decreases the activity of hidden connections. Thus what is observed can be an effect of synergy between contextual neurons and can be an interesting subject of further analysis. Finally, to our best knowledge, this study is the first to present in practice that GBP algorithm can successfully build CxNN models constructed of Sigma-if neurons (i.e. with dynamic size of field of attention) when layer-wise setting of numbers of connection

Layer-Wise Optimization of Contextual Neural Networks

309

groups is used. This suggests that GBP method can be further extended with automatic optimization of numbers of connection groups – both in ‘per-neuron’ and layer-wise modes.

5 Conclusions In this work, we report the results of experiments with Contextual Neural Networks built of Sigma-if neurons with dynamic size of field of attention. The main point of the analysis was to check how not-uniform, layer-wise assignment of the numbers of groups in hidden neurons can change classification accuracy and activity of hidden connections of trained CxNNs. Neural models were built with a dedicated version of H2O machine learning server. Measurements were obtained with 10 times repeated 10-fold cross validation for data sets with high number of attributes (gene expression of bone marrow cancer and myeloid leukemia) as well as for standard problems from UCI machine learning repository (e.g. Sonar, Wisconsin Breast Cancer, Crx). Results indicate that layer-wise selection of groups of connections can have a positive influence on the behavior of Contextual Neural Networks. When in consecutive hidden layers of neurons the numbers of groups decrease, the measured level of classification error of the model is lower for many considered problems. Moreover, this also applies to the activity of hidden connections. Such observation is intriguing since single contextual neurons with lower number of groups typically present higher activity of input connections. This can be an effect of synergy between contextual neurons and its analysis could be a valuable direction for further studies. Finally, the presented work confirms that H2O server extended with GBP method can train CxNNs with not-uniform settings of numbers of connection groups in hidden layers also when their neurons include aggregation function with dynamic size of the field of attention. This also suggests that GBP algorithm could be additionally extended with automatic optimization of the numbers of connection groups.

References 1. Nasser, I.M., Abu-Naser, S.S.: Lung cancer detection using artificial neural network. Int. J. Eng. Inform. Syst. (IJEAIS) 3(3), 17–23 (2019) 2. Suleymanova, I., et al.: A deep convolutional neural network approach for astrocyte detection. Sci. Rep. 8(12878), 1–7 (2018) 3. Wang, Z.H., Horng, G.J., Hsu, T.H., Chen, C.C., Jong, G.J.: A novel facial thermal feature extraction method for non-contact healthcare system. IEEE Access 8, 86545–86553 (2020) 4. Tsai, Y.C., et al.: FineNet: a joint convolutional and recurrent neural network model to forecast and recom-mend anomalous financial items. In: Proceedings of the 13th ACM Conference on Recommender Systems RecSys’19. ACM New York, NY, USA, pp. 536–537 (2019) 5. Mendez, K.M., Broadhurst, D.I., Reinke, S.N.: The application of artificial neural networks in metabolomics: a historical perspective. Metabolomics 15(11), 1–14 (2019). https://doi.org/ 10.1007/s11306-019-1608-0 6. Cabrera-Vives, G., Reyes, I., Förster, F., Estévez, P.A., Maureira, J.: Supernovae detection by using convolutional neural networks. In: 2016 International Joint Conference on Neural Networks (IJCNN), pp. 251–258 (2016). https://doi.org/10.1109/IJCNN.2016.7727206

310

M. Jodłowiec et al.

7. Rosu, C., Bacu, V.: Asteroid image classification using convolutional neural networks. In: 2021 IEEE 17th International Conference on Intelligent Computer Communication and Processing (ICCP), pp. 3–10 (2021). https://doi.org/10.1109/ICCP53602.2021.9733484 8. Chen, S., Zhang, S., Shang, J., Chen, B., Zheng, N.: Brain-inspired cognitive model with attention for self-driving cars. IEEE Trans. Cognitive Dev. Syst. 11(1), 13–25 (2019) 9. Cheng, L., Jiang, F., Wang, Z., Li, J.: Multi-constrained real-time entry guidance using deep neural networks. IEEE Trans. Aerosp. Electron. Syst. 5(1), 325–340 (2021). https://doi.org/ 10.1109/taes.2020.3015321 10. Cheng, L., Wang, Z., Jiang, F., Zhou, C.: Real-time optimal control for spacecraft orbit transfer via multiscale deep neural networks. IEEE Trans. Aerosp. Electron. Syst. 55(5), 2436–2450 (2019). https://doi.org/10.1109/TAES.2018.2889571 11. Bakhadirovna, M.M., Azatovich S.M., Ulugbek Otkir Ugli, B.M.: Study of neural networks in telecommunication systems. In: 2021 International Conference on Information Science and Communications Technologies (ICISCT), pp. 1–4 (2021). https://doi.org/10.1109/ICI SCT52966.2021.9670198 12. Guest, D., Cranmer, K., Whiteson, D.: Deep learning and its application to LHC physics. Annu. Rev. Nucl. Part. Sci. 68, 1–22 (2018) 13. Liu, L., Zheng, Y., Tang, D., Yuan, Y., Fan, C., Zhou, K.: Automatic skin binding for production characters with deep graph networks. ACM Trans. Graph. (SIGGRAPH) 38(4), 1–12 (2019) 14. Gao, D., Li, X., Dong, Y., Peers, P., Xu, K., Tong, X.: Deep Inverse Rendering for Highresolution SVBRDF Estimation from an Arbitrary Number of Images. ACM Trans. Graph. (SIGGRAPH) 38(4), 1–15 (2019) 15. Higgins, I., et al.: β-VAE: Learning basic visual concepts with a constrained variational framework. In: International Conference on Learning Repr., ICLR 2017, vol. 2, no. 5, pp. 1–22 (2017) 16. Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of GANs for improved quality, stability, and variation. In: International Conference on Learning Representations, ICLR 2018, pp. 1–26 (2018) 17. Huang, X., Tan, H., Lin, G., Tian, Y.: A LSTM-based bidirectional translation model for optimizing rare words and terminologies. In: 2018 IEEE International Conference on Artificial Intelligence and Big Data (ICAIBD). IEEE, China, pp. 5077–5086 (2018) 18. Athiwaratkun, B., Stokes, J.W.: Malware classification with LSTM and GRU language models and a character-level CNN. In: Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2482–2486. IEEE, USA (2017) 19. Gong, K., et al.: Iterative PET image reconstruction using convolutional neural network representation. IEEE Trans. Med. Imaging 38(3), 675–685 (2019) 20. Munkhdalai, L., Park, K.-H., Batbaatar, E., Theera-Umpon, N., Ho Ryu, K.: Deep learningbased demand forecasting for Korean postal delivery service. IEEE Access 8, 188135–188145 (2020) 21. Dozono, H., Niina, G., Araki, S.: Convolutional self organizing map. In: 2016 IEEE International Conference on Computational Science and Computational Intelligence (CSCI), IEEE, pp. 767–771 (2016) 22. Amato, F., et al.,: Multilayer perceptron: an intelligent model for classification and intrusion detection. In: 31st International Conference on Advanced Information Networking and Applications Workshops (WAINA), pp. 686–691. IEEE, Taipei, Taiwan (2017) 23. Huk, M.: Backpropagation generalized delta rule for the selective attention Sigma-if artificial neural network. Int. J. App. Math. Comp. Sci. 22, 449–459 (2012) 24. Huk, M.: Notes on the generalized backpropagation algorithm for contextual neural networks with conditional aggregation functions. J. Intell. Fuzzy Syst. 32, 1365–1376 (2017)

Layer-Wise Optimization of Contextual Neural Networks

311

25. Huk, M.: Stochastic optimization of contextual neural networks with RMSprop. In: Nguyen, N.T., Jearanaitanakij, K., Selamat, A., Trawi´nski, B., Chittayasothorn, S. (eds.) ACIIDS 2020. LNCS (LNAI), vol. 12034, pp. 343–352. Springer, Cham (2020). https://doi.org/10.1007/9783-030-42058-1_29 26. Burnell, E.D., Wołk, K., Waliczek, K., Kern, R.: The impact of constant field of attention on properties of contextual neural networks. In: Nguyen, N.T., Trawinski, B., et al. (eds.) 12th Asian Conference on Intelligent Information and Database Systems, ACIIDS 2020, vol. 12034, pp. 364–375. LNAI, Springer (2020) 27. Huk, M.: Non-uniform initialization of inputs groupings in contextual neural networks. In: Nguyen, N.T., Gaol, F.L., Hong, T.-P., Trawi´nski, B. (eds.) ACIIDS 2019. LNCS (LNAI), vol. 11432, pp. 420–428. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-148027_36 28. Huk, M.: Training contextual neural networks with rectifier activation functions: Role and adoption of sorting methods. J. Intell. Fuzzy Syst. 37(6), 7493–7502 (2019) 29. Huk, M.: Weights ordering during training of contextual neural networks with generalized error backpropagation: importance and selection of sorting algorithms. In: Nguyen, N.T., Hoang, D.H., Hong, T.-P., Pham, H., Trawi´nski, B. (eds.) ACIIDS 2018. LNCS (LNAI), vol. 10752, pp. 200–211. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-75420-8_19 30. Szczepanik, M., et al.: Multiple classifier error probability for multi-class problems. Eksploatacja i Niezawodnosc - Maintenance Reliab. 51(3), 12–16 (2011). https://doi.org/10.175 31/ein 31. Huk, M.: Measuring computational awareness in contextual neural networks. In: 2016 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Budapest, pp. 002254– 002259 (2016). https://doi.org/10.1109/SMC.2016.7844574 32. Szczepanik, M., Jó´zwiak, I.: Data management for fingerprint recognition algorithm based on characteristic points’ groups. New Trends Databases Inform. Syst. Found. Comp. Decis. Sci. 38(2), 123–130 (2013) 33. Szczepanik, M., Jó´zwiak, I.: Fingerprint recognition based on minutes groups using directing attention algorithms. In: Rutkowski, L., Korytkowski, M., Scherer, R., Tadeusiewicz, R., Zadeh, L.A., Zurada, J.M. (eds.) ICAISC 2012. LNCS (LNAI), vol. 7268, pp. 347–354. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-29350-4_42 34. Huk, M., Pietraszko, J.: Contextual neural-network based spectrum prediction for cognitive radio. In: 4th International Conference on Future Generation Communication Technology (FGCT 2015). IEEE Computer Society, London, UK, pp. 1–5 (2015) 35. Kwiatkowski, J., et al.: Context-sensitive text mining with fitness leveling genetic algorithm. In: 2015 IEEE 2nd International Conference on Cybernetics (CYBCONF), Gdynia, Poland, pp. 1–6, 2015, Electronic Publication (2015). ISBN: 978-1-4799-8321-6 36. Huk, M.: Using context-aware environment for elderly abuse prevention. In: Nguyen, N.T., Trawi´nski, B., Fujita, H., Hong, T.-P. (eds.) ACIIDS 2016. LNCS (LNAI), vol. 9622, pp. 567– 574. Springer, Heidelberg (2016). https://doi.org/10.1007/978-3-662-49390-8_55 37. Huk, M.: Context-related data processing with artificial neural networks for higher reliability of telerehabilitation systems. In: 17th International Conference on E-health Networking, Application and Services (HealthCom), pp. 217–221. IEEE Computer Society, Boston, USA (2015) 38. Privitera, C.M., Azzariti, M., Stark, L.W.: Locating regions-of-interest for the Mars Rover expedition. Int. J. Remote Sensing 21, 3327–3347 (2000) 39. UCI Machine Learning Repository: http://archive.ics.uci.edu/ml 40. Golub, T.R., et al.: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, 531–537 (1999)

312

M. Jodłowiec et al.

41. Glosser, C., Piermarocchi, C., Shanker, B.: Analysis of dense quantum dot systems using a self-consistent Maxwell-Bloch framework. In: Proceedings of 2016 IEEE Int. Symposium on Antennas and Propagation (USNC-URSI), Puerto Rico, pp. 1323–1324. IEEE (2016) 42. Huk, M.: Measuring the effectiveness of hidden context usage by machine learning methods under conditions of increased entropy of noise. In: 2017 3rd IEEE International Con-ference on Cybernetics (CYBCONF), pp. 1–6. Exeter (2017) 43. Mikusova, M., et al.: Towards layer-wise optimization of contextual neural networks with constant field of aggregation. In: Nguyen, N.T., Chittayasothorn, S., Niyato, D., Trawi´nski, B. (eds.) ACIIDS 2021. LNCS (LNAI), vol. 12672, pp. 743–753. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-73280-6_59 44. H2O.ai documentation: http://docs.h2o.ai/h2o/latest-stable/h2o-docs/index.html

Computer Vision Techniques

Automatic Counting of People Entering and Leaving Based on Dominant Colors and People Silhouettes Kazimierz Choro´s(B)

and Maciej Uran

Faculty of Information and Communication Technology, Department of Applied Informatics, Wrocław University of Science and Technology, Wyb. Wyspia´nskiego 27, 50-370 Wrocław, Poland [email protected]

Abstract. Counting of people crowd is an important process for video surveillance, anomaly warning, public security control, as well as ongoing protection of people and public facilities. People counting methods are also applied for controlling the numbers of people entering and leaving such places as a tourist bus, an office or an university building, a public building, a supermarket or shopping mall, a culture or sport center, etc. The problem arises when the number of exiting people is not equal to the number of entering people. How many people are missing and who is missing? The paper presents an approach for people counting and detection of missing persons. This approach includes two procedures. First, the dominant color of people detected on video was analyzed. Next, the silhouette sizes were used. Both procedures finally allow us to define specific features distinctive for missing people. The results of tests performed with video recordings of people entering and exiting through the door are promising. In this approach individual’s identities are not registered; therefore, privacy violation is avoided. Keywords: Content-based video analysis · Surveillance video · People counting · Security control · Missing persons · Dominant color hue · Silhouette size

1 Introduction Counting of people crowd from a single still image or video sequence frames has become an essential part of computer vision systems. It is an important process for video surveillance, anomaly warning, security control, as well as ongoing protection of people and public facilities. Automatic crowd counting is frequently used to estimate the number of people participating in some outdoor events such as processions, political meetings or manifestations, protest events, subway stations, tourist attractions, open-air concerts, sports competitions, and in many other mass activities. In past many methods of crowd counting have been proposed based on the detection and calculation of objects (faces, heads, human bodies) in images or videos [1–3]. Then these methods tried to estimate crowd density and on the basis of the crowd density estimate the number of people. In © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 N. T. Nguyen et al. (Eds.): ACIIDS 2022, LNAI 13758, pp. 315–328, 2022. https://doi.org/10.1007/978-3-031-21967-2_26

316

K. Choro´s and M. Uran

recent years methods based on convolutional neural network (CNN) models [4, 5] have become very popular. People counting is also very useful in monitoring of people entering or leaving some places. In such a case the direction of a person movement is important not only the appearance in a camera. In many applications these entering and leaving people are observed at the door or some gate. Such a situation is noticed in the case of counting people entering or leaving a tourist bus, an office or an university building, a public building, a supermarket or shopping mall, a culture or sport center, etc. It is important to know whether the number of leaving people is the same as the number of people who have entered. Then it is important to know which person or persons are missing. In such surveillance applications the people counting is performed in a relatively narrow passage like a tourist bus door, so in the most cases entering or leaving people are detected as individual objects and the phenomenon of object occlusions is not frequent. The difficulty may arise from the fact that passing people are recorded in only a few frames in a monitoring video. Moreover, the faces are not clearly visible. The paper is structured as follows. The next section describes related work on the crowd counting, including counting of people entering and exiting through the door. The third section presents the proposed approach. The next section describes the data used in the experiments and presents the results of the tests showing how the missing people can be identified in the monitoring videos. The further experimental research to improve the method proposed as well as final conclusions are presented in the last sixth section.

2 Related Work Crowd counting is applied for images presenting almost static crowd during the mass events such as music concerts, sports events, or meetings, etc. People counting is also used for people walking (pedestrian detection), for example during political demonstrations, or simply on streets or passages in a supermarket or shopping mall, etc. Then counting of people is also useful for the people entering and exiting buses or some buildings, crossing the gate observed by a digital video camera, etc. Static still images seem to be very easy to process in crowd counting. There are many methods based on the detection of faces, heads, or shoulders [6]. However, these methods relatively efficient for sparse populations lead to worse results in the case of high crowd density because of the phenomenon of people occlusions observed in a dense crowd. The solution is to estimate the crowd basing on the crowd density map. Crowd counting approaches for such complex scenes are mainly based on regression models [7, 8] which learn a mapping between low-level features and class labels. Crowd counting using regression differs depending on the goal of regression: object count or object density. Usually heat maps are used to depict density maps. Heat maps were usually applied for the usability evaluation of websites [9]. Places with dense crowd are represented by red points and smaller crowd with blue ones. Heat maps are very useful to show a crowd density variation and to study crowd movements [7, 10, 11].

Automatic Counting of People Entering and Leaving

317

The method presented in [12] is a patch-based approach for crowd density estimation also applied to solve the problem of simultaneous density estimation of pedestrians and cars in traffic scenes. The main problem encountered is a specific noise in person density estimation caused by misclassification of cars, because it happens that examined features of persons are very similar to patch features of some car parts. Such a misclassification has been reduced by applying a special preprocessing based on the resizing of the car parts and filtering out person appearances which become very small. In [13] the authors use multiple local features to count the number of people in each foreground segment, and then the total crowd is equal to the sum of the group sizes. It also permits to estimate these parts of the analyzed crowd which are not seen in the image. The proposed method can be used to estimate crowd density throughout different regions in the image and moreover can be used in a multi-camera environment. The problem of people counting is more difficult in the case of extremely dense crowds [14]. The difficulty arises from the lack of training samples, severe occlusions, cluttered scenes and variation of perspective. The deep convolutional neural networks (CNN) regression models are currently being proposed as an efficient solution for counting people of images in extremely dense crowds [4, 5, 14, 15]. Deep CNN models automatically choose features for counting and furthermore negative samples weaken negative influence of background like buildings and trees on the final results. The other method [16] combines features obtained using multiple receptive field sizes and examines the importance of each such feature at individual image location. In consequence it encodes the scale of the contextual information required to accurately predict crowd density. The experiments have shown that this method outperforms stateof-the-art crowd counting methods, especially when perspective effects are strong. Automatic counting is also applied for people moving on streets or passages in a supermarket or shopping mall, etc. In [17] the real-time system was presented able to detect moving people and calculate their number. Experiments were performed on indoor videos with different illumination conditions and with different number of moving people. The results ranged from 90% to 98% depending on illumination conditions. The processed color images were acquired by several cameras placed at different entrances of public buildings. It permitted not only to count people but also to estimate the preferred followed routes including for example the number of people choosing stairs or preferring lifts. However, the movement of people poses some problems mainly because of the appearing occlusions. In [18] a trajectory set clustering method was proposed to estimate the number of moving persons in a scene as a function of time. First of all the well-known feature tracking algorithm was applied with some enhancements in order to extract a large set of features from the analyzed video. An algorithm for pedestrian detection using learned shapelet features was described in [19]. In this approach learning discriminative shapelet features over separate regions is used to capture informative cues from all over the image and then integrated into a final classifier to detect pedestrians. To improve the accuracy of people counting a method was proposed using flow analysis with the movement speed of a person [20]. First the estimation of foreground movement speed is performed and then multiple people detection is carried out based

318

K. Choro´s and M. Uran

on the flow analysis on a line of interest while people enter and exit a given region. The speed of the flow is estimated on the basis of the number of frames containing for each entry and exit of the foreground objects in the line of interest. Special case of people counting is that used for counting persons entering and exiting buses or buildings through the door. To detect people who are passing through the door in the experiments depicted in [21] a camera was positioned vertically down at the entrance of a monitored area. Not only the proper placement of a camera was examined but also the size of an observed area, and the optimal perspective. Automatic counting persons entering and exiting is useful in public transportation. The authors of [22] described an approach how to match people entering and then at some later time exiting from a doorway. To associate entry and exit objects, a trellis optimization algorithm was used for sequence estimation, based on multiple texel camera measurements. Since the number of states in the trellis grows exponentially with the number of persons currently on the bus, furthermore, a beam search pruning technique was used to reduce the computational complexity. This approach provided the information how long a person is on a bus and what stops the person uses. It also provides valuable information about the intensity and variation of traffic flow in the whole transportation system. Such an approach may, however, generate the undesirable problem of collecting the personal information of people using public means of transport. An automatic passenger counting system for bus fleet of the public transport based on skin color detection approach was proposed in [23]. Passenger counting in the bus was performed by processing images in a few steps: conversion of RGB color to HSV, segmentation of images using thresholds, reduction of noise and removal of needless objects, smoothing of images, and finally passenger counting. Then in [24] the authors proposed a real-time counting method estimating the number of people entering and exiting a store. The proposed method first extracts foreground basing on the average picture level, searches for motion based on the maximum a posteriori probability, and next analyzes flow on the basis of multiple touching sections. The number of people entering or leaving a store was finally calculated using multivariate linear regression with the flow volume, flow velocity, and flow direction. Another automatic counting method [25] of bus passengers using surveillance videos also consisted of three main steps: door state estimation, passenger detection, and passenger tracking and counting. The experiments were carried out on surveillance videos of three types: videos captured during a day, videos captured at night, and during a rainy day. Moreover, a few special real situations were included such as different objects worn by the passengers, people occlusions, and high dense crowds. In the paper [26] the authors collected the head target image samples, have established a head target detection and tracking model based on deep learning mainly based on YOLOv3 algorithm, and analyzed the trajectory of passengers in the bus boarding and disembarking area. Then based on the passenger head trajectories a statistical algorithm for passenger detection was proposed and verified by a variety of bus scenarios. There are many other methods for counting the number of people entering and leaving a bus or building. In the surveillance systems it happens that we are not interested only in the total number of people but also we want to verify if the number of people entering is the same as the number of people exiting. Such a case appears for example during tourist

Automatic Counting of People Entering and Leaving

319

excursions when tourists are leaving the bus and are going to visit some places and then there are coming back and the organizer or guide is checking whether all people have come back. The security guards in public administrative buildings, museums, galleries, university buildings and in any other such objects want to know when closing whether all people entering the building during the day have already leaving it. So, the number of people is not crucial but the most important is the information on missing persons. In this paper we have proposed a solution for this problem taking into account that we cannot register individual’s people identities, we cannot violate people privacy by collecting the personal information.

3 New Approach for People Detection, Counting, and People Annotation The proposed approach is based on three steps: detection of people, people counting, and then people annotation using the dominant color and the silhouette size. The people detection can be performed by any method already proposed in the computer vision domain. The method applied is based on the algorithm of point tracking that is a centroid of a rectangular shape of an object detected by a convolutional neural network. People detection is achieved by the already trained model MobileNet-SSD. The software used in the experiments is based on the computer program code available in the Internet [27], however, it was modified to ensure not only counting people but also recognizing the direction of movement (entering or exiting). Moreover, to enable the detection of the dominant color and the silhouette size of a passing person, then used in people annotation process, a camera could not be placed vertically down at the door as in majority of people counting algorithms. If a camera is placed vertically down the people heads are clearly visible and people counting is much easier, however, the detection of the dominant color of their dresses is rather impossible. Also the dimensions of the human body including human height cannot be observed and measured. The dominant color is found by analyzing the frequencies of colors of a detected person. The HSL (Hue Saturation Lightness) color model is used and three most frequently observed H values are registered for each detected person in an entering people registration list. In this list also the size of a detected person is saved. The dimensions of the rectangle are taken as the size of a person silhouette. This rectangle determined during the people detection process reflects the dimensions of a human body and the dimensions of the rectangle are recognized as a second characteristic of a person. To achieve a good color recognition as well as to ensure that rectangles determined in the people detection process sufficiently discriminate entering persons a camera should be placed not above the entry but in front of entering people.

320

K. Choro´s and M. Uran

The similarity of two persons on the basis of the dominant colors is calculated using the Euclidean measure:  n c(p, q) = (1) (pi − qi )2 i=1

where: c is the Euclidean distance, pi – value of H component of the dominant color i of the person p, qi – value of H component of the dominant color i of the person q, n – number of dominant colors characterized a person (n = 3 in our experiments). The dominant colors are ranked for each person. Index 1 is for the most dominant. When detecting dominant colors of people entering and leaving the dominant colors of background are at the beginning recognized and then not included in the set of colors characterizing a given person. The similarity of two persons on the basis of the silhouette size is also calculated using the Euclidean distance measure:   2  2 wp − wq + hp − wq (2) s(p, q) = where: s is the Euclidean distance, wp (wq ) – width of the rectangle determined for the person p (q), hp (hq ) – height of the rectangle determined for the person p (q),

4 Test Results In our experiments we aimed to show to what extent the dominant color and the silhouette size are useful for finding missing persons, i.e. persons who have entered but not exited from a monitored space. Two videos were used in the experiments (Table 1) (Fig. 1). Table 1. Video characteristics used in the experiments. Video 1

Video 2

Duration

180 s

213 s

Frame rate

30

30

Format

MP4

MP4

Original resolution

720 × 1280

720 × 1 280

Resolution after conversion

250 × 444

250 × 444

Number of detected people entering

10

12

Number of detected people exiting

7

8

Automatic Counting of People Entering and Leaving

321

Fig. 1. Examples of frames with persons entering and exiting recorded in the test videos.

Unfortunately, the most dominant colors are not always exactly the same or even not evidently similar, although they were registered for the same person entering and then leaving the room, because the lighting conditions are not the same for the both events. It happens that they are significantly different (Fig. 2).

Fig. 2. The differences of the hues of three most dominant colors of the same person entering (left): 22, 52, 20 and leaving (right): 100, 180, 52.

Fig. 3. Example of the hue measurement results for three dominant colors of an entering person.

322

K. Choro´s and M. Uran

Another problem is that the best situation is when clothes are of expressive colors (Fig. 3). In practice many people wear clothes that are usually quite muted colors, so, only color may be not significantly discriminative. At the beginning the distances of three hues of dominant colors were calculated between people entering and leaving detected in the Video 1 (Table 2) and the Video 2 (Table 3). Then as the most probable entering persons who then exited are those with the lower distance measure. For every leaving person we are looking between entering persons who is the most similar (Euclidean distance of hue values of three dominant colors). So, the most similar entering persons for seven leaving persons are: E1, E6, E8, E3, E9, E6, E10 Therefore, we can conclude that the missing people are: E2, E4, E5, E7 Table 2. Euclidean distance measures of three dominant colors for people entering and leaving detected in the Video 1. Video 1

Dominant hues of leaving persons

Dominant hues of entering persons

L1

L2

L3

L4

L5

L6

L7

260

228

200

110

12

214

100

252

240

198

106

14

212

180

22

226

202

108

16

216

52

E1

202

200

198

192.47

55.32

4.90

159.37

322.21

24.74

179.22

E2

52

22

20

310.11

347.76

293.26

134.77

40.99

317.43

168.20

E3

110

106

108

226.30

214.02

159.37

0.00

162.89

183.62

93.34

E4

106

96

100

232.67

226.93

172.17

13.42

150.39

196.41

96.93

E5

22

52

20

310.88

346.72

293.47

135.66

39.50

317.62

153.27

E6

214

212

216

203.35

32.86

24.25

183.62

346.42

0.00

202.28

E7

22

26

20

328.21

361.48

307.23

147.95

16.12

331.48

175.57

E8

200

198

202

197.27

55.89

0.00

159.37

322.17

24.25

181.17

E9

12

14

10

343.94

379.98

325.67

166.35

6.00

349.92

192.52

E10

68

260

52

194.49

237.23

209.21

169.16

254.85

224.76

86.16

This is the first variant of the people matching procedure. In such a case it happens that one entering person is similar to more than one leaving person. In the Video 1 the leaving person L2 as well as L6 are matched to the entering person E6. In the second matching procedure we are looking for the most similar entering person but only between those still inside. If the order of leaving persons in the tests is adequate to their index, as the probable entering people who exited are: E1, E6, E8, E3, E9, E4, E10. In this second variant the missing people are: E2, E5, E7.

Automatic Counting of People Entering and Leaving

323

Table 3. Euclidean distance measures of three dominant colors for people entering and leaving detected in the Video 2. Video 2

Dominant hues of leaving persons

Dominant hues of entering persons

L1

L2

L3

L4

L5

L6

L7

L8

226

204

334

240

352

10

50

46

216

202

336

252

354

12

232

44

222

200

332

260

350

14

228

102

E1

202 206 204

31.62

6.00

225.18

81.83

256.36 332.57 156.06 246.95

E2

194 192 196

47.71

14.70

242.55

99.06

273.72 315.25 152.84 229.44

E3

228 226 240

20.69

52.46

178.33

34.93

209.43 379.99 178.50 292.05

E4

226 222 240

18.97

49.84

182.00

38.68

213.07 376.61 176.69 288.32

E5

330 260 180 120.48 140.14 169.99 120.68 195.50 437.56 285.46 365.23

E6

122 118 124 173.27 139.84 368.42 224.45 399.59 189.42 170.28 108.33

E7

336 344 192 171.42 194.04 140.24 149.35 159.12 498.18 309.25 426.85

E8

240 252 260

86.00

145.18

0.00

E9

354 352 356 229.86 263.32

35.10

179.48

E10

30

E11

14

12

10

362.63 329.09 557.73 413.73 588.90

E12 240

38

28

263.66 240.37 435.95 315.63 464.85 231.89 337.25 207.72

54.18

176.20 413.54 193.71 325.37 6.63

592.37 351.00 504.23

330 260 229.90 224.19 312.47 224.02 335.20 402.54 105.01 327.13 5.66

311.80 102.53

Similarly, for the Video 2 we have the following results. The most similar entering persons for eight leaving persons are in the first variant: E4, E1, E9, E8, E9, E11, E10, E11 So, the missing people are: E2, E3, E5, E6, E7, E12 For the second variant of matching procedure the most similar of those still inside are the entering people: E4, E1, E9, E8, E7, E11, E10, E6 And missing persons are: E2, E3, E5, E12 The real missing people in the Video 1 were: E2, E4, and E7. Whereas the real missing people in the Video 2 were: E2, E3, E4, and E5. Table 4 presents the values of standard measures of recall, precision, and F-measure for the experimental results received on Video 1 and Video 2. In the second part of tests the similar analyses have been performed using people silhouettes instead of dominant colors. The size (width and height) of rectangle determined by the people detection and counting method applied in the experiments was used as estimated sizes of detected people. The rectangle size was measured at the moment a person was crossing the virtual line in the middle of the scene (Fig. 2 and 3).

324

K. Choro´s and M. Uran Table 4. Efficiency measures of detecting missing people based on dominant colors. Recall

Precision

F-measure

- Variant 1

1.00

0.75

0.86

- Variant 2

0.66

0.66

0.66

- Variant 1

0.75

0.50

0.60

- Variant 2

0.75

0.75

0.75

Video 1

Video 2

The distances of estimated sizes of entering and leaving people detected in Videos 1 and 2 are presented in Tables 5 and 6. The most probable entering persons who then exited are those with the lower distance measure. For every leaving person we are looking between entering persons who is the most similar (Euclidean distance of sizes of detected people silhouettes). So, the most similar entering persons for seven leaving persons are: E1, E7, E1, E3, E7, E8, E7 Therefore, we can conclude that the missing people are: E2, E4, E5, E6, E9, E10 Also in the case of the analysis of people silhouettes it happens that one entering person is similar to more than one leaving person. In Video 1 the leaving person L2, L5 but also L7 are matched to the entering person E7. Table 5. Euclidean distance measures of silhouette sizes for people entering and leaving detected in the Video 1. Video 1

Estimated sizes of leaving persons

Estimated sizes of entering persons

L1

L2

L3

L4

L5

L6

L7

129

108

136

128

147

124

98

206

123

215

185

100

180

137

E1

118

210

11.70

87.57

18.68

26.93

113.76

30.59

75.69

E2

121

189

18.79

67.27

30.02

8.06

92.72

9.49

56.86

E3

121

187

20.62

65.31

31.76

7.28

90.80

7.62

55.04

E4

143

197

16.64

81.86

19.31

19.21

97.08

25.50

75.00

E5

115

204

14.14

81.30

23.71

23.02

108.81

25.63

69.12

E6

128

170

36.01

51.08

45.71

15.00

72.53

10.77

44.60

E7

98

127

84.86

10.77

95.85

65.30

55.95

59.03

10.00

E8

126

175

31.14

55.03

41.23

10.20

77.88

5.39

47.20

E9

87

125

91.24

21.10

102.47

72.67

65.00

66.29

16.28

E10

79

126

94.34

29.15

105.69

76.69

72.80

70.29

21.95

Automatic Counting of People Entering and Leaving

325

Furthermore, the second matching procedure was applied. We are looking for the most similar entering person but only between those still inside. If the order of leaving persons in the tests is adequate to their index, basing on the similarities between people silhouettes the probable entering people who exited are: E1, E6, E8, E3, E9, E4, E10 In this second variant the missing people are: E2, E5, E6 Table 6. Euclidean distance measures of silhouette sizes for people entering and leaving detected in the Video 2. Video 2

Estimated sizes of leaving persons L1

L2

L3

L4

L5

L6

L7

L8

Estimated sizes of entering persons

121

134

121

96

135

132

98

188

185

233

169

153

197

158

137

214

E1

131

180

11.18

53.08

14.87

44.20

17.46

22.02

54.20

66.37

E2

73

163

52.80

92.85

48.37

25.08

70.71

59.21

36.07

125.80

E3

128

193

10.63

40.45

25.00

51.22

8.06

35.23

63.53

63.57

E4

139

212

32.45

21.59

46.62

73.01

15.52

54.45

85.48

49.04

E5

58

149

72.56

113.28

66.10

38.21

90.74

74.55

41.76

145.34

E6

122

185

1.00

49.48

16.03

41.23

17.69

28.79

53.67

72.09

E7

82

156

48.60

92.91

41.11

14.32

67.01

50.04

24.84

120.83

E8

91

180

30.41

68.25

31.95

27.46

47.17

46.53

43.57

102.79

E9

143

217

38.83

18.36

52.80

79.40

21.54

60.02

91.79

45.10

E10

88

173

35.11

75.60

33.24

21.54

52.77

46.49

37.36

108.08

E11

117

180

6.40

55.66

11.70

34.21

24.76

26.63

47.01

78.72

E12

137

215

34.00

18.25

48.70

74.33

18.11

57.22

87.21

51.01

Similarly for the Video 2 we have the following results for the analysis of silhouette sizes of people detected. The most similar entering persons for eight leaving persons are in the first variant: E6, E12, E11, E7, E3, E1, E7, E9 So, the missing people are: E2, E3, E5, E8, E10 For the second variant of matching procedure the most similar of those still inside are the entering people: E6, E12, E11, E7, E3, E1, E2, E9.

326

K. Choro´s and M. Uran

And missing persons are: E4, E5, E8, E10 Table 7 presents the values of standard measures of recall, precision, and F-measure for the experimental results based on the matching of people silhouette sizes received on Video 1 and Video 2. Table 7. Efficiency measures of detecting missing people based on people silhouettes. Recall

Precision

F-measure

- Variant 1

0.66

0.33

0.44

- Variant 2

0.33

0.33

0.33

- Variant 1

0.75

0.60

0.67

- Variant 2

0.50

0.50

0.50

Video 1

Video 2

5 Conclusions People counting methods can be applied for controlling the numbers of people entering and leaving monitored spaces such as a tourist bus, an office or an university building, a public building, a supermarket or shopping mall, a culture or sport center, etc. The problem arises when the number of entering people is not equal to the number of exiting people. How many people are missing and who is missing? The paper presents an approach for people counting and then reducing the set of potential missing persons chosen from the people entering. This approach is based on analysis and comparison of dominant colors of entering and leaving persons as well as analysis and comparison of silhouette sizes of entering and leaving persons. Both procedures finally allow us to define specific features distinctive for missing people. The results of tests performed with video recordings of people entering and exiting through the door are promising. What is very important in this approach is the fact that individual’s identities are not registered; therefore, privacy violation is avoided. This is critical factor because the protection of personal data and the protection of privacy is a priority in the present day. The further experimental research will be undertaken on real surveillance videos to better verify the efficiency of the proposed approach. The first results and observations have shown that mainly the measuring of silhouette sizes should be improved and that the position and placement of the camera have a great influence on the measured values. In spite of all, the technique presented in the paper promises an effective solution for identifying and reducing the set of potential missing people.

Automatic Counting of People Entering and Leaving

327

References 1. Benenson, R., Omran, M., Hosang, J., Schiele, B.: Ten years of pedestrian detection, what have we learned? In: Agapito, L., Bronstein, M.M., Rother, C. (eds.) ECCV 2014. LNCS, vol. 8926, pp. 613–627. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-16181-5_47 2. Jingying, W.: A survey on crowd counting methods and datasets. In: Bhatia, S.K., Tiwari, S., Ruidan, S., Trivedi, M.C., Mishra, K.K. (eds.) Advances in Computer, Communication and Computational Sciences. AISC, vol. 1158, pp. 851–863. Springer, Singapore (2021). https:// doi.org/10.1007/978-981-15-4409-5_76 3. Li, B., Huang, H., Zhang, A., Liu, P., Liu, C.: Approaches on crowd counting and density estimation: a review. Pattern Anal. Appl. 24(3), 853–874 (2021) 4. Sindagi, V.A., Patel, V.M.: A survey of recent advances in CNN-based single image crowd counting and density estimation. Pattern Recogn. Lett. 107, 3–16 (2018) 5. Fan, Z., Zhang, H., Zhang, Z., Lu, G., Zhang, Y., Wang, Y.: A survey of crowd counting and density estimation based on convolutional neural network. Neurocomputing 472, 224–251 (2022) 6. Lin, S.F., Chen, J.Y., Chao, H.X.: Estimation of number of people in crowded scenes using perspective transformation. IEEE Trans. Syst., Man, Cybern.-Part A: Syst. Humans 31(6), 645–654 (2001) 7. Loy, C.C., Chen, K., Gong, S., Xiang, T.: Crowd counting and profiling: methodology and evaluation. In: Ali, S., Nishino, K., Manocha, D., Shah, M. (eds.) Modeling, Simulation and Visual Analysis of Crowds. TISVC, vol. 11, pp. 347–382. Springer, New York (2013). https:// doi.org/10.1007/978-1-4614-8483-7_14 8. Zhang, Z., Wang, M., Geng, X.: Crowd counting in public video surveillance by label distribution learning. Neurocomputing 166, 151–163 (2015) 9. Choro´s, K.: Further tests with click, block, and heat maps applied to website evaluations. In: J˛edrzejowicz, P., Nguyen, N.T., Hoang, K. (eds.) ICCCI 2011. LNCS (LNAI), vol. 6923, pp. 415–424. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-23938-0_42 10. Fan, Z., Zhu, Y., Song, Y., Liu, Z.: Generating high quality crowd density map based on perceptual loss. Appl. Intell. 50(4), 1073–1085 (2019). https://doi.org/10.1007/s10489-01901573-7 11. Kang, D., Ma, Z., Chan, A.B.: Beyond counting: comparisons of density maps for crowd analysis tasks – counting, detection, and tracking. IEEE Trans. Circuits Syst. Video Technol. 29(5), 1408–1422 (2018) 12. Pham, V.Q., Kozakaya, T., Yamaguchi, O., Okada, R.: COUNT forest: CO-voting uncertain number of targets using random forest for crowd density estimation. In: Proceedings of the IEEE International Conference on Computer Vision ICCV, pp. 3253–3261 (2015) 13. Ryan, D., Denman, S., Fookes, C., Sridharan, S.: Crowd counting using multiple local features. In: Proceedings of the Digital Image Computing: Techniques and Applications DICTA, pp. 81–88. IEEE (2009) 14. Wang, C., Zhang, H., Yang, L., Liu, S., Cao, X.: Deep people counting in extremely dense crowds. In: Proceedings of the 23rd ACM International Conference on Multimedia, pp. 1299– 1302 (2015) 15. Ranjan, V., Le, H., Hoai, M.: Iterative crowd counting. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 278–293. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_17 16. Liu, W., Salzmann, M., Fua, P.: Context-aware crowd counting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5099–5108 (2019) 17. Snidaro, L., Micheloni, C., Chiavedale, C.: Video security for ambient intelligence. IEEE Trans. Syst., Man, Cybern. - Part A: Syst. Humans 35(1), 133–144 (2005). https://doi.org/10. 1109/TSMCA.2004.838478

328

K. Choro´s and M. Uran

18. Rabaud, V., Belongie, S.: Counting crowded moving objects. In: Proceedings of the Computer Society Conference on Computer Vision and Pattern Recognition CVPR’06, vol. 1, pp. 705– 711. IEEE (2006) 19. Sabzmeydani, P., Mori, G.: Detecting pedestrians by learning shapelet features. In: Proceedings of the Conference on Computer Vision and Pattern Recognition, pp. 1–8. IEEE (2007) 20. Park, J.H., Cho, S.I.: Flow analysis-based fast-moving flow calibration for a people-counting system. Multimedia Tools Appl. 80(21–23), 31671–31685 (2021). https://doi.org/10.1007/ s11042-021-11231-1 21. Marciniak, T., D˛abrowski, A., Chmielewska, A., Nowakowski, M.: Real-time bi-directional people counting using video blob analysis. In: Proceedings of the Joint Conference New Trends in Audio & Video and Signal Processing: Algorithms, Architectures, Arrangements and Applications NTAV/SPA, pp. 161–166. IEEE (2012) 22. Budge, S.E., Sallay, J.A., Wang, Z., Gunther, J.H.: People matching for transportation planning using texel camera data for sequential estimation. IEEE Trans. Syst., Man, Cybernet.: Syst. 43(3), 619–629 (2012) 23. Nasir, A.S.A., Gharib, N.K.A., Jaafar, H.: Automatic passenger counting system using image processing based on skin colour detection approach. In: Proceedings of the International Conference on Computational Approach in Smart Systems Design and Applications ICASSDA, pp. 1–8. IEEE (2018) 24. Cho, S.I., Kang, S.J.: Real-time people counting system for customer movement analysis. IEEE Access 6, 55264–55272 (2018) 25. Hsu, Y.W., Wang, T.Y., Perng, J.W.: Passenger flow counting in buses based on deep learning using surveillance video. Optik – Int. J. Light Electron Optics 202, 163675 (2020) 26. Zhao, J., Li, C., Xu, Z., Jiao, L., Zhao, Z., Wang, Z.: Detection of passenger flow on and off buses based on video images and YOLO algorithm. Multimed. Tools Appl. 81(4), 4669–4692 (2022). https://doi.org/10.1007/s11042-021-10747-w 27. Rosebrock, A.: OpenCV people counter with Python, Object Tracking Tutorials. https://pyi magesearch.com/2018/08/13/opencv-people-counter/ (2021)

Toward Understanding the Impact of Input Data for Multi-Image Super-Resolution Jakub Adler, Jolanta Kawulok , and Michal Kawulok(B) Faculty of Automatic Control, Electronics and Computer Science, Silesian University of Technology, Gliwice, Poland [email protected] Abstract. Super-resolution reconstruction is a common term for a variety of techniques aimed at enhancing spatial resolution either from a single image or from multiple images presenting the same scene. While single-image super-resolution has been intensively explored with many advancements proposed attributed to the use of deep learning, multiimage reconstruction remains a much less explored field. The first solutions based on convolutional neural networks were proposed recently for super-resolving multiple Proba-V satellite images, but they have not been validated for enhancing natural images so far. Also, their sensitiveness to the characteristics of the input data, including their mutual similarity and image acquisition conditions, has not been explored in depth. In this paper, we address this research gap to better understand how to select and prepare the input data for reconstruction. We expect that the reported conclusions will help in elaborating more efficient superresolution frameworks that could be deployed in practical applications. Keywords: Super-resolution · Multi-image super-resolution learning · Convolutional neural networks · Data selection

1

· Deep

Introduction

Super-resolution (SR) reconstruction consists in generating a high-resolution (HR) image from a low-resolution (LR) observation, being either a single image or multiple images presenting the same scene [26,39]. In the latter case, we benefit from information fusion, as each LR image carries a different portion of HR information. In contrast to single-image SR (SISR), the multi-image SR (MISR) techniques demonstrate higher capabilities of reconstructing the true underlying HR information, rather than hallucinating it based on a single LR image. Single-image SR has received considerable attention in recent years and a number of new techniques appeared that have significantly improved the state This research was supported by the National Science Centre, Poland, under Research Grant 2019/35/B/ST6/03006 (MK) and co-financed by the Silesian University of Technology grant for maintaining and developing research potential (JK). c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  N. T. Nguyen et al. (Eds.): ACIIDS 2022, LNAI 13758, pp. 329–342, 2022. https://doi.org/10.1007/978-3-031-21967-2_27

330

J. Adler et al.

of the art in this field [34]. These advancements were attributed to deep learning, in particular to the use of fully convolutional neural networks (CNNs) which demonstrate high capabilities in learning the relation between LR and HR. From a large number of corresponding LR–HR patches, a network learns how to transform a single LR image into a resulting HR one. Importantly, this process can be relatively easily modeled by exploiting the architectural choices originally developed for feature representation and nonlinear mapping [11]. However, for multi-image SR, the reconstruction process is much more complicated due to the variations among the input LR images—commonly, there are sub-pixel translations between them and they vary with regards to the lighting conditions. This makes it much more challenging to implement the MISR process using a deep network and the first end-to-end solutions were proposed as late as in 2019 [6,8,29]. They were developed to address the Proba-V SR Challenge [25] organized by European Space Agency (ESA)—at that time, the deep learning-based solutions were already considered as the state of the art in SISR [34]. Another important reason explaining why the problem of multi-image SR is much less explored than SISR lies in the availability of real-world data for training and evaluation. The Proba-V mission allows for acquiring satellite images at two different scales, namely at 100 m and 300 m ground sampling distance (GSD) which corresponds to the size of a pixel in these images. In this way, the process of data collection was automated and allowed for assembling a large dataset with over 1000 scenes (each scene is composed of a single HR image coupled with at least nine LR images captured at subsequent revisits of the imaging sensor). Actually, the data prepared for the Proba-V challenge were the first large-scale real-world dataset for MISR—a few other datasets have been published since then, but overall creating a dataset for MISR is difficult and costly. Although the MISR networks demonstrated high performance for the ProbaV dataset, it is not clear yet whether they can be applied in other domains, and whether they require retraining before they are applied for images of a different kind. Also, the sensitivity of these models to the input data characteristics (including the number of LR images and their diversity) has not been explored in depth. With the research reported in this paper, we intend to address the problems outlined and we investigate the performance of the residual attention model (RAMS) [29] trained with the Proba-V images when applied to superresolve natural images, as well as images acquired by the Sentinel-2 satellite of much higher spatial resolution (GSD of 10 m). Overall, our contribution is focused around the following points: – We demonstrate that the RAMS model trained with the Proba-V data can be successfully applied to super-resolve real-world natural images. – We investigate the impact of the number of LR images presented for reconstruction on the final reconstruction accuracy. – We report our analysis regarding the influence of imaging conditions on the final reconstruction outcome. – We explore how to select a subset of LR images out of all the images presented for reconstruction.

Impact of Input Data for Multi-Image Super-Resolution

331

In addition to that, we present a thorough literature review embracing both single-image as well as multi-image SR techniques. While there have been several excellent reviews on SISR published recently [7,37], the advancements in MISR attributed to deep learning have not been summarized so far.

2

Literature Review

Super-resolution reconstruction is commonly categorized into single-image and multi-image SR, depending on the input data presented for reconstruction. Importantly, there may be different goals of SR—they range from generating a visually plausible result expected to appear as if it were real [23], to discovering the ground-truth HR information [16]. The SR goal is related with the data used for training and evaluation—this issue is also tackled in this section. 2.1

Single-Image Super-Resolution

The SISR techniques are mostly based on learning the mapping from low to high resolution. Initially, this process was realized in a form of creating a dictionary between LR and HR patches, i.e., small image fragments of a constant size. The use of dictionary learning for single-image SR was intensively explored relying on sparse-coding techniques [1]. In 2014, Dong et al. demonstrated that their CNN for single-image SR (SRCNN) [9] can outperform the methods based on sparse coding despite a relatively simple architecture composed of three convolutional layers. Limited capabilities of SRCNN were addressed with a very deep SR network [19] which can be efficiently trained relying on fast residual learning. Liu et al. demonstrated how to exploit the domain expertise in a form of sparse priors while learning the deep model [24]. Their sparse coding network was reported to achieve high training speed and model compactness. In addition to that, a deep Laplacian pyramid network (LapSRN) with progressive upsampling [22] was shown to deliver competitive results at a high processing speed. An important research direction in SISR is concerned with the use of generative adversarial networks (GANs) that are composed of a generator, trained to perform SR, and a discriminator, learned to distinguish between the images produced by the generator and the real HR images. The originally proposed SRGAN [23] was further improved on many ways, including enhancement with dense blocks to improve the perceptual quality of the reconstructed images [33], especially in highly textured areas. Although GANs are primarily focused on obtaining plausible visual effect rather than reconstructing the underlying HR information, there were also some attempts to exploit them for remote sensing applications where the latter is pivotal [30]. The current trends in SISR are focused around enhancing the real-world images, proposing faster and lightweight models [13], including knowledge distillation [12], as well as trying to understand, explain and interpret the superresolution processes [2]. Most of the SR methods have been developed and validated relying on artificial scenarios—some original images are treated as an

332

J. Adler et al.

HR reference and they are degraded and downscaled to create simulated LR inputs. Unfortunately, the models elaborated using simulated data commonly do not perform well when they are fed with real data (i.e., not with simulated LR images), resulting in artefacts and poor reconstruction of image details. This is caused by the fact that the simulation process does not reflect well the actual imaging process, which we studied in our earlier works [3,16]. Recently, some real-world datasets have been published (e.g., a DRealSR dataset [36]) and more attention is paid to this important problem [7]. 2.2

Multi-Image Super-Resolution

SR from multiple images consists in fusing the complementary information from a number of available LR observations presenting the same scene with sub-pixel displacements. As multiple observations provide more data extracted from the analyzed scene, the reconstruction can be more accurate than for single-image methods. Existing MISR techniques [39] are based on the premise that each LR observation has been derived from an original HR image, degraded using an assumed imaging model (IM) composed of image warping, blurring, decimation and contamination with the noise. The reconstruction consists in reversing that degradation process, which requires solving an ill-posed optimization problem. Therefore, there have been a number of optimization-based techniques proposed for MISR coupled with some regularisation to provide a proper balance between spatial smoothness and sharp edges in the reconstructed image [10]. These techniques are highly parameterized and they implement complex image processing pipelines. Most of them do not require any training, but this also means that the relation between low and high resolution, as well as constraints imposed on the solution space must be known and defined a priori, rather than learned from the available data. Deep learning has not been applied to MISR until 2019, when we introduced our EvoNet framework [17] that couples the advantages of deep learningempowered SISR with data fusion performed using evolutionarily-tuned imaging models [15]. This was followed by a DeepSUM network [6] which learns the whole multi-image reconstruction process in an end-to-end manner, including single-image enhancement of individual LR input images, their sub-pixel co-registration, and data fusion. Despite DeepSUM won the Proba-V SR challenge organized by ESA in 2019, its training process is quite lengthy and it was designed to address the specific Proba-V dataset—the number of input LR images is fixed and set to N = 9 and the magnification factor is 3×. The challenge along with publishing the first large-scale real-world multi-image dataset intensified the research on using deep learning for MISR. The RAMS model [29] utilizes the spatial and temporal correlations within a stack of LR images by selecting the extracted convolutional features using the attention mechanism. In HighRes-net [8], the learned latent LR representations are combined in a recursive manner to obtain the global representation that is upsampled to obtain the final super-resolved image. Also, a recurrent network (MISR-GRU) [28] has been proposed to deal with the Proba-V data which treats the input LR images as

Impact of Input Data for Multi-Image Super-Resolution

333

a sequence. Recently, we have introduced Magnet—a graph neural network for multi-image SR [31]. In Magnet, a stack of co-registered input images is represented as a graph which is processed with spline-based convolutions in subsequent network layers. Finally, the resulting graph is converted into feature maps that are subject to the pixel shuffle operation to generate the super-resolved image. Importantly, such data representation allows for fusing any number of LR images, and their spatial resolution does not have to be identical within a stack. Some of the advancements concerning SISR can also be exploited for MISR. This includes a recently proposed OpTiGAN [30] which combines the benefits of multi-image fusion with a GAN that improves the perceptual image quality. A specific category of MISR that has also received considerable attention from the researchers is concerned with super-resolving video sequences. Kappeler et al. proposed a CNN [14] which is fed with three consecutive motioncompensated video frames, thus operating in both spatial and temporal domain. Yang et al. proposed a new recurrent neural network for processing video sequences [38], which generates the super-resolved image based on five adjacent frames. However, the CNN-based video SR techniques are based on explicit or implicit assumptions on the input stream, concerned with fixed and rather high sampling frequency or presence of moving objects, whose resolution can be increased by estimating the motion fields. Hence they cannot be applied to a general MISR case. Another sort of MISR techniques in which deep learning is employed is concerned with super-resolving burst images [5]—series of images captured one after another within a short time interval. Several new architectures were recently proposed to address one of the NTIRE challenge tracks that was focused on this interesting problem [4]. Also, a network for super-resolving bursts of satellite images was proposed in [27]. Overall, the field of MISR is much less explored than SISR, but it is clear that this task can be effectively performed using CNNs in an end-to-end manner. The main obstacle here consists in low availability of large-scale datasets with real LR and HR images. Apart from the SupER dataset [20] created with hardware binning (thus not fully reflecting the operational conditions), the ProbaV dataset remains the only real-world benchmark for MISR. It was recently reported in [32] that by making the DeepSUM network invariant to the LR inputs permutation, it is possible to improve the network’s performance and decrease the amount of required training data for super-resolving Proba-V images. We have also demonstrated that the RAMS and HighRes-net models trained with the Proba-V images are suitable for enhancing multispectral images acquired within the Sentinel-2 mission [18], despite the spatial resolution of the latter is 30× higher than that of Proba-V (GSD of 10 m compared with 300 m). However, the possibility of using Proba-V data for preparing deep models suitable for super-resolving other types of images, including natural ones, has not been explored so far. Also, it was not investigated in depth how the number of LR images along with their quality and mutual similarity influences the reconstruction quality.

334

3

J. Adler et al.

Proposed Experimental Setup

In this section, we outline the RAMS network [29] selected for our experimental study and we justify that choice (Sect. 3.1). Furthermore, we present the test set (Sect. 3.2), and specify the testing procedure along with the metrics used for evaluation (Sect. 3.3). 3.1

Models Exploited for Multi-Image Super-Resolution

2D Conv

Upsampling

Residual Temporal Attention Block

TR-Con v

Global residual connection

Upsampling

Architecture of the residual attention multi-image super-resolution network is outlined in Fig. 1. The network is composed of two paths—the main branch and the global residual branch with a residual temporal attention block (RTAB). RTAB performs a simple fusion of the LR images presented for reconstruction, hence generating a baseline coarse solution that is added to the outcome of the main branch. The attention mechanism in RTAB applies relevance weights to individual LR input images, thus the network selects the most valuable ones for reconstruction. The main branch extracts temporal-spatial features using 3D convolutions applied to a stack of LR images that are treated as a 3D cube. In this branch, the attention mechanism is applied in the feature domain to select the most valuable features in residual feature attention blocks (RFABs). Later in the main branch, RFABs are topped with the layers that fuse the feature maps in the temporal dimension—each temporal reduction block (TRB) decreases the temporal depth by one, until the temporal dimension is eliminated (reduced to a single feature map).

TRB

+

TRB

3D Conv

RFAB

RFAB

3D Conv

Main branch

+

Fig. 1. Outline of the RAMS architecture.

We have decided to exploit the RAMS network in our study, as it is a state-ofthe-art technique for MISR (it performs better than DeepSUM and similarly to HighRes-net). Moreover, the authors have published the implementation1 alongside the ready-to-use trained models. Importantly, it may be expected that the attention mechanism increases the network’s robustness against LR data selection and preprocessing, so it should be less sensitive to the imaging conditions 1

The RAMS implementation is available at https://github.com/EscVM/RAMS.

Impact of Input Data for Multi-Image Super-Resolution

335

and diversity among the input data. In our study, we exploit two RAMS models published by the authors that were trained using (i ) 415 scenes with RED band images and with (ii ) 396 NIR band images from the Proba-V dataset. Each scene is composed of a single HR image with a size of 384 × 384 pixels (100 m GSD) and 9 to 35 LR images (128 × 128 pixels, 300 m GSD). Both models increase the spatial resolution 3× which results from the characteristics of the Proba-V dataset. 3.2

Test Data

For our study, we exploited data from our B4MultiSR dataset [21] that contains several examples of natural and satellite images with both simulated and real LR counterparts. The real part is composed of multiple LR and HR images captured independently with various sensors of different native resolution. We have used six scenes, including one synthetic image (text rendered over white background) with simulated LR images, two natural images with simulated LR counterparts, and three real-world scenes, each of which contains one HR image coupled with independently-captured 26 to 40 LR images. In Fig. 2, we present the real-world images used in our study along with four examples of LR images. The size of LR images ranges from 271 × 300 up to 451 × 641. It is worth noting that these LR images are slightly shifted and they carry different portions of HR information. For each setting, a full series of LR images has been collected. In addition to that, the boat scene (middle row in Fig. 2) includes LR images captured in different lighting conditions (regular, sidelight, and flash) as shown in Fig. 3. Unfortunately, although our dataset is appropriate for testing the behavior of the trained model, the amount of these data is by far insufficient to train the models for MISR. 3.3

Testing Procedure

We assess the SR quality based on the similarity between the super-resolved image and the HR reference. The similarity is measured using peak signal-tonoise ratio (PSNR), structural similarity index (SSIM) and universal image quality index (UIQI) [35]. As it was proposed in [25] within the Proba-V challenge, we compensate the outcome in terms of whole-pixel translation and mean brightness. Basically, a super-resolved image may be slightly translated with regards to the reference image and it may be darker or brighter without losing any details, which would unnecessarily affect the similarity score. Therefore, the image is translated up to three pixels in vertical and horizontal directions to maximize the PSNR score. Also, the mean brightness is subtracted from the pixels in the HR reference image and from the super-resolved outcome. The PSNR metric with such correction was termed as cPSNR, and we adjust the brightness and apply the determined translations prior to computing the SSIM and UIQI metrics. Furthermore, as the ratio between LR and HR differs for each scene, the super-resolved image (3× larger than LR) is scaled bicubically to the size of the HR reference before computing the metrics.

336

J. Adler et al.

a)

b)

c)

High-resolution reference image

Low-resolution images

Fig. 2. Real-world scenes used in our study. The images in the (a) text and (b) boat scenes were captured using different cameras, and the (c) bushehr scene presents images acquired with WorldView-2 (HR) and Sentinel-2 (LR) satellites. Each selected HR region is magnified and corresponding regions in four LR images are shown on the right. For better visualization, the LR regions are magnified bicubically.

Regular light

Flash lamp

Sidelight

Fig. 3. LR images captured in different lighting conditions for the boat scene.

The goals of our study were multifarious. First, our intention was to verify whether RAMS models trained using Proba-V data can be applied for superresolving natural images. This was performed by presenting nine LR images from each scene for reconstruction using RAMS. We actively select the LR images out of those presented for reconstruction. For each scene, we select nine LR images in a greedy approach—first, we determine the mutual cPSNR scores among the LR images to select a pair of images with the highest (or lowest) similarity, and then we iteratively adjoin an LR image with the highest (or lowest) cPSNR score to any of the already selected images. In this way, for each scene we select two sets of nine LR images with high and low mutual similarity.

Impact of Input Data for Multi-Image Super-Resolution

337

We applied that procedure to all of the considered scenes. Subsequently, we investigated the performance for different numbers of input LR images presented for reconstruction, and we analyzed the impact of the lighting conditions in which the LR images are captured on the final reconstruction quality. This part of our study was performed for the boat scene, as it encompasses images captured in various lighting conditions.

4

Experimental Results

The quantitative results of performed reconstruction are reported in Table 1. It can be seen that both RAMS models trained with RED channel and NIR channel images behave in a similar way, and they allow for super-resolving the scenes composed of simulated and original LR images. The scores for two scenes (synthetic text and bushehr ) are rather low, but it can be inspected in Fig. 4 that qualitatively the reconstruction outcomes are comparable for all the scenes, and the detail level has been improved compared with the LR images. For the synthetic text scene, the low scores result from the fact that the white background became grayish after SR and some noise can be seen on the plain regions. As the HR image contains extreme pixel values (minimum for text and maximum for the background), this cannot be fully compensated with the cPSNR correction. For bushehr, the HR image is captured using a different sensor of much higher spatial resolution, so even though SR offers considerable gain, the outcome is still quite different from the reference image. It can be concluded from Table 1 that it is slightly better to select most similar images out of those presented for reconstruction (thus to maximize the mutual similarity). However, this cannot be confirmed for the scenes presented in Fig. 4(a–c), for which the better outcome is obtained from the least similar Table 1. Reconstruction accuracy measured with cPSNR (in dB), SSIM and UIQI, obtained using the RAMS network trained with Proba-V RED and NIR images, with LR images selected so as to minimize or maximize their mutual similarity. Training set →

Proba-V RED channel

Mutual similarity → Minimized Metric →

Proba-V NIR channel

Maximized

Minimized

Maximized

cPSNR SSIM UIQI cPSNR SSIM UIQI cPSNR SSIM UIQI cPSNR SSIM UIQI

Simulated scenes (synthetic text)

16.58

0.768 0.980 17.83

0.830 0.985 18.07

0.837 0.986 17.74

(no noise)

33.20

0.964 0.999 32.74

0.947 0.998 32.47

0.961 0.999 30.46

0.832 0.985 0.927 0.998

(with noise)

24.98

0.802 0.985 25.87

0.833 0.986 26.48

0.836 0.987 26.64

0.905 0.989

Average

24.92

0.845 0.988 25.48

0.870 0.990 25.67

0.878 0.991 24.95

0.888 0.991

(text)

21.71

0.768 0.993 24.05

0.829 0.996 24.40

0.855 0.996 23.98

0.840 0.996

(boat)

26.58

0.706 0.990 26.33

0.704 0.989 26.79

0.698 0.990 26.42

0.704 0.990

(bushehr)

18.60

0.532 0.785 19.50

0.537 0.818 16.78

0.503 0.714 19.41

0.532 0.795

Average

22.30

0.669 0.923 23.29

0.690 0.934 22.66

0.685 0.900 23.27

0.692 0.927

Average

23.61

0.757 0.955 24.39

0.780 0.962 24.17

0.782 0.945 24.11

0.790 0.959

Real scenes

338

J. Adler et al.

Fig. 4. Outcome of RAMS trained with NIR images for the (a) synthetic text scene with simulated LR, and for the (b) text, (c) boat, and (d) bushehr scenes with real LR. The input images were selected based on maximized and minimized mutual similarity. Table 2. Reconstruction accuracy measured with cPSNR (in dB), SSIM and UIQI for the boat scene performed from LR images acquired in different lighting conditions— regular (R), flash lamp (F), sidelight (S), and two mixtures of LR images. Training set → Proba-V RED channel Proba-V NIR channel Metric → cPSNR SSIM UIQI cPSNR SSIM UIQI Regular Flash lamp Sidelight R+F+S R+F

26.58 28.16 17.40 23.11 27.12

0.706 0.785 0.533 0.659 0.730

0.990 0.998 0.922 0.979 0.991

26.79 28.00 17.50 23.45 27.20

0.698 0.777 0.534 0.648 0.732

0.990 0.998 0.923 0.981 0.992

images (minimized mutual similarity). For (a) and (b), the text is more legible, and for the boat scene (c), the fabric texture as well as the details in the hat are better visible, and they resemble HR. For bushehr (d), the use of different LR images resulted in some grid-like artefacts which also decrease the measured reconstruction accuracy. To better understand the problem of LR data selection, in Fig. 5 we show the outcome of reconstruction performed from images captured

Impact of Input Data for Multi-Image Super-Resolution

339

in different lighting conditions (regular, with flash lamp, and with sidelight), as well as from two sets with diverse LR images. It can be seen that both regular light and flash lamp lead to high-quality reconstruction, which is reflected in the quantitative scores (Table 2). However, making the set of LR images diverse in terms of lighting conditions affects the reconstruction quality which can be observed both for R+F and R+F+S cases. This suggests that it is beneficial to exploit images of low mutual similarity, provided that the dissimilarities result from sub-pixel shifts among them rather than from variations in the lighting conditions. It is important to note that the lighting variations within the bushehr scene are substantial (see Fig. 2) which may be the reason why maximizing the similarity leads to a better outcome in that case.

Regular (R)

Flash lamp (F)

Sidelight (S)

R+F

R+F+S

Fig. 5. Reconstruction outcome from input LR images acquired in different lighting conditions: regular, flash lamp, sidelight, as well as from combined input sets—all of them (R+F+S) and from regular and flash (R+F).

In Fig. 6, we show how the reconstruction outcome depends on the number of LR images presented for reconstruction. There is a clear qualitative and quantitative difference between single-image and multi-image SR outcome. It can be noticed that beginning from N = 5 LR images, both the fabric texture of the painting, as well as the stripes on the hat can be clearly distinguished.

N = 1 / 25.69 dB

N = 3 / 26.40 dB

N = 5 / 26.54 dB

N = 7 / 26.63 dB

N = 9 / 26.79 dB

Fig. 6. Reconstruction outcome for different number of input LR images.

5

Conclusions and Future Work

In this paper, we have reported our study on exploiting a deep model trained with Proba-V images to super-resolve images acquired from different sources. Despite the differences resulting from the sensor characteristics, the trained models allowed for successful enhancement of synthetic, simulated and real natural

340

J. Adler et al.

images. An important observation is that the models are sensitive to the level and nature of the mutual differences among the input images. While this requires further investigation, possibly relying on more scenes, the reported results suggest that the variations among the LR images are beneficial, provided that they do not result from the noise or from differences in the lighting conditions. This observation may be helpful in the cases when LR images can be acquired in controlled conditions to collect highly-valuable data that would maximize the chances for accurate reconstruction. Our ongoing work is focused on active selection of LR images coupled with their preprocessing, as well as on test-time data augmentation that may help increase the quality of the data presented for reconstruction. Future activities also encompass reconstructing color images—either in a channel-wise manner or by processing all channels at once.

References 1. Alvarez-Ramos, V., Ponomaryov, V., Reyes-Reyes, R.: Image super-resolution via two coupled dictionaries and sparse representation. Multimed. Tools Appl. 77(11), 13487–13511 (2018). https://doi.org/10.1007/s11042-017-4968-3 2. Balestriero, R., Glotin, H., Baraniuk, R.G.: Interpretable Super-Resolution via a Learned Time-Series Representation (2020). arxiv.org/abs/2006.07713 3. Benecki, P., Kawulok, M., Kostrzewa, D., Skonieczny, L.: Evaluating superresolution reconstruction of satellite images. Acta Astronaut. 153, 15–25 (2018) 4. Bhat, G., Danelljan, M., Timofte, R.: NTIRE 2021 challenge on burst superresolution: methods and results. In: Proceedings IEEE/CVF CVPR Workshops, pp. 613–626 (2021) 5. Bhat, G., Danelljan, M., Van Gool, L., Timofte, R.: Deep burst super-resolution. In: Proceedings IEEE/CVF CVPR, pp. 9209–9218 (2021) 6. Bordone Molini, A., Valsesia, D., Fracastoro, G., Magli, E.: DeepSUM: deep neural network for super-resolution of unregistered multitemporal images. IEEE TGRS 58(5), 3644–3656 (2020) 7. Chen, H., et al.: Real-world single image super-resolution: a brief review. Inf. Fusion 79, 124–145 (2021) 8. Deudon, M., et al.: HighRes-net: Recursive fusion for multi-frame super-resolution of satellite imagery (2020). arxiv.org/abs/2002.06460 9. Dong, C., Loy, C.C., He, K., Tang, X.: Learning a deep convolutional network for image super-resolution. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8692, pp. 184–199. Springer, Cham (2014). https://doi. org/10.1007/978-3-319-10593-2 13 10. Farsiu, S., Robinson, M.D., Elad, M., Milanfar, P.: Fast and robust multiframe super resolution. IEEE TIP 13(10), 1327–1344 (2004) 11. Huang, Y., Li, J., Gao, X., Hu, Y., Lu, W.: Interpretable detail-fidelity attention network for single image super-resolution. IEEE TIP 30, 2325–2339 (2021) 12. Hui, Z., Wang, X., Gao, X.: Fast and accurate single image super-resolution via information distillation network. In: Proceedings IEEE/CVF CVPR, pp. 723–731 (2018) 13. Jo, Y., Kim, S.J.: Practical single-image super-resolution using look-up table. In: Proceedings IEEE/CVF CVPR, pp. 691–700 (2021) 14. Kappeler, A., Yoo, S., Dai, Q., Katsaggelos, A.K.: Video super-resolution with convolutional neural networks. IEEE TCI 2(2), 109–122 (2016)

Impact of Input Data for Multi-Image Super-Resolution

341

15. Kawulok, M., Benecki, P., Kostrzewa, D., Skonieczny, L.: Evolving imaging model for super-resolution reconstruction. In: Proceedings GECOO, pp. 284–285 (2018) 16. Kawulok, M., Benecki, P., Nalepa, J., Kostrzewa, D., Skonieczny, L  : Towards Robust Evaluation of Super-Resolution Satellite Image Reconstruction. In: Nguyen, N.T., Hoang, D.H., Hong, T.-P., Pham, H., Trawi´ nski, B. (eds.) ACIIDS 2018. LNCS (LNAI), vol. 10751, pp. 476–486. Springer, Cham (2018). https://doi. org/10.1007/978-3-319-75417-8 45 17. Kawulok, M., Benecki, P., Piechaczek, S., Hrynczenko, K., Kostrzewa, D., Nalepa, J.: Deep learning for multiple-image super-resolution. IEEE GRSL 17(6), 1062– 1066 (2020) 18. Kawulok, M., Tarasiewicz, T., Nalepa, J., Tyrna, D., Kostrzewa, D.: Deep learning for multiple-image super-resolution of sentinel-2 data. In: Proceedings IEEE IGARSS, pp. 3885–3888 (2021) 19. Kim, J., Kwon Lee, J., Mu Lee, K.: Accurate image super-resolution using very deep convolutional networks. In: Proceedings IEEE CVPR, pp. 1646–1654 (2016) 20. K¨ ohler, T., B¨ atz, M., Naderi, F., Kaup, A., Maier, A., Riess, C.: Toward bridging the simulated-to-real gap: benchmarking super-resolution on real data. IEEE Trans. Pattern Anal. Mach. Intell. 42(11), 2944–2959 (2020) 21. Kostrzewa, D., Skonieczny, L  , Benecki, P., Kawulok, M.: B4MultiSR: a benchmark for multiple-image super-resolution reconstruction. In: Kozielski, S., Mrozek, D., Kasprowski, P., Malysiak-Mrozek, B., Kostrzewa, D. (eds.) BDAS 2018. CCIS, vol. 928, pp. 361–375. Springer, Cham (2018). https://doi.org/10.1007/978-3-31999987-6 28 22. Lai, W., Huang, J., Ahuja, N., Yang, M.: Fast and accurate image super-resolution with deep Laplacian pyramid networks. IEEE TPAMI 41(11), 2599–2613 (2019) 23. Ledig, C., Theis, L., Husz´ ar, F., et al.: Photo-realistic single image super-resolution using a generative adversarial network. In: Proceedings IEEE CVPR, pp. 105–114 (2017) 24. Liu, D., Wang, Z., Wen, B., et al.: Robust single image super-resolution via deep networks with sparse prior. IEEE TIP 25(7), 3194–3207 (2016) 25. M¨ artens, M., Izzo, D., Krzic, A., Cox, D.: Super-resolution of PROBA-V images using convolutional neural networks. Astrodynamics 3(4), 387–402 (2019) 26. Nasrollahi, K., Moeslund, T.B.: Super-resolution: a comprehensive survey. Mach. Vision Appl. 25(6), 1423–1468 (2014). https://doi.org/10.1007/s00138-014-0623-4 27. Nguyen, N.L., Anger, J., Davy, A., Arias, P., Facciolo, G.: Self-supervised multiimage super-resolution for push-frame satellite images. In: Proceedings IEEE/CVF CVPR, pp. 1121–1131 (2021) 28. Rifat Arefin, M., et al.: Multi-image super-resolution for remote sensing using deep recurrent networks. In: Proceedings IEEE CVPR Workshops, pp. 206–207 (2020) 29. Salvetti, F., Mazzia, V., Khaliq, A., Chiaberge, M.: Multi-image super resolution of remotely sensed images using residual attention deep neural networks. Remote Sens. 12(14), 2207 (2020) 30. Tao, Y., Muller, J.P.: Super-resolution restoration of spaceborne ultra-highresolution images using the UCL OpTiGAN system. Remote Sens. 13(12), 2269 (2021) 31. Tarasiewicz, T., Nalepa, J., Kawulok, M.: A graph neural network for multipleimage super-resolution. In: Proceedings IEEE ICIP, pp. 1824–1828 (2021) 32. Valsesia, D., Magli, E.: Permutation invariance and uncertainty in multitemporal image super-resolution. IEEE TGRS 60, 1–12 (2021) 33. Wang, X., et al.: ESRGAN: enhanced super-resolution generative adversarial networks. In: Proceedings ECCV workshops (2018)

342

J. Adler et al.

34. Wang, Z., Chen, J., Hoi, S.C.H.: Deep learning for image super-resolution: a survey. IEEE TPAMI 43(10), 3365–3387 (2021) 35. Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. In: IEEE TIP, pp. 600–612 (2004) 36. Wei, P., et al.: Component divide-and-conquer for real-world image superresolution. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12353, pp. 101–117. Springer, Cham (2020). https://doi.org/10.1007/ 978-3-030-58598-3 7 37. Yang, W., Zhang, X., Tian, Y., Wang, W., Xue, J., Liao, Q.: Deep learning for single image super-resolution: a brief review. IEEE Trans. Multimed. 21(12), 3106– 3121 (2019) 38. Yang, W., Feng, J., Xie, G., Liu, J., Guo, Z., Yan, S.: Video super-resolution based on spatial-temporal recurrent residual networks. Comput. Vis. Image Underst. 168, 79–92 (2018) 39. Yue, L., Shen, H., Li, J., Yuan, Q., Zhang, H., Zhang, L.: Image super-resolution: the techniques, applications, and future. Signal Process. 128, 389–408 (2016)

Single-Stage Real-Time Face Mask Detection Linh Phung-Khanh1(B) , Bogdan Trawi´ nski2 , Vi Le-Thi-Tuong1 , 1 Anh Pham-Hoang-Nam , and Nga Ly-Tu1(B) 1

2

School of Computer Science and Engineering, International University, Ho Chi Minh City, Vietnam [email protected], [email protected] Department of Applied Informatics, Wroclaw University of Science and Technology, Wroclaw, Poland [email protected] Abstract. With the battle against COVID-19 entering a more intense stage against the new Omicron variant, the study of face mask detection technologies has become highly regarded in the research community. While there were many works published on this matter, we still noticed three research gaps that our contributions could possibly suffice. Firstly, despite the introduction of various mask detectors over the last two years, most of them were constructed following the two-stage approach and are inappropriate for usage in real-time applications The second gap is how the currently available datasets could not support the detectors in identifying correct, incorrect and no mask-wearing efficiently without the need for data pre-processing. The third and final gap concerns the costly expenses required as the other detector models were embedded into microcomputers such as Arduino and Raspberry Pi. In this paper, we will first propose a modified YOLO-based model that was explicitly designed to resolve the real-time face mask detection problem; during the process, we have updated the collected datasets and thus will also make them publicly available so that other similar experiments could benefit from; lastly, the proposed model is then implemented onto our custom web application for real-time face mask detection. Our resulted model was shown to exceed its baseline on the revised dataset, and its performance when applied to the application was satisfactory with insignificant inference time. Code available at: https://bitbucket. org/indigoYoshimaru/facemask-web Keywords: Face mask detection · Covid-19 · Single-stage · Real-time · Face mask dataset · Deep learning · YOLO · Web application

1

Introduction and Related Works

before the current Omicron variant [1] continued enkindling more fears and sufferings as fuel to the raging flames of COVID-19 [2]. As it is high time we followed This research is funded by International University, VNU-HCM under grant number SV2020-IT-03. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  N. T. Nguyen et al. (Eds.): ACIIDS 2022, LNAI 13758, pp. 343–355, 2022. https://doi.org/10.1007/978-3-031-21967-2_28

344

L. Phung-Khanh et al.

the suggested method of wearing masks correctly - which requires the nose, mouth, and chin entirely covered - by the World Health Organization (WHO) [3] and how it has been proven valid [4,5], face mask detection is gradually gaining more interests as the Deep Learning research community hurries to create effective face mask detection datasets, models, and other implementations for public usage. Two-Stage Detectors. Most face mask detectors adapt the common pattern of two-staged detectors, where a Region of Interest (RoI) or particularly a face is firstly localized by an available face detection model, then classified into pre-defined categories [6–8]. Evidently, Nagrath et al. introduced the usage of OpenCV’s single shot multibox detector (SSD) [9] as the face detection model and MobileNetV2 for classification task [6]. Another approach by Chowdary et al. emphasized training for the classification of masked and unmasked faces by applying transfer learning on InceptionV3 network which resulted in significant accuracy during testing phase [7]. For Sethi et al., their research have included a combined deep learning architecture that could asynchronously perform twostaged face mask detection and person identification [8]. Although multiple-staged detectors are often argued to be more accurate than single-staged [10,11], we have noted that this practice could show obscure results of the entire model because of their accuracy in the classification stage. In other words, while the quality of classification for the proposed regions was thoroughly discussed, little was elaborated upon how accurate the models performed during the detection stage, specifically for the case of faces with masks. Hence, the usage of pre-trained detectors for real-time mask detection might provide intractable results. Furthermore, given the trade-off of inference time of two-staged detectors and the increase in accuracy of one-staged detectors, especially of “You Only Look Once” (YOLO) family from YOLOv3 to YOLOX [12–15], it is reasonable that we follow the single-staged approach. Single-Stage Detectors. To begin with, the YOLOv3 model introduced by Redmon et al. in 2018 had updated the previous backbone by adding 53 convolution layers of the Darknet backbone (hence the name Darknet-53) and was the first research to adapt the Residual architect to YOLO and using 3 scaled output [12]. Later, the YOLOv4 architecture built on Darknet framework would then present the novel CSPDarknet-53 and significant improvements through the “bags of specials” and “bags of freebies” [13]. For YOLOv5, its implementation using Pytorch and TorchHub had enabled much simpler integration to other applications, especially for the web [14]. Along with this was the introduction of an algorithm called Auto-anchor that could automatically guess the anchors to be used in the models. Ge et al. have proposed a model of equal importance called YOLOX in 2021, which indicated the ability of YOLO model to detect without anchor boxes [15]. However, to the best of our abilities, the number of currently obtainable singlestaged face mask detectors are still surprisingly limited. One example for this type of detector would be from Jiang et al., as they proposed a contextual attention detection head and cross-class object removal algorithm to a model named Reti-

Single-Stage Real-Time Face Mask Detection

345

naFaceMask [16]. Its performance was measured in pure precision and recall that could reach as high as 91.9% and 96.3% for face and 93.4% and 94.5% for mask, respectively. Research by Jiang et al. proposed a model named SE-YOLOv3 that incorporated Squeeze and Excitation module into Darknet53 and adopted Generalized Intersection over Union loss to achieve state-of-the-art results of 96.2% and 99.5%AP @50 for their smallest scaled and largest scaled, respectively [17]. Dataset on Face Mask Detection Problem. In the recent years, datacentric approaches arise as a certain trend to overcome the saturated circumstance with abundant introductions of deep learning models. Despite this, several challenges are still present in the processing of real-life dataset, such as a possible lack of data source, an imbalanced dataset, or the abnormal distribution among classes, etc. Particularly for face mask detection, most of the mentioned face mask detectors only addressed the case in which one person is wearing mask or not. Nonetheless, incorrectly worn masks should also be considered due to two reasons: (1) it is crucial to remind the populace to wear facial masks correctly and (2) real-time detection might have difficulties with situations where the masks do not cover the wearers’ essential parts or the detection is unstable in particular setups. Certain datasets publicized over the last two years like MaskedFace-Net proposed by Cabani et al. would consist of both correctly and incorrectly worn masks categories [18]. Unfortunately, the dataset was more suitable for classification tasks than object detection. Meanwhile, Jiang et al. also published a dataset called Properly Wearing Masked Face Detection (PWMFD) that would support this aspect well with the same three classes [17]. Nevertheless, it unpleasantly turned out to be rather imbalanced. Face Mask Model Real Life Usage. Granting that many face mask detection models have been brought to publication, their purposes are rather straightforward despite the varying circumstances. An adaptation of the original YOLOv4 was performed by Susanto et al. to build a face mask detection device installed in a school campus [19]. The set of equipment at the stall includes: one digital camera that produces input images of size 1920 × 1080, one computer with a GTX 1060 Nvidia GPU for model inference, and a speaker to inform the validity of the wearers on campus. With the trade-off for relatively large input size, the model achieved only 12 FPS on average. Jiang et al. implemented their model as part of a control system to open the entry when a visitor is wearing a mask correctly and is within the normal temperature range [17]. Finally, an interesting use case in which a person-based face mask detector was embedded into a Raspberry Pi that would inform local enforcers whenever a person is detected without a mask or violates social distancing rules [20]. All of these cases have embodied the model into stations or electronic devices, which could amount to huge costs when a large number of stations or set-up places is required. Proposals and Paper Structure. Given the literature gaps, we will attempt at these problems with a model based on the one-staged architecture, trained

346

L. Phung-Khanh et al.

on three-class dataset, and implemented to detect over the network. The specific proposals of this paper are as follows: 1. Primarily, our research will introduce a modification to the YOLO architecture to create a YOLO model with 4 scaled outputs and Ghost convolution modules that can achieve 1.3% higher in AP@50 score than the baseline of YOLOv5. To clarify this aspect, we will introduce a modified model that particularly solve our face mask detection problem, rather than a generic version of the YOLO model for object detection problems. 2. A data-centric approach is also mentioned in our work by updating a current dataset and creating new one. 3. Finally, a face mask detection web application is also implemented to conquer the aforementioned hardware issues. The rest of this paper will be divided into 4 sections. To start off, our model architecture is depicted in Sect. 2. Then, Sect. 3 will cover the step-by-step implementation on how the datasets were handled, how the training, validation, and testing processes were operated, as well as how the web application was configured. The achieved results for those operations and related discussions are then detailed in Sect. 4, before we ultimately conclude the paper along with an outlook into future works in Sect. 5.

2 2.1

Proposed Model Choosing Anchor Boxes for Model

Unlike other objects, human faces often do not diverge in aspect ratios, i.e., most of them should be able to be represented by squares. However, analysis of the dataset has indicated that various ratio of width and height can be found in human face data. The introduced YOLO architectures would normally have 9 anchor boxes as the standard to successfully localize candidate objects, most of which are usually predefined with certain width and heights. However, because the choice of a number for these anchor boxes and their sizes may affect the prediction results, we have applied the K-means clustering algorithm to divide the widths and heights of faces from the dataset into 12 clusters, representing the 12 anchor boxes that will be used for the models rather than the default value above. The coordinates of the clusters’ centroids are then used as the anchor boxes’ width and heights. The following Fig. 1 will visualize the 12 clusters and their centroids as expected from this method. 2.2

Architecture

Our model’s backbone architecture consists of a sequence of Ghost Convolution modules [21] and Cross Stage Partial (CSP) [22] blocks of Ghost Bottleneck module. Ghost Convolution is applied using linear, cheap operations to create a

Single-Stage Real-Time Face Mask Detection

347

Fig. 1. The 12 centroids and clusters from K-Means clustering the faces in dataset after scaling to image size 640 × 640

more enriched result with less computation complexity. A Ghost block consists of 2 convolution blocks, each of them a chain formed by 2D convolutional layers, 2D batch normalization layer, and a Sigmoid-weighted Linear units (SiLU) activation layer. While common convolutional layers receive the entire feature maps then perform convolution on all of the received map, Ghost module performs convolution to the feature map X to produce a feature map y1 that is 2 times smaller in terms of channel size than the wanted channel. Afterwards, y1 is fed to a series of linear, cheap operator by convolving y1 with its own set of filters to deliver y2 . By concatenating y1 and y2 , the result of the Ghost module owns the same output channel as wanted. The feature maps created by Ghost module will then serve as the input to the CSP Ghost module. The output of the Ghost Convolution can be summarized as the concatenation of 2 convolution block by filters: (1) Y = (y1 , y2 ) = (conv(X), Φ(y1 )) in which X is the input block, Y is the output block, conv() represent normal convolutional layer while Φ represents the cheap operation function, and y1 , y2 are the output of each function, respectively. The CSP Network, introduced by Wang et al. [22], promises to enhance the learning capability and to reduce the computation of residual blocks. The network divides the first block of the module into two smaller blocks, then creates skip connection to one block and normal connection to the other one. CSP is applied to the entire Ghost Bottleneck module as recommended by Wang et al. to reduce the computational cost of the bottlenecks. Following the stride 2 Ghost Bottleneck of the original paper, a down-sampling layer and a depth-wise convolution is incorporated between two Ghost modules in

348

L. Phung-Khanh et al.

the module’s shortcut path. At the end of the backbone, a spatial pyramid pooling layer is implemented. Adhering to YOLOv4, the model’s head maintains the similar head architecture but changes the original convolution layers to a set of defined CSPGhost and Ghost convolution blocks concatenated with corresponding blocks from either backbone and head. As stated earlier, we added an extra detection layer to YOLO head to better capture the variety of anchor boxes’ sizes. The visualization of the model is shown in Fig. 2. The implementation of the proposed model follows that of YOLOv5 by Ultralytics [14].

P1 P2

P3

P4

Fig. 2. Proposed model architecture. Purple, yellow, blue, red blocks represent the output shapes of Ghost, CSPGhost, Upsampling, SPP modules respectively. P1-4 indicates our detection layers. (Color figure online)

3 3.1

Implementation Datasets

Update of PWMFD Dataset. The Properly Wearing Masked Face Detection Dataset (PWMFD for short) was proposed by Jiang et al. [17]. earlier in 2021. It contains 9205 images of proper quality in total, among which 80.2% are used as training set and the rest are used for validation. Nevertheless, PWMFD still suffers from imbalance between its classes. Particularly, for the training set, there are only 320 incorrectly-worn masked-faces over a total of 16702 faces, accounting for only 1.9% of the entire training set. We hence propose a less imbalanced PWMFD following these steps. Initially, we successfully crawled 240 images of incorrectly-worn masked-faces from multiple sources, monitored so that they are exclusive from the original PWMFD and more diverse with different face sizes. We then performed data augmentation on them to give a total of 1200 images by only applying the 4 techniques: Gaussian noise, rotation, channel enhancement, and CutOut. Other methods such

Single-Stage Real-Time Face Mask Detection

349

as blurring, scaling or shearing are unsuitable for the human faces and give unrecognizable results. Since most images already include faces within the required classes, auto-labelling was carried out using the RetinaFace model to detect all available faces. Afterwards, we manually assessed them to guarantee the annotation was correct. These newly crawled and processed images were then added to the training set. The final distribution of the new PWMFD, which includes 8738 images in the training set and 1818 in the validation set, is shown in Fig. 3, depicting a less imbalanced dataset with an increased ratio of 9.4% for incorrectly-worn masked-faces in the training set. However, the distribution of the validation set is preserved so as to evaluate the influence of a dataset’s balance on the YOLO-family models. FMITA. After experiencing the laborious process involved with crawling and labeling pictures, we then hypothesized the possibility to utilize the existence of other datasets and achieve the same outcome with less manual effort. This had led to a synthetic dataset where the faces would be retrieved from the two resources - Flickr-Faces-HQ Dataset [23] and MaskedFace-Net Dataset [18] then haphazardly printed out over selected backgrounds from the internet with empty spaces, such as empty parks, roads, fields, etc. Each image generated using this method would contain about 4 to 6 faces with arbitrary sizes and positions within the boundaries of the frame. We have managed to accumulate in total 8013 images for the dataset, with a sum of 36005 different faces almost evenly categorized into our 3 desired classes, since the number of faces are 12003, 12001 and 12001 corresponding to the order mentioned in Fig. 3. Due to this peculiar background story and the funny nature of each and every single outcome, we have decided to give this dataset the name “Face Masks in the Air”, or FMITA for short. 3.2

Training and Validation Processes

Training Details. The models are all equally trained on Colab machines equipped with NVIDIA Tesla P100-PCIE (16 GB) GPU. The configurations

Fig. 3. The New Properly Wearing Masked Face Detection Dataset includes 7574, 9993, and 1828 of with mask, no mask and incorrectly masked faces respectively in the training set. The validation set contains 991, 791, and 46 of the same class order of faces.

350

L. Phung-Khanh et al.

are mostly similar when training the models on both the original and updated PWMFD dataset; the differences are only at some minute details depending on the versions. For YOLOv3 and YOLOv4, following the instructions of the creator [13], we use the maximum batches of 6000 for all 3 classes, resulting in a specification with 95 epochs, a batch size of 64, and an image size of 416 × 416. YOLOv5 and our proposed model were trained with an image size of 640 × 640 and a batch size of 16 over 100 and 200 epochs, respectively. Validation and Testing Details. Validation phase is performed on Google Colab, using a Telsa K80 GPU with a batch size of 1 and the corresponding input image size as trained for all considered models. In addition, all models that were trained on both the original and the updated PWMFD are validated on the same validation set, while models trained on the FMITA dataset are validated on a different validation set that has the same distribution as their training set. After the validation phase, the models that were trained with FMITA are then tested on PWMFD’s validation set. Finally, our model is tested on a laptop having Intel Core i7-8750H with GTX1050 Ti GPU. 3.3

Web Application Implementation

To show the capability of our model in real life situation, we implement a web application built with Flask and Socket.IO that use the weights of the proposed model to perform detection via video streaming. Our application workflow is depicted in the activity diagram in Fig. 4.

Fig. 4. Activity diagram of face mask detection web application

The simplest version of our application is assumed to have one server and multiple clients. In Fig. 4, clients and servers are seen as objects. Using the trained model mentioned in the previous sections with YOLOv5 support in TorchHub

Single-Stage Real-Time Face Mask Detection

351

configuration, our server initializes and validates the existence of elements following the prerequisites, such as checking if the video source is linked or if the model weights are connected. While the server is required to receive video stream from cameras and signal the face mask detection model for inference, the clients are to visualize the results from the server and issue warnings if social distancing rules, that is when more than one person is detected with at least one is with an incorrectly worn mask or without a mask at all, are not followed. The server receives a direct video stream from the machine’s webcam or from an IP camera connected to the same network. In our case, we run the application on our laptop and the video stream would be either its own webcam or from an IP camera. The video frames are processed by OpenCV and passed to the model for inference. This process returns the resulted frames which highlight all detected faces and show their mask-wearing statuses, with bits of information such as the total number of faces, the number of faces without masks, with correctly-worn masks and incorrectly-worn masks. The result frame is wrapped inside a response object, which is a Flask representation of an outgoing WSGI HTTP response with body, status, and headers. On the other hand, the frame information is converted into a JSON string and sent to the frontend via SocketIO. At the clients side, the result and inferred frames are shown on a simple web page. Whenever a video frame is inferred in the backend, the frontend will receive two different types of messages: the first is a JSON message emitted by SocketIO, and the other is the inferred video frame sent with Flask built-in function. Both the inferred video frame and its information are displayed on-screen for us to see. As discussed, the validity of mask wearing for all the person detected in the frames is shown, and the warning message is issued correspondingly.

4 4.1

Result Evaluation Metrics

Pursuing the standard evaluation of object detection models, average precision with different Intersection over Union (IoU) thresholds was used to evaluate the performance of face mask detectors on the validation set. Precision is wellknown as a metric to quantify the number of accurate positive predictions over the predicted set. Another metric known as recall will represent the proportion of accurate positive prediction over the positive ground truths. Precision and recall can be formulated as in Eq. 2 and Eq. 3, respectively P =

TP TP + FP

(2);

R=

TP TP + FN

(3)

where T P , T N , F P , F N are respectively true positives, true negative, false positive, false negative number of classified cases. In computer vision, particularly object detection, IoU, or the Jaccard distance between predicted and ground truth bounding boxes, is applied by calculating the area of intersection of the

352

L. Phung-Khanh et al.

two boxes over the union area of them. IoU’s formula is described in Eq. 4, in which A is the box area. IoU (box1 , box2 ) =

Aintersect (box1 , box2 ) Abox1 + Abox2 − Aintersect

(4)

With a specific IoU threshold, a P-R curve can be obtained, the average precision at certain IoU thresholds of 50% and 75% used in Table 1 and Table 2 can be understood as the area under the precision-recall curve with recalls as the horizontal axis. Hence, AP50 and AP75 is calculated similarly using Eq. 5. APt =

  1 p(r) |C| ∗ N

(5)

AP =

c∈C r∈[0,1]

1  APt |T |

(6)

t∈T

in which C is the set of the given classes, N is the number of interpolation points chosen, usually 101, and p(r) is the precision at a particular recall. Following the same definition, AP in the below tables represents the mean average precision over the IoU threshold range (50, 95) with a step size of 5. AP is calculated in Eq. 6, given that the set T is the set of all thresholds. 4.2

Comparisons and Discussions

The results of the compared models in two datasets PWMFD and the updated PWMFD are shown in Table 1 and Table 2, respectively. For the original PWMFD, YOLOv3 achieves the smallest inference time, an indicator of its high speed when it comes to real-time application; however, its mAPs are the lowest of them all. As shown in Table 1, our model achieved relative high result of 97.5% for AP50 and 86.9% for AP75 after training 100 epochs. After having been trained for 200 epochs, our model attained the highest result of 97.6% AP50 and 88.4% AP75 while still maintaining reasonable inference time. Additionally, our model is relatively light-weighted comparing to YOLOv3 and YOLOv4. For the updated PWMFD, the results in Table 2 indicates that all models showed a reduction in mAPs. As mentioned in Sect. 3, we had added a number of diverse sizes of faces to the training set and as a result, they all experienced Table 1. Face mask detectors comparison on PWMFD Model

Size (MB)

Avg. inference time per image (seconds)

AP50

AP75

AP

YOLOv3

234.9

0.0946

0.860

0.405



YOLOv4

244.2

0.1040

0.967

0.732



YOLOv5s - 100 epochs

13.7

0.0281

0.962

0.853

0.717

YOLOv5s - 200 epochs

13.7

0.0273

0.969

0.875

0.739

Our model - 100 epochs

13.9

0.0318

0.975

0.869

0.735

Our model - 200 epochs

13.9

0.0321

0.976 0.884 0.749

Single-Stage Real-Time Face Mask Detection

353

a tendency to localize smaller images that were not labeled in the validation set. Further improvements can be updating the validation set to have similar distribution to the training set. More experiments were originally carried out on the FMITA dataset; however, the results were undesirable, because all considered models achieved 0.999 mAP evaluation, yet roughly 0.2 when testing on the real images. Although it is explainable because the two sets’ distributions for instance, lighting, the used masks, the border around these human faces, etc., are too distinct, we have decided to omit the results and consider this to be a stepping-stone for later research on more advanced techniques in conquering this miscue. Table 2. Face mask detectors comparison on new PWMFD Model

Size (MB)

Avg. inference time per image (seconds)

AP50 AP75 AP

YOLOv3

234.9

0.0704

0.878 0.338 –

YOLOv4

244.2

0.0753

0.966 0.773 –

YOLOv5s - 100 epochs

13.7

0.0277

0.964 0.854 0.717

YOLOv5s - 200 epochs

13.7

0.0278

0.964 0.858 0.737

Our model - 100 epochs

13.7

0.0291

0.958 0.845 0.713

Our model - 200 epochs

13.7

0.0279

0.962 0.852 0.729

Fig. 5. Result of model and web app from webcam

We also tried the experiment with YOLOX, but the model had crashed after achieving 62 epochs for over 20 h on Google Colab. This probably was why it only scored 0.919 and 0.681 for AP50 and AP , respectively. Since this was nothing peculiar and the training phase was too long, we have also decided to not repeat the process any further. Ultimately, while performing real-time detection on the application, our model managed to achieve an average of 50 FPS on the testing laptop with inference time as little as 0.0027 s. Final results are shown in Fig. 5 and Fig. 6, where the video was randomly picked from [24].

354

L. Phung-Khanh et al.

Fig. 6. Result of model and web app from video source

5

Conclusion

In response to the new challenges that the COVID-19 pandemic is leaving in its wake, we first proposed a modified YOLO-based face mask detection model using CSP and Ghost module, which was able to detect the wearing of masks among many people and determine whether the masks’ statuses are valid with respect to the regulations from WHO. Along with the development of this model, we also presented an updated version of the PWMFD dataset and experimented with a synthetic dataset called FMITA. A web application was also built to allow the proposed model to demonstrate its mentioned abilities in real-time with very low inference time per image. In future works, the foremost goal would be to improve the stability of our proposed model, and possible renovations such as body heat tracking or interpersonal distance measuring will can then be applied.

References 1. WHO. Omicron. https://www.who.int/news/item/28-11-2021-update-on-omicron 2. WHO. Who coronavirus (COVID-19) dashboard. https://covid19.who.int/ 3. WHO. Coronavirus disease (COVID-19) advice for the public: when and how to use masks. https://www.who.int/emergencies/diseases/novel-coronavirus-2019/ advice-for-public/when-and-how-to-use-masks 4. Howard, J., et al.: An evidence review of face masks against COVID-19. Proceed. Natl. Acad. Sci. 118(4), e2014564118 (2021) 5. Ueki, H., et al.: Effectiveness of face masks in preventing airborne transmission of SARS-CoV-2. mSphere, 5(5), e00637–20 (2020) 6. Nagrath, P., Jain, R., Madan, A., Arora, R., Kataria, P., Hemanth, J.: Ssdmnv2: a real time DNN-based face mask detection system using single shot multibox detector and mobilenetv2. Sustain. Urban Areas 66, 102692 (2021)

Single-Stage Real-Time Face Mask Detection

355

7. Jignesh Chowdary, G., Punn, N.S., Sonbhadra, S.K., Agarwal, S.: Face mask detection using transfer learning of inceptionV3. In: Bellatreche, L., Goyal, V., Fujita, H., Mondal, A., Reddy, P.K. (eds.) BDA 2020. LNCS, vol. 12581, pp. 81–90. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-66665-1 6 8. Sethi, S., Kathuria, M., Kaushik, T.: Face mask detection using deep learning: an approach to reduce risk of coronavirus spread. J. Biomed. Inform. 120, 103848 (2021) 9. Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0 2 10. Lin, T.Y., Goyal, P., Girshick, R., He, K., Doll´ ar, P.: Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2980–2988 (2017) 11. Carranza-Garc´ıa, M., Torres-Mateo, J., Lara-Ben´ıtez, P., Garc´ıa-Guti´errez, J.: On the performance of one-stage and two-stage object detectors in autonomous vehicles using camera data. Remote Sens. 13(1), 89 (2021) 12. Redmon, J., Farhadi, A.: Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767 (2018) 13. Bochkovskiy, A., Wang, C.Y., Liao, H.Y.M.: Yolov4: optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934 (2020) 14. Jocher, G., et al.: ultralytics/yolov5: V6.0 - YOLOv5n ‘Nano’ models, roboflow integration, tensorflow export, OpenCV DNN support, October 2021 15. Ge, Z., Liu, S., Wang, F., Li, Z., Sun, J.: Yolox: exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430 (2021) 16. Jiang, M., Fan, X., Yan, H.: Retinamask: a face mask detector. arXiv preprint arXiv:2005.03950 (2020) 17. Jiang, X., Gao, T., Zhu, Z., Zhao, Y.: Real-time face mask detection method based on yolov3. Electronics 10(7), 837 (2021) 18. Cabani, A., Hammoudi, K., Benhabiles, H., Melkemi, M.: Maskedface-net-a dataset of correctly/incorrectly masked face images in the context of COVID-19. Smart Health 19, 100144 (2021) 19. Susanto, S., Putra, F.A., Analia, R., Suciningtyas, I.K.L.N.: The face mask detection for preventing the spread of COVID-19 at Politeknik Negeri Batam. In: 2020 3rd International Conference on Applied Engineering (ICAE), pp. 1–5 (2020) 20. Yadav, S.: Deep learning based safe social distancing and face mask detection in public areas for COVID-19 safety guidelines adherence. Int. J. Res. Appl. Sci. Eng. Technol. 8(7), 1368–1375 (2020) 21. Han, K., Wang, Y., Tian, Q., Guo, J., Xu, C., Xu, C.: Ghostnet: More features from cheap operations. CoRR, abs/1911.11907 (2019) 22. Wang, C.Y., Liao, H.Y.M., Wu, Y.H., Chen, P.Y., Hsieh, J.W., Yeh, I.H.: CSPNet: a new backbone that can enhance learning capability of CNN. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 390–391 (2020) 23. Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. CoRR, abs/1812.04948 (2018) 24. ThanhNienNews. Thanks to the coronavirus, 400 packages of medical masks are sold out after just 30 seconds (2020). https://www.youtube.com/watch? v=MmkTUFy QTM

User-Generated Content (UGC)/In-The-Wild Video Content Recognition Mikolaj Leszczuk(B) , Lucjan Janowski , Jakub Nawala , and Michal Grega AGH University of Science and Technology, 30059 Krak´ ow, Poland [email protected] http://qoe.agh.edu.pl Abstract. According to Cisco, we are facing a three-fold increase in IP traffic in five years, ranging from 2017 to 2022. IP video traffic generated by users is largely related to user-generated content (UGC). Although at the beginning of UGC creation, this content was often characterised by amateur acquisition conditions and unprofessional processing, the development of widely available knowledge and affordable equipment allows one to create UGC of a quality practically indistinguishable from professional content. Since some UGC content is indistinguishable from professional content, we are not interested in all UGC content, but only in the quality that clearly differs from the professional. For this content, we use the term “in the wild” as a concept closely related to the concept of UGC, which is its special case. In this paper, we show that it is possible to deliver the new concept of an objective “in-the-wild” video content recognition model. The value of the F measure in our model is 0.988. The resulting model is trained and tested with the use of video sequence databases containing professional and “in the wild” content. These modelling results are obtained when the random forest learning method is used. However, it should be noted that the use of the more explainable decision tree learning method does not cause a significant decrease in the value of measure F (an F-measure of 0.973). Keywords: Quality of Experience (QoE) · Quality of Service (QoS) Metrics · Evaluation · Performance · Computer Vision (CV) · Video Quality Indicators (VQI) · Key Performance Indicators (KPI) · User-Generated Content (UGC) · In-the-wild content

1

·

Introduction

Cisco claims a three-fold increase in IP traffic over the last five years, from 2017 to 2022. In 2022, 82% of all IP traffic is IP video traffic. IP video traffic per month increases from 91.3 Exabytes in 2017 to 325.4 Exabytes in 2022. Supported by Polish National Centre for Research and Development (TANGO-IVA/0038/2019-00). c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  N. T. Nguyen et al. (Eds.): ACIIDS 2022, LNAI 13758, pp. 356–368, 2022. https://doi.org/10.1007/978-3-031-21967-2_29

User-Generated Content (UGC)/In-The-Wild Video Content Recognition

357

Furthermore, according to Cisco, the average Internet user generates 84.6 GB of traffic per month in 2022. Therefore, the fixed customer Internet traffic per user per month increases by 194% from 28.8 GB in 2017. Finally, we observe an increasing video definition: By 2023, 66% connected flat-panel TV sets will be 4K [2]. IP video traffic generated by users is largely related to User-Generated Content (UGC). First, we need to better understand what UGC stands for. There are definitions in the literature that UGC, alternatively known as UserCreated Content (UCC), is any form of content, such as images, video, text, and audio, that has been posted by users on on-line platforms (such as social media platforms, forums, and wikis). However, from our point of view of content quality, we treat UGC as just what users create and upload themselves. There are also other definitions, completely out of step with the context of the video quality assessment; According to [1] and [10], UGC is a product that consumers create to distribute information about online products or the companies that market them. UGC opposes Professionally Generated Content (PGC) and MarketerGenerated Content (MGC) [27]. UGC, like any other type of content, in some cases requires a Quality of Experience (QoE) assessment. UGC quality evaluation is a very popular research topic nowadays. Google Scholar query “user-generated content video quality” returns more than 76,000 results. Consequently, one can say that there is quite a lot of research available on the quality of UGC. After all, it is not surprising that due to the specificity of the content, the algorithms for UGC quality evaluation are clearly distinct from general quality evaluation. However, before we even start assessing the quality of the UGC, the question arises: How do we decide in advance if the content is really UGC? Relatively little research is available on recognising UGC. To the best of our knowledge, literature studies practically do not indicate the existence of actually working algorithms that distinguish UGC content from general content. The dilemma is made worse by the fact that, in today’s world, the fact that a piece of content is UGC no longer determines the conditions of its acquisition or the techniques by which it is processed. Although at the beginning of UGC creation, this content was often characterised by amateur acquisition conditions and unprofessional processing, the development of widely available knowledge and affordable equipment allows one to create UGC of a quality practically indistinguishable from professional content [26]. Since some UGC content is indistinguishable from professional content, we are not interested so much in all UGC content, but only in the quality that clearly differs from the professional quality. For this content, we use the term “in the wild” as a concept closely related to the concept of UGC, which is its special case. Recognition of “in-the-wild” content in relation to professional content is particularly important in the case of content containing news. Although news is generally professional content, it can contain “in-the-wild” content and, moreover, with a varying degree of “saturation”. This is illustrated in Fig. 1, which contains six examples of combining professional content with “in the wild” content in the news1 . As one can see, it is possible to have shots: only with 1

Video source: https://youtu.be/8GKKdnjoeH0, https://youtu.be/psKb bSFUsU and https://youtu.be/lVuk2KXBlL8.

358

M. Leszczuk et al.

professional content (Subfig. 1a), with professional content but with a relatively small display presenting the “in the wild” content (Subfig. 1b), with professional content but with a relatively large display presenting the “in-the-wild” content (Subfig. 1c), with “in the wild” digitally mixed with a large area of professional content (Subfig. 1d), with “in the wild” content digitally mixed with a small area of professional content (Subfig. 1e), and finally exclusively with the “in-the-wild” content (Subfig. 1f).

Fig. 1. Professional content vs. “in-the-wild” content.

In this paper, we show that it is possible to introduce the new concept of an objective “in the wild” video content recognition model. The value of the measurement of the accuracy of a model (the parameter of the F-measure) achieved is 0.988. The resulting model is trained and tested with the use of video sequence

User-Generated Content (UGC)/In-The-Wild Video Content Recognition

359

databases containing “in the wild” and professional content. These modelling results are obtained when the random forest learning method is used. However, it should be noted that the use of the more explainable decision tree learning method does not cause a significant decrease in prediction accuracy (a measure F of 0.973). The remainder of this paper is structured as follows. To unravel the content of the term “content in the wild”, we provide a review of the other work in Sect. 2. In Sect. 3 the preparation of a corpus of databases—the original video sequences, consisting of footage already available. Section 4 describes the construction of the model. The results are reported in Sect. 5. Section 6 concludes this paper and describes plans for further work.

2

Other Work

We start by defining what we mean by the term in-the-wild content. The most important feature of in-the-wild content is that it is not professionally curated. In other words, it is captured and processed by amateurs. In-the-wild content is also typically characterised by unstable camera motion. This often results from capturing the content on a handheld device (e.g., a smartphone). To complement our definition, we performed a literature search. This brought about a series of refinements. It also made us narrow the scope of the definition to the context of Multimedia Quality Assessment (MQA). We start by providing a definition of “in the wild” phrase from the Cambridge Advanced Learner Dictionary & Thesaurus. The dictionary provides the following description of the phrase: “under natural conditions, independent of humans.” Although the second part of the definition does not fit the context discussed here, the reference to natural conditions reflects how the term “in the wild” is used in the MQA literature. In [13] Li et al. note that the content in the wild is opposed to the synthetically distorted content. As the name suggests, the synthetically distorted content is first captured without distortion. The selected distortions are then inserted with artificially. The distortions are usually manually selected by the experimenters with the intention of simulating real-world impairments. In-the-wild content comes with real-world impairments already present. Since this naturally increases the ecological validity of experiments, there is an ongoing trend in MQA research to use in-the-wild content instead of synthetically distorted content. The authors of [13] also note that since the “in-the-wild” content is already impaired, there is no access to the pristine version of it. Thus, popular and well-established Full Reference (FR) approaches to MQA cannot be used. Instead, only No-Reference approaches are appropriate. The authors of [25] also state that the content in the wild represents the content of the real world (in contrast to synthetically distorted content). Importantly, they also used the terms in-the-wild video and UGC video interchangeably. Furthermore, Ying et al. refer to a class of video data that their data set represents as “real-world in-the-wild UGC video data.”

360

M. Leszczuk et al.

Taking all this into account, we may conclude that the authors of [25] do not make a distinction between in-the-wild and UGC videos. On the contrary, their paper suggests that UGC videos are representations of in-the-wild content categories. Among other things, this is supported by the fact that UGC videos are affected by a mixture of complex distortions (e.g., unsteady hands of content creators, processing and editing artefacts, or imperfect camera devices). Tu et al. [21] do not provide a direct definition of content in the wild. Instead, similarly to [25], they pair the concept with UGC videos. In particular, they mention “in the wild UGC databases.” However, the authors of a [21] juxtapose in-the-wild UGC databases with single-diffusion synthetic ones (the latter representing a legacy approach and being less representative of real-world conditions). Tu et al. also point to an important insight. They say that in in-the-wild UGC databases compression artefacts are not necessarily the dominant factors affecting video quality. This observation further separates in-the-wild UGC databases from legacy ones, typically focussing on compression artefacts (sometimes complementing them with transmission artefacts). Yi et al. [24] follow the examples of [21] and [25], and mix the concepts of UGC and in-the-wild videos. Similarly to [21] and [25] they point out that UGC videos contain complex real-world distortions. One new insight they bring to light is that in-the-wild UGC videos are affected by non-uniform spatial and temporal distortions. This contrasts with legacy synthetically distorted video databases. There, even if more than one distortion is applied, the distortions are applied uniformly across an image or a video frame. In our work, in addition to UGC videos, we also take into account the socalled eyewitness videos. Since eyewitness videos may not always take the form of UGC, we choose not to classify eyewitness videos as UGC videos. However, eyewitness videos and UGC videos have at least one thing in common. They are both often captured under natural conditions. In other words, they are captured under wild conditions. Thus, we refer to the videos that we take into account in this work as in-the-wild video. Significantly, this contrasts with other MQA works that usually combine the terms in-the-wild and UGC and call the data they operate on “in-the-wild UGC videos.” Now, we offer a definition of the content category of in-the-wild video. The definition serves to highlight the content that we consider relevant in the context of this work. In-the-wild videos can be best described by saying that they are captured and processed under natural conditions. Specifically, in-the-wild videos satisfy the following requirements. 1. They are captured without using professional equipment. In particular, without using a professional film production camera. 2. They are not recorded in a studio. In other words, they are captured in an environment without an intentional lighting setup. 3. They are captured without a gimbal or similar image stabilising equipment. In short, they represent handheld videos.

User-Generated Content (UGC)/In-The-Wild Video Content Recognition

361

4. They do not undergo significant postprocessing aimed at intentionally improving their quality. Put differently, they do not undergo film-production-grade post-processing. It should be noted that a review of the literature on the subject indicates that the problem of recognising UGC content has already been considered, but not with respect to “in-the-wild” content. In addition, some references are related to UGC content, but not necessarily to multimedia content. For example, [14] is about classifying the text. However, even the methods of classifying multimedia content disclosed during the literature review are not necessarily based on video data. For example, the model described in [4] uses audio features, while the model presented in [5] uses video, however, while classifying it is supported by tags and metadata such as shot length. Another literary example is the related article (citing previous articles) [8]. The accuracy results obtained (depending on the modality of the data, the factors taken into account and the type of modelling) reach values on the order of 0.8 (usually in the range of about 0.7 to about 0.9). However, it should be remembered that it is difficult for us to relate to these results, as all of these listed articles take into account the recognition of UGC content and not necessarily “in the wild”. For example, in the studies cited, by definition, it is assumed that news is professional content. Meanwhile, in the previous section, we show that it is not entirely so clear.

3

Databases

This section introduces the custom datasets for this study. 3.1

The Full Sets

To make our considerations about in-the-wild content more concrete, we list databases which either of the works mentioned above [13,21,24,25] calls in-thewild databases (in brackets, we provide the year each database was created in): 1. 2. 3. 4. 5. 6.

CVD-2014 (2014) [17], LIVE-Qualcomm (2016) [3], KoNViD-1k (2017) [6], LIVE-VQC (2018) [20], YouTube-UGC (2019) [22] and LSVQ (2021) [25].

Significantly, CVD-2014 and LIVE-Qualcomm are sometimes referred to as early-in-the-wild databases. This is because they take into account only incapture distortions. In other words, they investigated how the use of different capture devices under varying conditions influences the video quality. Although this addresses the problems inherent to in-the-wild videos, it does not cover the complete spectrum of possible distortions. For example, this approach does not take into account postprocessing or transcoding distortions.

362

M. Leszczuk et al.

Our further research used 3 publicly available in-the-wild video databases: (i) CVD-2014, (ii) LIVE-Qualcomm, and (iii) KoNViD-1k. The database of the (v) YouTube-UGC database was rejected for further research. Unfortunately, we found that the YouTube-UGC database is “contaminated” with a large amount of UGC content, but of essentially professional quality. Unfortunately, this means that we cannot train models on such content because no image indicators will notice the difference between a professional looking UGC and a professional studio quality. Therefore, we use databases that contain only “in-the-wild” recordings. The LSVQ database (vi) was not used because it was published after we started compiling the recordings. The database of (iv) LIVE-VQC was not used because the other 3 databases allowed to accumulate enough content “in-thewild”. The database of video sequences has been supplemented with a “counterweight” in the form of professional quality video sequences. For this purpose, the “NTIA simulated news” database was used [18]. 3.2

The Sub Sets

The number of video frames in all databases used is hundreds of thousands, but in reality, the frames belonging to one shot are quite similar to each other. Therefore, in further analysis, we operate at the level of recognising the entire shot. For databases that are not delivered per shot, these shots are detected using PySceneDetect (a command-line application and a Python library for detecting shot changes in video sequences and automatically splitting the video into separate clips) using its default parameters. As a result, we get 68 shots with professional content and 2 169 shots with “in the wild” content. 3.3

Video Indicators

This section describes the setup of the video indicators. Provides a general overview and details what video indicators are used. As the frames belonging to one shot are quite similar to each other, they have similar values of the video indicators. Consequently, the experiment operates on averaged video shots. It applies a set of video indicators and outputs a vector of results (one for each video indicator). The results will be combined later with the ground truth (content “in the wild” versus professional content). Together, they constitute input data for modelling. In total, we used 10 video indicators. They come from our AGH Video Quality (VQ) team. Table 1 lists all video indicators. It also includes references to related publications. The execution time of all working video indicators is 0.12 s. The execution time for the AGH video indicator is the combined execution time of each individual VQI and comes from [16]. Importantly, execution times are evaluated using a laptop with an Intel Core i7-3537U CPU.

User-Generated Content (UGC)/In-The-Wild Video Content Recognition

363

Table 1. The list of video indicators which are used in the experiment. # Name 1 Blockiness [19] 2 Spatial Activity (SA) [19] 3 Block Loss [12] 4 Blur [15, 19] 5 Temporal Activity (TA) [19] 6 Exposure [11] 7 Contrast 8 Noise [7] 9 Slicing [12] 10 Flickering [19]

4

Modelling

The possession of the data sets (“in-the-wild” content and professional content) allows us to construct a model that detects “in-the-wild” video content. To build the model, we start from the indicator transformation, which is a typical step in a data analysis (see Fig. 2).

Fig. 2. Data preparation and analysis schema.

Our goal is to have transformed the indicators’ distributions closer to normal and in the 0–1 range. To achieve the desired distribution, we use, if necessary, (1) truncation, changing all values from a certain range to the margin of the range, (2) a logarithmic scale by applying the function log10 (x + 1), and (3) x−min(x) . Truncation is used normalisation to 0–1 by the simple formula max(x)−min(x)

364

M. Leszczuk et al.

if some large values make the distribution asymmetric. The logarithmic scale is used if the indicator distribution was strongly asymmetric even after truncation. To make it easier to repeat our results in Table 2, the transformation parameters are presented for each indicator. Table 2. Normalisation procedure for each indicator. # Name

Truncation Log min max

1

Blockiness

[0.5, 1.25]

No

0.5

1.25

2

Spatial Activity (SA)

[0, 200]

No

0

200

3

Block loss

None

Yes 0

3.33

4

Blur

[0, 22]

Yes 0

1.36

5

Temporal Activity (TA) [0, 75]

No

0

75

6

Exposure

None

No

9

222

7

Contrast

None

No

0

104

8

Noise

None

Yes 0

1.79

9

Slicing

None

Yes 0.16 4.22

10 Flickering

None

No

0

1

We assume an “in-the-wild” video content recognition by classification into two classes (the video shot is “in-the-wild” content, the video shot is professional content). Due to the high imbalance of samples in the class (“in-the-wild” content vs. professional content) of about 1:32, samples of the second class are subsampled to obtain a ratio of 1:5. Consequently, the modelling is carried out on a set of 408 samples, divided in the ratio of 68:304. The results of the above are the preparation of samples for the training experiment (80%) and samples for the test experiment (20%). Further modelling is done with Scikit-learn – a free software machine learning library for the Python programming language. Decision tree and random forest algorithms/modelling tools are tested for the detection of “in-the-wild” video content. Finally, modelling using a random forest turns out to be more effective, while a decision tree gives a more explainable model. The detailed results are reported in the (next) Sect. 5.

5

Results

This section presents the results of the development of a model that detects “in-the-wild” video content. We provide the results using standard values used

User-Generated Content (UGC)/In-The-Wild Video Content Recognition

365

in pattern recognition, information retrieval, and machine learning-based classification, mean accuracy on the given test data and labels, precision, recall, and the F measure (F score). Precision (also called positive predictive value) is the fraction of relevant instances among the retrieved instances, while recall (also known as sensitivity) is the fraction of the total number of relevant instances that were retrieved. Therefore, both precision and recall are based on understanding and measurement of relevance. A measure that combines precision and recall is the harmonic mean of precision and recall, the traditional F measure, or the balanced F score [23]. For both the decision tree and the random forest, we make 5000 attempts to train and test the model, each time randomly dividing the division into training and test sets and reporting the results obtained on the test set. In addition, we also report the results obtained on the entire “in-the-wild” set (the advantage of this approach is the ability to perform a test on a very large set of data, a disadvantage being the potential occurrence of only one type of error). The reported results are always average results. For the decision tree, we obtain the following (Table 3) results. Table 3. Decision tree results received for “in-the-wild” content recognition. Accuracy Precision Recall F-measure Test set

0.956

0.976

0.971

0.973

“In-the-wild” set 0.975

1.000

0.975

0.987

We try to visualise the decision tree in the classification task. Naturally, each approach to modelling, due to a different training set, results in slightly different parameters of the decision tree, but most of the decision trees look as in Fig. 3. As one can see, flickering turns out to be by far the most discriminating indicator (for some trees, it turned out to be generally sufficient). However, in most cases, it must be supported by other indicators. For the random forest, we obtain the following (Table 4) results. Table 4. Random forest results received for “in-the-wild” content recognition. Accuracy Precision Recall F-measure Test set

0.980

0.983

0.994

0.988

“In-the-wild” set 0.994

1.000

0.994

0.997

366

M. Leszczuk et al.

Fig. 3. Decision tree visualisation.

6

Conclusions and Further Work

In this paper, we show that it is possible to deliver the new “in-the-wild” concept of an objective video content recognition model. The value of the measured accuracy of a model (the parameter of the F measure) achieved is 0.988. These modelling results are obtained when the random forest learning method is used. However, it should be noted that the use of the more explainable decision tree learning method does not cause a significant decrease in prediction accuracy (measure F of 0.973). The results presented are work in progress. While the current results are highly promising, they still require additional validation since the training and test datasets are relatively limited (especially for professional content). Therefore, additional selected video sequences from the database of 6000+ professional YouTube news clips, collected under [9], should be used. These video sequences are currently being reviewed and any “in-the-wild” shots are being eliminated from them (adding them to the “in-the-wild” part of the set being prepared). This allows in the future to make a more precise validation of the model, or also its correction. Finally, since we only compare the results of the implementation of the random forest and the decision tree, the implementation of other machine learning techniques will be useful for future research.

User-Generated Content (UGC)/In-The-Wild Video Content Recognition

367

References 1. Berthon, P., Pitt, L., Kietzmann, J., McCarthy, I.P.: CGIP: managing consumergenerated intellectual property. Calif. Manage. Rev. 57(4), 43–62 (2015) 2. U. Cisco: Cisco annual internet report (2018–2023) white paper. Cisco, San Jose (2020) 3. Ghadiyaram, D., Pan, J., Bovik, A.C., Moorthy, A.K., Panda, P., Yang, K.C.: Incapture mobile video distortions: a study of subjective behavior and objective algorithms. IEEE Trans. Circuits Syst. Video Technol. 28, 2061–2077 (2018). https:// doi.org/10.1109/TCSVT.2017.2707479 4. Guo, J., Gurrin, C.: Short user-generated videos classification using accompanied audio categories. In: Proceedings of the 2012 ACM International Workshop on Audio and Multimedia Methods for Large-Scale Video Analysis, pp. 15–20 (2012) 5. Guo, J., Gurrin, C., Lao, S.: Who produced this video, amateur or professional? In: Proceedings of the 3rd ACM Conference on International Conference on Multimedia Retrieval, pp. 271–278 (2013) 6. Hosu, V., et al.: The Konstanz natural video database (KoNViD-1k). In: 2017 Ninth International Conference on Quality of Multimedia Experience (QoMEX), pp. 1–6 (2017) 7. Janowski, L., Papir, Z.: Modeling subjective tests of quality of experience with a generalized linear model. In: 2009 International Workshop on Quality of Multimedia Experience, pp. 35–40, July 2009. https://doi.org/10.1109/QOMEX.2009. 5246979 8. Kim, J.H., Seo, Y.S., Yoo, W.Y.: Professional and amateur-produced video classification for copyright protection. In: 2014 International Conference on Information and Communication Technology Convergence (ICTC), pp. 95–96. IEEE (2014) 9. Ko´zbial, A., Leszczuk, M.: Collection, analysis and summarization of video content. In: Choro´s, K., Kopel, M., Kukla, E., Siemi´ nski, A. (eds.) MISSI 2018. AISC, vol. 833, pp. 405–414. Springer, Cham (2019). https://doi.org/10.1007/978-3-31998678-4 41 10. Krumm, J., Davies, N., Narayanaswami, C.: User-generated content. IEEE Pervasive Comput. 7(4), 10–11 (2008) 11. Leszczuk, M.: Assessing task-based video quality — a journey from subjective psycho-physical experiments to objective quality models. In: Dziech, A., Czy˙zewski, A. (eds.) MCSS 2011. CCIS, vol. 149, pp. 91–99. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-21512-4 11 12. Leszczuk, M., Hanusiak, M., Farias, M.C.Q., Wyckens, E., Heston, G.: Recent developments in visual quality monitoring by key performance indicators. Multimedia Tools Appl. 75(17), 10745–10767 (2014). https://doi.org/10.1007/s11042014-2229-2 13. Li, D., Jiang, T., Jiang, M.: Quality assessment of in-the-wild videos. In: Proceedings of the 27th ACM International Conference on Multimedia (MM 2019), pp. 2351–2359 (2019) 14. Marc Egger, A., Schoder, D.: Who are we listening to? Detecting user-generated content (UGC) on the web. ECIS 2015 Completed Research Papers (2015) 15. Mu, M., Romaniak, P., Mauthe, A., Leszczuk, M., Janowski, L., Cerqueira, E.: Framework for the integrated video quality assessment. Multimedia Tools Appl. 61(3), 787–817 (2012). https://doi.org/10.1007/s11042-011-0946-3 16. Nawala, J., Leszczuk, M., Zajdel, M., Baran, R.: Software package for measurement of quality indicators working in no-reference model. Multimedia Tools Appl., December 2016. https://doi.org/10.1007/s11042-016-4195-3

368

M. Leszczuk et al.

17. Nuutinen, M., Virtanen, T., Vaahteranoksa, M., Vuori, T., Oittinen, P., Hakkinen, J.: CVD 2014 - a database for evaluating no-reference video quality assessment algorithms. IEEE Trans. Image Process. 25, 3073–3086 (2016). https://doi.org/ 10.1109/TIP.2016.2562513 18. Pinson, M.H., Boyd, K.S., Hooker, J., Muntean, K.: How to choose video sequences for video quality assessment. In: Proceedings of the Seventh International Workshop on Video Processing and Quality Metrics for Consumer Electronics (VPQM2013), pp. 79–85 (2013) 19. Romaniak, P., Janowski, L., Leszczuk, M., Papir, Z.: Perceptual quality assessment for H.264/AVC compression. In: 2012 IEEE Consumer Communications and Networking Conference (CCNC), pp. 597–602, January 2012. https://doi.org/10. 1109/CCNC.2012.6181021 20. Sinno, Z., Bovik, A.C.: Large-scale study of perceptual video quality. IEEE Trans. Image Process. 28, 612–627 (2019). https://doi.org/10.1109/TIP.2018.2869673 21. Tu, Z., Chen, C.J., Wang, Y., Birkbeck, N., Adsumilli, B., Bovik, A.C.: Video quality assessment of user generated content: a benchmark study and a new model. In: 2021 IEEE International Conference on Image Processing (ICIP), pp. 1409– 1413. IEEE, September 2021. https://doi.org/10.1109/ICIP42928.2021.9506189. https://ieeexplore.ieee.org/document/9506189/ 22. Wang, Y., Inguva, S., Adsumilli, B.: YouTube UGC dataset for video compression research. In: 2019 IEEE 21st International Workshop on Multimedia Signal Processing (MMSP), pp. 1–5. IEEE, September 2019. https://doi.org/10.1109/MMSP. 2019.8901772. https://ieeexplore.ieee.org/document/8901772/ 23. Wikipedia Contributors: Precision and recall – Wikipedia, the free encyclopedia (2020). https://en.wikipedia.org/w/index.php?title=Precision and recall& oldid=965503278d. Accessed 6 July 2020 24. Yi, F., Chen, M., Sun, W., Min, X., Tian, Y., Zhai, G.: Attention based network for no-reference UGC video quality assessment. In: 2021 IEEE International Conference on Image Processing (ICIP), pp. 1414–1418. IEEE, September 2021. https://doi.org/10.1109/ICIP42928.2021.9506420. https://ieeexplore. ieee.org/document/9506420/ 25. Ying, Z., Mandal, M., Ghadiyaram, D., Bovik, A.: Patch-VQ: ‘patching up’ the video quality problem. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14019–14029, June 2021. http:// arxiv.org/abs/2011.13544 26. Zhang, M.: Swiss TV station replaces cameras with iphones and selfie sticks. Downloaded on 1 October 2015 (2015) 27. Zhao, K., Zhang, P., Lee, H.M.: Understanding the impacts of user-and marketergenerated content on free digital content consumption. Decis. Support Syst. 154, 113684 (2022)

A Research for Segmentation of Brain Tumors Based on GAN Model Linh Khanh Phung1,2 , Sinh Van Nguyen1,2(B) , Tan Duy Le1,2 , and Marcin Maleszka3 1

3

School of Computer Science and Engineering, International University, Ho Chi Minh City, Vietnam [email protected] 2 Vietnam National University, Ho Chi Minh City, Vietnam Department of Applied Informatics, Wroclaw University of Science and Technology, Wroclaw, Poland

Abstract. Analysis of medical image is a useful method that can support doctors in medical diagnosis. The development of deep learning models is essential and widely applied in image processing and computer vision. Application of machine learning and artificial intelligent in brain tumor diagnosis brings an accuracy and efficiency in medical treatment field. In this research paper, we present a method for determining and segmenting the brain tumor region in the medical image dataset based on 3D Generative Adversarial Network (3D-GAN) model. We first explore the state-of-the-art methods and recent approaches in such field. Our proposed 3D-GAN model consist of three steps: (i) pre-processing data, (ii) building an architecture of multi-scaled GAN model, and (iii) modifying loss function. The last our contribution is creating an application to visualize 3D models that representation of medical resonance brain images with the incorporation of the chosen models to determine exactly the region containing brain tumors. Comparing to the existing methods, our proposed model obtained better performance and accuracy. Keywords: Segmentation · Medical image processing adversarial network · Brain tumor

1

· Generative

Introduction

The methods for clinical diagnosis and treatment based on medical image processing are more and more very efficient. Especially, in the context of rapid development of information technology (IT) recent years, the new techniques in image analysis and video files processing from the dataset like MRI, CT scanner, Endoscopic surgery, Ultrasonography, etc., are proved obtaining advantages in both diagnosis and treatment. It helps deceasing the time of treatment and risk; increasing health recuperation after operating and saving treatment fee. Brain tumor is a very dangerous disease for anyone. Determining exactly the tumor region on the brain is an important step to propose a suitable method for c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  N. T. Nguyen et al. (Eds.): ACIIDS 2022, LNAI 13758, pp. 369–381, 2022. https://doi.org/10.1007/978-3-031-21967-2_30

370

L. Khanh Phung et al.

processing and treatment. The techniques in image processing can be applied to analyze, recognize and classify the tumor regions. Deep learning approaches have been used to analyze the medical image dataset in the-state-of-the-art methods. However, the accuracy and time processing are factors that need to improve. Depending on quality of the obtained images from magnetic resonance imaging (MRI) scanning techniques, doctors can recognize exactly the tumor region on the brain image based on its boundary. In case of low resolution, lack of information of color, or the characteristics of tumors are not determined clearly, the digital image processing algorithms can be used to process. However, with a huge amount of image data obtained by scanning the MRI that may prevent manual segmentation in a reasonable time. Although convolution neural network (CNN) models are themselves can be achieved satisfactory results, they still have restrictions in understanding semantic information and spatial contiguity in segmentation of the images. In deep learning approaches, GAN is widely used in classification of images. It provide ability to convince people that its own synthetic image is real by using a minimax mechanism [1]. Methods such as GAN can also provide constraints and refinements of the segmentation model’s outputs [2,3]. In medical imaging segmentation, a majority of GAN-based segmentation model is implemented using 2 components: generator and discriminator. While the generator cater for segmentation, the discriminator, usually put efforts to differentiate the difference between the given ground truth and the predicted mask. As the result, the generator is theoretically successful if the discriminator is no longer able to distinguish the real and predict segmentation mask. Therefore, an automatic solution is required for faster and more effective segmentation of the brain tumors. This research paper presents a proposed method for segmentation of Brain Tumors based on GAN Model. Our proposed method is trained and validated in the BRATS 2021 dataset [4], so the first step is pre-processing data. In the next step, we build an architecture of multi-scaled GAN model (for training and testing processed data). After that, we modify the loss function to obtain the accuracy of the proposed method. The last our work is creating an application to visualize 3D models. Comparing to the several methods, our proposed method obtained better performance and accuracy. The remainder of the paper is structured as follows. Section 2 presents several researches. Section 3 describes in detail our proposed method for segmentation and simulations. We present the implementation and obtained results in Sect. 4. Section 5 includes discussion and evaluation. The last section is our conclusion and future work.

2

Related Works

In this section, we review the state-of-the-art methods and applications for medical image processing. The method is based on the graphical techniques and geometric modeling for processing the 3D objects have been proposed in [5]. In this research, authors presented the proposed methods for processing 3D objects from variety of dataset. The methods for medical image processing are proposed to reconstruct the data objects from a DICOM dataset [6,7]. This method can help

A Research for Segmentation of Brain Tumors Based on GAN Model

371

doctors and medical staffs have a better diagnosis in their treatments. The advent of CNN has contributed to the high quality outcome of automatic medical image segmentation. Among the most successful models, U-Net architecture has been widely used to perform partition of tumors and cells [8]. The structure is formed in U-shape by the presence of 2 paths: contracting path (called encoding path) for obtaining feature information and expansive path (called decoding path) for spatial information reconstruction. Due to the extensive usage of 3D medical images, early updates of the original 2D U-Net to the 3D version were proposed in studies like V-Net with the contribution of Dice loss [11], and 3D U-Net, which substitutes all U-Net’s 2D operations with their 3D counterparts [12]. U-Net can localize the interest region in an image to produce a more accurate segmentation while requiring a small number of training images [8,9]. Other variations have been rigorously studied to either compromise the U-Net’s deepness [13,14] or to conquer the model’s constraints on the diversion of shapes and sizes of the target structure [9,15]. However, studies have shown that U-Net-based models still meet limitations in variations of object scales in medical images [9] and demanding inept weight ensemble [10]. Therefore, improving the current state of the 3D neuro-imaging network is crucial and can be put into practice under the refinements of some frameworks such as GAN [16]. In 2012, a group of scientists established a challenge named Brain Tumor Segmentation (BraTS). The number of organizers, participants, and data has increased. The recent BraTS2021 dataset has increased significantly from 660 to 2000 cases since 2020 [4]. The first-place winner of the 2020 BraTS challenge was Isensee et al. [19], who introduced the framework nnU-Net (no-new U-Net)that allows adaptation to general biomedical segmentation tasks. nnU-Net uses a batch size of 2, volume shape of 4 × 128 × 128 × 128, and intensive modifications including region-based training, post-processing, data augmentation, and batch normalization. Thereafter, both winners of the BraTS 2021 challenge, Futrega et al. [20] with first-place in the validation set and Luu et al. [21] winning first-place in the final testing set. They are utilized the nnU-Net structure as baseline model. While the authors from NVIDIA [20] proposed an extensive ablation study to choose the most prominent architecture. The authors from KAIST [21] introduced a more comprehensive network with an axial attention decoder alongside other minor modifications. Nevertheless, most models without GAN suffer from the need for intensive ablation study to search for the most notable model possible [20]. Such approaches are tedious and expensive since the models need retraining for any modification. Furthermore, they usually need asymmetric architecture [17] or two similar models for refinements [18], which could require intensive training resources. The utilization of GAN can help resolve this cost with fewer efforts. An early implementation of GAN for medical image segmentation was proposed by Xue et al. [22]. SegAN used basic 2D U-Net architecture as the generator, which receives axial slices of the original 3D volume as inputs. After the segmentation, the 2D masks are then restacked to the volumetric images. The discriminator reused the generator’s encoder as architecture, and a pair of the predicted map and ground truth multiplied with the brain images as input. The entire network

372

L. Khanh Phung et al.

aims to optimize the defined multi-scale object L1 loss function. Finally, as an inspiration of Pix2Pix, Vox2Vox follows the GAN architecture in which the generator is a typical 3D U-Net and a simple CNN architecture as the discriminator [23]. For the generator, the model accepts the cropped 3D image rather than 2D slices as in SeGAN as input and uses the batch size of 4. The segmentation result and the ground truth are also used as input for the discriminator for realness recognition. While Vox2Vox also encloses the brain image into the discriminator’s input, it applies concatenation instead of multiplication.

3 3.1

Our Proposed Method Overview

The main idea of our proposed method is presented as follows (see Fig. 1). We use a GAN model with improved steps in architecture design and loss function. The method diagram includes four modules: Input data; Generator architecture; Bridge (an intermediate processing step between generator and discriminator) and Discriminator.

Fig. 1. The architecture diagram of our proposed method

3.2

Data Structure and Preprocessing

The input data is first processed before input to the generator. Bridge is the improved point in architecture design of our model. Both real images and predicted images are put into the Discriminator for determining the real and fake

A Research for Segmentation of Brain Tumors Based on GAN Model

373

ones. Modifying the loss function is additional important step to compute and segment the tumor of brain. In this research, the input data is downloaded from an open source [4], the Benchmark on Brain Tumor Segmentation (BraTS). They are formatted following the file structure of medical image dataset NifTi [4]. They are structured and concatenated into a volume that is saved as a single file (see Fig. 2).

Fig. 2. The structure of NifTi image [4]

The same as Dicom images created from MRI CT scanner [6], the NIfTI files are commonly used in neuroscience and even neuroradiology research. The NifTi files can be converted easily into Dicom files based on some existing tools. They are described as follows. – Data structure: one case (a set of patient images) consist of 5 NifTi files (4 modes and 1 label). Each NifTi file is a 3D image. Each 3D image has size 240 × 240 × 155 structured in a 3D volume. For annotations, the predicted mask is labeled as 0, 1, 2 and 4, where: 0 is background; 1 is Necrosis; 2 is Invasion and 4 is Enhancing tumor (ET). The regions are: Enhancing tumor is the regions with label 4; Tumor core (TC) is the regions with label 1 and 4; Whole tumor (WT) is the regions with label 1, 2, and 4. – Pre-processing data: for each case, combine all modes into a 3D array to process. For each 3D object, the brain region is cropped based on its boundary and background subtraction. The object data is then normalized for faster training. Data augmentation is processed to finish training step. 3.3

Building Architecture

As mentioned, our architecture of Multi-scale GAN consists of three parts: generator, discriminator, and “bridge”. The generator employs two symmetric pathways: an encoding pathway that extracts the spatial information from the set of 4 modalities of brain MRI images; and a decoding one that enables precise localization of tumor regions from the encoded feature map. The detail of the block

374

L. Khanh Phung et al.

is described as follows. The entire encoding pathway of the generator contains 4 blocks at the corresponding level, in which the feature map in the lower level is doubled in channels while decreased in widths and heights. The pattern of the 3D Instance Normalization, Leaky Rectified Linear Unit (LeakyReLU), and 3D Convolution are adopted throughout each block. It is also crucial to note that the deeper the model goes, the higher the chance deterioration occurs during the model’s learning process [24]. Hence, to remedy such problem, we use a structure that bears some resemblance to that of Deep ResUnet [13]. Such similarity lies in the usage of skip connection to avoid vanishing gradient from the first 3D convolution layer of the block to its third one (see Fig. 1). Moreover, a dropout layer is used in the middle of the block to improve generalization [8] (Fig. 3).

Fig. 3. Block architecture

On the other hand, the decoder block accounts for the localization of the model. The input at each level of the block is concatenated with the output feature map from the corresponding encoder, which presumably is to retrieve the lost information during downsampling [25]. Similarity, each decoding block triplicates the usage of 3D convolution, 3D instance norm and LeakyReLU. After the first end of the repetition, a copy of the current output is saved for the multiscale handler. On the contrary, the final output of the block is doubled-scaled in width, height and depth, then used as the input for the next level. At the other end of the network, the discriminator is responsible for classifying the model’s input to be either the real label or generated. While in the training phase, the presence of the discriminator is necessary for the guidance for the generator to reach its best weight set; the discriminator is no longer needed in the inference stage. We use the same structure as the generator’s encoder with an 1 × 1 × 1 convolution to produce binary For the discriminator, we use the same structure as in the generator’s encoder with an 1 × 1 × 1 convolution to produce binary outputs. In segmentation, beside the retrieval of semantic structure, the utilization of low level features is as well of the essence. Therefore, we employ a multi-scale outputs processing layer as a means to gather low level information. As shown in Fig. 1, we take convolution of kernel size 1 × 1 × 1

A Research for Segmentation of Brain Tumors Based on GAN Model

375

to match them all to channel of 4 at each output scale from the generator. Then, element-wise summation is used for the upscaled feature maps and those are convolved in their upper level. It is also noticeable that we do not adopt the fusion of masks and images such as concatenation as in [23] or element-wise product as in [22]. Although the methods targeted in making the masks harder to distinguish for the discriminator. These fusions did not significantly increase the discriminator’s classification loss yet requires more computation than using only masks. However, as the idea of using GAN in segmentation is to take advantage of the optimization of both adversarial loss and segmentation losses, intervention to bring difficulty to discriminator is still a crucial task. 3.4

Modifying Loss Function

As mentioned in the previous section, constrains on discriminator should be made to enhance the optimization of losses. In practice, we want to restrict the epochs of training generator and discriminator [1]. However, it is sometimes difficult to determine how large each epoch should be for the models’ losses to behave as wanted. Therefore, we developed a simple decision algorithm to help tweak the epochs dynamically based on the adversarial loss trends. In general, the idea is based on linear regression to construct a line Y = α ∗ X + β from the set of iterations X and its corresponding adversarial loss Y . The increase of the loss values when the slope α is positive, or the large y-intercept β when the loss is stabilizing (i.e. α ≈ 0) indicate the discriminator’s excellence in detecting the genuineness of the masks. Hence, temporary ignorance of the discriminator while reinforcing the segmented masks helps strengthen the aforementioned difficulty. On the other hand, the decrease of adversarial loss or its balance at low values signal the discriminator being fooled by the generator. Therefore, training discriminator in these circumstances is crucial and usually contributes to the segmentation loss. The formula of the slope is demonstrated in Eq. 1. |X| (xi − x ¯)(yi − y¯) (1) slope = i=1 |X| ¯)2 i=1 (xi − x where |X| = |Y | is the number of element in the set of iterations X or in list of loss values Y . x ¯ and y¯ is the average of X and Y , respectively. The algorithm (Algorithm 1) is demonstrated as follows. The usage of Algorithm 1 and the loss functions are illustrated in Algorithm 2. After the certain epochs l, the values and iterations of adversarial loss will be clean to guarantee the new loss trend. After the erasure, the user-defined epoch k is assigned to the discriminator epoch Nd to ensure that the discriminator must be trained at least Nk epochs. Otherwise, the whole process is just similar to training residual U-Net. We use mean-square loss for adversarial loss and Dice loss [11] for region loss. Let M be the length of dataset, their formulas are shown in Eq. 2 Ladv =

M 1  (D(maski ) − target)2 M i=1

(2)

376

L. Khanh Phung et al.

where D is the discriminator and target is either 0 or 1 M 2 ∗ i=1 pi ∗ gi Lregion = 1 − DSC = 1 − M M 2 2 i=1 pi + i=1 gi + 

(3)

where DSC is computed as in Eq. 4,  is the smoothen value, p and g are the segmented and ground truth tumor, respectively. Where N : total number of epochs; Nd : number of epochs of discriminator to be trained in the next epoch i of N ; D: discriminator; G: generator and λ: region loss weights. Algorithm 1. Algorithm for discriminator epochs estimation Input: X, Y Output: Nd Require: k ∈ Z+ 1: α ← slope 2: β ← y¯ − α ∗ x ¯ 3: if α > 0 then 4: Nd ← 0 5: else if α < 0 then 6: Nd ← k 7: else 8: if β ≥ threshold then 9: Nd ← 0 10: else 11: Nd ← k 12: end if 13: end if

4

 Increasing loss, train generator  Decreasing loss, train discriminator  Stabilizing loss  threshold value of loss function

Implementation and Results

Our experiments are executed on machine using a Nvidia RTX 3060 12 GB GPU, Intel Core i5 2.9–4.3 Hz CPU, and 12 GB RAM. All models are implemented using PyTorch framework. The processing of input data is carried out using Nibable, which is an ubiquitous library for handling Nifti data, and Numpy for mathematical manipulation of the data. For the aforementioned data augmentation, images are randomly cropped with patch size of 112 × 112 × 96, which are latter randomly rotated and flipped. As for training optimization, we opt for Adam optimizer [26] with learning rate of 0.001, weight decay of 1e−4, and decay rates range from 0.5 to 0.999. Based on our observation of loss functions, having trained the models for 100 epochs using the batch size of 3 could guarantee the their convergence. With regards to the inference stage, the similar window size as in training is essential for the model to run properly. However, the patch should neither be cropped randomly nor centered since the boundary of the tumor could be loss. Therefore, for the model’s consistency, the segmentation output of one case is calculated sequentially from the patches. The results of ours and other models are shown in Fig. 4.

A Research for Segmentation of Brain Tumors Based on GAN Model

377

Algorithm 2. Training algorithm Require: k ∈ Z+ 1: for i in {0..N } do 2: if i mod l = 0 then 3: refresh X and Y 4: Nd ← k 5: end if  train discriminator 6: for j in {0..Nd } do 7: for sample in dataset do 8: (img, maskreal ) ← sample 9: Ldis ← Ladv (D(maskreal ), 1) + Ladv (D(G(img)), 0) 10: backprop(Ldis ) 11: end for 12: end for 13: for sample in dataset do  train generator 14: (img, maskreal ) ← sample 15: Lgen ← λLregion (G(img), maskreal ) + Ladv (D(G(img)), 1) 16: backprop Lgen 17: insert iterationsample to X 18: insert Ladv into Y 19: end for 20: Nd ← Algorithm 1 21: end for

Fig. 4. Comparison of the methods, the red marks show the major false segmentation (Color figure online)

378

5 5.1

L. Khanh Phung et al.

Discussion and Comparison Evaluation Metrics

We evaluate our model using four metrics, namely Dice score, Jaccard score, Hausdorff distance and average symmetric surface distance (ASSD). The scores are assess to measure the similarities between the ground truth and predicted masks, hence, the higher they are, the better the assessed model. On the contrary, the large distances should refer the substantial differences between the two sets. Dice score is used as a standardized metric to evaluate the similarity of the ground truth and segmented tumor. Given a set of voxels of ground truth as X and that of segmented mask is Y , Dice score of one instance can be calculated as follows: 2|X ∩ Y | (4) DSC = |X| + |Y | While Jaccard is also used for evaluating similarities, Jaccard coefficient considers number of the intersection over union rather than the sum of the two | sets. The formula of Jaccard score is shown as follows: JSC = |X∩Y |X∪Y | = |X∩Y | |X|+|Y |−|X∩Y | .

The third metric is Hausdorff distance which evaluates the difference of the sets of voxels between ground truths and segmented tumor. Intuitively, this distance can be understood as the greatest distance from one point in the first set to its closest point in the other one. We compute the 95th percentile of the metric, which is commonly used in evaluation of BraTS dataset. The formula of this distance is used as follows: dH (X, Y ) = max{maxsX ∈S(X) d(sX , S(Y )), maxsY ∈S(Y ) d(sY , S(X))}, where d(x, y) is the Euclidean distance between two voxels x and y, and S(X), S(Y ) are the set of surface voxels of ground truths X and segmented masks Y , respectively. Finally, ASSD evaluates the average of symmetric distances between the voxels on the boundary of ground truth The metric is described as  and segmented tumors.  follows: dASSD (X, Y ) = 5.2

sX ∈S(X)

d(sX ,S(Y ))+ s ∈S(Y ) d(sY ,S(X)) Y . |S(X)|+|S(Y )|

Evaluation and Discussion

In Fig. 4, our model achieved competitive results. As marked in picture 1, 2 and 9 of the figure, most evaluated models are sensitive to noises, which prompted segmentation regions where no tumors are found in the ground truth. Our model, however, did not show such tendency. It can be argued that models that are insensitive may result in losing information in some cases. Nevertheless, as shown in picture 2, 8, 9 and 10 of the figure, our model is adept at segmenting normal round-shaped tumors despite distinct scales. Furthermore, our model showed competency in detecting abnormal tumor with variety of shapes as in picture 3, 4, 5. Finally, for cases that are considered difficult for all models, such as thin boundary or small tumors in picture 6 and 7, our model, while did not successfully capture the tumors to the fullest, yielded results that are not distorted as

A Research for Segmentation of Brain Tumors Based on GAN Model

379

other models’. Regarding the quantitative results shown in Table 1, our models were compared against 2 original CNN-based, i.e. U-Net and V-Net, and 2 GAN-based models, i.e. VoxelGAN and Vox2Vox. Although our residual U-Net alone can achieve adequate results, its scores were relatively smaller than those of V-Net and Vox2Vox. By using the proposed methodology, our multi-scale GAN gained highest scores and smallest distances in whole tumor assessment and increased the baseline model’s scores of enhancing tumor and tumor core classes significantly (roughly 3 to 6%). Furthermore, while there were minor increase of approximately 2.7 mm average in both Hausdorff distance and ASSD of enhancing tumor and tumor core in comparison with the original ResU-Net, these distances are still substantially smaller than those of other models. Table 1. Comparison of models’ accuracy. The best results of ET, WT, and TC are highlighted in underlined-bold, bold, and italics-bold, respectively. Model

Label Dice (%) Jaccard (%) Haudorff (mm) ASSD (mm)

Unet

ET WT TC

77.975 81.130 73.632

72.112 71.690 65.715

19.866 33.538 25.055

7.958 13.473 9.783

VoxelGAN

ET WT TC

77.683 82.586 77.192

71.795 73.738 71.079

26.292 31.702 33.768

9.745 12.798 12.513

Vox2Vox

ET WT TC

79.745 83.238 78.210

74.823 75.199 72.440

18.892 30.996 26.013

6.855 12.370 9.6722

Vnet

ET WT TC

78.332 86.411 75.478

73.581 79.153 68.846

14.918 18.109 18.491

5.527 6.351 7.066

Our ResUnet (without GAN) ET WT TC

77.964 88.103 80.366

72.190 80.930 75.569

6.382 16.505 8.431

2.349 6.0193 3.523

Our multi-scale GAN

82.318 88.599 83.108

78.487 81.466 79.705

9.616 13.323 13.180

3.367 5.123 5.416

6

ET WT TC

Conclusion

In this research, we proposed a segmentation model using GAN models. The novelty of our method focused on the construction of multi-scale GAN, the modification of training process and loss functions. Furthermore, we performed data pre-processing, developed a visualization application for 3D brain MRI and tumors, evaluated our model quantitatively and qualitatively against other 4 models. Our model achieved the most prominent results of all graded models. In

380

L. Khanh Phung et al.

the future, we consider to improve the current model and enhancing its ability in detecting small, thin tumor regions. Additionally, more models, especially GAN-based, should also be evaluated. As we have only assessed ours and other mentioned models on the validation set separated from BraTS2021’s training set, we intend to test our model on the challenge’s platform. Finally, the development of our model could be more alluring if we take greater advantage of GAN in other aspects beyond segmentation. Acknowledgment. This research is funded by International University, VNU-HCM under grant number SV2021-IT-02.

References 1. Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems 27 (2014) 2. Maji, D., Sigedar, P., Singh, M.: Attention Res-UNet with guided decoder for semantic segmentation of brain tumors. Biomed. Signal Process. Control 71, 103077 (2022). https://doi.org/10.1016/j.bspc.2021.103077 3. Zhaoa, Z., Wang, Y., Liu, K., Yang, H., Sun, Q., Qiao, H.: Semantic segmentation by improved generative adversarial networks. In: Computer Vision and Pattern Recognition (cs.CV), FOS: Computer and Information Sciences (2021). https:// doi.org/10.48550/ARXIV.2104.09917 4. Baid, U., et al.: The RSNA-ASNR-MICCAI BraTS 2021 benchmark on brain tumor segmentation and radiogenomic classification. In: Computer Vision and Pattern Recognition (2021). https://doi.org/10.48550/ARXIV.2107.02314 5. Van Nguyen, S., Tran, H.M., Maleszka, M.: Geometric modeling: background for processing the 3d objects. Appl. Intell. 51(8), 6182–6201 (2021). https://doi.org/ 10.1007/s10489-020-02022-6 6. Nguyen, V.S., Tran, M.H., Vu, H.M.Q.: An improved method for building a 3D model from 2D DICOM. In: Proceedings of International Conference on Advanced Computing and Applications (ACOMP), pp. 125–131. IEEE (2018). ISBN 978-15386-9186-1 7. Sinh, N.V., Ha, T.M., Truong, L.S.: Application of geometric modeling in visualizing the medical image dataset. J. SN Comput. Sci. 1(5), 254 (2020) 8. Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4 28 9. Ibtehaz, N., Rahman, M.S.: MultiResUNet: rethinking the U-Net architecture for multimodal biomedical image segmentation. J. Neural Netw. 121, 74–87 (2020). https://doi.org/10.1016/j.neunet.2019.08.025 10. Zhou, Z., Siddiquee, M.M.R., Tajbakhsh, N., Liang, J.: UNet++: redesigning skip connections to exploit multiscale features in image segmentation. J. IEEE Trans. Med. Imaging (2019). https://doi.org/10.48550/arXiv.1912.05074 11. Milletari, F., Navab, N., Ahmadi, S.-A.: V-Net: fully convolutional neural networks for volumetric medical image segmentation. In: The Fourth International Conference on 3D Vision (3DV), pp. 565–571. IEEE (2016)

A Research for Segmentation of Brain Tumors Based on GAN Model

381

12. Cicek, O., Abdulkadir, A., Lienkamp, S.S., Brox, T., Ronneberger, O.: 3D UNet learning dense volumetric segmentation from sparse annotation. In: Computer Vision and Pattern Recognition (cs.CV), FOS: Computer and Information Sciences (2016). https://arxiv.org/abs/1606.06650 13. Zhang, Z., Liu, Q., Wang, Y.: Road extraction by deep residual U-Net. IEEE Geosci. Remote Sens. Lett. 15(5), (2018). https://doi.org/10.1109/lgrs.2018. 2802944. ISSN 1558-0571 14. Li, R., et al.: DeepUNet: a deep fully convolutional network for pixel-level sea-land segmentation. arXiv (2017). https://doi.org/10.48550/ARXIV.1709.00201 15. Oktay, O., et al.: Attention U-Net: learning where to look for the pancreas. arXiv (2018). https://doi.org/10.48550/ARXIV.1804.03999 16. Chen, H., Qin, Z., Ding, Y., Lan, T.: Brain tumor segmentation with generative adversarial nets. In: 2019 2nd International Conference on Artificial Intelligence and Big Data (ICAIBD), pp. 301–305 (2019). https://doi.org/10.1109/ICAIBD. 2019.8836968 17. Myronenko, A.: 3D MRI brain tumor segmentation using autoencoder regularization. In: Crimi, A., Bakas, S., Kuijf, H., Keyvan, F., Reyes, M., van Walsum, T. (eds.) BrainLes 2018. LNCS, vol. 11384, pp. 311–320. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-11726-9 28 18. Jiang, Z., Ding, C., Liu, M., Tao, D.: Two-stage cascaded U-Net: 1st place solution to BraTS challenge 2019 segmentation task. In: Crimi, A., Bakas, S. (eds.) BrainLes 2019. LNCS, vol. 11992, pp. 231–241. Springer, Cham (2020). https://doi.org/10. 1007/978-3-030-46640-4 22 19. Isensee, F., J¨ ager, P.F., Full, P.M., Vollmuth, P., Maier-Hein, K.H.: nnU-Net for brain tumor segmentation. In: Crimi, A., Bakas, S. (eds.) BrainLes 2020. LNCS, vol. 12659, pp. 118–132. Springer, Cham (2021). https://doi.org/10.1007/978-3030-72087-2 11 20. Futrega, M., Milesi, A., Marcinkiewicz, M., Ribalta, P.: Optimized U-Net for brain tumor segmentation. arXiv preprint arXiv:2110.03352 (2021) 21. Luu, H.M., Park, S.-H.: Extending nn-UNet for brain tumor segmentation. In: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (2021). https://doi.org/10.48550/arXiv.2112.04653 22. Xue, Y., Xu, T., Zhang, H., Long, L.R., Huang, X.: SegAN: adversarial network with multi-scale L1 loss for medical image segmentation. J. Neuroinform. 16(3–4), 383–392 (2018) 23. Cirillo, M.D., Abramian, D., Eklund, A.: Vox2Vox: 3D-GAN for brain tumour segmentation. In: Crimi, A., Bakas, S. (eds.) BrainLes 2020. LNCS, vol. 12658, pp. 274–284. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-72084-1 25 24. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition (cs.CV), FOS: Computer and Information Sciences (2015). https://doi.org/10.48550/ARXIV.1512.03385 25. Drozdzal, M., Vorontsov, E., Chartrand, G., Kadoury, S., Pal, C.: The importance of skip connections in biomedical image segmentation. In: Computer Vision and Pattern Recognition (cs.CV), FOS: Computer and Information Sciences (2016). https://doi.org/10.48550/ARXIV.1608.04117 26. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: Machine Learning (cs.LG), FOS: Computer and Information Sciences (2014). https://doi. org/10.48550/ARXIV.1412.6980

Tracking Student Attendance in Virtual Classes Based on MTCNN and FaceNet Trong-Nghia Pham1,2(B) , Nam-Phong Nguyen1,2 , Nguyen-Minh-Quan Dinh1,2 , and Thanh Le1,2 1

2

Faculty of Information Technology, University of Science, Ho Chi Minh City, Vietnam {ptnghia,lnthanh}@fit.hcmus.edu.vn Vietnam National University, Ho Chi Minh City, Vietnam

Abstract. All classes are held online in order to ensure safety during the COVID pandemic. Unlike onsite classes, it is difficult for us to determine the full participation of students in the class, as well as to detect strangers entering the classroom. Therefore, We propose a student monitoring system based on facial recognition approaches. Classical models in face recognition are reviewed and tested to select the appropriate model. Specifically, we design the system with models such as MTCNN, FaceNet, and propose measures to identify people in the database. The results show that the system takes an average of 30 s for learning and 2 s for identifying a new face, respectively. Experiments also indicate that the ability to recognize faces achieves high results in normal lighting conditions. Unrecognized cases mostly fall into too dark light conditions. The important point is that the system was less likely to misrecognize objects in most of our tests.

Keywords: Student monitoring system FaceNet

1

· Face recognition · MTCNN ·

Introduction

The Covid pandemic is spreading globally and significantly affecting all areas of life, including education. Many universities have switched from onsite instruction to online not to disrupt learning. However, many problems have occurred in these classrooms, such as strangers appearing and vandalizing the classroom or students logging in but not following lectures. In addition, because it is difficult to observe the status of students in class, many teachers do not have the information to adjust the teaching method to suit the level of students’ acquisition. From that problem, we propose a system based on facial recognition to identify and track students in the classroom. The system can be practically implemented in onsite classes and works on several online platforms. This system also facilitates monitoring of students at the university, restricts strangers, and assesses student participation in courses. Some final-year students have participated and learned c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  N. T. Nguyen et al. (Eds.): ACIIDS 2022, LNAI 13758, pp. 382–394, 2022. https://doi.org/10.1007/978-3-031-21967-2_31

Tracking Student Attendance in Virtual Classes

383

how to develop the system. This activity creates motivation for students to do research. Face recognition is a branch of study in the field of computer vision and also a branch of biometrics (fingerprint recognition, iris recognition). Face recognition is widely applied in practice to reduce human effort and time in surveillancerelated tasks. Fingerprint and iris recognition technology has been developed with high accuracy and is commonly used today. However, compared with fingerprint and iris recognition, face recognition has a richer data source and requires less interaction to perform. Not to mention the ability to recognize multiple objects simultaneously increases practical applicability. The face-related problem includes many different issues such as face detection, face recognition, feature extraction, labeling, and classification. However, these problems are not always independent of each other. For example, the first step for face recognition is to identify the face in photos or videos, followed by feature extraction and then the recognition steps. Therefore, it requires a lot of research and development work. Besides, face recognition faces many challenges, including fake faces, lighting conditions, face occlusion, and resolution. That poses many challenges, but it is also motivating to tackle them. Face recognition is based on the face’s elements, including shape and reflectivity. Many external factors such as illumination, posture, expression, glasses, and hairstyle can affect biometric recognition systems. In all these factors, variation in lighting is a significant challenge and needs to be addressed first. Conventional image-based facial recognition systems are affected by changes in ambient lighting, even for environments where we can control the light, like in homes or studios. The in-depth study on the effect of lighting changes on facial recognition by Adini et al. [1] came to some conclusions: – Lighting conditions, especially lighting angles, drastically change the appearance of a face. – When comparing unprocessed images, the variation between shots of a person under different lighting conditions was more remarkable than between images between two people under the same light. – All local filters are not capable of correcting variations due to changes in illumination direction. To tackle the above challenges, we survey the state-of-the-art approaches, then choose the most suitable models for the proposed system. We also make adjustments to match the reality of the context. Training should be quick and straightforward to implement for new faces. Moreover, the ability to quickly identify and export the results are other objectives of the system. We organize the article into six sections. The first section introduces an overview of the purpose of the study and the requirements of the problem to be solved. The next section describes the related work of face recognition and the achievements achieved until now. Core models applied to our system are described in Sect. 3. In Sect. 4, we introduce our complete face recognition system to monitor students in the online classroom. Section 5 presents the evaluation

384

T.-N. Pham et al.

dataset, experimental methods, and results obtained. Finally, the conclusion and the components to be improved further.

2

Related Work

Many proposed computer vision approaches to solve face detection or recognition tasks with high accuracies, such as local, subspace, and composite approaches. However, there are still many problems due to various challenges, such as face tilt, lighting conditions, and facial expression. Recently developed methods have focused on solving all these challenges, thus developing reliable facial recognition systems. However, they require high processing time, consume a lot of memory, and are relatively complex. Therefore, we conduct research and evaluate different methods to check the suitability of the system we intend to build. Three basic steps are used to develop a face recognition system, including face detection, feature extraction, and face recognition. The face detection step is used to detect and locate the image of a human face in the frame. Next, we need to extract the feature from the face detected in the previous step. After that, we compare feature vectors with the features of other faces that have been similarly performed and stored in the database. If they match above a certain threshold, we confirm their existence. Otherwise, we refuse. However, the feature vector comparison may not be enough for an accurate recognition system. Therefore, many advanced face recognition methods have been proposed to automatically learn the features and provide advanced face detection techniques. As a result, identify objects more quickly and accurately. In the following paragraphs, we describe the features and salient works in each step of a complete identification system. Face detection is a prerequisite problem to be solved in almost all problems related to human faces, such as human face recognition, emotion recognition. The input of this stage is an RGB/grayscale image or video, and the output is also an image or video respectively to the human face that has been located by the bounding boxes. The classic models in this group include the Single-Shot Detector (SSD) [8] with ResNet 10 as the foundation, the Viola-Jones detector [15], and the histogram of the directional gradient (HOG) [9]. In addition, the face detection step can be used for video and image classification, object detection, region of interest. The features and distribution of mouth, nose, and eyes in the face image need to be extracted and formed feature vectors in the feature extraction stage. The reason for choosing these parts for characterization is to identify information or characteristics so that we can distinguish between different individuals. Some prominent techniques to do this work include linear discriminant analysis (LDA) [2], principal component analysis (PCA) [13], Scale-invariant feature transform (SIFT) [14], and local binary sampling method (LBP) [5]. The final stage is facial recognition. Correlation filters (CF), convolutional neural networks (CNN), and k-nearest neighbors (K-NN) are considered classic methods to solve this problem. However, the recognition must be made based

Tracking Student Attendance in Virtual Classes

385

on the internal attributes to achieve high accuracy. Furthermore, the algorithms must be able to handle the adverse effects due to external factors and errors during alignment. There are many proposed solutions, such as applying algorithms to reduce noise, and improve resolution. However, this approach did not yield significant results. The researchers continue to use 3D techniques obtained from laser scanners or 3D vision methods. However, this approach requires intensive computation and high equipment costs for real-world applications. A more advanced technique used is infrared. The advantage of this method is that it can detect camouflaged faces and recognize faces in poor lighting conditions. However, infrared also suffers from instability problems at different temperatures. An innovative technique, Near-Infrared Images (NIR) [3], has attracted more and more attention due to its low cost and stability. In contrast to visible light-based methods, this method overcomes the impact of illumination changes on face recognition.

3 3.1

Background Face Detection

Face detection is considered a specific case of object detection. In object detection, the task is to find all the objects’ positions and sizes in an image that belongs to particular class. In addition to face recognition, face detection can also be applied in other applications such as face motion capture, autofocus function in photography, emotion prediction, marketing, etc. Through surveying and analyzing many face detection models, we have selected two methods: MobileNetV2-SSD [11] and MTCNN [16]. The reason for choosing these two methods is fast face detection and high accuracy. MobileNetV2-SSD includes two neural networks, MobileNetV2 (improved MobileNet) and SSD (Single Shot Detector) [8]. MobileNetV2 is a base network that uses convolution to produce high-level features for classification or detection. An SSD is a detection network to detect objects (faces) with provided features. When stitched together, it becomes a complete face detection model. MTCNN (Multi-task Cascaded Convolutional Neural Networks) is a three-stage algorithm used to detect the bounding box along with five landmark points on the face. MTCNN is composed of three stages corresponding to three convolutional neural networks which are P-Net, R-Net and O-Net. Before being fed into PNet, the input image is scaled down to several sizes. Each scaled image is an input to this network. This operation aims to find many faces of different sizes in the input image. For each scaled image, a 12× 12 kernel slides over its surface to find the face present on it with a stride of 2 pixels. After each convolution layer, the PReLU activation function is applied. Output is the coordinates, the probability that a face exists and does not exist in each frame. After collecting all outputs, the model discards all frames with low confidence and merges high-overlapped frames into a unique frame using NMS (Non-maximum suppression). Because some frames may grow out of the image boundary when we convert them to square, it is necessary to buffer them to get enough input value. Then, all frames

386

T.-N. Pham et al.

are converted to 24×24 size and fed into the R-net. After each convolution layer, the PReLU activation function is applied. One output is the coordinates of the more precise frames and the confidence of that frame. O-Net and R-NET are structurally similar, differing only in-depth. The results of R-Net after employing NMS are resized to 48 × 48, then fed into O-Net as its input. O-Net outputs not only the coordinates of the bounding boxes but also the coordinates of the five landmarks on the face. Three value types of the O-Net outputs include: – The probability of the face of each bounding box (the confident) – The coordinates of the bounding box – The coordinates of the five landmarks (2 points of eyes, one nose, and 2 points of lips) MobileNet is more suitable for mobile devices and embedded vision applications, devices, or computers with limited resources. Google proposes this architecture. This structure uses depthwise separable convolutions, which dramatically reduces the number of parameters compared to a network with regular convolutions of the same depth. This results in a deep neural network without much math. Each line is a sequence that repeat the corresponding operator n times; all classes in the same row have the same number of output channels c. The first layer of each sequence has stride s; all other layers have s = 1. Stride is the number of cells shifted when performing the convolution between two matrices. All convolutions use the 3 × 3 kernel except as specified in the table. t is the expansion factor, and k is the number of color channels of the original image. Regular linear bottleneck models do not perform as well as nonlinear models. However, the authors of the paper have shown that the linear bottleneck improves the model performance while arguing that non-linearity destroys information in the low-dimensional digital space. The results in the paper also show that the shortcut that connects the narrow layers works better than the shortcut that connects the extended layers. SSD is a single-shot detection model for multi-class classification problems, faster than previous models of the same type (YOLO [10]) and significantly more accurate, with high accuracy. The core of SSD is predicting scores and offsets for a fixed default set of frames using small convolutional filters applied to the feature map. Predictions of different scales from feature maps of different scales are generated and cleanly separated the predictions by aspect ratio to achieve high detection accuracy. The structure of MobileNetV2-SSD is similar, but only the VGG-16 part is replaced with the construction of MobileNetV2. We evaluate and compare two models, MTCNN and MobileNetV2-SSD. The MTCNN model and MobilenetV2-SSD trained on the WIDER FACE dataset. By using the data and evaluation method on the home page of FDDB (Face Detection Dataset and Benchmark) to compare the face detection models, the results are shown in Fig. 1. The higher the upward curve, the more accurate it is. From there, we can see that MTCNN gives more precise results than MobileNetV2SSD. As for speed, MobileNetV2-SSD does fewer calculations and therefore runs faster than MTCNN. Table 1 shows this comparison.

Tracking Student Attendance in Virtual Classes

387

Fig. 1. The comparison of MTCNN and MobileNetV2-SSD Table 1. The speed performance of MTCNN and MobileNetV2-SSD GPU CPU MTCNN

0.049 0.27

MobileNetV2-SSD 0.023 0.22

The speed measurement is performed on self-portrait video with 360p image quality, considering only the runtime portion of the model (not counting video read time, and model loading) across all video frames, then calculating the average results. MobileNetV2-SSD’s execution time is generally better than that of MTCNN. However, MobileNetV2-SSD is not as accurate as MTCNN. It can not detect when panning/rotating is too tilted or covered over 1/3 of the face, while MTCNN detects half of the face is covered or 90◦ horizontal. Therefore, we decide to apply MTCNN to our system. 3.2

Face Recognition

In the face recognition phase, we have analyzed many models such as ACNN [7], LFR [4], LGS [6]. However, we focus on FaceNet [12] because of the good capabilities of the model. FaceNet is an algorithm introduced in 2015 by Google that uses deep learning to extract features on human faces. FaceNet takes an image of a person’s face and returns a vector containing 128-dimensional important features. Then use SVM to group these vectors into groups representing faces to which this vector belongs. Figure 2 illustrates the FaceNet’s architecture. FaceNet is a variant of the Siamese network that represents the image by a multidimensional Euclidean space (usually 128), so that the smaller the distance of the embedding vectors, the greater their similarity. Face recognition algorithms (except FaceNet) often have a problem with using an extra layer of bottleneck to reduce the dimension of the embedding vector. The limitation of

388

T.-N. Pham et al.

Fig. 2. The architecture of FaceNet

this is that the number of dimensions of the embedding vector is relatively large (more than 1000), affecting the algorithm’s speed. Therefore, it is often used together with PCA to reduce the data dimension and increase the calculation speed. Also, the error function only measures the difference between two images. Therefore, each training only has one of two cases (same or different); it is not possible to learn the same and different at the same time in one training session. FaceNet can solve both of these problems with a tiny adjustment. Because FaceNet is a variant of Siamese (using convolutional networks) and reduces the number of data dimensions to 128, the calculation process is faster, and the accuracy is guaranteed. Furthermore, the error function that FaceNet uses is Triplet, which can learn at the same time the similarity between two pictures in the same group and the difference between two pictures from different groups, much more efficiently than previous methods. In FaceNet, the CNN network helps encode the input image into a 128dimensional vector and then input the triplet error function to evaluate the distance. To use the Triplet Loss Function, three images are required, of which one is selected as the landmark. The landmark photo (A) must be fixed first of the three. The remaining two images include an image labeled Negative (N) (object different from the original image subject) and an image labeled Positive (P) (same object as the original image). The objective of the error function is to minimize the distance between two images if it is negative and maximize the distance when the two images are positive. The loss function is as following:

L(A, P, N ) =

n 

2

2

max(|| f (Ai ) − f (Pi )||2 − || f (Ai ) − f (Ni )||2 + α, 0) (1)

i=0

The selection of three images dramatically affects the quality of the FaceNet model. If a good triplet is selected, the model converges quickly, and the prediction results are more accurate. Furthermore, hard triplet makes the training model smarter because the resulting vector is a vector representing each image. These vectors can distinguish Negatives (similar to Positives). As a result, images with the same label are closer together in Euclidean space.

4

A Proposed Student Monitoring System

Attendance and student monitoring systems using mirror recognition technology need to be implemented and optimized in order to be able to identify as accurately as possible. The system is designed to include two main modules: training and student identification. Figure 3 depicts our proposed system.

Tracking Student Attendance in Virtual Classes

389

Fig. 3. A proposed student monitoring system

In the training module, we reconfigure but still base on FaceNet’s architecture. Currently, Hiroki Taiai’s pre-trained Keras FaceNet set has been trained on the MS-Celeb-1M dataset with more than 10 million photos of nearly 100,000 world celebrities and has been collected online since 2016. The image is a color image normalized on all three color channels with 160 × 160 pixels. The model give good results on our system through evaluation, so we do not re-train from the beginning, but base on this set to perform further training on our dataset. As a result, the training costs reduce and are easy to integrate into many other modules. The dataset that we continue to train is a dataset of student faces from classes. Through the FaceNet model to extract features, return a 128-dimensional vector. The feature vector is labeled by the system and saved to the model used in face recognition. The teacher can do this process at the beginning of the lesson without elaborate preparation in advance. Training time for one student is about 30 s, which we consider appropriate for the context of the problem. After creating the student face database, the face recognition module is executed. First, the system is designed based on the MTCNN model for face detection. Then, these images are provided into the FaceNet model to extract the feature vectors and compare them with the face vectors in the data warehouse created in the previous step to identify students. If the face matches the saved data, the system takes the student’s attendance and displays the student information on the screen. However, in the process, we apply more face alignment techniques. Because the face obtained after the detection step can be in many states and angles, faces may be overlapped or skewed due to poor video framing, so face alignment improves detection accuracy. This process can be comprehended as normalizing the data before it is fed into the predictive model. There are many forms to align, and some works use 3D models, then apply transformations to transform the input image so that the facial markers match the tissue markers on the 3D model. Other works are more superficial, such as relying on markers on the face to rotate, shift or scale the face. To avoid affected execution time in our system, we use the 2D alignment method. Regarding the process of matching or identifying objects in the database, we have integrated two directions to evaluate and select the appropriate method for the deployed application. The first direction is based on one-shot learning, a supervised learning model that requires very few images of an object for training. This model identifies objects through a simple CNN network model, but it has

390

T.-N. Pham et al.

a disadvantage that when new objects are added, we need to retrain the model. The second direction is learning similarity, based on the distance between two photos. The distance is considerable if the two pictures are not of the same person, and diminutive if they are of the same subject. The advantage of this method is that it is not limited to labels, so there is no need to retrain when there is a new object. Therefore, we decide to implement a similarity-based approach to feed the system. In our system, we apply a measure of cosine similarity. Cosine similarity is a method of calculating the similarity between two non-zero vectors in a dot product space. The cosine of their angle determines the similarity between two vectors, which is also the dot product of the same unit vector so that both have the length of 1. A·B (2) A B Based on the similarity formula, we can infer the distance of two images as following: (3) consine distance = 1 − cosine similarity similarity = cos(θ) =

5

Experiments and Result Analysis

5.1

Datasets

The data set that we use to evaluate the system includes 750 images taken from Kaggle of 5 famous people1 , including Donald Trump, Jack Ma, Bill Gates, Narendra Modi, and Elon Musk. Figure 4 shows some of their pictures. Moreover, we collected about 1750 photos of 15 volunteers (including us) from various characteristics such as age, gender, etc. Table 2 details this dataset. We further evaluate the model in an online class of about 20 students.

Fig. 4. The famous people’s images on Kaggle

5.2

Parameter Settings

Experimental configuration includes computers equipped with Intel Core i7, VGA Nvidia GTX 1050. The programming language is Python 3.0. The integrated library has OpenCV to support training and object recognition models and Tkinter to generate user interfaces for objects. The threshold set for recognition is 0.8. 1

https://www.kaggle.com/anku5hk/5-faces-dataset.

Tracking Student Attendance in Virtual Classes

391

Table 2. Attributes of the collected data set Types

Attributes (Number of images)

Age

10–15 (430), 18–25 (670), 30–50 (650)

Gender

Male (980), Female (770)

Hair

Short (980), Long (770)

Glasses

Wearing (465), Not Wearing (1100)

Place

Indoor (860), Outdoor (890)

Emotion Neutral (1300), Smile (450)

5.3

Results

From the mentioned data set, we evaluate the proposed system using the k-fold cross-validation method with k = 10. Due to the limitation of the paper length, we do not list all of the volunteers. Instead of that, we choose pictures of the authors in this paper. Table 3 details this result. Table 3. Test results of some people in the data set (* indicates all people who exist in the database and not in the database) Person

Average detection precision Average time (in second)

Bill Gates

0.95

2.0

Jack Ma

0.92

2.3

Narendi Modi

0.93

2.1

Elon Musk

0.92

2.1

Donald Trump 0.94

2.2

Author 1

0.84

2.4

Author 2

0.83

2.5

Author 3

0.87

2.3

Unidentified*

0.81

3.3

Through the experimental process, we find that the images in the data set from Kaggle achieve high results and the recognition time is faster than the data set we create ourselves. The reason is that the data from Kaggle has good light and high resolution. Therefore, the system is easy to recognize. For the data set, we collect ourselves comes from many contexts. Therefore, the image quality is not guaranteed and results are not high. However, the experimental process on our data set shows objectivity and is consistent with the actual implementation. With a defined threshold, we identify the object that is stored in the database. In terms of execution time, it takes more time for the images in our dataset. Objects that are not in the data set also make the system take longer. However, the time is within an acceptable range.

392

T.-N. Pham et al.

Fig. 5. Heatmap visualization representing the correlation precision of face recognition

Figure 5 illustrates the correlation precision of face recognition of people in the dataset. It shows that some samples exist in the system but are labeled as Unidentified. We check these samples and re-match the training samples. The unrecognizable patterns are due to the relatively small face shape and bad environmental conditions, resulting in features that may not be enough to warrant above the decision threshold. On the other hand, some objects are not in the database but are mistakenly assumed to exist. Although the percentage is very small, it can cause dangerous problems. We find that the wrongly detected samples were all in the data set that we collect ourselves, and the quality of the collected samples is not good. To solve this problem, we propose two solutions. The first solution is to closely examine the training data, increase the threshold when evaluating the training set, and remove the unqualified images. The second solution is to identify across different frames or pictures of the same object. Both solutions are easily executable, so we have added these steps to the recognition system. We also test the system in an online class of 20 students on the Zoom platform. At the start of the execution, the system takes pictures of all the screens and then feeds them to the training and identification system. At the beginning of the lesson, each student will spend about 30 s taking pictures and putting them into the training model. We enter student information and then run the identification system. Students that are not identified by the system are labeled Unidentified. In general, the testing process resulted in most students being recognized, except for a few students whose webcams were too dim. In addition, not all students are recognized in the same frame due to different hidden and moving positions. We overcome this problem by capturing frames at different times to identify objects. Figure 6 illustrates the execution of the system.

Tracking Student Attendance in Virtual Classes

393

Fig. 6. The implementation of the proposed system in an online classroom via the Zoom platform

Through the implementation process, we find that the model met the goals. Although much depends on the quality of the device and the speed of the network connection, the system still gives acceptable results. Unfortunately, due to the pandemic, we are not able to test on onsite classes. However, with the results achieved on the online system, the system could work better in the real environment because we can control the quality of the device and the execution environment.

6

Conclusion and Future Research Directions

Object identification is one of the highly applicable problems. We focus on learning methods to solve this problem and apply it to the student monitoring system in the classroom. The implementation of the system helps teachers track students as well as detect intruders, especially in online classes. We have selected algorithms to integrate into the system through surveys and analysis of methods. In particular, two methods are considered the core for the system to work well. They are the MTCNN algorithm to identify face area and FaceNet to recognize a face. We have also adjusted and added measures such as cosine similarity to increase the ability to identify objects in the database. Experimental results show that the accuracy and time of the system meet the requirements of the actual implementation. Some issues still need further investment and development, including increasing the ability to recognize when faces are obscured and recognizing learners’ emotions. These will be interesting issues for us to continue working on in the future. Acknowledgements. This research is funded by the University of Science, VNUHCM, Vietnam under grant number CNTT 2021-13 and Advanced Program in Computer Science.

394

T.-N. Pham et al.

References 1. Adini, Y., Moses, Y., Ullman, S.: Face recognition: the problem of compensating for changes in illumination direction. IEEE Trans. Pattern Anal. Mach. Intell. 19(7), 721–732 (1997) 2. Annalakshmi, M., Roomi, S.M.M., Naveedh, A.S.: A hybrid technique for gender classification with slbp and hog features. Cluster Comput. 22(1), 11–20 (2019) 3. Du, H., Shi, H., Liu, Y., Zeng, D., Mei, T.: Towards nir-vis masked face recognition. IEEE Signal Process. Lett. 28, 768–772 (2021) 4. Elharrouss, O., Almaadeed, N., Al-Maadeed, S.: LFR face dataset: left-front-right dataset for pose-invariant face recognition in the wild. In: 2020 IEEE International Conference on Informatics, IoT, and Enabling Technologies (ICIoT), pp. 124–130. IEEE (2020) 5. HajiRassouliha, A., Gamage, T.P.B., Parker, M.D., Nash, M.P., Taberner, A.J., Nielsen, P.M.: FPGA implementation of 2D cross-correlation for real-time 3D tracking of deformable surfaces. In: 2013 28th International Conference on Image and Vision Computing New Zealand (IVCNZ 2013), pp. 352–357. IEEE (2013) 6. Kumar, D., Garain, J., Kisku, D.R., Sing, J.K., Gupta, P.: Unconstrained and constrained face recognition using dense local descriptor with ensemble framework. Neurocomputing 408, 273–284 (2020) 7. Ling, H., Wu, J., Huang, J., Chen, J., Li, P.: Attention-based convolutional neural network for deep face recognition. Multimedia Tools Appl. 79(9), 5595–5616 (2020) 8. Liu, W.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0 2 9. Ouerhani, Y., Alfalou, A., Brosseau, C.: Road mark recognition using hog-svm and correlation. In: Optics and Photonics for Information Processing XI. vol. 10395, p. 103950Q. International Society for Optics and Photonics (2017) 10. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788 (2016) 11. Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: Mobilenetv 2: inverted residuals and linear bottlenecks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520 (2018) 12. Schroff, F., Kalenichenko, D., Philbin, J.: Facenet: A unified embedding for face recognition and clustering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 815–823 (2015) 13. Seo, H.J., Milanfar, P.: Face verification using the lark representation. IEEE Trans. Inf. Forensics Secur. 6(4), 1275–1286 (2011) 14. Vinay, A., Hebbar, D., Shekhar, V.S., Murthy, K.B., Natarajan, S.: Two novel detector-descriptor based approaches for face recognition using sift and surf. Procedia Comput. Sci. 70, 185–197 (2015) 15. Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2001, vol. 1, p. 1. IEEE (2001) 16. Zhang, K., Zhang, Z., Li, Z., Qiao, Y.: Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Process. Lett. 23(10), 1499–1503 (2016)

A Correct Face Mask Usage Detection Framework by AIoT Minh Hoang Pham1,3 , Sinh Van Nguyen1,3 , Tung Le2,3 , Huy Tien Nguyen2,3 , Tan Duy Le1,3(B) , and Bogdan Trawinski4 1

4

School of Computer Science and Engineering, International University, Ho Chi Minh City, Vietnam [email protected] 2 Faculty of Information Technology, University of Science, Ho Chi Minh City, Vietnam 3 Vietnam National University, Ho Chi Minh City, Vietnam Department of Applied Informatics, Wroclaw University of Science and Technology, Wroclaw, Poland

Abstract. The COVID-19 pandemic, which affected over 400 million people worldwide and caused nearly 6 million deaths, has become a nightmare. Along with vaccination, self-testing, and physical distancing, wearing a well-fitted mask can help protect people by reducing the chance of spreading the virus. Unfortunately, researchers indicate that most people do not wear masks correctly, with their nose, mouth, or chin uncovered. This issue makes masks a useless tool against the virus. Recent studies have attempted to use deep learning technology to recognize wrong mask usage behavior. However, current solutions either tackle the mask/nonmask classification problem or require heavy computational resources that are infeasible for a computational-limited system. We focus on constructing a deep learning model that achieves high-performance results with low processing time to fill the gap in recent research. As a result, we propose a framework to identify mask behaviors in real-time benchmarked on a low-cost, credit-card-sized embedded system, Raspberry Pi 4. By leveraging transfer learning, with only 4–6 h of the training session on approximately 5,000 images, we achieve a model with accuracy ranging from 98 to 99% accuracy with the minimum of 0.1 s needed to process an image frame. Our proposed framework enables organizations and schools to implement cost-effective correct face mask usage detection on constrained devices. Keywords: Face mask detection

1

· Convolution neural network · AIoT

Introduction

OOne of the most infamous milestones in the 21st century is the outbreak of the COVID-19 pandemic, causing the death of approximately 6 million people around the globe in the year 2021 [1]. Even though there is a tremendous amount c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  N. T. Nguyen et al. (Eds.): ACIIDS 2022, LNAI 13758, pp. 395–407, 2022. https://doi.org/10.1007/978-3-031-21967-2_32

396

M. Hoang Pham et al.

of effort put into deriving and improving the quality of vaccines, only 64.37% of the world population received at least one dose of vaccine, as reported in March 2021 [2]. Furthermore, due to the incompletion of biotechnology, people with sufficient amounts of vaccines will still have a chance to be re-infected as new virus variants emerge. As a result, masks are recommended in public areas as a preventive measure to limit the spread of the virus. However, according to the study in [3], convenience is one of the main factors that affect mask usage behavior, resulting in wrong mask usage, turning masks into a piece of ineffective equipment against the pandemic. For example, wearing a mask with the nose uncovered provides a more convenient feeling because the respiratory system is allowed to function more appropriately, but also susceptible to COVID-19 virus as it is also the main gate for the virus to enter the human body. This effect was observed in a study in Poland [4], where the number of people correctly wearing a mask goes down gradually, and almost half of the population wears the mask with either the mouth or the nose uncovered. With the current situation, regulations and supervision with little human interference are necessary to control the proper mask equipment in public areas with minimal human-to-human interactions. Thus, a system that can automatically recognize wrong mask usage should be developed to prevent the spread of the virus most effectively. A lot of research has been conducted in this area; however, they follow a twostep process: detect the face first and recognize mask usage based on the detected face. The first step is considered the backbone of this area as object detection has witnessed various success stories, with a lot of models providing state-of-the-art results with low latency in execution. For example, the series of YOLO-family research has been developed recently [5–9]. The latter step is the goal of this area: the output of this step is a model with high accuracy and low latency so that the entire process can be executed in real-time. Recent studies often ignore the latter criteria and provide a heavy network with almost no practical value, as in the works of [10]. Our work attempts to construct a deep learning model covering the problem of recognizing the wrong usage of masks with high accuracy and low processing time. Experimental results show that our model can achieve accuracy between 98 and 99% with only 0.1 s to process an image frame. Recently, a new trend combines artificial intelligence with IoT to provide a powerful technology of artificial intelligence of things (AIoT). With the development of sensors and board circuits, some IoT devices, such as the Raspberry Pi series, can execute lightweight deep learning models. Thus, by enhancing the performance of such IoT devices with deep learning technology, it is possible to make an internet of intelligent things and, therefore, allow the communications among devices to carry out more sophisticated tasks with high precision. For example, to operate a self-driving car, one needs to combine records from multiple sensors equipped at various positions of the vehicle to construct a representation of its surroundings to make appropriate decisions with low latency. In addition, the power of AIoT is not limited to high-performance devices that can handle medium-sized deep learning models. Cheap sensors with wireless connections can also be used to gather real-time data from various geographical

A Correct Face Mask Usage Detection Framework by AIoT

397

locations, empowering an AI model as data is the battery for any AI application. Following this trend, we propose general guidance on deploying a medium-sized deep learning model on an AIoT system. Hence, our contribution is two-fold: – First, we provide a lightweight model trained on a small sample with generalization capacity by using transfer learning. We also evaluate each transferred model in terms of processing time to prove the capability to be executed in real-time. – Second, we develop an AIoT (Artificial Intelligence of Things) framework that combines face detection and mask behavior classification models. This framework can be used in various circumstances, such as acting as a gate to control the mask usage before accessing highly-sensitive areas and to collect data for model retraining and further study. The organization of this paper is as follows. Section 2 provides an overview of all studies related to ours. Section 3 gives an illustration of the sketch and deployment of our system alongside our model training procedure, followed by Sect. 4 describing the results of system implementation in terms of accuracy and processing time. Last but not least, Sect. 5 includes our overall conclusions and future improvements.

2

Background and Related Works

As mentioned in the previous part, most studies tackle the problem of recognizing correct mask behavior by extracting the face from an image captured by a media device and determining the behavior for each detected face afterward. For example, the work of Ejaz et al. [11] uses the Viola-Jones algorithm [12] to extract face from a camera and principal component analysis (PCA) to group mask behavior into clusters, based on which the classification is performed. Despite good inference properties, this approach has two main drawbacks. In the face extraction step, despite low latency, the algorithm is susceptible to image quality (see the work of Enriquez et al. for more details [13]). In addition, using PCA to reduce data dimension may cause information loss as information on mask usage lies within the combination of raw pixels in the image. The PCA algorithm also does not scale with the growth of the dataset, so improving the algorithm alongside the dataset growth may come as a challenge. The study of Fan et al. [10] leverages deep learning for both face recognition and mask usage recognition tasks. For the former, they use a pre-trained model on the Wider Face dataset [14]. For the latter, they proposed a new network architecture called Context Attention Module (CAM), which utilizes the concatenation of multiple filters and the advantage of the attention mechanism (Woo et al. [15]). Even though the result of this study looks promising, the model architecture resembles GoogLeNet [16], which is known to have a considerable network size and is not suitable to implement in an embedded system. The study also proposed a light version of the network with a cost of losing 4% in performance.

398

M. Hoang Pham et al.

Joshi et al. [17] also used deep learning technology for mask usage recognition. They use three different kinds of network architecture to detect faces and extract facial features: the proposal net (P-net) to propose possible bounding boxes around faces in a media source, the refine net (R-net) to filter out lowconfident bounding boxes, and the O-net to detect facial landmarks (including eyes, noses, and mouths) on the face within each bounding box from the previous network. The landmarks are then fed into another deep learning model for mask usage recognition. The results look promising, and the architecture is usable on embedded devices; however, this work focuses more on monitoring whether the face is wearing a mask, not on recognizing the usage of the mask on a specific face. Hence, the results only cover the binary classification task of mask and nonmask faces. Our work is slightly broader since we have to both detect whether a person is wearing a mask and determine whether they are wearing it correctly. Overall, recent works follow the two-step mentioned earlier; however, they either have low accuracy, only cover a mask/non-mask classification problem, or provide a model that is too heavy for deployment. Our approach aims to solve the problem of incorrect mask usage behavior with high accuracy while providing a model with a low processing time that can be deployed to an AIoT system with external computing.

3 3.1

System Design and Implementation System Design

In this section, we will propose a general framework for mask behavior recognition that is extensible based on various requirements, illustrated in Fig. 1. The core of this framework is the combination of a receiver and external hardware. The receiver acts as an intermediary between the system’s users and the recognition algorithm to capture facial images and display inferential results. In contrast, the external system is responsible for handling all predictive steps. The role of the external server is vital as it allows the system to use medium-sized deep learning architecture with minimal latency. The external system should perform image processing and execute two deep learning modules per request with low processing time: a face recognition that can extract faces from raw images and a face classifier to recognize their mask usage behavior. Without the support of the external server, the recognition task would not be possible as the embedded system is known for limited resources. The choice of the external system varies from fog computing, placing a high computing server near the recognition system, to cloud computing, relying on a cloud service provider to handle hardware infrastructure. In the former, the scalability and availability of the system are of utmost importance as the local server needs to be manually maintained. However, low latency is more necessary in the latter as most cloud providers are not in the same region as the receiver. To further illustrate the use case of the proposed framework, imagine the scenario where a large corporation wants to launch a COVID preventive campaign by requiring every employee to wear a mask correctly. To enforce the campaign,

A Correct Face Mask Usage Detection Framework by AIoT

399

Fig. 1. Overview of the correct face mask usage detection framework.

the corporation wants its workers to equip masks appropriately before entering the workplace. Thus, the proposed framework can be considered a sensor device positioned at the front door of the corporation building. As a result, whenever a person wants to access the building, they must use the framework to verify the correct usage. The camera sensor of the framework captures raw frames. Afterward, it passes each frame to the embedded system, which will then be transmitted to the external server to obtain the inferential results, as indicated in Fig. 1. Following this, the external system replies with the inferential results. These results include each detected face’s location and mask status within a frame. Corporations can also integrate further actions specific to their requirements. For example, a corporation may want to implement a lock mechanism to prevent the false case from entering its building.

400

3.2

M. Hoang Pham et al.

Implementation

We choose Raspberry Pi 4 Model B as the embedded system for implementation. With the equipment of Broadcom BCM2711 Quad-core Cortex-A72 (ARM v8) 64-bit SoC @ 1.5 GHz chip, this embedded system can handle basic tasks of a modern personal computer such as making documents and streaming videos. Its board also has ports for camera sensors and LCDs, making it suitable for the description of the embedded system in Subsect. 3.1. We use fog comping for the external server with a Macbook Pro with a 2.5 GHz Quad-Core Intel Core i7 as an external server. Using a laptop can also construct the scenario of limited computational resources and provide valuable information for model selection in Sect. 4. 3.3

Training Procedure

As far as one may concern, the system’s core is the face mask classifier module to recognize the mask usage behaviors from extracted faces. As a result, this module should achieve high performance with low processing time. Recent studies have shown that deep learning technology, specifically convolutional neural network (or CNN), provides state-of-the-art results for computer vision tasks. It can learn complex features from images with shared-parameter characteristics. An illustration of CNN architecture is shown in Fig. 2. Accordingly, the input is continuously transformed into higher dimensional spaces by recursively using filters to swipe through the image. The output is combined with a fully-connected network to provide the final results.

Fig. 2. An example of a CNN architecture

The power of CNN does not stop there. According to [18], the learned parameters of CNN are transferrable. That is, we can transfer parameters learned among various tasks that are highly correlated. Since object recognition has a relatively high correlation to our mask behavior recognition tasks, we can copy the parameters from a complex object classification problem to our downstream mask recognition problem. We utilize the learned feature extractor from a more

A Correct Face Mask Usage Detection Framework by AIoT

401

Fig. 3. Typical examples of the MaskedFace-net dataset

complex dataset and freeze it to retain the complex knowledge. Thus, during the training procedure, the parameters of the feature extractor are fixed and not updated over time. At the same time, we design a custom classifier and actively update it, allowing domain-specific knowledge to be integrated. Later in Sect. 4, we will show that despite training on a relatively low number of training examples, the learned model can achieve high accuracy on large test samples thanks to the high generalization capacity of the feature extractors.

4 4.1

Experiments Dataset

The dataset we use is MaskedFace-Net, first introduced in [19]. This dataset contains approximately 130 thousand synthetic images of correct and incorrect mask usage. The scenarios of incorrect mask usage include wearing masks with either the nose, the mouth, or the chin uncovered. The data is synthesized from the FFHQ dataset [20], a benchmark dataset for a generative adversarial network for style generation. The synthesis process is as follows. First, it calculates 12 facial landmarks covering the area below the nose. Then, it fills the area

402

M. Hoang Pham et al.

enclosed by the landmarks with fake medical masks. In addition, these landmarks can be shifted to match each case of wearing masks. For example, to generate an image of wearing masks with the nose uncovered, we can move the nose landmarks downwards while the position of other landmarks stays the same. Figure 3 describes some examples of the MaskedFace-net dataset. We divide the dataset into training, validation, and test sets (See Table 1 for more details). Even though the training and validation sample size is relatively small compared to the testing size, we can still achieve high accuracy on the testing set. Since we are using transfer learning with fixed backbone architecture, we believe that the pre-trained model has a high level of generalization; as a result, it is possible to produce a well-performed model with a small portion of images. For evaluation, we use precision, recall, and f1-score metrics to evaluate the performance of each backbone model in the case of the random split generating an imbalanced dataset. Table 1. Dataset distribution Datasets

Label

Number of samples

Training set (10072 correct + 9995 incorrect)

Correct mask usage

10072

Validation set (10088 correct + 9979 incorrect)

Correct mask usage

Test set (46,888 correct + 46,760 incorrect)

Correct mask usage

4.2

Incorrect mask usage with uncovered nose Incorrect mask usage with uncovered mouth Incorrect mask usage with uncovered chin Incorrect mask usage with uncovered nose Incorrect mask usage with uncovered mouth Incorrect mask usage with uncovered chin

9054 732 941 10088 9091 779 888 46,888

Incorrect mask usage with uncovered nose 42,344 Incorrect mask usage with uncovered mouth 3334 Incorrect mask usage with uncovered chin 4416

Results

We evaluate our implementation with three different backbone networks, namely MobileNet [21], VGG16 [22], and ResNet50 [23], with the main network a fullyconnected network with two hidden layers, each of which has 512 and 256 units, respectively. The output layer has two units corresponding to the predicted probability distribution of two labels: correct mask usage and incorrect mask usage. We trained all networks with an RMSProp optimizer and a learning rate of 0.0001. The pre-trained weights of each backbone network are taken from its corresponding trained parameters on the ImageNet dataset [24], which is a large dataset for object recognition of 1000 classes based on high-resolution images. We stabilized the pre-trained weights and did not update the weight during the

A Correct Face Mask Usage Detection Framework by AIoT

403

training process to retain the knowledge learned from the upstream problems. Table 2 shows the performance of transferred models after training for 50 epochs with a batch size of 32. As shown, MobileNet yields the highest f1-score, with VGG16 coming close with a score of 98.73%. However, the evaluation time of VGG16 is nearly identical to the corresponding number of MobileNet. Hence, we will consider these two models, which we will discuss in Subsect. 4.3 since we can sacrifice some accuracy to gain real-time deployment. Table 2. Model performance on MaskedFace-net dataset Model name Precision Recall

F1-score Training time (hours)

ResNet50

50.07%

50.07%

50.07%

6.1814

0.7505

MobileNet

99.39%

99.39% 99.39% 4.7844

0.5503

VGG16

98.73%

98.73%

0.5353

98.73%

4.8856

Testing time (hours)

It is worth noting that the performance metric is recorded on approximately 90,000 images while the model is trained on only 20,000 images. These numbers yield the power of transfer learning when it comes to generalization capacity, illustrating the knowledge transfer phenomenon from a model trained on more complex tasks (in this case, trained on ImageNet) to a model trained on more straightforward tasks (in this case, mask behavior recognition). As an illustration of the performance results, Fig. 4 demonstrates our model performance on both testing and real-time data. 4.3

Discussion

Despite high performance metrics, our model is not complete: it still has minor errors. Figure 5 illustrates some common errors made by the system using either VGG16 or MobileNet. As can be seen, most errors fall into the category of incorrect mask usage with an uncovered chin, which is a safe case of wearing a mask as the respiratory system is fully covered. These errors result from a low number of examples in this category (see Table 1). There are, however, a few ways to address issues. The first way is to gather more data regarding this category for the model to recognize such cases. Another way to alleviate this is to weigh the mentioned case higher so that the model will be penalized more if such a case is classified incorrectly. However, these cases are minor for basic implementation. When implemented in real-time, performance metrics are not the only criteria. Processing time is also one of the crucial metrics to evaluate the system as it needs to have a fast processing time between frames. Table 3 illustrates the size of each model and its processing time, measured by taking the average processing time between two consecutive frames when deployed in the proposed AIoT system mentioned in Sect. 3.

404

M. Hoang Pham et al.

Fig. 4. Some examples of system classifying mask image on both testing set and realtime images Table 3. Model size and processing time Model name #parameters Processing time ResNet50

23,587,712

0.2675 ± 0.2886

MobileNet

3,228,864

0.1261 ± 0.0351

VGG16

14,714,688

0.3436 ± 0.0722

Since most of the processing computation is performed on external hardware, we choose to use the built-in camera of the Mac OS X laptop mentioned in Subsect. 3.2. It is shown that MobileNet beats VGG16 on both performance and processing time. It only takes MobileNet about 0.1 s to process an image frame, while the corresponding time for VGG16 is doubled. The large difference can be explained by the number of parameters of each model. Since MobileNet is designed for mobile devices, it contains the least number of parameters, while VGG16 uses vanilla convolutional architecture, which makes its size much larger than MobileNet’s counterpart. As a result, MobileNet is empirically proven to be efficient when deployed on AIoT system

A Correct Face Mask Usage Detection Framework by AIoT

405

Fig. 5. Some examples of system errors

5

Conclusions

In this study, we investigate the performance of three different models for the problem of recognizing mask usage behavior. We also utilize transfer learning to accomplish the task more efficiently by retaining the configuration trained on the ImageNet dataset and adding custom, fully-connected layers that contain parameters suitable for our problems. The results show that MobileNet is the best solution for accuracy and processing time. We propose a general and extensible framework for deploying the model to an AIoT system with external computation. This crucial part allows the system to execute deep learning models. In addition, we also demonstrate some examples of system success and failure, where most of the failure falls into the safe case of mask equipment. Such failures can also be alleviated by collecting more data on such cases and using different weighting techniques to penalize the model’s wrong prediction for the case. Even though our system has high accuracy, there is still room for improvement. With the development of face recognition, we can further utilize facial landmarks to identify each case of incorrect mask usage. Model optimization is another direction in which model architecture should be minimized to lower the inference time. Lastly, bringing GPU to the embedded system is also attractive as it allows the deep learning model to operate smoothly on edge without needing external computation.

References 1. Dong, E., Du, H., Gardner, L.: An interactive web-based dashboard to track covid19 in real time. Lancet Infect. Dis. 20(5), 533–534 (2020) 2. Mathieu, E., et al.: A global database of covid-19 vaccinations. Nat. Human Behav. 5(7), 947–953 (2021) 3. Esmaeilzadeh, P.: Public concerns and burdens associated with face mask-wearing: lessons learned from the covid-19 pandemic. Prog. Disast. Sci. 13, 100215 (2022). https://www.sciencedirect.com/science/article/pii/S2590061722000023

406

M. Hoang Pham et al.

´ 4. Ganczak, M., Pasek, O., Duda-Duma, L  , Swistara, D., Korze´ n, M.: Use of masks in public places in Poland during sars-cov-2 epidemic: a covert observational study. BMC Public Health 21(1), 1–10 (2021) 5. Liu, W., et al.: SSD: single shot multibox detector. CoRR, vol. abs/1512.02325 (2015). http://arxiv.org/abs/1512.02325 6. Redmon, J., Farhadi, A.: Yolov3: an incremental improvement. CoRR, vol. abs/1804.02767 (2018). http://arxiv.org/abs/1804.02767 7. Bochkovskiy, A., Wang, C., Liao, H.M.: Yolov4: optimal speed and accuracy of object detection. CoRR, vol. abs/2004.10934 (2020). https://arxiv.org/abs/2004. 10934 8. Redmon, J., Divvala, S.K., Girshick, R.B., Farhadi, A.: You only look once: unified, real-time object detection. CoRR, vol. abs/1506.02640 (2015). http://arxiv.org/ abs/1506.02640 9. Redmon, J., Farhadi, A.: YOLO9000: better, faster, stronger. CoRR, vol. abs/1612.08242 (2016). http://arxiv.org/abs/1612.08242 10. Fan, X., Jiang, M.: Retinafacemask: a single stage face mask detector for assisting control of the covid-19 pandemic. In: 2021 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pp. 832–837. IEEE (2021) 11. Ejaz, M.S., Islam, M.R., Sifatullah, M., Sarker, A.: Implementation of principal component analysis on masked and non-masked face recognition. In: 2019 1st International Conference on Advances in Science, Engineering and Robotics Technology (ICASERT), pp. 1–5 (2019) 12. Viola, P., Jones, M.: Robust real-time object detection. Int. J. Comput. Vision 57, 137–154 (2001) 13. Enriquez, K.: Faster face detection using convolutional neural networks & the viola-jones algorithm (2018) 14. Yang, S., Luo, P., Loy, C.C., Tang, X.: Wider face: a face detection benchmark. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016) 15. Woo, S., Park, J., Lee, J.-Y., Kweon, I.S.: CBAM: convolutional block attention module. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19 (2018) 16. Szegedy, C., et al.: Going deeper with convolutions. CoRR, vol. abs/1409.4842 (2014). http://arxiv.org/abs/1409.4842 17. Joshi, A.S., Joshi, S.S., Kanahasabai, G., Kapil, R., Gupta, S.: Deep learning framework to detect face masks from video footage. In: 2020 12th International Conference on Computational Intelligence and Communication Networks (CICN), pp. 435–440 (2020) 18. Zamir, A.R., Sax, A., Shen, W., Guibas, L.J., Malik, J., Savarese, S.: Taskonomy: disentangling task transfer learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3712–3722 (2018) 19. Cabani, A., Hammoudi, K., Benhabiles, H., Melkemi, M.: Maskedface-net-a dataset of correctly/incorrectly masked face images in the context of covid-19. Smart Health 19, 100144 (2021) 20. Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. CoRR, vol. abs/1812.04948 (2018). http://arxiv.org/ abs/1812.04948 21. Howard, A.G., et al.: Mobilenets: efficient convolutional neural networks for mobile vision applications. CoRR, vol. abs/1704.04861 (2017). http://arxiv.org/abs/1704. 04861 22. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)

A Correct Face Mask Usage Detection Framework by AIoT

407

23. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. CoRR, vol. abs/1512.03385 (2015). http://arxiv.org/abs/1512.03385 24. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)

A New 3D Face Model for Vietnamese Based on Basel Face Model Dang-Ha Nguyen1 , Khanh-An Han Tien1 , Thi-Chau Ma1(B) , and Hoang-Anh Nguyen The1,2 1

2

University of Engineering and Technology, VNUH, Hanoi, Vietnam [email protected] Vietnam - Korea Institute of Science and Technology, Hanoi, Vietnam

Abstract. In recent years, many 3D face reconstruction from image methods have been introduced and most of them have shown incredible results. However, methods such as photogrammetry require images from multiple views and can be very time-consuming. Deep learning based methods, on the other hand, are faster and more efficient but heavily rely on the base face models and training datasets. Meanwhile, most base face models lack Asian facial features, and high-quality Vietnamese facial image databases are still not available yet. In this paper, we propose an approach that increases the accuracy of Vietnamese 3D faces generated from a single image by creating a new mean face shape and training a convolution neural network with our dataset. This method is compact and can improve the quality of 3D face reconstruction using facial image data with specific geographical and race characteristics.

Keywords: 3D face

1

· Face reconstruction · Morphable model

Introduction

Reconstructing a 3D face from a single image refers to retrieving a 3D face surface model of a person given only one input face image. This is a classical and challenging task in computer vision that has a wide range of applications, such as improving face recognition systems [9,20] or face animation [17,19]. Two decades ago, Vetter and Blanz introduced a solution for this task called 3D Morphable Model (3DMM) [3]. The basic idea is to learn the features of a human face from the input image and then use them to change the parameters of a base generative 3D face model. Together with other attributes like texture, illumination and projection, we can generate the 3D shape of the given face image. Since then, many researchers have created various types of base face models: the Basel Face Model (BFM) [11,14] , the Faces Learned with an Articulated Model and Expressions (FLAME) Model [13] or the state-of-the-art Universal Head Model (UHM) [15,16] combined from the LSFM [4] and LYHM [6,7] models. However, only good morphable models would not be good enough for the “model fitting” process, as obtaining the features for the corresponding coefficients also plays an c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  N. T. Nguyen et al. (Eds.): ACIIDS 2022, LNAI 13758, pp. 408–420, 2022. https://doi.org/10.1007/978-3-031-21967-2_33

A New 3D Face Model for Vietnamese

409

important role. Until now, high-quality ground truth 3D models of human faces are still scarce. Weakly supervised deep learning methods [8,10,18,21,23] have shown the most promising results. But even these highly successful techniques could not reconstruct a realistic Vietnamese 3D face. An important reason why this may be so, is that base generative face models [4,6,7,14] are usually created by scanning the faces of Europeans and Americans. To address the problem, we take pictures of almost 200 Vietnamese from different angles with distinctive expressions. Then, with the new dataset, we generate the 3D faces of every person using their frontal face image in a normal condition to develop a mean face shape. We observed that our mean face shape has more Asian features and can be fitted more with Vietnamese face images. The dataset is also used for the training session to increase the accuracy of the process. Evaluating the truthfulness of 3D faces remains a problem because ground truth face models are lacking. Thus, we used 2 types of measurements named photometric error and cosine distance to test the effectiveness of our method on Vietnamese faces. Besides, we also use mean square error to verify our results on a few 3D scanned face models from the NoW dataset [18]. In summary, our major contribution in this paper is creating a new mean face shape for Vietnamese by utilizing Basel Face Model [14] and a 3D face reconstruction with a weakly supervised learning method [8].

2

Related Works

In this paper, we develop our method based mainly on the Basel Face Model [11,14] and Deng et al.’s framework [8]. Although there are newer generative face models such as the FLAME [13] and Universal Head [15,16] models, we specifically choose the BFM-2009 [14] because it mainly focuses on the shape of the face region. Deng et al. [8] also relies on this base generative face model to perform reconstructing 3D faces and their work has shown very promising results. 2.1

Basel Face Model

Basel Face Model is a Morphable Model that was first introduced in 2009 by Paysan et al. [14]. To generate this face model, they scanned the faces of 100 men and 100 women. Most of the volunteers are European. The age of the group of participants ranges from 8 to 62 with the average age being 25. Their face scanning system was one of the most advanced and effective. Instead of using normal 3D scanners as older methods, Paysan et al. [14] used a coded light system built by ABW-3D. Together with three cameras and two projectors, they were able to retrieve four depth images for face shape reconstruction. Three more SLR cameras are also used to collect RGB images for face texture. After that, the 3D face models have to be unified by applying a modified version of the Optimal Step Nonrigid ICP Algorithm [2]. The face model of

410

D.-H. Nguyen et al.

each individual then is parameterized as a triangular mesh shared topology and represented by two independent 3m (m = 53490) dimensional vectors S = (x1 , y1 , z1 , ..., xm , ym , zm ) T = (r1 , g1 , b1 , ..., rm , gm , bm ) where S and T stand for shape and texture. Finally, a parametric face model consisting of Ms (μs , σ s , Us ) and Ms (μt , σ t , Ut ) is generated by using Principle Component Analysis (PCA), where μ is the mean, σ is the standard deviations and U is the basis of principle components (Fig. 1).

Fig. 1. 2009 Basel Face Model [1]

From here, we can reconstruct a new 3D face by combining the parametric face model with features (α, β) from an input image S(α) = μs + Us diag(σ s )α T(β) = μt + Ut diag(σ t )β 2.2

(1)

Weakly-Supervised Learning Method

In 2019, Deng et al. proposed an impressive CNN-based single-image face reconstruction method [8]. By implementing image-level loss and perspective loss,

A New 3D Face Model for Vietnamese

411

they have proved that there are effective ways to deal with the lack of ground truth 3D face model. Instead of directly comparing two 3D models, the network calculated the loss from the rasterized image of the face shape with the input image. The robustified photometric loss is a part of the image-level loss which has a big impact on the quality of the final model so we decide to use this as a measurement to evaluate the accuracy of our later model. To reconstruct the face of an individual, they put the image into a convolutional neural network called R-Net which is based on ResNet-50 [12], with the final fully-connected layer being modified to 239 neurons. The last output consisting of coefficients has the effect of changing the BFM-2009 [14] to create the 3D face shapes of corresponding people. The output coefficients (in Fig. 2) including: – Identity (α) and expression (β) transform the base face shape. – Texture (δ) and lighting (γ) change the face color so it could recover the skin surface of people. – Pose (ρ) helps rotate and move the face shape to ensure the position of the face after rasterizing.

Fig. 2. Deng et al.’s framework

After acquiring the final coefficients, we can represent the new 3D face model by S = S(α, β) = S + Bid α + Bexp β T = T(δ) = T + Bt δ

(2)

where S, T, Bid , Bt are the mean shape, mean texture, principal components of identity and texture respectively, which all belong to the BFM-2009 [14]. Bexp indicates the group of principal components of expression, provided by FaceWarehouse [5] (Fig. 3).

412

3

D.-H. Nguyen et al.

Proposal

Fig. 3. Our main framework

The 2009 Basel Face Model [14] used in Deng et al.’s framework [8] had shown great results in many cases. However, most participants in the data collection sessions are Europeans, so the model of mean shape also has European facial features and characteristics such as almond-shaped eyes and long and narrow noses [22]. Therefore, the new 3D models reconstructed with Vietnamese face images are still not convincing enough. To tackle this problem, we collect new face images of 195 people, consisting of 100 men and 95 women, all of whom are Vietnamese. Then, we recreate new 3D faces of every individual to accomplish a new mean shape for later training and reconstructing sessions. To sum up, we propose a method to increase the accuracy of the 3D face reconstruction from a single image process for Vietnamese by generating a new mean shape with our high-quality face image dataset. 3.1

Face Image Collecting

We create our dataset by taking pictures of 195 Vietnamese, most of them are Viet people. The ratio of gender is almost the same and the age of the group is between 20 and 50. To capture photos of each volunteer, we use a capturing system containing 27 FHD DSLR cameras and 10 lighting devices. The cameras are arranged in 3 separate layers, and on each layer, they are installed at equal intervals. We also make sure all capturing devices are synchronized to get all images at once. This helps the process become more efficient. Each volunteer has to take part in an approximately one-hour process, including five small sessions. In the first three sessions, we capture faces with no accessories, as shown in Fig. 4. In the fourth and fifth sessions, we ask the participants

A New 3D Face Model for Vietnamese

413

to wear a face mask and sunglasses, respectively. Data acquired from the last two sessions can help improve face reconstruction in occlusion situations. In each session, each individual is required to perform seven different expressions: neutral, happy, sad, afraid, angry, surprise and disgust, under 11 lighting conditions. Thus, the total number of images collected from each person is 10,395: 5 (capturing sessions) × 11 (lighting conditions) × 7 (facial expressions) × 27 (poses) (Fig. 5).

Fig. 4. Captured photos from our camera system

3.2

Generating New Mean Face Shape

Fig. 5. Detailed 3D face reconstructing process

Generating a quality 3D face shape is a challenging task so in this paper, by utilizing the 2009 Basel Face Model [14] and Deng et al.’s method [8], we can simply recover the face model of a person at a good quality without using complicated 3D scanning systems. However, the base mean shape from BFM has a total of 53490 vertices, and we focus on improving the detail of the face region, so we excluded the ears from the base model. Therefore, the new face shape has only 35709 vertices. Then, we use the frontal image of each individual from the

414

D.-H. Nguyen et al.

dataset in the normal lighting condition with neutral expression to generate 195 different 3D faces. After obtaining face models of all volunteers, we can create a new mean shape by calculating the average value of every vertex. Our new mean face model named V-Mean (in Fig. 6) has noticeably more Asian face features such as a wider nose and shallower eye regions. Before using V-Mean to test, we also use our dataset to train the R-Net so the output coefficients could fit with our new face model.

Fig. 6. Comparing mean shape from BFM (left) with V-Mean (right)

3.3

Training Session

We used the pre-trained model of Deng et al. [8] and continued training on a dataset of 37k images of Vietnamese people. Training a model with a batch size of 32 and 20 epochs takes almost 3 h on a single NVIDIA RTX 3060 GPU. Our new dataset used for training contains: – – – – – –

Number Number Number Number Number Number

of of of of of of

images: 37737. subjects: 195. angles/person: 28. lighting conditions/person: 1. expressions/person: 7. images/person: 196.

A New 3D Face Model for Vietnamese

415

– Input image resolution: 224 × 224. – Number of training images: 29647. – Number of validation images: 7449. In our experiment, we trained two models, one using the original mean face shape from BFM-2009 [14] and one using our Vietnamese mean face.

4 4.1

Experiments Measurements

Formally in measuring the accuracy of the 3D reconstructed faces problem, we need to calculate the distance of each vertex on the 3D reconstructed model to the closest vertex on the ground truth 3D model and then use RMSE (Root Mean Square Error) to calculate the final error. However, in our experiment, we don’t have the resources to construct the 3D ground truth ourselves so in this section, we propose two measurements to evaluate the performance of our model on Vietnamese people by using only 2D reconstructed images and 2D ground truth images. The main two metrics we use to measure the accuracy of our proposed method are cosine distance and photometric error. Moreover, we also use the mean square error to ensure the new face model does not perform worse on pictures of foreign. The first metric calculates the cosine distance between the identity features. With this function, we can measure the similarity between two vectors and in this case, the two vectors are the detected features from the input and the reconstructed image. This metric helps to compare the accuracy of reconstructed shapes. Cosine Distance = 1 −

f (Iinput ) . f (Ireconstructed ) ||f (Iinput )|| . ||f (Ireconstructed )||

(3)

From the given function, f is the detected features from the images. Cosine distance ranges from 0 to 1, whereas value 1 indicates that there is no similarity between the two images and value 0 indicates that the two images are identical to each other. To make sure our first measurement produces accurate results, we use a second metric which is photometric error, which subtracts the input and reconstructed images and calculates the error between each pixel in the two images. The formula is as follows P hotometric Error = VI  (Iinput − Ireconstructed )

(4)

whereas VI is a skin mask with value 1 representing the value of a pixel in the skin area and value 0 representing the value of a pixel elsewhere,  denotes Hadamard product. To construct the skin mask, we use the skin segmentation method of Deng et al. [8], which allows us to exclude pixels in special regions such as hair, beard, and glasses, to name a few. Photometric error is ranging

416

D.-H. Nguyen et al.

from 0 to 255 with a lower value meaning the input and reconstructed images are more alike. The final measurement is mean square error, which calculates the distance between the vertices from the 3D reconstructed models and the ground truth models. The metric can be written as follow M ean Square Error =

N 1  (Pi − Qi )2 N i=1

(5)

whereas P is the vertex of the 3D ground truth face model and Q is the nearest vertex on the reconstructed model. Mean Square Error can have the value 0 meaning the 2 models are identical or can be increased up to ∞. We only use this metric for comparing the 3D generated faces with the scanned faces from NoW challenge [18]. 4.2

Discussion

After finishing training our R-Net models, we used the first 2 measurements in the previous section to validate our test sets and then calculated the mean result to measure the performance of our models (Fig. 7).

Fig. 7. Input and reconstructed images of original model and our models. The first row represents our first test set and the second one represents our second test set.

For our first test set, we proceed to measure the accuracy of 23 images of Vietnamese people in front view with a normal lighting condition, we will name this “normal testset”. And for our second test set, we use 30 images of Vietnamese people with different blur magnitudes and this test set also includes cases where people are wearing glasses, by that we will name this “blur testset”. In our first experiment, we used the first measurement: cosine distance. According to Table 1,

A New 3D Face Model for Vietnamese

417

both of our models result in lower cosine distance compared to the original model when using the first test set. This shows that by only using 37K images, we have already improved the accuracy of the 3D shapes of Vietnamese faces. Additionally, the result of the trained model using V-Mean has strengthened our speculation which is with more Viet features on the mean face shape, we can make the reconstructed faces of Vietnamese more realistic. Table 1. Table comparing cosine distance between models Deng et al. Proposed method: Trained model with BFM

Proposed method: Trained model with V-Mean

Normal Testset 0.357

0.313

0.309

Blur Testset

0.516

0.518

0.513

However, the cosine distance in the second test set indicates that in special cases where the faces were blurry or got covered with glasses or masks, our models became worse. This result is rather expected since the face images used for training are sharper and can be observed clearly. With these results, we predict that the models will perform much better if we have larger training data sets with more in-the-wild pictures. Table 2. Table comparing photometric error between models Deng et al. Proposed method: Trained model with BFM

Proposed method: Trained model with V-Mean

Normal Testset 4.167

3.79

3.86

Blur Testset

6.03

6.26

6.301

Our second measurement, which calculates the photometric error between the input and reconstructed images, focuses more on the small details and the texture quality. From Table 2, we can see that the trained model using the new mean shape has shown better results in both cases than the original model. However, Table 2 also implies that by using the Vietnamese mean face, the performance gets worse compared to using the mean face from the BFM-2009 [14]. This result is predictable since we have much less data to train our model to interpolate coefficients like texture or poses on the new Vietnamese mean shape than to train our model to interpolate those coefficients on the original generic face. Specifically, the pre-trained model is trained with over 300K images using the original mean face. Therefore, as shown in Fig. 8, when we started training the first model (which uses the original mean face from the BFM-2009 [14]), it was

418

D.-H. Nguyen et al.

already good at interpolating the coefficients for the base face. Meanwhile, by using the V-Mean, the second model will need to learn how to change the new mean face from scratch and with only 37K images of 195 subjects compared to 300K images of thousands of subjects. Thus, we expect the performance of the second model to be worse than the first one. Nonetheless, our second model trained with Vietnamese mean shape still yields competitive results compared to the first one where the photometric losses are 3.79 and 3.86 respectively.

Fig. 8. Comparing model trained with BFM (left) and model trained with V-Mean (right) in training process at epoch 1

In the final experiment, we proceed to test the performance of our mean face shape on 12 images of people who do not come from Vietnam in the NoW dataset [18] using the mean square error. Table 3. Table comparing mean square error between models Deng et al. Proposed method: Trained model with BFM NoW Dataset 2.082

2.065

Proposed method: Trained model with V-Mean 2.05

The results from Table 3 have shown that our method did not affect the quality of the reconstructed face models of people who are not from Vietnam. Although the final 3D models still lack details like full head or wrinkles when compared with the results from state-of-the-art methods such as Feng et al.’s method [10] and Zielonka et al.’s method [23], we assume that when applying our framework to those methods can help increase the accuracy when reconstructing 3D faces for specific regions.

5

Conclusion

We have introduced a new mean shape model for Vietnamese which helps improve the accuracy of the 3D face reconstruction process. Our new model had

A New 3D Face Model for Vietnamese

419

shown better results than the original when tested on both of our custom test sets. However, the truthfulness of the final 3D face model can still be improved by training the network with more Vietnamese face images. Other researchers are invited to use our 3D base face model at website1 . Acknowledgements. This work is funded by Vietnam - Korea Institute of Science and Technology, Ministry of Science and Technology. We would like to express our deepest appreciation to the Vietnam Ministry of Science and Technology for supporting us with the 02.M02.2022 project.

References 1. Basel face model - details. https://faces.dmi.unibas.ch/bfm/main.php?nav=1-10&id=details Accessed 22 Jul 2022 2. Amberg, B., Romdhani, S., Vetter, T.: Optimal step nonrigid ICP algorithms for surface registration. In: 2007 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2007). https://doi.org/10.1109/CVPR.2007.383165 3. Blanz, V., Vetter, T.: A morphable model for the synthesis of 3D faces. In: Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques, pp. 187–194. SIGGRAPH 1999, ACM Press/Addison-Wesley Publishing Co., USA (1999). https://doi.org/10.1145/311535.311556 4. Booth, J., Roussos, A., Zafeiriou, S., Ponniah, A., Dunaway, D.: A 3D morphable model learnt from 10,000 faces. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5543–5552 (2016). https://doi.org/10.1109/ CVPR.2016.598 5. Cao, C., Weng, Y., Zhou, S., Tong, Y., Zhou, K.: Facewarehouse: A 3D facial expression database for visual computing. IEEE Trans. Visual Comput. Graphics 20(3), 413–425 (2014). https://doi.org/10.1109/TVCG.2013.249 6. Dai, H., Pears, N., Smith, W., Duncan, C.: A 3D morphable model of craniofacial shape and texture variation. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 3104–3112 (2017). https://doi.org/10.1109/ICCV.2017.335 7. Dai, H., Pears, N., Smith, W., Duncan, C.: Statistical modeling of craniofacial shape and texture. Int. J. Comput. Vision 128(2), 547–571 (2019). https://doi. org/10.1007/s11263-019-01260-7 8. Deng, Y., Yang, J., Xu, S., Chen, D., Jia, Y., Tong, X.: Accurate 3D face reconstruction with weakly-supervised learning: from single image to image set. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 285–295 (2019). https://doi.org/10.1109/CVPRW.2019.00038 9. Dutta, K., Bhattacharjee, D., Nasipuri, M.: SpPCANet: a simple deep learningbased feature extraction approach for 3D face recognition. Multimedia Tools Appl. 79(41–42), 31329–31352 (2020). https://doi.org/10.1007/s11042-020-09554-6 10. Feng, Y., Feng, H., Black, M.J., Bolkart, T.: Learning an animatable detailed 3D face model from in-the-wild images. ACM Trans. Graph. 40(4), 1–13 (2021). https://doi.org/10.1145/3450626.3459936 11. Gerig, T., et al.: Morphable face models - an open framework. In: 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pp. 75–82 (2018). https://doi.org/10.1109/FG.2018.00021 1

https://github.com/ngdangha/Vietnamese-Mean.git.

420

D.-H. Nguyen et al.

12. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 770–778 (2016). https://doi.org/10.1109/CVPR.2016.90 13. Li, T., Bolkart, T., Black, M.J., Li, H., Romero, J.: Learning a model of facial shape and expression from 4D scans. ACM Trans. Graph. 36(6), 1–17 (2017). https:// doi.org/10.1145/3130800.3130813 14. Paysan, P., Knothe, R., Amberg, B., Romdhani, S., Vetter, T.: A 3D face model for pose and illumination invariant face recognition. In: 2009 Sixth IEEE International Conference on Advanced Video and Signal Based Surveillance, pp. 296–301 (2009). https://doi.org/10.1109/AVSS.2009.58 15. Ploumpis, S., et al.: Towards a complete 3D morphable model of the human head. IEEE Trans. Pattern Anal. Mach. Intell. 43(11), 4142–4160 (2021). https://doi. org/10.1109/TPAMI.2020.2991150 16. Ploumpis, S., Wang, H., Pears, N., Smith, W.A.P., Zafeiriou, S.: Combining 3D morphable models: a large scale face-and-head model. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10926–10935 (2019). https://doi.org/10.1109/CVPR.2019.01119 17. Richard, A., Zollhöfer, M., Wen, Y., de la Torre, F., Sheikh, Y.: MeshTalk: 3D face animation from speech using cross-modality disentanglement. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1153–1162 (2021). https://doi.org/10.1109/ICCV48922.2021.00121 18. Sanyal, S., Bolkart, T., Feng, H., Black, M.J.: Learning to regress 3D face shape and expression from an image without 3D supervision. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7755–7764 (2019). https://doi.org/10.1109/CVPR.2019.00795 19. Thies, J., Zollhöfer, M., Stamminger, M., Theobalt, C., Nießner, M.: Face2face: Real-time face capture and reenactment of RGB videos. Commun. ACM 62(1), 96–104 (2018). https://doi.org/10.1145/3292039 20. Xu, K., Wang, X., Hu, Z., Zhang, Z.: 3D face recognition based on twin neural network combining deep map and texture. In: 2019 IEEE 19th International Conference on Communication Technology (ICCT), pp. 1665–1668 (2019). https://doi. org/10.1109/ICCT46805.2019.8947113 21. Xu, S., et al.: Deep 3D portrait from a single image. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7707–7717 (2020). https://doi.org/10.1109/CVPR42600.2020.00773 22. Zaidi, A.A., et al.: Investigating the case of human nose shape and climate adaptation. PLoS Genet. 13(3), 1–31 (2017). https://doi.org/10.1371/journal.pgen. 1006616 23. Zielonka, W., Bolkart, T., Thies, J.: Towards metrical reconstruction of human faces. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision–ECCV 2022. LNCS, vol. 13673. Springer, Cham (2022). https:// doi.org/10.1007/978-3-031-19778-9_15

Domain Generalisation for Glaucoma Detection in Retinal Images from Unseen Fundus Cameras Hansi Gunasinghe1(B) , James McKelvie2 , Abigail Koay3 , and Michael Mayo1 1

2

Department of Computer Science, University of Waikato, Hamilton, New Zealand [email protected] Department of Ophthalmology, University of Auckland, Auckland, New Zealand 3 School of Information Technology and Electrical Engineering, University of Queensland, Brisbane, Australia

Abstract. Out-of-distribution data produced by unseen devices can heavily impact the accuracy of machine learning model in glaucoma detection using retinal fundus cameras. To address this issue, we study multiple domain generalisation methods together with multiple data normalisation methods for glaucoma detection using retinal fundus images. RIMONEv2 and REFUGE, both public labelled glaucoma detection datasets that capture fundus camera device information, were included for analysis. Features were extracted from images using the ResNet101V2 ImageNet-pretrained neural network and classified using a random forest classifier to detect glaucoma. The experiment was conducted using all possible combinations of training and testing camera devices. Images were preprocessed in five different ways using either single or combination of three different preprocessing methods to see their effect on generalisation. In each combination, images were preprocessed using median filtering, input standardisation and multi-image histogram matching. Standardisation of images led to greater accuracy than other two methods in most of the scenarios with an average of 0.85 area under the receiver operator characteristic curve. However, in certain situations, specific combinations of preprocessing techniques lead to significant improvements in accuracy compared to standardisation. The experimental results indicate that our proposed combination of preprocessing methods can aid domain generalisation and improve glaucoma detection in the context of different and unseen retinal fundus camera devices. Keywords: Glaucoma detection · Domain generalisation learning · Fundus cameras · Retinal fundus images

1

· Transfer

Introduction

General machine learning systems assume that the training and testing data come from the same data distribution [20]. These systems disregard out-ofdistribution scenarios that may be encountered in routine clinical practice where c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  N. T. Nguyen et al. (Eds.): ACIIDS 2022, LNAI 13758, pp. 421–433, 2022. https://doi.org/10.1007/978-3-031-21967-2_34

422

H. Gunasinghe et al.

new imaging devices are introduced in the clinical workflow [12,23]. This is a weakness that make machine learning lacks in practice, implying that to maintain accuracy in out-of-distribution scenarios model retraining would be required. Generalisation between imaging devices is a specific instance of domain generalisation. It is dedicated to address the problem of domain shift caused by device dependent properties of data such as colour or illumination variations into machine learning models. Domain shift can cause out-of-distribution. Retinal fundus images is one of the least invasive methods used in glaucoma screening. Nearly thirty retinal fundus cameras from nine main manufacturers are currently in widespread clinical use and some of them are reviewed in [5,15]. The differences such as colour and image clarity caused by the fundus cameras have important implications for the diagnosis of eye diseases when used with deep learning, as shown in two recent studies [12,21]. Our study evaluates the effect of domain generalisation using three image preprocessing methods over the accuracy of glaucoma detection algorithms concerning different fundus camera devices. Images from three distinct fundas cameras from different manufacturers namely Canon CR-2, NIDEK AFC-210 and ZEISS VISUCAM-500 are used in our work. Sample images from three cameras are shown in Fig. 1.

Fig. 1. Images captured through: Zeiss Visucam 500 (Left), Nidek AFC-210 fundus camera with a body of Canon EOS 5D Mark II (centre), Canon CR-2 (Right).

We utilise feature extraction using ImageNet-trained ResNet101V2 as the feature extractor. Apart from the out-of-distribution problem caused by different devices, there are potential disadvantages of ImageNet trained classifiers. A study conducted by Geirhos et al. [8] shows that ImageNet-trained CNNs are biased towards texture. To overcome the issue, they suggest style transferring where as we tested histogram matching to match the image histogram of test images to that of training images. Particularly, we used randomised multi-level histogram matching to change the tonal distribution of the image. The objective of our study is to compare the combinations of multiple existing preprocessing methods, in order to examine how well they improve device domain adaptation for the task of glaucoma detection in retinal fundus images. Random

Domain Generalisation for Glaucoma Detection

423

forest classifiers achieve the best accuracy among other base classifiers with these types of images and image features in a previous study [11]. Hence, we use the random forest as the base classifier in all experiments to obtain accurate estimates of predictive performance.

2

Related Work

Machine learning based glaucoma classification is studied by various medical and machine learning research communities for segmentation, detection and localisation [17]. However, most of these studies are limited to a single clinical setup or individual fundus camera which brings into question the external validity of a model. Examples of such studies using limited variability datasets to train models can be observed in recently published works such as [22] and [6]. An exception to this trend is Asaoka et al. [2] who has conducted an extensive study on validation of pretrained generic ResNet model to screen for glaucoma using images from different fundus cameras with data augmentation. They trained their model using images from one device (NONMYD WX, Kowa Company, Ltd., Aichi, Japan) and tested on images from two other devices (NONMYD 7 camera, Kowa Company, Ltd., and RC-50DX, Topcon Co.Ltd.) separately. They reported over 99% testing area under the receiver operator characteristic curve (AUROC). However, data in their study was sourced from three different clinical settings and has not been made available for independent validation. Additionally, they used computationally expensive preprocessing methods. Furthermore, their images are free from other pathologies and only includes glaucoma positive and normal (control) retinal images. However, non-glaucoma patients may have other eye diseases in a clinical setting. In contrast, our study images contain other ocular diseases and is highly imbalanced as well [14]. To the author’s knowledge, this is the only prior work considering variability introduced by fundus camera models conducted on publicly available data. To improve the problem of domain generalisation, Xiong et al. [21] has proposed a method named enhanced domain transformation based on the concept that if a transformed colour space presents the test data with a distribution identical to that of training data the model trained should generalise better. We adopt this idea into our experiments by applying histogram matching into the test data using colour histograms calculated using the training data. Moreover, colour normalisation is a technique that can be used to process retinal fundus images. Grey world normalisation, histogram equalisation and histogram matching are studied by Goatman et al. [9] who concluded that the third method is more effective in identification of four lesion types in Diabetic Retinopathy screening. In their study of automatic detection of rare pathologies in fundus photographs, Quellec et al. used an image preprocessing technique for reducing the device dependency. First, the images are converted into Y Cb Cr colour space and then the Gaussian kernel is used for background removal in Y channel [16]. A blurred image is removed from the Y channel that is roughly equal to mean

424

H. Gunasinghe et al.

removal. Their method achieved comparable results with transfer learning for glaucoma identification. We adapt the method in our research by standardising the Y channel of each image. The median filter is a non-linear digital preprocessing technique, often used to remove noise from an image [10]. Shoukat et al. used median filter with Gabor filter and adaptive histogram equalisation in designing of a EfficientNet-based automatic method for early detection of glaucoma using fundus images [18]. Trained independently, their system performed above 90% of accuracy for two public datasets. There was no separation or consideration of the image source device in these experiments, which is a limitation of the study. Instead, we use median filtering together with image standardisation as a preprocessing technique in our study by testing against each device. Random forest is an ensemble learning method for classification that operates by constructing an ensemble of decision trees at training time and predicting the class by combining the individual tree predictions [4]. Random forest ensembles often perform at near state-of-the-art levels produced by deep neural networks for many problems while requiring very little hyper parameter optimisation. Therefore, random forest is often the chosen classifier for detection tasks. Random forest classifier-based techniques for glaucoma detection have been used by Acharya et al. [1] and Hoover and Goldbaum [13], and they report accuracies of 91% and 89% respectively. However, this work was conducted in 2011 and 2003 using very small datasets ( 0, BRDF (λ, θi , ϕi , θv , ϕv ) = BRDF (λ, θv , ϕv , θi , ϕi ), BRDF (λ, θi , ϕi , θv , ϕv ) cos θi dωi ≤ 1,

(1) (2) (3)

where θi , θv are illumination and viewing elevation angles, ϕi , ϕv are illumination and viewing azimuthal angles, ωi = [θi , ϕi ], and λ is the spectral index. A BRDF can be isotropic or anisotropic. The anisotropic BRDF model depends on five variables Y BRDF = BRDF (λ, θi , ϕi , θv , ϕv ),

(4)

while the isotropic, i.e., when the reflected light does not depend on surface orientation, only on four variables Y BRDF = BRDF (λ, θi , |ϕi − ϕv |, θv ).

(5)

436

M. Haindl and V. Havlíček

The BRDF models are mostly divided into two components - diffuse and specular. The diffuse component models equal light distribution into all angles, while the specular component assumes highly reflective blobs randomly distributed on the surface and influenced by the surface shape. Numerous non-linear BRDF models were published, such as Binn model [3], Cook-Torrance model [4], Edwards model [5], Hapke - Lommel - Seeliger model [11], Lafortune model [12], Lewis model [13], Minnaert model [14], Oren-Nayar model [15,17], Phong model [18], Schlick model [20,21], stretched Phong model [16]. Other BRDF models are based on the microfacet theory Ashikhmin-Shirley [1], Torrance-Sparrow [22], Trowbridge-Reitz [23] and several others. Most BRDF models are restricted to isotropic materials and few models (e.g., [5,20,21,25] are capable to model anisotropic materials.

Fig. 1. Two tested isotropic materials and their corresponding BRDF.

3

Anisotropy Criterion

The suggested anisotropy criterion ε (11) depends on the selected range of BRDF measurements and can be applied to any number of spectral bands with a straightforward modification of the Eq. (11).

BRDF Anisotropy Criterion

1  α(θi , θv , k), n(k) ∀θi ∀θv 1  1  1  ε(k) = α(θi , θv , k), ε= nk nk n(k)

ε(k) =

∀k

∀k

437

(6) (7)

∀θi ∀θv

α(θi , θv , k) = |f BRDF (θi , θv , φi , φv ) − μBRDF (θi , θv , k)|,  1 f BRDF (θi , θv , k), μBRDF (θi , θv , k) = nθi ,θv (k)

(8) (9)

∀φ=k

k = |φi − φv |,  ε2λ , ε = |ε| = ε=



(10) (11)

∀λ

ε2R + ε2G + ε2B ,

(12)

where n(k) is the number of all angular combinations for a specific k, nk is the number of all possible differences k (i.e., nk = 226 for 81 × 81 angular format), μBRDF (θi , θv , k) (9). The anisotropy criterion for usual RGB color representation is (12). Spectral curves f (α(λ, θi , θv , k)) denote anisotropy directions.

4

Experimental Textures

We tested the anisotropy criterion on our extensive UTIA BTF database [9] (Fig. 1), composed of material images under varying illumination and viewing directions. The anisotropy wood materials (Figs. 2, 3) were tested on the Wood UTIA BTF Database. All BRDF tables (Figs. 1, 2, 3 - bottom) were computed from the BTF measurements. 4.1

Wood UTIA BTF Database

The Wood UTIA BTF database contains veneers from sixty-five varied European, African, and American wood species. Among the European wood species are elm, fir, pear, pine, plum, birches, ash trees, cherry trees, larch, limba, linden, olive tree, spruces, beeches, oaks, walnuts, and maple trees. The others are various African and American wood species. The UTIA BTF database1 was measured using the high precision robotic gonioreflectometer [8], which consists of independently controlled arms with a camera and light. Its parameters, such as angular precision of 0.03◦ , the spatial resolution of 1000 DPI, or selective spatial measurement, classify this gonioreflectometer as a state-of-the-art device. The typical resolution of the area of

1

http://btf.utia.cas.cz/.

438

M. Haindl and V. Havlíček

Fig. 2. Tested wood anisotropic materials and their corresponding BRDF.

Fig. 3. Tested wood anisotropic materials and their corresponding BRDF.

BRDF Anisotropy Criterion

439

Fig. 4. Anisotropy criterion dependence on illumination and viewing elevation angles for limba and spruce anisotropic BRDF. The horizontal axis shows illumination (Ix θi ) and viewing (_Vx θv ) elevation angle combinations.

interest is around 2000 × 2000 pixels, sample size 7 × 7 [cm]. We measured each material sample in 81 viewing positions times 81 illumination positions resulting in 6561 images per sample, 4 TB of data. The images uniformly represent the space of possible illumination and viewing directions.

440

M. Haindl and V. Havlíček

Fig. 5. Anisotropy criterion dependence on illumination and viewing elevation angles for wenge and alder anisotropic BRDF.

5

Results

Figures 1, 2, 3 - upper rows show presented materials for viewing and illumination angles approximately collinear with the surface normal. The visual evaluation suggests azimuthal independence for isotropic glass and stone materials on Fig. 1 while strong dependence on azimuthal angles for anisotropic limba and spruce wood on Fig. 2. The remaining three anisotropic wood materials (alder, ayouz, and wenge) also depend on azimuthal angles but not so noticeably.

BRDF Anisotropy Criterion

441

Fig. 6. α(θi , θv , k) curve for the B spectral channel and anisotropy criterion dependence on illumination and viewing elevation angles for ayouz anisotropic BRDF.

Table 1 summarizing anisotropy criterion values ε (7), ε (11) for all presented isotropic and anisotropic materials confirms the above visual observation. The largest criterion value, 30, has the most anisotropic spruce wood, while the isotropic stone and green glass values are only 1.76 and 5.48, respectively. The more considerable criterion value for glass is due to its specular reflections, which do not exist in diffuse stone material (Fig. 1). Smaller criterion values are for less highlighted anisotropy for wenge, ayouz (Fig. 3), and alder (Fig. 2 - middle) than for accentuated anisotropy for limba and spruce wood. Single spectral components ελ are very similar for related material. The materials ordering for our seven materials based on the criterion (stone, glass, wenge, ayouz, alder, limba, spruce) is identical using any spectral band λ in ελ or ε. Their standard deviation over the criterion spectral components is in the range 0.05; 1.85. The smaller the ε value, the smaller the modeling error can be expected from an isotropic BRDF model.

442

M. Haindl and V. Havlíček

Fig. 7. Anisotropy criterion dependence on illumination and viewing elevation angles for glass01 and stone01 isotropic BRDF.

Anisotropy criterion graphical dependence on illumination and viewing elevation angles (horizontal axis) on Figs. 4, 5, 6 and 7 illustrates the above observation. The isotropic materials (Fig. 7) have small values in the range of tens, whereas the anisotropic wood materials (Figs. 4, 5 and 6) have these values in the range of hundreds.

BRDF Anisotropy Criterion

443

Table 1. Anisotropy criterion glass01 stone0 wood05 wood35 wood45 wood57 wood65 ayouz limba alder spruce wenge ε

6

2.96 3.19 3.60

1.08 0.99 0.97

7.76 7.83 8.36

12.71 13.99 15.30

8.25 9.73 10.02

15.06 17.31 19.06

5.62 5.39 5.33

ε 5.48 std 0.32

1.76 0.05

13.84 0.27

24.32 1.06

16.22 0.78

30.18 1.85

9.44 0.13

Conclusion

The anisotropy criterion of bidirectional reflectance distribution function allows deciding if a simpler isotropic BRDF model will provide sufficient quality modeling or if it is necessary to use a more complex anisotropic BRDF model. The criterion simultaneously shows dominant angular orientations for the anisotropic materials. The presented results indicate that the anisotropic criterion can reliably differentiate between isotropic and anisotropic materials and thus can be used to select the appropriate class of BRDF nonlinear models. The criterion can be easily used for high-dynamic or hyperspectral measurements with a straightforward modification to any number of spectral bands.

References 1. Ashikhmin, M., Shirley, P.: An anisotropic phong BRDF model. J. Graph. Tools 5(2), 25–32 (2000) 2. Bell, S., Upchurch, P., Snavely, N., Bala, K.: Material recognition in the wild with the materials in context database. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3479–3487 (2015) 3. Blinn, J.F.: Models of light reflection for computer synthesized pictures. In: Proceedings of the 4th Annual Conference on Computer Graphics and Interactive Techniques, pp. 192–198. ACM Press (1977) 4. Cook, R.L., Torrance, K.E.: A reflectance model for computer graphics. ACM Trans. Graph. (TOG) 1(1), 7–24 (1982) 5. Edwards, D., et al.: The halfway vector disk for BRDF modeling. ACM Trans. Graph. (TOG) 25(1), 1–18 (2006) 6. Gibert, X., Patel, V.M., Chellappa, R.: Material classification and semantic segmentation of railway track images with deep convolutional neural networks. In: 2015 IEEE International Conference on Image Processing (ICIP), pp. 621–625. IEEE (2015) 7. Haindl, M., Filip, J.: Visual Texture. Advances in Computer Vision and Pattern Recognition, Springer, London (2013). https://doi.org/10.1007/978-1-4471-49026, https://link.springer.com/book/10.1007/978-1-4471-4902-6

444

M. Haindl and V. Havlíček

8. Haindl, M., Filip, J., Vávra, R.: Digital material appearance: the curse of terabytes. ERCIM News (90), 49–50 (2012). https://ercim-news.ercim.eu/en90/ri/ digital-material-appearance-the-curse-of-tera-bytes 9. Haindl, M., Mikeš, S., Kudo, M.: Unsupervised surface reflectance field multisegmenter. In: Azzopardi, G., Petkov, N. (eds.) CAIP 2015. LNCS, vol. 9256, pp. 261–273. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-23192-1_22 10. Haindl, M., Vácha, P.: Wood veneer species recognition using Markovian textural features. In: Azzopardi, G., Petkov, N. (eds.) CAIP 2015. LNCS, vol. 9256, pp. 300–311. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-23192-1_25 11. Hapke, B.W.: A theoretical photometric function for the lunar surface. J. Geophys. Res. 68(15), 4571–4586 (1963) 12. Lafortune, E., Foo, S., Torrance, K., Greenberg, D.: Non-linear approximation of reflectance functions. In: ACM SIGGRAPH 1997, pp. 117–126. ACM Press (1997) 13. Lewis, R.R.: Making shaders more physically plausible. In: Computer Graphics Forum, vol. 13, pp. 109–120. Wiley Online Library (1994) 14. Minnaert, M.: The reciprocity principle in lunar photometry. Astrophys. J. 93, 403–410 (1941) 15. Nayar, S.K., Oren, M.: Visual appearance of matte surfaces. Science 267(5201), 1153–1156 (1995) 16. Neumann, L., Neumannn, A., Szirmay-Kalos, L.: Compact metallic reflectance models. In: Computer Graphics Forum, vol. 18, pp. 161–172. Wiley Online Library (1999) 17. Oren, M., Nayar, S.K.: Generalization of lambert’s reflectance model. In: Proceedings of the 21st Annual Conference on Computer Graphics and Interactive Techniques, pp. 239–246 (1994) 18. Phong, B.T.: Illumination for computer generated pictures. Commun. ACM 18(6), 311–317 (1975) 19. Remeš, V., Haindl, M.: Bark recognition using novel rotationally invariant multispectral textural features. Pattern Recogn. Lett. 125, 612–617 (2019) 20. Schlick, C.: A customizable reflectance model for everyday rendering. In: Fourth Eurographics Workshop on Rendering, Paris, France, pp. 73–83 (1993) 21. Schlick, C.: An inexpensive BRDF model for physically-based rendering. In: Computer graphics forum, vol. 13, pp. 233–246. Wiley Online Library (1994) 22. Torrance, K.E., Sparrow, E.M.: Off-specular peaks in the directional distribution of reflected thermal radiation. J. Heat Transf. 6(7), 223–230 (1966) 23. Trowbridge, T., Reitz, K.P.: Average irregularity representation of a rough surface for ray reflection. JOSA 65(5), 531–536 (1975) 24. Varma, M., Zisserman, A.: A statistical approach to material classification using image patch exemplars. IEEE Trans. Pattern Anal. Mach. Intell. 31(11), 2032–2047 (2009). http://doi.ieeecomputersociety.org/10.1109/TPAMI.2008.182 25. Ward, G.: Measuring and modeling anisotropic reflection. Comput. Graph. 26(2), 265–272 (1992)

Clustering Analysis Applied to NDVI Maps to Delimit Management Zones for Grain Crops Aliya Nugumanova , Almasbek Maulit(B)

, and Maxim Sutula

Sarsen Amanzholov East Kazakhstan University, Ust-Kamenogorsk, Kazakhstan [email protected]

Abstract. This research studies the possibility of applying data mining methods to determine homogeneous management zones in fields sown with cereals. For the study, satellite images of two fields in the East Kazakhstan region were used, obtained by the Sentinel-2 satellite in different periods of time (images of the first field were obtained from May to September 2020, images of the second field – from May to August 2021). Based on these images, a dataset of seasonal NDVI values was formed for each field. Four different clustering algorithms were applied to each of the datasets, the clustering results were visualized and rasterized as color maps, which were then offered for comparison and verification by an expert agronomist. Based on the expert review, recommendations were formulated for determining zones of homogeneous management. Keywords: NDVI · Clustering · Management zones · Precision agriculture · Sentinel-2 · Geospatial data visualization

1 Introduction The use of satellite images to define homogeneous management zones in the field is becoming an increasingly accessible and significant component of precision farming. One of the widely used approaches in this case is the cluster analysis of the vegetation index values calculated based on the processing of satellite images. Unfortunately, this approach does not always demonstrate good accuracy and robustness, which makes research aimed at improving it relevant. In this research, we apply cluster analysis to the data of the normalized vegetation index (NDVI) obtained as a result of processing a series of satellite images of fields sown with grain crops in the East Kazakhstan region in the seasons of 2020–2021. Our motivation is to provide farmers with an accessible, yet accurate and sustainable method for identifying homogeneous management areas in the field based on this approach. To this end, we study four clustering algorithms: K-means, K-medians, Hclust, Dbscan, and verify the results obtained. Thus, for each field under study, we perform the following sequence of operations. First, we get satellite images of the field on the available dates of the season. Secondly, for each received image, we calculate the NDVI values based on the comparison of the intensities of the near infrared and red channels. Thirdly, we consolidate the calculated NDVI values into a single dataset. Fourth, we apply various clustering algorithms to the © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 N. T. Nguyen et al. (Eds.): ACIIDS 2022, LNAI 13758, pp. 445–457, 2022. https://doi.org/10.1007/978-3-031-21967-2_36

446

A. Nugumanova et al.

obtained data set, the results of which are verified by an expert agronomist. Fifth, based on the verified results, we create a colored map for determining zones of homogeneous management in the fields. The objects of study are two arable fields of “East Kazakhstan Agricultural Experimental Station” agricultural enterprise. This agricultural enterprise is located on the right bank of the Irtysh River, 3 km north of the city of Ust-Kamenogorsk, in the Republic of Kazakhstan. The agricultural enterprise specializes in seed production of agricultural crops and the production of elite seeds, cultivates spring wheat, barley, soybeans, sunflower, and potatoes. The total area of the agricultural enterprise is 4,229.81 ha, of which arable land occupies 3,769.7 ha. Crop production is carried out in 36 fields. Brief characteristics of the studied fields are presented in Table 1, and their localization in Fig. 1. Table 1. Brief characteristics of the studied fields

Field ID

Field 1

Field 2

7.2

8.2

Area

114 ha

130 ha

Soil type

Black soil

Black soil

Humus content, %

3.85 ± 0.6

3.5 ± 0.4

Crop cultivated in 2019

Soy “Niva”

Oats “Skakun”

Crop cultivated in 2020

Spring soft wheat “Ulbinka 25”

Sunflower “Zarya”

Crop cultivated in 2021

Sunflower “Dostyk”

Spring wheat “Altai”

Fig. 1. a) Localization of fields; b) localization of the observation area.

2 Materials and Methods The overall methodology of our study includes five steps (Fig. 2):

Clustering Analysis Applied to NDVI Maps to Delimit Management Zones

447

1. Obtaining satellite images of the field in.tiff format with dimensions of x × y pixels on the specified dates of the season (m dates – m images). 2. Calculating the array of coordinate NDVI x×y values for each image using the NDVI calculation formula. 3. Consolidation of the calculated values into a single dataset. For this, transforming each NDVI x×y array into a list with length of n = x * y, then forming n × m data set from m lists of length n. 4. Application of k-means, hclust, dbscan, k-medians clustering algorithms to the consolidated dataset. The number of clusters is equal to 3 in accordance with the assumed zones of homogeneous management (zones of high, medium and low fertility). 5. Visualization of the results in the form of a color map (green – high fertility zone, yellow – medium fertility zone, pink – low fertility zone) using the Nearest Neighbor rasterization algorithm.

Fig. 2. Research methodology

2.1 Acquisition of Satellite Images Satellite images of the fields were obtained from the Sentinel-2 satellite. This data source was chosen because Sentinel-2 offers the best combination of high spectral, spatial and temporal resolution (Table 2). In addition, the images taken by Sentinel-2 have red edge spectral bands in the longer wavelength range, which are necessary for vegetation

448

A. Nugumanova et al.

analysis [1–3]. The Sentinel-2 data was downloaded from the official website of the US Geological Survey EarthExplorer – USGS1 . Table 2. Images characteristics Channel λ, nm

Resolution, Available shooting dates m 2020

2021

May June July September May B4-red

665 10

B8-near infrared

842 10

7, 9, 14, 29

23, 26

16

11

June

July August

2, 27, 6, 16, 13, 29 18 18, 28, 31

2

The processing of multispectral images and the creation of vector layers of fields were carried out in the QGis2 application, which supports many vector and raster formats. To correct and improve satellite images, the following types of pre-processing were carried out: geometric correction; radiometric calibration; radiometric correction of the influence of the atmosphere; recovery of missing pixels; contrasting; filtration. Figure 3 shows satellite images of fields after processing in the QGis environment.

Fig. 3. Field images after processing in QGis

2.2 Calculation of NDVI Matrices To calculate NDVI, the obtained multichannel tiff-files were loaded into the R3 environment. The fourth (near infrared) and third (red) image channels were used to calculate the NDVI scores. That is, for each pixel of the satellite image, the NDVI value was calculated using the formula: NDVI = 1 https://earthexplorer.usgs.gov. 2 https://qgis.org/en/site/. 3 https://cran.r-project.org/.

NIR − RED NIR + RED

Clustering Analysis Applied to NDVI Maps to Delimit Management Zones

449

where NIR is the intensity of this pixel in the near infrared channel; RED – intensity in the red channel [4]. The obtained values were stored in the matrix NDVI xy , the dimension of which was 107 × 173 for field 7_2, and 130 × 148 for field 8_2. In total, 10 such matrices for field 7.2 and 11 for field 8_2 were obtained for different dates of the season. Since the field boundaries do not exactly coincide with the rectangular shape of the matrices, the NDVI values for pixels outside the field boundaries were artificially set to 0. Figures 4 and 5 show heat maps of these matrices for fields 7_2 and 8_2, respectively.

Fig. 4. Heat maps of NDVI matrices for field 7_2 in season 2020

Fig. 5. Heat maps of NDVI matrices for field 8_2 in season 2021

2.3 Consolidation of NDVI Data To consolidate the data into a single set of NDVI, each of the m NDVI matrices of x × y dimensions was first unwrapped, i.e. transformed into a vector of length n = xy (Fig. 6). Then all the vectors obtained in this way were combined into a single consolidated data set, n × m matrix, a fragment of which is shown in Fig. 7. Thus, the rows of this data set were the pixels (points) of the field, and the columns were the dates on which the measurements were made. The resulting data set is convenient for clustering, since points (rows) with a similar distribution of NDVI values throughout the season will be grouped in one cluster, which will make it possible to single out zones of homogeneous control in the field.

450

A. Nugumanova et al.

Fig. 6. Converting the NDVI matrix to a vector

Fig. 7. Fragment of the consolidated dataset for clustering

2.4 Clustering the Consolidated Data As noted above, in this study, pixels (field points) acted as clustering objects, and measured NDVI values by season dates were used as features. For clustering, four wellestablished algorithms were used: k-means, k-medians, hclust and dbscan. The number of clusters k for all algorithms was chosen to be 4, of which 3 clusters will define zones of high, medium and low fertility, and one cluster will be technical, for points outside the field. The k-means algorithm is one of the popular iterative data clustering methods, it is fast and efficient to use. The literature describes many examples of applying this algorithm to NDVI indicators [5–7]. The algorithm consists of the following steps. At the first step, k points are randomly selected from a given dataset (image). These points will be considered the centroids of the clusters. At the second step, all remaining points

Clustering Analysis Applied to NDVI Maps to Delimit Management Zones

451

of the image are distributed into clusters. For this, the distance from the point to each centroid is calculated, and the point is assigned to the cluster, the distance to the centroid of which will be the smallest. At the third step, i.e. when all image points are distributed over clusters, the clusters are recalculated. The arithmetic mean of the points belonging to the cluster is taken as the new centroid of each cluster. Thus, in this algorithm, the goal is to minimize the total intraclass variance, which is equal to the square of the Euclidean distance. The k-medians algorithm is a modification of the k-means algorithm. To calculate the centroids of clusters, not the arithmetic mean, but the median is used, due to which the algorithm is considered more resistant to outliers [8]. The arithmetic mean is a measurement that is very vulnerable to outliers, even a single outlier can seriously “worse” the value of the arithmetic mean, while the median is a statistic that is incredibly robust to outliers, since it takes subject at least 50% of the data to noise [9]. To distribute points among clusters, the k-medians algorithm minimizes the absolute deviations of points from centroids equal to the Manhattan distance, i.e., it uses the sum of distances rather than the square of the distance. Thus, it is recommended to use the k-medians algorithm if the data may be noisy or if the distance is Manhattan. The Hclust algorithm is an implementation of hierarchical clustering in the R environment [10]. Hierarchical clustering first assumes that each point in the dataset is a separate cluster. Then, some clusters are iteratively merged into one new cluster based on the analysis of the distance between the clusters. As a result, all clusters converge to one cluster, but this process can be stopped by specifying the required number of final clusters (in our case, this is, as already noted, 4). The choice of a metric for measuring the distance between points must precede the implementation of the algorithm, usually the Euclidean distance or the Chebyshev distance is used [11]. The DBSCAN algorithm is a relatively young method compared to the clustering algorithms listed above. It was first published in 1996 [12], while for example k-means was developed in the 1950s [13]. The idea of the DBSCAN algorithm is to search for high-density zones, i.e., defining clusters as zones with a close arrangement of points. The algorithm has two parameters: 1) minPts – the minimum number of points grouped together for the area to be considered dense; 2) ε is a distance measure that will be used to locate points near any point. The algorithm starts by randomly choosing a point in the dataset and continues until all points have been visited. If there is at least a minimum number of minPoint points within the radius ε to a given point (such points are called reachable), then all these points are considered to be part of the same cluster. The clusters are then expanded by recursively repeating the reachability calculation for each neighboring point in the cluster. DBSCAN can be used with any distance function (as well as a similarity function or a Boolean condition), so the distance function (dist) can be considered as an additional parameter. Thus, DBSCAN does not require pre-determining the number of clusters, unlike the previously considered clustering algorithms, it itself determines the number of clusters based on the minPts and ε parameters.

452

A. Nugumanova et al.

2.5 Visualization of Clusters (Zones of Homogeneous Management) on the Map The conducted clustering compared each point of the field with a certain cluster to which this point belongs. Now, if we assign a certain color to each cluster, and display the points with the colors of their clusters on the map, then we will get a color map of homogeneous management zones. Since the numbering of clusters is purely nominal, and is only used to distinguish points in one cluster from points in another, an approach to map clusters to fertility zones should be provided. To do this, we additionally analyzed the values of the centroids of the obtained clusters. The points of the technical cluster with a zero centroid were assigned a gray color; these are points outside the field. The points of the cluster with the smallest centroid were assigned a pink color, according to our hypothesis these are the points of the low fertility zone. The points of the cluster with the largest centroid were assigned green color; these are the points of the zone of high fertility. The point of the remaining cluster was assigned yellow color; these are the points of the medium fertility zone. The resulting color map looks somewhat noisy; some points inside almost uniform color zones are highlighted in a different color (Fig. 8). Such noise does not necessarily represent clustering errors, but may be due to unfavorable satellite shooting conditions or terrain features. In any case, these noise points within large homogeneous zones cannot be treated in a special way by the farmer, as each point represents a small 10 × 10 m square. There are minimum sizes of management zones that are limited by the farmer’s ability to differentiate management zones within the field, usually depending on the equipment that the farmer uses for planting or fertilizing [14, 15].

Fig. 8. Color map of fertility zones with a resolution of 10 × 10 m, inside almost homogeneous zones there are points of other zones (noise or errors)

Accordingly, in order for the farmer to use the resulting map as a tool for variable management, it is necessary to remove noise by increasing the size of the points, i.e. move from a resolution of 10 × 10 m to a lower resolution (for example, 40 × 40 m), which is more accessible for differentiated operations on the field. For this purpose, in this study, we used the simplest scaling algorithm, the nearest neighbor method [16]. For comparison, bilinear or spline algorithms are more complex scaling algorithms. Figure 9 shows the result of scaling the color map using the nearest neighbor method.

Clustering Analysis Applied to NDVI Maps to Delimit Management Zones

453

Fig. 9. Color map of fertility zones with a resolution of 40 × 40 m (the remaining noises are available for differentiated management)

3 Results Figures 10 and 11 visualize the results of field clustering in the 2020 and 2021 seasons, respectively. As can be seen from Fig. 10, three of the four clustering algorithms – kmeans, hclust and k-medians – give approximately the same result. The dbscan algorithm tries not only to group points according to the proximity of NDVI values, but also to connect them to each other in space as much as possible, due to which fairly homogeneous zones are obtained compared to other algorithms. In Fig. 11, on the contrary, the k-means, hclust and dbscan algorithms show approximately the same result, while the k-medians algorithm is quite pessimistic in choosing the green zone. Experimental runs of the algorithms have shown that the k-means and Hclust algorithms are the most stable, while the behavior of the dbscan and k-medians algorithms is significantly affected by the parameters.

Fig. 10. Visualization of fertility zones with noise in the field 7_2 (2020)

Fig. 11. Visualization of fertility zones with noise in the field 8_2 (2021)

In particular, for the k-medians algorithm, we used a fast implementation in R [17], which depends on two parameters: 1) α is the rate of decline of the descent steps, 2) γ

454

A. Nugumanova et al.

is a positive constant that controls the descent steps. The results obtained (Figs. 12 and 13) indicate the high instability of this algorithm and the need for a thorough study of the possibility of its application in identifying homogeneous management zones on the field.

Fig.12. Results of K-medians clustering for the field 8_2 with a fixed parameter gamma = 0.2, the parameter alpha varies from 0 to 1

Fig. 13. Results of K-medians clustering for the field 8_2 with a fixed parameter gamma = 0.4, the parameter alpha varies from 0 to 1

The dbscan algorithm, as noted above, depends on two parameters, minPts and ε. We used a minPts value of 100, which means that the smallest possible cluster size is 100 points. Figures 14 and 15 show the results of dbscan clustering at various ε values from 0 to 0.1. With ε values above 0.1, this algorithm divided the points into only 2 groups: the main one and the noise. Experimental runs of the algorithm have shown that it is more stable than k-medians, but requires a very precise determination of the ε value. In this study, we selected it empirically, close to the smallest non-zero value in the consolidated dataset. Finally, the obtained clustering results were visualized and rasterized with a resolution of 40 × 40 m (Figs. 16 and 17). Verification of maps was carried out by an expert, agronomist of the enterprise “East-Kazakhstan Agricultural Experimental Station”. The maps obtained because of clustering by the hclust algorithm were recognized as the most relevant to reality.

Clustering Analysis Applied to NDVI Maps to Delimit Management Zones

455

Fig. 14. Results of dbscan clustering for field 7_2 when changing ε from 0.06 to 0.1

Fig. 15. Results of dbscan clustering for field 7_2 when changing ε from 0.06 to 0.1

Fig. 16. Color map of fertility zones for field 7_2 based on k-means, hclust, dbscan, k-medians algorithms

Fig. 17. Color map of fertility zones for field 8_2 based on k-means, hclust, dbscan, k-medians algorithms

456

A. Nugumanova et al.

4 Conclusion In this article, we assessed the possibility of using clustering methods to identify fertility zones in the cultivation of grain crops using wheat as an example. In our experiments, the most stable behavior was demonstrated by the k-means and hclust algorithms, and of these two algorithms, the expert agronomist preferred fertility maps based on the hclust algorithm. The k-medians algorithm showed very high instability due to such parameters as the descent rate and the descent step. The dbscan algorithm requires further, deeper research regarding the choice of the ε parameter, which is obviously closely related to the statistics of the original data. In addition, the proposed methodology requires numerical verification, and in our future work, we plan to evaluate the correlation between fertility assessment and yield in each of the field zones. Acknowledgement. This study has been supported by the Science Committee of the Ministry of Education and Science of the Republic of Kazakhstan (Grant No. AP09259379).

References 1. Misra, G., Cawkwell, F., Wingler, A.: Status of phenological research using Sentinel-2 data: a review. Remote Sens. 12(17), 2760 (2020) 2. Zhang, T., et al.: Band selection in Sentinel-2 satellite for agriculture applications. In: 23rd International Conference on Automation and Computing (ICAC), pp. 1–6. IEEE (2017) 3. Ghosh, P., et al.: Assessing crop monitoring potential of Sentinel-2 in a spatio-temporal scale. Int. Arch. Photogramm. Remote Sens. Spatial Inf. Sci. 42, 227–231 (2018) 4. Carlson, T.N., Ripley, D.A.: On the relation between NDVI, fractional vegetation cover, and leaf area index. Remote Sens. Environ. 62(3), 241–252 (1997) 5. Naser, M.A., et al.: Using NDVI to differentiate wheat genotypes productivity under dryland and irrigated conditions. Remote Sens. 12(5), 824 (2020) 6. Romani, L.A.S., et al.: Clustering analysis applied to NDVI/NOAA multitemporal images to improve the monitoring process of sugarcane crops. In: 6th International Workshop on the Analysis of Multi-temporal Remote Sensing Images (Multi-Temp). IEEE (2011) 7. Marino, S., Alvino, A.: Detection of homogeneous wheat areas using multi-temporal UAS images and ground truth data analyzed by cluster analysis. Eur. J. Remote Sens. 51(1), 266–275 (2018) 8. Whelan, C., Harrell, G., Wang, J.: Understanding the k-medians problem. In: Proceedings of the International Conference on Scientific Computing (CSC), p. 219. The Steering Committee of The World Congress in Computer Science, Computer Engineering and Applied Computing (WorldComp) (2015) 9. Feldman, D., Schulman, L.J.: Data reduction for weighted and outlier-resistant clustering. In: Proceedings of the Twenty-Third Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1342–1354 (2012) 10. Giordani, P., Ferraro, M.B., Martella, F.: Hierarchical clustering. In: An Introduction to Clustering with R. BQAHB, vol. 1, pp. 9–73. Springer, Singapore (2020). https://doi.org/10.1007/ 978-981-13-0553-5_2 11. Irani, J., Pise, N., Phatak, M.: Clustering techniques and the similarity measures used in clustering: a survey. Int. J. Comput. Appl. 134(7), 9–14 (2016)

Clustering Analysis Applied to NDVI Maps to Delimit Management Zones

457

12. Ester, M., Kriegel, H.-P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Simoudis, E., Han, J., Fayyad, U.M. (eds.) Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), pp. 226–231. AAAI Press (1996) 13. Steinhaus, H.: Sur la division des corps materials en parties. Bull. Acad. Polon. Sci., III, IV, 801–804 (1956) 14. Zhang, N., Wang, M., Wang, N.: Precision agriculture — a worldwide overview. Comput. Electron. Agric. 36(2–3), 113–132 (2002). https://doi.org/10.1016/s0168-1699(02)00096-0 15. Ali, J.: Role of precision farming in sustainable development of hill agriculture. In: National Seminar on Technological Interventions for Sustainable Hill Development, At GB Pant University of Agriculture & Technology, Pantnagar, Uttarakhand, India (2013) 16. Jiang, N., Wang, L.: Quantum image scaling using nearest neighbor interpolation. Quantum Inf. Process. 14(5), 1559–1571 (2015) 17. Cardot, H., Cenac, P., Monnez, J.-M.: A fast and recursive algorithm for clustering large datasets with k-medians. Comput. Stat. Data Anal. 56, 1434–1449 (2012)

Features of Hand-Drawn Spirals for Recognition of Parkinson’s Disease Krzysztof Wrobel1(B) , Rafal Doroz1 , Piotr Porwik1 , Tomasz Orczyk1 , Agnieszka Betkowska Cavalcante2 , and Monika Grajzer2 1

Faculty of Science and Technology, Institute of Computer Science, University of Silesia, Bedzinska 39, 41-200 Sosnowiec, Poland [email protected] 2 Gido Labs sp. z o.o., Wagrowska 2, 61-369 Poznan, Poland

Abstract. In this paper, a method for diagnosing Parkinson’s disease based on features derived from hand-drawn spirals is presented. During drawing of these spirals on a tablet, coordinates of points of the spiral, pressure and angle of the pen at that point, and timestamp were registered. A set of features derived from the registered data, by means of which the classification was performed, has been proposed. For testing purposes, classification has been done by means of several of the most popular machine learning methods, for which the accuracy of Parkinson’s disease recognition was determined. The study has proven that the proposed set of features enables the effective diagnosis of Parkinson’s disease. The proposed method can be used in screening tests for Parkinson’s disease. The experiments were conducted on a publicly available “Parkinson Disease Spiral Drawings Using Digitized Graphics Tablet Data Set” database from the UCI archives. This database contains drawings of spirals made by people with Parkinson’s disease as well as by healthy people.

Keywords: Parkinson’s disease extraction · Hand-drawn spirals

1

· Medical diagnosis · Feature

Introduction

One of the symptoms of Parkinson’s disease is trembling of the whole body, especially noticeable in the hands [1–3]. Affected people may have difficulty drawing even the simplest figures. Because of that, it is possible to diagnose Parkinson’s disease using a freehand drawing analysis. The advantage of this diagnostic method is the fact that it is non-invasive. A person can draw a picture on a paper, scan or photograph it, and send it for a medical analysis. An alternative method is to draw the image on a tablet, which makes it possible to directly obtain a digital image. Such an image can then be analyzed using various numerical methods [4–6]. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  N. T. Nguyen et al. (Eds.): ACIIDS 2022, LNAI 13758, pp. 458–469, 2022. https://doi.org/10.1007/978-3-031-21967-2_37

Features of Hand-Drawn Spirals for Recognition of Parkinson’s Disease

459

The presented procedure can be performed remotely, what means that the patients do not need to leave their homes, enabling wide access for this diagnostic method for older people, or people with motor disabilities. It also makes it suitable for screening tests on a large scale. The methodology described in this paper is a follow-up to the author’s previous research on diagnosing Parkinson’s disease through the analysis of speech samples [6]. Both techniques can be combined in order to further increase the accuracy of correct diagnosis. Hence, in this paper, we also present the planned scope of the entire multimodal diagnostic system, together with its suggested practical implementation. This paper presents a method for diagnosing Parkinson’s disease based on the features derived from hand-drawn spirals. The paper consists of 7 sections and is organized as follows: Sect. 2 is a general description of the proposed method, Sect. 3 concerns the method of acquisition of data describing the spirals, Sect. 4 contains proposition and discussion on the set of features extracted from the obtained data, Sect. 5 deals with the experiments performed and the analysis of the obtained results, Sect. 6 provides a discussion on the practical implementation of the entire mulimodal system for diagnosing Parkinson’s disease, and Sect. 7 contains conclusions, a summary of the work, and further possible research directions.

2

Method Description

The proposed method is designed as an aid in the diagnosing of Parkinson’s disease based on the analysis of a spiral drawn by the patient. The block diagram of the proposed method is shown in the Fig. 1.

Fig. 1. The block diagram of the proposed method.

460

K. Wrobel et al.

We can distinguish three main stages of the method. In the first stage, a person draws a spiral using a tablet, trying to make it as similar as possible to the reference drawing. In the second stage, a set of features describing the analyzed spiral is derived from the raw data registered by tablet. The derived features take into account both the shape of the spiral and its smoothness. With the progresses of a disease, the person has increasing difficulty in faithfully reproducing the pattern of a spiral. The drawing becomes more and more deformed, and the lines become more and more rugged. The raw features describing the spiral and the derived set of features used in further analysis are described in the following sections. The last stage of the method is to classify the person affected or unaffected by a Parkinson’s disease. The proposed method has been tested using a variety of popular classifiers, which are described in the research section.

3

Data Acquisition

The entry point for the data acquisition process is a patient drawing a spiral on the tablet, and trying to make it is as similar as possible to the reference spiral. During the drawing of the spiral, the coordinates of all its points (x, y) are registered. Additionally, for each point, the pen pressure p, pen angle l, and recording time t are recorded. Using the recorded values, the drawing velocities vx , vy , vp , vl , between successive points are calculated according to the following formulas: −yi −pi −li i+1 −xi vxi = xti+1 vyi = yti+1 , vpi = pti+1 , vli = tli+1 , −ti , i+1 −ti i+1 −ti i+1 −ti for i = 1, ..., n − 1, vxi = vxn−1 , vyi = vyn−1 , vpi = vpn−1 , vli = vln−1 , for i = n.

(1)

Having velocities vx , vy , vp , vl , we can compute the pen accelerations ax , ay , ap , al between points, using the following formulas: −vpi i+1 −vxi i+1 −vyi axi = vxti+1 ayi = vyti+1 api = vpti+1 , ali = −ti , −ti , i+1 −ti for i = 1, ..., n − 1, axi = axn−1 , ayi = ayn−1 , api = apn−1 , ali = aln−1 , for i = n.

vli+1 −vli ti+1 −ti ,

(2)

Finally, the result of the acquisition is each spiral being described by a set of Si vectors: Si = [xi , yi , pi , li , ti , vxi , vyi , vpi , vli , axi , ayi , api , ali ],

for i = 1, ..., n.

(3)

Features of Hand-Drawn Spirals for Recognition of Parkinson’s Disease

461

Spirals can be captured in two modes. In the first mode, the Static Spiral Test (SST), the patient draws a spiral on the given spiral pattern. In the second mode, the Dynamic Spiral Test (DST), the reference spiral appears and disappears at a certain frequency. In the experimental part, tests for both modes are presented.

4

Feature Extraction

In this stage, for each spiral drawn by the patient, a set of features is determined, that allows classifying the patient as a healthy or suffering from Parkinson’s disease. The hand of a person with Parkinson’s disease is shaking, so the spiral may be deformed, and the lines may be rugged. The difference between the spiral drawn by a healthy person and by one suffering from the Parkinson’s disease is shown in Fig. 2. As can be seen, the spirals differ in both the shape and smoothness of the lines. Considering the mentioned differences, the developed method is utilizing a set of features that take into account both the smoothness and the shape of the drawn spirals. In order to calculate the smoothness value, the drawn spiral is compared with a reference spiral (Fig. 3). The reference spiral is the analyzed spiral subjected to the Gaussian smoothing process [7] described by the formula (4). For the smoothing process, data is proceeded point by point. For each point, the kernel function k is calculated: k(x, x ) = exp(

−|x − x | ), γ

(4)

where, x, x are spirals points, and γ defines the width of the kernel.

Fig. 2. Example of spirals drawn by a healthy person (a), and the person suffering from Parkinson’s disease (b).

462

K. Wrobel et al.

Fig. 3. The reference spiral.

The influence of the parameter γ on the obtained results is shown in the research section. The comparison of spirals is done independently for their X and Y coordinates. The Euclidean measure was used for their comparison. The result of comparing the X coordinates is described by the feature sx , while the Y coordinates are making the feature sy . The next two features describe the matching of the shape of the analyzed spiral to the reference spiral used in the acquisition process. The (x, y) coordinates of the reference spiral can be calculated using the following formula: x(t) = a · t · cos(t), y(t) = a · t · sin(t),

(5)

where a is the spiral parameter, and the variable t takes successive values in the interval [1, n], where n is a fixed number of reference spiral points. The shape matching, similarly to the smoothness, is determined independently based on the X and Y coordinates. A plot of the X, and Y coordinates for the sample spirals is shown in Fig. 4. The number of points of the analyzed spiral and the reference spiral may differ, so for their comparison, the DTW [8,9] method was used. This method is useful when we compare data series with similar structures but different lengths. If the compared series are identical, then their matching cost is zero. The result of comparing the X and Y coordinates using the DTW method are the mx and my features.

Features of Hand-Drawn Spirals for Recognition of Parkinson’s Disease

463

Fig. 4. X and Y coordinates of the spiral drawn by a), b) healthy person, and c), d) person with Parkinson’s disease.

The method assumes that successive points of the reference spiral were generated with constants: pressure Δp, velocity Δv, acceleration Δa, and pen angle Δl. This makes the standard deviation values calculated for these data to be zero. Affected people, due to hand shakiness, unconsciously change the pressure, the angle, or the speed of the pen. As a result, the values of standard deviations calculated for the data describing the spirals are greater than zero, and this value increases with the progress of the disease. These features form the third group of features: fp , fl , fvx , fvy , fvp , fvl , fax , fay , fap , fal . A total of 14 features were derived from the raw data, and the set of these features is shown in Table 1.

464

K. Wrobel et al. Table 1. The analyzed feature set.

Feature Description sx

The smoothness determined from X coordinates

sy

The smoothness determined from Y coordinates

mx

The matching of the analyzed spiral to the reference spiral, determined from the X coordinates

my

The matching of the analyzed spiral to the reference spiral, determined from the Y coordinates

fp

The standard deviation for pen pressure (p)

fl

The standard deviations for pen angle (l)

fvx

The standard deviation for the velocity calculated on the X axis (vx )

fvy

The standard deviation for the velocity calculated on the Y axis (vy )

fvp

The standard deviation for velocity calculated relative to pen pressure (vp )

fvl

The standard deviation for velocity calculated relative to pen angle (vl )

fax

The standard deviation for acceleration calculated on the X axis (ax )

fay

The standard deviation for acceleration calculated on the Y axis (ay )

fap

The standard deviation for acceleration calculated relative to pen pressure (ap )

fal

The standard deviation for acceleration calculated relative to pen angle (al )

5

Experiments and Results

Experiments were done using a Parkinson’s disease patients database from the UCI archives. This database contains spirals drawn by 77 people: 62 people were affected by Parkinson’s disease, and 15 people were healthy [4]. Data has been recorded using a Wacom Cintiq 12WX graphics tablet. The database is publicly available at the following URL: https://archive.ics.uci.edu/ml/machine-learning-databases/00395. Popular classifiers from the WEKA library under the Matlab environment were used during the study. All classifiers were applied using their default settings. The list of tested classifiers is as follows: – – – – – –

Bayes NET [10], k -Nearest Neighbors Classifier [11], J48 - C4.5 decision trees [12], Random Forests [13], Naive Bayes [10,14], AdaBoost [15].

Features of Hand-Drawn Spirals for Recognition of Parkinson’s Disease

465

The evaluation of the proposed method was done using the Overall Accuracy (ACC ) measure, which can be calculated from the following formula: ACC =

number of patients correctly recognized · 100%. number of patients tested

(6)

Due to our data set’s relatively low sample count, we decided to use a “leave one out” cross validation approach. During the first experiment, the accuracy of the method for the static spiral test (SST) has been determined. The experimental results are presented in Table 2. Table 2. Classification accuracy for static spiral tests [%]. Classifier

Value of γ 18 20

Bayes NET

81.34 87.15 90.84 89.73 86.88

22

24

26

k -Nearest Neighbors 82.53 89.87 92.26 89.36 85.62 J48 - C4.5

78.35 86.33 88.42 87.76 86.04

Random Forests

73.74 88.54 92.15 91.17 88.45

Naive Bayes

79.21 83.17 84.32 79.05 78.57

AdaBoost

77.25 84.37 86.04 88.42 81.75

The highest classification accuracy was achieved for the k-NN and Random Forests classifiers. Their results were 92.26% and 92.15%, respectively. Results were obtained for the parameter γ = 22. The other classifiers showed accuracy ranging from 85% to 91%, and these values were also achieved for the parameter γ = 22. It can be safely stated that the value of parameter γ = 22 is optimal, since for the other tested values of parameter γ the classification accuracy was lower. In the second experiment, the accuracy of the method for the dynamic spiral test (DST) has been determined. The experimental results are presented in Table 3. Table 3. Classification accuracy for dynamic spiral tests [%]. Classifier Bayes NET

Value of γ 18 20

22

24

26

79.59 85.54 88.98 86.87 85.34

k -Nearest Neighbors 78.16 86.63 90.83 87.00 85.65 J48 - C4.5

76.85 82.53 85.32 87.53 83.32

Random Forests

69.66 85.43 87.17 86.36 84.76

Naive Bayes

76.54 81.22 83.43 77.76 74.86

AdaBoost

72.36 80.59 84.27 82.32 78.757

466

K. Wrobel et al.

In Table 3, it can be seen that the classification accuracy is several percent lower than that of the static tests. This is due to the fact that the reference spiral appeared and disappeared at a certain frequency, making it difficult for the test subjects to draw it correctly. For the dynamic mode, the best results were again achieved for the k-NN classifier and the parameter γ = 22, and it was a value of 90.83%.

6

Multimodal Diagnostic System

The methodology presented in the paper aims to diagnose Parkinson’s disease on the basis of a drawing prepared by a patient. However, our previous research has shown that the disease can also be diagnosed by analyzing a patient’s voice with about 91% accuracy using a decision tree classifier. This is because the symptoms of the disease also affect your ability to produce audible, undistorted speech. We anticipate that a multi-modal diagnostic system will further improve the diagnosis of Parkinson’s disease. Such a system could also be used to monitor disease progression over time and be an element of decision support for healthcare professionals. The patient’s handwriting and speech analysis data MAY BE analyzed by ML systems. In the future, it will be possible to study factors that may influence disease assessment in order to improve system performance. For example, patients of a certain age can be diagnosed using one method and the remaining patients can be diagnosed using a different method. In order to implement a multi-modal diagnostic system, we plan to increase the comfort of use by medical personnel. With such assumptions, the system should be available via a website, preferably on mobile devices. While the aforementioned mobile application was used to analyze handwritten drawings, the speech-based system has so far been tested in a laboratory setting [6]. In the future, data can be obtained using the mPASS [19,20] platform, which is an internet tool that allows you to collect speech samples in a very convenient way. The mPASS platform was originally designed to help people with severe speech disorders and offers a convenient operation that allows for disabilitysensitive recording of speech samples - the process can be stopped/paused at any time. The material for recording is divided into smaller fragments, and each recording may be accompanied by additional audiovisual information (drawing and sound command played in the headphones). This approach is designed to help people with various visual, hearing, speech and other impairments. In addition, the tool allows you to verify audio samples “on the fly” and offers a simple process for making the necessary corrections. A sample screen from the mPASS platform showing a recording session is shown in Fig. 5. After successful acquisition, audio samples can be uploaded to an external server for processing. The results will then be uploaded to an application on a mobile device for further analysis by healthcare professionals. In this way, a multi-modal diagnostic system can be easily implemented in practice and offered to end users - without the need for expensive specialized equipment.

Features of Hand-Drawn Spirals for Recognition of Parkinson’s Disease

467

Fig. 5. The screen from the mPASS web-based application - audio samples acquisition.

It means, that data can be obtained using the mPASS platform. Now, we cooperate with authors of the platform [20] in the field of automatic data acquisition. The platform allows you to collect speech samples step by step in a very convenient way for the user. The mPASS platform was originally designed to help people with severe speech disorders to communicate with their environment. The platform offers a very convenient service that allows you to record speech samples. The recording process can be stopped/paused at any time. The material for recording is divided into smaller fragments, and each recording may be accompanied by additional audiovisual information (projected drawing and a sound command played in the headphones). This approach helps people with various visual, hearing or speech impairments.

7

Conclusions

This paper proposes a method for recognizing Parkinson’s disease by means of features derived from hand-drawn spirals. Described method appears to be a fully functional solution. The best accuracy of the method was 92.26% for kNN classifier, which is comparable with results known from the articles presented by 2020–2022 (see Table 4).

468

K. Wrobel et al.

Table 4. A comparison of our results with the newest achievements from the literature. Authors and works

Year

Methods

The best accuracy [%]

Database

Our method

2022 Machine 92.26 learning

UCI

Megha Kamble et al. [21]

2021

Machine learning

91.6

UCI

Luca Parisi et al. [22]

2022

Deep learning

92.0

Various datasets

Sven Nõmm et al. [23]

2020

Deep learning

88.0

Own (34 subjects)

Chakraborty Sabyasachi 2020 et al. [24]

Deep learning

93.0

Kaggle’s data repository (55 subjects)

Further research directions have already been presented in the previous chapter. It is planned to design a multimodal system, that will be able to diagnose Parkinson’s disease with even greater accuracy than the one presented here. The system will rely on the parallel analysis of both: the drawing of the spirals and voice samples. Additional analysis of the voice samples will result in more features being used for classification. Further work will also focus on the reduction of features and the selection of only those, for which the classification effectiveness will be the best [16–18]. Ultimately, it is planned to build a multimodal web application for diagnosing of patients on-line.

References 1. Golbe, L.I., Mark, M.H., Sage, J.I.: Parkinson’s Disease Handbook. The American Parkinson Disease Association Inc. (2010) 2. Parkinson, J.: An Essay on the Shaking Palsy. London (1817) 3. Grosset, D., Fernandez, H., Grosset, K., Okun, M.: Parkinson’s Disease Clinician’s Desk Reference. CRC Press, Boca Raton (2009) 4. Isenkul, M.E., Sakar, B.E., Kursun, O.: Improved spiral test using digitized graphics tablet for monitoring Parkinson’s disease. In: The 2nd International Conference on e-Health and Telemedicine (ICEHTM-2014), pp. 171–175 (2014) 5. Erdogdu Sakar, B., et al.: Collection and analysis of a Parkinson speech dataset with multiple types of sound recordings. IEEE J. Biomed. Health Inform. 17(4), 828–834 (2013) 6. Froelich, W., Wrobel, K., Porwik, P.: Diagnosis of Parkinson’s disease using speech samples and threshold-based classification. J. Med. Imaging Health Inform. 5(6), 1358–1363 (2015) 7. Deisenroth, M.P., Turner, R.D., Huber, M.F., Hanebeck, U.D., Rasmussen, C.E.: Robust filtering and smoothing with Gaussian processes. IEEE Trans. Autom. Control 57(7), 1865–1871 (2012)

Features of Hand-Drawn Spirals for Recognition of Parkinson’s Disease

469

8. Ibrahim, M.Z., Mulvaney, D.J.: Geometrical-based lip-reading using template probabilistic multi-dimension dynamic time warping. J. Vis. Commun. Image Representation 30, 219–233 (2015) 9. Salvador, S., Chan, P.: Toward accurate dynamic time warping in linear time and space. Intell. Data Anal. 11(5), 561–580 (2007) 10. Muramatsu, D., Kondo, M., Sasaki, M., Tachibana, S., Matsumoto, T.: A Markov chain Monte Carlo algorithm for Bayesian dynamic signature verification. IEEE Trans. Inf. Forensics Secur. 1, 22–34 (2006) 11. Aha, David W.: Incremental constructive induction: an instance-based approach. In: Machine Learning Proceedings, pp. 117–121 (1991) 12. Quinlan, J.R.: C4. 5: Programs for Machine Learning. Elsevier, Amsterdam (2014) 13. Breiman, L.: Random forests. Mach. Learn. 1, 5–32 (2001) 14. Johnson, R.A., Bhattacharyya, G.K.: Statistics: Principles and Methods. Wiley, Hoboken (2019) 15. Rojas, R.: AdaBoost and the super bowl of classifiers a tutorial introduction to adaptive boosting (2009) 16. Wrobel, K.: Diagnosing Parkinson’s disease with the use of a reduced set of patients’ voice features samples. In: Saeed, K., Chaki, R., Janev, V. (eds.) CISIM 2019. LNCS, vol. 11703, pp. 84–95. Springer, Cham (2019). https://doi.org/10. 1007/978-3-030-28957-7_8 17. Pudil, P., Novovicova, J., Kittler, J.: Floating search methods in feature selection. Pattern Recogn. Lett. 15(11), 1119–1125 (1994) 18. Porwik, P., Doroz, R.: Self-adaptive biometric classifier working on the reduced dataset. In: Polycarpou, M., de Carvalho, A.C.P.L.F., Pan, J.-S., Woźniak, M., Quintian, H., Corchado, E. (eds.) HAIS 2014. LNCS (LNAI), vol. 8480, pp. 377– 388. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-07617-1_34 19. Betkowska Cavalcante, A., Grajzer, M.: Proof-of-concept evaluation of the mobile and personal speech assistant for the recognition of disordered speech. Int. J. Adv. Intell. Syst. 9, 589–599 (2016) 20. http://mpass.gidolabs.eu. Accessed 20 May 2022 21. Kamble, M., Shrivastava, P., Jain, M.: Digitized spiral drawing classification for Parkinson’s disease diagnosis. Meas. Sens. 16, 100047 (2021) 22. Parisi, L., Neagu, D., Ma, R., Campean, F.: Quantum ReLU activation for convolutional neural networks to improve diagnosis of Parkinson’s disease and COVID-19. Expert Syst. Appl. 187, 115892 (2022) 23. Nõmm, S., Zarembo, S., Medijainen, K., Taba, P., Toomela, A.: Deep CNN based classification of the archimedes spiral drawing tests to support diagnostics of the Parkinson’s disease. IFAC-PapersOnLine 53(5), 260–264 (2020) 24. Chakraborty, S., Aich, S., Sim, J.-S., Han, E., Park, J., Kim, H. C.: Parkinson’s disease detection from spiral and wave drawings using convolutional neural networks: a multistage classifier approach. In: 22nd International Conference on Advanced Communication Technology (ICACT), pp. 298–303 (2020)

FASENet: A Two-Stream Fall Detection and Activity Monitoring Model Using Pose Keypoints and Squeeze-and-Excitation Networks Jessie James P. Suarez1(B) , Nathaniel S. Orillaza Jr.2 , and Prospero C. Naval Jr.1 1

2

Diliman Department of Computer Science, University of the Philippines, Quezon City, Philippines {jpsuarez,pcnaval}@up.edu.ph Manila College of Medicine, University of the Philippines, Manila, Philippines [email protected]

Abstract. Numerous frameworks have already been proposed for vision-based fall detection and activity monitoring. These works have leveraged state-of-the-art algorithms such as 2D and 3D convolutional neural networks in order to analyze and process video data. However, these models are computationally expensive which prevent their use at scale for low-resource devices. Moreover, previous works in literature have not considered modelling features for simple and complex actions given a video segment. This information is crucial when trying to identify actions for a given task. Hence, this work proposes an architecture called FASENet, a 1D convolutional neural network-based two-stream fall detection and activity monitoring model using squeezeand-excitation networks. By using pose keypoints as inputs for the model instead of raw video frames, it is able to use 1D convolutions which is computationally cheaper than using 2D or 3D convolutions, thereby making the architecture more efficient. Furthermore, FASENet primarily has two streams to process pose segments, a compact and dilated stream which aims to extract features for simple and complex actions, respectively. Furthermore, squeeze-and-excitation networks are used in between these streams to recalibrate features after their combination based on their importance. The network was evaluated in three publicly available datasets, the Adhikari Dataset, the UP-Fall Dataset, and the UR-Fall Dataset. Through the experiments, FASENet was able to outperform prior state-of-the-art work on the Adhikari Dataset for accuracy, precision, and F1. The model was shown to have the best precision on the UP-Fall and UR-Fall Datasets. Finally, it was also observed that FASENet was able to reduce false positive rates compared to a previously related study. Keywords: Computer vision Video processing

· Activity monitoring · Deep learning ·

c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  N. T. Nguyen et al. (Eds.): ACIIDS 2022, LNAI 13758, pp. 470–483, 2022. https://doi.org/10.1007/978-3-031-21967-2_38

FASENet: A Two-Stream Fall Detection and Activity Monitoring Model

1

471

Introduction

According to the World Health Organization, an estimated 684,000 individuals die from falls globally, out of which, over 80% are from lower to middle income countries. In addition, adults older than 60 years of age suffer the greatest number of fatal falls [1]. For the elderly, falls remain to be the top cause of morbidity and mortality. Those who have suffered falls end up with post-fall syndromes such as immobilization, increased dependency, depression, and even an overall decrease in the quality of life [22]. Moreover, unobserved falls are also becoming more of a concern due to situations that may require isolation such as living alone or restrictions from contagions [31]. In the elderly, since numerous diseases may be associated with increasing age and lack of physical activity, detecting and disrupting prolonged sitting and lying positions may help mitigate these risks [25]. Because of this, numerous works have been proposed in the past to automatically identify falls and activity as part of patient monitoring systems [23]. Current works on vision-based general activity recognition revolve around the use of deep learning algorithms, more specifically, convolutional neural networks due to their effectiveness in various applications [4]. These algorithms make use of computationally heavy models to process videos in order to identify a large set of activities trained from benchmark datasets [7,27]. Although these approaches were proven to be effective, they are hard to scale and deploy due to their computational requirements. Therefore, it is an important consideration for studies to take into account these limitations when proposing solutions for activity recognition and fall detection. Recently, a framework was proposed by [26] wherein they made use of a lightweight pose estimation algorithm along with 1D convolutional neural networks in order to effectively yet efficiently identify possible falls and monitor patient activity. Based on their findings, it seems that there is a merit in incorporating features from both simple (patient activity) and complex (falls) actions. Although prior work has focused on capturing the relationship of both spatial and temporal aspects of a video, they have not explicitly tried to model and incorporate features from simple and complex actions [3,9,21]. Therefore, this study aims to extend the work of [26] and leverage on their findings by proposing FASENet, a two-stream fall detection and activity monitoring model using squeeze-and-excitation networks. In comparison to previous works, FASENet makes use of pose keypoints as inputs to the model in order to utilize 1D convolutions instead of 2D or 3D convolutions, thereby making FASENet more efficient in comparison to prior works. Furthermore, FASENet extracts features for both simple and complex actions while it takes advantage on the combination of both to improve performance.

2

Related Work

For both activity recognition and fall detection, many studies have used on deep learning algorithms, more specifically convolutional neural networks,

472

J. J. P. Suarez et al.

to accomplish the task. Some have used proposed simple convolutional neural networks such as Adhikari et al. wherein they have used convolutional neural networks for feature extraction on both RGB and depth frames for activity recognition [3]. Similarly, Espinosa et al. also proposed their own convolutional neural network. However, instead of using RGB frames, they have used optical flows as inputs to capture motion [9]. On the other hand, most works have relied on pretrained architectures for general features learned from a larger training set. Examples include the work of Luo et al. which used a pretrained ResNet-34 that was fine-tuned on thermal and depth frames for the task [19]. Moreover, Wang et al. used PCANet as a feature extractor specifically for the foreground of frames for fall detection [29]. Additionally, Nunez-Marcos et al. used VGG-16 pretrained on ImageNet and fine-tuned it for UCF101 for activity and fall-related use-cases [21]. Meanwhile, other works have designed their own architectures in order to propose a solution. Kong et al. proposed a three-stream architecture that aims to extract features from multiple representations of videos and combines them for richer features [15]. On the other hand, Han et al. proposed a two-stream mobile VGG wherein one stream focuses on human body motion and another stream aims to employ a lightweight model for feature extraction [11]. Elshwemy et al. proposed a spatiotemporal residual autoencoder for identifying falls in thermal images which aims to identify falls using reconstruction error [8]. Similarly, Cai et al. employed an autoencoder approach using hourglass residual units. However, aside from reconstruction, they simultaneously include the task of classification using the bottleneck features [6]. From what is seen in literature, common approaches for fall detection and activity monitoring use 2D convolutional neural networks. Although they are effective, 2D convolutions have relatively high computational costs especially when processing video segments. In addition, it was observed that those who have proposed their own architectures have utilized multiple streams of inputs to extract various features from different representations. However, most of these works focus on combining motion and spatial properties, they do not try to model the intrinsic properties of the action such as different features for simple and complex actions. For these reasons, this work proposes an architecture that attempts to extract features for both simple and complex actions to improve the prediction performance.

3 3.1

Methodology Preprocessing

Pose Extraction. This work uses pose skeletons to create an efficient model. Similar to [26], this study makes use of Mediapipe [18], an efficient and lowcost open-source pose estimator to extract pose skeletons from video segments. For pose estimation, they utilize a two-step detector-tracker machine learning pipeline. The final output of the MediaPipe pose estimator is a pose skeleton having 33 joints with 3 dimensions.

FASENet: A Two-Stream Fall Detection and Activity Monitoring Model

473

Pose Normalization. Since the poses appear at various parts of the frame, the coordinate values of the pose skeletons might affect the features learned by the model. In order to solve this, pose normalization similar to the approach of [17] is done to ensure that the pose skeletons are standardized across all frames. Feature Extraction. After the pose skeletons are taken, simple features like distance between selected pose keypoints and velocity of keypoints between frames are computed similar to the approach in [26]. 3.2

FASENet: The Proposed Architecture

Fig. 1. FASENet architecture

Given the gaps in literature as well as insights and recommendations taken from [26], this study proposes an architecture called FASENet which makes use of two different streams that leverage on squeeze-and-excitation networks to model simple and complex actions for improved prediction performance. An overview of the architecture can be seen in Fig. 1. The rest of this subsection discusses each component of FASENet in detail. Feature Projection. Given the set of handcrafted features namely pose keypoints, distances, and velocity, each of them are projected into a higher dimension. By mapping the different feature sets into a higher dimension, there might be some properties in the data that can be useful to the succeeding components of the architecture. Each feature set is fed to a fully connected layer whose output is 128 units in order to do the task. Next, it is passed to a ReLU activation. After mapping each set of features into higher dimensions, they are combined together through concatenation. The resulting projection would in turn be 384 features. Dilated Stream. The dilated stream represents the upper set of layers in Fig. 1. The main objective of this stream is to obtain features for complex actions. To do this, this stream makes use of dilated 1D convolutions in order to model action for equally spaced non-consecutive frames. By making use of dilations, sudden

474

J. J. P. Suarez et al.

changes for the video segment can be seen which is important for complex actions because by definition, complex actions are not repetitive [24,28]. This stream is composed of three sets of layers. Each set of layers has a dilated 1D convolution, a batch normalization layer, and a dropout layer. Batch normalization is added in order to standardize the outputs of each layer. Dropouts are added in order to control overfitting. For the 1D convolutions in the three sets of layers, the kernel sizes are set to 5, 2, and 1 respectively, each having 512 output channels. In addition, dilations are set to 2, 4, and 1, respectively. To ensure that the sequence length does not change despite the dilations which might significantly reduce the sequence length, 3, 2, and 0 padding is used, respectively. All dropouts are set to 30%. Compact Stream. The compact stream represents the lower set of layers in Fig. 1. In contrast to the dilated stream, this stream aims to extract features for simple actions. Since this stream deals with consecutive frames as inputs, it can easily spot repetitions thereby making it suitable to model simpler actions. Furthermore, this part of the architecture is lifted from the model architecture from [26] which was shown to be effective in modelling simple actions for a smaller set of frames. Similar to the dilated stream, this stream is composed of three sets of layers. Each layer has a 1D convolution that attempts to model the temporal information across the different sets of features, a batch normalization layer for ensuring standardized outputs and faster optimization, and a dropout to control possible overfitting. For the 1D convolutions, the kernel size is set to 7, 5, 3, respectively where all of them have output channels of 512. To retain the sequence length, padding of size 3, 2, and 1 are added to the inputs during the convolutions. Again, all dropouts in this stream are set to 30%. Feature Combination. The goal of FASENet is to combine features that can provide simple and complex actions. In this case, the Dilated Stream and Compact Stream output features need to be combined. To do this, a simple element-wise addition was added to incorporate both features together. Squeeze-and-Excitation. The combined set of features are fed to 1D squeezeand-excitation network. These types of networks have an architectural unit that explicitly models interdependencies between channels in a feature map [14]. These networks provide more weight into more important features in the feature map before feeding them into the succeeding sets of layers. Once the feature maps are recalibrated, the features are separately fed to the next set of dilated and compact layers. This ratio parameter is set to 4 for all layer. Skip Connections. Recall that for each squeeze-and-excitation layers, the outputs are fed separately to another set of dilated and compact layers. However, some features from the initial set of layers might provide some information that are important. Thus, for each squeeze-and-excitation layer, skip connections [13] are added to each succeeding layer as shown in Fig. 1. This ensures that the stream of information and relevant features continue even up until the latter portions of the architecture while also considering features that have passed through multiple layers.

FASENet: A Two-Stream Fall Detection and Activity Monitoring Model

475

Pooling and Fully Connected Layers. After the main set of layers for feature extraction, which is after the last squeeze-and-excitation layer as shown in Fig. 1, a global average pooling compresses into a single dimension having 512 channels. After obtaining the final set of features, they are fed to two fully connected layers both with 512 hidden units. Between these layers is a dropout set to 30% to prevent overfitting. The activation of the first layer is a ReLU while the second layer’s activation would depend if the task is has binary or multiple classes. A sigmoid activation is used for binary cases and a softmax activation is used for multiclass cases. Hyperparameters and Training. The network was trained for 100 epochs using the Adam optimizer with a batch size of 64. Since the activity recognition and the fall detection dataset has multiple classes, the loss function used is the cross entropy loss. For datasets with only binary classes, binary cross entropy loss is used. The learning rate and weight decay was empirically determined with respect to different datasets. For the Adhikari Dataset, the learning rate was set to 5e−5 and the weight decay was set to 0.01. The learning rate on the UP-Fall Dataset was set to 2e−6 and the weight decay was set to 0. Lastly, the UR-Fall Dataset had a learning rate of 7e−6 and a weight decay of 0.055.

4

Datasets

Adhikari Dataset 1 . The Adhikari Dataset [3] contains six main classes that are related to indoor patient activity such as standing, sitting, lying, bending, crawling, and others (subject is not in frame). The videos were taken from five different rooms having eight different viewing angles using a Kinect sensor. For the subjects there were five participants where two were female and three were male. Overall, there are 16,794 frames for training and 2,543 frames for testing. Only the RGB frames will be used for this study and the training and test sets are based from those that are provided by the authors (Table 1). Table 1. Adhikari dataset class distribution Class Others

Train % Test

Test %

585 3.48%

205 8.06%

Standing

7,002 41.69%

694 27.25%

Sitting

3,308 19.69%

793 31.18%

Lying

4,402 26.21%

731 28.74%

Bending

805 4.79%

83 3.26%

Crawling

691 4.11%

37 1.45%

16,793 100%

2,543 100%

Total

1

Train

http://falldataset.com.

476

J. J. P. Suarez et al.

Universidad Panamericana (UP) Fall Dataset.2 The Universidad de Panamericana Fall Dataset [20] has 11 different classes. Five out of the eleven classes are fall types depending on the direction on the fall. Meanwhile, the remaining six are activities such as standing, walking, lying, sitting, jumping, and picking up an object. The three trials were taken from 17 adults using two Microsoft LifeCam cameras. There are two views collected, the frontal and the side. In this study, only frontal view was used. For the train and test split, similar to [9], trials 1 and 2 are used for training while trial 3 was used for testing. In total, there were 196,931 frames used for training and 97,747 were used for testing (Table 2). Table 2. UP-Fall dataset data distribution Class

Train

Train % Test

Test %

Falling front on hands

6,144

3.11%

3,060 3.13%

Falling front on knees

6,103

3.10%

3,085 3.16%

Falling backwards

6,143

3.11%

3,094 3.17%

Falling sideward

6,169

3.13%

3,079 3.15%

Falling sitting

6,060

3.08%

3,014 3.08%

Walking

36,459

18.51%

18,109 18.52%

Standing

36,071

18.31%

17,961 18.36%

Sitting

35,795

18.17%

17,894 18.30%

5,921

3.01%

2,951 3.01%

17,744

9.01%

8,950 9.15%

34,322

17.42%

Picking up object Jumping Laying Total

196,931 100%

16,550 16.93% 97,747 100%

University of Rzeszow (UR) Fall Dataset.3 The University of Rzeszow Fall Dataset [16] only has two classes, fall and not fall. The videos collected were taken in a frontal view of the room using a Microsoft Kinect camera. There were 70 videos that were collected where 30 of them contained falls while the rest only had simple day-to-day activities. For the training and test split, 80% was used for training and 20% was used for testing, similar to that of [6]. In total, there are 8,540 frames for training and 2,135 for testing (Table 3). Table 3. UR-Fall dataset data distribution Class

Train Train % Test

Not fall 7,147 83.68%

2 3

Fall

1,393 16.31%

Total

8,540 100%

Test %

1,758 82.34% 377 17.65% 2,135 100%

http://sites.google.com/up.edu.mx/har-up/. http://fenix.univ.rzeszow.pl/mkepski/ds/uf.html.

FASENet: A Two-Stream Fall Detection and Activity Monitoring Model

5

477

Results and Discussion Table 4. Effects of number of frames and comparisons Dataset

Frames Model

Adhikari Dataset [3]

4

FASENet 8

UP-Fall Dataset [20]

4 8 16 4

16

0.8071 0.8317

0.8911

0.6576 0.8048

0.7313 0.6366 0.7998

0.8206 0.8226

0.9326

0.9629

0.9675 0.929

Conv1D [26] 0.992

0.9682

FASENet

0.9732 0.9706

0.9901

0.8302

0.9057 0.9189

0.9768 0.9686 0.8921 0.9835

0.8213

0.8327 0.8355

0.8624 0.8127

Conv1D [26] 0.9862 0.9521 FASENet

0.7409 0.7165

0.8808 0.8499 0.8067

Conv1D [26] 0.9728 FASENet

8

0.634

Conv1D [26] 0.8924 0.844 FASENet

UR-Fall Dataset [16]

0.706

0.8737 0.8391 0.8122 0.8191

Conv1D [26] 0.8798 FASENet

F1 0.7042

0.8545 0.7333 0.6595 0.6828

Conv1D [26] 0.8609 FASENet

Rec. 0.7488

0.8742 0.7543 0.7185

Conv1D [26] 0.8024 FASENet

Prec. 0.6865

0.8696 0.7393 0.7595 0.7475

Conv1D [26] 0.8424 FASENet

16

Acc.

Conv1D [26] 0.8478

0.9288 0.9575 0.9478

0.9865 0.9773 0.9719

Number of Frames. The results on the number of frames are shown in Table 4. Aside from identifying the effect on the number of frames, FASENet’s results are benchmarked against the work of [26]. From these results, it can be seen that overall, FASENet has better performance on 4 and 8 frames as compared to the Conv1D model. In addition, it can be noted that FASENet has generally better precision across all of the results. Due to the precision-recall trade-off, for longer frames such as 16 frames, FASENet was able to attain higher precision at the cost of lower recall. Table 5. False positive rates for FASENet Dataset

Model

False Positive Rate

UP-Fall Dataset Binary Training (Conv1D [26]) Binary Training (FASENet) Best Model (Conv1D [26]) Best Model (FASENet)

2.94% 2.41% 2.59% 2.15%

UR-Fall Dataset Best Model (Conv1D [26]) Best Model (FASENet)

0.68% 0.58%

478

J. J. P. Suarez et al.

False Positive Rate on Falls. Out of the three (3) datasets, two of them have fall classes namely the UP-Fall and UR-Fall datasets. The UR-Fall is a dataset that only contains two classes while the UP-Fall dataset contains 11 classes where 5 of them are falls and the remaining are day-to-day activities. Since the UPFall contains multiple classes, one model is specifically trained on the UP-Fall dataset where the labels are converted to fall and not fall. On the other hand, the other model would be the best model based from the experiments on FASENet. The results of the false positive rates using FASENet can be seen in Table 5. It also includes the false positive rates from [26] to serve as comparison. As shown in the results, using FASENet has reduced false positive rate for both the UR-Fall and UP-Fall Datasets. In addition, for the Conv1D, it can be seen that the best model using all classes performed the best as compared to binary training. Similar to FASENet, the model that was trained with multiple classes has performed better than training it with only fall and not fall classes. Table 6. Ablations of FASENet on Adhikari dataset using 8 frames Author Components Acc. Com. Dila. SqE. Skip

Prec.

Rec.

F1



0.8559 0.8618 0.8642

0.6858 0.7038 0.7165

0.6842 0.7073 0.7262

0.679 0.6995 0.7154



0.8499 0.8476 0.8527

0.7258 0.7209 0.7126

0.7079 0.7058 0.6939

0.7007 0.7005 0.6967



0.8598 0.7283 0.6877 0.8674 0.729 0.6933 0.8742 0.7543 0.7185

0.7001 0.7028 0.7313

Adhikari et al. [3]

0.7476

0.6452

0.5502

0.4987

Abedi et al. (H 3.6M) [2]

0.7347







Abedi et al. (FDD) [2]

0.8132







Suarez et al. (4PD) [26]

0.8415

0.6854

0.7733 0.7047

Suarez et al. (4PDV) [26]

0.8478

0.6865

0.7488

0.7042

Suarez et al. (8PDV) [26]

0.8424

0.706

0.7409

0.7165

Zhu et al. [32]

0.86







Ours Ours Ours

  

Ours Ours Ours Ours Ours Ours

  

    

 

  

 

Ablations on Architecture. Tables 6, 7, and 8 show the results on the ablations done for FASENet with respect to various datasets. The number of frames used are based on the best F1 scores in Table 4. The results of FASENet on the Adhikari dataset can be seen in Table 6. As observed in the table, it can be seen that experiments using only the dilated

FASENet: A Two-Stream Fall Detection and Activity Monitoring Model

479

stream mostly has better precision compared to using only the compact stream. In addition, the effect of squeeze-and-excitation can be seen in the increase in accuracy for most experiments. Lastly, by combining both the strengths of both the compact and dilated stream along with squeeze-and-excitation networks and skip connections for feature map recalibration has helped the model attain the better overall performance. It can be seen that the best performing model making use of all components of FASENet has outperformed previous works on accuracy, precision, and F1 scores. However, previous works have better recall than the proposed architecture. This might be due to precision-recall trade-off wherein compared to prior work, FASENet’s precision is higher than that of previous studies. Despite this observation, by looking at the F1 score, overall, FASENet has better overall performance compared to prior work. Table 7. Ablations of FASENet on UP-Fall dataset using 16 frames Author Components Acc. Com. Dila. SqE. Skip

Rec.

F1



0.8875 0.8539 0.8876 0.8582 0.8919 0.8552

0.8036 0.8173 0.819

0.8219 0.8173 0.8259



0.8787 0.8566 0.8792 0.8563 0.8887 0.8614

0.7764 0.7764 0.8098

0.8019 0.8023 0.8279



0.883 0.8599 0.784 0.8875 0.8574 0.796 0.8911 0.8624 0.8127

0.8072 0.8176 0.8302

Espinosa et al. [9]

0.8226 0.7425

0.7167

0.7294

Martinez-Villasenor [20]

0.951 0.718

0.713

0.712

Suarez et al. (16PDV) [26]

0.893

0.8341 0.8367

Ours Ours Ours

  

Ours Ours Ours Ours Ours Ours

  

    

 

  

 

Prec.

0.8485

The results on the UP-Fall Dataset can be seen in Table 7. It can be seen in the results that generally, there are small gains in accuracy when there are squeeze-and-excitation networks. However, the best accuracies are obtained when the squeeze-and-excitation networks are combined with skip connections regardless of using compact, dilated, or both streams. In addition, it can also be seen that the dilated stream’s recall is lower than that of the other set of experiments. Furthermore, those that have used skip connections have better overall performance as compared to those which do not have them. With respect to F1 score, the best performance was by making use of all components of FASENet. In comparison to previous work, the best accuracy is with the work of [20]. Meanwhile, the best recall and F1 scores remain with the previous experiments

480

J. J. P. Suarez et al.

making use of pose keypoints and velocity only having 16 frames as its input. However, FASENet using all of its components has the best precision. The results of the UR-Fall Dataset can be seen in Table 8. Based from the results, it can be inferred that similar to the Adhikari Dataset, using the dilated stream has better performance on precision as compared to using only the compact stream. Furthermore, improvements on the results for the F1 score can be seen in the presence of squeeze-and-excitation networks. However, the best overall performance with respect to F1 can be seen when using all of the various components of FASENet. In comparison to previous work on the UR-Fall Dataset, previous works have better recall. On the other hand, majority of the results have outperformed prior work with respect to precision. Table 8. Ablations of FASENet on UR-Fall dataset using 16 frames Author Components Acc. Com. Dila. SqE. Skip Ours Ours Ours

  

Ours Ours Ours Ours Ours Ours

6

  

    

 

  

 

Prec.

Rec.

F1



0.9858 0.9599 0.9877 0.9703 0.9815 0.9443



0.9867 0.975 0.9486 0.9616 0.9882 0.9805 0.9514 0.9657 0.9863 0.9723 0.9486 0.9603



0.9815 0.9242 0.9886 0.9654 0.9901 0.9732 -

0.9519 0.9481 0.9599 0.9651 0.9519 0.9599

0.976 0.9494 0.9706 0.968 0.9706 0.9719

Nunez-Marcos et al. [21]

0.95

Harrou et al. [12]

0.9666 0.9355

Zerrouki et al. [30]

0.94

Bhandari et al. [5]

0.9571 0.9354

0.9666 -

Cai et al. [6]

0.962

0.923

1

0.96

Feng et al. [10]

-

0.948

0.914

0.931

Suarez et al. (16PDV) [26]

0.992 0.9682

0.93

1

-

1

0.952

0.98

0.9

0.9865 0.9773

Conclusion and Future Work

This work has proposed a new architecture for handling fall detection and activity monitoring called FASENet. Through the use of both compact and dilated streams which extract simple and complex features, along with squeeze-andexcitation networks for feature recalibration, and skip connections for forward

FASENet: A Two-Stream Fall Detection and Activity Monitoring Model

481

passing, FASENet was able to attain state-of-the-art in the Adhikari Dataset with respect to accuracy, precision, and F1. In addition, it has shown stateof-the-art on precision in the UP-Fall and UR-Fall Datasets with competitive performance in other metrics. Furthermore, it was shown that in comparison to previous work, the proposed architecture was able to reduce the false positive rates for falls which is important in order to reduce the rate of false alarms. Lastly, when comparing FASENet to previous work with respect to number of frames, FASENet was able to perform well despite having less number of frames because of its architectural design. For future work, leveraging data augmentation to handle imbalanced classes can be used. In addition, self-supervised methods might improve performance of the model when using different augmentations on the skeleton data. Lastly, this model can be tested and improved for more rigorous activities.

References 1. Falls (2021). https://www.who.int/news-room/fact-sheets/detail/falls 2. Abedi, W.M.S., Ibraheem Nadher, D., Sadiq, A.T.: Modified deep learning method for body postures recognition. Int. J. Adv. Sci. Technol. 29, 3830–3841 (2020) 3. Adhikari, K., Bouchachia, H., Nait-Charif, H.: Activity recognition for indoor fall detection using convolutional neural network. In: 2017 Fifteenth IAPR International Conference on Machine Vision Applications (MVA). IEEE (2017). https:// dx.doi.org/10.23919/MVA.2017.7986795 4. Albawi, S., Mohammed, T.A., Al-Zawi, S.: Understanding of a convolutional neural network. In: 2017 International Conference on Engineering and Technology (ICET), pp. 1–6. IEEE (2017) 5. Bhandari, S., Babar, N., Gupta, P., Shah, N., Pujari, S.: A novel approach for fall detection in home environment. In: 2017 IEEE 6th Global Conference on Consumer Electronics (GCCE) (2017) 6. Cai, X., Li, S., Liu, X., Han, G.: Vision-based fall detection with multi-task hourglass convolutional auto-encoder. IEEE Access 8, 44493–44502 (2020) 7. Carreira, J., Zisserman, A.: Quo Vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) 8. Elshwemy, F., Elbasiony, R., Saidahmed, M.: A new approach for thermal vision based fall detection using residual autoencoder. Int. J. Intell. Eng. Syst. 13(2), 250–258 (2020) 9. Espinosa, R., Ponce, H., Gutierrez, S., Mart´ınez-Villasenor, L., Brieva, J., MoyaAlbor, E.: A vision-based approach for fall detection using multiple cameras and convolutional neural networks: a case study using the up-fall detection dataset. Comput. Biol. Med. 115, 103520 (2019) 10. Feng, Q., Gao, C., Wang, L., Zhao, Y., Song, T., Li, Q.: Spatio-temporal fall event detection in complex scenes using attention guided LSTM. Pattern Recogn. Lett. 130, 242–249 (2020) 11. Han, Q., et al.: A two-stream approach to fall detection with MobileVGG. IEEE Access 8, 17556–17566 (2020) 12. Harrou, F., Zerrouki, N., Sun, Y., Houacine, A.: Vision-based fall detection system for improving safety of elderly people. IEEE Instrum. Measur. Mag. 20(6), 49–55 (2017)

482

J. J. P. Suarez et al.

13. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) 14. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018) 15. Kong, Y., Huang, J., Huang, S., Wei, Z., Wang, S.: Learning spatiotemporal representations for human fall detection in surveillance video. J. Vis. Commun. Image Represent. 59, 215–230 (2019) 16. Kwolek, B., Kepski, M.: Human fall detection on embedded platform using depth maps and wireless accelerometer. Comput. Methods Programs Biomed. 117(3), 489–501 (2014) 17. Lin, C.B., Dong, Z., Kuan, W.K., Huang, Y.F.: A framework for fall detection based on OpenPose skeleton and LSTM/GRU models. Appl. Sci. 11(1), 329 (2020) 18. Lugaresi, C., et al.: MediaPipe: a framework for building perception pipelines. arXiv preprint arXiv:1906.08172 (2019) 19. Luo, Z., et al.: Computer vision-based descriptive analytics of seniors’ daily activities for long-term health monitoring. Mach. Learn. Healthcare (MLHC) 2, 1 (2018) 20. Mart´ınez-Villase˜ nor, L., Ponce, H., Brieva, J., Moya-Albor, E., N´ un ˜ez-Mart´ınez, J., Pe˜ nafort-Asturiano, C.: UP-fall detection dataset: a multimodal approach. Sensors 19(9), 1988 (2019) 21. Nunez-Marcos, A., Azkune, G., Arganda-Carreras, I.: Vision-based fall detection with convolutional neural networks. Wirel. Commun. Mob. Comput. 2017, 1–16 (2017) 22. P´erez-Ros, P., Sanchis-Aguado, M.A., Dur´ a-Gil, J.V., Mart´ınez-Arnau, F.M., Belda-Lois, J.M.: FallSkip device is a useful tool for fall risk assessment in sarcopenic older community people. Int. J. Older People Nurs. (2021) 23. Sathyanarayana, S., Satzoda, R.K., Sathyanarayana, S., Thambipillai, S.: Visionbased patient monitoring: a comprehensive review of algorithms and technologies. J. Ambient. Intell. Humaniz. Comput. 9(2), 225–251 (2015) 24. Shoaib, M., Bosch, S., Incel, O., Scholten, H., Havinga, P.: Complex human activity recognition using smartphone and wrist-worn motion sensors. Sensors 16(4), 426 (2016) 25. Silva, F.M., et al.: The sedentary time and physical activity levels on physical fitness in the elderly: a comparative cross sectional study. Int. J. Environ. Res. Public Health 16(19), 3697 (2019) 26. Suarez, J.J.P., Orillaza, N.S., Naval, P.C.: AFAR: a real-time vision-based activity monitoring and fall detection framework using 1D convolutional neural networks. In: 14th International Conference on Machine Learning and Computing (2022) 27. Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497 (2015) 28. Vrigkas, M., Nikou, C., Kakadiaris, I.A.: A review of human activity recognition methods. Front. Robot. AI 2, 28 (2015) 29. Wang, S., Chen, L., Zhou, Z., Sun, X., Dong, J.: Human fall detection in surveillance video based on PCANet. Multimed. Tools Appl. 75(19), 11603–11613 (2015) 30. Zerrouki, N., Harrou, F., Houacine, A., Sun, Y.: Fall detection using supervised machine learning algorithms: a comparative study. In: 2016 8th International Conference on Modelling, Identification and Control (ICMIC) (2016)

FASENet: A Two-Stream Fall Detection and Activity Monitoring Model

483

31. Zeytinoglu, M., Wroblewski, K.E., Vokes, T.J., Huisingh-Scheetz, M., Hawkley, L.C., Huang, E.S.: Association of loneliness with falls: a study of older US adults using the national social life, health, and aging project. Gerontol. Geriatr. Med. 7, 233372142198921 (2021) 32. Zhu, N., Zhao, G., Zhang, X., Jin, Z.: Falling motion detection algorithm based on deep learning. IET Image Process. 16, 2845–2853 (2021)

Pre-processing of CT Images of the Lungs Talshyn Sarsembayeva(B) , Madina Mansurova(B) , Adai Shomanov , Magzhan Sarsembayev , Symbat Sagyzbayeva, and Gassyrbek Rakhimzhanov Al-Farabi Kazakh National University, Almaty, Kazakhstan [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]

Abstract. Respiratory diseases are one of the primary causes of death in today’s population, and early detection of lung disorders has always been and continues to be critical. In this sense, it is critical to evaluate the condition of the lungs on a regular basis in order to avoid disease or detect it before it does substantial harm to human health. As the most popular and readily available research tool in diagnosis, radiography is critical. Despite all of the benefits of this technology, diagnosing sickness symptoms from photos is a challenging task that necessitates the involvement of highly experienced specialists as well as significant time investment. The difficulty arises from the incompleteness and inaccuracy of the initial data, particularly the presence of numerous image distortions such as excessive exposure, the presence of foreign objects, and so on. The U-net technique was used to do early processing of CT images of the lungs using a neural network during the research. The current status of study in the field of X-ray and CT image identification employing in-depth training methodologies demonstrated that pathological process recognition is one of the most significant tasks of processing today. Keywords: U-net · CT image · Medical image processing · Artificial intelligence · Deep learning

1 Introduction Today, respiratory diseases are one of the leading causes of death in the population and early diagnosis of lung diseases has been and remains very important. In this regard, it is necessary to regularly monitor the condition of the lungs to prevent the disease or to detect it before causing significant harm to human health. Radiography is of great clinical importance as the most common and available research tool in diagnosis. Despite all of the benefits of this technology, diagnosing sickness symptoms from photos is a challenging task that necessitates the involvement of highly experienced specialists as well as significant time investment. The difficulty arises from the incompleteness and inaccuracy of the initial data, particularly the presence of numerous image distortions such as excessive exposure, the presence of foreign objects, and so on. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 N. T. Nguyen et al. (Eds.): ACIIDS 2022, LNAI 13758, pp. 484–491, 2022. https://doi.org/10.1007/978-3-031-21967-2_39

Pre-processing of CT Images of the Lungs

485

In this context, the introduction and application of in-depth machine learning methods linked with the creation of intelligent systems in medicine has become the most popular and promising approach in recent years. And the discipline of visual image processing and analysis is rapidly evolving. The use of in-depth machine learning methods to analyze medical images such as computed tomography (CT), magnetic resonance imaging (MRI), and chest radiography allows for a more tailored and effective approach to the diagnosis and treatment of lung ailments. A large decrease in time costs and an improved burden for healthcare personnel are also advantages of adopting machine learning technologies for picture recognition. Although a great amount of biomedical data has been collected to far, which is meaningless in and of itself, the processing and analysis of this data using machine learning algorithms has the potential to drastically alter the therapeutic experience.

2 Related Work CT scans are widely used to make diagnoses and access thoracic illnesses. The increased resolution of CT imaging has necessitated a thorough exploration of statistics for analysis. As a result, computerizing the examination of such data has become a rapidly emerging research area in medical imaging. The detection of thoracic disorders via image processing leads to a preprocessing stage known as “Lung segmentation,” which encompasses a wide range of techniques starting with simple Thresholding and incorporating numerous image processing features to improve segmentation, precision, and heftiness. Techniques including image pre-processing, segmentation, and feature extraction have been thoroughly discussed in image processing. The work [1] proposes a review of the literature on CT scans of the lungs and states Preprocessing ideas, segmentation of a variety of pulmonary arrangements, and Feature Extraction with the goal of identifying and categorizing chest anomalies. In addition, research developments and disagreements are discussed, as well as instructions for future investigations. In work [2] examines segmentation of various pulmonary structures, registration of chest images, and applications for detection, classification, and quantification of chest anomalies in CT scans. Furthermore, research trends and obstacles are recognized, as well as prospective research directions. The challenge of assessing the effectiveness of deep convolutional neural networks (DCNNs) for identifying tuberculosis (TB) on chest radiographs is addressed in paper [3]. The authors employed 1007 posterior-anterior chest radiographs from four de-identified HIPAA-compliant datasets that were rejected from review by the institution’s review board. They also explained how they separated the datasets: training (68.0%), validation (17.1%), and test (14.9%). The use of classification systems was explained. AlexNet and GoogLeNet, two different DCNNs, were utilized to categorize photos as having pulmonary tuberculosis symptoms or as healthy. ImageNet was employed with both untrained and pretrained networks, as well as numerous preprocessing approaches. In cases where the classifiers disagreed, the images were blindly interpreted by an independent certified cardiologist-radiologist, who was assisted by the radiologist, to evaluate the prospective workflow. This demonstrates the study’s thoroughness. The effective classifier has an AUC of 0.99, according to the findings of work [3], where AlexNet and GoogLeNet DCNN were combined. The AUCs of the pretrained models were greater

486

T. Sarsembayeva et al.

than those of the untrained models (P 0.001), according to the authors. The accuracy of AlexNet and GoogLeNet was significantly enhanced by expanding the data set (Pvalues for AlexNet and GoogLeNet were 0.03 and 0.02, respectively). DCNNs were inconclusive in 13 of 150 test cases evaluated blinded by a cardiothoracic radiologist, who properly interpreted all 13 of them (100%). The radiologist’s technique resulted in a sensitivity of 97.3% and a specificity of 100%. Conclusion Using an AUC of 0.99, deep learning with DCNN can accurately classify TB on chest x-ray. The strategy of involving a radiologist in circumstances when the classifiers disagreed raised the accuracy even more. In [4] suggested a new computer-aided detection (CAD) approach based on context grouping and region growth to help radiologists detect lung cancer early using computed tomography. Rather than employing the traditional thresholding method, this study employs contextual clustering, which allows for more precise lung segmentation from chest volume. Following segmentation, GLCM and LBP features are extracted, and three different classifiers, random forest, SVM, and k-NN, are used to classify them. The authors present a deep learning algorithm for detecting aberrant sensitivity on chest radiographs in article [5]. Multi-CNN is a model that uses multiple Convolutional Neural Networks to define the input image. A digitized chest x-ray dataset is used as input. Multi-CNN is made up of convolutional neural networks that were built using the ConvnetJS package. The proposed model produces normal/abnormal density as a result. This article also introduces Fusion Rules, a way for synthesizing the outcomes of model components. The suggested Multi-CNN model was found to be feasible with 96% accuracy on the X-ray picture dataset.

3 Methods To facilitate the output of permanent features, the pre-processing phase normally involves some image manipulation, such as filtering, brightness equalization, and geometric corrective adjustments. The picture of an object’s properties are defined as a set of features that roughly describe the thing of interest. There are two types of exceptions: local and integrative. Local features have the advantage of being adaptable and not changing in response to variations in brightness and illumination, but they are not unique. Changes in the object’s structure and difficult lighting conditions have no effect on the integral features that characterize the image of the thing as a whole. When the item sought is modeled on a number of regions, each of which is characterized by its own set of features—local texture descriptor—there is a combined method—the use of local features as pieces of an integrated description. The object as a whole is described by a set of such descriptors. By studying the vector of symbols obtained in the preceding stage and splitting the relevant space into subdomains representing the appropriate class, classification is defined as identifying whether an object belongs to a certain class. There are numerous classification methods: neural networks, statistical (Bayesian, regression, Fisher, etc.), key trees and forests, metric (near neighbors, Parsen windows, etc.), and kernel (SVM, RBF, potential function technique) (AdaBoost). The membership of two classes is evaluated for the purpose of identifying the object in the image: the class of images that

Pre-processing of CT Images of the Lungs

487

contain the object and the class of images that do not contain the object (background images). The set of selected characteristics, their capacity to discriminate between photographs of objects of different classes, is a crucial aspect influencing the quality and stability of the classification. The simpler the space of possibilities is, and the classifier can have a simpler form, the more separable the functions are (possibly due to the complexity of the structure). Conversely, the more complex the feature space is, the more complicated classifier is required to correctly distribute it, even if the features are little (but the structure is simple). Many “simple” qualities can be combined to approximate a few “complex” characteristics. Using these concepts, a modern and effective method for combining all stages of image analysis has been developed: pre-processing, simultaneous acquisition of a set of “simple” signs, and classification based on their optimization using a multi-layer deep learning training base of convolutional neural images. The capacity allocation mechanism takes place on the ground floor (CNN), which is part of the classifier; the structure of features is produced automatically during the learning process and is governed by the network’s model and architecture. The more “simple” attributes are required to describe (approximate) target objects, the more parameters are required to create a network model, and the calculation becomes more “complicated.” CNN can be seen of as a broad approach of modeling functional space, but it requires an image exercise model that represents all of the relevant computational resources and objects. This is because, according to the rules of network design, the specific model of the object in CNN is built solely on the basis of information included in the training database. The model of distinction is developed by the researcher using a priori information on the nature of the pictures of the object in the classical approach. As a result, the typical sub-issue of constructing an efficient collection of functions becomes the challenge of developing an optimal CNN architecture, which in turn generates the required features from the images in the classification problem [3]. The patient’s data during the preliminary processing of CT images of the lungs were considered. Individual results were obtained for each patient. Segmentation for patient 1 is shown in Fig. 1. In the picture you can see the original picture, hand-drawn picture and mixed picture.

Fig. 1. Segmentation

488

T. Sarsembayeva et al.

The data learning phase is shown in Fig. 2.

Fig. 2. Learning phase

The graph of the epochs is shown in Fig. 3.

Fig. 3. Epochs

Pre-processing of CT Images of the Lungs

489

Figure 4 shows the testing process during the preliminary processing of CT images of the lungs. In the testing process, the red area indicates the forecast, the green area indicates the truth, and the yellow area indicates the intersection.

Fig. 4. Test results

The key future direction of this work is to include pulmonary pathology to the training model and improve the medical database, taking into consideration the potential for new problems in the development of new machine learning algorithms. For many years, the challenge of differential identification of localized forms in the lungs has been a concern. Although contemporary diagnostic equipment makes it harder to detect localized forms in lung tissue, understanding these alterations remains a challenge. In the detection of abnormal volumetric formations in the lungs, the created automated method has a high degree of information. The technology has the advantage of being independent of any sort of computed tomography. Artificial systems are an important step in the development of artificial intelligence in lung cancer diagnostics. The simultaneous detection of nodules, bands, atherosclerosis, and blood vessels in the lung parenchyma is required for the examination of numerous lung disorders. This makes it more difficult to use existing systems in clinical practice. The database can be replenished in collaboration with professional bodies in various fields.

490

T. Sarsembayeva et al.

Evaluation. The model’s performance is estimated using Jaccard and Dice indicators, which are well-known for this type of computer vision problem. The intersection across the union is known as jacquard, and it is the same size as bone F1. The key problem is that they only assess the number of true positives, false positives, and false negatives, ignoring the alleged location. As a result, the average distance of the contour and the average distance of the surface are more appropriate. The evaluation was based on test data that had not been used during the training phase. Jacquard had a score of 0.926, and Dice received a score of 0.961. The reduction path’s high-resolution capabilities are paired with the output chosen to forecast a more accurate result based on this information, which is the architecture’s core idea. A softmax function was applied to the model output, and the network was trained using a negative log probability loss 0.0005 with a reading speed as an Adam optimization criterion. Horizontal and vertical scrolling, modest scaling, and filling were all employed to magnify the data. Before uploading to the Internet, all photos and masks were downsized to 512 × 512 pixels. To boost performance, it was chosen to apply coding from vgg11 that had previously been studied on ImageNet. This method boosts performance marginally while drastically speeding up network convergence. The Vanilla unnet setup has no package recovery. It has been added to promote network convergence because it is being used all of the time. Because this network configuration outperformed the validation data set’s packet rate and other unnet choices without pre-prepared weights, it was picked for the final evaluation.

4 Conclusion The following tasks were completed during the work: 1) research into the principles of developing machine learning algorithms; 2) creation of a database for training and testing an automated diagnostic system (ACD) for CT images of the lungs; and 3) evaluation of indicators of the developed system’s information content [6]. Acknowledgment. This work was funded by Committee of Science of Republic of Kazakhstan AP09260767 “Development of an intellectual information and analytical system for assessing the health status of students in Kazakhstan” (2021–2023).

References 1. Vijayaraj, J., et al.: Various segmentation techniques for lung cancer detection using CT images: a review. Turk. J. Comput. Math. Educ. 12(2), 918–928 (2021) 2. Sluimer, I.C., et al.: Computer analysis of computed tomography scans of the lung: a survey. IEEE Trans. Med. Imaging 25, 385–405 (2006) 3. Lakhani, P., Sundaram, B.: Deep learning at chest radiography: automated classification of pulmonary tuberculosis by using convolutional neural networks. Radiology 284(2), 574–582 (2017) 4. Baboo, S.S., Iyyapparaj, E.: A classification and analysis of pulmonary nodules in CT images using random forest. In: 2nd International Conference on Inventive Systems and Control (ICISC), pp. 1226–1232 (2018)

Pre-processing of CT Images of the Lungs

491

5. Kieu, P.N., et al.: Applying multi-CNNs model for detecting abnormal problem on chest x-ray images. In: 10th International Conference on Knowledge and Systems Engineering (KSE), pp. 300–305 (2018) 6. Mansurova, M., Sarsenova, L., Kadyrbek, N., Sarsembayeva, T., Tyulepberdinova, G., Sailau, B.: Design and development of student digital health profile, pp. 1–5 (2021). https://doi.org/ 10.1109/AICT52784.2021.9620459

Innovations in Intelligent Systems

Application of Hyperledger Blockchain to Reduce Information Asymmetries in the Used Car Market Chien-Wen Shen , Agnieszka Maria Koziel(B)

, and Chieh Wen

Department of Business Administration, National Central University, Taoyuan 32001, Taiwan [email protected]

Abstract. The used car market has long been an example of a market rife with information asymmetry between sellers and buyers. Since most consumers have little experience and knowledge in buying cars, they rely on the historical vehicle documents provided only by car dealers, which might be insufficient to make prepurchase judgments. To receive more information about events that occurred in the vehicle’s past, buyers need to spend time collecting other related documents from different sources. The whole process is time-consuming and leads to quality uncertainties causing market inefficiency. Such a problem can be alleviated by blockchain technology by using nodes of a computer network to record the historical information of a car, where the chain of data cannot be falsified, creating transparent, verified, and easy access to all documents. Accordingly, we propose a Hyperledger-based approach and simulate the acquisition time of historical vehicle data to illustrate the blockchain application to reduce information asymmetries in the used car market. In Hyperledger Fabric, all business network transactions are recorded on the smart contracts, allowing the records to coexist among the participants, including dealers, maintenance plants, motor vehicle offices, police offices, and buyers. This blockchain technology application mitigates information asymmetries between buyers and sellers, guarantees the integrity and transparency of data, and shortens the time obtaining historical car information. Keywords: Asymmetric information · Blockchain · Used car · Hyperledger · Smart contract

1 Introduction Information asymmetries and quality uncertainties are problems faced by used car market buyers at any time before making a purchase decision. Buyers suffer from issues such as the car being in worse condition than initially indicated, accident damage not disclosed, fraud, over vaulted product, etc. To cope with those problems, buyers spend enormous time seeking information. Asymmetry of information increases used car costs and buyer dissatisfaction [1]. It causes damage to customers’ rights and interests, resulting in disputes between buyers and dealers [2, 3]. Although the used car trade-in in many countries has already surpassed the size of the new cars market [4], according to the © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 N. T. Nguyen et al. (Eds.): ACIIDS 2022, LNAI 13758, pp. 495–508, 2022. https://doi.org/10.1007/978-3-031-21967-2_40

496

C.-W. Shen et al.

Motor Vehicle Station statistics, in 2019, the number of used cars in Taiwan doubled the new car market size. There is still a problem of quality uncertainties and a lack of trust in the second-hand car market. Buyers have to put much effort into seeking information that might have a quality issue, falsified, incorrect, or missing [3]. Most buyer-seller relationships in the used car market are characterized by an information asymmetry [5]. Chen [6] analyzed the environmental situation of Taiwan’s second-hand cars. It showed that the second-hand car market is a typical phenomenon of Gresham’s law “bad money drives out good.” One of the buyers and the seller has incomplete information during the transaction, thus forming a market inefficiency. Emons and Sheldon [7] investigate the behavior of both buyers and sellers, testing for adverse selection by sellers and quality uncertainty among buyers. They show that the estimated hazard function of car ownership does indeed decline with the duration of ownership, as quality uncertainty implies. In the used car market, consumers can only rely on their own car buying experience or the vehicle information provided by the car dealers to judge the vehicle’s condition. Whereas, because the used car market has a severe information asymmetry problem, it results in many disputes. Consumers have a distrust and insecurity of used cars [8]. Several methods are addressing the information asymmetry problem. One intuitive solution for consumers and competitors is acting as monitors for each other. Consumer Reports, Underwriters Laboratory, notaries public, and online review services help bridge gaps in the information [9]. Another solution is to have manufacturers provide warranties, guarantees, and refunds. In addition to seller-granted warranties, third-party companies can offer their insurance at some cost to the consumer. The government can also regulate the quality of goods sold on the market [10]. Therefore, asymmetric information still exists because buyers and sellers don’t have equal amounts of data required to make an informed decision regarding a transaction [11]. Buyers’ issues during the information-seeking process might be alleviated by using blockchain-based solutions. Blockchain technology can provide a solution to mitigate information asymmetries and more efficiently reduce quality uncertainty, therefore changing relationships between buyers and sellers into a more reliable [3]. This technology is making progress in the automotive industry with a significant impact on development in keeping all car records transparent and secure during the whole car life cycle. Therefore it might be highly beneficial for the used car market to provide a complete vehicle history. There are already several studies proposing a blockchain technology framework to reduce the information asymmetry issue based on the vehicle history [12]. However, they are focusing on resolving only specific problems such as odometer fraud [13], falsifying car mileage data [14], insurance information [15], or vehicle accident documentation which cannot ultimately reduce the information asymmetry [16]. Blockchain technology promises to automate the tracking of cars through their historical and current life and provide reliable information at any time it is needed [17]. Recently, the application of blockchain technology in the used car market has received increasing attention. It can effectively reduce the information asymmetry of the used car market by storing the past usage status of vehicles [3]. Blockchain technology is a decentralized distributed database technology that uses blocks to record the car’s information where the chain of data cannot be falsified. All the data is transparent and visible. Therefore there is no method to modify the record, hence it can increase

Application of Hyperledger Blockchain to Reduce Information

497

the transparency and fairness of the transaction to protect the rights of the consumer [18]. Therefore, this research uses blockchain as a solution to put the vehicle historical information records of used cars into the blockchain. To tackle the problem of information asymmetry in the used car market, we apply the Hyperledger approach to build a blockchain framework and illustrate the development of Hyperledger Fabric-based applications. In Hyperledger Fabric, all business network transactions are recorded on the smart contracts, allowing the records to coexist among the participants. Smart contracts can reduce administration costs and risk, increase trust and improve the efficiency of used car market business processes [19]. Therefore, we suggest using smart contracts to solve the information asymmetry problem and provide a reliable transaction environment for the used car market. This paper records the interactions between the car owner and the dealer, the vehicle maintenance factory, and the supervision of the blockchain. These records follow the unique VIN code of the vehicle to enable the used car dealer to have a complete vehicle condition record through permission control so that the customer can see the most real vehicle condition [20]. Moreover, customers can choose a car that reduces an environmental impact to a minimum by having access to all car data—for instance, transparent information about carbon dioxide when looking for a sustainable car purchase. Customers can make ecoconscious decisions based on this kind of data while purchasing a used car. We conduct a simulation to demonstrate blockchain technology application efficiency in information asymmetry reduction from the time-related perspective because evaluating and obtaining the right. Precise information on the used car market is time-consuming and effortful [21]. Through the simulation, we review the blockchain application and compare the time to obtain the historical information of the vehicle in the process without and with the blockchain application. Therefore, our study shows that improving information flow and making information available for each party decreases information asymmetries problems in the used car market. Blockchain technology shortens the time to obtain historical data and increased trust among buyers.

2 Hyperledger Framework Hyperledger is an open-source project initiated by the Linux Foundation in 2015 to advance blockchain digital technology and transaction verification. It consists of 30 founding company members and technical and organizational governance institutions. Hyperledger focuses on permissioned blockchain frameworks to provide maximum support for business consortia that want to advance cross-industry collaboration by developing the blockchain technology [22]. The project is built to improve transparency and enhance accountability and trust between partners. The Hyperledger Blockchain technologies are increasingly called Third Generation Blockchain systems, where the first generation is considered to be Bitcoin and the second Ethereum [23]. However, it is not a unique Blockchain technology but a set of different technologies included by member companies. Hyperledger supports, aids the development, and promotes a range of business blockchain technologies, including distributed general ledger frameworks, smart contracts, graphical interfaces, library clients, service libraries, and simple applications [24]. It contains various tools and frameworks. Hyperledger frameworks, also called

498

C.-W. Shen et al.

distributed ledger technology, include Fabric, Besu, Burrow, Indy, Iroha, and Sawtooth. At the same time, tools are being developed from Avalon, Cactus, Caliper, Cello, Composer, Explorer, Grid, and Firefly. Each of the frameworks has its specific advantages of use in certain applications. Some of the mentioned technologies are advanced, developed, and actively applied (Fabric, Besu, Indy, Iroha, Sawtooth), while others are still in the incubation phase (Burrow). Recently the most developed Hyperledger framework is Hyperledger Fabric [25]. Hyperledger Fabric is a permissioned open-source blockchain framework developed and funded by IBM and the Linux Foundation. Similar to other blockchain technologies, a ledger uses smart contracts and is used by participants to manage their transactions. However, access to the network is only limited to the network participants who have verified identities and joined the blockchain through a Membership Service Provider [26]. Individuals can have several channels in one blockchain network, allowing a group of participants to create a separate ledger of transactions that are not visible to other members of the network [27]. In Fabric, all transactions are managed by smart contracts written in chaincode that are invoked when the external application needs to interact with the blockchain ledger [28]. The general ledger is the key feature that makes Hyperledger Fabric design a comprehensive, practical, and essential blockchain solution for users. It is a set of transaction blocks that are performed using a chaincode. The general ledger includes a world state database with data records and a transaction log that records all changes over the database, which cannot be changed after modification [29]. Hyperledger Fabric is composed of applications, channels, and peers (Fig. 1). Channel is a mechanism used to achieve private communications and transactions between two or more network members. Each transaction is executed on the channel, where each party must be verified to perform a transaction on the channel. Application is the initiator of the transaction request and must be connected to the peer to operate. Different types of peers (nodes) play different roles in blockchain networks. The endorser peer confirms the transaction after receiving a request from the client application, and the anchor peer receives updates and transfers them to other nodes in the organization [26]. Every peer that joins the channel needs to maintain the same ledger, and only the Peer that joins the channel can access the ledger to protect privacy. Order peer is a particular node used to ensure the consistency of peers’ ledger. It is mainly responsible for sorting transactions and packaging them into blocks to be passed back to peers to join the blockchain [29]. Our used car trade Hyperledger blockchain framework is illustrated in Fig. 1. The Dealer creates the transaction ‘VehicleTransfer’ (participant application) for car1, which sends a request to peer 1 (node) and calls smart contract1, which uses a special chaincode to create the query and update the general ledger1 that sores all transactions. Peer1, after receiving updates from other participants’ applications, updates the ledger. A smart contract generates a response to a request and sends it to the participant application, where the query process ends. Next, the client application builds a transaction based on collected responses and sends it to the order peer1 (orderer). Orderer collects transactions over the whole network and distributes them to all peers. Peer1 confirms the transaction and sends it to the ledger1 that stores all the transactions after verification. After updating ledger1, peer1 generates an event to notify the dealer that the transaction is completed [30, 31].

Application of Hyperledger Blockchain to Reduce Information

499

Fig. 1. Blockchain network in Hyperledger Fabric architecture for used car trade.

Each block for used car trade in the Hyperledger Fabric blockchain application is composed of a header and a body, where a header contains the block version number representing the current block, the hash value of the previous block, the hash value of the current block, which is the value of the Merkle root node generated from the transaction list, and other essential information such as timestamps generated by the block. The block body contains the number of transactions created in smart contracts, which are arranged in order, and hash values specific to each transaction, which are connected through a data structure (Merkle tree) [32] (Fig. 2). Based on the proposed framework, all of the historical transactions in the blocks are provided to the car buyer. Through this process, customers who make a request can obtain complete vehicle historical data without missing any documents, decreasing information asymmetry and shortening the time for searching for documents through traditional means.

Fig. 2. Maintenance plant block structure in used car trade blockchain network.

In Fabric, networks are managed by organizations where each node uses the Blockchain network on behalf of the participating organization. This differs from other

500

C.-W. Shen et al.

networks such as Bitcoin or Ethereum, where computers independently contribute resources to creating a network, regardless of who is the owner. In most blockchain networks, the computers can coordinate, but Hyperledger Fabric will enable organizations to coordinate. It tracks its execution history in the ledger profile, and it has a sequential verification of the blockchain architecture that makes it secure. Moreover, Hyperledger Fabric has an authentication policy that no trusted developers cannot modify. The authentication policy is the place of transaction verification and can only be parameterized by the smart contract (chaincode) [33]. Therefore Hyperledger Fabric blockchain technology is suitable for the used car trade, which can securely store all vehicle data without any possibility for dealers to modify the record on the blockchain. It increases the transparency and fairness of the transaction to protect the rights of the customer [18]. In the current process, it is necessary to apply to different organizations for access to the vehicle’s history data, such as the inspection information provided by the motor vehicle office, the repair or maintenance record by the maintenance plant, and the accident record by the police officer. Access to each agency in the process has its sub-process, where the sub-process has its file application process, processing time, and waiting time. However, after applying a blockchain, the process of querying historical vehicle data reduces the time for people to submit paper applications, such as the time for people to go to the entity, fill in application documents, or wait in line. The car owner can directly go to the used car blockchain network to perform historical data queries in the blockchain process. According to the agreement in the smart contract, as long as the applicant is the owner of the vehicle, the applicant can access the historical information block (Fig. 3). The process after importing blockchain, when consumers come to the store of a used car dealer, they ask to see the historical information of the vehicle. Based on the definition in the smart contract, the dealer is the owner of this vehicle, so the dealer has the right to see the vehicle’s historical data. In the user car trade historical information workflow, the buyer connects with the dealer to make a car request. Secondly, the buyer receives the car history information and checks the detailed car information at a specific phase, such as manufacturing, distribution, etc. Finally, by selecting the transaction, the buyer will access all detailed information related to car ownership, maintenance, accidents, or violations.

Fig. 3. Workflow with a blockchain application for obtaining historical vehicle information.

Application of Hyperledger Blockchain to Reduce Information

501

3 Implementation and Simulation The used car trade blockchain network aims to provide solutions to the used car sale without asymmetric information. The target users of the network include the main stakeholders in the used car trade, such as the Dealer, Maintenance Plant, Motor Vehicle Office, Police Office, and Customers. Therefore, we designed four types of smart contracts to facilitate car information circulation and simulate the acquisition time of historical vehicle data. We implement a blockchain application simulation process to understand the benefits of using smart contracts in reducing time-related information asymmetry problems. 3.1 Hyperledger Fabric Network Implementation We build the Hyperledger Fabric environment system sample network to demonstrate the situation of car transactions on the blockchain. The used car network can be run through the transaction rules, and finally, the transaction query function can be implemented directly to the Hyperledger API. Fabric is used to run the second-hand Car chain network (car-trade), where smart contracts define the internal transaction rules. To record a transaction on the smart contract, the asset must be first defined in the blockchain network. Asset in the used car trade blockchain network represents a blockchain network that can be traded. In this research, the used car blockchain network assets are cars, where each has a unique vehicle 17-characters identification number (VIN) (Fig. 4 (1)). The smart contracts, called chaincodes, can be shared by all entities within a used car trade network. Private chaincodes are run only on peers with whom the chaincode is shared and is inaccessible to others. An authorized network member installs the smart contract on the node (peer) service. Each node and user interacting with Fabric is a network member called a consortium. Hence in this research, we identify five consortium members (participants): Dealer, Maintenance Plant, Motor Vehicle Office, Police Office, and Customer. The example of code defining participant dealer is shown in Fig. 4 (2). Each participant has been assigned a code, first and middle name, address, contact, gender, and date of birth and inherits the defined user attribute. In this research, the VIN code TJCPCBLCX11000237 is taken as an example. The vehicle type is Model 3, the color is white, the transmission system is electric, the manufacturer is Tesla, the year of manufacture is 2019-01, and the owner is a Dealer with ID D0.

Fig. 4. Code defining asset: vehicle (1) and user: dealer (2) in the blockchain network.

502

C.-W. Shen et al.

By defining a transaction, it is possible to specify that the interactions between participants may result in blocks. Before the dealer and buyer can transact with each other, the set of contracts defining rules must be defined. Here the smart contract defines rules between participants in executable code, and after the smart contract is invoked, the transaction is generated. Smart contracts can ensure that the used car trade information asymmetry problem related to time, missing or falsified documents are solved or at least decreased. There are four types of transactions defined in this study: vehicle transactions submitted by dealers, vehicle maintenance transactions submitted by vehicle maintenance plant, vehicle inspection transactions submitted by motor vehicle office, and vehicle violations and accident transactions submitted by the police office. Each transaction needs to define explicit content, such as VIN code, vehicle name, replacement parts, maintenance or repair description, and maintenance time for vehicle maintenance information. After submission, the vehicle is recorded in the blockchain maintenance block, and after the transaction in the network is completed, the block is generated in the blockchain. On the used car market, a smart contract between a used car buyer and a dealer allows a buyer to obtain the car’s complete information and verify the dealer simultaneously. The smart contract recorded on the blockchain contains all historical car information. There are contracts between the owner and Motor Vehicle Office, the owner and Maintenance plant, owner and Police Office. While creating a smart contract between buyer and dealer and making a transaction, all historical information becomes transparent for the buyer. The transaction definition in the proposed blockchain network needs to compose a model in (.js) file to define the transaction content, such as transfer of vehicle ownership, vehicle maintenance, vehicle inspection, or accident. In the vehicle transfer transaction, the dealer proposes the transaction event where the dealer and buyer have defined the ‘VehicleTransfer’ smart contract. The VIN, name, model number, manufacturer, year of manufacture, transmission system, color, speed, miles, vehicle-dealer address, contact information, price, and the new vehicle owner is defined after the vehicle is transferred. In the transaction of vehicle maintenance, the transaction is raised by the vehicle maintenance plant, and the smart contract ‘VehicleMaintenance’ is defined (Fig. 5 (1)) and submitted (Fig. 5 (2)). This happens when the owner drives the vehicle to the maintenance plant for repair or maintenance. The maintenance personnel logs in through ID to submit the vehicle in the Transaction Type and record the vehicle’s Vin code, mileage, vehicle parts, detailed description of maintenance, time and price, and owner. The ID can be traced back to which participant uploaded this record. The event of vehicle inspection is proposed by the motor vehicle office, and the ‘VehicleInspection’ smart contract is defined. The owner goes to the motor vehicle office for inspection within the specified time. The Vehicle Inspection in Transaction Type is submitted by the motor vehicle office to record the Vehicle Vin code, name, model type, manufacturer, year of manufacture, transmission system, color, speed, miles, and contact information. In the transaction of violation or accident, the vehicle owner causes a vehicle crime, such as a vehicle accident or violation. Police office personnel log and submit ‘VehicleViolationOrAccident’ in Transaction Type and submit data to the blockchain network based on the transcript’s content. The submitted content includes the vehicle’s Vin code, name, vehicle violation

Application of Hyperledger Blockchain to Reduce Information

503

reason, the time point of violation, and vehicle owner. The ID can be traced back to which participant uploaded this record.

Fig. 5. Example of code for defining (1), and submitting (2) vehicle maintenance.

Access control and authorization are an essential part of the business network security architecture shared by member organizations on the blockchain. It distinguishes between access control for resources in a business network and access control for network management changes. Define the permissions of each Participant in the used car network blockchain. The ACL file is used to define the permissions of each participant on the blockchain. By defining ACL rules, we determine which users/roles are allowed to create, read, update, or delete elements in this used car network. REST API was used to make a query request for the resources and return the query results to the web page in (.json) format. According to the completed transaction initiated in the previous section, the current block with hash values in the Fabric is shown below. The header indicates the number of blocks, the current hash of this block, and the previous hash. The envelope info section contains the transaction ID and timestamps in this block and is visible. Finally, the metadata section includes the certificate and signature of the block creator (Fig. 6).

Fig. 6. Algorithm for the content of block in the (.json) format.

3.2 Simulation Process Experiments with Accessing Historical Data In this study, Arena Simulation Software simulates the acquisition time of historical vehicle data, such as vehicle maintenance data, vehicle inspection data, vehicle violation, or

504

C.-W. Shen et al.

accident data. It compares the time difference between the acquisition of historical vehicle data before and after the introduction of the blockchain. Business process simulation analyzes and experiments with business processes in a virtual environment and tests all risks and problems that may arise in the real-time environment. It is ideal when there is a need to change a process significantly but don’t know the results or see how a process might perform in a particular condition [21]. In the traditionally used car trade process, it is necessary to apply to different organizations or companies to access the vehicle’s history data to provide the vehicle information to consumers for reference before purchase. The whole process is time-consuming and leads to asymmetric information issues. We assume probability follows a triangular distribution. After checking the form and confirming the identity, the motor vehicle office’s staff executes the procedure of inquiry the vehicle’s historical data using a triangular method to define the three times and set the approval rate for each check station. Blockchain technology application reduces the time for people to submit paper applications, the time to go to the entity, the time for filling in application documents, and the time for waiting in line. According to the agreement in the smart contract, as long as the applicant is the owner of the vehicle, the applicant can access the historical information block. Therefore after applying blockchain, the triangular probability distribution is also used to define dealer visits in three different blocks and executed to view the minimum possible value, the most likely value, and the maximum value of the vehicle’s historical data (Table 1). Table 1. Total time per entity with and without blockchain Process

Motor Vehicle Office

The car owner goes to the Motor Vehicle Office

Total time per entity without blockchain

Total time per entity with blockchain

Average Min Max Units average average

Average Min Max Units average average

21.40

11.80

37.10

Minutes 14.95

The car 36.20 owner wait line in Motor Vehicle Office

14.90

63.40

Minutes

The car owner fills in the application form

7.32

5.25

9.63

Minutes

Motor Vehicle Office staff check application form

7.14

5.36

9.29

Minutes

5.71

28.30

Minutes

(continued)

Application of Hyperledger Blockchain to Reduce Information

505

Table 1. (continued) Process

Total time per entity without blockchain

Total time per entity with blockchain

Average Min Max Units average average

Average Min Max Units average average

Motor Vehicle Office staff query history information

4.90

3.10

6.80

Maintenance The car Plant owner goes to the Maintenance Plant

29.90

11.40

56.70

Minutes 15.35

The car owner wait line in Maintenance Plant

28.70

12.90

58.80

Minutes

The car owner fills in the application form

7.58

5.36

9.69

Minutes

Maintenance Plant staff check application form

7.40

5.20

9.58

Minutes

Maintenance Plant staff query history info

2.90

0.56

4.80

Days

18.60

10.30

28.40

Minutes 14.86

The car 29.50 owner wait line in Police Office

10.80

55.10

Minutes

5.26

9.39

Minutes

Police Office The car owner goes to the Police Office

The car owner fills in the application form

7.33

Days

6.45

28.71

Minutes

6.71

27.69

Minutes

(continued)

506

C.-W. Shen et al. Table 1. (continued) Process

Total time per entity without blockchain

Total time per entity with blockchain

Average Min Max Units average average

Average Min Max Units average average

Police Office staff check application form

7.04

5.26

9.58

Minutes

Police Office staff query vehicle history info

3.76

1.39

6.87

Days

SUM

4.33

2.17

6.97

Days

45.17

24.05

76.48

Minutes

4 Conclusions and Limitations This study records the historical information of the car by building a business network through Hyperledger Fabric Framework to define business network transaction rules. Through the blockchain network, all transaction records of vehicles in the past are kept in the smart contracts and are enforced automatically when transaction conditions are satisfied. Smart contracts guarantee appropriate access control where all records are secure and reliable, and the history is tracked through the past ledger making data more transparent. Customers who want to buy a car on the used car market can make a pre-purchase evaluation according to the car condition information existing on the blockchain. Blockchain networks increased the transparency and fairness of transactions and protected the rights and interests of consumers. It also enables used car consumers to better understand the vehicle conditions through the vehicle’s historical information, thus avoiding disputes caused by asymmetric information. The simulation results show that the blockchain network can reduce the acquisition time of historical data of the vehicle and decrease information-seeking effort, which relates to a timeline that reflects the current state of the vehicle and the course of the event in its history. Suppose consumers want to know the historical information of the vehicle. In that case, the dealer can obtain the information through the blockchain query, reducing the time for personnel exchanges, waiting and written applications, etc. Moreover, according to the experiment results, if the vehicle industry wants to introduce blockchain applications in the future, it should be implemented by the car dealer first. When the vehicle is sold, the information is recorded on the blockchain, and every time the car enters the original maintenance plant, the maintenance record can also be uploaded to the blockchain, and establish cooperation with government agencies, when the vehicle performs the first vehicle inspection, the vehicle inspection result is recorded on the blockchain. In addition, it is recommended that the government agency initiate a plan to establish relevant regulations and invite other automobile-related companies to join in to create a better vehicle industrial environment. Blockchain implemented in the used car market can encourage potential car buyers to purchase a used car instead of a

Application of Hyperledger Blockchain to Reduce Information

507

new one. This might positively influence the automobile industry, while it is commonly known that car production is a significant contributor to air pollution [34]. By choosing to purchase a car on the used car market, car buyers may contribute to decreasing air pollution. Therefore their transaction may be considered sustainable. The limitations and the future recommendations of this research are as follows. This study uses the application scenarios and business processes of Hyperledger Fabric to build a blockchain. However, there are many projects in Hyperledger that can be developed, explored, and applied in the used car market. In the future, it is recommended to use a web page to build the interface to provide users with a better UI interface. It is also suggested that Explorer or Cello tools in the Hyperledger project could be used to monitor node hardware resources, monitor the use status of blockchain, log records and monitor the operation of smart contracts. Since the feature of blockchain is to solve the problem of information asymmetry, it is recommended to verify the authenticity of the vehicle data after the blockchain is imported into the used car market.

References 1. Sureshchandar, G.S., Rajendran, C., Anantharaman, R.N.: The relationship between service quality and customer satisfaction – a factor specific approach. J. Serv. Mark. 16(4), 363–379 (2002) 2. Bauer, I., Zavolokina, L., Schwabe, G.: Is there a market for trusted car data? Electron. Mark. 30(2), 211–225 (2019). https://doi.org/10.1007/s12525-019-00368-5 3. Zavolokina, L., Schlegel, M., Schwabe, G.: How can we reduce information asymmetries and enhance trust in ‘The Market for Lemons’? Inf. Syst. e-Bus. Manage. 19(3), 883–908 (2020). https://doi.org/10.1007/s10257-020-00466-4 4. Ellencweig, B., et al.: Used Cars, New Platforms: Accelerating Sales in a Digitally Disrupted Market. McKinsey & Company (2019) 5. Ba, S., Pavlou, P.: Evidence of the effect of trust building technology in electronic markets: price premiums and buyer behavior. MIS Q. 26, 243–268 (2002) 6. Chen, C.-K.: The Study of Expediency & Benefit for Consumers Who Procure Vehicles from Second-hand Vehicles Vendors in A Propensity of Asymmetrical Information of Supply/Demand Market—By Way of Examples of the Market in Taiwan and Japan (2005). https:// hdl.handle.net/11296/yq8459 7. Emons, W., Sheldon, G.: The market for used cars: new evidence of the lemons phenomenon. Appl. Econ. 41(22), 2867–2885 (2009) 8. Chang, C.-L.: The Study on Business Model and Competitive Strategy of Used Car (2011) 9. Kagan, J.: Underwriters Laboratories (UL) (2015). https://www.investopedia.com/terms/u/ underwriters-laboratories-ul.asp 10. Ross, S.: How to Fix the Problem of Asymmetric Information (2019). https://www.investope dia.com/ask/answers/050415/how-can-problem-asymmetric-information-be-overcome.asp 11. Mehlhart, G., et al.: European second-hand car market analysis. Final Report. Öko-Institut, Darmstadt, Germany (2011) 12. Kooreman, P., Haan, M.A.: Price anomalies in the used car market. De Economist 154(1), 41–62 (2006) 13. Chanson, M., et al.: Blockchain as a privacy enabler: an odometer fraud prevention system. In: Proceedings of the 2017 ACM International Joint Conference on Pervasive and Ubiquitous Computing and Proceedings of the 2017 ACM International Symposium on Wearable Computers (2017)

508

C.-W. Shen et al.

14. Li, C.: Blockchain Technology for Smart Vehicle: The Mileage and Condition Recording (2018) 15. Lamberti, F., et al.: Blockchains can work for car insurance: using smart contracts and sensors to provide on-demand coverage. IEEE Consum. Electron. Mag. 7(4), 72–81 (2018) 16. Cebe, M., et al.: Block4forensic: an integrated lightweight blockchain framework for forensics applications of connected vehicles. IEEE Commun. Mag. 56(10), 50–57 (2018) 17. Notheisen, B., Cholewa, J.B., Shanmugam, A.P.: Trading real-world assets on blockchain. Bus. Inf. Syst. Eng. 59(6), 425–440 (2017). https://doi.org/10.1007/s12599-017-0499-8 18. Chen, X.: 歐盟: 區塊鏈有助於減少二手汽車里程表遭竄改 (2017). https://www.ithome. com.tw/news/118566 19. Zheng, Z., et al.: An overview on smart contracts: challenges, advances and platforms. Futur. Gener. Comput. Syst. 105, 475–491 (2020) 20. Demir, M., Turetken, O., Ferworn, A.: Blockchain based transparent vehicle insurance management. In: 2019 Sixth International Conference on Software Defined Systems (SDS) (2019) 21. Van Der Aalst, W.M.: Business process simulation survival guide. In: Handbook on Business Process Management 1, pp. 337–370. Springer (2015) 22. Linux Foundation. Advancing Business Blockchain Adoption Through Global Open Source Collaboration|What is Hyperledger? (2020). https://www.hyperledger.org/ 23. Krsti´c, M., Krsti´c, L.: Hyperledger frameworks with a special focus on Hyperledger Fabric. Vojnotehnicki glasnik 68, 639–663 (2020) 24. Blockstuffs. Introduction of Hyperledger, Its Projects and Tools (2018). https://www.blocks tuffs.com/blog/introduction-of-hyperledger-its-projects-and-tools. Cited 29 Oct 2021 25. Muscara, B.: Frameworks and Tools (2021). https://wiki.hyperledger.org/display/LMDWG/ Frameworks+and+Tools. Cited 29 Oct 2021 26. Mamun, M.: How Does Hyperledger Fabric Work? (2018). https://medium.com/coinmonks/ how-does-hyperledger-fabric-works-cdb68e6066f5 27. Thummavet, P.: Demystifying Hyperledger Fabric: Fabric Architecture (2019). https://med ium.com/coinmonks/demystifying-hyperledger-fabric-1-3-fabric-architecture-a2fdb587f6cb 28. Chen, K.: Introduction to Hyperledger Fabric Transaction Process (2020). https://medium. com/@kenchen_57904/hyperledger-fabric-%E4%BA%A4%E6%98%93%E6%B5%81% E7%A8%8B%E7%B0%A1%E4%BB%8B-f23bb54d5c8d 29. Debruyne, C., Panetto, H., Guédria, W., Bollen, P., Ciuciu, I., Meersman, R. (eds.): OTM 2018. LNCS, vol. 11231. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-11683-5 30. Hyperledger. Hyperledger FABRIC (2020) 31. Hyperledger. Peers (2020). https://hyperledger-fabric.readthedocs.io/en/release-1.4/peers/ peers.html. Cited 29 Oct 2021 32. Liang, Y.-C.: Blockchain for Dynamic Spectrum Management, pp. 121–146 (2020) 33. Androulaki, E., et al.: Hyperledger fabric: a distributed operating system for permissioned blockchains. In: Proceedings of the Thirteenth EuroSys Conference (2018) 34. Xie, Y., Wu, D., Zhu, S.: Can new energy vehicles subsidy curb the urban air pollution? Empirical evidence from pilot cities in China. Sci. Total Environ. 754, 142232 (2021)

Excess-Mass and Mass-Volume Quality Measures Susceptibility to Intrusion Detection System’s Data Dimensionality Arkadiusz Warzy´nski(B)

, Łukasz Falas , and Patryk Schauer

Faculty of Information and Communication Technology, Department of Computer Science and Systems Engineering, Wrocław University of Science and Technology, Wrocław, Poland [email protected]

Abstract. In spite of ever-increasing volume of network traffic, unsupervised intrusion detection methods are one of most widely researched solutions in the field of network security. One of the key challenges related to development of such solutions is the proper assessment of methods utilized in the process of anomaly detection. Real life cases show that in many situations labeled network data is not available, which effectively excludes possibility to utilized standard criteria for evaluation of anomaly detection algorithms like Receiver Operating Characteristic or Precision-Recall curves. In this paper, an alternative criteria based on ExcessMass and Mass-Volume curves are analyzed, which can enable anomaly detection algorithms quality assessments without need for labeled datasets. This paper focuses on the assessment of effectiveness of Excess-Mass and Mass-Volume curves-based criteria in relation to intrusion detection system’s data dimensionality. The article discusses these criteria and presents the intrusion detection algorithms and datasets that will be utilized in the analysis of data dimensionality influence on their effectiveness. This discussion is followed by experimental verification of these criteria on various real-life datasets differing in dimensionality and statistical analysis of the results indicating relation between effectiveness of analyzed criteria and dimensionality of data processed in intrusion detection systems. Keywords: Anomaly detection · Unsupervised machine learning · Algorithm quality assessment · Data dimensionality

1 Introduction One of common problems encountered in the field of network cybersecurity is the volume of network traffic present in distributed systems. Due to the vast development of network based functionalities in modern applications, future networks and their security mechanisms must provide solutions enable real time security assessments, potentially even in case of network attacks that were previously not recognized or which are utilizing new vectors of attack. In scope of this requirements, one of the most promising solutions to these challenges are automated intrusion detection systems utilizing machine learning © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 N. T. Nguyen et al. (Eds.): ACIIDS 2022, LNAI 13758, pp. 509–519, 2022. https://doi.org/10.1007/978-3-031-21967-2_41

510

A. Warzy´nski et al.

methods in the process of network traffic anomaly detection. Numerous anomaly detection methods proposed in various papers utilize labeled datasets and standard evaluation criteria, e.g. criteria utilizing Receiver Operating Characteristic (ROC) or PrecisionRecall (PR) curves. However, real life cases show that in majority of distributed network systems labeled datasets are not available due to lack of funds or simple lack of capabilities to label even a small part of a large network traffic dataset. This paper describes ongoing research started in one of previous articles [1], which aimed at verification of applicability of anomaly detection methods assessments criteria based on Excess-Mass and Mass-Volume curves in systems were labeled data is not available. The results of experimentation described in previous article indicated that while EM and MV based measures can be considered as some approximate indication of anomaly detection algorithms accuracy, they cannot be effectively utilized to distinguish which unsupervised intrusion detection method performs better on the given dataset without labels, especially in case of algorithms with relatively similar values of ROC and PR measures. However, while performed research have shown that these criteria cannot clearly indicate the ranking of unsupervised learning algorithms, the general conclusions where, that experimentation results can be influenced by characteristics of the datasets and their preparation. The main hypothesis after the previous experimentation is that the EM and MV based measures can be susceptible to dataset dimensionality and can be better suited for low-dimensional datasets, rather than for high-dimensional datasets. This article focuses on research aimed at experimental verification and indication if the hypothesis regarding the EM and MV based measures susceptibility to data dimensionality is correct. The experimentation discussed in this article consisted of selection of real-life datasets differing in dimensionality, selection of verified anomaly detection algorithms and experimentation aimed at identification of dataset characteristics on EM and MV based measures, followed by discussion of experimentation results and indication of potential future research works in this field.

2 Related Works One of the basic problems in conducting research on detecting anomalies in network traffic is the use of high-quality data sets. One of the most popular and giving satisfactory results is the use of known, synthetically generated data sets. Their preparation consists in the implementation of a series of planned actions leading to the generation of corresponding network traffic. These actions include both desirable events, corresponding to normal traffic, and attacks to simulate anomalies. This type of simulated traffic generation usually takes many days for the generated traffic to match reality as closely as possible. The described approach enables an easy and accurate way to verify the results obtained by the anomaly detection method. Unfortunately, the traffic generated in this way will never reach the complexity of the real traffic. Moreover, the infrastructure that would be secured only by a system built based on anomaly detection mechanisms will never be completely secure, as it is impossible to avoid incorrect classification of individual packages as safe or not.

Excess-Mass and Mass-Volume Quality Measures Susceptibility

511

The field of anomaly detection is mainly faced with the problem of proper selection of appropriate parameter values for the application of the anomaly detection method and ensuring high efficiency of detection. When using reference datasets, it is easy to verify the obtained results with a confusion matrix as well as ROC and AUC curves. Unfortunately, it is not possible to verify the operation of anomaly detection methods in this way due to the lack of labels in the real environment. With the aforementioned problems, the field of anomaly detection is growing, and the proposed methods give better and better results. The authors of [2], motivated by a wide range of applications, identified how to rank observations in the same order as that induced by the density function, which they called anomaly scoring. For the discussed problem, they proposed a performance criterion in the form of the MV curve. [3] shows that the Mass Volume curve is a natural criterion to evaluate the accuracy of decision rules regarding anomaly scoring. In response to this work, in the article [4], an alternative approach of the anomaly scoring problem based on Excess-Mass Curves was presented. Another work in which criteria based on the Excess-Mass (EM) and Mass-Volume (MV) curves were proposed is [5]. In addition, the proposed method based on feature sub-sampling and aggregating. Authors compared the results to ROC and PR criteria, which use the data labels hidden to MV and EM curves. This article is a direct continuation of [1]. Based on the experience gained so far and reviews collected on the above-mentioned publication, it was decided to continue the research in order to deepen the topic and confirm or exclude the hypotheses described in the previous chapters.

3 Algorithms and Datasets To examine the effect of feature selection algorithms on the applicability of EM and MV measures compared to ROC and PR algorithms, we decided to repeat the experiments on the same datasets using identical anomaly detection algorithms to determine whether, with a limited number of dimensions, all validation measures point to the same best anomaly detection algorithm. 3.1 Algorithms The set of algorithms tested did not change from our previous experiment. Again, the quality of classification performed with 5 algorithms has been tested, and compared to the original study [1], there is an addition of the ABOD algorithm and an example implementation of an autoencoder to the list. The LOF algorithm [6] is based on estimating the average neighborhood density, and its performance depends on the adopted k, which defines the local neighborhood scale. LOF uses local kernel density estimation, which is a way of determining the density of a random variable distribution from a finite sample of data.

512

A. Warzy´nski et al.

In the work presented in [7], the authors used the anomaly detection one-class support vector machine (SVM) algorithm. This approach presents a semi-supervised classification, which assumes that the training data belongs to only one class and distinguishes points belonging to the selected class by determining the set boundary in a multidimensional space based on the learning data. Isolation Forrest [8] is an example of an algorithm that uses decision trees. It labels observations as anomalies when they are on short branches, which reflect how many cuts were made to separate them from the rest of the dataset. The way it works uses the general rule about anomalies, which tend to be rare and few. ABOD [9] is an algorithm for angle-based detection of outlier observations. It is a solution that was developed due to the problems of anomaly detection in multidimensional datasets, for which measuring the distance between points becomes a computationally expensive and time-consuming operation. By measuring and comparing the angles of the observations located at the arms and at the apex of the angle, it is quite easy to determine whether many observations are located in the vicinity of a given observation. If a point diverges from a group of other points (i.e. is outside the cluster), then the angle determined for it is small. Autoencoder is an anomaly detection method based on neural networks. The network learns to reconstruct the data, assuming that these are normal values for the system, aiming to minimize the reconstruction error of this data. The reconstruction error is the arithmetically calculated difference between the model’s output and input data. In a running system, the learned model processes the incoming data, and exceeding a predetermined threshold of reconstruction error means that an anomaly is detected. 3.2 Datasets Previous research has been conducted on 3 sets containing network traffic data obtained from several different network protocols. NSL-KDD [10] – a dataset containing 41 features of various types (continuous, discrete, binary, and nominal features) representing information contained in TCP/IP headers. The dataset contains 22 types of attacks, and all traffic was generated in a laboratory network. In our study, all non-numeric features were removed from the dataset. The tests performed for this collection are particularly relevant especially because it is an extension of the KDD’99 dataset [11], from which, for example, the smtp subset was derived, which was originally considered during the development of the evaluation of classification algorithms using EM and MV values. UGR’16 [12] – the data is described by 12 features of different types derived from the Netflow protocol and is therefore limited in its ability to select features, especially due to the need to exclude non-numeric features from the analysis. This collection also contains real network traffic. UNSW-NB15 [13] – dataset containing 48 features of various types derived from TCP/IP headers. The dataset is available in pcap format and was prepared by collecting over 100GB of data on real network traffic and 9 types of attacks. Again, it is necessary to limit the number of analyzed features by excluding columns with non-numeric types.

Excess-Mass and Mass-Volume Quality Measures Susceptibility

513

4 Empirical Research The focus of the empirical research was focused on practical verification if EM and MV based metrics are viable for assessment of unsupervised intrusion detection methods quality when dataset dimensionality is being reduce via one of typical data preprocessing method. During this research, 5 real-life network traffic datasets and 3 data dimensionality reduction methods were utilized in the experiments comparing 5 different unsupervised network traffic anomaly detection methods. A detailed description of the whole conducted research was described in the following sections. 4.1 Research Agenda At this point, we decided to test the effect of dimensionality on the results obtained by the anomaly detection algorithms, while checking whether different methods of validating the results indicate the same methods as the best ones. In order achieve this, it was necessary to reduce the number of features for each of the datasets that were previously used during the study. The problem of minimizing the number of data dimensions for machine learning algorithms has been widely described in the literature and the general principle for selecting features for further analysis is difficult task and it is necessary to determine the optimal subset of features for each dataset separately, even if it is for the same domain. This is primarily due to the rather large diversity of the data, and in the case of network traffic datasets it is particularly crucial, as the observation of traffic may be carried out at the level of different network protocols, which may completely change the nature of the data. As a result of the research conducted so far on feature selection methods, many algorithms and methods have been developed that allow for dimensionality reduction. Each of them can indicate a different subset of features, so we decided to use 3 different algorithms available in the scikit-learn library to avoid the influence of the choice of a specific feature selection method on the obtained results. The comments and suggestions received so far have indicated that the number of dimensions of the data analyzed should be reduced as much as possible. Reviewing the datasets used in the original research [1] we found that they contain only single features (the smtp dataset has only 3 of them). In the case of network traffic analysis and the possibility of a variety of attacks, it seems that this can be quite a limitation in detecting threats, so we decided to re-examine what values the traditional validation measures, i.e. ROC and PR, would achieve. This will allow us to determine whether reducing the number of dimensions has resulted in a general deterioration of classification results, which would be undesirable. At the same time, we decided that with each feature selection method we would select a subset of 3, 5 and 7 features for each dataset. The exception here is the UGR dataset, which was tested for 6 numeric features, so in its case the tests include only a variant with a 3 and 5 element subset. 4.2 Research Results The results of the conducted research were grouped into three different categories based on selection of dataset feature reduction method. Data presentation in this form allows for indication how dimensionality reduction influenced metrics utilized for methods quality

514

A. Warzy´nski et al.

assessments for each dataset and for each of feature reduction method. Experimentation results for each of the feature reduction methods consists of table providing overview of ROC, PR, EM and MV metrics values for each dataset with different number of features, i.e. 3, 5 and 7 (except from UNSW dataset) and related charts indicating metrices variety for selected datasets. 4.2.1 KBest Feature Selection Results First experimentation focused on verification of influence of KBest feature reduction method on best and worst algorithm indication by ROC, PR, EM and MV based measures. In this scenario, as well as in following, ROC and PR based measures are considered as reference indicators of the algorithm’s quality rankings for each of the datasets and feature sets, while EM and MV based measures indications are compared with them to assess validity of their indication. The best value of each metric for each of the datasets and feature number is distinguished by bold, while the worst is marked with underline (Table 1). 4.2.2 KNN Feature Selection Results Second experimentation verified the influence of KNN based feature reduction method on quality of indication provided by the EM and MV based metrics in comparison to the reference metrics. The results of this experiment are presented in the same manner as it was described for the first experimentation (Table 2). 4.2.3 VAR Feature Selection Results Last experimentation verified the influence of VAR feature reduction method on quality of indication of the verified metrics. The results of this experiment are shown accordingly to the previously presented results (Table 3). Analysis of the results confirms that again in majority of cases, excluding LOF algorithm and KDD dataset, reduction in number of features does not aggravate the effectiveness of anomaly detection of the algorithms taking place in the experimentation. Similarly, to the previous results the reduction in number of features in dataset did not result in expected increase of quality of best and worst algorithm indications by the EM and MV based metrics. In majority of cases these metrics did not indicate the same algorithms as reference ROC and PR metrics. The only standing out dataset is the UNSW_VAR7 in case of which all of the metrics indicated the same algorithm as the best one, however there were none such unanimous indication in term of worst algorithm. Apart from the aforementioned dataset, only in case of UGR_VAR5 a partial match between metrics can be observed, due to the fact that the EM metric indicated the same algorithm as best as the reference metrics. Nonetheless, similarly to former experiments, obtained results does not allow for any form of confirmation that EM and MV based metrics can effectively replace PR and ROC metrics, nor that reduction of data dimensionality improves the effectiveness of EM and MV metrics.

Excess-Mass and Mass-Volume Quality Measures Susceptibility

515

Table 1. Results for KBest feature selection method Dataset KDD_K Best3

KDD_K Best5

KDD_K Best7

UNSW _KBest 3

UNSW _KBest 5

UNSW _KBest 7

UGR_K Best3

UGR_K Best5

ROC

0.844

0.942

0.914

0.509

0.541

0.548

0.657

0.682

PR

0.799

0.932

0.869

0.037

0.071

0.062

0.936

0.939

EM

8.008 e-05

3.872 e-06

6.735 e-07

3.297 e-03

8.018 e-05

1.232 e-05

6.888 e-05

7.108 e-07

MV

1.120 e+02

2.201 e+03

1.107 e+04

1.554 e+01

1.422 e+02

1.356 e+03

1.222 e+02

1.058 e+04

ROC

0.955

0.955

0.954

0.977

0.979

0.979

0.458

0.438

PR

0.924

0.926

0.940

0.426

0.427

0.428

0.889

0.886

EM

1.669 e-04

8.490 e-06

1.210 e-06

2.209 e-04

1.851 e-05

4.141 e-06

1.497 e-04

3.439 e-06

MV

5.333 e+01

1.072 e+03

7.005 e+03

3.451 e+01

9.844 e+02

8.583 e+03

4.732 e+01

2.086 e+03

ROC

0.956

0.956

0.950

0.991

0.992

0.992

0.561

0.562

PR

0.949

0.956

0.952

0.517

0.580

0.633

0.908

0.909

EM

1.453 e-04

5.024 e-06

7.371 e-07

2.542 e-02

4.059 e-03

4.563 e-04

2.300 e-04

7.536 e-06

MV

6.008 e+01

1.859 e+03

1.353 e+04

2.791 e+01

4.686 e+02

9.048 e+02

2.611 e+01

1.228 e+03

ROC

0.698

0.831

0.954

0.903

0.926

0.931

0.658

0.712

PR

0.652

0.683

0.686

0.095

0.112

0.121

0.949

0.956

EM

6.350 e-06

3.852 e-07

1.316 e-07

1.192 e-05

3.637 e-07

3.959 e-08

5.948 e-06

6.822 e-08

MV

1.287 e+02

1.779 e+03

8.292 e+03

3.657 e+01

1.006 e+03

9.184 e+03

6.439 e+01

3.984 e+03

ROC

0.952

0.946

0.946

0.992

0.991

0.992

0.475

0.471

PR

0.954

0.947

0.949

0.532

0.568

0.610

0.893

0.892

EM

6.452 e-05

3.434 e-06

6.711 e-07

8.685 e-03

4.069 e-03

4.485 e-04

1.247 e-04

3.225 e-06

MV

1.189 e+02

2.265 e+03

1.394 e+04

2.041 e+01

2.004 e+02

1.025 e+02

5.030 e+01

2.010 e+03

Algorithm

lof

ocsvm

iForest

abod

auto_ encode r

516

A. Warzy´nski et al. Table 2. Results for KNN feature selection method Algorithm

lof

ocsvm

iForest

abod

auto_e ncoder

Dataset KDD_K NN3

KDD_ KNN5

KDD_ KNN7

UNSW _KNN3

UNSW _KNN5

UNSW _KNN7

UGR_ KNN3

UGR_ KNN5

ROC

0.483

0.481

0.792

0.599

0.602

0.613

0.612

0.673

PR

0.524

0.509

0.701

0.156

0.190

0.177

0.928

0.939

EM

1.153 e-04

3.751 e-06

8.178 e-05

8.394 e-02

6.676 e-04

7.919 e-05

6.888 e-05

7.108 e-07

MV

6.116 e+01

2.518 e+03

1.159 e+03

8.800 e+00

9.968 e+02

5.846 e+03

1.110 e+02

9.333 e+03

ROC

0.956

0.924

0.954

0.935

0.928

0.905

0.420

0.428

PR

0.918

0.850

0.932

0.313

0.295

0.289

0.882

0.884

EM

1.668 e-04

4.421 e-06

6.237 e-06

1.456 e-04

1.972 e-06

2.120 e-07

1.439 e-04

3.378 e-06

MV

5.369 e+01

1.802 e+03

1.387 e+03

5.578 e+01

4.591 e+03

5.213 e+04

4.703 e+01

1.971 e+03

ROC

0.971

0.925

0.948

0.984

0.970

0.966

0.533

0.538

PR

0.926

0.858

0.941

0.446

0.368

0.374

0.905

0.906

EM

1.018 e-04

3.968 e-06

1.272 e-05

3.356 e-04

1.203 e-06

6.319 e-07

1.932 e-04

7.056 e-06

MV

8.826 e+01

1.469 e+03

6.551 e+02

6.051 e+01

7.678 e+03

4.162 e+04

2.977 e+01

1.209 e+03

ROC

0.839

0.477

0.519

0.737

0.725

0.733

0.699

0.749

PR

0.841

0.495

0.665

0.410

0.415

0.412

0.956

0.957

EM

5.806 e-06

1.270 e-07

1.968 e-07

4.407 e-06

3.495 e-08

4.146 e-09

5.840 e-06

6.842 e-08

MV

1.007 e+02

2.471 e+03

3.901 e+03

5.090 e+00

5.558 e+02

5.910 e+03

1.089 e+02

3.396 e+03

ROC

0.951

0.924

0.943

0.941

0.923

0.931

0.436

0.458

PR

0.928+

0.859

0.925

0.343

0.337

0.309

0.884

0.889

EM

1.019 e-04

2.153 e-05

1.555 e-04

2.889 e-04

1.339 e-06

1.143 e-06

1.373 e-04

3.435 e-06

MV

9.030 e+01

5.923 e+02

5.630 e+01

4.133 e+01

5.439 e+03

3.136 e+04

5.004 e+01

1.929 e+03

Excess-Mass and Mass-Volume Quality Measures Susceptibility

517

Table 3. Results for VAR feature selection method Algorithm

lof

ocsvm

iForest

abod

auto_e ncoder

Dataset KDD_ VAR3

KDD_ VAR5

ROC

0.877

0.952

PR

0.859

EM

2.248e -05

MV

KDD_ VAR7

UNSW _VAR3

UNSW _VAR5

UNSW _VAR7

UGR_ VAR3

UGR_ VAR5

0.958

0.570

0.952

0.704

0.631

0.688

0.941

0.946

0.043

0.941

0.124

0.935

0.945

2.152 e-07

2.934 e-09

5.572 e-14

2.152 e-07

2.608 e-28

6.421 e-05

6.314 e-07

2.897 e+02

3.474 e+04

2.188 e+06

5.331 e+13

3.474 e+04

2.818 e+25

9.872 e+01

8.846 e+03

ROC

0.963

0.945

0.946

0.758

0.945

0.754

0.577

0.442

PR

0.942

0.904

0.904

0.059

0.904

0.058

0.925

0.887

EM

4.284 e-05

5.254 e-07

9.881 e-09

3.687 e-15

5.254 e-07

2.341 e-26

1.075 e-04

2.114 e-06

MV

2.041 e+02

1.451 e+04

7.567 e+05

5.187 e+12

1.451 e+04

7.253 e+24

6.037 e+01

3.201 e+03

ROC

0.970

0.967

0.971

0.713

0.967

0.987

0.612

0.579

PR

0.963

0.965

0.968

0.049

0.965

0.582

0.936

0.921

EM

2.444 e-05

2.356 e-07

7.601 e-09

5.473 e-14

2.356 e-07

2.776 e-26

1.502 e-04

4.597 e-06

MV

3.634 e+02

3.155 e+04

8.926 e+05

5.063 e+12

3.155 e+04

7/051 e+23

4.344 e+01

1.513 e+03

ROC

0.749

0.835

0.961

0.555

0,492

0.852

0.761

0.745

PR

0.662

0.777

0.916

0.238

0.059

0.124

0.961

0.959

EM

1.934 e-06

3.801 e-07

2.617 e-09

3.417 e-18

3.801 e-07

2.565 e-26

1.312 e-04

5.492 e-06

MV

2.465 e+02

1.221 e+04

2.248 e+06

1.100 e+14

1.221 e+04

1.981 e+26

5.359 e+01

2.998 e+03

ROC

0.960

0.939

0.940

0.752

0.939

0.987

0.456

0.468

PR

0.950

0.892

0.911

0.057

0.892

0.509

0.892

0.889

EM

2.861 e-05

3.707 e-07

6.426 e-09

9.741 e-16

3.707 e-07

2.671 e-26

7.875 e-05

1.710 e-06

MV

3.003 e+02

1.974 e+04

1.068 e+06

1.035 e+13

1.974 e+04

2.476 e+24

7.453 e+01

3.823 e+03

518

A. Warzy´nski et al.

4.3 Results Discussion Despite testing several feature selection methods and limiting themselves to a small number of dimensions, the results obtained do not allow us to conclude that for the selected algorithms it is possible to perform effective validation using methods that do not require labels. Only in one case all validation methods pointed to the same anomaly detection algorithm as the best one. At the same time, it turned out that in most cases reducing the number of dimensions did not cause the ROC and PR methods to significantly deviate from the values obtained using all the features available in the datasets. It is therefore possible to maintain comparable classification quality even when reducing the number of dimensions, but this does not increase the reliability of using EM and MV curves to assess classification performance. The research carried out so far does not provide an answer as to what is the reason for the significantly worse results in tests on benchmark datasets for the problem of detecting threats in ICT traffic. The results presented here prove that the number of dimensions in this case is not crucial, as it did not positively influence the comparison of results with validation methods using labels. The problem presented here requires additional research to determine whether the data normalization method used in the preprocessing step might not play a key role in the results obtained by the EM and MV algorithms. It may well be that by normalizing the data using the methods presented in the original research, it is impossible to process every dataset related to the anomaly detection problem, which would lead to the need to adjust pre-processing methods and would be a considerable limitation for practical application, especially considering the results obtained by the ROC and PR methods. It should also be noted that this would lead to requiring appropriate measurements for each deployment in any network, and adequate comparisons would only be possible with the preparation of labeled validation data, which would make practical use completely impossible.

5 Summary This paper discussed research aimed at verification and indication if alternative criteria for evaluation of anomaly detection methods based on Excess-Mass and Mass-Volume curves can be effectively utilized for low-dimension data and if they are susceptible to data dimensionality. Conducted research focused on comparison of standard ROC and PR curves with EM and MV based curves in order to asses if EM and MV based metrics will indicate the same algorithms best and worst algorithms in the field of network traffic anomaly detection as typically utilized ROC and PR based metrics. Five different algorithms were undergoing assessment with utilization of the aforementioned metrics. The research, as a continuation of previously conducted works which shown that EM and MV based metrics are not suitable for high dimensional datasets, was focused on verification if these metrics can be effectively utilized for lower dimension datasets as it was stated in the original article [1]. The experimentation shown in this paper utilized three different feature reduction methods and three different datasets to verify if data preprocessing based on dimensionality reduction can increase effectiveness and enable applicability of EM and MV based metrics in real life scenarios.

Excess-Mass and Mass-Volume Quality Measures Susceptibility

519

The results of the experimentation have shown that reduction of dataset dimensionality did not result in the expected increase of quality of EM and MV based metrics indication of the best algorithms. This leads to conclusion that while these metrics can be susceptible to data dimensionality, additional data preprocessing, which was not stated in the original article, is probably required in order enable effective utilization of such quality measures in development of unsupervised anomaly detection methods for network intrusion detection systems. Future research works will focus on verification of commonly used data preprocessing methods in order to verify if any of them can influence the quality of indication provided by the EM and MV based metrics in a way, that would enable their utilization in practical development of unsupervised network intrusion detection methods.

References 1. Warzy´nski, A., Falas, Ł., Schauer, P.: Excess-mass and mass-volume anomaly detection algorithms applicability in unsupervised intrusion detection systems. In: 2021 IEEE 30th International Conference on Enabling Technologies: Infrastructure for Collaborative Enterprises (WETICE), pp. 131–136 (2021) 2. Clémençon, S., Jakubowicz, J.: Scoring anomalies: a M-estimation formulation. In: AISTATS 2013: 16th international conference on Artificial Intelligence and Statistics. Scottsdale, AZ, United States, pp. 659–667. hal-00839254 (2013) 3. Clémençon, S., Thomas, A.: Mass volume curves and anomaly ranking. Electron. J. Statist. 12(2), 2806–2872 (2018). https://doi.org/10.1214/18-EJS1474 4. Goix, N., Sabourin, A., Clémençon, S.: On anomaly ranking and excess-mass curves. AISTATS (2015) 5. Goix, N.: How to evaluate the quality of unsupervised anomaly detection algorithms? ArXiv abs/1607.01152 (2016) 6. Breunig, M., Kriegel, H.-P., Ng, R., Sander, J.: LOF: identifying density-based local outliers. In: Proceedings of ACM SIGMOD International Conference on Management of Data, pp. 93– 104 (2000) 7. Mukkamala, S., Janoski, G., Sung, A.: Intrusion detection using neural networks and support vector machines. In: Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN’02 (Cat. No.02CH37290), pp. 1702–1707 (2002) 8. Liu, F.T., Ting, K.M., Zhou, Z.: Isolation forest. In: 2008 Eighth IEEE International Conference on Data Mining, pp. 413–422 (2008). https://doi.org/10.1109/ICDM.2008.17 9. Kriegel, H.-P., et al.: Angle-based outlier detection in high-dimensional data. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 444–452. ACM (2008) 10. NSL-KDD: NSL-KDD data set for network-based intrusion detection systems (2009). http:// iscx.cs.unb.ca/NSL-KDD/ 11. Tavallaee, M., Bagheri, E., Lu, W., Ghorbani, A.A.: A detailed analysis of the KDD CUP 99 data set. In: Proceedings of the 2nd IEEE International Conference on Computational Intelligence for Security and Defense Applications, pp. 53–58. USA: IEEE Press (2009) 12. Maciá-Fernández, G., Camacho, J., Magán-Carrión, R., García-Teodoro, P., Therón, R.: UGR ‘16: A new dataset for the evaluation of cyclostationarity-based network IDSs. Comput. Secur. 73, 411–424 (2018) 13. Nour, M., Slay, J.: UNSW-NB15: a comprehensive data set for network intrusion detection systems (UNSW-NB15 network data set). In: Military Communications and Information Systems Conference (MilCIS). IEEE (2015)

A Homomorphic Encryption Approach for Privacy-Preserving Deep Learning in Digital Health Care Service Tuong Nguyen-Van1,2 , Thanh Nguyen-Van1,2 , Tien-Thinh Nguyen1,2 , Dong Bui-Huu3 , Quang Le-Nhat4 , Tran Vu Pham1,2 , and Khuong Nguyen-An1,2(B) 1

2

Faculty of Computer Science and Engineering, Ho Chi Minh City University of Technology (HCMUT), Ho Chi Minh City, Vietnam {1614028,1513056,ntthinh,ptvu,nakhuong}@hcmut.edu.vn Vietnam National University, Ho Chi Minh City, Ho Chi Minh City, Vietnam 3 FPT University HCMC, Ho Chi Minh City, Vietnam [email protected] 4 Dinovative, HCMC, Ho Chi Minh City, Vietnam

Abstract. Applied deep learning technology in digital health care service is a potential way to tackle many issues that hospitals face, such as over health care requests, lack of doctors, and patient overload. But a conventional deep learning model needs to compute raw medical data for evaluating health information, which raises considerable concern about data privacy. This paper proposes an approach using homomorphic encryption to encrypt raw data to protect privacy while deep learning models can still perform computations over encrypted data. This approach can be applied to almost any digital health care service in which data providers want to ensure that no one can use their data without permission. We will focus on a particular use case (predict mental health based on phone usage routine) to represent the approach’s applicability. Our encryption model’s accuracy is similar to the non-encryption model’s (only 0.01% difference) and has practical performance. Keywords: Data privacy · Privacy-preserving · Homomorphic encryption · E-health · Intelligent information · Intelligent system Neural networks

1

·

Introduction and Cryptographic Backgrounds

Nowadays, smartphones are indispensable parts of our lives. Smartphones make our life more convenient. Besides their benefits, smartphones also have some negative impacts on our health. Both adults and children are affected; children are considered to be more susceptible to those impacts if their parents do not oversee them [3,6]. This dilemma is a huge concern for parents and social scientists. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  N. T. Nguyen et al. (Eds.): ACIIDS 2022, LNAI 13758, pp. 520–533, 2022. https://doi.org/10.1007/978-3-031-21967-2_42

PPML in e-Health Care

521

To the best of our knowledge, several commercial applications are developed to help parents tackle this problem (like KidsGuard Pro, FamilyTime, etc.)1 . They allow parents to monitor how their children use phones and gain various insights from the phone usage. Those insights can be disease prediction, mental health scores, or negative behavior ratio produced by taking advantage of deep learning technology and medical experts. But these applications still have some issues. Several studies [7] indicated that parental control applications make the children annoying because they feel that the applications are overly invasive and observant of their privacy and their lives, especially the teenagers. This monitoring problem can lead to a negative relationship in the family. Furthermore, there is always the risk that someone will take advantage of collected data for financial purposes. For example, any third party who somehow obtains the data can sell it to insurance companies or pharmaceutical companies. And because the data is always identifiable, encrypting the data is necessary. But we cannot apply deep learning models to standard encrypted data, so we have to adopt a particular encryption scheme, homomorphic encryption, which allows us to run deep learning using encrypted data. Our Contributions: By using a suitable homomorphic encryption scheme, we propose a general architecture in which the system is first able to encrypt the raw data. Then suitable deep learning models can compute and analyze the encrypted data for monitoring health status while preserving individual privacy. In realizing our proposal, we construct, implement, and evaluate a non-invasive application that allows the parents to monitor their children’s well-being scores effectively and ensure that no one can exploit the privacy of the children’s data. The application automatically collects the children’s data, then encrypts it with CKKS homomorphic encryption scheme [2], and sends the ciphertext to the server. The server computes the health score of the kid with a deep learning model built upon a homomorphic encryption scheme and trained by an experienced team of child psychologists, social scientists. After that, the encrypted output is sent to the parents, and then they decrypt the outputs locally and get the health score of their child. The application should satisfy the following properties – Accuracy: the application applied homomorphic encryption can output predicting results with a slight difference compared to the non-apply app. – Privacy: no third-party can reveal the data fed in the application without the secret key. – Performance: the app’s performance is similar to the regular (without applying homomorphic encryption) app’s performance. Structure of the Paper: Section 1 provides a brief motivation, some proposed solutions for protecting the privacy in e-health applications, and presents some background knowledge of homomorphic encryption and the CKKS scheme that 1

https://www.clevguard.com/parental-control/parental-control-applications/.

522

T. Nguyen-Van et al.

we use in this application. Section 2 describes the detailed application. The implementation and evaluation of the solution were presented in Sect. 3 and Sect. 4, respectively. Finally, Sect. 5 concludes this paper. The rest of this section presents some descriptions and knowledge of the underlying homomorphic encryption and CKKS scheme, which are our service’s building blocks. 1.1

Homomorphic Encryption

The term homomorphic encryption (HE) describes a class of encryption algorithms that enables one to perform operations (i.e., additions or multiplications) on encrypted data. Let (M, C, K, E, D) be the encryption scheme, where M and C are the message space and the ciphertext space; K is the key-space; and E, D are the encryption and decryption algorithms. Suppose the message space forms a group (M, ) and the ciphertext space forms a group (C, ◦), then with a pair (pk, sk) ∈ K 2 , pt ∈ M , ct ∈ C, the encryption algorithm E(pt) : M −→ C is a mapping from group M to group C under the public key pk ∈ K; and the reverse map is D(ct) is a mapping from group C to group M under the secret key sk ∈ K. A (M, C, K, E, D) encryption scheme described as above is called homomorphic if it meets the equation below E(m1 ) ◦ E(m2 ) = E(m1  m2 ), ∀m1 , m2 ∈ M, where E is the encryption algorithm and M is the set of all possible messages. To build an encryption scheme that allows arbitrary functions on the ciphertext, only addition and multiplication operations are necessary since addition and multiplication form a complete set over finite sets. A HE scheme supporting both addition and multiplication is called fully homomorphic encryption (FHE). The concept of FHE scheme was first suggested in 1978 [12] but remained an open question because of performance problems. There are three major branches of homomorphic encryption schemes that have formed [11] based on lattice, integerbased, and learning-with-errors (LWE), or ring-learning-with-errors (RLWE). The CKKS scheme is based on RLWE with practical performance, so we use CKKS in this approach. 1.2

CKKS

CKKS is an approximate HE scheme [2] proposed by Cheon, Kim, Kim and Song (CKKS) in 2017. Approximate HE means the plaintext obtained by the decryption process is not necessarily the original plaintext, but with some small errors added. In practice, data always contains some noise from their actual values because of many reasons such as environmental noise, the limited number of bits that the computer system can represent a float number. So CKKS’ key idea is to view an encryption noise as part of an error that happens during approximate computations to increase the HE scheme’s performance. In CKKS, message

PPML in e-Health Care

523

space M is a subset of complex numbers, ciphertext space C and key-space K are sets of vectors that have two polynomials. The ct encryption of the plaintext m and the secret key sk form a decryption structure D(ct) = ct, sk = m + e (mod q) where e is a small error added to guarantee the security of the RLWE hardness assumption (Fig. 1a) and q is ciphertext modulus. Note that the size of plaintext is small enough compared to the ciphertext size. In the encryption process, the plaintext is multiplied by a scale factor to reduce the precision loss (e) which is small enough, the value m = m + e is an approximate origin number. However, there are some issues that the error will grow exponentially with the number of homomorphic multiplication operations. To tackle this problem, CKKS applies a rescaling technique that reduces the size of ciphertext for each level of multiplication (see Fig. 1b).

(a) Decryption structure

(b) Homomorphic multiplication and rescaling for approximate arithmetic

Fig. 1. Homomorphic Encryption [2]

Encode - Decode. For 2N is a power-of-two, we denote a field R[X]/(Φ2N (X)) be the quotient ring of polynomials modulo the (2N )-th cyclotomic polynomial, the encoding/decoding techniques are based on an isomorphism σ : R[X]/(Φ2N (X)) −→ CN [g]Φ2N −→ (g(ζ), g(ζ 3 )..., g(ζ 2N −1 )), −πi

where ζ = e N roots of Φ2N . We note that

is the principal 2N -th root of unity. And ζ, ζ 3 , ..., ζ 2N −1 are  ∗  Z  = N , therefore we denote 2N H = {(zj )j∈Z∗2N : z−j = zj , ∀j ∈ Z∗2N } ⊆ CN ,

and T is a subgroup of the multiplicative group Z∗2N satisfying Z∗2N /T = {−1; 1}. So the decoding technique first turns a plaintext polynomial m(X) ∈ R =

524

T. Nguyen-Van et al.

Z[X]/(X N + 1), which satisfies the norm m ∞ q that means all coefficients of polynomial m(X) are smaller than q, to a complex vector (zj )j∈Z∗2N ∈ H using canonical embedding map σ then projects it to message vector (zj )j∈T ∈ CN/2 by the natural projection π : H → CN/2 . In the encoding process, we have to round off the plaintext polynomial coefficients, but the encoding is the inverse of decoding procedure. By ·R , we refer to the rounding of the closing element in R, and the encoding is presented by π −1

σ −1

·R

−−→ H −−→ R[X]/(Φ2N (X)) −−−→ R CN/2 z = (zj )j∈T −→ π −1 (z) −→ σ −1 (π −1 (z)) −→ σ −1 (π −1 (z))R . Leveled Homomorphic Encryption Scheme. Because of the error size issue mentioned above, the ciphertext modulus should be reduced by performing the rescaling after a multiplication operation. Hence CKKS controls the level of homomorphic operation to ensure the precision loss. Specifically, if we calculate a multiplicative depth function of L without output rescaling, the bit size of the output will grow exponentially with L. CKKS divides the intermediate values by a base to tackle this issue. So we can discard some inaccuracy of the least significant bits (LSBs), but there is still a minor encryption error. CKKS maintains almost the same message size and makes the required ciphertext modulus linear in the L range. We fix a base q0 > 0, let q = q0 · Δ for 1 ≤  ≤ L. We denote Rq = R/qR for the residue ring of R modulo an integer q. Let χs , χr and χe be Gaussian distributions on R for secret, encryption, error polynomials with small coefficients respectively. Choose an integer P. – Key generation: Keygen() • Sample secret polynomial s ← χs . • Sample a (resp. a ) uniformly random from Rq (resp. RP ·q ), and e, e ← χe . • A secret key sk ← (1, s), a public key pk ← (b = −as + e, a) ∈ Rq2 , an evaluation key 2 evk ← (b = −a s + e + P s2 , a ) ∈ RP ·q .

– Encryption: Epk (m) • Sample an ephemeral secret polynomial (use once for masking) r ← χr and e0 , e1 ← χe . • For a plaintext polynomial m ∈ R, output a ciphertext ct ← (c0 = rb + e0 + m, c1 = ra + e1 ) ∈ Rq2 . – Decryption: D(sk, ct) Output a plaintext m ← ct, sk (mod q ). The decryption output is an approximate value of plaintext. i.e. D(sk, Epk (m)) ≈ m.

PPML in e-Health Care

525

– Homomorphic Addition: Add(ct, ct ) Given two ciphertexts ct, ct ∈ Rq2 , output ctadd ← ct + ct ∈ Rq2 . The correctness holds as D(sk, ctadd ) ≈ D(sk, ct) + D(sk, ct ). – Homomorphic Multiplication : CM ult(c, ct) Given a ciphertexts ct = (c0 , c1 ) and constant c ∈ R, output ctmult ← c · ct ∈ Rq2 . – Homomorphic Multiplication : M ult(ct, ct ) • Given two ciphertexts ct = (c0 , c1 ) and ct = (c0 , c1 ), let (d0 , d1 , d2 ) = (c0 c0 , c0 c1 + c1 c0 , c1 c1 ) (mod q ). • Output

ctmult ← (d0 , d1 ) + P −1 · d2 · evk ∈ Rq2 .

The correctness holds as D(sk, ctmult ) ≈ D(sk, ct) · D(sk, ct ). – Rescaling: Rescale(ct) Given a ciphertext ct ∈ Rq2 and a new modulus q−1 < q , output a re-scaled ciphertext ctrs = (q−1 /q ) · ct ∈ Rq2−1 . – Rotate: Rotate(ct, k) • The encrypted plaintext vector is shifted by k slots. • The Galois group Gal = Gal(Q(ζ)/Q) consists of the automorphisms κk : m(X) → m(X k ) (mod Φ2N (X)), for a polynomial m(X) ∈ R and k co-prime with 2N . The automorphisms κk is very useful for the permutation on a vector of plaintext values. Optimization. In the Leveled HE scheme described above, we have to compute the arithmetic operations module a large integer (log(q) > Depth of circuit L×(log(Δ)) which are very expensive. To increase the performance of the scheme, we set Q = q0 q1 q2 ...q , 1 ≤  ≤ L, for distinct coprime q0 , q1 , ..., q and use the CRT (Chinese Remainder Theorem) form for efficient computation, where qj ≈ Δ, so log(qj ) ≈ log(Δ). This optimization allows us to compute arithmetic operations in parallel, hence increasing the overall performance. In the rescale procedure, we reduce the ciphertext modulus from Q down to Q−1 = Q /q that will maintain the structure of Q−1 and the correctness of the rescaling procedure (qj ≈ Δ).

526

2 2.1

T. Nguyen-Van et al.

The Proposed Solution System Architecture

The whole system of the application consists of three parties: parents’ app, child’s app, and server. The system flow presents in Fig. 2. First, the parents’ app takes some setup steps to generate HE’s secret key, public key, and evaluation key. The public key is sent to the child’s phone locally, and the evaluation key is sent to the server (1). The child’s app collects data and encrypts them with the public key, then sends encrypted data to the server (2). The server can not know the child’s data without the secret key. The server can predict the well-being score using the evaluation key on a pre-trained HE neural network, so the score is encrypted and then sends it to the parents’ phone (3). Only the parents can decrypt the encrypted well-being score using the secret key (4). At the server side, the neural network model in this system is composed of two fully connected layers. The model gets the input data and then outputs the well-being score between 0 and 10, where a higher score indicates positive health. For an input feature vector x of length n, we can describe the model as f (x) = b2 + W2 (s(b1 + W1 x)), where b1 ∈ Rn , b2 ∈ R are bias vectors, W1 ∈ Rn×n , W2 ∈ R1×n are weight matrices of each layer, and s : Rn → Rn is the square activation function, s(a) = a2 . Because the HE scheme only supports addition and multiplication, so to compute activation functions (like sigmoid, rectified linear) we approximate them with low-degree polynomial [16], to make the homomorphic encryption performance feasible. And the square activation function provides high inference accuracy when it replaces some non-linear activation functions for a low-depth machine learning model. The authors in [10] presented a theoretical analysis of using polynomial activation function in neural networks and showed that square activation function works pretty well.

Fig. 2. System architecture

PPML in e-Health Care

2.2

527

Data Selection and Features

We collect data about the total time spent on phone apps and sleep time to predict the well-being score. Different studies suggest an association between the use of social media and depressive symptoms [9,15]. We define three categories of apps (social media, education, entertainment) and divide each day into three time-period: – school hours: 7am - 16pm, – home hours: 16pm - 22pm, – sleep hours: 22pm - 7am. Then the data collected about total time spent are split into nine features. This data will help us indicate valuable insights relating to children’s health. The children can use social media or entertainment apps (games, music,..) in their evening free time, but it is not suitable for school or sleeping time. Sleep circle also plays a primary role in adolescent well-being. Several studies say that good sleep affects the youth’s brain and behaviors (see [4].) We collect the time that the children sleep and the delay time or how much they stay up late (a negative number if he falls asleep earlier). This pair is stored locally for three-day. We send the past three nights of sleep data to the model each day. There are a lot of causes that can influence the insight of health from sleep data (e.g., the child may stay up late one night to finish their homework, it is fine, but if he goes to bed late every night, that does a negative effect), so we need longer-term patterns. Each day, these data provide 15 features: nine from application usage and six from sleeping data from three previous days. We use sleeping data for the labeling procedure and nine phone usage data for training/predicting health scores.

3

Implementation

3.1

Data Preparation

To implement the application, we use an open real-life dataset2 that is public for research purposes. This dataset is “Tsinghua App Usage Dataset” which has been used in predicting smartphone application usage by points of interest [17]. The dataset contains many fields such as user identification (anonymized), timestamps, packet length, location, application categories (crawled from Android Market and Google Play), etc. Some dataset statistics are present in Table 1, files and descriptions that we use in this implementation are in Table 2.

2

http://fi.ee.tsinghua.edu.cn/appusage/.

528

T. Nguyen-Van et al. Table 1. Dataset statistics Dataset statistics

Descriptions

Duration

7 days

Location

In one of the biggest cities of China

Number of identified Apps

2000

Number of users

1000

Number of app categories

20

Number of app usage records 4171950

Table 2. Files and Description Dataset files

Descriptions

App Usage Trace.txt User ID Timestamp Location Used App ID Traffic (Byte) App2Category.txt

App ID Cat ID

Categories.txt

Cat ID Name

In Subsect. 2.2, we defined some features used for our app. We will extract these features from our dataset. First, we split 20 application categories in the public dataset into our classes: social media, education, and entertainment. After that, we add the category field into the big data frame (App Usage Trace.txt) based on App ID (App2Category.txt), Cat ID (Categories.txt), and our classification (Table 3). Then, we extract our features from the data frame, we use Table 3. App categories classification Cat ID

Name

Class

0

‘Utilities’

Entertainment

1

‘Games’

Entertainment

2

‘Entertainment’

Entertainment

3

‘News’

Social

4

‘Social Networking’, ‘wechat’, ‘linkedin’, ‘weibo’

Social

5

‘Shopping’

Entertainment

6

‘Finance’

Education

7

‘Business’

Education

8

‘Travel’

Education

9

‘Lifestyle’, ‘meituan’, ‘?’

Education

10

‘Education’

Education

11

‘Health&Fitness’

Education

12

‘infant&mom’

Education

13

‘Navigation’

Education

14

‘Weather’

Education

15

‘Music’

Entertainment

16

‘References’

Education

17

‘Books’

Education

18

‘Photo&Video’

Entertainment

19

‘Sports’

Education

PPML in e-Health Care

529

the timestamp of each packet trace to figure out which period the application is used (school time, evening time, or sleep time) and the duration of usage. The 3-day sleep pattern can be inferred from application usage in sleep time. We have now the feature dataset containing 15 data columns, but we can not train a supervised learning model with unlabeled data. We must create a labeling procedure that indicates an accurate health score for the feature set. As we discussed in Subsect. 2.2, quality sleep helps to protect both physical and mental health. Adolescents, in particular, have a physiological need for more and insufficient sleep [14]; quantity and quality are correlated with severe problems in many health aspects [5]. Besides, multiple problems cause inadequate sleep, such as school stress, family relationship, and digital media, which is our concern. So we use six sleep pattern attributes in the feature set to calculate the well-being score. A systematic review [13] of the functional consequences of sleep problems in adolescents showed that sleep disturbance and later bedtime negatively affected psycho-social health, school performance, poor mental health status, loneliness, anxiety, and depression. Table 4 reflects recent American Academy of Sleep Medicine (AASM) recommendations3 that the American Academy of Pediatrics (AAP) has endorsed. We classify our sleep data into some score classes based on AASM recommendations for teenagers (Table 5). Table 4. American Academy of Sleep Medicine (AASM) recommendations Age

Recommended Amount of Sleep

Infants aged 4–12 months

12–16 h a day (including naps)

Children aged 1–2 years

11–14 h a day (including naps)

Children aged 3–5 years

10–13 h a day (including naps)

Children aged 6–12 years

9–12 h a day

Teens aged 13–18 years

8–10 h a day

Adults aged 18 years or older 7–8 h a day Table 5. Sleep data score Duration of sleep

Score

Delay

Score

−2 −2 (b) Delay

https://www.nhlbi.nih.gov/health-topics/sleep-deprivation-and-deficiency.

530

T. Nguyen-Van et al.

We add all three-day sleep data to get total score. Then we compute the wellbeing score from total score by the following formula total score − min score × 10, max score − min score where max score is the maximum sum of scores that we can reach, min score is the minimum sum of scores. wellbeing score =

3.2

Training Model

We implement the training neural network using Keras. We split the labeled data into the training dataset (4286 records) and the test dataset (1100 records). After the training process with the cross-validation test (20% train dataset), we have a pretty good model (RMSE = 2.11, MAE = 1.737 on the test dataset). 3.3

Predicting Model

In predicting model implementation, we create a new neural network using CKKS homomorphic encryption and the model’s weights from the training process. Some cryptographic libraries are implementing CKKS, such as HEAAN (C++), Microsoft SEAL (C++), and Lattigo (lattice-based cryptographic library in Golang). But in this paper, we use the wrapper Python for Microsoft SEAL to take advantage of the rich support for machine learning libraries. The first step of implementation is initializing the CKKS context, key set, and necessary objects and sending it to the party who has the right to use it. In this application, we set the polynomial modulus degree with N = 213 . That degree is the minimum polynomial modulus degree for security parameters allowed by the SEAL library. Still, it is more than enough size for 9 (the number of features that need to be encrypted). Since we evaluate the circuit of a depth of three, we set coefficients modulus Q with [60, 30, 30, 30, 60] (210 bits), meaning 60-bit base prime, three 30-bit primes for rescaling, and 60-bit prime for modular switching technique. This setting step is crucial for the precision and the performance of HE implementation. With this setting, the security level is 128-bit [1], and the precision is 30-bit. Next, we encode the input data vector, the model’s weights from the training model. Then we implement the HE neural net circuit with matrix-vector product, square a vector element-wise, vector-vector product, vector-vector sum. We can implement all the circuit directly with CKKS operations, except matrixvector and vector-vector products. Because CKKS operations perform elementwise on plaintext, ciphertext vector, the usual matrix-vector product can not use. We implement matrix-vector multiplication using the diagonal method [8]. This is the parallel “systolic” multiplication algorithm, we can run single instruction, multiple data (SIMD) computation on this algorithm. The idea is to put the matrix in some diagonal vector, and then multiply the vector permuted. In detail, let’s denote the matrix is A and the vector is v, and

PPML in e-Health Care

531

product w = Av, we represent the matrix A by n vectors d0 , d1 , ..., dn−1 . Where di = (A0,i , A1,i+1 , ..., An−1,i−1 ), so di [j] = Aj,j+i . And we can comn−1  di × (v i). To see that this gives the right answer, note that pute w ← i=0

the j  th entry in the result is w[j] =

n−1 

di [j] · (v i)[j] =

i=0

n−1 

Aj,j+i · v[j + i] =

i=0

n−1 

Aj,k · v[k].

i=0

Figure 3a is an example for matrix-vector diagonal method with matrix of size 3 × 3 and vector of size 3 × 1.

(a) Matrix-vector diagonal method (3 × 3 and 3 × 1)

(b) Vector-vector product - sum step (N = 8)

Fig. 3. Training operations

After each multiplication, we must rescale the result so that the noise does not blow up and is in a decryptable form. And we have to normalize the scales to the same one before computing operations. The problem in the vector-vector product is the same as in the matrixvector product because we just can compute element-wise operations on CKKS encrypted vector. Similar to the matrix-vector product solution, we rotate the ciphertext 2i , where 0 ≤ i < log(N/2), to create a new ciphertext and then add two ciphertexts together. Figure 3b is an example.

4

Test Results and Evaluations

After running on 1100 records of the test dataset, we over-viewed the model’s accuracy and performance when using the HE technique to preserve privacy. Overall, the HE implementation is slightly slower than the non-HE, but the difference comes from predicting encrypted data on neural networks; other tasks’ running time is negligible. The error between non-HE and HE implementation is tiny (0.01%), so the model’s accuracy is preserved. And the performance is

532

T. Nguyen-Van et al.

about 87 ms delay-time compared to non-HE (on Intel i7-6700HQ 2.6 GHz, 16 GB RAM). We can now claim that the application satisfied the accuracy and performance properties.

5

Conclusions and Discussion

We have proposed a system that solves a real-life privacy-preserving machine learning use case with CKKS. The system collects children’s phone usage data and then predicts their health scores while protecting their data privacy. We analyzed the use-case’s problem and defined the structure of the app, the neural network, and what data we will collect. During the implementation, we faced some difficulties and proposed our approach to tackle them. We created an application that solves the use-case problem and has high accuracy, data privacy, and practical performance. Acknowledgement. This research is funded by Vietnam National University Ho Chi Minh City (VNU-HCM) under grant number NCM2021-20-02. We acknowledge Ho Chi Minh City University of Technology (HCMUT), VNU-HCM, for supporting this study. The authors would also like to thank Mr. Nguyen Ngoc Ky for his comments helping to improve the manuscript of this work significantly.

References 1. Chase, M., et al.: Security of homomorphic encryption, Technical report, HomomorphicEncryption.org, Redmond, WA, USA (2017) 2. Cheon, J.H., Kim, A., Kim, M., Song, Y.: Homomorphic encryption for arithmetic of approximate numbers. In: Takagi, T., Peyrin, T. (eds.) ASIACRYPT 2017. LNCS, vol. 10624, pp. 409–437. Springer, Cham (2017). https://doi.org/10. 1007/978-3-319-70694-8 15 3. Chiu, C.-T., Chang, Y.-H., Chen, C.-C., Ko, M.-C., Li, C.-Y.: Mobile phone use and health symptoms in children. J. Formos. Med. Assoc. 114, 598–604 (2015) 4. Clarke, G., Harvey, A.G.: The complex role of sleep in adolescent depression. Child Adolesc. Psychiatr. Clin. 21, 385–400 (2012) 5. Do, Y.K., Shin, E., Bautista, M.A., Foo, K.: The associations between self-reported sleep duration and adolescent health outcomes: what is the role of time spent on internet use? Sleep Med. 14, 195–200 (2013) 6. Domoff, S.E., Borgen, A.L., Foley, R.P., Maffett, A.: Excessive use of mobile devices and children’s physical health. Hum. Behav. Emerg. Technol. 1, 169–175 (2019) 7. Ghosh, A.K., Badillo-Urquiola, K., Guha, S., LaViola Jr., J.J., Wisniewski, P.J.: Safety vs. surveillance: what children have to say about mobile apps for parental control. In: Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, pp. 1–14 (2018) 8. Halevi, S., Shoup, V.: Algorithms in HElib. In: Garay, J.A., Gennaro, R. (eds.) CRYPTO 2014. LNCS, vol. 8616, pp. 554–571. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-662-44371-2 31

PPML in e-Health Care

533

9. Heffer, T., Good, M., Daly, O., MacDonell, E., Willoughby, T.: The longitudinal association between social-media use and depressive symptoms among adolescents and young adults: An empirical reply to Twenge. Clin. Psychol. Sci. 7(2019), 462– 470 (2018) 10. Livni, R., Shalev-Shwartz, S., Shamir, O.: On the computational efficiency of training neural networks. In: Advances in Neural Information Processing Systems, pp. 855–863 (2014) 11. Lyubashevsky, V., Peikert, C., Regev, O.: On ideal lattices and learning with errors over rings. In: Gilbert, H. (ed.) EUROCRYPT 2010. LNCS, vol. 6110, pp. 1–23. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-13190-5 1 12. Rivest, R.L., Adleman, L., Dertouzos, M.L., et al.: On data banks and privacy homomorphisms. Found. Secure Comput. 4, 169–180 (1978) 13. Shochat, T., Cohen-Zion, M., Tzischinsky, O.: Functional consequences of inadequate sleep in adolescents: a systematic review. Sleep Med. Rev. 18, 75–87 (2014) 14. Tarokh, L., Saletin, J.M., Carskadon, M.A.: Sleep in adolescence: Physiology, cognition and mental health. Neurosci. Biobehav. Rev. 70, 182 (2016) 15. Twenge, J.M., Joiner, T.E., Rogers, M.L., Martin, G.N.: Increases in depressive symptoms, suicide-related outcomes, and suicide rates among us adolescents after 2010 and links to increased new media screen time. Clin. Psychol. Sci. 6, 3–17 (2018) 16. Xie, P., Bilenko, M., Finley, T., Gilad-Bachrach, R., Lauter, K., Naehrig, M.: Crypto-nets: neural networks over encrypted data. In: ICLR (2014) 17. Yu, D., Li, Y., Xu, F., Zhang, P., Kostakos, V.: Smartphone app usage prediction using points of interest. Proc. ACM Interact. Mob. Wearable Ubiquit. Technol. 1, 174 (2018)

Semantic Pivoting Model for Effective Event Detection Hao Anran1,2(B) , Hui Siu Cheung1 , and Su Jian2 1

2

Nanyang Technological University, Singapore, Singapore {S190003,asschui}@ntu.edu.sg Institute for Infocomm Research, Agency for Science, Technology and Research, Singapore, Singapore [email protected]

Abstract. Event Detection, which aims to identify and classify mentions of event instances from unstructured articles, is an important task in Natural Language Processing (NLP). Existing techniques for event detection only use homogeneous one-hot vectors to represent the event type classes, ignoring the fact that the semantic meaning of the types is important to the task. Such an approach is inefficient and prone to overfitting. In this paper, we propose Semantic Pivoting Model for Effective Event Detection (SPEED), which explicitly incorporates prior information during training and captures more semantically meaningful correlation between input and events. Experimental results show that our proposed model achieves the state-of-the-art performance and outperforms the baselines in multiple settings without using any external resources. Keywords: Event detection · Information extraction Language Processing · Deep learning

1

· Natural

Introduction

Event Detection (ED), which is a primary task in Information Extraction, aims to detect event mentions of interests from a text. ED has wide applications in various domains such as news, business and healthcare. It also provides important information for other NLP tasks including Knowledge Base Population and Question Answering. The state-of-the-art ED models are predominantly deep learning methods, which represent words using high dimensional vectors and automatically learn latent features based on training data [2,4]. However, limited size and data imbalance of ED benchmarks pose challenges in performance and robustness of current deep neural models [1]. For instance, over 60% of the types in the ACE 2005 benchmark dataset have less than 100 data instances each. Recent works on ED can be categorized into three major approaches: (i) proposing architectures with more sophisticated inductive bias [2,22]; (ii) leveraging on linguistic tools and knowledge bases [12]; (iii) using external or automatically augmented training data [21]. These approaches can be seen as indirectly Supported by the Agency for Science, Technology and Research, Singapore. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  N. T. Nguyen et al. (Eds.): ACIIDS 2022, LNAI 13758, pp. 534–546, 2022. https://doi.org/10.1007/978-3-031-21967-2_43

Semantic Pivoting Model for Effective Event Detection

535

alleviating the lack of type semantic prior in the model, but they ignore the important fact that the types are semantically meaningful. The models only treat each event type class homogeneously as one-hot vectors and are therefore agnostic to the semantic difference or association of the types. In this paper, we propose to directly incorporate the type semantic information by utilizing the class label words of the event types (e.g., “attack” and “injure”) to guide ED. To this end, we leverage the state-of-the-art Language Model (LM) structure, Transformer [18], and propose a Semantic Pivoting Model for Effective Event Detection (SPEED), which uses the event type label words as auxiliary context to enhance trigger classification through a two-stage network. We highlight the fact that the label words are natural language representation of the meanings of the target types, which allows us to: (1) use them as initial semantic pivots for ED, and (2) encode them in the same manner as the input sentence words and enhance the representations of both via the attention mechanism. To the best of our knowledge, this is the first work to exploit the event class label set and incorporate the type semantic prior information for the task. We evaluate our SPEED model on ACE 2005 benchmark and achieve the state-ofthe-art performance. The rest of the paper is organized as follows: Sect. 2 reviews the related work and Sect. 3 specifies the task formulation. In Sect. 4, we introduce our proposed SPEED model. Section 5 discusses the experimental details including the dataset, compared baseline models, hyperparameter settings, performance results and analysis. Section 6 gives the conclusion of this paper.

2

Related Work

Deep learning models [2,16] which are based on distributed vector representations of the text and neural networks have been widely used for modern ED. Such approaches automatically extract latent features from data and are thus more flexible and accurate than early feature-based approaches (e.g., [9]). Over the years, other than convolutional neural network (CNN) [2] and recurrent neural network (RNN) [16], more sophisticated architectures or mechanisms including attention [10], Transformer and graph neural network [11,21,22] are introduced to improve the performance. However, the data scarcity and imbalance problems remain as the bottleneck for substantial improvement. To alleviate the data scarcity problem, many works leverage on external linguistic resources such as Freebase and Wikipedia [1,21] to generate auto-augmented data via distant supervision. Utilization of pre-trained language models, joint extraction of triggers and arguments, and the incorporation of document-level or cross-lingual information [3,17,19] are also found to be able to enhance ED. Class label representation has been used for image classification, but rarely explored for natural language processing (NLP) tasks. The recent works [20,24] encoded label information as system input for Text Classification. Recently, Nguyen et al. [14] demonstrated the effectiveness of explicitly encoding relation and connective labels for Discourse Relation Recognition. However, these methods learn separate encoders for the labels and the input sentence words.

536

H. Anran et al.

Fig. 1. Architecture of the proposed Speed model.

This is redundant because the words used in both the labels and the sentences are from the English vocabulary and they can share the same embedding. Furthermore, these methods do not effectively model the rich interactions between sentence words and labels as well as between two event labels. In our work, the label input shares the distributed representation with the input text, while the deep attention-based structure captures higher-order interactions among word and label tokens. By harnessing the power of pre-trained language models, we avoid the hassle of data augmentation methods such as distant supervision [13], in which much noise is introduced.

3

Event Detection

ED is formulated as identifying event triggers which are the words that best indicate mentions of events, and classifying these triggers into a pre-defined set of event types. For example, in the sentence S1, the underlined words are the triggers of an Attack event and an Injure event respectively: S1: A bomb went off near the city hall on Friday, injuring 6. We formulate the task as a word-level sequence tagging problem, with the input being the document sentences and the output being the predicted trigger type labels of each word spans. We follow the criteria used in previous ED works [2,8,15] and consider an event trigger as correct if and only if both the boundary and classified type of a trigger match the corresponding ground truth.

4

Proposed Model

Figure 1 shows the overall architecture of our proposed SPEED model, a twostage Transformer-based model consisting of a Label Semantic Learner and a Trigger Classifier.

Semantic Pivoting Model for Effective Event Detection

537

Fig. 2. Architecture of Label Semantic Learner.

4.1

Label Semantic Learner

Figure 2 shows the proposed Label Semantic Learner, which employs Sequenceto-sequence (Seq2seq) Transformer and Gumbel Sampling. Sequence-to-sequence Transformer. To learn a semantic representation of event types based on the label words, we first concatenate words in the original event type labels, forming a sequence L = l1 , ..., ln . Then, we randomly shuffle the labels to reduce the influence of positional embedding. The Label Semantic Learner takes the label words as input and passes the representation through the Transformer architecture [18], which consists of M encoder layers followed by M decoder layers. The attention mechanism within each layer allows the type label words to interact with each other based on lexical semantic similarity and difference. Note that the encoder and decoder attention masks are set to allow each token li ∈ L to interact with all other tokens either before or after it. The final decoder layer is connected to a feed-forward neural layer (FFNN ), which predicts a new sequence L = w1 , ..., wn  that encodes type semantic information. To be consistent with the original label sequence, we restrict the number of tokens in the output sequence to form L to be the same as that of the input L. Gumbel Sampling. The next step of the Label Semantic Learner infers the bestsuited label words from the vocabulary based on the distribution. This involves “discrete” steps of taking the most possible next tokens, causing the backpropagation problem. Also, there are conceivably multiple ways to describe the meanings of the event types. Instead of deterministically choosing the word of highest probability, the model may benefit from a “softer”, probabilistic approach that also allows other words to be chosen (those with lower probabilities). Thus, we employ the Gumbel-Softmax method [6], which closely approximates the argmax operation via Gumbel Sampling, for the Label Semantic Learner.

538

H. Anran et al.

Fig. 3. Architecture of Trigger Classifier.

More specifically, we replace the usual non-differentiable token prediction operation that selects one word wi from the vocabulary with the highest probability: (1) wi = argmax(sof tmax(pi )) with:

pi + Gi )) (2) τ where p are the computed probability logit values of w, τ is the temperature parameter controlling the degree of approximation and Gi is a random noise value sampled from the Gumbel-Softmax distribution G: wi = argmax(sof tmax(

G = − log(− log(Ui )), Ui ∼ U nif orm(0, 1)

(3)

This basically reparameterizes pi and replaces the sample wi in Equation (1) drawn from the one-hot-encoded, categorical distribution on the vocabulary with an approximation drawn from a continuous distribution (G) as in Equation (2), thereby allowing the backpropagation to compute the respective gradient. 4.2

Trigger Classifier

Figure 3 shows the architecture of Trigger Classifier, which consists of Input Encoding, Feature Extraction and Trigger Prediction. Input Encoding. We leverage the state-of-the-art Pre-trained Language Model BERT [18] for encoding. As shown in Fig. 3, we construct the input by concatenating each sentence with the type semantic sequence. After adding special tokens in BERT including [CLS] and [SEP], for each input sentence s (of length NS ) with all the label words L (of length NL ), the input sequence is as follows: Xs = [CLS], L , [SEP1 ], s, [SEP2 ]

(4)

Semantic Pivoting Model for Effective Event Detection

539

of length Nvalid = Ns + NL + 3. Following BERT, three types of embeddings are used, namely word-piece embedding Ew , position embedding Ep and segment index embedding Es . For each word-piece token xi ∈ Xs , it is embedded as: ei = Ew (xi ) ⊕ Ep (xi ) ⊕ Es (xi )  fs (1) xi ∈ L or xi = SEP2 Es (xi ) = fs (0) otherwise

(5)

where fs is the pre-trained segment index embedding from BERT. Feature Extraction. At each Transformer layer, contextualized representation for tokens is obtained via aggregation of multi-head attention (denoted by ‘Tm’ in Fig. 3): (6) MultiHead(Q, K, V ) = head1 , ..., headH W O headi = fAttn (QWiQ , KWiK , V WiV ), i ∈ [1, H] where H denotes the number of heads and Q, K, V are query, key, value matrix respectively. In our proposed model, candidate words from the input sentence and words from the type semantic sequence are jointly modeled in the same vector space. The Attn of each head can be decomposed as follows: ⎧ Attn(Qs , Ks , Vs ) ⎪ ⎪ ⎪ ⎨Attn(Q  , K  , V  ) L L L Attn(Q, K, V ) ⇒ ⎪ Attn(QL , Ks , Vs ) ⎪ ⎪ ⎩ Attn(Qs , KL , VL ) These correspond to four types of token interactions. The first one is the interaction between each pair of input sentence words, capturing solely sentence-level contextual information. The second is the interaction between each pair of type label words, which models the correlation between event type labels. The remaining two allow s to be understood in regards to L and vice versa. For example in Fig. 3, each word ti in the example sentence S1 (s) is a candidate trigger word, and the two event types, among others, are part of the semantic type representation L . Suppose the true label of the token t2 (“injuring”) in the input sentence s is semantically represented by w1 (Injure). In our model, the representation of type semantic word w1 at each layer is enriched by that of a similar or contrary type semantic word, such as w2 (Attack). Moreover, it is substantiated by the input sentence words ti ∈ s, especially the trigger word candidate token t2 as it occurs to be an instance of w1 . For the input sentence token t2 , its representation is contextualized by other tokens in the same sentence (ti ∈ s), whereas its attentions with w1 and other label semantic tokens (wi ∈ L ) provide semantic clues for its ED classification. Trigger Prediction. Finally, a feed-forward neural layer predicts an event type yˆi for each input sentence token ti : yˆi = FFNN(ti ).

540

H. Anran et al. Table 1. Data split and statistics. # Docs # Sents # Eventful # Triggers Train 529

5

14347

3352

4420

Dev

30

634

293

505

Test

40

840

347

424

Experiments

This section discusses the dataset, evaluation metrics, baseline models for comparison and experimental results. 5.1

Dataset and Evaluation Metrics

We conduct the experiments based on ACE 2005, a benchmark dataset for Event Detection and the most widely-used ED benchmark to date. The documents are gathered from six types of media sources: newswire, broadcast news, broadcast conversation, weblog, online forum and conversational telephone speech. The annotation includes 33 fine-grained event types. We evaluate our models on its English subset. We use the same split as in the previous ED work [2,21,22]. The details of the dataset and split are summarized in Table 1. Eventful sentences refer to those with at least one event mention. For the evaluation, we report the precision (P), recall (R) and micro-average F1 scores. 5.2

Baselines

We compare the performance of our model with three kinds of baselines: (1) models that do not use linguistic tools or extra training data; (2) models that use linguistic tools to obtain auxiliary features such as POS tag, dependency trees, disambiguated word sense; and (3) models that are trained with extra data. The baseline models are discussed as follows: – DMCNN [2] is a CNN-based model that uses dynamic multi-pooling. – DMBERT, DMBERT+Boot [21] have a pipelined BERT-based architecture for ED. DMBERT+Boot is DMBERT trained on an augmented dataset Boot from external corpus through adversarial training. – BERT QA [3] performs ED in a QA-like fashion by constructing generic questions to query BERT. – JRNN [15] is a RNN-based model for joint ED and argument extraction. – JMEE [11] jointly extracts event triggers and event arguments with a Graph Convolutional Network (GCN) based on parsed dependency arcs. – MOGANED [22] uses Multi-Order Graph Attention Network (GAT) to aggregate multi-order syntactic relations in the sentences based on Stanford CoreNLP parsed POS and syntactic dependency.

Semantic Pivoting Model for Effective Event Detection

541

Table 2. Performance results for ACE 2005 Event Trigger Classification. Model

Core mechanism

P

R

F1

DMCNN [2]

CNN

75.6

63.6

69.1

DMBERT [21]

Transformer

77.6

71.8

74.6

BERT QA [3]

Transformer

71.1

73.7

72.4

JRNN‡ [15]

features+RNN

66.0

73.0

69.3

JMEE‡ [11]

features+RNN+GCN

76.3

71.3

73.7

MOGANED‡ [22]

features+GAT

79.5 72.3

75.7

SS-VQ-VAE‡ [5]

WSD+Transformer

75.7

77.8

76.7

DYGIE++∗ [19]

Transformer+Multi-task data





73.6

DMBERT+Boot∗ [21] Transformer+Augmented data 77.9

72.5

75.1

SPEED (ours)

Transformer

76.8

77.4 77.1

PLMEE† [23]

Transformer

81.0

80.4

80.7

SPEED2† (ours) Transformer 79.8 86.0 81.4 Note: The baseline models are grouped by core mechanism: ‡ indicates the models that use linguistic tools. ∗ indicates those using external resources. † indicates the models that are trained and evaluated only on eventful data.

– SS-VQ-VAE [5] filters candidate trigger words using an OntoNotes-based Word Sense Disambiguation (WSD) tool and uses BERT for ED. – DYGIE++ [19] is a multi-task information extraction model which uses BERT and graph-based span population. Gold annotations for ACE 2005 event, entity and relation are all used in training. Additionally, the highest reported scores for ED on the ACE 2005 data in the literature to date are obtained by: – PLMEE [23] is a BERT-based model that is finetuned for ED and argument extraction in a pipelined manner. We found that PLMEE is trained and evaluated with only eventful sentences from the ACE 2005 dataset. For a fair comparison, we implement SPEED2, which uses the same training and evaluation data as PLMEE. 5.3

Implementation Details

We implement the proposed model in Pytorch and use BERTlarge−uncased with whole word masking. Maximum sequence length is set as 256. We use Adam [7] optimizer with the learning rate tuned around 3e−5. The batch size is set between 4–8 to be fit for single-GPU training. We implement early stopping (patience = 5) and limit the training to 50 epochs. We apply dropout of 0.9. For the Label Semantic Learner, we set Transformer encoder/decoder layer N = 3, attention heads H = 4, and the temperature for Gumbel Sampling τ = 0.1.

542

5.4

H. Anran et al.

Experimental Results

Table 2 shows the performance results of our proposed SPEED model based on the ACE 2005 benchmark as compared to the state-of-the-art models. The models are grouped together roughly by their approaches. Without using linguistic tools or external resources, our proposed SPEED model achieves 77.1% in F1, outperforming the baseline models by 0.4%-8.0% in F1. Among all the models, SPEED achieves the highest recall with good precision. Although MOGANED achieves a particularly high precision (79.5%), its recall is lower than our SPEED model by 5.1%. One possible reason is that since it utilizes golden entities and syntactic features based on linguistic tools, the model’s inductive bias enables it to perform better on the more regular instances. In contrast, SPEED is not based on syntactic prior but semantic prior of the event types. It can cover irregular instances though with less precision. Table 3. Ablation study on SPEED. Model

F1

Δ F1

(1) SPEED model 77.1 – 75.2 −1.9% (2) TC (large) w/o LSL (3) TC (large) w/o labels as input 73.4 −3.7% (4) TC (base) 72.8 -4.3% (5) TC (base) w/o labels as input 71.0 −1.8%

When trained and evaluated on only eventful data, our SPEED2 outperforms PLMEE in terms of recall (+5.6%) and F1 (+0.7%), giving a more balanced performance. The results show that incorporating label information is an effective approach for event detection. 5.5

Ablation Studies

We conduct ablation experiments to show the effectiveness of the individual components of our model. Table 3 reports the results in F1: (1) The original SPEED model, whose Trigger Classifier (TC) is based on BERT-large. (2) We remove the Label Semantic Learner (LSL), i.e., the label word representation L is the same as the original label words L. (3) We do not use labels as input, i.e., the LSL is removed and the label word representation L is not included as part of the input in the TC. (4) We replace BERT-large by BERT-base in the TC. (5) On top of (4), we do not use labels as input for the model, similar to (3). The results show that all the key components in our proposed SPEED model are necessary and effective for ED. Firstly, we observe that removing LSL significantly reduces performance by 1.9%. Secondly, replacing TC (large) with TC (base) leads to a drop in performance by 4.3%. This is possibly because the BERT-large provides better contextual word representation and more space for

Semantic Pivoting Model for Effective Event Detection

543

interaction between a sentence and type semantic pivot words than its base counterpart. Finally, regardless of the BERT version used in the Trigger Classifier, performance degradation is significant if the event type labels are not used as input to provide the semantic prior information. 5.6

Analysis and Discussion

Analysis on Scarce Training Data Scenario Performance. To show the data efficiency of our proposed SPEED model, we also evaluate it on scarce training data in comparison with the baseline models DMCNN and DMBERT. More specifically, we evaluate the models after training them with 20%, 40%, 60% and 80% of the training data. As shown in Fig. 4, our model performs significantly better than the baselines under the settings. With less training data, the performance of DMCNN and DMBERT drops significantly by 3.0%-28.5% in F1, while the performance of our proposed SPEED model only drops by 2.9%-7.4%. With an extremely limited amount (20%) of data, SPEED can still achieve a reasonable F1 performance of 69.7%. In the same setting, DMCNN and DMBERT can only achieve around 46% in F1. Similar to SPEED, SPEED2 which is evaluated with only eventful sentences shows reasonable performance degradation with significantly reduced amount of training data. This shows the effectiveness of the proposed model in learning from scarce data for ED.

Fig. 4. Performance on scarce training data. Table 4. Performance comparison on single (1/1) and multiple (1/N) event sentences.

Model

1/1

DMCNN JRNN JMEE

74.3 50.9 69.1 75.6 64.8 69.3 75.2 72.7 73.7

1/N

All

SPEED (ours) 77.5 76.8 77.1

544

H. Anran et al.

Analysis on Single/Multiple Event Sentence Performance. Among all the baselines, JMEE focuses on addressing multiple event sentences, i.e., sentences with each containing more than one event trigger. In Table 4, we report our F1 performance on single event sentences and multiple event sentences, in comparison with JMEE and the strong baselines it used for this scenario (i.e., DMCNN and JRNN). Without using linguistic features including POS tag and dependency, our SPEED model achieves high F1 (76.8%) performance on multiple event sentences, outperforming the baselines by 4.1%–25.9%. It shows that our proposed architecture can effectively model cross-event interaction, benefiting ED on multiple event sentences.

6

Conclusion

In this paper, we propose a novel semantic pivoted Event Detection model that utilizes the pre-defined set of event type labels for event detection. It features event type semantics learning via a Transformer-based mechanism. The experimental results show that our model outperforms the state-of-the-art event detection methods. In addition, the proposed model demonstrates several other advantages, such as working well for the scenarios of scarce training data and multiple event sentences.

References 1. Chen, Y., Liu, S., Zhang, X., Liu, K., Zhao, J.: Automatically labeled data generation for large scale event extraction. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 409–419. Association for Computational Linguistics (2017). https://doi.org/10.18653/v1/P17-1038 2. Chen, Y., Xu, L., Liu, K., Zeng, D., Zhao, J.: Event extraction via dynamic multipooling convolutional neural networks. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Beijing, China, pp. 167–176. Association for Computational Linguistics (2015). https:// doi.org/10.3115/v1/P15-1017 3. Du, X., Cardie, C.: Event extraction by answering (almost) natural questions. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 671–683. Association for Computational Linguistics (2020). https://doi.org/10.18653/v1/2020.emnlp-main.49 4. Grishman, R.: Twenty-five years of information extraction. Nat. Lang. Eng. 25(6), 677–692 (2019). https://doi.org/10.1017/S1351324919000512 5. Huang, L., Ji, H.: Semi-supervised new event type induction and event detection. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 718–724. Association for Computational Linguistics (2020). https://doi.org/10.18653/v1/2020.emnlp-main.53

Semantic Pivoting Model for Effective Event Detection

545

6. Jang, E., Gu, S., Poole, B.: Categorical reparameterization with gumbel-softmax (2017) 7. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: Bengio, Y., LeCun, Y. (eds.) 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015, Conference Track Proceedings (2015). https://arxiv.org/abs/1412.6980 8. Li, Q., Ji, H., Huang, L.: Joint event extraction via structured prediction with global features. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Sofia, Bulgaria, pp. 73–82. Association for Computational Linguistics (2013) 9. Liao, S., Grishman, R.: Using document level cross-event inference to improve event extraction. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, Sweden, pp. 789–797. Association for Computational Linguistics (2010) 10. Liu, S., Chen, Y., Liu, K., Zhao, J.: Exploiting argument information to improve event detection via supervised attention mechanisms. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 1789–1798. Association for Computational Linguistics (2017). https://doi.org/10.18653/v1/P17-1164 11. Liu, X., Luo, Z., Huang, H.: Jointly multiple events extraction via attention-based graph information aggregation. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 1247–1256. Association for Computational Linguistics (2018). https://doi.org/10.18653/v1/ D18-1156 12. Lu, W., Nguyen, T.H.: Similar but not the same: word sense disambiguation improves event detection via neural representation matching. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 4822–4828. Association for Computational Linguistics (2018). https://doi.org/10.18653/v1/D18-1517 13. Muis, A.O., et al.: Low-resource cross-lingual event type detection via distant supervision with minimal effort. In: Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, New Mexico, USA, pp. 70–82. Association for Computational Linguistics (2018) 14. Nguyen, L.T., Van Ngo, L., Than, K., Nguyen, T.H.: Employing the correspondence of relations and connectives to identify implicit discourse relations via label embeddings. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 4201–4207. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/P19-1411 15. Nguyen, T.H., Cho, K., Grishman, R.: Joint event extraction via recurrent neural networks. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California, pp. 300–309. Association for Computational Linguistics (2016). https://doi.org/10.18653/v1/N16-1034 16. Nguyen, T.H., Grishman, R.: Event detection and domain adaptation with convolutional neural networks. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Beijing, China, pp. 365–371. Association for Computational Linguistics (2015). https://doi.org/ 10.3115/v1/P15-2060

546

H. Anran et al.

17. Subburathinam, A., et al.: Cross-lingual structure transfer for relation and event extraction. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 313–325. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/D191030 18. Vaswani, A., et al.: Attention is all you need. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc. (2017) 19. Wadden, D., Wennberg, U., Luan, Y., Hajishirzi, H.: Entity, relation, and event extraction with contextualized span representations. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), Hong Kong, China, pp. 5784–5789. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/D19-1585 20. Wang, G., et al.: Joint embedding of words and labels for text classification. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 2321–2331. Association for Computational Linguistics (2018). https://doi.org/10.18653/v1/P181216 21. Wang, X., Han, X., Liu, Z., Sun, M., Li, P.: Adversarial training for weakly supervised event detection. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 998–1008. Association for Computational Linguistics (2019). https://doi.org/10. 18653/v1/N19-1105 22. Yan, H., Jin, X., Meng, X., Guo, J., Cheng, X.: Event detection with multi-order graph convolution and aggregated attention. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 5766–5770. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/D19-1582 23. Yang, S., Feng, D., Qiao, L., Kan, Z., Li, D.: Exploring pre-trained language models for event extraction and generation. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 5284–5294. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/ P19-1522 24. Zhang, H., Xiao, L., Chen, W., Wang, Y., Jin, Y.: Multi-task label embedding for text classification. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 4545–4553. Association for Computational Linguistics (2018). https://doi.org/10.18653/v1/D18-1484

Meet Your Email Sender - Hybrid Approach to Email Signature Extraction Jelena Graovac(B) , Ivana Tomaˇsevi´c , and Gordana Pavlovi´c-Laˇzeti´c Faculty of Mathematics, Department of Computer Science, University of Belgrade, Belgrade, Serbia {jgraovac,ivana,gordana}@matf.bg.ac.rs

Abstract. Email signature is considered imperative for effective business email communication. Despite the growth of social media, it is still a powerful tool that can be used as a business card in the online world which presents all business information including name, contact number and address to recipients. Signatures can vary a lot in their structure and content, so it is a great challenge to automatically extract them. In this paper we present a hybrid approach to automatic signature extraction. First step is to obtain the original most recently sent message from the entire email thread, cleaned from all disclaimers and superfluous lines, making the signature to be at the bottom of the email. Then we apply Support Vector Machine (SVM) Machine Learning (ML) technique to classify emails according to whether they contain a signature. To improve obtained results we apply a set of sophisticated Information Extraction (IE) rules. Finally, we extract signatures with a great success. We trained and tested our technique on a wide range of different data: Forge dataset, Enron with our own collection of emails and a large set of emails provided by our native English-speaking friends. We extracted signatures with precision 99.62% and recall 93.20%.

Keywords: Email signature extraction extraction · Hybrid approach

1

· SVM · Information

Introduction

Email is one of the most used communication services. Despite the rise of social media and instant messaging, email usage is steadily growing, with more than 4 billion users worldwide in 2021 and about 6.8 billion email accounts – and it continues to grow [17]. In the business world, email signatures are considered imperative for effective communication. They serve the purpose of business cards or letter pads in this electronic world. They contain sender’s name, organization, location, phone number, company’s personal web page URL, social networks addresses and so on. Still, signatures may be highly varying depending on the individuals or based on the company information. Among many possible applications of automatic extraction of signatures from emails, such as preprocessing emails for text-to speech systems, automatic c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  N. T. Nguyen et al. (Eds.): ACIIDS 2022, LNAI 13758, pp. 547–558, 2022. https://doi.org/10.1007/978-3-031-21967-2_44

548

J. Graovac et al.

formation of personal address lists, email threading, etc., an application that may have a wide social and economic impact is development of a network of contacts as a core of a successful customer relation management system. Our goal in this paper is to develop a hybrid - machine learning/rule based approach to automatic email signature extraction. It is well known that quality of data is a crucial resource for successful accomplishment of such projects. So, we were supported by an engineer team developing the labelling software, and a highly competent team worked on data labelling (each email labeled by more than one person). It may be worth saying that, although some commercial solutions for signature extraction from emails do exist on the market, e.g., Outlook Text Minner [5], more sophisticated analysis and use, larger coverage and range of data, specific collections and clients and the need for an as perfect methodology as possible, combining Machine Learning (ML) and Information Extraction (IE) methods, were the challenge of the project that inspired this work and the paper.

2

Related Work

As opposed to many Natural Languge Processing (NLP) and Named Entity Recognition (NER) researches in general and within email context ([15,16,19]), not much has been done on extracting signatures from emails. Signature extraction is envisaged more as an image detection than NLP problem and still, no significant effort in image detection deep learning methods has been put into solving this problem. In [7] a method is presented for component-level analysis of plain-text email messages and its use for automatic identification of signature blocks, signature lines and reply lines. The method is based on applying machine learning methods to a sequential representation of an email message, in which each email is represented as a sequence of lines, and each line is represented as a set of features. Several state-of-the-art sequential and non-sequential machine learning algorithms are compared. Experimental results show that the presence of a signature block in a message can be detected with accuracy higher than 97% and signature lines can be identified with accuracy higher than 99%. The algorithms were trained on 20 Newsgroups data set [14] and the authors own inboxes. This resulted in a final set of 617 messages (all from different senders) containing a signature block, and a set of 586 messages not having a signature block. A method to parse signature block fields in email messages described in [8] uses a combination of geometrical analysis and linguistics analysis to convert the two-dimensional signature block into one-dimensional reading blocks and to identify functional classes of text in a one-dimensional reading block, integrated with weighted finite-state transducers. Signature block identification was tested on 351 complete email messages from various sources. The overall method recall rate is 93% and precision rate is 90%. Authors stated that most errors came from geometrical analysis, but improvements could also be done by use of N − gram functional block model, which requires more training data.

Hybrid Approach to Email Signature Extraction

549

Signature detection (along with header and program code detection), as an aspect of email non-text filtering is considered in [20]. Two Support Vector Machine (SVM) models are constructed for identifying start line and end line of a signature. A number of features are identified including features for previous and next line. Data set used includes emails from different newsgroups at Google, Yahoo, Microsoft and several universities. The overall recall and precision for signature detection was 91.35% and 88.47%, respectively. It was observed that about 38% of the errors were from signature start line identification and 62% of the errors were from end line identification. Open source libraries, industrial tools and different platforms use, apply or offer signature extraction for different purposes. One such library is Talon [4], developed by Mailgun [2]. Authors described that machine learning techniques are used to classify signature lines, and for other tasks they use various heuristics from intensive experience working with email signature extraction in real world problems.

3

Methodology

Determining whether an email contains a signature block belongs to classification problems, so we used classification methods to solve this problem [9–11,18]. Specific classification procedure we applied consists of the following steps: – Data Preprocessing. First, we extracted the most recent emails from the threads and applied different preprocessing techniques to ensure that the signatures will be at the bottom of the email, if they exist (we deleted all reply lines, disclaimer lines, notification lines, long lines, etc.) – SVM Email Classification. Then we applied SVM supervised learning method to classify all emails into P catogory (signature-containing emails) or N category. – IE Refinement. Obtained results are refined using additional IE rules. – Signature Extraction. At the end, for all emails that contain the signature (classified into P category) we extracted signature blocks. 3.1

Early Stage - Phase 1

We have started with the Talon project [4] to extract reply lines and signatures, if detected, described in [7]. Talon provides for brute force signature extraction (based on sender name or special characters like “–” line), which works fine most of the time, but also for machine learning approach based on the scikit-learn Python library to build SVM classifiers [3]. Reasons for application of SVM to text-classification problems is well argumented in [13] paper and may be shortly summarized as follows: using the abstract model of text-classification tasks, based on statistical properties of textclassification problems, it is possible to prove what types of text-classification problems are efficiently learnable with SVMs. Signature detection and extraction belong to such problems.

550

J. Graovac et al.

In Talon, data used for training is the Forge dataset [1] taken from personal email conversations and from Enron dataset, divided into P and N folders emails containing and not containing signatures, structured and annotated in a special way - each email in two files, _body and _sender, signature lines prefixed by #sig#. An example of a _body file with a labeled signature is presented in Fig. 1. Hi Mike, Can we have a meeting tomorrow at 10 AM CST? #sig#-#sig#John Smith #sig#555-555-5555

Fig. 1. Example of a signature.

The classifier may be retrained by a new dataset which has to be structured and annotated the same way. Data Preprocessing. Since the original Talon library has not worked well enough for our email corpus, we redesigned the method for extracting last message sent and provided for additional preprocessing as to eliminate unnecessary lines. So a new preprocessing method takes, as an input, a full email message from a sender, it extracts the original message (the most recent message sent), deletes notifications and disclaimers, as well as long and unwanted lines (such as, for example, ’UNSUBSCRIBE from this page’, ’Corpora mailing list’, etc.) and outputs the original message with signature (if one exists) somewhere at the bottom. SVM Mail Classification. This step is supposed to classify an email as having or not having a signature (belongs to the positive (P) or negative (N) class), based on a set of well-tailored and customised set of features. Following the strategy described in [7], the message is considered as a sequence of lines, each line represented as a sequence of features. Each feature is implemented as a method that is applied to a line (or a line with neighbouring lines) and assigns 1 or 0 to the line meaning that the line does or does not satisfy the feature. There may be several dozens of features and some of them are, e.g., “line contains job title”, “line contains phone number/URL/email address”, etc. Features are applied to the last 15 lines of an email (which is a type of feature selection) since, if exists, signature is left somewhere at the bottom of the preprocessed mail and it is observed that signatures are usually not longer than 15 lines. Also, we assigned weights to some important features so that features are repeated several times.

Hybrid Approach to Email Signature Extraction

551

An example of last 15 lines of a message, set of features and a feature matrix are given in Figs. 2, 3 and 4, respectively. line 14: line 13: If you cannot access the site, please contact the \ organizer (someone@gmail= line 12: .com) line 11: + Notice of acceptance: By the end of May 2021 line 10: + Full paper due: By the end of August 2021 line 9: + Fee: Free line 8: =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=\ 3D=3D=3D=3D=3D=3D=3D= line 7: =3D=3D=3D=3D=3D=3D=3D=3D line 6: line 5: line 4: Dr. John Smith line 3: Kobe University line 2: [email protected] line 1: line 0: _______________________________________________

Fig. 2. Bottom of email (the example of the email is fictional).

FEATURES: [’many capitalized words’, ’line too long’, ’email’, ’URL’, ’relax_phone’, ’separator’, ’special_chars’, ’signture_words’, ’name’, ’punct. percent > 50’, ’punct. percent > 90’, ’contains sender’, ’twitter’, ’facebook’, ’linkedin’, ’phone’, ’line’, ’spec_char’, ’words’, ’name’, ’quote’, ’ab1’, ’tab2’, ’tab3’ ]

Fig. 3. Set of features.

Again, a pre-trained classifier may be applied and assign True or False class (classify the mail into P/Positive or N/Negative class). The classifier may also be retrained by a new data collection which should be structured and annotated the same way as the Forge dataset. IE Refinement of SVM Results and Signature Extraction. It turned out that the SVM classifier made substantial errors. It classified some signaturecontaining emails into the N class and some emails not containing signature into the P class.

552

J. Graovac et al. line line line line line line line line line line line

13: 12: 11: 10: 9: 8: 7: 4: 3: 2: 0:

0 0 0 0 1 0 0 1 1 0 0

1 0 0 0 0 1 0 0 0 0 0

1 0 0 0 0 0 0 0 0 1 0

0 0 0 0 0 0 0 0 0 0 0

0 0 1 1 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 1

0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 1

0 0 0 0 0 0 0 0 1 0 0

0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0

Fig. 4. Feature matrix.

For example, the email from Fig. 5 was classified into N class due to its “modest” signature, while the email from the Fig. 6 is, erroneously classified into P class. Not sure if Mike was able to catch you with a note earlier today, but Unfortunately, his international flight was delayed. Can we possibly reschedule to tomorrow when Mike is available since his input is necessary for a productive discussion? Thanks and sorry for the unforeseen delay. Please let us know what works for the reschedule. John 555-555-5555

Fig. 5. Wrongly classified email by SVM classifier into N class (the example of the email is fictional).

I will be out of the office until Monday, July 28th. For any matter requiring a timely response, please contact my colleagues. James Jones (212)111-1111 or [email protected] Robert Brown (212)222-2222 or [email protected]

Fig. 6. Wrongly classified email by SVM classifier into P class (the example of the email is fictional)

In order to improve the classification based on SVM algorithm solely, we provided for additional, IE technique-based verification of the decision made by SVM. For each email we considered the email address and metadata associated with it (first and last name, if present). The double check class assignment takes, as an input, an original preprocessed message with SVM decision and produces,

Hybrid Approach to Email Signature Extraction

553

as an output, final decision on the email class. If the final decision is True, the email signature is extracted. The underlying procedure may be described by the pseudocode given as Algorithm 1.

Algorithm 1. Procedure for hybrid method of signature extraction Input: E body, E f irst name, E last name, E email address username, E email address domain, SV M (E) Output: P – if E has signature, N – otherwise if SV M (E) = Y es then return Weaker Conditions(E body, E name, E email address username, E email address domain) else E name, E email address username, return Stronger Conditions(E body, E email address domain) end if

Algorithm 2. Weaker Conditions Input: E body, E f irst name, E last name, E email address username, E email address domain Output: P – if E body has signature, N – otherwise if (E f irst name OR E last name OR E email address username OR E email address domain is found in E body) then return P else return N end if

Weaker conditions in Algorithm 2 imply that at least the first or last name of the sender are found, or something similar to the username or email domain. If nothing found, the mail is classified as an N class mail, although SVM predicted P class. Stronger conditions in Algorithm 3 assume that more information about the sender have to be found in order to revert the SVM decision. Thus applying stronger conditions consists of the following: Check if at least one of the following data about sender is contained in last 15 lines: the first name, the last name, username, email domain. If yes, check if after or below it some more signaturepertinent data exist (phone number, web address, job title, twitter, etc.). The reason for checking this is that only signatures that contain more data than just first and last names are of interest. If this additional data exists, the email is classified in the class P although the SVM predicted N class. This procedure significantly increased the accuracy of the SVM method itself. For example, using these rules examples from Figs. 5 and 6 are classified in the correct categories.

554

J. Graovac et al.

Algorithm 3. Stronger Conditions Input: E body, E f irst name, E last name, E email address username, E email address domain Output: P – if E has signature, N – otherwise if (E f irst name OR E last name is found ) AND (phone number OR web address OR, job title OR, company name, OR LinkedIn, OR twitter is found in E body) then return P else return N end if

3.2

Development of Signature Extractor - Phase 2

In order to increase quality of email classification and signature extraction for a massive email collection of diverse emails with contemporary signature components, we have built a robust signature extraction tool. All the emails are annotated and last message lines extracted from both emails with or without signatures. A Python library has been defined for feature rewriting, feature pattern design, feature-based file representation, SVM model training, SVM model testing, elimination of unwanted lines, reply lines extraction, etc. We also applied the set of rules for signature extraction. First we looked for a line containing salutation, e.g.,“Best regards”, “Thanks” etc., followed by first name/last name/nickname of the sender. If the mail does not contain the salutation line, then we looked for a line containing the first name/last name/nickname of the sender. In such a case we extracted, as a signature, all the lines from the one containing the mentioned content to the end of the message.

4

Data Sets and Results

In Phase-1, email classifier (V1) has been applied to Forge and our own dataset complemented with Enron dataset (respectively, Dataset-1 and Dataset-2), and in Phase-2 signature extractor (V2.1 and V2.2) has been applied to a large collection of emails obtained from our native English-speaking friends (Dataset-3). For evaluating the performance of the techniques, we use the typical evaluation metrics that come from information retrieval - Precision (P), Recall (R), F1 measure and Accuracy ([6]):

P =

TP 2P R TP + TN TP ,R = ,F1 = , Acc = TP + FP TP + FN P +R TP + TN + FP + FN

where TP (True Positives) is defined as the number of documents that were correctly assigned to the positive category, TN (True Negatives) is the number of the assessments where the system and a human expert agree on a negative label, FP (False Positives) is the number of documents that were incorrectly assigned

Hybrid Approach to Email Signature Extraction

555

to the considered category, and FN (False Negatives) is the number of negative labels that the system assigned to documents otherwise assessed as positive by the human expert ([12]). All presented measures can be aggregated over all categories in two ways: micro-averaging – the global calculation of measure considering all the documents as a single dataset regardless of categories, and macro-averaging – the average on measure scores of all the categories. In this paper we used macro-averaged P, R, F1 and Acc. 4.1

Dataset-1

As the first dataset that we experimented with we used a set of Forge formatted emails containing P - 1243 emails, N - 1474 emails. We then filtered the dataset with emails from unique senders and obtained P - 469 emails and N - 344 emails (called “ForgeUnique”). After additional filtering for messages not from news and newsletter we ended up with P - 469 emails and N - 326 emails (called “ForgeUniqueNotBots”). When the Phase-1 method involving email preprocessing and SVM classifier applied to mail classification (P/N), by applying the 10 cross-validation we obtained the following results (Table 1): Table 1. Result of email classification with SVM method with 10-fold cross validation, on our emails dataset with unique senders, and with unique senders that are not bots.

Dataset

P

ForgeUnique

0.924152 0.987207 0.945879 0.954639

R

ACC

F1

ForgeUniqueNotBots 0.948665 0.989293 0.962169 0.968553

4.2

Dataset-2

A set of mail-classification experiments was then carried out on the Enron email dataset complemented with our own mail collections and mailing lists (P - 400 emails and N - 400 emails). We annotated (in a semiautomatic way) all the emails in Forge-format annotation divided into several categories: – – – – – –

Positive with signatures at the end, Positive with signature followed by notification, Positive with signature followed by something else, Positive with signature with odd characters Negative - no signatures, Negative - incomplete signatures.

When SVM applied to mail classification (P/N), we calculated quality measures for different combinations of these categories (trained on our mails, tested on Enron, cross-validation on joined dataset, by categories, etc.) and all the results were above 90%. Some of these (overall) results are in the Table 2.

556

4.3

J. Graovac et al.

Dataset-3

Table 2. Result of SVM email classification with 10-fold cross validation, on Enron and Enron complementing our emails with unique senders that are not bots. Dataset

P

R

ACC

F1

Enron

0.857342 0.943707 0.867096 0.897685

OurEmails+Enron 0.919156 0.959001 0.936205 0.938490

The final dataset we experimented with was quite a large email dataset in its original format (not in Forge format). It consisted of 15264 emails without signatures and 5105 emails with signatures. The dataset was first thoroughly annotated (labeled) by competent labellers (more than one) and divided into the same categories as before (P, N, P-notifications following signature, incomplete signature, etc.). Regarding Phase-2 processing, when the SVM algorithm applied to signature extraction from the emails of the P class, the results obtained on this dataset are presented in the Table 3. Table 3. Result of signature extractor on Dataset-3 with SVM method only (V2.1) and hybrid method (V2.2). Dataset-3 P

R

ACC

F1

V2.1

0.964784

0.759530

0.932816

0.849941

V2.2

0.996231 0.932027 0.982081 0.963060

Thus the rules applied significantly advanced the quality of the signature extractor. Accuracy of the methods proposed in this paper for email signature detection and extraction cannot be directly compared with other approaches presented elsewhere, due to different datasets (size, coverage), problems with not sharing datasets publicly for privacy reasons, the fact that email signature structure evolves in time, etc. Still, quality measures for a very large corpus of contemporary emails of different categories (personal, business), obtained and presented in this paper, are quite persuasive and better than most of those reported elsewhere (precision of signature extraction of 99.6% being the very best among all).

Hybrid Approach to Email Signature Extraction

5

557

Conclusion and Future Work

Email has the advantage of being sent and received instantly, whether the recipient is a next door or thousands of miles away. Also, uninterrupted communication allows companies and clients to establish a meaningful and productive conversation. Very important part of each email is its signature block which serves as a virtual business card. In this paper we presented a new hybrid approach to automatic signature extraction. The approach consists of 3 tasks: preprocessing of the email thread to obtain the most recently sent original email from the sender, classification of emails according to whether they contain a signature using hybrid approach (SVM complemented with IE rules), and signature extraction from the positive category (emails that contain signature). We used three different datasets: Forge, Enron complementing our emails, and datasets obtained from our native English-speaking friends. On the third, largest and the most challenging email dataset, we extracted signatures with precision 99.62% and recall 93.20%. One possible improvement of presented approach could be adding the results of the IE verification technique as an additional feature to the SVM classifier. All the conditions could be used as features and some level of improvements obtained by such SVM learner may be expected. Our next task we are working on is extraction of entities from the signature. We also plan to work on classification of emails into business and personal classes, based on their content. Acknowledgements. The work presented has been supported by the Ministry of Science and Technological Development, Republic of Serbia, through Projects No. 174021 and No. III47003.

References 1. Forge dataset. http://github.com/materials-data-facility/forge 2. Mailgun, open sourcing our email signature parsing library. http://www.mailgun. com/blog/open-sourcing-our-email-signature-parsing-library/ 3. SVM, Scikit Learn Library. http://scikit-learn.org/stable/modules/svm.html 4. Talon, the Mailgun’s Python library. http://github.com/mailgun/talon 5. Text Minner, Email Signature Extractor. http://appsource.microsoft.com/en-us/ product/office/wa104380692 6. Baeza-Yates, R., Ribeiro-Neto, B., et al.: Modern information retrieval, vol. 463. ACM press New York (1999) 7. Carvalho, V.R., Cohen, W.W.: Learning to extract signature and reply lines from email. In: Proceedings of the Conference on Email and Anti-Spam, vol. 2004 (2004) 8. Chen, H., Hu, J., Sproat, R.W.: Integrating geometrical and linguistic analysis for email signature block parsing. ACM Trans. Inform. Syst. (TOIS) 17(4), 343–366 (1999) 9. Graovac, J.: A variant of n-gram based language-independent text categorization. Intell. Data Anal. 18(4), 677–695 (2014) 10. Graovac, J., Kovaˇcevi´c, J., Pavlovi´c-Laˇzeti´c, G.: Hierarchical vs. flat n-gram-based text categorization: can we do better? Computer Science and Information Systems 14(1), 103–121 (2017)

558

J. Graovac et al.

11. Graovac, J., Mladenovi´c, M., Tanasijevi´c, I.: Ngramspd: Exploring optimal n-gram model for sentiment polarity detection in different languages. Intell. Data Anal. 23(2), 279–296 (2019) 12. Joachims, T.: Learning to classify text using support vector machines: Methods, theory and algorithms. Kluwer Academic Publishers (2002) 13. Joachims, T.: A statistical learning model of text classification for svms. In: Learning to Classify Text Using Support Vector Machines, pp. 45–74. Springer (2002). https://doi.org/10.1007/978-1-4615-0907-3 4 14. Lang, K.: The 20 newsgroups data set, version 20news-18828 (1995) 15. Lawson, N., Eustice, K., Perkowitz, M., Yetisgen-Yildiz, M.: Annotating large email datasets for named entity recognition with mechanical turk. In: Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s Mechanical Turk, pp. 71–79 (2010) 16. Minkov, E., Wang, R.C., Cohen, W.: Extracting personal names from email: Applying named entity recognition to informal text. In: Proceedings of human language technology conference and conference on empirical methods in natural language processing, pp. 443–450 (2005) 17. Radicati, S.: Email market, 2021–2025. The Radicati Group Inc, Palo Alto, CA (2021) 18. Tanasijevi´c, I.: Multimedial databases in managing the intagible cultural heritage. University of Belgrade (2021) 19. Tanasijevi´c, I., Pavlovi´c-Laˇzeti´c, G.: Herculb: content-based information extraction and retrieval for cultural heritage of the balkans. The electronic library (2020) 20. Tang, J., Li, H., Cao, Y., Tang, Z.: Email data cleaning. In: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pp. 489–498 (2005)

Performance of Packet Delivery Ratio for Varying Vehicles Speeds on Highway Scenario in C-V2X Mode 4 Teguh Indra Bayu1,2 , Yung-Fa Huang3(B) , and Jeang-Kuo Chen1 1 Department of Information Management, Chaoyang University of Technology,

Taichung, Taiwan [email protected], [email protected], [email protected] 2 College of Information Technology, Satya Wacana Christian University, Salatiga, Indonesia 3 Department of Information and Communication Engineering, Chaoyang University of Technology, Taichung, Taiwan [email protected] Abstract. The 3GPP’s Cellular Vehicle-to-Everything (C-V2X) specifications cover both short-range Vehicle-to-Vehicle (V2V) communications, which utilize an air interface called sidelink/PC5, and wide-area Vehicle-to-Network (V2N) communications, which enable vehicles to communicate with base stations (referred to as eNodeB in 3GPP). The primary contribution of this work is to investigate the performance of C-V2X Mode 4 in realistic highway scenarios by altering the vehicle’s speed, vehicle density, resource keep probability (Prk ), and Modulation Coding Scheme (MCS) parameters. Simulation scenarios in this work will use three types of vehicle’s speed. Each vehicle’s speed type can be described as: Type 1 is 40 km/h, Type 2 is 80 km/h and Type 3 is 120 km/h. The simulation results show that the MCS = 9, and Prk = 0.8 configuration achieved the best 95% Packet Delivery Ratio (PDR) performance distance breaking point with 380, 300, and 280 m for the density of vehicles is 0.1, 0.2 and 0.3 vehicles/meter respectively. Keywords: C-V2X · Performance · Vehicles speed · PDR

1 Introduction Nowadays, vehicles are equipped with a range of onboard sensors (cameras, RADARs, and LIDARs) that enable them to be partially aware of their surroundings. Wireless vehicular communication technologies can considerably increase the awareness range of vehicles and enable bi-directional communication between vehicles and any other entity equipped with an appropriate communications module. The term Vehicle-toEverything (V2X) communications are used broadly to describe methods for transmitting information flows in a variety of Intelligent Transportation System (ITS) applications related to traffic safety and efficiency, automated driving, and infotainment. Vehicle-toVehicle (V2V), Vehicle-to-Network (V2N), Vehicle-to-Pedestrian (V2P), and Vehicleto-Infrastructure (V2I) communications are all included in V2X. Numerous standardization bodies have worked to specify V2X wireless technologies, particularly after the © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 N. T. Nguyen et al. (Eds.): ACIIDS 2022, LNAI 13758, pp. 559–568, 2022. https://doi.org/10.1007/978-3-031-21967-2_45

560

T. I. Bayu et al.

US and Europe allocated reserved 5.9 GHz spectrum for ITS in 1999 and 2008. As a result, many standards families have been completed: the IEEE/SAE DSRC in 2010 in the United States, the ETSI/CEN Cooperative-ITS (C-ITS) Release 1 in 2013 in Europe, and the 3GPP Cellular-V2X (C-V2X) in early 2017 as a feature of LTE Release 14. The physical and data link layers of DSRC and C-ITS are implemented following the IEEE 802.11p standard, a short-range technology for V2V communications. The 3GPP’s CV2X specifications cover both short-range V2V communications, which utilize an air interface called sidelink/PC5, and wide-area V2N communications, which enable vehicles to communicate with base stations (referred to as eNodeB in 3GPP). C-V2X will continue to evolve in future 3rd Generation Partnership Project (3GPP) standard releases to enable new use cases such as automated driving [1]. In short-range C-V2X, two techniques for resource allocation are defined: one that is network-controlled named Mode 3, and another that is entirely distributed among nodes, named Mode 4. Although Mode 3 is predicted to outperform Mode 4 due to the additional information available to the scheduler, issues at the cellular boundaries could still develop (especially with different operators). When coverage is inconsistent or unavailable, the latter approach remains the only viable alternative. 3GPP defines a Mode 4 algorithm due to its critical necessity and enables interoperability between products from different suppliers. Outside of 3GPP, few studies have been conducted to determine the performance of the standard and the effect of its many parameters [2]. To better handle the radio resource management, the work in [3] introduces the possible adaptation to the Modulation Coding Scheme (MCS) configuration. The adaptation on [3] is based on the vehicle distance with corresponding transmit power to operate on a particular MCS configuration. Although several recent articles have focused on Mode 4, each has emphasized a particular characteristic or a few parameters. A work in [4] discussed V2X performance for vehicle and pedestrian safety by comparing cellular and WiFi configurations. While works in [5] pointed out the actual C-V2X performance considering errors occurs in the PHY and MAC layer. Performance indicator introduced in [5] called Packet Delivery Ratio (PDR), where PDR is calculated from summarizing the total packets received with total errors occurs in C-V2X Mode 4 communication. A specific radio communication interface called PC5 sidelink also has been investigated in [6] to foresee the usefulness in terms of driver assistance. Another safety analysis taking account of driver assistance has also been done in [7]. To address this constraint and present an in-depth examination of Mode 4, we concentrate on the key factors mentioned, such as vehicle speed, vehicle density, resource keep probability (Prk ), and MCS. Previous work in [8] stated that the C-V2X performance for very slow vehicles speed depends on the number of vehicles and the value of Prk . Where the higher Prk showed better PDR performance in the higher vehicle’s density. Considering similar vehicle’s speed, vehicles tend to group and cause a significant loss in overall PDR performance. To cope with performance loss, the primary contribution of this work is to investigate the performance of C-V2X Mode 4 in realistic highway scenarios by altering the vehicle’s speed, vehicle density, resource keep probability, and MCS parameters. The simulation scenarios in [8] still use a similar vehicle’s speed of 20 km/h. In this work, the simulation scenarios will have multiple types of vehicles with different speeds in one scenario. This mixed vehicle’s speed configuration comprises

Performance of Packet Delivery Ratio for Varying Vehicles Speeds

561

the term of varying vehicle’s speed. We believe that the mixed vehicle’s speed configuration will depict closer real-world scenarios. On the other hand, applying mixed speed scenarios will break the probability of a dead-lock congestion situation on a highway. A dead-lock congestion scenario means that slow-moving vehicles are blocking the fastmoving vehicles. Hence all vehicles will lock behind slow moving vehicles and then create a huge group. The following sections comprise the remainder of the paper: Sect. 2 will examine the primary characteristics of short-range C-V2X Mode 4; Sect. 3 will detail the models and settings used to get numerical results; Sect. 4 will then analyze the effect of parameters on C-V2X performance, and Sect. 5 will show the conclusions from this work.

2 C-V2X C-V2X Mode 4 makes adjustments to the physical and media access control (MAC) layers of the LTE sidelink in order to accommodate V2X communications. The physical layer (PHY) is intended to improve the performance of LTE when used in high-mobility environments. An entirely new scheduling mechanism (SB-SPS) has been implemented at the MAC layer, allowing vehicles to make resource selections independently. The PHY and MAC layers of C-V2X Mode 4 are discussed in detail in the following sections, which cover the most significant characteristics of each layer. 2.1 C-V2X Physical Layer C-V2X’s physical layer (PHY) executes Single-Carrier Frequency Division Multiple Access (SC-FDMA) communication. When looking at resources in the time domain, they are divided into subframes of 1 ms, which are then divided into frames of 10 ms, and so forth. A total of 14 SC-FDMA symbols are utilized in each subframe. Four of which are used for demodulation reference signals (DMRS), one used for TxRx switching, and the remaining nine symbols are used for data transmissions [9]. The channel is separated into 15 kHz subcarriers when looking at the frequency domain. These subcarriers are grouped into Resource Blocks (RBs). Each RB contains 12 subcarriers and stretches over one subframe of the network. C-V2X, in contrast to the usual resource structure of LTE, divides RBs into subchannels to improve performance and efficiency. However, the number of RBs per subchannel is customizable. However, the total number of subchannels is restricted by the assigned bandwidth, either 10 MHz or 20 MHz. Physical Sidelink Shared Channel (PSSCH) and Physical Sidelink Control Channel (PSCC) are two types of physical channels (PSCCH). The PSSCH is responsible for transmitting the RBs carrying data, also known as Transport Blocks, to their destinations (TBs). The PSCCH is responsible for transporting the Sidelink Control Information (SCI), which is essential for scheduling and decoding operations. SCI provides information such as the Modulation and Coding Scheme (MCS) used to transmit the packet, the frequency resource location where the transmission occurs, and other scheduling information.

562

T. I. Bayu et al.

Transmission of the PSCCH and PSSCH can be accomplished by using neighboring or non-adjacent protocols. The PSCCH and PSSCH are broadcast in continuous RBs in the scheme to the right of this one. PSCCH and PSSCH are broadcast in separate RB pools using the non-adjacent method. Specifically, the PSCCH requires only two RBs. However, the number of RBs required by the PSSCH is variable and relies on the total body count (TB) size. The PSSCH and PSCCH are always transmitted in the same subframe, regardless of the transmission strategy used to transmit them. 2.2 C-V2X MAC Layer At the MAC layer, C-V2X supports SB-SPS, also known as Mode 4, allowing vehicles to choose resources independently. The process begins with the reception of a packet from one of the upper layers of the network. Upon receipt, the MAC layer generates a scheduling grant that specifies the number of subchannels to be reserved, the number of recurrent transmissions for which the subchannels will be reserved, and the interval between transmissions. Assume that a grant has already been generated when a packet from the top tiers is received. In that situation, the transmission is scheduled based on the grant that is currently in effect. The number of subchannels is pre-configured and is determined by the application requirements. The Resource Reselection Counter (RRC) determines the number of recurrent transmissions. It is set by selecting an integer value between 5 and 15 at random from a pool of available resources. Finally, the Resource Reservation Interval (RRI) establishes the periodicity between transmissions, and the top layers determine its value [10]. The grant is then transmitted to the PHY layer, which generates a list of all the subchannels that fulfill the grant parameters. Candidate Single-Subframe Resources (CSRs) are subchannels that are part of a single subframe and can consist of one or more subchannels. All of the CSRs within a selection window consisting of the time between the time a packet is received from the upper layers and the maximum permissible delay determined by the RRI is included in the list. The information contained in the SCIs received during a sensing window comprised of the latest 1000 subframes is then used to filter the list. CSRs are rejected based on this SCI information. The CSR will be reserved during the following selection window if the SCI indicates that and if the CSR’s average PSSCH Reference Signal Received Power (RSRP) exceeds a predetermined threshold. After removing all CSRs that fulfill these two criteria, it should be possible to obtain at least 20% of the total number of CSRs. If this is not the case, the process is repeated, with the RSRP threshold being raised by 3 dB every time. Finally, the PHY selects the 20% of CSRs with the lowest Sidelink Reference Signal Strength Indicator (RSSI) averaged over the sensing window. CSRs with the lowest interference levels prioritize others, thanks to this 20% threshold. The MAC layer receives the remaining CSRs. A single CSR is chosen at random in order to limit the likelihood of many cars selecting the same CSR. The CSR is picked for a series of recurrent transmissions set by the RRC, and its value decreases with each transmission. Upon reaching zero, each vehicle can either maintain the same reservation with probability Prk or generate a new grant and restart the selection procedure.

Performance of Packet Delivery Ratio for Varying Vehicles Speeds

563

3 System Models The Nakagami fading model has been widely used in numerous vehicle contexts because of its realistic characteristics. In our system model, we assume that all V2X channels suffer Nakagami fading simultaneously. In this case, the channel gain remains constant during the whole packet time. The probability density function for reception power Y is stated as follows: fY (Y) =

mm Y m−1 −mY e ω , (m)ωm

(1)

where the ω and m signify the mean received power and the fading exponent, respectively. (m) is the maximum likelihood estimation (using the log-likelihood function) to optimize the parameter m for the Nakagami distribution. We set the value of m to 1.5 if the distance between a sender and a receiver is less than 80 m and to 0.75 in all other circumstances where the distance is greater than 80 m [11]. In this study, the channel model takes into account pathloss as well as the shadowing fading effect. The distance between the transmitting and the receiving vehicle impacts the pathloss computation. The pathloss computation is dictated by three requirements. The first requirement is that the lowest permissible distance is three meters; as a result, if the distance is less than three meters, the value will be set into three. The second condition is met if the distance is more than three meters and less than the breakpoint distance (dBP ). The final requirement is met when the distance exceeds the dBP . The breakpoint distance dBP can be stated as follows: [12] dBP = 4(hUe )2 × fcc ,

(2)

where hUe is the user equipment height. fc is the carrier frequency in gigahertz. c = 3 × 108 m/s is speed of light. To acquire the final pathloss value, the pathloss (PL) and free pathloss (PLfree ) value will be compared to choose the highest value. Pathloss can be expressed as [12]  3 < d < dBP 27 + 22.7 log10 (d ) + 20 log10 (fc ), , (3) PL = 7.56 + 40 log10 (d ) − hUe + 2.7 log10 (fc ), d ≥ dBP where d is the distance based on the transmitting and receiving vehicles. In free space, the pathloss can be expressed as [12]   fc + 20 log10 (d ). (4) PLfree = 46.4 + 20 log10 5

4 Simulation Parameters and Results This work uses an open-source implementation of the 3GPP standard C-V2X (Rel 14) Mode 4, anointed OpenCV2X, to study the performance of C-V2X Mode 4. OpenCV2X is a modified version of the SimuLTE OMNeT++ simulator that enables the simulation of

564

T. I. Bayu et al. Table 1. Simulation environment configurations.

Parameters

Values

Road length

2 km

Vehicle’s density (β)

0.1 (200 cars), 0.2 (400 cars), 0.3 (600 cars)

Number of vehicle’s type

3

User equipment height (hUe )

1.5 m

Carrier frequency (fc )

5.9 GHz

Breakpoint distance (dBP )

176 m

Vehicle type 1 speed

40 km/h

Vehicle type 2 speed

80 km/h

Vehicle type 3 speed

120 km/h

Number of highway’s lane

6 (3 in two directions)

Modulation coding scheme (MCS)

7 (2 Sub-channels), 9 (4 Sub-channels)

Sub-channels per sub-frame

2, 4

RBs per sub-channel

17 (2 Sub-channels), 12 (4 Sub-channels)

Resource keep probability (Prk )

0, 0.8, 1

LTE networks [9]. OpenCV2X is used in conjunction with SUMO, a road network simulator that supports various vehicle speed settings. The configuration of the simulation environment is given in Table 1. Three variants of vehicle’s speed applied within the simulation scenarios, which is: Type 1 vehicles with 40 km/h, Type 2 for 80 km/h vehicle’s speed, and Type 3 to distinguish 120 km/h vehicle’s speed. We imply a constraint that Vehicle Type 1 will remain in the outer lane, replicating the movement of slow-moving and heavy cars on the road. This condition is created in order to maintain traffic flow and prevent vehicles from becoming stranded behind slower-moving vehicles. A similar distribution of vehicles is observed on each lane, with an equal amount of each type. Figure 1 shows the traffic flow simulation on each scenario. It can be seen that the vehicle’s flow condition is not constant anymore if compared to the vehicle’s flow in [8]. This condition allows highspeed vehicles to get past slower vehicles because the slow-moving vehicles always stay in the same lane. We believe that by simulating mixed vehicle’s speed, the performance should differ when the vehicles are grouped. PDR performance will be investigated paired with the vehicle’s distance as a horizontal axis. Since the C-V2X communication transmission range can satisfy up to 450 m, we use 3 m as a minimum and 450 m as a maximum distance for statistical boundary. We also simulate a non-standard resource keep probability value as Prk = 1. With the value of Prk = 1, the vehicles will always choose the resource slot they already used

Performance of Packet Delivery Ratio for Varying Vehicles Speeds

565

before. Different modulation coding schemes have also been investigated with the corresponding allocation of radio resources, such as the number of available sub-channel and sub-frame.

Fig. 1. Sumo’s vehicle flow scenario configuration.

Figures 3, 4 and 5 show the PDR comparison curve for β = 0.1, β = 0.2, and β = 0.3, respectively. When the road condition is uncongested as for β = 0.1, it can be observed that PDR performance gradually worsened below 90% at the 200 m distance for Prk = 0. At the same time, the non-standard Prk = 1 value starts to decrease at 250 m distance. This situation occurs because the number of nodes in the system is not too many. Thus the resource slot option is plenty. When the nodes have successfully created a pair for data transmission, the static pair will barely affect the other nodes. The only issue in the low-density situation is the radio signal power that causes the performance plumed starting at 400 m distance. The exceptional results showed when the high-density situation occurred. Non-standard Prk = 1 performance starts to fail at the beginning of 50 m distance and continues to decline as the distance increases. However, an exciting result comes out at the β = 0.3 scenario, where the steady Prk = 1 overcome the other two Prk = 0 and Prk = 0.8 at the beginning of 300 m distance. This condition again concludes that some node pairs are kept using the same resource slot time over time. The number of sub-channels on MCS = 9 is more extensive than MCS = 7. With additional radio resources, vehicles will have more room to select appropriate resources for communication. As can be seen in the following Fig. 2, 3 and 4, the PDR performance of MCS = 9 surpasses the MCS = 7 results. We use four sub-channels for MCS = 9 radio configuration. With this amount of radio resources, at least two pairs of CSR can be served simultaneously. The value of resource keep probability still plays an essential role in radio resource scheduling and management. For example, in Fig. 2, the performance of MCS = 7 with Prk = 0.8 slightly overcomes the MCS = 9 with Prk = 0 performance. High Prk value demonstrated to achieve the best performance among vehicle’s density β = 0.1 and β = 0.2. For the high vehicle’s density, such as β = 0.3, Prk = 1 presented to affluent the performance at certain distances, 300 m for MCS = 7 and 350 m for MCS = 9.

566

T. I. Bayu et al.

Fig. 2. PDR Comparison for β = 0.1 vehicles/m.

Fig. 3. PDR comparison for β = 0.2 vehicles/m.

We conclude the performance measurement by taking into account of PDR threshold at 95% with the corresponding distance breaking point. The higher the distance will indicate better performance. Figure 5 shows the histogram for 95% PDR performance distance breaking point. From Fig. 5, it was observed that with Prk = 0.8 achieved the best 95% PDR performance distance breaking point with 380, 300, and 280 m for β = 0.1, β = 0.2, and β = 0.3 vehicles/meter respectively.

Performance of Packet Delivery Ratio for Varying Vehicles Speeds

567

Fig. 4. PDR comparison for β = 0.3 vehicles/m.

Fig. 5. 95% PDR performance distance breaking point.

5 Conclusions This work emphasized mixed vehicle’s speed inside the road environment. This mixed vehicle’s speed comprises the varying vehicle’s speed to create a realistic highway situation. There are three types of vehicles defined by vehicle speed. The vehicle’s speeds are defined as 40 km/h, 80 km/h, and 120 km/h for vehicle Type 1, Type 2, and Type 3, respectively. The simulations were also conducted for three kinds of vehicle density, for which β = 0.1, 0.2, and 0.3 vehicles/meter. Other cellular radio parameters were

568

T. I. Bayu et al.

also tested for modulation coding scheme, MCS = 7, and MCS = 9. Furthermore, an important MAC layer parameter, Prk with value 0, 0.8, and 1 included in the testing. Simulation results shows that the MCS = 9 with Prk = 0.8 achieves the best performance for 95% PDR threshold distance breaking point.

References 1. Vukadinovic, V., et al.: 3GPP C-V2X and IEEE 802.11p for vehicle-to-vehicle communications in highway platooning scenarios. Ad Hoc Netw. 74, 17–29 (2018) 2. Bazzi, A., et al.: Study of the impact of PHY and MAC parameters in 3GPP C-V2V mode 4. IEEE Access 6, 71685–71698 (2018) 3. Burbano-Abril, A., et al: MCS Adaptation Within the Cellular V2X Sidelink. arXiv, Cornell University (9), arXiv:2109.15143 (2021) 4. Liu, Z., et al.: Implementation and performance measurement of a V2X communication system for vehicle and pedestrian safety. Int. J. Distributed Sensor Netw. 12, (2016) 5. Gonzalez-Martin, M., et al.: Analytical models of the performance of C-V2X mode 4 vehicular communications. IEEE Trans. Veh. Technol. 68, 1155–1166 (2019) 6. Hirai, T., Murase, T.: Performance evaluations of PC5-BASED CELLUlar-V2X Mode 4 for feasibility analysis of driver assistance systems with crash warning. Sensors 20, (2020) 7. Hirai, T., Murase, T.: Performance evaluation of NOMA for sidelink cellular-V2X mode 4 in driver assistance system with crash warning. IEEE Access 8, 168321–168332 (2020) 8. Bayu, T.I., Huang, Y.F., Chen, J.K.: Performance of C-V2X communications for high density traffic highway scenarios. In: The 26th International Conference on Technologies and Applications of Artificial Intelligence, pp. 147–152. Taiwan (2021) 9. McCarthy, B., et al.: OpenCV2X: Modelling of the V2X Cellular Sidelink and Performance Evaluation for Aperiodic Traffic. arXiv, Cornell University (3), arXiv:2103.13212 (2021) 10. Kang, B., et al.: ATOMIC: adaptive transmission power and message interval control for C-V2X mode 4. IEEE Access 9, 12309–12321 (2021) 11. Zhang, M., et al.: Fuzzy logic-based resource allocation algorithm for V2X Communications in 5G cellular networks. IEEE J. Sel. Areas Commun. 39, 2501–2513 (2021) 12. Kyösti, P., Jämsä, T., Meinilä, J., Byman, A.: MIMO Channel Model Approach and Parameters for 802.16m. https://ieee802.org/16/tgm/contrib/C80216m-07_104.pdf. Accessed 30 Mar 2022

Extensions of the Diffie-Hellman Key Agreement Protocol Based on Exponential and Logarithmic Functions Zbigniew Lipi´ nski1

and Jolanta Mizera-Pietraszko2(B)

1

2

University of Opole, Opole, Poland [email protected] Military University of Land Forces, Wroclaw, Poland [email protected]

Abstract. We propose a method of constructing cryptographic key exchange protocols of the Diffie-Hellman type based on the exponential and logarithmic functions over the multiplicative group of integers modulo n. The security of the proposed protocols is based on the computational difficulty of solving a set of congruence equations containing a discrete logarithm. For the multiplicative group of integers modulo n we define the non-commutative group of their automorphisms. On the defined group we construct non-commutative key exchange protocol similar to the Anshel-Anshel-Goldfeld key exchange scheme.

Keywords: Diffie-Hellman key agreement filed · Non-commutative cryptography

1

· Primitive roots of finite

Introduction

The Diffie-Hellman key exchange is one of the most utilized key agreement protocols in the secure network communication. There are symmetric and asymmetric (public) key exchange variants of the protocol. In case of the symmetric DiffieHellman key exchange protocol the users A and B agree the modulus p being the prime number and the primitive root r from the multiplicative group (Z/(p))∗ of integers modulo p. The user A selects the number a and the user B selects the number b from (Z/(p))∗ . Both communicating parties exchange the numbers ra mod p and rb mod p. The common key is ka,b = (rb )a = (ra )b mod p, [1,2]. Security of the protocol is based on computational difficulty in determining the value of discrete logarithm logr (ra ) mod ϕ(p). The standard X9.42 defines the Diffie-Hellman public key exchange scheme in which the private-public keys is the pair (a, g a mod p), a, g ∈ (Z/(p))∗ , [2]. According to the standard, the modulo parameter should be a prime number of the form p = jq + 1, where q is a large prime and j ≥ 2. The base g of the public key g a mod p is of the form g = hj mod p, where h is any integer with 1 < h < p − 1 such that hj mod p > 1. The base g does not have to be a generator of the cyclic group c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  N. T. Nguyen et al. (Eds.): ACIIDS 2022, LNAI 13758, pp. 569–581, 2022. https://doi.org/10.1007/978-3-031-21967-2_46

570

Z. Lipi´ nski and J. Mizera-Pietraszko

(Z/(p))∗ . For j = 2 we obtain the so called ‘safe prime’ numbers p = 2q + 1, (r) where q is prime. The set of primitive roots S2q+1 can be obtained by removing ∗ from the group (Z/(2q + 1)) all elements raised to power two (the quadratic residues) and removing the element 2q, which is the unique element such that (2q)2 = 1 mod (2q + 1). For the symmetric Diffie-Hellman key exchange protocol the crucial problem is determination of the primitive root elements from the group (Z/(p))∗ . Unfortunately, there is no general formula which allows to determine it. In the preliminary section we discus the problem of finding the primitive root elements in cyclic groups. In the Lemma we define the set of primitive roots in the group (Z/(p))∗ as the common part of the complement of the sets of p-residues, see the formula (2). This definition allows in an efficient way to construct algorithms for searching the primitive root elements in cyclic groups. In the same section we define four types of functions which permute the elements of the group (Z/(p))∗ . In Sect. 4 we propose the method of constructing Diffie-Hellman like key agreement protocols based on defined exponential and monomial functions. The security of the discussed protocols is based on the computational difficulty of solving a set of congruence equations containing the discrete logarithm. The public key exchange variants of the proposed protocols can be obtained straightforward. With each of the defined functions we associate the subgroup of symmetric group of p − 1 elements. In Sect. 5 we use these groups to construct a symmetric key agreement protocol based on the idea of the Anshel-Anshel-Goldfeld (AAG) key agreement scheme [3,4].

2

Related Work

The idea of using the non-abelian groups in cryptography has its origin in the solutions of three famous problems in combinatorial group theory proposed by M. Dehn in 1911, [5]. These are the word problem, the conjugacy problem and the isomorphism problem for finitely presented groups, [6]. In [7] Wagner and Magyarik devised the first public-key protocol based on the unsolvability of the word problem for finitely presented groups. A non-deterministic public-key cryptosystem based on the conjugacy problem on braid group, similar to the Diffie-Hellman key exchange system, was proposed in [8]. The particular importance are the Anshel-Anshel-Goldfeld key agreement system and the public-key cryptosystem proposed in [3,4,9]. The authors used the braid groups where the word problem is easy to solve but the conjugacy problem is intractible. This is due to the fact that on braid groups the best known algorithm to solve the conjugacy problem requires at least exponential running time. Proposed in Sect. 5 the symmetric key agreement protocol is based on the idea of the Anshel-AnshelGoldfeld (AAG) key agreement scheme. In Sect. 4 we construct a familly of Diffie-Hellman like key agreement protocols based on exponential and monomial functions which permute the cyclic group (Z/(p))∗ . This is one of many generalization of the Diffie-Hellman key exchange protocol. For example, in [10] it was proposed a generalization of the

Extensions of the Diffie-Hellman Key Agreement Protocol

571

Diffie-Hellman scheme called Algebraic generalization of Diffe-Hellman (AGDH). Its security is based on the hardness of a solution of the homomorphic image problem, which requires to compute the image of a given element under an unknown homomorphism of two algebras selected as a encryption platform. In [11] a matrix-based Diffie-Hellman key exchange protocol was proposed. The security of the proposed protocol is based on exploiting of a non-invertible public matrix in the key generating process. In [12] a polynomial time algorithm to solve the Diffie-Hellman conjugacy problem in braid groups was proposed. Security aspects of the algorithm was analysed and proved. In [13] was considered a Diffie-Hellman like key exchange procedure which security is based on the difficulty of computing discrete logarithms in a group matrices over noncommutative ring. In the algorithm the exponentiation is hidden by a matrix conjugation which ensures it security. In [14] a new computational problem called the twin Diffie-Hellman problem was proposed. The twin DH protocol allows to avoid the problem of an attack on the public keys exchanged in the standard DiffieHellman scheme. I. F. Blake and T. Garefalakis analyzed the complexity of the discrete logarithm problem, the Diffie-Hellman and the decision DH problem, [15]. The authors showed that if the decision DH problem is hard then computing the two most significant bits of the DH function is hard. In [16] the authors analyze complexity of the Group Computational Diffe-Hellman and Decisional Diffe-Hellman problems and their application in cryptographic protocols. The security of the group Diffe-Hellman key exchange is discussed, [17]. In [18] the authors apply the symbolic protocol analysis to cryptographic protocols which utilize the exponentiation calculations on cyclic groups. Using this analysis several security aspects of Diffie-Hellman like operations was proved.

3

Preliminaries

The multiplicative group (Z/(n))∗ is the set of elements from the ring Z/(n) of integers modulo n coprime to n. The group (Z/(n))∗ is cyclic if and only if n is of the type n = 1, 2, 4, pm , 2pm , where p is an odd prime number and m ≥ 1, [19]. The generators of (Z/(n))∗ are called primitive roots. The order of a primitive root r ∈ (Z/(n))∗ is equal to the Euler totient function ϕ(n), i.e., ϕ(n) is the (r) smallest number such that rϕ(n) = 1 mod n. By Sn we denote the set of all (r) primitive roots in (Z/(n))∗ . The cardinality of the set Sn equals to ϕ(ϕ(n)). ∗ By (Z/ϕ(n)) we denote the multiplicative group of integers coprime to ϕ(n), i.e., (Z/ϕ(n))∗ = {k ∈ Z/ϕ(n) : gcd(k, ϕ(n)) = 1}. For any i ∈ (Z/(n))∗ and k ∈ (Z/ϕ(n))∗ the equation i · k = 0 mod ϕ(n) implies that i = 0 mod ϕ(n). From this implication it follows that the set {ik mod n : i ∈ (Z/(n))∗ } is equal to (Z/(n))∗ and the elements rk mod n, for given primitive root r, generate the whole set (1) Sn(r) = {rk mod n : k ∈ (Z/ϕ(n))∗ }. Let p be a prime number and ϕ(p) = pa0 0 pa1 1 · · · pakk , where p0 = 2. The element g ∈ (Z/(p))∗ is a residue of degree pi modulo p if there exists a ∈ (Z/(p))∗ such

572

Z. Lipi´ nski and J. Mizera-Pietraszko

that api = g mod p. By ((Z/(p))∗ )pi we denote the group of pi -residues mod p and by (((Z/(p))∗ )pi )C = (Z/(p))∗ \ ((Z/(p))∗ )pi the set of pi -non-residues, i.e., the complement of ((Z/(p))∗ )pi in (Z/(p))∗ . The group ((Z/(p))∗ )pi is a cyclic of order ϕ(p)/pi . Lemma. For a prime number p and ϕ(p) = pa0 0 pa1 1 · · · pakk the set of primitive roots can be obtained from the formula Sp(r) =

k 

(((Z/(p))∗ )pi )C .

(2)

i=0

Proof. Let g = rb , x = ra ∈ (Z/(p))∗ and xpi = g mod p for some root r. From ra·pi = rb mod p it follows that a · pi = b mod ϕ(p). If gcd(b, ϕ(p)) = 1 then pi (r) does not divides b, which means that b ∈ (Z/ϕ(p))∗ and rb mod p ∈ Sp . ∗ The formula (2) states that for a primitive root r ∈ (Z/(p)) the set of congruence equations xpi = r mod p, i ∈ [0, k] is not solvable. By Gpp0 ···pk =

k 

((Z/(p))∗ )pi = ((Z/(p))∗ )p0 p1 ···pk

i=0

we denote the intersection of all pi -residue groups ((Z/(p))∗ )pi , i ∈ [0, k]. Because (r) any root r = r0e mod p, where r0 ∈ Sp and e ∈ (Z/ϕ(p))∗ , then the group p Gp0 ···pk leaves the set of roots invariant and acts on (Z/ϕ(p))∗ as translations e+ind(g)

Tg (r0e ) = g · r0e = r0

mod p,

where g ∈ Gpp0 ···pk and ind(g) is the index of g. The order of the group |Gp0 p1 ···pk | = p0 ϕ(p) p1 ···pk . One of the first algorithms which allow to find a primitive root was defined by Gauss. In the algorithm it is used the fact that for the element g ∈ (Z/(p))∗ ai such that g (p−1)/pi = 1 the orders of a g (p−1)/pi and g (p−1)/pi are equal to i a Ordp (g (p−1)/pi ) = pi and Ordp (g (p−1)/pi ) = pi i respectively. If pi = pj , then a

(p−1)/pi i

by multiplying gi

a

a

(p−1)/pj j

and gj

j a (p−1)/pi i (p−1)/pj Ordp (gi gj )

one can obtain an element of order

a pai i pj j .

= Gauss’s algorithm to find a primitive root.

1. Find the first pi -non-residue, i ∈ [0, k], i.e., the numbers gi such that (p−1)/pi gi = 1 mod p. ai

i 2. From the formula gi mod p determine the elements of order pai i . a k (p−1)/pi i 3. The product i=0 gi mod p is a primitive root.

(p−1)/p

As an example of application of the Gauss’s algorithm we find a primitive root in Z/(p)∗ , where p = 4441. Because p−1 = 23 ·3·5·37, for each prime p0 = 2, p1 = 3,

Extensions of the Diffie-Hellman Key Agreement Protocol

573

p2 = 5, p3 = 37 the first 2th non-residue is 7, the first 3th, 5th and 37th non3 residue is 2. The element r = 74440/2 · 24440/3 · 24440/5 · 24440/37 = 2749 mod 4441 is a primitive root. E. Bach modified the Gauss’s algorithm to the following form. 1. Find B ≥ 1 such that B log B = C log p, C = 30. 2. Factor p − 1 = pa0 0 . . . pas s Q, where pi < B and Q are free of primes < B. (p−1)/pi 3. For each i = 0, . . . , s choose a prime gi ≤ 2(log p)2 such that gi = 1. ai s (p−1)/pi 4. For g = i=0 gi construct the set 4

(log p) S = {g · q (p−1)/Q mod p : q is prime and q ≤ 5 (log log p)2 }.

In [20] E. Bach proved, assuming that the extended Riemann hypothesis is true, that the algorithm computes the set S of residues mod p such that S contains a (log p)4 primitive root and the cardinality of the set |S| is of order O( (log log p)3 ). As an example of application of E. Bach algorithm we find a primitive root in Z/(p)∗ , where p = 4441. For the prime number p = 4441 take C = 1, B = 5, p0 = 2, 3 p1 = 3 and write p − 1 = 23 · 3 · Q, where Q = 5 · 37 and g = 74440/2 · 24440/3 = 2690 mod 4441. For q = 2 the element r = g · q 4440/Q = 2690 · 24440/185 = 3355 mod 4441 is a primitive root. Table 1. The primitive roots in Z/(p)∗ , p − 1 = 23 · 3 · 5 · 37. pi -residue pi -non-residue mod p mod p

primitive root

p0 = 2

1, 2, 3, . . . 7, 11, 13, . . .

(2 · 3)· 13, . . .

p1 = 3

1, 6, . . .

2, 3, 4, 5, 7, 9, 10, 12, 13, . . .

6·13, . . .

p2 = 5

1, 6, . . .

2, 3, 4, 5, 8, 9, 10, 11, 12, 13, . . . 6·13, . . .

p3 = 37 1, 13, . . .

2, 3, . . .

13·(2 · 3), . . .

From the formula (2) and the fact that the product of two k-residues is kresidue and the product of k-residue and k-non-residue is k-non-residue one can find a non-prime primitive root following the rule. For each i ∈ [0, k] find a pair (gi , gi ) of elements gi ∈ ((Z/(p))∗ )pi and gi ∈ (((Z/(p))∗ )pi )C such that ∀i∈[1,k] gi · gi = g0 · g0 mod p, then the common value is a primitive root. In the Table 1 it was shown how to find a primitive root using proposed rule for p = 4441. The best know unconditional estimate for the smallest primitive root is due to Burgess [21]. Based on Burgess’s results in [22] Shparlinski constructed a deter1 ministic algorithm which finds a primitive root in a finite field Fp in time O(p 4 + ), for any  > 0. V. Shoup in [23], assuming the extended Riemann hypothesis proved that the least primitive root gp is bounded by a polynomial in log p gp = O(ω 4 (1 + log ω)4 (log p)2 ) = O((log p)6 ), where ω ≡ ω(p − 1) is the number of distinct prime factors of p − 1. E. Bach conjectured that the least primitive root for prime p is O((log p)(log log p)), [24].

574

Z. Lipi´ nski and J. Mizera-Pietraszko

Let us define four invertible functions on (Z/(p))∗ determined by elements of the group ⎧ ⎪ R(x) = rx mod p, r ∈ Sp(r) , ⎪ ⎨ E(x) = xe mod p, e ∈ (Z/ϕ(p))∗ , (3) ⎪ M (x) = m · x mod p, m, x ∈ (Z/(p))∗ , ⎪ ⎩ Cn (x) mod p, gcd(n, p2 − −1) = 1, where p is a prime number, Cn (x) are Chebyshev polynomials of the first kind, [25]. The exponential function R(x) is determined by the primitive root r of the group (Z/(p))∗ . The monomial function E(x) is determined by the element e from (Z/ϕ(p))∗ . Each of the function (3) defines permutation of the set (Z/(p))∗ . By R, E, M , Cn we denote the permutation matrices determined by the functions R(x), E(x), M (x), Cn (x). Composition of these functions corresponds to the multiplication of permutation matrices. For example, the composition of two functions R1 (x) and R2 (x), defined by R1 ◦ R2 (x) = R2 (R1 (x)), corresponds to (r) the multiplication of matrices R1 R2 . We denote by Gp the permutation group (e) (m) (C) determined by the exponential functions R(x). By Gp , Gp , Gp we denote the groups generated by the functions E(x), M (x) and Cn (x) respectively. The (r) (e) (m) (C) group Gp is non-abelian, and Gp , Gp and Gp are abelian subgroups of (r) Gp for prime p > 11.

4

Extensions of Diffie-Hellman Protocol

The first extension of the Diffie-Hellman key exchange protocol utilises the exponential and monomial functions defined in (3). The DH r,e algorithm. 1. The users A and B agree the primitive root r from (Z/(p))∗ and the element e ∈ (Z/ϕ(p))∗ . e 2. The user A selects a number a , calculates ra mod p and sends it to B. e The user B selects a number b, calculates rb mod p and sends it to A. e 3. For given ae and rb the user A calculates kA = (rb )a = rb e

e

a

e e

e

= r(ba) mod p.

For given be and ra the user B calculates e

kB = (ra )b = ra e

e

b

e e

e

= r(ab) mod p.

e

The common key is ka,b = r(ab) mod p. Let’s illustrate the algorithm with a simple example on the (Z/(4441))∗ group. e Let r = 21, e = 53, a = 121, b = 61 ∈ (Z/(4441))∗ then ae = 3535, rb = 3209 e and be = 972, ra = 862 mod 4441. The common key is kA = 32093535 = kB = 862972 = 385 mod 4441. e The attacker can intercept the numbers r, e and u = ra mod p. To decipher the number ae mod p he has to solve the logarithmic equation logr (u) =

Extensions of the Diffie-Hellman Key Agreement Protocol

575

w mod ϕ(p). Using encryption for multiple exponents {ei }i we force the attacker to decipher the secret a, which require to solve set of equations logr (u) = w mod ϕ(p), w = aei mod p with two unknown parameters w and a. The next modification of the Diffie-Hellman protocol utilizes composition of the exponential functions from (3). The DH 2r algorithm. 1. The users A and B agree two primitive roots 2. The user A selects a number a , calculates B. The user B selects a number b , calculates A. rb 3. For given r12 mod p and r2a mod p the user r2b

r2a r2b

kA = (r1 )r2 = r1 For given

ra r12

a

r1a, r2 from (Z/(p))∗ . r r12 mod p and sends it to rb

r12 mod p and sends it to A calculates

r2a+b

= r1

mod p.

mod p and r2b mod p the user B calculates ra

r a r2b

kB = (r12 )r2 = r12 b

r2a+b

The common key is ka,b = r1

r a+b

= r12

mod p.

mod p.

The following example illustrates how the algorithm works. Let r1 = 44, r2 = rb ra 94, a = 121, b = 61 ∈ (Z/(4441))∗ then r12 = 939, r2a = 3386 and r12 = 2593, r2b = 3031, mod 4441. The common key is kA = 9393386 = kB = 25933031 = 450 mod 4441. ra The attacker has the three numbers r1 , r2 and u = r12 mod p. To decipher the secret a he has to solve the set of two equations for discrete logarithm logr1 (u) = w mod ϕ(p) and logr2 (w) = a mod ϕ(p) for unknown variables w and a. The third modification of the Diffie-Hellman protocol utilizes the inverse of the exponential function and the monomial functions from (3). The DH e,log algorithm. 1. The users A and B agree the primitive root r from (Z/(p))∗ and the element e ∈ (Z/ϕ(p))∗ . 2. The user A selects a primitive root ra , calculates (logra (r))e mod ϕ(p) and sends it to B. The user B selects a primitive root rb , calculates (logrb (r))e mod ϕ(p) and sends it to A. 3. For given (logrb (r))e and (logr (ra ))e the user A calculates kA = (logrb (r))e (logr (ra ))e = (logrb (ra ))e mod ϕ(p). For given (logra (r))e and (logr (rb ))e the user B calculates kB = (logra (r))e (logr (rb ))e = (logra (rb ))e mod ϕ(p). Because (logrb (ra ))e = (logra (rb ))−e mod ϕ(p) the common key is k(ra , rb ) = (logrb (ra ))e mod ϕ(p).

576

Z. Lipi´ nski and J. Mizera-Pietraszko

As an example we will calculate the common key k(ra , rb ) on the (Z/(4441))∗ group. Let e = 53, r = 21, ra = 104, rb = 168 ∈ (Z/(4441))∗ then logrb (r) = 2527, logr (ra ) = 1913, logra (r) = 3377, logr (rb ) = 1063 mod 4440. Bacause kA = 252753 191353 = 2231 and kB = 337753 106353 = 3431 mod 4440 then the common key is k(ra , rb ) = 2231 mod 4440. The attacker has primitive root r, exponent e and ua = (logra (r))e mod ϕ(p). To determine ra it has to solve the set of equations ua = (wa )e mod ϕ(p) and wa = logra (r) mod ϕ(p) for two unknown variables wa and ra .

5

The Key Agreement Protocol on Permutation Group

Let us denote by V ((Z/(p))∗ ) the set of ordered ϕ(p)-tuples with elements from the group (Z/(p))∗ . By v0 = (1, . . . , ϕ(p)) ∈ V ((Z/(p))∗ ) we denote the tuple ordered in ascending order. The functions from (3) define the permutation of V ((Z/(p))∗ ) ⎧ (r) 1 p−1 ⎪ ) mod p, r ∈ Sp , ⎪ Rv0 = (r , . . . , r ⎨ e e Ev0 = (1 , . . . , (p − 1) ) mod p, e ∈ (Z/ϕ(p))∗ , (4) ⎪ Mv0 = (m · 1, . . . , m · (p − 1)) mod p, m ∈ (Z/(p))∗ , ⎪ ⎩ Cn v0 = (Cn (1), . . . , Cn (p − 1)) mod p, gcd(n, p2 − −1) = 1. (R)

(E)

(M )

(C)

The sets of matrices (4) we denote as Sp , Sp , Sp and Sp respectively. rx From the composition rules of the functions Ri ◦ Rj (x) = rj i mod p follows the (r)

non-commutativity of the group Gp generated by the matrices R, i.e., Ri Rj = (r) Rj Ri ∈ Gp . From the composition rules of the monomials Ei ◦Ej (x) = (xei )ej = ei ej mod p, x ∈ (Z/(p))∗ , ei , ej ∈ (Z/ϕ(p))∗ , follows the commutativity of the x (e) group Gp , i.e., Ei Ej = Ej Ei . The matrices Mi , i ∈ [1, ϕ(p)] generate the finite (m) abelian group Gp isomorphic to (Z/(p))∗ , called the automorphism group of ∗ (Z/(p)) , [26]. From the equation R ◦ E(x) = rxe = (re )x mod p it follows that (R) (e) (r) RE ∈ Sp which implies that Gp ⊂ Gp . The formula (1) written in terms of matrices has the form ϕ(ϕ(p)) Sp(R) = {REi }i=1 . From the equations E ◦ M (x) = M (E(x)) = mxe mod p and M ◦ E(x) = (e) (m) E(M (x)) = (mx)e mod p it follows that the abelian groups Gp and Gp does (e) (m) (r) not commute, i.e., [Gp , Gp ] = 0. For n = 11 the group G11 is the alternating (r) group of the set of ten elements. For prime number p > 11 the group Gp is the ∗ ∗ symmetric group Sym((Z/(p)) ) of the set (Z/(p)) . The orders of the groups (e) (m) Gp and Gp are equal to ϕ(ϕ(p)) and ϕ(p) respectively. (r) The construction of the symmetric cryptographic key on the group Gp was motivated by the AAG symmetric key exchange protocol, [3,4]. The cryptographic key exchange we define by the formula −1 ka,b = a0 bN +1 a−1 N +1 b0 ,

Extensions of the Diffie-Hellman Key Agreement Protocol

577

(r)

where a0 , bN +1 , aN +1 , b0 ∈ Gp . The construction of the protocol can be explained on the following example. The users A and B agree the set of ele(r) ments S (g) = {g0,1 , g1,1 , g1,2 , g2,1 } from the group Gp . The user A selects the set Sa = {a0 , a1 , a2 , a3 }, such that a3 = g0,1 g1,1 g2,1 , calculates Sa(g) = {u0,1 = a0 g0,1 a−1 1 ,

u1,1 = a1 g1,1 a−1 −1 2 ,u 2,1 = a2 g2,1 a3 } u1,2 = a1 g1,2 a−1 , 2

(g)

and sends Sa to B. Similarly, the user B selects the set Sb = {b0 , b1 , b2 , b3 }, such that b3 = g0,1 g1,2 g2,1 , calculates (g)

Sb

= {w0,1 = b0 g0,1 b−1 1 ,

w1,1 = b1 g1,1 b−1 −1 2 ,w 2,1 = b2 g2,1 b3 } w1,2 = b1 g1,2 b−1 2 ,

and sends Sb to A. The user A, using the formula w0,1 w1,1 w2,1 = b0 a3 b−1 3 , −1 −1 −1 = (b0 a3 b−1 a ) . The user B, using the formula calculates the common key ka,b 3 0 −1 −1 u0,1 u1,2 u2,1 = a0 b3 a−1 3 , calculates the common key ka,b = (a0 b3 a3 )b0 . In this trivial example, the attacker can easily calculate the cryptographic key ka,b (g) because for a given matrices g and u from the sets S (g) , Sa to determine the unknown variable a0 it’s enough to solve the set of equations ⎧ ⎪ ⎪ a0 g0,1 = u0,1 a1 , ⎨ a1 g1,1 = u1,1 a2 , a1 g1,2 = u1,2 a2 , ⎪ ⎪ ⎩ a2 g2,1 = u2,1 a3 . (g)

Because we know the formula a3 = g0,1 g1,1 g2,1 the set of equations can be trivially solved. In the proposed algorithm, we hide the last variable in the set Sa in order to make it difficult to find the element a0 and the key ka,b . To do which nodes are elements of the set this, we introduce a graph Gcipher 2N Sx(g) = {u0,1 = x0 g0,1 x−1 1 ,

ui,1 = xi gi,1 x−1 −1 N −1 i+1 , u N,1 = xN gN,1 xN +1 }i=1 , ui,2 = xi gi,2 x−1 i+1 ,

(r)

where xi , gi,1 , gi,2 ∈ Gp . Example of the such graph Gcipher build of six nodes 6 is shown on Fig. 1. A path

x0 g0,1 x−1 1

 

x1 g1,1 x−1 2 



x2 g2,1 x−1 3 

   x1 g1,2 x−1 x2 g2,2 x−1 2 3

x3 g3,1 x−1 4

Fig. 1. The graph Gcipher for generation of the cryptographic key. 6

pa (u) = (u0,1 , . . . , (ui,1 ||ui,2 ), . . . , uN,1 )

578

Z. Lipi´ nski and J. Mizera-Pietraszko

from the root node u0,1 to the leaf node uN,1 defines uniquely the product of matrices (5) x0 = u0,1 . . . (ui,1 ||ui,2 ) . . . uN,1 , i ∈ [1, N − 1], from which follows the form of the aN +1 element xN +1 = g0,1 . . . (gi,1 ||gi,2 ) . . . gN,1 , i ∈ [1, N − 1].

(6)

The expression (gi,1 ||gi,2 ) means that in the product (6) either the matrix gi,1 or gi,2 appears, depending on the selected path in Gcipher . The number of possible 2N values for the matrix xN,1 is equal to the number of all possible paths from the build of 2N nodes root node u0,1 to the leaf node uN,1 . For the graph Gcipher 2N N −1 such paths. For sufficiently large number of nodes in Gcipher one there are 2 2N can regard the matrix xN +1 as unknown parameter. On Fig. 1, it is shown the build of six nodes, N = 3, in which there are four paths between graph Gcipher 6 the root and the leaf node. To calculate the key ka,b , calculated for two paths , the attacker should solve the set of matrix pa (u) and pb (w) in the graph Gcipher 2N equations  −1 gi−1,2 = u−1 xi gi−1,1 i−1,1 ui−1,2 xi , (7) −1 xi gi,1 gi,2 = ui,1 u−1 i,2 xi , i ∈ [2, N ], where gi,1 , gi,2 , ui,1 , ui,2 are known parameters. For a given solution xi = ai of (g) (7), using the formulas for ui,1 , ui,2 from Sa and x = a−1 i xi we obtain the following set of matrix equations  −1 −1 gi−1,2 = gi−1,1 gi−1,2 x x gi−1,1 (8) −1 −1 x gi,1 gi,2 = gi,1 gi,2 x, i ∈ [2, N ] If the solution of (8) is trivial then the solution of (7) is unique. For a non-trivial solution g0 of the equations (8), also any matrix g0k is a solution of (8). The two solutions ai and ai of (7) are related by the formula ai = ai (g0 )k , k ∈ [1, |g0 |], where g0 is the solution of (8), i.e., belongs to the centralizer Cgi,1 g−1 of the i,2

−1 element gi,1 gi,2 and |g0 | is order of the element g0 . If centralizers Cgi,1 g−1 , i ∈ i,2 [2, N ], are nontrivial then the solution of the set of equations (7) is not unique. The security of the proposed algorithm depends on the difficulty of finding the proper solution of the equations (7), i.e., depends on the number of solutions of (7). Below we give the detailed description of the algorithm. (r)

The key agreement algorithm in Gp . 1. The users A and B agree the set of elements S (g) = {g0,1 , gi,1 , gi,2 , (r) −1 gN,1 }N i=1 from the group Gp with non trivial centralizers Cgi,1 g −1 . i,2

+1 2. The user A selects the set Sa = {ai }N and the path pa (u) i=0 ⊂ Gp cipher in the graph G2N , such that formula (5) for a0 is satisfied. The user (r)

Extensions of the Diffie-Hellman Key Agreement Protocol

579

−1 A calculates Sa = {u0,1 , ui,1 , ui,2 , uN,1 }N to i=1 and sends the set Sa (r) +1 the user B. Similarly, the user B selects the set Sb = {bi }N ⊂ G , p i=0 and the path pb (w) in the graph Gcipher , such that formula (5) for b is 0 2N (g) −1 and satisfied. The user B calculates Sb = {w0,1 , wi,1 , wi,2 , wN,1 }N i=1 (g) sends the set Sb to A. −1 3. For the selected path pa (u) the user A calculates ka,b = −1 −1 )a ) . ((b0 aN +1 b−1 N +1 0 −1 For the selected pb (w) the user B calculates ka,b = (a0 bN +1 a−1 N +1 )b0 . The common key is ka,b . (g)

(g)

(r)

We apply the proposed algorithm to the group G17 , which is isomorphic to , N = 4, is build the symmetric group of the set of (Z/(17))∗ . The graph Gcipher 8 (g) . The agreed set S and selected of eight nodes. There are 23 paths in Gcipher 8 secret sets Sa and Sa are S (g) = {E3 , E5 , E7 , E9 , E11 , E13 , E15 , E3 }, Sa = {R14 , R12 , R11 , R10 , R7 , E3 E5 E11 E15 E3 }, Sb = {R3 , R5 , R7 , R10 , R11 , E3 E7 E9 E13 E3 }, (r)

(r)

where the matrices Rj ∈ G17 are determined by the primitive roots j ∈ S17 (e) and Ei ∈ G17 . The paths selected by the users A and B are pa (u) = (u0,1 , u1,1 , u2,2 , u3,2 , u4,1 ), pb (w) = (w0,1 , w1,2 , w2,1 , w3,1 , w4,1 ),

−1 where ui,j = ai gi,j a−1 i+1 and wi,j = bi gi,j bi+1 , i ∈ [0, 4], j = 1, 2. The common key calculated by communicating parties is given by the formula

ka,b = R14 (E3 E7 E9 E13 E3 )(E3 E5 E11 E15 E3 )−1 R3−1 , and can be written in matrix form ka,b v0 = (3, 6, 9, 12, 15, 2, 5, 8, 11, 14, 1, 4, 7, 10, 13, 16), where v0 = (1, . . . , 16).

6

Conclusions

We proposed three cryptographic key exchange protocols of the Diffie-Hellman type based on the exponential and logarithmic functions over the multiplicative group of integers modulo prime number p. The security of the proposed protocols is based on the computational complexity in solving a set of congruence equations containing the discrete logarithm. For the multiplicative group of inte(r) ger numbers modulo p we constructed the non-commutative group Gp of their automorphisms. On the defined group we constructed a non-commutative key exchange protocol similar to the Anshel-Anshel-Goldfeld key exchange scheme. The security of the proposed protocols is based on the difficulty of finding path build of 2N nodes and solution of the set of in a defined cipher graph Gcipher 2N (r) certain matrix equations in Gp .

580

Z. Lipi´ nski and J. Mizera-Pietraszko

References 1. Diffie, W., Hellman, M.E.: New Directions in Cryptography. IEEE Trans. Inform. Theor. IT-22(6), 644–654 (1976) 2. Rescorla, E.: Diffie-Hellman Key Agreement Method, RFC 2631, http://www.rfceditor.org (1999) 3. Anshel, I., Anshel, M., Goldfeld, D.: An algebraic method for public-key cryptography. Math. Res. Lett. 6, 1–5 (1999) 4. Anshel, I., Anshel, M., Goldfeld, D.: Non-abelian key agreement protocols. Discrete Appl. Math. 130, 3–12 (2003) ¨ 5. Dehn, M.: Uber unendliche diskontinuierliche Gruppen. Math. Annalen 71, 116– 144 (1911) 6. Myasnikov, A., Shpilrain, V., Ushakov, A.: Non-commutative Cryptography and Complexity of Group-theoretic Problems, Mathematical Surveys and Monographs, vol. 177, AMS (2011) 7. Wagner, N.R., Magyarik, M.R.: A public-key cryptosystem based on the word problem. In: Blakley, G.R., Chaum, D. (eds.) CRYPTO 1984. LNCS, vol. 196, pp. 19–36. Springer, Heidelberg (1985). https://doi.org/10.1007/3-540-39568-7 3 8. Ko, K.H., Lee, S.J., Cheon, J.H., Han, J.W., Kang, J., Park, C.: New public-key cryptosystem using braid groups. In: Bellare, M. (ed.) CRYPTO 2000. LNCS, vol. 1880, pp. 166–183. Springer, Heidelberg (2000). https://doi.org/10.1007/3540-44598-6 10 9. Anshel, I., Anshel, M., Fisher, B., Goldfeld, D.: New key agreement protocols in braid group Ccyptography. In: Naccache, D. (ed.) CT-RSA 2001. LNCS, vol. 2020, pp. 13–27. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-45353-9 2 10. Partala, J.: Algebraic generalization of Diffe-Hellman key exchange. J. Math. Cryptol. 12, 1–21 (2018) 11. Chefranov, A. G., Mahmoud, A. Y.: Commutative Matrix-based Diffie-HellmanLike Key-Exchange Protocol. In: Proceedings of the 28th International Symposium on Computer and Information Sciences In: E. Gelenbe, R. Lent (eds.), Springer, pp. 317–324, (2013). https://doi.org/10.1007/978-3-319-01604-7 31 12. Cheon, J.H., Jun, B.: A polynomial time algorithm for the braid Diffie-Hellman Conjugacy Problem. In: Boneh, D. (ed.) CRYPTO 2003. LNCS, vol. 2729, pp. 212–225. Springer, Heidelberg (2003). https://doi.org/10.1007/978-3-540-451464 13 13. Eftekhari, M.: A Diffie-Hellman key exchange protocol using matrices over noncommutative rings. Groups Complex. Cryptol. 4, 167–176 (2012) 14. Cash, D., Kiltz, E., Shoup, V.: The twin Diffie-Hellman problem and applications. In: Smart, N. (ed.) EUROCRYPT 2008. LNCS, vol. 4965, pp. 127–145. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-78967-3 8 15. Blake, I.F., Garefalakis, T.: On the complexity of the discrete logarithm and DiffieHellman problems. J. Complexity 20, 148–170 (2004) 16. Bresson, E., Chevassut, O., Pointcheval, D.: The group Diffie-Hellman problems. In: Nyberg, K., Heys, H. (eds.) SAC 2002. LNCS, vol. 2595, pp. 325–338. Springer, Heidelberg (2003). https://doi.org/10.1007/3-540-36492-7 21 17. Steiner, M., Tsudik, G., Waidner, M.: Diffe-Hellman key distribution extended to group communication. In: Proceedings of ACM CCS ’96, ACM Press, pp. 31–37 (1996) 18. Dougherty, D. J., Guttman, J. D.: Symbolic Protocol Analysis for Diffie-Hellman, arXiv:1202.2168 (2012)

Extensions of the Diffie-Hellman Key Agreement Protocol

581

19. Niven, I. M., Zuckerman, H. S., Montgomery, H. L.: An introduction to the theory of numbers, John Wiley & Sons (1991) 20. Bach, E.: Comments on search procedures for primitive roots. Math. Comp. 66(220), 1719–1727 (1997) 21. Burgess, D.A.: Character sums and primitive roots in finite fields. Proc. London Math. Soe. 3(17), 11–25 (1967) 22. Shparlinski, I.: On finding primitive roots in finite fields. Theor. Comput. Sci. 157, 273–275 (1996) 23. Shoup, V.: Searching for Primitive Roots in Finite Fields. Math. Comput. 58(197), 369–380 (1992) 24. Bach, E., Shallit, J.: Algorithmic number theory, Volume I: Efficient Algorithms, MIT Press (1996) 25. Lidl, R., Niederreiter, H., Cohn, P. M.: Finite Fields, Cambridge University Press (1997) 26. Rose, H.E.: A Course on Finite Groups, Springer-Verlag (2009). https://doi.org/ 10.1007/978-1-84882-889-6

Music Industry Trend Forecasting Based on MusicBrainz Metadata Marek Kopel(B)

and Damian Kreisich

Faculty of Information and Communication Technology, Wroclaw University of Science and Technology, Wybrzeze Wyspia´ nskiego 27, 50-370 Wroclaw, Poland [email protected]

Abstract. In this paper forecast analysis for music industry is performed. The trends for years 2020–2024 are calculated based on forecasting for time-series metadata from online MusicBrainz dataset. The analysis takes on music releases throughout the years from different perspectives, e.g. the release format, type of music releases or release length (playback time). In the end, all the results are discussed before final conclusions are drawn.

Keywords: Forecast

1

· Trend · Musicbrainz · Music industry

Introduction

For a long time, music industry is a point of interest for different people on the globe. It can be artists, musicians, music lovers or even business analysts and investors. Basically anybody who once got into music, never left the idea of following their favourite musicians and their music. The dataset used in this research is the metadata for music releases, artists, recordings and other entities connected to information about music industry. This paper’s theme is data analysis within music industry - focusing on trends within the data. The important parts are: preparing proper analysis, visualization and statistical significance checking, but mainly trending and forecasting using automated methods. The analysis topics are connected directly to the music industry, especially focusing on music releases, which bring lots of information on what was released, how was it released - what is featured within the releases and many other parameters that make the analysis worthy and interesting. The data source for analysis is open-sourced database called MusicBrainz1 , which is available for download, but also live querying and editing. MusicBrainz along with Discogs2 databases are probably the best ones to consider, when it comes to music metadata. Additional aspects that should be considered are the data quality, the importance of analysis findings and different topics that can be analysed in the similar way which may lead to the follow-up analysis or cross-checking results among other 1 2

https://musicbrainz.org/. https://www.discogs.com/.

c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  N. T. Nguyen et al. (Eds.): ACIIDS 2022, LNAI 13758, pp. 582–594, 2022. https://doi.org/10.1007/978-3-031-21967-2_47

Music Industry Trend Forecasting Based on MusicBrainz Metadata

583

data sources on the same subject. The specific goal here is to gather knowledge on trends and changes in the music industry, based on MusicBrainz metadata analysis and prepare a solution to perform similar analysis on demand.

2

Related Works

MusicBrainz is a community maintained database on music releases. It is often used as a music research data set. Previous authors’ work using the data set was [4]. It was an analysis of artist mutual influence. MusicBrainz is usually augmented with complementary data sets. Author of [11] use MusicBrainz with The Echo Nest to analyze a gender gap and gender differences in the music industry. Data sets like AudioBrainz [3] or MuSE [1] while combined with MusicBrainz allow for music sentiment analysis. And combining MusicBrainz data with lyrics data sets allowed authors of [2] to experiment with music genre classification. On the other hand trend forecasting is another field on its own. Trends can be analyzed for any kind of time related data. Just to name a few examples: in [9] authors try to foretell a price of a cryptocurrency, in [6] forecasting of fashion trends is researched, authors on [12] propose a method for forecasting an oil price and in [7] a forecast analysis for seasonal climate conditions under climate change is presented. But to authors’ best knowledge no meaningful trend forecasting has been researched on MusicBrainz data basis.

3

Forecasting Methods

Automated forecasting is using one of known forecasting methods (Auto.arima, Ets, Prophet, Snaive, Tbats) in an automated modeler. The whole concept of automated forecasting is to provide model, that accepts hyper parameters and then is able to fit into the data. Such automated modelling method should also provide ways to determine the accuracy of model, usually based on n-fold crossvalidation - since there is no other way to relate to the real data. Cross-validation makes it possible to use historical data and provide relatively good results in terms of model accuracy. Every method works good with data that contains seasonality, mostly focused on daily, weekly or eventually monthly data. But here for the analysis there is a need to adapt the method for yearly data. Prophet model is the one that works out of the box for yearly data. So for this study Prophet is chosen as the forecasting method as presented in [10]. The components of Prophet model are linear regression with changepoints, seasonality and holidays/events. Every parameter for Prophet model can be set by hand using analyst in the loop approach - visualize, adapt the model and repeat. Regarding the parameters, Prophet has capabilities to automate the process when the hyper parameters are set properly. The parameters that will be sought after during the tuning process are: changepoint prior scale, which directly affects the amount of changepoints set in the model, seasonality prior scale and seasonality prior mode, which are both responsible for the seasonality component itself. If no seasonality is available within the data - minimum value of seasonality prior scale should work best for the model - almost disabling the seasonality component.

584

M. Kopel and D. Kreisich

Using Prophet requires specific column names for the data, which is ds and y, where ds stands for the time and y is the value of a variable at the time. Data with the labels specified properly can be directly fed into the Prophet model, which results in setting up all the parameters in the model, taking into account all the parameters set by analyst. In the end, model can produce forecast with uncertainty, in correlation to dataset, with built-in Python visualisation tools. It makes it the best solution to use for music metadata based forecasting, as it provides all the desired tools, ways to visualise the forecast along with counting the errors and accuracy for model, based on cross-validation.

4

The Experiment

Each analysis in the experiment was carried out according to the following steps: 1. Motivation to cover chosen subject. Every subject that has been chosen for analysis process should make sense, business-wise or knowledge-wise. The goal of this step is to answer the question how the information derived from analysis can change the approach or decision making connected to the topic. This step can often be specified in different analysis methodologies shared by researchers as “Defining questions” or “Specifying questions for analysis”, but the motivation fits better in terms of a particular subject, whereas the popular specification for analysis entry point is better for the whole thesis defining subject. The handbook [8] with handy approach to data analysis generalizes the motivation process as a purpose. 2. Basic statistical and data quality analysis. This step describes how the analysis process is continued, which dimensions are used for later measurements. The important thing is to use sample for detailed description of the data, that is needed for next steps. After the data is gathered and transformed, it is checked whether the data is good enough to perform a proper analysis and forecasting. While considering different subjects, some of them may contain too many null values, so that analysis is based only on a small fraction of the whole database. If that fraction is not big enough, the basic statistical analysis show no statistical significance. The data is tested for correlation, average comparison and trend significance. The analysis on historical time-series data should be on par with events that happened during that time. 3. Forecasting hyper parameter selection and model fitting. This step focuses on what can be derived from the data, that covers the years 2020 - 2024. The whole forecasting process consists of model fitting and tuning, what parameters gave best results in terms of forecasting, why and what are the conclusions based on model forecasts. Each forecast is accompanied with error analysis, the model diagnostics, which provides additional information on how much can the forecast be correct and precise. Precision is one of the most important points. Forecasts are made using Prophet model, since it got some of the best results in papers on automated forecasting.

Music Industry Trend Forecasting Based on MusicBrainz Metadata

585

4. Result analysis based on probable scenarios. This step focuses on different information correlated with the test results, to match the results with real examples and approaches. This should lead to answer all the questions stated in the motivation for particular analysed subject. 4.1

Dataset: MusicBrainz

According to MusicBrainz wiki, general description of what MusicBrainz stands for is: “MusicBrainz is a community-maintained open source encyclopedia of music information”. MusicBrainz has wide community of operators and contributors, that are interested in music, open source technology and music metadata. The database can be split to few smaller parts of schema, as well as entities are separated for primary and secondary ones. The primary entities are: Area, Artist, Event, Label, Place, Recording, Release, Release group, Series, URL, Work. Primary entities contain all the data that is frequently queried for gathering music metadata in any kind of music industry analysis. Secondary entities are made for the convenience of data usage, queries on those can help with gathering more specific information on music. Secondary entities are: Artist Credit, Medium, Track. 4.2

Data Processing

The experiments does not require very intricate data processing. Most of the needs are fulfilled during data acquisition process, thanks to the MusicBrainz API capabilities, that provide the number of elements to each query after performing proper filtering. The only thing that has to be aligned for the experiments is structuring data in a way that Prophet understands it - two columns named ds - for the point in time and y - for the value representing that point in time. In this case i.e. it is year along with the number of releases. Python code doing so can be found in the attachments. Preparing data in that way makes it possible to perform similar experiment and analysis using aforementioned scripts as well.

5

Results

This section contains results of the performed data analysis. Subsections can be considered as separate subjects for further analysis. 5.1

Releases as Single and Album

Popular opinions of independent musicians and labels often points to the heavy need of promotion for releases. Too long formats of music releases may not be approachable anymore, since people tend to have less and less time for music, in favour of other media. The important thing is to check if music industry properly shifted to the latest trends and is able to compete for people’s attention. The

586

M. Kopel and D. Kreisich

analysis is based on data collected through years 1980–2019, with specific release type. The analysis separates data about long-playing albums and singles only. Quality-wise, MusicBrainz data can have flaws and inconsistency, which means that amount of the albums and singles may not sum up to the total value of releases within MusicBrainz since some of them i.e. do not contain release type, but the portion of whole database is statistically high enough. Population size of the releases in MusicBrainz is about 2,900,000, while sample size in analysis: 1,900,000. Huge sample size means that the sample data can be considered representative for data analysis of releases, with significant accuracy bigger than 99%. When looking just at the dots representing the historical release data in Fig. 1, it shows that music industry is constantly rising toward higher number of releases every year, but in the end, after 2012 there is a small crash, which affects number of released albums, but it did not happen to Singles at all. Single releases are constantly rising, with small denial on 1996–2004 period, where the number of released, independent singles were almost constant. The assumptions on rising importance of single releases in the picture of releasing whole album, which most likely will contain the preceding singles seems to be confirmed. The music industry heads toward releasing less albums, which may be connected to different factors, i.e.: Albums are released worldwide, no specific releases for different countries, especially for smaller artists. They are not likely to affect global amount of album releases, since the ease of releasing music is now approachable more than ever. Less albums are being released in exchange for releasing singles, due to the current overload of information in the internet and short attention span of the audience as noticed in [5]. Constant rise of single releases can be derived from the last point on why album numbers are going down. Albums often consist of 8–12 tracks, depending on many different factors. The topic of album length is another subject of data analysis, for this part lets assume 10 tracks as an average number of tracks on an album release. According to [5], people attention span has lowered significantly throughout the years, which means that it fits the trend visible on the graph - singles in the recent years are getting much more attention from release perspective, simply because they are more approachable and can make artists more visible to the public. If attention span of a regular music listener will be around 1 day for each release, it does not matter, if it is an album or a single artist gets more attention as long as they put more releases, instead of a big one. Simply, an album can be decomposed to few singles and “the rest” which will be the unique content for the whole long-play release, so instead of one album release, people will be able to receive i.e. 5 different premieres and the album that sums up all the singles and add something extra to it. Smaller releases are perfectly fitting the new mentality of internet users in the information era, which lets artist be on the front page of social media communities for longer periods, and that is one of the most important factors to date for building a fan base or recognition.

Music Industry Trend Forecasting Based on MusicBrainz Metadata

587

Another important topic on why are singles going high after 2005 is the distribution model, where there is a start of internet music sales - in the U.S, from the very beginning of digital sources - single songs were sought after instead of full albums. Singles had much more impact on financial part of music industry at this point of time, which does not apply to the current trend from last 5– 10 years (on-demand streaming), but the attention span argument and social changes are definitely important for those years. Data on sales to back-up the arguments can be found within RIAA U.S. Recorded Music Revenues. The forecasting starts with getting proper diagnostic parameters to properly evaluate model in terms of its accuracy, which later lead to valuable forecasts. Given the data in yearly format and forecasting going to cover years 2020–2024, there is a need for cross validation to show yearly increments of error at least for three consecutive years. The whole data set consists of 40 years, then the initial learning time can be set to 10 years, so it is possible to perform lots of cross-validation over data set. The result of cross-validation is the following set of values: Changepoint Prior Scale=10.0, Seasonality Prior Scale=0.01, Seasonality Mode=additive. Based on the forecasting model achieved with mentioned parameters, forecast is available on Fig. 1.

Fig. 1. Single And album releases history and forecasts. The dots represent the actual data of album releases (forming the upper line) and single releases (forming the lower line). The blue areas are forecasting models fit for each release trend individually. (Color figure online)

588

M. Kopel and D. Kreisich

The dots on the figure stand for real data points and the plot line is model approximation of the data. Right at the end of the plot there is forecasting, with widening cone, which show how the error of forecasting is handled by model. Bigger the cone - higher the error value. As a result of forecasting for album releases there is a simple conclusion, that most likely number of albums released within few next years will most likely be falling down, but the error approximation show that it is possible to get number of album releases constant or even going up again in next two or three years. Here the scenario is not very clear and it is hard to say that it is not worth to think of releasing albums - it will always be great opportunity to reach out the market with new music album. Because of that and different factors that are making music industry more approachable for anybody, like easy access to recording technology, without very high budget we may expect number of released albums to go up right after some downtime. When it comes to singles, the situation feels straightforward, only possible outcomes from the forecast is rise in the number of released Singles, which also fits into assumptions made during basic analysis process. Uncertainty at this point covers only different magnitude of rising in single releases. Singles and Albums may once get to the same amount of releases per year, if album releases will still be going down in numbers. There is relatively high probability that it may happen by 2024 and small probability to get to that state one year earlier. That can be observed at the very end on Fig. 1 plot lines, where possible values field of singles and albums overlap. 5.2

Medium Used for Releases

Every medium used for releasing music had it’s time and space to get popularity, because of constant growth in technological approach towards creating new medium devices and data storage. Vinyl’s were for sure the most important up to year 1985, since only cassettes were available at the same time, and this format was not a big concurrent against vinyl’s. Then in 1985 CD’s are gaining popularity very fast, up until getting new official releasing way in the digital media and internet. General trends are clearly visible in this matter, but the motivation is to check the niche and focus on older release medium to check, how big was the shift towards newest technologies, how music is and will be released in the next few years. Great questions are, does Cassettes get any new releases nowadays, or is it forgotten, unneeded medium that musicians should avoid, or maybe it is a niche that musicians can use to differentiate themselves on hard music industry’s market. Another question is directly connected to vinyl. Why are they still being released or if people are still interested in this format, those are making it possible to gather information on niche formats that do make sense, and formats that should not be used anymore. This time analysis is focused around percentage based data - it can be treated as market share for the particular medium, which gives a lot of information on

Music Industry Trend Forecasting Based on MusicBrainz Metadata

589

how the world was changing within this subject. Figure 2 clearly show how the music industry shifted between different media formats. The few statements that are considered: Vinyl have been used for very long time and still it has got its niche, giving access to vinyl for people that are often considered music lovers or audiophiles are still making additional sales. Traditional people will still use vinyl and it may be worth it to follow that topic, when music is targeted to classical audience for example. If the artist is influential on big spectrum of listeners, should also consider releasing their music using traditional medium, not only digital releases - more often than not, music is being bought for collection purposes, even if people use digital releases to listen to music, they still buy CDs or other formats like vinyl’s to fulfil their collection. Those are the reasons why CDs and Vinyl are not going instantly towards 0% of market share. Digital releases are the most approachable, cheapest and convenient for listeners and the data supports this statement clearly. Cassettes are very niche in today’s world and music industry, more and more often people are losing access to CD players and the world is going toward fully digital access to the media over the internet. Cassettes had their time in the early 80s, especially on markets like Poland which was technologically behind i.e. USA or Great Britain.

Fig. 2. Release format market share throughout the years.

CDs reached reasonable part of the market, but never really got to the point where Vinyl’s once were. The thing that changed here is that there were more media formats available and the market did not ditch the old format totally,

590

M. Kopel and D. Kreisich

to support the new one. This is also the case for digital releases as well. Recent years are clearly showing that the new king in terms of release format is the Digital one, but all the other formats are constant at this point, except for CD that is constantly going down. Most probably the thing about CD’s may eventually get to the same point as vinyl - the releases will still be there, on a limited level, only for people collecting the music stuff and the main stream of the music releases will go through Digital media format, which is visible on Fig. 2 in years 2004–2019 for CD and Digital forming a big X sign. Set of parameters is chosen the same way as in Sect. 5.1, final set of parameters derived for forecasting purposes is available in Table 1. Table 1. Hyper parameter values for different release formats Medium Changepoint prior scale Seasonality prior scale Seasonality mode Cassette 10.0

0.1

Multiplicative

Vinyl

0.5

0.01

Additive

CD

10.0

0.01

Multiplicative

Digital

10.0

0.01

Additive

Forecasting results may interfere above 100% and below 0% - maximum values to consider are between 0 and 100%, but model cannot be narrowed down to limit value. Market share treated like normal data by Prophet model. The digital release forecast is straightforward. The trend is almost linear. It seems that digital releases are going to go higher and higher in terms of the market share. By year 2024 we can expect from 94% even up to 99% of the releases being released in digital format. The results on digital releases rising higher should also be connected to significant decreases in other media formats, since the sum is about to reach 100%. The other format forecasts are visualized similar way to the digital releases. Most of the market share that is not in hands of digital format is in hand of CD format, which was in denial recently, and most likely will be going down further to reach similar levels as vinyl releases. It is fairly interesting that the scale of “digital hegemony” is so big and that any other release formats most likely will not be considered for further music releases. Only music lovers and musicologists will likely use and buy older formats, different than digital. Right now - even if CDs have more than 10% of the market - people that buy them, do not use them to listen, but to collect. All the niche formats, which consists of CD, Vinyls and Cassettes are becoming collectible items, instead of having a specific use to them. Forecast for those formats are collected on Fig. 3.

Music Industry Trend Forecasting Based on MusicBrainz Metadata

591

Fig. 3. Niche release format forecasts. Niche are CD (highest releases in 90’s and 00’), Cassettes (dropping drastically throughout ’80s) and Vinyl (staying statically low).

All the niche formats fall into the same area when it comes to market share, and most likely - all those formats, can be considered physical releases. Physical releases will unlikely achieve new popular distribution ideas or release formats and their market share will not raise above 10% in sum. Models do not provide such information for granted, but based on knowledge gained during the basic analysis - this is the most likely a proper scenario for near future. As an answer to questions about, how should musicians or labels release music - the easy choice will be the digital. Digital music releases have so much advantage over older medium, mostly because of the costs, durability and accessibility. Currently it is much easier to open music file on PC or start listening music using media streaming platform such as Spotify, instead of looking for CD or Vinyl, putting it into proper device and after all those, finally playing it. At the same time - digital music is not vulnerable to any physical damage, there is no situation, where scratches on the back of CD prevents from playing music from the device. Those different factors are still like nothing compared to the cost of music releases, that artists and labels pay - to reach out the “physical” market. All the costs of pressing physical copies and promotion, which are treated as very significant in the music release costs - are basically voided and that is definitely the main reason to avoid releasing physical records, especially without reasonable fan base asking for such release of artist’s music. Another popular way to deliver the physical product is to get the distribution deal, but the smaller musician influence on the market is - the less likely distribution deals will be available out in the field. All in all - the study clearly show that the value in physical

592

M. Kopel and D. Kreisich

releases exists, but only for niche scenarios, bigger projects and artists, where the majority of the industry is focusing towards shorter releases (based on Sect. 5.1) - that fit much better into the digital release format. Regardless of anyone’s private opinion on the topic data clearly shows that digital is the future and musicians cannot try to deny that trend, if they want to become successful. 5.3

Album Total Time Length and Tracks

Fig. 4. Average number of tracks forecast.

This subject covers everything about the time length for album format, which accordingly to the research from sect. 5.1 may have shifted towards shorter forms even for full albums that are being released. This may be connected to the attention span changes in social media and society in general. The analysis answers question on how long is ideal full album release for different times, if it has changed over the years or not. Seemingly it can be connected to the medium, album was released on, but also changes over time may have important impact on the topic globally. If analysed factors are constant throughout the years, this would mean that long ago, music industry has set proper standards for album length, not to overwhelm anybody, but to provide proper content to the audience. The analysis is focused around the average of the release number of tracks and length of the releases. The very first (Fig. 4) shows the number of tracks on album releases throughout the years. By far - information found about length and tracks on the album is impressive and the results are not obvious. It is expected to get a bit lower amount of tracks on the albums, leading to shorter forms, but at some point this trend definitely

Music Industry Trend Forecasting Based on MusicBrainz Metadata

593

should stop. In horizon of 2024 it seems like the average is very slowly moving towards less amount of tracks, which may become more or less constant by 2025, but that would have to be checked in next few years. The forecasts for 2019–2024 is available for number of tracks on Fig. 4. The forecast follows the trending on going for shorter and shorter releases, but when it comes to albums, there is no big change on the magnitude here. The expected number of tracks is between 10 to 11 on average, which is not a big difference from the peak on 12 tracks on average. The forecast for album length is a derivative of the number of tracks forecast and it shows similar trend to it, which is not surprising since it is checked that there is high correlation between the two. Also there is no big magnitude changes, all the time everything is around 40–50 min, when it comes to the length, but statistically, definitely there is a significant decreasing trend, which cannot be omitted. Based on the information gathered here, musicians probably should stick with traditional approach to album releases aiming for 40+ minutes and not much more. It makes perfect sense when matched with attention span research, the overall amount of information is rising, even if the average length is lower and lower recently.

6

Conclusions

The main goal of the paper is to check forecasting analysis methods with music metadata from MusicBrainz. The time-series data mostly come from music release metadata. The main aspect of the analysis are changes in music industry throughout the years and their trends that are estimated for the near future. The outcomes should be interesting for musicians and anybody wanting to release music in the next few years. Timeline covered in the analysis is years 1980–2019, with forecasts reaching up to year 2024. Trending is connected to the changes in society and technology. This is what drives music industry to new grounds every few years. In terms of media, it was e.g. the CDs in the mid-80’s or the digital era in music that formed in 2004 in form of online music stores and streaming services. Today no musicians or labels can afford to avoid the internet as the main media to release music, socialize with fans, target and market their products. As for the Tools used they were working even better than expected, especially the accuracy of forecasting obtained with Prophet model. Its approach to analysis covers all the topics and specifically addresses the needs for clarity in the data analysis. The downside of the chosen approach was the inconvenience of the cross-validation model accuracy. There may be a situation during the analysis that forecast is not fitting the reality, based on historical information. Thankfully Prophet’s analyst in the loop approach to modelling makes it possible to adapt the model when achieved results are not satisfying. Another downside of the approach was collecting the data through MusicBrainz API. Collecting the more intricate data, like length of album and number of tracks, have taken more than 40 h of data collection processing. This problem could be resolved with direct database queries, the way it have been done in [4].

594

M. Kopel and D. Kreisich

Future analysis could be even more detailed if the data is reprocessed in a business intelligence methodology, with data warehouse created upon MusicBrainz database. This could enable faster data acquisition and easier visualisation process.

References 1. Akiki, C., Burghardt, M.: Muse: The musical sentiment dataset. J. Open Humanities Data 7(6) (2021) 2. Bodo, Z., Szilagyi, E.: Connecting the last. fm dataset to lyricwiki and musicbrainz. lyrics-based experiments in genre classification. Acta Univ. Sapientiae 10(2), 158– 182 (2018) 3. Bogdanov, D., Porter, A., Schreiber, H., Urbano, J., Oramas, S.: The acousticbrainz genre dataset: Multi-source, multi-level, multi-label, and large-scale. In: Proceedings of the 20th Conference of the International Society for Music Information Retrieval (ISMIR 2019): 2019 Nov 4–8; Delft, The Netherlands.[Canada]: ISMIR; 2019. International Society for Music Information Retrieval (ISMIR) (2019) 4. Kopel, M.: Analyzing music metadata on artist influence. In: Nguyen, N.T., Trawi´ nski, B., Kosala, R. (eds.) ACIIDS 2015. LNCS (LNAI), vol. 9011, pp. 56–65. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-15702-3 6 5. Lorenz-Spreen, P., Mønsted, B., H¨ ovel, P., Lehmann, S.: Accelerating dynamics of collective attention. nat. commun. 10, 1759 (2019) 6. Ma, Y., Ding, Y., Yang, X., Liao, L., Wong, W.K., Chua, T.S.: Knowledge enhanced neural fashion trend forecasting. In: Proceedings of the 2020 International Conference on Multimedia Retrieval, pp. 82–90 (2020) 7. Shao, Y., Wang, Q.J., Schepen, A., Ryu, D.: Going with the trend: forecasting seasonal climate conditions under climate change. Monthly Weather Rev. 149(8), 2513–2522 (2021) 8. Start, S.: Introduction to data analysis handbook migrant & seasonal head start technical assistance center academy for educational development. J. Acad. 2(3), 6–8 (2006) 9. Sun, X., Liu, M., Sima, Z.: A novel cryptocurrency price trend forecasting model based on lightgbm. Finance Res. Lett. 32, 101084 (2020) 10. Taylor, S., Letham, B.: Forecasting at scale. peerj preprints (2017) ´ Gender differences in the global music industry: Evidence 11. Wang, Y., Horv´ at, E.A.: from musicbrainz and the echo nest. In: Proceedings of the International AAAI Conference on Web and Social Media. vol. 13, pp. 517–526 (2019) 12. Zhao, L.T., Wang, Y., Guo, S.Q., Zeng, G.R.: A novel method based on numerical fitting for oil price trend forecasting. Appl. Energy 220, 154–163 (2018)

AntiPhiMBS-TRN: A New Anti-phishing Model to Mitigate Phishing Attacks in Mobile Banking System at Transaction Level Tej Narayan Thakur

and Noriaki Yoshiura(B)

Department of Information and Computer Sciences, Saitama University, Saitama 338-8570, Japan [email protected]

Abstract. With the continuous improvement and growth at a rapid pace in the utility of mobile banking payment technologies, fraudulent mobile banking transactions are being multiplied using bleeding-edge technologies sharply and a significant economic loss is made every year around the world. Phishers seek new vulnerabilities with every advance in fraud prevention and have become an even more pressing issue of security challenges for banks and financial institutions. However, researchers have focused mainly on the prevention of fraudulent transactions on the online banking system. This paper proposes a new anti-phishing model for mobile banking systems at the transaction level (AntiPhiMBS-TRN) that mitigates fraudulent transactions in the mobile banking payment system. This model applies a unique id for the transactions and an application id for the bank application known to the bank, bank application, users, and the mobile banking system. In addition, AntiPhiMBS-TRN also utilizes the international mobile equipment identity (IMEI) number of the registered mobile device to prevent fraudulent transactions. Phishers cannot execute fraudulent transactions without knowing the unique id for the transaction, application id, and IMEI number of the mobile device. This paper employs a process meta language (PROMELA) to specify system descriptions and security properties and builds a verification model of AntiPhiMBS-TRN. Finally, AntiPhiMBS-TRN is successfully verified using a simple PROMELA interpreter (SPIN). The SPIN verification results prove that the proposed AntiPhiMBS-TRN is error-free, and banks can implement the verified model for mitigating fraudulent transactions in the mobile banking system globally. Keywords: Mobile banking system · Fraudulent transaction · Anti-phishing model · Verification

1 Introduction The “Mobile Banking System” in this paper referes to the concept of a mobile banking system in general. With the advancement of mobile technologies, most modern commerce payments depend on the mobile banking system that is always open 24/7, 365 days © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 N. T. Nguyen et al. (Eds.): ACIIDS 2022, LNAI 13758, pp. 595–607, 2022. https://doi.org/10.1007/978-3-031-21967-2_48

596

T. N. Thakur and N. Yoshiura

a year for financial transactions. However, the unfortunate truth is that fraudulent transactions are also around-the-clock operations. With the continuous improvement and growth at a rapid pace in the utility of mobile banking payment technologies, fraudulent mobile banking transactions are being multiplied using bleeding-edge technologies sharply and a significant economic loss is made every year around the world. 2019 Iovation financial services fraud and consumer trust report [24] show that 61% of financial transactions originate from mobile and 50% of suspected fraudulent transactions seen by Iovation are from mobile devices. Fraudsters seek new vulnerabilities with every advance in fraud prevention and have become an even more pressing issue of security challenges for banks and financial institutions. Fraud-the Facts 2021 [25] revealed that mobile banking fraud losses increased by 41% in 2020 in the UK. Mobile banking users download phishing apps unknowingly, install them on their mobile devices, and input login credentials (Username and password) in the phishing app unintentionally. They follow the links in the phishing emails/SMS and are redirected to the phishing login interface, and they input the login credentials in the phishing login interface. Thus, phishers steal login credentials using phishing apps or phishing login interfaces from mobile banking users and employ the stolen login credentials to login into the mobile banking system. As the login credentials are valid, phishers login into the mobile banking system. Phishers request a transaction, and MBS sends a one-time password (OTP) for the security of the transaction. However, phishers reply to the OTP, and they execute the fraudulent transactions as shown in the threat model in Fig. 1.

Fig. 1. Threat model for phishing in mobile banking system at transaction level

Generally, mobile banking users provide usernames and passwords only once on their mobile devices for MBS and MBS do not ask for a password each time they use the MBS. MBS use OTP for security when they request a transaction. Phishers get stolen/lost users’ mobile devices and they do not have to enter the username and password to use the MBS. They request fraudulent transactions using the stolen/lost mobile devices. Currently, security is maintained in the MBS at the transaction level using login credentials and the OTP mechanism (Two-factor authentication). Phishers do not have to input login credentials (Username and password) on mobile devices. MBS sends OTP to the phishers for security and asks to reply to the OTP. Phishers have physical access to mobile devices and reply to OTP easily and execute fraudulent transactions. Hence, only two-factor authentication (Login credentials, and OTP) mechanism cannot stop phishers from executing fraudulent transactions, and there is a need for multi-factor authentication to stop the fraudulent transactions in the mobile banking system.

AntiPhiMBS-TRN

597

Researchers have worked to prevent these frauds and enhance the security measures for fraudulent transactions. Researchers developed a layered approach for near-field communication (NFC) enabled mobile payment systems [1] and used machine learning algorithms for the mobile fraud detection system [2, 3]. However, machine learning can solve simple fraud cases, and more complex frauds require human intervention. Fraudsters also use emerging technologies to mimic the transaction behavior of genuine customers. They also keep changing their methods so that it is difficult to detect fraud using machine learning methods. Authors of [4–8] presented a fraud detection system for electronic banking systems in which [4] proposed biometric security and systems, and [5–8] employed machine learning for fraud detection in electronic banking transactions. Authors of [9–20] used machine learning, deep learning, and neural networks for fraud detection in online transactions and banking transactions. Authors of [21–23] developed fraud detection systems for financial databases. The above research adopted machine learning models to mitigate fraudulent transactions in online transactions. However, these approaches are inefficient and insufficient to account for fraudulent transactions in the mobile banking system. To overcome this gap, this paper presents a new anti-phishing model for mobile banking systems at the transaction level (AntiPhiMBS-TRN), and the objective of this research is to mitigate fraudulent transactions executed using stolen login credentials and stolen/lost mobile devices. Banks and financial institutions can implement AntiPhiMBS-TRN to mitigate fraudulent transactions in the mobile banking industry. The paper is further structured as follows: Sect. 2 describes the related works, Sect. 3 presents the novel anti-phishing model to mitigate fraudulent transactions in the mobile banking system, Sect. 4 describes the results and discussion, and Sect. 5 presents conclusions and future work.

2 Related Works Researchers have worked on the prevention of fraudulent transactions in digital banking. Vishwakarma, Tripathy, and Vemuru [1] proposed a layered approach for near field communication (NFC) enabled mobile payment system to prevent fraudulent transactions. Delecourt and Guo [2] utilized potential reactions of fraudsters into consideration to build a robust mobile fraud detection system using adversarial examples. Zhou, Chai, and Qiu [3] introduced several traditional machine learning algorithms for fraud detection in the mobile payment system. Eneji, Angib, Ibe, and Ekwegh [4] focused on the integration of biometric security to mitigate and combat electronic banking frauds. Ali, Hussin, and Abed [5] reviewed various attack detection systems and identified transaction monitoring as the most effective model for electronic banking (e-banking). Pracidelli, and Lopes [6] proposed the artifacts capable of minimizing electronic payment fraud problems using unsupervised and supervised algorithms. Guo,Wang,Dai,Cheng, and Wang [7] proposed a novel fraud risk monitoring system for e-banking transactions. Seo and Choi [8] used machine learning techniques for predicting abusers in electronic transactions. Minastireanu and Mesnita [9] reviewed the existing research in fraud detection and found that the best results were achieved in terms of accuracy and coverage by the supervised learning techniques. Zhou, Zhang, Wang, and Wang [10] used the siamese neural network structure to solve the problem of sample imbalance in online transactions. Khattri and Singh [11] proposed a new distance authentication mechanism for committing a

598

T. N. Thakur and N. Yoshiura

valid and secure online transaction using a credit card or debit card. Kanika and Singla [12] reviewed the use of deep learning techniques for online transaction fraud detection. Hartl and Schmuntzsch [13] focused on the user-end fraud detection and protection for online banking in social engineering. Kataria and Nafis [14] compared the hidden Markov model, deep learning, and neural network to detect fraud in online banking transactions. Masoud and Mehdi [15] used the k nearest neighbor technique with association rules to improve the algorithms for detecting outliers in credit card transactions for electronic banking. Eshghi and Kargari [16] proposed a multi-criteria decision method, intuitionistic fuzzy set, and evidential reasoning of a transaction concerning the effects of uncertainty for them. Kargari and Eshghi [17] proposed a semi-supervised combined model based on clustering algorithms and association rule mining for detecting frauds and suspicious behaviors in banking transactions. Sarma, Alam, Saha, Alam, Alam, and Hossain [18] proposed a system to detect bank fraud using a community detection algorithm that identifies the patterns that can lead to fraud occurrences. Gyamfi and Abdulai [19] used supervised learning methods to support vector machines with spark (SVMS) to build models representing normal and abnormal customer behavior for detecting fraud in new transactions. Shaji and Panchal [20] used an adaptive neuro-fuzzy inference system for detecting fraudulent transactions. Susto, Terzi, Masiero, Pampuri, and Schirru [21] proposed a machine learning-based decision support system for fraud detection in online/mobile banking system. Sapozhnikova, Nikonov, Vulfin, Gayanova, Mironov, and Kurennov [22] employed three classifiers and developed an algorithm for analyzing the information about the user environment to monitor the transactions. Mubalaike and Adali [23] emphasized deep learning (DL) models to detect fraudulent transactions with high accuracy. Above mentioned works do not mitigate fraudulent transactions using stolen login credentials and phishing apps in the mobile banking system. Our paper proposes a new anti-phishing model to mitigate fraudulent transactions in the mobile banking system in the world of mobile payment transactions. Banks and financial institutions can implement this model to mitigate phishing attacks in the mobile banking system.

3 Proposed Anti-phishing Model AntiPhiMBS-TRN This paper proposes an anti-phishing model for mobile banking systems at the transaction level (AntiPhiMBS-TRN). AntiPhiMBS-TRN aims to mitigate phishing attacks in the mobile banking system for three categories of users. The first category is the users who download phishing apps mistakenly and provide login credentials in phishing apps. The second category is the users who visit the phishing login interface mistakenly and provide login credentials in the phishing login interface. Phishers steal login credentials from banking users using phishing apps or phishing login interfaces and exploit them to request fraudulent transactions. The third category is the users whose mobile devices are stolen by the phishers or lost somewhere unknownigly. When phishers get such stolen/lost mobile devices, they do not have to enter the login credentials to use the mobile banking system. Phishers request transactions using the stolen/lost mobile devices.

AntiPhiMBS-TRN

599

Fig. 2. Fraudulent transaction protection in the mobile banking system using AntiPhiMBS-TRN

In all of the above cases, the mobile banking system sends OTP to the phishers for the security of the transactions. However, phishers reply to OTP using mobile devices as they have physical access to them and execute fraudulent transactions. Our proposed model AntiPhiMBS-TRN protects mobile banking users from phishers’ fraudulent transactions as shown in Fig. 2. We propose the architecture of the anti-phishing model AntiPhiMBSTRN to describe the detailed working mechanism of AntiPhiMBS-TRN. 3.1 Architecture of Anti-phishing Model AntiPhiMBS-TRN The architecture of the anti-phishing model AntiPhiMBS-TRN consists of the model for defending against phishing attacks in the mobile banking system at the transaction level. The participating agents in our proposed model AntiPhiMBS-TRN are mobile user, bank, bank application, mobile banking system, phishing application, and phisher. We specify the following agents and initial conditions for working of AntiPhiMBSTRN. • A mobile user (U) opens an account in the Bank (B) and provides an international mobile equipment identity number (imeiNo) of mobile devices for the Mobile Banking System (MBS). • Bank provides application id (appId), user id (uId), login password (lgnPwd), transaction password (trnPwd), and a unique id for transaction (unqIdTrn) to the user for transaction in MBS. Bank generates unqIdTrn once when a new bank account is opened. New unqIdTrn is not needed for each transaction. • Bank shares imeiNo, appId, uId, lgnPwd, trnPwd, unqIdTrn, and mobNo of each user with MBS. • Each of the banking applications is identified by an appId and know the relationship among uId, appId, imeiNo, and unqIdTrn for each user. • Bank and MBS know the relationship among uId, lgnPwd, trnPwd, appId, imeiNo, and unqIdTrn for each user. • Users download mobile banking app (BA), install, run and provide login credentials (Username and password) to log on to the mobile banking system. • Users always share the IMEI number of the mobile device with the MBS. MBS verifies the IMEI number of the mobile device during the transaction. • Users do not reveal the information provided by the bank to others.

600

T. N. Thakur and N. Yoshiura

Model for Preventing Fraudulent Transactions in the Mobile Banking System We consider that banks provide training to mobile users about the detailed procedure to execute transactions using the mobile banking system. This paper presents the scenario of transactions in the mobile banking system using the mobile banking app and the scenario of fraudulent transactions. Scenario of Transaction in the Mobile Banking System Using Mobile Banking App All the participating agents (user, bank, bank app, and mobile banking system) of the model must follow the following steps to mitigate the phishing attacks at the transaction level. • Step 1. A mobile user (U) opens a bank account in the Bank (B) and provides an imeiNo as security parameters to the bank. • Step 2. Bank sends uId, lgnPwd, trnPwd, and a unqIdTrn to the user for transactions in the MBS. • Step 3. Bank sends appId, uId, lgnPwd, trnPwd, unqIdTrn, and mobile number (mobNo) of each user to the MBS. • Step 4. A mobile user downloads the app, logins and is authenticated in MBS, and requests for transactions. • Step 5. Bank app (BA) asks for a unique id for the transaction to the user. • Step 6. The user provides unqIdTrn to BA. • Step 7. BA sends uId, lgnPwd, and unqIdTrn to MBS and requests for the transaction. • Step 8. MBS asks BA for its appId. • Step 9. BA provides appId to MBS. • Step 10. MBS asks for a transaction password to the BA if the application id is the same as in the stored database of MBS for that user id. • Step 11. BA asks for the transaction password to the user. • Step 12. The user provides trnPwd to BA. • Step 13. BA sends trnPwd to the MBS. • Step 14. MBS verifies the transaction password and sends OTP to the user. • Step 15. The user replies with the OTP to the MBS. MBS verifies the OTP and IMEI number of the mobile device. The scenario of transactions in the mobile banking system is shown in Fig. 3. Mobile banking users request transactions and the bank app asks the user to input the unique id for the transaction. The user inputs in the bank app of the mobile banking system. The bank app already knows the login credentials (User id and password) as they are inputted by the users for authentication in the mobile banking system before requesting the transactions. The bank app sends the login credentials and unique id for the transaction to the MBS. MBS wants to verify the identity of the bank app and asks it to provide the application id. The bank app supplies the app id to MBS. MBS verifies the application id of the bank app within its database. If the application id is not valid, MBS knows that the app is a phishing app. If the application id is valid, then MBS asks the bank app to provide a transaction password. After that, the bank app asks the user to provide the transaction password. The user provides the transaction password to the bank app. The bank app provides the transaction password to the MBS. MBS verifies the two-factor

AntiPhiMBS-TRN

601

Fig. 3. Scenario of transaction by using bank app

authentication (login credentials and unique id for transactions). If both of them are correct, then MBS sends OTP to the user as third-factor authentication. The user replies with the OTP. MBS always detects the IMEI number of the transaction originating mobile devices. Finally, MBS verifies the OTP and the IMEI number of the mobile devices. If both the OTP reply and IMEI number of the mobile devices are valid, then MBS executes the transactions. Generally, the execution of the transaction depends on the two-factor authentication (login credentials and OTP) only. In AntiPhiMBS-TRN, the execution of the transaction depends on the multi-factor authentication (login credentials, unique id for transaction, OTP, and IMEI number of mobile devices). Scenario of Fraudulent Transaction by Phisher The fraudulent transactions can be executed in the mobile banking system using the following methods. • Fraudulent transaction using stolen login credentials and phishing apps • Fraudulent transaction using stolen/lost mobile devices.

Fraudulent Transaction Using Stolen Login Credentials and Phishing Apps When the phishers request transactions using stolen login credentials, MBS detects the new mobile devices with the help of the registered IMEI number for that user. The bank app asks for a valid IMEI number of the registered mobile devices to the phisher as shown in Fig. 4. The phisher cannot provide the IMEI number of the old mobile device, and the new mobile device is not allowed for the fraudulent transaction. Generally, MBS does not notice the change of mobile devices for financial transactions. This paper uses an IMEI number of mobile devices in AntiPhiMBS-TRN for the security of change of mobile devices. If the phishers reply with the valid IMEI number anyway and request the transaction, the bank app asks for a unique id for the transaction and the phishers cannot provide a unique id for the transaction to the bank app. If phishers use phishing apps (PA), phishing app requests a fraudulent transaction, the mobile banking system asks for an application id to the phishing app to identify the

602

T. N. Thakur and N. Yoshiura

Fig. 4. Scenario of fraudulent transaction using stolen login credentials

phishing apps. The phishing app provides a fake app id to the MBS. However, MBS detects the phishing app as a fake app that differs from the registered app id. Thus, the phishers cannot execute the fraudulent transaction using phishing apps. Fraudulent Transaction Using Stolen/Lost Mobile Devices Practically, mobile banking users input login credentials (username and password) in the mobile banking system only once, and they are saved on those mobile devices. They do not have to input first-factor authentication (Login credentials) each time when they request transactions. MBS uses OTP as the second-factor authentication for the security of the transactions. The phishers get stolen/lost mobile devices and request a fraudulent transaction using the mobile banking system running on those mobile devices as shown in Fig. 5. If AntiPhiMBS-TRN is not used, MBS asks for OTP only to the phishers and they can reply to OTP easily using the stolen/lost mobile devices and the fraudulent transactions can be successful. However, in AntiPhiMBS-TRN, the bank app asks for a unique id for a transaction for multifactor authentication. The phishers cannot provide a unique id for the transactions to the bank app, and fraudulent transactions are not executed using those stolen/lost mobile devices. The advantage of AntiPhiMBS-TRN is

Fig. 5. Scenario of fraudulent transaction using stolen/lost mobile devices

AntiPhiMBS-TRN

603

that phishers cannot execute fraudulent transactions using stolen login credentials and stolen/lost mobile devices. 3.2 Verification of Proposed Anti-phishing Model AntiPhiMBS-TRN We develop a verification model of AntiPhiMBS-TRN by specifying system properties and safety properties using PROMELA. The verification model of AntiPhiMBS-TRN consists of the processes, message channels, and data types. The processes (mobileUser, bank, mobileBankingSystem, bankApp, phisher, and phishing app) in AntiPhiMBSTRN communicate with each other using defined message channels. We specify the following temporal property using linear temporal logic (LTL) in the verification model of AntiPhiMBS-TRN. [](((usrId==bankUsrId)&&(lgnPwd==bankLgnPwd)&&usrTrnPwd==bankTrn Pwd)&&(usrUnqIdTrn==bankUnqIdTrn)&&(usrOTP==bankOTP))>(transactionSuccess==true)) The LTL property states that the transaction of the banking user in the mobile banking system will succeed if (i) the user id provided by the user and received by MBS from the bank is equal, (ii) the login password provided by the user and received by MBS from the bank is equal, (iii) the transaction password provided by the user and received by MBS from the bank is equal, (iv) the unique id for the transaction provided by the user and received by MBS from the bank is equal, and (v) OTP provided by the user and sent by MBS to the user is equal.

4 Results and Discussion This paper verifies the safety properties and the LTL property of the proposed model AntiPhiMBS-TRN. We accomplished experiments using SPIN Version 6.4.9 running on a computer with the following specifications: Intel® Core(TM) i5-6500 [email protected] GHz, RAM 16 GB, and windows10 64 bit. We ran SPIN to verify the safety properties of AntiPhiMBS-TRN for up to 50 users. SPIN checked the state space for deadlocks during the verification of safety properties. The SPIN verification results for safety properties are in Table 1. Table 1 shows the results obtained from SPIN illustrating the elapsed time, total memory usage, states transitioned, states stored, depth reached, and verification status for safety properties for various users. The SPIN verification results show a continuous rise in the verification time, transitions, and depth with an increase in the number of users during the verification of AntiPhiMBS-TRN. Besides, the SPIN verification did not detect any deadlock or errors during the execution of the AntiPhiMBS-TRN model. After that, we executed SPIN in the same computing environment to verify the LTL property for up to 50 users. The SPIN verification result for LTL property is in Table 2. SPIN checked the statespace for never claim and assertion violations in the run of LTL property. The SPIN verification results show a continuous rise in the verification time and depth with the increase in the banking users during the verification of LTL property. The SPIN verified the LTL property successfully. Moreover, the SPIN verification did not detect any deadlock or errors during the execution of the AntiPhiMBS-TRN model.

604

T. N. Thakur and N. Yoshiura Table 1. Verification results for safety properties

Users

Time (seconds)

Memory (Mbytes)

Transitions

States stored

Depth

Verification status

1

4.62

39.026

8291118

572263

3737

Verified

2

6.93

39.026

8922211

580992

3777

Verified

9317685

585438

6411

Verified

5

10.3

39.026

10

15.8

39.026

9624342

585278

10054

Verified

20

28.8

39.026

10183740

590008

19795

Verified

30

41.2

39.026

10400937

586723

24735

Verified

40

53.6

39.026

10529304

587235

37421

Verified

50

66.6

39.026

10552524

587991

45013

Verified

Depth

Verification status

Table 2. Verification results for LTL property Users

Time (seconds)

Memory (Mbytes)

Transitions

States stored

1

4.12

39.026

7134189

572388

684

Verified

2

6.14

39.026

7856907

579632

967

Verified

5

8.23

39.026

7913480

585144

1607

Verified

10

12.9

39.026

7647999

585217

2564

Verified

20

22.9

39.026

8152402

581929

4063

Verified

30

33.3

39.026

8443322

585684

5408

Verified

40

43.4

39.026

8582050

586992

6504

Verified

50

54.8

39.026

8859879

586604

8056

Verified

SPIN did not generate any counterexample during these experiments, and we concluded that there is no error in the design of the AntiPhiMBS-TRN model.

5 Conclusion and Future Work In this digital era, fraudulent transactions are escalating sharply with the rise in mobile banking transactions in banks and financial institutions. Phishers utilize phishing apps or phishing login interfaces to accumulate login credentials and employ them to perform fraudulent transactions within the mobile banking industry. Moreover, phishers exploit stolen/lost mobile banking users’ mobile devices to execute fraudulent transactions. Even though fraudulent transactions are ascending progressively, any anti-phishing model for fraudulent transactions has not been developed so far for the mobile banking system. Therefore, this paper developed a new anti-phishing for mobile banking system at the transaction level (AntiPhiMBS-TRN) to mitigate fraudulent transactions globally.

AntiPhiMBS-TRN

605

Phishers exploit stolen login credentials for fraudulent transactions using a new mobile device. However, AntiPhiMBS-TRN detects the new mobile device and queries the IMEI number of the old mobile device to execute the transaction. The phishers cannot deliver an IMEI number, a unique id for the transactions and cannot succeed in the fraudulent transactions in the mobile banking system. The phishing apps cannot provide a valid application id to the mobile banking system, and phishers fail to execute the fraudulent transactions in the mobile banking system using phishing apps. Phishers get stolen/lost mobile devices and request transactions using mobile banking systems installed in those devices. However, AntiPhiMBS-TRN employs a unique id for the transaction system, and phishers cannot provide a unique id for transactions. Hence, phishers fail to execute the fraudulent transactions using stolen/lost mobile devices. We observed from our experimental SPIN results of the PROMELA model of the AntiPhiMBS-TRN program that the AntiPhiMBS-TRN does not include any deadlocks or errors within the model. Moreover, SPIN verified safety properties and LTL property within the PROMELA model of AntiPhiMBS-TRN. Hence, banks and financial institutions can implement this verified AntiPhiMBS-TRN model to mitigate the ongoing fraudulent transactions and increase the mobile banking users to transform into a cashless society in this digital era of digital banking. In future research, we will propose a new secured model to detect the change of locations and mitigate other probable attacks such as man in the middle (MITM) attack, SQL injection attack, man in the browser (MITB) attack, replay attack in the mobile banking system.

References 1. Vishwakarma, P.P., Tripathy, A.K., Vemuru, S.: A Layered approach to fraud analytics for nfc-enabled mobile payment system. In: Negi, A., Bhatnagar, R., Parida, L. (eds.) ICDCIT 2018. LNCS, vol. 10722, pp. 127–131. Springer, Cham (2018). https://doi.org/10.1007/9783-319-72344-0_9 2. Delecourt, S., Guo, L.: Building a robust mobile payment fraud detection system with adversarial examples. In: 2019 IEEE Second International Conference on Artificial Intelligence and Knowledge Engineering (AIKE), pp. 103–106. IEEE. Sardinia (2019). https://doi.org/ 10.1109/AIKE.2019.00026 3. Zhou, H., Chai, H.F., Qiu, M.I.: Fraud detection within bankcard enrollment on mobile device based payment using machine learning. Front. Inf. Technol. Electron. Eng. 19(12), 1537–1545 (2018). https://doi.org/10.1631/FITEE.1800580 4. Eneji, S.E., Angib, M.U., Ibe, W.E., Ekwegh, K.C.: A study of electronic banking fraud, fraud detection and control. Int. J. Innov. Sci. Res. Technol. 4(3), 708–711 (2019) 5. Ali, M., Hussin, N., Abed, I.: E-banking fraud detection: a short review. Int. J. Innov. Creat. Change 6(8), 67–87 (2019) 6. Pracidelli, L.P., Lopes, F.S.: Electronic payment fraud detection using supervised and unsupervised learning. In: Rocha, Á., Adeli, H., Reis, L.P., Costanzo, S., Orovic, I., Moreira, F. (eds.) WorldCIST 2020. AISC, vol. 1160, pp. 88–101. Springer, Cham (2020). https://doi. org/10.1007/978-3-030-45691-7_9 7. Guo, C., Wang, H., Dai, H., Cheng, S., Wang, T.: Fraud risk monitoring system for e-banking transactions. In: 2018 IEEE 16th International Conference on Dependable, Autonomic and Secure Computing, 16th International Conference on Pervasive Intelligence and Computing, 4th International Conference on Big Data Intelligence and Computing and Cyber Science and

606

8. 9.

10.

11.

12.

13.

14.

15.

16.

17.

18.

19.

20.

21.

T. N. Thakur and N. Yoshiura Technology Congress, pp. 100–105. IEEE, Athens (2018). https://doi.org/10.1109/DASC/ PiCom/DataCom/CyberSciTec.2018.00030 Seo, J.H., Choi, D.: Feature selection for chargeback fraud detection based on machine learning algorithms. Int. J. Appl. Eng. Res. 11(22), 10960–10966 (2016) Minastireanu, E., Mesnita, G.: An analysis of the most used machine learning algorithms for online fraud detection. Informatica Economica 23(1), 5–16 (2019). https://doi.org/10.12948/ issn14531305/23.1.2019.01 Zhou, X., Zhang, Z., Wang, L., Wang, P.: A model based on Siamese neural network for online transaction fraud detection. In: 2019 International Joint Conference on Neural Networks (IJCNN), pp. 1–7, IEEE, Budapest (2019). https://doi.org/10.1109/IJCNN.2019.8852295 Khattri, V., Singh, D.K.: A novel distance authentication mechanism to prevent the online transaction fraud. In: Siddiqui, N.A., Tauseef, S.M., Abbasi, S.A., Rangwala, A.S. (eds.) Advances in Fire and Process Safety. STCEE, pp. 157–169. Springer, Singapore (2018). https://doi.org/10.1007/978-981-10-7281-9_13 Kanika, Singla, J.: A survey of deep learning based online transactions fraud detection systems. In: 2020 International Conference on Intelligent Engineering and Management (ICIEM), pp. 130–136. IEEE, London (2020). https://doi.org/10.1109/ICIEM48762.2020. 9160200 Hartl, V.M.I.A., Schmuntzsch, U.: Fraud protection for online banking. In: Tryfonas, T. (ed.) HAS 2016. LNCS, vol. 9750, pp. 37–47. Springer, Cham (2016). https://doi.org/10.1007/ 978-3-319-39381-0_4 Kataria, S., Nafis, M.T.: Internet banking fraud detection using deep learning based on decision tree and multilayer perceptron. In: 2019 6th International Conference on Computing for Sustainable Global Development (INDIACom), pp. 1298–1302. IEEE, New Delhi (2019) Masoud, K., Mehdi, F.: Fraud detection in banking using kNN (k-nearest neighbor) algorithm. In: International Conference on Research in Science and Technology, vol. 5, pp. 26–34. Scientific Information Database, London (2016) Eshghi, A., Kargari, M.: Introducing a new method for the fusion of fraud evidence in banking transactions with regards to uncertainty. Expert Syst. Appl. 121, 382–392 (2019). https://doi. org/10.1016/J.ESWA.2018.11.039 Kargari, M., Eshghi, A.: A model based on clustering and association rules for detection of fraud in banking transactions. In: Proceedings of the 4th World Congress on Electrical Engineering and Computer Systems and Sciences EECSS, vol. MVML 104, Madrid, Spain (2018). https://doi.org/10.11159/MVML18.104 Sarma, D., Alam, W., Saha, I., Alam, M.N., Alam, M.J., Hossain, S.: Bank fraud detection using community detection algorithm. In: 2020 Second International Conference on Inventive Research in Computing Applications (ICIRCA), pp. 642–646. IEEE, Coimbatore (2020). https://doi.org/10.1109/ICIRCA48905.2020.9182954 Gyamfi, N.K., Abdulai, J.: Bank fraud detection using support vector machine. In: 2018 IEEE 9th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON), pp. 37–41. IEEE, Vancouver (2018). https://doi.org/10.1109/IEMCON.2018. 8614994 Shaji, J., Panchal, D.: Improved fraud detection in e-commerce transactions. In: 2017 2nd International Conference on Communication Systems, Computing and IT Applications (CSCITA), pp. 121–126. IEEE, Mumbai (2017). https://doi.org/10.1109/CSCITA.2017.806 6537 Susto, G.A., Terzi, M., Masiero, C., Pampuri, S., Schirru, A.: A fraud detection decision support system via human on-line behavior characterization and machine learning. In: 2018 First International Conference on Artificial Intelligence for Industries (AI4I), pp. 9–14. IEEE, Laguna Hills (2018). https://doi.org/10.1109/AI4I.2018.8665694

AntiPhiMBS-TRN

607

22. Sapozhnikova, M.U., Nikonov, A.V., Vulfin, A.M., Gayanova, M.M., Mironov, K.V., Kurennov, D.V.: Anti-fraud system on the basis of data mining technologies. In: 2017 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT), pp. 243–248. IEEE, Bilbao (2017). https://doi.org/10.1109/ISSPIT.2017.8388649 23. Mubalaike, A.M., Adali, E.: Deep learning approach for intelligent financial fraud detection system. In: 2018 3rd International Conference on Computer Science and Engineering (UBMK), pp. 598–603. IEEE, Sarajevo (2018). https://doi.org/10.1109/UBMK.2018.856 6574 24. 2019 Iovation financial services fraud and consumer trust report. https://content.iovation. com/resources/2019-iovation-Financial-Services-Fraud-and-Consumer-Trust-Report.pdf. Accessed 14 Dec 2021 25. Fraud-The Facts 2021: the definitive overview of payment industry fraud report. https://www. ukfinance.org.uk/policy-and-guidance/reports-publications/fraud-facts-2021. Accessed 14 Apr 2022

Blockchain-Based Decentralized Digital Content Management and Sharing System Thong Bui1,3 , Tan Duy Le2,3(B) , Tri-Hai Nguyen4 , Bogdan Trawinski5 , Huy Tien Nguyen1,3 , and Tung Le1,3(B) 1

5

Faculty of Information Technology, University of Science, Ho Chi Minh City, Vietnam {bhthong,ntienhuy,lttung}@fit.hcmus.edu.vn 2 School of Computer Science and Engineering, International University, Ho Chi Minh City, Vietnam [email protected] 3 Vietnam National University, Ho Chi Minh City, Vietnam 4 Department of Computer Science and Engineering, Seoul National University of Science and Technology, Seoul, South Korea [email protected] Department of Applied Informatics, Wroclaw University of Science and Technology, Wroclaw, Poland [email protected]

Abstract. With the explosion of the big data era, storing and sharing data has become a necessity. However, sharing and using misleading data can lead to serious consequences. Therefore, it is necessary to develop safe frameworks and protocols for storing and sharing data among parties on the Internet. Unfortunately, the current systems struggle against the typical challenges to maintain confidence, integrity, and privacy. This paper proposes a blockchain network in a decentralized storage system that supports the management and shares digital content preserving its integrity and privacy. The system allows users to access authenticated data and develops protocols to protect the privacy of data owners. The detailed analysis and proof show that the proposed system is considered a promising solution for sharing and storing data in the big data era.

Keywords: Blockchain content sharing model

1

· Decentralized storage system · Digital

Introduction

With the fast expansion of the Internet, massive amounts of data are being generated and exchanged across several frameworks. In a recent analysis, we may face approximately five billion raw data every day [10]. With available public data, the need for trusty and private information is more and more indispensable. In most organizations, this information is considered valuable and restricted c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  N. T. Nguyen et al. (Eds.): ACIIDS 2022, LNAI 13758, pp. 608–620, 2022. https://doi.org/10.1007/978-3-031-21967-2_49

Blockchain-Based Decentralized Digital Content System

609

property. The growth of data, especially valuable ones, requires an effective and safe system for sharing and storing them with others. Despite the practical demands, this problem also struggles against many challenges. In particular, a sharing protocol requires meeting typical constraints like correctness and privacy between others. Privacy properties can be achieved by encrypting data and sharing them through traditional methods, such as email with encrypted attachments combined with a key sharing protocol based on the asymmetric encryption concept between Bob and Alice. However, the above method relies based on complete trust between two partners. For example, the doctor/hospital should have complete trust in the medical record shared by the patient. Though, in many situations, the shared data may not appear to be completely correct. It is reasonable that one party tries to modify the data to gain benefits or hide sensitive information. In many situations, an integrated blockchain solution is used to encourage the honesty of information providers by keeping their identities anonymous. However, this solution creates more disadvantages for the information receiver because the information provider may create multiple fake documents and send them through multiple blockchain addresses to exploit even more benefits. That is why a well-trusted third party takes part in the data sharing process. Not only verify the correctness of the data from the information provider, but this third party may also ensure information receiver behavior like payment or keep the shared data privacy. Along with the increase in the need to share data, the demand for data storage and retrieval is also increasing. However, storing data in traditional data centers is expensive and lacks scalability [2]. Cloud storage service is much cheaper and easier to expand the solution. It offers unlimited storage space, convenient sharing and accessing service, and offsite backup [18]. We can categorize storage services into centralized storage systems and decentralized storage ones based on the architecture. A centralized storage system has been proven to lack availability and data privacy as below [19]. – Data availability: The servers of the centralized storage system are vulnerable to DoS/DDoS attacks, system errors, and even external influences like earthquakes or fire incidents. The crashed servers may bring down the whole system. – Data privacy: Despite many security methods to protect the privacy of the uploaded data, user’s data stored on a centralized storage system can still be compromised or taken control by the system admin or attackers. Decentralized storage systems that utilize blockchain technology are promising solutions to overcome these limitations. Data uploaded to Decentralize storage systems are encrypted, split, and stored on a different node; only the Data Owner has the authorization to share or access. Since the decentralized storage system is a peer-to-peer network, there is no administration required, then the risk of losing control of data to a higher authority is removed. Decentralized storage systems also possess the availability property. If there are unavailable nodes due to being attacked, other nodes can still maintain the whole system operation.

610

T. Bui et al.

As a consequence of the above challenges of storing and sharing data, in this paper, we introduce a model of storing and sharing data integrated blockchain network and decentralized storage system that preserves the information provider’s anonymity as well as the correctness, privacy, and integrity of the provided information. Our contribution to this paper can be summarized as follows: – We propose a data storing model integrated with a blockchain network and decentralized storage system to form a secure storage system. – We design a data sharing method to manage the sharing process that maintains the privacy and integrity of shared data. – We analyze the proposed model and method to show their strong properties.

2 2.1

Related Work Blockchain

With the extensive attraction to cryptocurrency recently from both industry and academia, blockchain technology is also attracted great interest. Blockchain can be considered a decentralized ledger in which all committed transactions are stored in a chain of blocks. Each block in the blockchain network is pointed to the previous block through a hash pointer containing the previous block’s hash value. The blockchain network keeps expanding by committing new transactions, thereby adding new blocks to the network. A blockchain network consists of multiple nodes in which each minor node holds a copy of the ledger. Nodes in the blockchain network communicate together directly through a peer-to-peer network. Minor node’s ledgers are synchronized through a consensus algorithm [8]. Two most well-known consensus algorithms can be mentioned are Proof-of-work (PoW) [6] and Proof-of-stake (PoS) [5]. – Proof-of-work requires minors to perform a complicated computational task to calculate the hash value of the constantly changing block header. The calculated value must be equal to or smaller than a given hash value. – Proof-of-stake suppose that a minor with more currencies would be less likely to attack the network; hence, at each mining round, the minor possesses more value and has a higher probability of proposing a new block. Blockchain plays an important role in the financial sector through cryptocurrencies, such as Bitcoin [6] and Etherum [12]. Etherum provides an abstract layer enabling anyone in the system to create their own smart contract. A smart contract is a set of rules that is executed only if certain conditions are met. Other studies have shown that blockchain also plays an important role in many other fields such as education, health, transportation, and environment [1,11].

Blockchain-Based Decentralized Digital Content System

2.2

611

Decentralized Storage System

Decentralized storage systems are one of the most notable applications of blockchain technology. Decentralized storage systems solve the remaining problems of centralized storage systems, such as data availability and data privacy [19]. Storj [15] supports end-to-end encryption as well as shards and distributes data to nodes around the world for storage. Sia [13] divides the document into pieces and encrypts them before delivering each piece to the storage nodes through smart contracts. IPFS [3] is a peer-to-peer distributed storage system that builds on content-based addressing. Although IPFS data are distributed to nodes worldwide, these nodes do not follow any consensus protocols. Instead, IPFS utilizes a cryptographic hash function to create a unique identifier address for each storage container. IPFS routing algorithm working based on a 2-column Distributed Hash Table maintained by multiple peers. 2.3

Decentralized Data Sharing

Wang et al. [14] propose a model integrated with IPFS and Ethereum blockchain that allows the Data Owner to distribute secret keys to the corresponding datauser to give them access to specified data that has been encrypted through a specified access policy. Patel et al. [9], Xia et al. [17], and Zheng et al. [20] proposed blockchain-based data sharing systems for healthcare and medical data. Naz et al. [7], and Wu et al. [16] introduced a data sharing solution based on blockchain and IPFS. However, the data receiver cannot verify the correctness and integrity of the document before submitting a data access request or making an escrow payment to a smart contract. Huynh et al. [4] proposed a model consisting of data producing scheme, data-storing scheme, and data-sharing scheme that guarantees the privacy of Data Owner and prevents fraud from sharing data.

3

Digital Content Sharing Model

In this section, we introduce the model that helps share digital content. Digital content can be understood as digitized documents, diplomas, certificates, medical records, or transcripts and verified by the issued agency. Our model includes three main entities, each demonstrating a distinct role that represents real-life participants in the digital content sharing process. – Data Owner (DO): Is the person or organization that owns the original data required to be accredited and sharing. The original data can be regarded as the key material for generating digital content. For example, the original data is an assignment of a student (DO), which is used to generate a digital certificate (transcript), or a blood sample, taken from a patient (DO) used to sequence the DNA data that can be digitized.

612

T. Bui et al.

– Data Provider (DP): Is the person or organization that has the authority and capability to generate reliable digital content by performing reviews, assessments, analyses, or experiments on the material provided and authorized by DO. We considered that the identification and reputation of DP are widely publicized (through a multimedia channel or authenticated personal channel of DP). Though, DP can be considered a trusted party. – Data User (DU): Is a person or organization having a demand to use the digital content to fulfill their purposes, such as recruitment and experiments. The involvement of the mentioned entities can be observed through multiple real-life data-sharing circumstances. For example, a candidate (DO) takes part in an IELTS test from an assessment center (DP). An employer (DU) needs the IELTS result (digital certificate generated by DP) for qualification evaluation. Though, the result must meet accuracy, integrity, and privacy. Another realistic situation is that a scientist (DU) requires data from sample(s) taken by hospitals (DP) from the patient(s) (DO) with a rare disease for research purposes. These sample data must also be highly correct and private. We introduce two phases: (1) Data sharing between the Digital content agency and Data Owner, and (2) Data sharing between the Data Owner and Data User (Fig. 1). 3.1

Digital Content Sharing Between Data Provider and Data Owner

– Step 1: DO requests DP to issue the digital content from their original data. Based on its nature, the original data can be obtained from DO from the previous procedure (e.g., blood sample index from healthcare procedure) or throughout the digital content generating process (e.g., online examination). The digital content results from the DP analysis, evaluation, and inspection processes of DO’s original data. DP is considered a trusted party, so the generated digital content is reliable. DO’s request should be presented based on an available structure such as an email template or fillable form that DO can specify the volume of information they expect to share, especially sensitive information or DO’s identity. In many situations, DO can choose to stay anonymous due to a dangerous or delicate situation; however, DO’s can also choose to share personal information for some reasons such as recruitment or registration. – Step 2: DP conducts the digital content generating process, which will be shared later to DO (DO will also share this digital content). DP also must adjust the digital content output that satisfies DO’s specification. In many cases, if DO requests anonymous digital content, DP must remove all identityrelated information that can expose DO and label the generated digital content. The labeling action served the purpose of original traceable data in the case that DO agrees to share their identity later but also preserved DO’s current incognito status. – Step 3: DP encrypts the generated digital content and issues an attached certificate from the digital content. The operations are described as below:

Blockchain-Based Decentralized Digital Content System

613

DP’s operation DO’s operation

Encrypted Data

(3)

(5)

Cloud service

Certificate

DO (2) (1) Raw Data

(4) (5)

(2)

(4)

DP

DP



DO

- Ck - Link to cloud

Blockchain

Fig. 1. Digital content sharing between Data Provider and Data Owner: (1) DO requests DP to issue the digital content from their original data; (2) DP generates and encrypts the digital content, then creates the corresponding certificate; (3) DP uploads the digital content and the certificate to decentralize storage system, (4) DP creates a blockchain transaction to inform DO about the sharing content; (5) DO downloads and verifies the shared digital content

(i) The digital content is encrypted using an asymmetric encryption algorithm with secret key k, k is generated randomly. DP can build its symmetric encryption algorithm or utilize available algorithms. DO must be informed about the encrypted algorithm, including its source code. We denote this symmetric encryption algorithm as SyE. C = SyEk (Digital content) (ii) DP and issue an attached certificate file from the digital content. This certificate, denoted as Cert, is created to verify the integrity of the digital content. The certificate must be created so that everyone from the blockchain system can access it to verify the integrity of the digital content but still have to maintain the privacy of the digital content. Cert is created by applying a digital signature of DA on the hashed value of encrypted digital content C using DP’s private key DPpriKey . The used hash function must also be informed to DO. Cert = SignDPpriKey (Hash(C))

614

T. Bui et al.

The Cert file was not only created to verify the integrity of C, but also to ensure that it was issued by the trusted party DP (as it was signed using DP’s private key). – Step 4: DP uploads C and Cert to a decentralized online storage system and receives the corresponding hyperlink. – Step 5: DP performs a transaction on the blockchain system and notifies DO that the digital content has been uploaded and the blockchain transaction path. The notification can be sent through an available application such as SMS or email in the blockchain transaction. In addition to the information about the sender and receiver, there must also be the following information: (i) The encrypted data Ck of secret key k, encrypted using DO’s public key (DOpubKey ). We denote the asymmetric encryption algorithm as AsyE. Ck = AsyEDOpubKey (k) (ii) The Cert file. (iii) The corresponding hyperlink to the uploaded C and Cert on the decentralized online storage. (iv) DP’s public key (DPpubKey ). (v) Important information or certificate proof of DP is a Data Provider with authorization and reputation to issue and share the digital content. – Step 6: After receiving DP’s notification, DO may download the digital content: (i) DO accesses the transaction on the blockchain to get the hyperlink to the uploaded C and Cert on the decentralized online storage. (ii) DO connects and downloads the C and Cert files from the decentralized online storage. (iii) DO uses the private key to get the secret key k by decrypting Ck . We denote the asymmetric decryption algorithm as AsyDE. k = AsyDEDOpriKey (Ck ) (iv) DO uses k to decrypt cipher C to get the digital content. We denote the symmetric decryption algorithm as SyDE. Digtal content = SyDek (C) – Step 7: Besides downloading the digital content, DO can also verify its integrity. Not only DO, any person or organization with access to the blockchain system may perform the verification action but not violate the privacy of the digital content. (i) DO (or any others) access the transaction on the blockchain to get the hyperlink to the uploaded C and Cert on the decentralized online storage, the certificate file, which we denote here CertBC and DP’s public key (DPpubKey ). (ii) DO (or any others) connect and download the cipher C and certificate file, which we denote here Certstorage from the decentralized online storage.

Blockchain-Based Decentralized Digital Content System

615

(iii) DO (or any others) verify if CertBC and Certstorage are matched to guarantee the Cert file from the decentralized online storage has not been modified. (iv) DO (or any others) verify the integrity of C using the hash function, Certstorage and DPpubKey : Integrity = V erif y(Hash(C), Certstorage , DPpubKey )

3.2

Digital Content Sharing Between Data Owner and Data User

After receiving the digital content from DP, DO can now actively share its digital content with any DU. The sharing protocol is as follows: – Step 1: DU sends a request to access the digital content belonging to DO. The request must clarify DU’s demand. If DO accept the request, they begin the data sharing process. In some situations, such as recruitment applications, DO may share the digital content with DU themselves, without DU’s request. – Step 2: DO access the transaction on the blockchain to get the Ck hyperlink to the uploaded C and Cert on the decentralized online storage. DO then uses the private key to get the secret key k by decrypting Ck (Fig. 2). k = AsyDEDOpriKey (Ck ) DO’s operation

DU’s operation Encrypted Data

Cloud service

DP’s operation

Certificate

(4)

DU

(1) (3)

Raw Data

DO

(4)

(3) (2)

DP DO - Ck - Link to cloud

P’/O

- Ck - Link to cloud - DP

Blockchain

Fig. 2. Digital content sharing between Data Owner and Data User: (1) DU sends a request to access the digital content belonging to DO; (2) DO access the transaction on the blockchain to get the k key and necessary data; (3) DO perform a transaction on the blockchain system and notify DU about the shared digital content; (4) DU downloads and verifies the shared digital content;

616

T. Bui et al.

– Step 3: DO perform a transaction on the blockchain system and notify DU that the digital content has been shared and the blockchain transaction path. The following information must be mention within the transaction: (i) The encrypted data Ck of secret key k, encrypted using DU’s public key (DUpubKey ). Ck = AsyEDUpubKey (k) (ii) The path to the transaction between DP and DO on the blockchain system. Re-sharing the transaction indirectly informs DU that DO did not manage to change the shared data from DP. – Step 4: DU can now access to the secret key k to decrypt C. (i) DU access the transaction on the blockchain to get the hyperlink to the uploaded C and Cert on the decentralized online storage. (ii) DO connect and download the C and Cert files from the decentralized online storage. (iii) DO uses the private key to get the secret key k by decrypting Ck . k = AsyDEDOpriKey (Ck ) (iv) DO uses k to decrypt cipher C to get the digital content. Digtal content = SyDek (C) (v) In case DU received anonymous data from DO and expected to access the hidden content related to DO’s identity; DU must request permission from DO to issue another sharing process. – Step 5: DU can also verify the integrity of shared digital content by performing Step 7 in the phase of digital content sharing between Data Provider and Data Owner.

4 4.1

Security Analysis Confidentially

The digital content is always encrypted during the sharing process. Although anyone from the blockchain network can access the encrypted digital content, the digital content is protected with a powerful encrypted algorithm. Since DP is considered a trusted party, the only way to decipher the encrypted data is by getting the k key, which DO can only share. In other words, without DO’s permission, the digital content cannot be violated. 4.2

Integrity

In many scenarios, people (even DO) may try to modify the digital content generated by DP and deliver modified content to DU for self-benefiting purposes or editing adverse information. Parties may try to forge the digital content directly from the decentralized storage system or fake the hyperlink to that. Attackers may try to modify the digital content directly from the decentralized storage system; however, since they do not have the private key of DP, they

Blockchain-Based Decentralized Digital Content System

617

cannot create the corresponding certificate file. Given the public key of DP from the transaction, any minor from the blockchain system can determine if the certificate has been forged and then verify the integrity of the encrypted digital content. Moreover, we proposed the IPFS decentralize storage. Data stored on the IPFS storage identify by its hash content. Though, changing the data also changes the hyperlink to it. In our proposed model, DP shares the hyperlink to data through a blockchain transaction, and DO re-shares the hyperlink by issuing a new transaction with a path to DP’s previous transaction without interacting directly with the digital content. On the other side, blockchain transactions are immutable. To modify data on a block require the agreement of a majority of a minor, which is very difficult to happen. This means there is no party can falsify the digital content by modifying the hyperlink shared by DP. There might be a situation where DO issue a transaction to DU that shares the path to another transaction that DP does not issue. This transaction may have the hyperlink lead to fake digital content with a corresponding fake certificate. In this case, DU has to doublecheck DP’s information from the blockchain carefully to ensure the transaction was issued by a well-trusted party. 4.3

Anonymity

The model keeps DO’s identity completely anonymous. The blockchain network only requires DO’s private key signature to create their transaction; hence, DO’s identity cannot be revealed. The only risk is that DU intentionally reveals DO’s identity from the digital content. However, DO can choose to avoid that by simply taking the option of removing all identity-related information from the original data before digital content is generated. 4.4

Non-repudiation

In some minor situations, DP or DO may want to deny their action of sharing the digital content due to erroneous or illegal content or license violated. However, there is no scenario DO or DP can deny their issued transaction from the blockchain since they have to sign on the transaction using their private key to create one. Therefore, the minor will drop transactions with invalid signatures; hence, DP and DO cannot repudiate their course of action. 4.5

Scalability

The model is integrated with a blockchain network and decentralized storage system, which are all peer-to-peer networks. These peer-to-peer networks can be easily expandable by adding more storage nodes to the network.

618

T. Bui et al.

4.6

Availability

Blockchain networks and decentralized storage systems are all peer-to-peer networks that consist of many nodes. Therefore, any inactive miner node due to DoS/DDoS attacks can be maintained by other miner nodes; hence, it is impossible to crash the system. Moreover, in many decentralized storage systems, the uploaded content is cached and stored in a different location. So if the storage node of the digital content is down, there still be its cache on other nodes accessible.

5

Conclusion and Future Work

This paper proposes a model integrated with a blockchain network and decentralized storage system that supports the management and shares digital content preserving its integrity and privacy. The model consists of two phases: Digital content sharing between DP and DO and Digital content sharing between DO and DU, in which DP act as a trusted party to authorize and evaluate data from DO before digitizing and issuing a certificate to confirm the data correctness. The digital content is then uploaded to the decentralized storage system. DP and DO later issue new transactions from the blockchain network to share the hyperlink to the uploaded content. We analyze the model to prove its properties, such as confidentially, integrity, anonymity, non-repudiation, scalability, and availability. Our model can be implemented on any blockchain system as well as easily extended to many new features: – The model considered DP as a default well-trusted party. Future research may focus on building a real-time reliability scale for DP through DU’s feedback or registration procedure. – The verify operation to determine the integrity of the digital content depends on the data validation of the blockchain network. However, the blockchain system data validation process requires much time to finish. However, one future orientation is to reduce the dependence on the blockchain system data validation process by performing a validation check on a certain amount of blocks combined with verification information provided by DP. – One of the most promising direction in the next steps is to optimize the anonymous digital content where it integrated 2-layer encryption that gives DO the right to decide if they want to share their identity by providing different keys to DU that each key may or may not decrypt DO’s identity. – Other copyright protection methods for digital data can also be integrated into the model. Acknowledgements. This research is funded by University of Science, VNU-HCM under grant number CNTT 2020-16.

Blockchain-Based Decentralized Digital Content System

619

References 1. Al-Jaroodi, J., Mohamed, N.: Blockchain in industries: a survey. IEEE Access 7, 36500–36515 (2019). https://doi.org/10.1109/access.2019.2903554 2. Bari, M.F., et al.: Data center network virtualization: a survey. IEEE Commun. Surv. Tutor. 15(2), 909–928 (2013). https://doi.org/10.1109/surv.2012.090512. 00043 3. Benet, J.: IPFS-content addressed, versioned, P2P file system. arXiv preprint arXiv:1407.3561 (2014) 4. Huynh, T.T., Nguyen, T.D., Hoang, T., Tran, L., Choi, D.: A reliability guaranteed solution for data storing and sharing. IEEE Access 9, 108318–108328 (2021). https://doi.org/10.1109/access.2021.3100707 5. King, S., Nadal, S.: PPCoin: peer-to-peer crypto-currency with proof-of-stake. SelfPublished Paper (2012) 6. Nakamoto, S.: Bitcoin: a peer-to-peer electronic cash system. Decentralized Bus. Rev., p. 21260 (2008) 7. Naz, M., et al.: A secure data sharing platform using blockchain and interplanetary file system. Sustainability 11(24), 7054 (2019). https://doi.org/10.3390/ su11247054 8. Nguyen, G.T., Kim, K.: A survey about consensus algorithms used in blockchain. J. Inf. Proc. Syst. 14(1), 101–128 (2018) 9. Patel, V.: A framework for secure and decentralized sharing of medical imaging data via blockchain consensus. Health Inform. J. 25(4), 1398–1411 (2018). https:// doi.org/10.1177/1460458218769699 10. Reinsel, D., Gantz, J., John, R.: The digitization of the world from edge to core. IDC White Paper (2018) 11. Shen, C., Pena-Mora, F.: Blockchain for cities—a systematic literature review. IEEE Access 6, 76787–76819 (2018). https://doi.org/10.1109/access.2018.2880744 12. Tikhomirov, S.: Ethereum: state of knowledge and research perspectives. In: Imine, A., Fernandez, J.M., Marion, J.-Y., Logrippo, L., Garcia-Alfaro, J. (eds.) FPS 2017. LNCS, vol. 10723, pp. 206–221. Springer, Cham (2018). https://doi.org/10.1007/ 978-3-319-75650-9 14 13. Vorick, D., Champine, L.: Sia: simple decentralized storage. Blockchain Lab White Paper (2014) 14. Wang, S., Zhang, Y., Zhang, Y.: A blockchain-based framework for data sharing with fine-grained access control in decentralized storage systems. IEEE Access 6, 38437–38450 (2018). https://doi.org/10.1109/access.2018.2851611 15. Wilkinson, S., Boshevski, T., Brandoff, J., Buterin, V.: Storj a peer-to-peer cloud storage network (2014) 16. Wu, Xuguang, Han, Yiliang, Zhang, Minqing, Zhu, Shuaishuai: Secure personal health records sharing based on blockchain and IPFS. In: Han, Weili, Zhu, Liehuang, Yan, Fei (eds.) CTCIS 2019. CCIS, vol. 1149, pp. 340–354. Springer, Singapore (2020). https://doi.org/10.1007/978-981-15-3418-8 22 17. Xia, Q., Sifah, E., Smahi, A., Amofa, S., Zhang, X.: BBDS: blockchain-based data sharing for electronic medical records in cloud environments. Information 8(2), 44 (2017). https://doi.org/10.3390/info8020044 18. Yang, P., Xiong, N., Ren, J.: Data security and privacy protection for cloud storage: a survey. IEEE Access 8, 131723–131740 (2020). https://doi.org/10.1109/access. 2020.3009876

620

T. Bui et al.

19. Zhang, C., Sun, J., Zhu, X., Fang, Y.: Privacy and security for online social networks: challenges and opportunities. IEEE Netw. 24(4), 13–18 (2010). https://doi. org/10.1109/mnet.2010.5510913 20. Zheng, X., Mukkamala, R.R., Vatrapu, R., Ordieres-Mere, J.: Blockchain-based personal health data sharing system using cloud storage. In: 2018 IEEE 20th International Conference on E-Health Networking, Applications and Services (Healthcom), pp. 1–6. IEEE (2018). https://doi.org/10.1109/healthcom.2018.8531125

Analysis of Ciphertext Behaviour Using the Example of the AES Block Cipher in ECB, CBC, OFB and CFB Modes of Operation, Using Multiple Encryption Zhanna Alimzhanova , Dauren Nazarbayev(B) and Aktoty Kaliyeva

, Aizada Ayashova ,

Al-Farabi Kazakh National University, 71 Al-Farabi Avenue, 050040 Almaty, Kazakhstan [email protected]

Abstract. This paper explores the Advance Encryption Standard (AES) block cipher in Electronic Code Book (ECB), Cipher Block Chaining (CBC), Output Feedback (OFB) and Cipher Feedback (CFB) modes of operation to compare the characteristic properties of ciphertext, and to compare the block complexity level of building ciphertext schemes using the methodology of periodic regularities. This paper investigates the features of four block modes of operation, which includes two analytical principles: the first principle, which defines periodicity with respect to the ciphertext; and the second, which includes the principle of repeated cipher iterations, to react the characteristic manifestations of the ciphertext, under certain control input data. In accordance with the above principles, the results of analysis of the regularities of ciphertext with respect to blocks and with respect to encryption iterations were shown in tables and respectively in obtained formulae. The package Matplotlib of the Python programming language was used for graphical visualization ciphertexts of first iteration of encryption on all investigated modes of operation under different key sizes. The implementation of AES algorithm and obtaining encryption results were performed using the package Crypto. Keywords: AES · Block cipher · ECB · CBC · OFB · CFB · Mode of operation · Ciphertext · Periodicity · Multiple encryption

1 Introduction This paper investigates the AES block cipher in four modes of operation: ECB, CBC, OFB and CFB [1, 2]. Block ciphers handle text of a constant length, called the block size. If the text is longer than the block size, it must be divided into several blocks. Typically, the last block of plaintext must be the size of a block. Modes of operation are such special constructions designed to enhance the cryptographic strength of block ciphers and ensure the confidentiality of information [3].

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 N. T. Nguyen et al. (Eds.): ACIIDS 2022, LNAI 13758, pp. 621–629, 2022. https://doi.org/10.1007/978-3-031-21967-2_50

622

Z. Alimzhanova et al.

In this paper, the following encryption modes were investigated: ECB, CBC, OFB and CFB using the example of the AES block cipher [4]. The study was conducted by the method of multiple encryption to study the schemes for constructing modes of operation [5]. The AES is a symmetric block cipher that was adopted in 2001 by the National Institute of Standards and Technology (NIST) [6]. The AES block cipher uses three fixed encryption keys: 128-bit, 192-bit, and 256-bit. For each key size, number of rounds is defined as follows: for a 128-bit key the number of rounds is 10, for a 192-bit key the number of rounds is 12, for a 256-bit key the number of rounds is 14 [7]. The plaintext is split into blocks of 16 bytes, if the last block is not a multiple of 16 bytes, the block is appended to the size of 16 bytes. Each block looks like a 4 × 4 matrix [8]. In this paper, analysis of ciphertext behaviour was done for four modes of operation: ECB, CBC, OFB and CFB under certain control data. The analysis consists of identifying the periodicity of the ciphertexts [9, 10]. The authors applied method of multiple encryptions to analyze of ciphertexts in four modes of operation for a 128-bit key. The visualization technique in Python programming language using the package Matplotlib was applied to visualize the behaviour of changes in the dynamics of the ciphertexts under different key sizes. The implementation of the AES algorithm and obtaining encryption results were carried out using the package Crypto [11, 12]. The scientific novelty of the research consists in identifying the main characteristics of the ciphertext behaviour and in the detect the periodicity of ciphertexts by method of multiple encryptions in the analysis of schemes for the construction of modes of operation on the example of the AES block cipher in investigated modes of operation.

2 Analysis of Characteristics of Ciphertext Behaviour in the Modes of Operation 2.1 Analysis of Characteristics of Ciphertext Behaviour in ECB Mode The first object of research is the ECB mode. In ECB mode each block is encrypted independently of the others. Thus, the individual blocks of plaintext are converted into separate blocks of ciphertext. Decryption takes place according to the analog scheme. In ECB mode you can encrypt and decrypt multiple blocks in parallel. For the analysis of the ECB mode and other modes of operation, the control input data were selected, which are shown in Table 1. In Table 1, the plaintext was selected from three blocks. We use three keys of different size for the same text to determine the dynamics of change. To illustrate the behaviour of the ciphertext, we used the visualization technology in the Python programming language. In all tables, all data are presented in decimal notation, and each number value contains one byte (8 bits).

Analysis of Ciphertext Behaviour Using the Example of the AES Block Cipher

623

Table 1. Control input data. Name of control input data Sequence of values of control input data Plaintext

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

key (128-bits)

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

key (192-bits)

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

key (256-bits)

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

Initialization vector

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

Table 2. Results on the first encryption iteration in ECB mode under different key sizes. Keys

Sequence of values of ciphertexts

128-bits

[102, 233, 75, 212, 239, 138, 44, 59, 136, 76, 250, 89, 202, 52, 43, 46, 102, 233, 75, 212, 239, 138, 44, 59, 136, 76, 250, 89, 202, 52, 43, 46, 102, 233, 75, 212, 239, 138, 44, 59, 136, 76, 250, 89, 202, 52, 43, 46]

192-bits

[170, 224, 105, 146, 172, 191, 82, 163, 232, 244, 169, 110, 201, 48, 11, 215, 170, 224, 105, 146, 172, 191, 82, 163, 232, 244, 169, 110, 201, 48, 11, 215, 170, 224, 105, 146, 172, 191, 82, 163, 232, 244, 169, 110, 201, 48, 11, 215]

256-bits

[220, 149, 192, 120, 162, 64, 137, 137, 173, 72, 162, 20, 146, 132, 32, 135, 220, 149, 192, 120, 162, 64, 137, 137, 173, 72, 162, 20, 146, 132, 32, 135, 220, 149, 192, 120, 162, 64, 137, 137, 173, 72, 162, 20, 146, 132, 32, 135]

Fig. 1. Visualization of results on the first encryption iteration in ECB mode under 128-, 192and 256-bits key sizes.

Figure 1 shows the plaintext and ciphertext in ECB mode under different key sizes: 128, 192 and 256 bits. Regardless of the key size, under the control input data from Table 1 in Fig. 1, we observe the periodicity of ciphertexts with respect each block.

624

Z. Alimzhanova et al.

2.2 Analysis of Characteristics of Ciphertext Behaviour in CBC Mode The next object of research is the CBC mode. In CBC mode, each plaintext block is pre-encrypted with the result of encrypting the previous block using the XOR operation. Thus, to encrypt each subsequent block, it is necessary to have the encryption result of the previous block to encrypt several blocks at once. For the analysis of this mode, the control input data were selected, which are shown in Table 1. In Table 3 and Fig. 2 are shown the ciphertexts in the CBC mode under different key sizes: 128, 192 and 256 bits. Table 3. Results on the first encryption iteration in CBC and OFB modes under different key sizes. Keys

Sequence of values of ciphertexts

128-bits

[102, 233, 75, 212, 239, 138, 44, 59, 136, 76, 250, 89, 202, 52, 43, 46, 247, 149, 189, 74, 82, 226, 158, 215, 19, 211, 19, 250, 32, 233, 141, 188, 161, 12, 246, 109, 15, 221, 243, 64, 83, 112, 180, 191, 141, 245, 191, 179]

192-bits

[170, 224, 105, 146, 172, 191, 82, 163, 232, 244, 169, 110, 201, 48, 11, 215, 82, 246, 116, 183, 185, 3, 15, 218, 177, 61, 24, 220, 33, 78, 179, 49, 24, 78, 39, 252, 152, 137, 153, 79, 178, 31, 190, 232, 49, 66, 137, 85]

256-bits

[220, 149, 192, 120, 162, 64, 137, 137, 173, 72, 162, 20, 146, 132, 32, 135, 8, 195, 116, 132, 140, 34, 130, 51, 194, 179, 79, 51, 43, 210, 233, 211, 139, 112, 197, 21, 166, 102, 61, 56, 205, 184, 230, 83, 43, 38, 100, 145]

Fig. 2. Visualization of results on the first encryption iteration in CBC and OFB modes under 128-, 192- and 256-bits key sizes. (All encryption results in CBC and OFB modes are the same).

Under the control input data from Table 1 and results of encryptions from Table 3, you can note that in the CBC mode, the periodicity with respect to each block is not revealed. For additional analysis of the CBC mode, the authors applied multiple encryption approach. Applying this approach, for three iterations of encryption, we got the following results:

Analysis of Ciphertext Behaviour Using the Example of the AES Block Cipher

625

Table 4. Results of multiple encryption in CBC mode. Iteration Sequence of values of ciphertexts 1

[102, 233, 75, 212, 239, 138, 44, 59, 136, 76, 250, 89, 202, 52, 43, 46, 247, 149, 189, 74, 82, 226, 158, 215, 19, 211, 19, 250, 32, 233, 141, 188, 161, 12, 246, 109, 15, 221, 243, 64, 83, 112, 180, 191, 141, 245, 191, 179]

2

[247, 149, 189, 74, 82, 226, 158, 215, 19, 211, 19, 250, 32, 233, 141, 188, 102, 233, 75, 212, 239, 138, 44, 59, 136, 76, 250, 89, 202, 52, 43, 46, 237, 109, 134, 95, 127, 189, 108, 89, 38, 218, 69, 177, 16, 193, 83, 45]

3

[161, 12, 246, 109, 15, 221, 243, 64, 83, 112, 180, 191, 141, 245, 191, 179, 237, 109, 134, 95, 127, 189, 108, 89, 38, 218, 69, 177, 16, 193, 83, 45, 102, 233, 75, 212, 239, 138, 44, 59, 136, 76, 250, 89, 202, 52, 43, 46]

In order to analyze mode behaviour, all multiple encryption results in which some regularities emerge are best represented as a two-dimensional matrix. Here, the number of rows is equal to the number of iterations of encryption, and the number of columns is the number of the blocks of ciphertext, respectively. We represent the results of encryption in the CBC mode with three iterations in the  form of a matrix ai,j 3×3 , in which each element of matrix is a sequence (list) of 16 number values, so the elements of the matrix are presented in the following form: ⎞ a11 a12 a13 ⎝ a21 a22 a23 ⎠ a31 a32 a33 ⎛

(1)

here, each element is a sequence of the ciphertext of one block: a11 a12 a13 a21 a22 a23 a31 a32 a33

= [102, 233, 75, 212, 239, 138, 44, 59, 136, 76, 250, 89, 202, 52, 43, 46] = [247, 149, 189, 74, 82, 226, 158, 215, 19, 211, 19, 250, 32, 233, 141, 188] = [161, 12, 246, 109, 15, 221, 243, 64, 83, 112, 180, 191, 141, 245, 191, 179] = [247, 149, 189, 74, 82, 226, 158, 215, 19, 211, 19, 250, 32, 233, 141, 188] = [102, 233, 75, 212, 239, 138, 44, 59, 136, 76, 250, 89, 202, 52, 43, 46] = [237, 109, 134, 95, 127, 189, 108, 89, 38, 218, 69, 177, 16, 193, 83, 45] = [161, 12, 246, 109, 15, 221, 243, 64, 83, 112, 180, 191, 141, 245, 191, 179] = [237, 109, 134, 95, 127, 189, 108, 89, 38, 218, 69, 177, 16, 193, 83, 45] = [102, 233, 75, 212, 239, 138, 44, 59, 136, 76, 250, 89, 202, 52, 43, 46]

  Analyzing the elements of the matrix ai,j 3×3 , we see a manifestation of the regularity, in the fact that this matrix is a symmetric matrix with respect to the main diagonal. In CBC mode, the periodicity appears only along the main diagonal, and all its elements of the main diagonal coincide with the result of data encryption in ECB mode. Thus, we observe the ciphertext of the first block of n iterations in ECB mode corresponds to the ciphertext of the first iteration of n blocks in CBC mode.

626

Z. Alimzhanova et al.

2.3 Analysis of Characteristics of Ciphertext Behaviour in OFB Mode The next object of analysis is the OFB mode. To analyze the OFB mode, the control input data were selected, which are shown in Table 1. Figure 2 shows the plaintext and ciphertexts in OFB mode under different key sizes: 128, 192 and 256 bits, since the results coincide with the CBC mode, and therefore the periodicity is relatively each block has not been defined. Next, we will apply the multiple encryption approach, in which there is a clear periodicity. Below in Table 5 is the result of 3 iterations: Table 5. Results of multiple encryption in OFB mode. Iteration Sequence of values of ciphertexts 1

[102, 233, 75, 212, 239, 138, 44, 59, 136, 76, 250, 89, 202, 52, 43, 46, 247, 149, 189, 74, 82, 226, 158, 215, 19, 211, 19, 250, 32, 233, 141, 188, 161, 12, 246, 109, 15, 221, 243, 64, 83, 112, 180, 191, 141, 245, 191, 179]

2

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

3

[102, 233, 75, 212, 239, 138, 44, 59, 136, 76, 250, 89, 202, 52, 43, 46, 247, 149, 189, 74, 82, 226, 158, 215, 19, 211, 19, 250, 32, 233, 141, 188, 161, 12, 246, 109, 15, 221, 243, 64, 83, 112, 180, 191, 141, 245, 191, 179]

Regardless of the number of blocks, with multiple encryption, we observe that the period relative to the iterations is T = 2 and the following equality holds:    C1 i =2 × k + 1, k = 1, n  (2) Ci = P i = 2 × k, k = 1, n here, i is the iteration number, Ci is the ciphertext of the i-th iteration, P is the plaintext. In OFB mode of operation, the encryption algorithm matches the decryption algorithm, so with the encryption algorithm, we can encrypt and decrypt messages. 2.4 Analysis of Characteristics of Ciphertext Behaviour in CFB Mode The next object of analysis is CFB mode. To analyze the CFB mode, the control input data were selected, which are shown in Table 1. Figure 3 shows the plaintext and ciphertexts in CFB mode under different key sizes: 128, 192 and 256 bits.

Analysis of Ciphertext Behaviour Using the Example of the AES Block Cipher

627

Table 6. Results on the first encryption iteration in CFB mode under different key sizes. Keys

Sequence of values of ciphertexts

128-bits

[102, 22, 249, 46, 66, 168, 241, 26, 145, 22, 104, 87, 142, 195, 170, 15, 147, 0, 74, 171, 34, 177, 43, 38, 209, 40, 150, 115, 28, 209, 18, 246, 156, 86, 182, 80, 202, 117, 174, 19, 203, 115, 211, 222, 150, 117, 74, 20]

192-bits

[170, 208, 222, 119, 226, 246, 187, 90, 31, 118, 59, 243, 255, 101, 212, 8, 234, 72, 204, 107, 46, 221, 1, 91, 17, 50, 171, 218, 183, 93, 100, 108, 83, 82, 65, 92, 27, 136, 241, 142, 191, 125, 237, 249, 185, 27, 122, 134]

256-bits

[220, 168, 57, 92, 161, 137, 59, 5, 250, 216, 181, 118, 95, 143, 64, 207, 167, 255, 134, 230, 48, 103, 107, 173, 87, 184, 251, 116, 189, 179, 10, 109, 116, 206, 105, 205, 203, 67, 15, 170, 107, 215, 192, 142, 167, 244, 183, 236]

Fig. 3. Visualization of results on the first encryption iteration in CFB mode under 128-, 192and 256-bits key sizes.

In CFB mode, as in CBC and OFB modes under the control input data from Table 1, the periodicity with respect to each block is also not detected. But using the multiple encryption approach, CFB mode exhibits some other regularities different from CBC and OFB modes. This regularity manifests itself with in each block element and in blocks. Analyzing the encryption results in Table 7, we see the manifestation of the following regularities regarding iterations: the first element in the iteration sequence is repeated with a step of 2, the first two elements are repeated with a step of 4, the first three elements are repeated with a step of 8, and the first four are repeated with a step 16, etc. Thus, the repetition of the first k elements will be equal to: τk = 2k

(3)

Since each block has 16 elements (16 bytes), it can be predicted that the repetition of one data block will be equal to τ1block = 216 , repetition of two blocks - τ2block = 232 , thus, the repetition of n blocks is: τnblock = 2n×16 = Tn where Tn is the period of the encrypted sequence of n blocks.

(4)

628

Z. Alimzhanova et al. Table 7. Results of multiple encryption in CFB mode.

Iteration Sequence of values of ciphertexts 1

[102, 22, 249, 46, 66, 168, 241, 26, 145, 22, 104, 87, 142, 195, 170, 15, 147, 0, 74, 171, 34, 177, 43, 38, 209, 40, 150, 115, 28, 209, 18, 246, 156, 86, 182, 80, 202, 117, 174, 19, 203, 115, 211, 222, 150, 117, 74, 20]

3

[102, 102, 182, 18, 22, 166, 154, 189, 184, 158, 225, 198, 153, 164, 243, 212, 244, 12, 243, 90, 24, 51, 14, 108, 67, 91, 136, 243, 57, 244, 219, 165, 45, 21, 115, 26, 33, 72, 3, 227, 151, 220, 130, 115, 100, 13, 5, 214]

5

[102, 22, 41, 67, 161, 220, 3, 251, 255, 103, 43, 170, 30, 40, 70, 48, 29, 121, 84, 136, 144, 189, 244, 140, 245, 245, 39, 125, 117, 253, 237, 47, 170, 96, 154, 250, 155, 239, 219, 18, 40, 146, 224, 5, 21, 199, 118, 183]

9

[102, 22, 249, 109, 216, 178, 100, 38, 172, 71, 69, 12, 82, 64, 246, 117, 199, 166, 146, 101, 34, 253, 204, 227, 230, 87, 162, 219, 125, 145, 69, 61, 226, 240, 85, 145, 90, 191, 237, 245, 183, 118, 186, 174, 44, 146, 206, 243]

257

[102, 22, 249, 46, 66, 168, 241, 26, 43, 200, 0, 42, 39, 1, 68, 4, 112, 95, 168, 248, 169, 20, 125, 226, 240, 78, 28, 30, 198, 211, 233, 114, 28, 40, 174, 224, 219, 165, 148, 7, 241, 148, 174, 182, 43, 19, 196, 233]



… 65537

[102, 22, 249, 46, 66, 168, 241, 26, 145, 22, 104, 87, 142, 195, 170, 15, 212, 38, 118, 11, 224, 33, 33, 236, 76, 59, 21, 165, 30, 194, 159, 14, 169, 90, 216, 206, 94, 97, 146, 85, 77, 57, 103, 94, 54, 54, 107, 164]

3 Conclusion In the paper, the first object of research was selected the ECB mode, where all encryption results are periodic sequences relative to each block. Then for research, the CBC, OFB and CFB modes were selected, respectively. Under certain control data in the three modes of operation: CBC, OFB and CFB, the periodicity relative to the blocks was not detected, and a multiple encryption approach was applied for them. As a result of the exploration of the CBC, OFB and CFB modes, certain relations were identified, which were represented in Tables 3, 4, 5, 6, 7 and in formulae (2)–(4). In CBC mode, with multiple encryption iterations, a symmetric matrix was obtained (1), the elements of the main diagonal of which correspond to the result of encryption in ECB mode. We observe a connection between ECB and CBC modes: the ciphertext of the first block of n iterations in ECB mode corresponds to the ciphertext of the first iteration of n blocks in CBC mode. In the OFB mode, there is an obvious periodicity with respect to encryption iteration, where at odd iterations, the ciphertext corresponds to the ciphertext of the first iteration, and at even iterations, the ciphertexts correspond to the plaintext, which is reflected in the formula (2). In this mode, the encryption algorithm matches the decryption algorithm, so we can encrypt and decrypt messages with the encryption algorithm. In CFB mode, a sequence of repetitions was determined, which

Analysis of Ciphertext Behaviour Using the Example of the AES Block Cipher

629

changed in accordance with the geometric progression formula (3), and the formula (4) was obtained for detecting the period (absolute cycle) for n block ciphertext. All encryption results were obtained using the Python programming language using the packages: Matplotlib and Crypto. From the received results of the analysis of the ciphertext behaviour in investigated modes of operation using the example of the AES block cipher, the authors plan further research and development of an expert system model to analyze and estimate the level of complexity of schemes of modes of operation, to detect periodicity of ciphertexts and recognize different modern cryptographic ciphers with the assistance machine learning tools.

References 1. Dworkin, M.: Recommendation for Block Cipher Modes of Operation. Methods and Techniques. National Institute of Standards and Technology (2001) 2. Rogaway, P.: Evaluation of some block cipher modes of operation. Cryptography Research and Evaluation Committees (CRYPTREC) for the Government of Japan (2011) 3. Smart, N.P.: Block ciphers and modes of operation. In: Smart, N.P. (ed.) Cryptography Made Simple, pp. 241–269. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-21936-3_13 4. Cid, C., Murphy, S., Robshaw, M.: Introduction to the AES. In: Cid, C., Murphy, S., Robshaw, M. (eds.) Algebraic Aspects of the Advanced Encryption Standard, pp. 1–4. Springer, Boston (2006). https://doi.org/10.1007/978-0-387-36842-9_1 5. Almuhammadi, S., Al-Hejri, I.: A comparative analysis of AES common modes of operation. In: 2017 IEEE 30th Canadian Conference on Electrical and Computer Engineering (CCECE), pp. 1–4. IEEE (2017). https://doi.org/10.1109/CCECE.2017.7946655 6. Smid, M.E.: Development of the advanced encryption standard. J. Res. Nat. Inst. Stand. Technol. 126, 1–18 (2021). https://doi.org/10.6028/jres.126.024 7. Song, B., Seberry, J.: Further observations on the structure of the AES algorithm. In: Johansson, T. (ed.) FSE 2003. LNCS, vol. 2887, pp. 223–234. Springer, Heidelberg (2003). https:// doi.org/10.1007/978-3-540-39887-5_17 8. Blazhevski, D., Bozhinovski, A., Stojchevska, B., Pachovski, V.: Modes of operation of the AES algorithm. In: The 10th Conference for Informatics and Information Technology, pp. 212–216 (2013) 9. Bujari, D., Aribas, E.: Comparative analysis of block cipher modes of operation. In: International Advanced Researches & Engineering Congress-2017, pp. 1–4 (2017) 10. Nawaz, Y., Wang, L., Ammour, K.: Processing analysis of confidential modes of operation. In: Wang, G., Chen, J., Yang, L.T. (eds.) SpaCCS 2018. LNCS, vol. 11342, pp. 98–110. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-05345-1_8 11. Bowne, S.: Hands-on Cryptography with Python: Leverage the Power of Python to Encrypt and Decrypt Data. Packt Publishing Ltd. (2018) 12. Package Crypto. Python Cryptography Toolkit. https://pythonhosted.org/pycrypto/Cryptomodule.html. Accessed 24 May 2012

Vulnerability Analysis of IoT Devices to Cyberattacks Based on Naïve Bayes Classifier Jolanta Mizera-Pietraszko1(B)

and Jolanta Ta´ncula2

1 Military University of Land Forces, Wrocław, Poland

[email protected] 2 Opole University, Opole, Poland [email protected]

Abstract. IoT or Smart Word, as a global technology, is a rapidly growing concept of ICT systems interoperability covering many areas of life. Increasing the speed of data transmission, increasing the number of devices per square meter, reducing delays - all this is guaranteed by modern technologies in combination with the 5G standard. However, the key role is played by the aspect of protection and security of network infrastructure and the network itself. No matter what functions are to be performed by IoT, all devices included in such a system are connected by networks. IoT does not create a uniform environment, hence its vulnerability in the context of cybersecurity. This paper deals with the selection of a method to classify software vulnerabilities to cyber-attacks and threats in the network. The classifier will be created based on the Naive Bayes method. However, the quality analysis of the classifier, i.e., checking whether it classifies vulnerabilities correctly, was performed by plotting the ROC curve and analyzing the Area Under the Curve (AUC). Keywords: Internet of Things · Cyberattacks · Naïve Bayes · Networking · 5G standard

1 Introduction The offer of smart solutions such as Smart City or Smart Home is nowadays very wide, and the development of this technology in combination with 5G networks expands its functionalities even more. The specific nature of IoT technology shows many vulnerabilities to cyberattacks and security gaps, hence nowadays IoT security is a priority issue. The threat posed by the very structure of IoT is primarily its distributed organization. Due to the distributed management system, ensuring the substantial security level and monitoring the devices poses a challenge. Facing the threats or failures of the devices requires both a quick action and an in-depth knowledge about the IoT security and reliability. Quite often the network is not sufficiently secured in terms of authentication through a weak or no passwords. An unsecured network can be an entrance for spying devices. IoT security is particularly important in the industrial sector. Some of the enterprises constitute a critical infrastructure, important for the security of the region or country. Securing against a loss of the systems’ control e.g. energy supply, should implement considerable security measures. Figure 1 shows the structure of IoT network. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 N. T. Nguyen et al. (Eds.): ACIIDS 2022, LNAI 13758, pp. 630–642, 2022. https://doi.org/10.1007/978-3-031-21967-2_51

Vulnerability Analysis of IoT Devices to Cyberattacks

631

Fig. 1. Structure of IoT network

2 CVSS System Common Vulnerabilities Scoring System measures the level of software vulnerability calculated on a point scale. It consists of three groups of metrics: Base, Time and Environment. The Base group represents vulnerability characteristics that are constant over time and may exist in different user environments, the Temporal group reflects vulnerability characteristics that change over time, and the Environmental group represents vulnerability characteristics that are unique to a user’s environment. The baseline metrics produce a score ranging from 0 to 10, which can be modified by the temporal and environmental metrics. The vulnerability detected has one of the following severity levels for CVSS v.3: – – – – – –

Secure: No vulnerabilities identified Critical: 9.0–10.0 on the CVSS scale High: 7.0–8.9 on the CVSS scale Medium: 4.0–6.9 on the CVSS scale Low: 0.1–3.9 on the CVSS scale None: 0.0 on the CVSS scale

At the critical and high level, a vulnerability can lead to taking control of servers or infrastructure devices, resulting in data leakage or loss, in some cases the system downtime or it can increase the privilege level for an attacker. Therefore, it is recommended to patch or update as soon as possible every vulnerability that occurs. On the other hand, low-level vulnerabilities have a low impact functioning devices with software and operating systems. An example of a critical vulnerability detected in March 2022 was unique in the sense that it allowed hackers to launch DDoS attacks of exceptional strength. This vulnerability was detected in the MiCollab and MiVoice Business Express business phone systems manufactured by Mitel, which act as gateways. It was named CVE-2022-26143, but also known as TP240PhoneHome. The Attack Vector for this vulnerability has been identified as Vector:

632

J. Mizera-Pietraszko and J. Ta´ncula

CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:H/I:H/A:H. According to NIST, it has assumed a critical level with the Base score of 9.8. Our motivation is to use the parameters of the model according to the classification CVSS i.e. taking the value of a software vulnerability with the corresponding value.

3 Classification in Data Mining There are many classification methods in the literature, they include: classification by induction of decision trees [1–4, 26], Bayesian classifiers [5–10], neural networks [11– 15], statistical analysis [16, 17], metaheuristics (e.g. genetic algorithms) [18–20], rough sets [21–25], and many other statistical methods that are being developed all the time. Among the mentioned classification methods, the most frequently used is the method of classification by induction of decision trees, which is particularly attractive for data mining [21–23]. Due to its intuitive representation, the resulting classification model is user-friendly to humans; moreover, decision trees are to be constructed relatively quickly compared to other classification methods. However, the main drawback of decision trees is the inability to capture correlations between attributes without further computations. In the following sections, we will take a closer look at the steps of threat classification in computer networks and the use of Naive Bayes classification as a method for detecting system vulnerabilities.

4 ROC Curves Quality assessment of the our classifier is examined using ROC (Receive Operating Curve) curves which constitute a graphical representation of the prediction model. They show a relationship between sensitivity and specificity. Sensitivity is the ability of the model to capture positive cases on the contrary to specificity which captures negative cases. Decision support system is widely used in various fields of life. In this paper, we deal with the problem of belonging one of the two classes i.e. assigning the vulnerability of a network to one of two groups: existence of vulnerability or not. A common problem in decision making is making mistakes. Erroneous decisions are inevitable because classes are often unsegregated. Our goal is to predict the distinguished class and therefore make a decision according to the following scoring: – TP (true positive) - number of correctly classified examples from the selected class (hit), – FN (false negative) - number of misclassified examples from this class, i.e., a negative decision, while the example is actually positive (missed), – TN (true negative) - the number of examples incorrectly allocated to the selected class (correct rejection), – FP (false positive) - the number of examples incorrectly assigned to the selected class, while in fact they do not belong to it (false alarm)

Vulnerability Analysis of IoT Devices to Cyberattacks

633

4.1 Confusion Matrix Combinations of the above four values, current and predicted, form a confusion matrix, which is shown in Table 1. Table 1. Confusion matrix Current Predicted

P

N

P

TP

FP

N

FN

TN

On the basis of the results presented in the confusion matrix we define the following measures of the model i.e. Sensitivity describes the probability that the classification is correct provided the case is positive SE(sensitivity) =

TP TP + FN

Specificity represents the probability that the classification is correct, provided the case is negative SP(specificity) =

TN TN + FP

The values of the sensitivity and specificity measures form coordinates of points on the plane that are called cutoff points. For each cutoff point, we calculate the both measures and then we mark them in the coordinate system; specificity on the abscissa axis and sensitivity on the ordinate axis. High sensitivity value means that the system correctly classifies the objects recognized. A low value of the parameter specificity tells us that the system classifies only a few non-relevant objects as relevant. Therefore, a high value of sensitivity and a low value of specificity is desirable. Combining the resulting points on the plane yields a plot of the ROC curve. 4.2 Area Under the Curve (AUC) The area under the plot of the ROC curve, denoted as AUC, can be taken as a measure of the goodness and accuracy of a given model. The value of AUC represents probability that a classifier ranks a randomly selected object from a distinguished class higher than a randomly selected object from a non-distinguished class. The value of AUC is shown in Table 2. The more concave the ROC curve is, the more powerful the test is.

634

J. Mizera-Pietraszko and J. Ta´ncula Table 2. Scale values of ROC curves Assessment

Value

Excellent