207 59 22MB
English Pages 427 [428] Year 2023
Advances in Intelligent Systems and Computing 1445
Shahram Latifi Editor
ITNG 2023 20th International Conference on Information Technology-New Generations
Advances in Intelligent Systems and Computing Volume 1445 Series Editor Janusz Kacprzyk, Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland Advisory Editors Nikhil R. Pal, Indian Statistical Institute, Kolkata, India Rafael Bello Perez, Faculty of Mathematics, Physics and Computing, Universidad Central de Las Villas, Santa Clara, Cuba Emilio S. Corchado, University of Salamanca, Salamanca, Spain Hani Hagras, School of Computer Science and Electronic Engineering, University of Essex, Colchester, UK László T. Kóczy, Department of Automation, Széchenyi István University, Gyor, Hungary Vladik Kreinovich, Department of Computer Science, University of Texas at El Paso, El Paso, TX, USA Chin-Teng Lin, Department of Electrical Engineering, National Chiao Tung University, Hsinchu, Taiwan Jie Lu, Faculty of Engineering and Information Technology, University of Technology Sydney, Sydney, NSW, Australia Patricia Melin, Graduate Program of Computer Science, Tijuana Institute of Technology, Tijuana, Mexico Nadia Nedjah, Department of Electronics Engineering, University of Rio de Janeiro, Rio de Janeiro, Brazil Ngoc Thanh Nguyen , Faculty of Computer Science and Management, Wrocław University of Technology, Wrocław, Poland Jun Wang, Department of Mechanical and Automation Engineering, The Chinese University of Hong Kong, Shatin, Hong Kong
The series “Advances in Intelligent Systems and Computing” contains publications on theory, applications, and design methods of Intelligent Systems and Intelligent Computing. Virtually all disciplines such as engineering, natural sciences, computer and information science, ICT, economics, business, e-commerce, environment, healthcare, life science are covered. The list of topics spans all the areas of modern intelligent systems and computing such as: computational intelligence, soft computing including neural networks, fuzzy systems, evolutionary computing and the fusion of these paradigms, social intelligence, ambient intelligence, computational neuroscience, artificial life, virtual worlds and society, cognitive science and systems, Perception and Vision, DNA and immune based systems, self-organizing and adaptive systems, e-Learning and teaching, human-centered and human-centric computing, recommender systems, intelligent control, robotics and mechatronics including human-machine teaming, knowledge-based paradigms, learning paradigms, machine ethics, intelligent data analysis, knowledge management, intelligent agents, intelligent decision making and support, intelligent network security, trust management, interactive entertainment, Web intelligence and multimedia. The publications within “Advances in Intelligent Systems and Computing” are primarily proceedings of important conferences, symposia and congresses. They cover significant recent developments in the field, both of a foundational and applicable character. An important characteristic feature of the series is the short publication time and world-wide distribution. This permits a rapid and broad dissemination of research results. Indexed by DBLP, INSPEC, WTI Frankfurt eG, zbMATH, Japanese Science and Technology Agency (JST). All books published in the series are submitted for consideration in Web of Science. For proposals from Asia please contact Aninda Bose ([email protected]).
Shahram Latifi Editor
ITNG 2023 20th International Conference on Information Technology-New Generations
Editor Shahram Latifi Department of Electrical and Computer Engineering University of Nevada Las Vegas, NV, USA
ISSN 2194-5357 ISSN 2194-5365 (electronic) Advances in Intelligent Systems and Computing ISBN 978-3-031-28331-4 ISBN 978-3-031-28332-1 (eBook) https://doi.org/10.1007/978-3-031-28332-1 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Contents
Part I Machine Learning 1
2
3
4
5
6
7
8
9
Loop Closure Detection in Visual SLAM Based on Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fabiana Naomi Iegawa, Wagner Tanaka Botelho, Tamires dos Santos, Edson Pinheiro Pimentel, and Flavio Shigeo Yamamoto Getting Local and Personal: Toward Building a Predictive Model for COVID in Three United States Cities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . April Edwards, Leigh Metcalf, William A. Casey, Shirshendu Chatterjee, Heeralal Janwa, and Ernest Battifarano
3
11
Integrating LSTM and EEMD Methods to Improve Significant Wave Height Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ashkan Reisi-Dehkordi, Alireza Tavakkoli, and Frederick C. Harris Jr.
19
A Deep Learning Approach for Sentiment and Emotional Analysis of Lebanese Arabizi Twitter Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Maria Raïdy and Haidar Harmanani
27
A Two-Step Approach to Boost Neural Network Generalizability in Predicting Defective Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alexandre Nascimento, Vinicius Veloso de Melo, Marcio Basgalupp, and Luis Alberto Viera Dias
37
A Principal Component Analysis-Based Scoring Mechanism to Quantify Crime Hot Spots in a City . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yu Wu and Natarajan Meghanathan
45
Tuning Neural Networks for Superior Accuracy on Resource-Constrained Edge Microcontrollers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alexandre M. Nascimento, Vinícius V. de Melo, and Márcio P. Basgalupp
53
A Deep Learning Approach for the Intersection Congestion Prediction Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marie Claire Melhem and Haidar Harmanani
65
A Detection Method for Stained Asbestos Based on Dyadic Wavelet Packet Transform and a Locally Adaptive Method of Edge Extraction . . . . . . . . . . . . . Hikaru Tomita and Teruya Minamoto
73
10 Machine Learning: Fake Product Prediction System . . . . . . . . . . . . . . . . . . . . . . Okey Igbonagwam
79
v
vi
Contents
Part II Cybersecurity and Blockchain 11
Ontology of Vulnerabilities and Attacks on VLAN . . . . . . . . . . . . . . . . . . . . . . . . Marcio Silva Cruz, Ferrucio de Franco Rosa, and Mario Jino
89
12
Verifying X.509 Certificate Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cody Welu, Michael Ham, and Kyle Cronin
97
13
Detecting Malicious Browser Extensions by Combining Machine Learning and Feature Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jacob Rydecki, Jizhou Tong, and Jun Zheng
105
A Lightweight Mutual Authentication and Key Generation Scheme in IoV Ecosystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Samah Mansour, Adrian Lauf, and Mostafa El-Said
115
To Reject or Not Reject: That Is the Question. The Case of BIKE Post Quantum KEM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nir Drucker, Shay Gueron, and Dusan Kostic
125
14
15
16
IoT Forensics: Machine to Machine Embedded with SIM Card . . . . . . . . . . . . . Istabraq Mohammed Alshenaifi, Emad Ul Haq Qazi, and Abdulrazaq Almorjan
17
Streaming Platforms Based on Blockchain Technology: A Business Model Impact Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rendrikson Soares, André Araújo, Gabriel Rodrigues, and Charles Alencar
18
Digital Forensic Investigation Framework for Dashcam . . . . . . . . . . . . . . . . . . . Saad Alboqami, Huthifah Alkurdi, Nawar Hinnawi, Emad Ul Haq Qazi, and Abdulrazaq Almorjan
133
143 151
Part III Software Engineering 19
Conflicts Between UX Designers, Front-End and Back-End Software Developers: Good or Bad for Productivity? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tea Pavicevic, Dejana Tomasevic, Alessio Bucaioni, and Federico Ciccozzi
161
20
Generalized EEG Data Acquisition and Processing System . . . . . . . . . . . . . . . . Vinh D. Le, Chase D. Carthen, Norhaslinda Kamaruddin, Alireza Tavakkoli, Sergiu M. Dascalu, and Frederick C. Harris, Jr.
21
Supporting Technical Adaptation and Implementation of Digital Twins in Manufacturing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Enxhi Ferko, Alessio Bucaioni, and Moris Behnam
181
Towards Specifying and Evaluating the Trustworthiness of an AI-enabled System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mark Arinaitwe and Hassan Reza
191
Description and Consistency Checking of Distributed Algorithms in UML Models Using Composite Structure and State Machine Diagrams . . . . . . . . . . . Yu Manta and Katsumi Wasaki
199
22
23
24
173
Simulation and Comparison of Different Scenarios of a Workflow Net Using Process Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 Felipe Nedopetalski, Franciny Medeiros Barreto, Joslaine Cristina Jeske de Freitas, and Stéphane Julia
Contents
vii
25 Making Sense of Failure Logs in an Industrial DevOps Environment . . . . . . . Muhammad Abbas, Ali Hamayouni, Mahshid H. Moghadam, Mehrdad Saadatmand, and Per E. Strandberg
217
Part IV Data Science 26 Analysis of News Article Various Countries on a Specific Event Using Semantic Network Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hyun Park, Jinie Pak, and Yanggon Kim 27 An Approach to Assist Ophthalmologists in Glaucoma Detection Using Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tatiane Martins Bistulfi, Marcelli Marques Monteiro, and Juliano de Almeida Monte-Mor
229
237
28 Multtestlib: A Parallel Approach to Unit Testing in Python . . . . . . . . . . . . . . . . Ricardo Ribeiro de Alvarenga, Luiz Alberto Vieira Dias, Adilson Marques da Cunha, and Lineu Fernando Stege Mialaret
247
29 DEFD: Adapted Decision Tree Ensemble for Financial Fraud Detection . . . . . Chergui Hamza, Abrouk Lylia, Cullot Nadine, and Cabioch Nicolas
255
30 Prediction of Bike Sharing Activities Using Machine Learning and Data Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shushuang Man, Ryan Zhou, Ben Kam, and Wenying Feng
263
Part V E-Learning 31 ICT: Attendance and Contact Tracing During a Pandemic . . . . . . . . . . . . . . . . . Shawn Zwach and Michael Ham
271
32 Towards Cloud Teaching and Learning: A COVID-19 Era in South Africa . . Dina Moloja
279
33 Learning Object as a Mediator in the User/Learner’s Zone of Proximal Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Parcilene Fernandes de Brito, Douglas Aquino Moreno, Giovanna Biagi Filipakis de Souza, and José Henrique Coelho Brandão 34 Quality Assessment of Open Educational Resources Based on Data Provenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Renata Ribeiro dos Santos, Marilde Terezinha Prado Santos, and Ricardo Rodrigues Ciferri 35 Quality Assessment of Open Educational Resources: A Systematic Review . . Renata Ribeiro dos Santos, Marilde Terezinha Prado Santos, and Ricardo Rodrigues Ciferri
285
295
303
Part VI Health 36 Predicting COVID-19 Occurrences from MDL-based Segmented Comorbidities and Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ana Patrícia de Sousa, Valéria Cesário Times, and André Araújo
313
37 Internet of Things Applications for Cold Chain Vaccine Tracking: A Systematic Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alex Fabiano Garcia and Wanderley Lopes de Souza
323
viii
38
Contents
GDPR and FAIR Compliant Decision Support System Design for Triage and Disease Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alper Karamanlioglu, Elif Tansu Sunar, Cihan Cetin, Gulsum Akca, Hakan Merdanoglu, Osman Tufan Dogan, and Ferda Nur Alpaslan
331
Part VII Potpourri I 39
40
41
42
43
Truckfier: A Multiclass Vehicle Detection and Counting Tool for Real-World Highway Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Murilo S. Regio, Gabriel V. Souza, Roberto Rosa, Soraia R. Musse, Isabel H. Manssour, and Rafael H. Bordini
341
Explaining Multimodal Image Retrieval Using A Vision and Language Task Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Md Imran Sarker, Mariofanna Milanova, and John R. Talburt
351
Machine Vision Inspection of Steel Surface Using Combined Global and Local Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mohammed W. Ashour, M. M. Abdulrazzaq, and Mohammed Siddique
359
A Process to Support Heuristic Evaluation and Tree Testing from a UX Integrated Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Freddy Paz, Adrian Lecaros, Fiorella Falconi, Alejandro Tapia, Joel Aguirre, and Arturo Moquillaza Description and Verification of Systolic Array Parallel Computation Model in Synchronous Circuit Using LOTOS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yuya Chiba and Katsumi Wasaki
369
379
44
A Virtual Reality Mining Training Simulator for Proximity Detection . . . . . . . Erik Marsh, Joshua Dahl, Alireza Kamran Pishhesari, Javad Sattarvand, and Frederick C. Harris Jr.
387
45
A Performance Analysis of Different MongoDB Consistency Levels . . . . . . . . . Caio Lazarini Morceli, Valeria Cesario Times, and Ricardo Rodrigues Ciferri
395
Part VIII Potpourri II 46
47
Information Extraction and Ontology Population Using Car Insurance Reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hamid Ahaggach, Lylia Abrouk, and Eric Lebon
405
Description of Restricted Object Reservation System Using Specification and Description Language VDM++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Aoto Makita and Katsumi Wasaki
413
48
Applying Scrum in Interdisciplinary Case Study Projects for Literacy in Fluency Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421 Matheus Silva Martins Mota, Breslei Max Reis da Fonseca, Gildarcio Sousa Goncalves, Jean Claudio de Souza, Vitor Eduardo Sabadine da Cruz, Odair Oliveira de Sá, Adilson Marques da Cunha, Luiz Alberto Vieira Dias, Lineu Fernando Stege Mialaret, and Johnny Cardoso Marques
49
A Demographic Model to Predict Arrests by Race: An Exploratory Approach 427 Alice Nneka Ottah, Yvonne Appiah Dadson, and Kevin Matthe Caramancion
Contents
ix
50 An Efficient Approach to Wireless Firmware Update Based on Erasure Correction Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Berk Kivilcim, Daniel Zhou, Zhijie Shi, and Kaleel Mahmood
431
51 Complex Network Analysis of the US Marine Highway Network . . . . . . . . . . . Natarajan Meghanathan
437
52 Directed Acyclic Networks and Turn Constraint Paths . . . . . . . . . . . . . . . . . . . . Laxmi Gewali and Sabrina Wallace
445
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
453
Chair Message
Welcome to the 20th International Conference on Information Technology – New Generations – ITNG 2023. Due to the continued global pandemic and the problems associated with traveling and in-person meeting, we are running the conference virtually. We hope and trust that in the following year, we will be able to meet in person where the audience feel safer and more comfortable to travel and participate in this event. The ITNG 2023 attracted quality submissions globally. The papers were reviewed for their technical soundness, originality, clarity and relevance to the conference. The conference enjoyed expert opinion of over 40 author and non-author scientists who participated in the review process. Each paper was reviewed by at least two independent reviewers. At the end, 52 papers were accepted for presentation and publication in the ITNG 2023 program. The articles in this book of chapters address the most recent advances in areas such as machine learning, Big Data Analytics, Cybersecurity, Blockchain Technology, Data Mining, e-Health, IoT and CPS, Software Engineering and Social computing. In addition to technical presentations by the authors, the conference features a young scholar invited on Monday. Several people contributed to the success of this year’s conference by organizing technical tracks. Dr. Doina Bein served in the capacity of conference vice chair. We benefited from the professional and timely services of major track organizers and associate editors, namely Drs. Azita Bahrami, Doina Bein, Allesio Bucaioni, Glauco Carneiro, Luiz Alberto Vieira Dias, Ray Hashemi, Kashif Saleem, Fangyan Shen and Hossein Zare. Others who were responsible for solicitation, review and handling the papers submitted to their respective tracks/sessions include Drs Noha Hazzazi and Poonam Dharam. The help and support of the Springer in preparing the ITNG proceedings is specially appreciated. Many thanks are due to Michael Luby, the Senior Editor and Supervisor of Publications, and Brian Halm, the Production Editor of the Springer, for the timely handling of our publication order. We also appreciate the efforts made by the Springer Project Coordinator, Praveena John. Praveena spent much time looking very closely at revised articles to make sure they are formatted correctly according to the publisher guidelines. We also thank the technical assistance of Prabhas Kumra in setting up the conference program and Zoom communication for us. Finally, the great efforts of the conference secretary, Ms. Mary Roberts who dealt with the day-to-day conference affairs, including timely handling volumes of emails, are acknowledged. I hope that you all enjoy the ITNG 2023 program and find it technically and socially fulfilling. Shahram Latifi The ITNG General Chair
xi
ITNG 2023 Reviewers
Abdollahian, Mali Ahmad, Shafiq Ardakani, Mali Bagherzadeh, Nader Bahrami, Azita Bein, Doina Bucaioni, Alessio Campos, Jorge Carneiro, Glauco Chen, Hsing-bung Chen, Yuwen Danish, Subhan Dharam, Poonam Dias, Luiz Alberto Fong, Anthony Gawanmeh, Amjad Gewali, Laxmi Hashemi, ray Hassan, Saima Hazzazi, Noha Imran, Muhammad Kan, Shaobai Li, Xiangdong Liu, Yijun Magalhães, Ana Mialaret, Lineu Mineda Carneiro, Emanuel Nunes, Eldman Saleem, Kashif Sheikh, Farah Shen, Fangyang Sousa, Gildarcio Souto, Thiago Sun, Weiqing Suzana, Rita Syed, Asad Talburt, John Teo, Honjie Yang, Mei Yuan, Shengli Zare, Hossein Zhang, Jun xiii
Part I Machine Learning
1
Loop Closure Detection in Visual SLAM Based on Convolutional Neural Network Fabiana Naomi Iegawa, Wagner Tanaka Botelho, Tamires dos Santos, Edson Pinheiro Pimentel, and Flavio Shigeo Yamamoto
Abstract
In Robotics, autonomous navigation has been addressed in recent years due to the potential of applications in different areas, such as industrial, comercial, health and entertainment. The capacity to navigate, whether autonomous vehicles or service robots, is related to the problem of Simultaneous Localization And Mapping (SLAM). Loop closure, in the context of Visual SLAM, uses information from the images to identify previously visited environments, which allows for correcting and updating the map and the robot’s localization. This paper presents a system that identifies loop closure and uses a Convolutional Neural Network (CNN) trained in Gazebo simulated environment. Based on the concept of transfer learning, the CNN of VGG16 architecture is retrained with images from a scenario in Gazebo to enhance the accuracy of feature extraction. This approach allows for the reduction of the descriptors’ dimension. The features from the images are captured in real-time by the robot’s camera, and its control is performed by the Robot Operating System (ROS). Furthermore, loop closure is addressed from image preprocessing and its division in the right and left regions to generate the descriptors. Distance thresholds and sequences are defined to enhance performance during image-to-image matching. A virtual office designed in Gazebo was used to evaluate the proposed system. In this scenario, loop closures were identified while the robot navigated through the F. N. Iegawa () · W. T. Botelho · T. dos Santos · E. P. Pimentel Federal University of ABC (UFABC), Centre of Mathematics, Computation and Cognition (CMCC), Santo André, São Paulo, Brazil e-mail: [email protected]; [email protected]; [email protected]; [email protected] F. S. Yamamoto Startup NTU Software Technology, São Paulo, Brazil e-mail: [email protected]
environment. Therefore, the results showed good accuracy and a few false negative cases. Keywords
Visual SLAM · Loop closure · Deep learning · Feature extraction · Mobile robot · CNN · Robot vision · Image recognition · Virtual environment · Image descriptors
1.1
Introduction
In Mobile Robotics, autonomous navigation is widely studied due to its importance in applications such as the exploration of difficult areas, load transportation in industries, delivery robots, among others. Thus, an autonomous navigation system comprises localization, mapping, tracking and locomotion tasks. Localization aims to estimate the robot’s pose (position and orientation). In the case of mapping, it is responsible for identifying interest points in the environment and representing them on a map. On the other hand, the best route to the destination relies on the track. Finally, locomotion means the robot’s physical aspects and detects objects during movement [1]. SLAM is a problem of Simultaneous Localization And Mapping, which deals with building the map of the environment at the same time the robot calculates its position. The problem relies on the relationship between both tasks, as localization estimates the position based on the map, and mapping uses the position information to update and decrease errors. According to [2], Visual SLAM is related to the sensor type implemented, as both position and 3D environment representation are estimated using images. In autonomous navigation, one of the main abilities of the robot should be to minimize errors during mapping and localization. While
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 S. Latifi (ed.), ITNG 2023 20th International Conference on Information Technology-New Generations, Advances in Intelligent Systems and Computing 1445, https://doi.org/10.1007/978-3-031-28332-1_1
3
4
exploring unknown environments, the sensor inaccuracy generates noise that accumulates over time. Consequently, it interferes with map creation and updates, as well as with precise localization. Therefore, loop closure is used to recognize places the robots have already visited and calibrate their position, and the map [3]. Loop closure approaches have been mainly implemented with so-called manual techniques for feature extraction, such as Scale-Invariant Feature Transform (SIFT), Speeded-Up Robust Features (SURF), Oriented FAST and Rotated BRIEF (ORB). However, driven by Convolutional Neural Networks (CNN) advancement in Computer Vision for image classification and recognition, Artificial Neural Networks (ANN) have been applied to enhance loop closure detection [4]. Several state-of-the-art loop closure systems have been proposed in the last few years. Nevertheless, the main challenges addressed by Arshad and Kim [5] are still open and might be solved with Deep Learning. The first challenge regards false positives, also known as perceptual aliasing, when images are similar but from different places. The second is a false negative when an already visited place is not recognized due to the main differences between images, such as illumination changes, among others. The third is a dynamic environment where objects constantly change their position. Finally, the fourth challenge is the real-time implementation of the algorithm. Li et al. [6] and Camara et al. [7] implemented a CNN to extract image features. However, [6] adopted the ResNet V2 [8] network with no additional training and applied cosine similarity to detect loop closure, increasing performance time. On the other hand, [7] proposed extracting features from two convolutional layers for a geometric comparison approach, achieving high accuracy. Against this background, the main target of this paper is to present a solution for a mobile robot controlled by Robot Operating System (ROS) to detect loop closure in a simulated scenario in Gazebo. The system is divided into three steps: capture the image, train the CNN and detect loop closure. The first step creates a dataset with captured images from Gazebo to be implemented in the second step. Next, a pretrained CNN, the VGG-16, is trained to extract features from images based on the concept of transfer learning. Finally, image preprocessing, feature extraction, descriptors generation and image-to-image matching are presented to detect loop closure. Different ANN architectures, such as CNN, Autoencoders and Generative Adversarial Networks, can be applied to images. However, the work presented in this paper uses CNN because of its good performance in the Computer Vision field. In addition, it applies convolution operations that allow filters on small regions of the image to preserve important information between layers and increase feature extraction accuracy.
F. N. Iegawa et al.
The main contribution of this paper is the proposed approach to detect the loop closure extracting feature using a trained CNN VGG-16 with virtual images from Gazebo. Based on transfer learning, the pre-trained VGG-16 with real images is the starting point for training a new CNN with virtual images. Combining images from physical and virtual environments aims to increase loop closure detection accuracy. For example, virtual and real images are used in Augmented Reality applications and can be used in other areas. In loop closure detection, the usage of the simulator is mainly explored for system validation in literature. Safin et al. [9] used Gazebo and ROS to create validation datasets for Visual SLAM systems. Thus, this paper presents a loop closure approach based on real and virtual environments. Another contribution of this paper is the 512 channels dimension of the descriptors tensors, which reduces time and computational processing. Nonetheless, it is important to point out that calculating the distance threshold between images in the navigation avoids manual set-up according to the environment, as discussed in [10]. The remainder of this paper is organized as follows. Section 1.2 are described the related works. Section 1.3 presents the development steps of the loop closure detection system with CNN and image preprocessing. The results are discussed in Sect. 1.4. Lastly, Sect. 1.5 addresses the conclusion and future work.
1.2
Related Works
According to [11], without loop closure, SLAM becomes Odometry as wheel encoders estimate the robot’s positions. Also, when neglected, it leads to the interpretation of the environment as an infinite corridor that intersects a visited place disregarding the space layout and affecting mapping and tracking tasks. In loop closure detection, the problem is based on image recognition, meaning that relevant features of the images are identified to be transformed. These features are saved as descriptors and extracted by probabilistic or Deep Learning approaches, as well as local and global. Probabilistic approaches, such as SIFT and SURF, are computationally expensive techniques to find local image descriptors and do not attend real-time Visual SLAM [5]. Deep Learning generates the descriptors using ANN, CNN and Autoencoders. The local approach filters some parts of the images and requires a region selection step. On the other hand, the global approach aims to use the whole image to generate those descriptors [12]. Chen et al. [13] proposed to use one CNN convolutional layer to extract features and a deeper layer to identify high activation values in the layer, called salient image regions.
1
Loop Closure Detection in Visual SLAM Based on Convolutional Neural Network
Thus, each image was described as a set of regions, allowing the image matching and, therefore, the loop closure detection. In another approach developed by Wang et al. [14], the image was segmented and vectorized before sending it to the Neural Network (NN). Based on the Stacked Denoising Auto-Encoder (SDAE) [15], the authors proposed the Graphregularization Stacked Denoising Auto-Encoder (G-SDAE), which is a non-supervised network that uses graph regularizer to learn abstract geometries between features. Also, a similarity matrix was created to detect the loop closure. A combination of supervised and non-supervised learning methods to detect loop closure was proposed by Memon et al. [16]. To do that, a classifier CNN was used to identify objects in the image, and an Autoencoder network detected new environments. In parallel, the Autoencoder was also retrained with new environment image features to detect loop closure. In addition, Super Dictionary was described as one dictionary containing only key frames and another with all frames to improve the image-matching process and reduce false positives. DeepSLAM was developed by Li et al. [6] and is a Visual SLAM system with a monocular camera based on NonSupervised Deep Learning. The authors defined the system with an Autoencoder to map the 3D environment. Also, a Recurrent Convolutional Neural Network (RCNN) captures camera movement and estimates the robot’s position, and a CNN Resnet [8] detects loop closure. Besides, pose graphs are constructed and optimized to integrate the system and minimize accumulated errors. Zhang et al. [12] proposed an approach that considered local and global descriptors in parallel to find candidates for the loop closure. The combination of the probabilistic method for local descriptors and the CNN to extract global descriptors increased the performance of the algorithm when compared to the nearest-neighbor techniques.
1.3
Proposed System
Figure 1.1 shows the three steps of loop closure detection, defined as 1 Capture the Image, 2 Train the CNN and 3 Loop Closure Detection. To achieve 3 Loop Closure Detection, 1 Capture the Image and 2 Train the CNN to create the (a) CNN, based on the pre-trained (b) VGG-16 network. In addition, to increase the accuracy in (c) Feature Extraction and (d) Descriptors Generation, 2 Train the CNN uses images captured by the robot during the (e) Navigation Task in the Gazebo scenario.
1.3.1
Capture the Image
In Fig. 1.1, the robot must navigate to 1 Capture the Image in the open-source 3D simulator Gazebo to create the (f)
5
Dataset. Python scripts are implemented in the (e) Navigation Task, and the Rospy library [17] is used to interface Python with ROS. In ROS, the velocity topic controls the robot’s motor nodes, and the camera topic captures images of the environment to save them in the (f) Dataset.
1.3.2
Train the CNN
In 2 Train the CNN, the (b) VGG-16 architecture, illustrated in Fig. 1.2, is trained in Google Colab with the (f) Dataset created in Fig. 1.1. The (b) VGG-16 has 13 convolutional and three fully connected layers. Convolutional layers are responsible for extracting information such as edges in the image and applying the activation function, known as Rectified Linear Unit (ReLU), to limit the output to positive values at the input [20]. During convolution operation, the filters, known as the kernel, are applied to input regions called receptive fields. Max pooling layers decrease image dimension when selecting the maximum value of a receptive field and are important to reduce computational processing. The classifier layers are fully connected using the ReLU activation function and Softmax, an exponential function that calculates the network probabilities. Figure 1.2 shows each layer in the pre-trained VGG-16 network and its respective output dimensions. One image captured by the robot with the dimension of 224x224 pixels with three channels is provided as input, and the output is a tensor of 1000 channels according to VGG-16 classes. Also, the original weights of convolutional layers are frozen and can be used in new training. The third fully connected layer is replaced by another two fully connected before training, with one layer output being 4096 channels and the other 256. Therefore, the network is trained, and the output tensor is reduced to 256 channels. The transfer learning concept is applied to (b) VGG16 of Fig. 1.1. It is the starting point for a new network, defined as (a) CNN. In 2 Train the CNN, the (b) VGG-16 is implemented using the PyTorch library and the (f) Dataset. Thus, the (a) CNN is saved in convnet.pth file created by the PyTorch to be used in the 3 Loop Closure Detection.
1.3.3
Loop Closure Detection
In 3 Loop Closure Detection, shown in Fig. 1.1, the robot navigates in the (g) Scenario, and (h) Preprocessing is applied to the images captured by the camera. Next, the CNN is responsible for (c) Feature Extraction and (d) Descriptors Generation used in (i) Image-to-Image Matching with the (j) Descriptors Database, indicating whether the loop closure is detected or not. It is important to point out that the (j) Descriptors Database is formed with images already captured in the (g) Scenario and updated as the robot moves.
6
F. N. Iegawa et al.
Fig. 1.1 System development steps
Fig. 1.2 VGG-16 architecture [18] designed with [19]
Image and Preprocessing In Fig. 1.3, the robot captures the 1 Image in real time during the navigation. In 2 Preprocessing, the (a) Image Resize is .540 × 540 pixels to standardize its size for the CNN input. In addition, the 1 Image is divided into (b) Left Region and (c) Right Region. According to [21], this divi-
sion improved the performance of recognizing a path when revisited in the opposite direction. After the division, each region is transformed into a (d) Normalized Tensor using the Transforms module [22] of the Torchvision library [23] to improve CNN performance and reduce image distortion.
1
Loop Closure Detection in Visual SLAM Based on Convolutional Neural Network
7
Fig. 1.3 Detect loop closure diagram
Fig. 1.4 Features extraction by layer from VGG-16. (a) Original image. (b) Conv1. (c) Conv6. (d) Conv13
Feature Extraction The detection and extraction of relevant features from the 1 Image happen in Feature Extraction in Figs. 1.1(c) and 1.3 3 . Thus, the (d) Normalized Tensor is fed into the (e) CNN (convnet.pth) to use the fourth fully-connected layer output. Each (e) CNN layer extracts relevant features to be sent to the next layer. Therefore, a specific layer output can be used to detect the loop closure. It is important to note that the output is related to the activations of layers such as pooling, ReLU and convolutional filters in VGG-16 of Fig. 1.2. Figure 1.4 illustrates the (a) Original Image and three convolutional layer outputs, (b) Conv1, (c) Conv6 and (d) Conv13 from VGG-16 in Fig. 1.2. Understanding feature visualization is not always easy from layer to layer evolution point of view. For that reason, visualization techniques have
been developed based on the characteristics map proposed by Zeiler and Fergus [24]. Features activations are projected on the map to allow the identification of important parts of the image. For example, output (b) Conv1 displays the limits and color differences, while (c) Conv6 output shows the edges of a possible table. As the layer depth increases, its complexity also varies, leading to a feature map representation as in (d) Conv13, where the layout and color scale indicates the possibility that the object is a table. In this case, each color represents an important feature that can be an edge or part of an object. Therefore, the objective is to preserve the sequence of information to describe the image content. Descriptors Generation Descriptors Generation in Figs. 1.1(d) and 1.3 4 is always performed based on the concatenation of tensors obtained
8
F. N. Iegawa et al.
after 3 Feature Extraction and from left to right. The final tensor is called (f) Descriptors and has dimension 1x1x512, a tensor with 512 channels. This step adds features from both left and right regions in one sequence, which increases the amount of information extracted and reduces execution time in loop closure detection.
the dataset was used for testing, the code was implemented in Python, and ROS controls the Jackal robot in Gazebo and captures its monocular camera images.
Image-to-Image Matching Before 5 Image-to-Image Matching in Fig. 1.3, the (g) Threshold is calculated through a distance test. The (g) Threshold represents the maximum difference between the descriptors for two images to be considered similar. Thus an environment is randomly selected in (h) Query Database for Matching to register the base image of the test. The next step is to capture multiple images with different sets of rotation and lighting in the same environment. The (i) Euclidean Distance is calculated between the base image and the others. In the end, the average of the distances is set as the (g) Threshold.
Figure 1.5a shows the scenario implemented to validate loop closure detection. It is an indoor office environment and is available in [26]. Objects such as chairs, cabinets and armchairs were added for a more realistic environment. As the main target of this paper is not focused on navigation, the (R) Robot followed a pre-defined path, in white, to navigate. In addition, its start and end positions are represented by (S) and (E), respectively. A Python script was executed to capture scenario images in intervals of four seconds and then save them in (j) Descriptors Database of Fig. 1.1 for 3 Loop Closure Detection. Another script recorded the robot’s positions according to ROS Odometry and Gazebo Robot State topics. Therefore, as shown in Fig. 1.3, images were captured in real-time by the robot and compared to the ones available in (l) Descriptors Database to calculate the Euclidean Distance. In addition, in the end, 416 images were captured from the scenario in Fig. 1.5a, and the distance threshold between two similar images was set to 1.8 based on simulation data analysis. In Fig. 1.5a, whenever the white line trajectory transpasses itself, it indicates a loop closure. In (b), images (1)–(21) were captured by the (R) Robot and fed into the (e) CNN of Fig. 1.3 to extract features and generate descriptors. Finally, loop closure was detected after image-to-image matching. It is important to highlight that (1), (5), (9), (12), and (17) in (a) are part of the path followed by the (R) Robot and correspond to the captured images (1), (5), (9), (12), and (17) in (b). Also, in (b), images (1) and (21) are similar, which means the (R) Robot has been to the same point twice and has detected loop closure.
The main target of Image-to-Image Matching in Figs. 1.1 (i) and 1.3 5 is to identify cases of loop closure, avoiding false positives. Initially, with the (f) Descriptors of the 1 Image, the (h) Query Database for Matching is made. All the images are selected as descriptors to raise the candidates for loop closure. Then, the (i) Euclidean Distance of the images’ (f) Descriptors consulted in (h) Query Database for Matching is calculated. The (i) Euclidean Distance represents the difference between images as a tensor, and its maximum value is used to create the (j) Three Distances Sequences. Therefore, only the maximum distances are grouped into three sequences, reducing the processing time. With the (j) Three Distances Sequences in Fig. 1.3, each (i) Euclidean Distance of the sequence is compared with the (g) Threshold. In cases where all values of the (j) Three Distances Sequences are less than the (g) Threshold, the result is (k) Loop Closure Detected. On the other hand, if the condition is not satisfied, the image descriptor must be added to the (l) Descriptors Database for further matching. It is important to note that the (j) Three Distances Sequences ensure that a sequence is being observed, eliminating false positives where only one or two of the (i) Euclidean Distances fall below the (g) Threshold.
1.4
Simulation Results
This section presents the results found in the Gazebo scenario to validate the loop closure detection proposed in Sect. 1.3. Also, it is available in [25]. The Gazebo simulation was performed on an Intel Core i7-4510U 2.00GHz notebook with 8GB of RAM, and 2 Train the CNN in Fig. 1.1 was executed in Google Colab with 12GB of RAM. In addition, 30% of
1.4.1
1.4.2
Scenario
Performance Metrics
To evaluate the system performance, all images were labeled as the ground truth, whether it was loop closure or not. It also helped to identify false negatives and false positives. Based on these images, the results presented in Fig. 1.5 were analyzed according to Accuracy, F1-Score and Area Under the Curve (AUC) performance metrics. The Accuracy achieved was 96.8%, indicating system performance given the number of times loop closure was correctly detected. On the other hand, F1-Score was 58%, which is responsible for measuring Accuracy’s reliability as it calculates the harmonic mean between precision and sensibility. Precision aims to verify how many positive results
1
Loop Closure Detection in Visual SLAM Based on Convolutional Neural Network
9
Fig. 1.5 (a) Office scenario [26] used in the simulation and (b) sequence of images captured by the robot during the loop closure detection [25]
Fig. 1.6 ROC curve for loop closure detection
were positive indeed, and sensibility shows how many of all positives were correctly labeled. Thus, F1-Score allows interpreting both metrics together and may indicate possible system failures when it is close to zero. Another analyzed metric was an AUC of 83.5%, as illustrated in Fig. 1.6. The AUC calculates the area under the Receiver Operating Characteristics (ROC) curve that ranges from 0 to 1. This curve plots sensibility against a false positive rate to indicate classification performance. Besides, it allows measuring the
quality of predictions when closer to 1, which means better predictions are made.
1.5
Conclusions and Future Works
Loop closure is important for developing Visual SLAM systems because it allows the robot to minimize errors in localization and mapping caused by accumulated sensor
10
inaccuracy. Also, it can be used for path-planning tasks. Furthermore, Deep Learning has improved in robustness when applied to feature extraction. The main objective of this paper was to develop a loop closure detection system based on a VGG-16 network, trained and implemented in a virtual environment. Thus, the simulator images were used for feature extraction training to increase the system’s accuracy. Besides, developing solutions related to Visual SLAM that can be validated in simulators is the key to high-cost projects. One of the main contributions was the CNN trained with virtual images based on transfer learning from real images to extract features. It was implemented and validated in the Gazebo simulator. Another contribution was the 512 channels dimension of the descriptors tensors used in imageto-image matching. Results showed high accuracy in loop detection but with some false positive cases. As future works, the system proposed in this paper may also be validated in outdoor environments, such as the KITTI dataset [27] that includes sequences with loop closure to be detected. In addition, it should be validated in a physical robot, and the results must be compared with those found in the simulation environment. Acknowledgments This work was supported by a Technical Training Fellowship (TT-3/Process Number: 2019/12080-5) funded by the São Paulo Research Foundation (FAPESP)/PIPE Grant Program from July/2019 to December/2020 offered by the Startup NTU Software Technology (Process Number: 2018/04306-0).
References 1. Y.D.V. Yasuda, L.E.G. Martins, F.A.M. Cappabianco, Autonomous visual navigation for mobile robots: a systematic literature review. ACM Comput. Surv. 53, 1–34 (2020) 2. M.R.U. Saputra, A. Markham, N. Trigoni, Visual slam and structure from motion in dynamic environments: a survey. ACM Comput. Surv. 51, 1–36 (2018) 3. C. Chen, B. Wang, C.X. Lu, N. Trigoni, A. Markham, A survey on deep learning for localization and mapping: towards the age of spatial machine intelligence. CoRR, abs/2006.12567 (2020) 4. J.Y. Ng, F. Yang, L.S. Davis, Exploiting local features from deep networks for image retrieval, in IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (2015) 5. S. Arshad, G. Kim, Role of deep learning in loop closure detection for visual and lidar slam: a survey. Sensors 21, 1243 (2021) 6. R. Li, S. Wang, D. Gu, DeepSLAM: a robust monocular slam system with unsupervised deep learning. IEEE Trans. Ind. Electron. 68, 3577–3587 (2021) 7. L.G. Camara, C. Gäbert, L. Pˇreuˇcil, Highly robust visual place recognition through spatial matching of CNN features, in IEEE International Conference on Robotics and Automation (ICRA) (2020)
F. N. Iegawa et al. 8. C. Szegedy, S. Ioffe, V. Vanhoucke, A.A. Alemi, Inception-v4, inception-resnet and the impact of residual connections on learning, in Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI Press, Palo Alto, 2017) 9. R. Safin, R. Lavrenov, E.A. Martínez-García, Evaluation of visual slam methods in usar applications using ros/gazebo simulation, in Proceedings of 15th International Conference on Electromechanics and Robotics “Zavalishin’s Readings”, ed. by A. Ronzhin, V. Shishlakov (Springer Singapore, 2021) 10. A. Mukherjee, S. Chakraborty, S.K. Saha, Detection of loop closure in slam: a deconvnet based approach. Appl. Soft Comput. 80, 650– 656 (2019) 11. C. Cadena, L. Carlone, H. Carrillo, Y. Latif, D. Scaramuzza, J. Neira, I. Reid, J.J. Leonard, Past, present, and future of simultaneous localization and mapping: toward the robust-perception age. IEEE Trans. Robot. 32, 1309–1332 (2016) 12. P. Zhang, C. Zhang, B. Liu, Y. Wu, Leveraging local and global descriptors in parallel to search correspondences for visual localization. Pattern Recognit. 122, 108344 (2022) 13. Z. Chen, F. Maffra, I. Sa, M. Chli, Only look once, mining distinctive landmarks from convnet for visual place recognition, in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (2017) 14. Z. Wang, Z. Peng, Y. Guan, L. Wu, Manifold regularization graph structure auto-encoder to detect loop closure for visual slam. IEEE Access 7, 59524–59538 (2019) 15. P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, P. Manzagol, Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 11, 3371–3408 (2010) 16. A.R. Memon, H. Wang, A. Hussain, Loop closure detection using supervised and unsupervised deep neural networks for monocular slam systems. Robot. Auton. Syst. 126, 103470 (2020) 17. Open Robotics, rospy Package Summary. http://wiki.ros.org/rospy. Accessed: 2022-06-05 18. K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556 (2015) 19. H. Iqbal, Harisiqbal88/plotneuralnet v1.0.0 (2018). http://bit.ly/ 3GfGiwa 20. S. Albawi, T.A. Mohammed, S. Al-Zawi, Understanding of a convolutional neural network, in International Conference on Engineering and Technology (ICET) (2017) 21. S. Garg, N. Suenderhauf, M. Milford, Don’t look back: robustifying place categorization for viewpoint- and condition-invariant place recognition, in IEEE International Conference on Robotics and Automation (ICRA) (IEEE Press, Piscataway, 2018) 22. A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, S. Chintala, PyTorch: An Imperative Style, High-Performance Deep Learning Library (Curran Associates, Red Hook, 2019) 23. PyTorch, Torchvision (2017), http://bit.ly/3EvfGWH. Accessed 22 July 2022 24. M.D. Zeiler, R. Fergus, Visualizing and understanding convolutional networks, in Computer Vision - ECCV, ed. by D. Fleet, T. Pajdla, B. Schiele, T. Tuytelaars (Springer, Berlin, 2014) 25. F.N. Iegawa, W.T. Botelho, Loop closure (2022), http://bit.ly/ 3UJtqlQ. Accessed 27 Oct 2022 26. Clearpath Robotics, Clearpath additional simulation worlds (2022), http://bit.ly/3gmpicL. Accessed 28 Aug 2022 27. A. Geiger, P. Lenz, C. Stiller, R. Urtasun, Vision meets robotics: the kitti dataset. Int. J. Robot. Res. 32, 1231–1237 (2013)
2
Getting Local and Personal: Toward Building a Predictive Model for COVID in Three United States Cities April Edwards, Leigh Metcalf, William A. Casey, Shirshendu Chatterjee, Heeralal Janwa, and Ernest Battifarano
Abstract
The COVID-19 pandemic was lived in real-time on social media. In the current project, we use machine learning to explore the relationship between COVID-19 cases and social media activity on Twitter. We were particularly interested in determining if Twitter activity can be used to predict COVID-19 surges. We also were interested in exploring features of social media, such as replies, to determine their promise for understanding the views of individual users. With the prevalence of mis/disinformation on social media, it is critical to develop a deeper and richer understanding of the relationship between social media and real-world events in order to detect and prevent future influence operations. In the current work, we explore the relationship between COVID-19 cases and social media activity (on Twitter) in three major United States cities with different geographical and political landscapes. We find that Twitter activity resulted in statistically significant correlations using the Granger causality test, with a lag of one week in all three cities. Similarly, the use of replies, which appear more likely to be generated by individual A. Edwards () · W. A. Casey United States Naval Academy, Cyber Science Department, Annapolis, MD, USA e-mail: [email protected] L. Metcalf Carnegie-Mellon University, Pittsburgh, PA, USA S. Chatterjee Department of Mathematics, City University of New York, New York, NY, USA H. Janwa Department of Mathematics, University of Puerto Rico, Rio Piedras, Puerto Rico E. Battifarano Finance Department, NYU School of Professional Studies, New York, NY, USA
users, not bots or public relations operations, was also strongly correlated with the number of COVID-19 cases using the Granger causality test. Furthermore, we were able to build promising predictive models for the number of future COVID-19 cases using correlation data to select features for input to our models. In contrast, significant correlations were not identified when comparing the number of COVID-19 cases with mainstream media sources or with a sample of all US COVID-related tweets. We conclude that, even for an international event such as COVID19, social media tracks closely with local conditions. We also suggest that replies can be a valuable feature within a machine learning task that is attempting to gauge the reactions of individual users. Keywords
COVID-19 pandemic · Feature selection · Granger causality · Machine learning features · Natural language processing · Pearson correlation · Predictive modeling · Regression analysis · Social media mining · Twitter replies
2.1
Introduction
Both individuals and organizations have been shown to use social media to manipulate public opinion in an attempt to influence political outcomes [1–3]. Furthermore, Twitter has been used as a predictor for real world events, such as vulnerability exploits [4]. In this article, we explore the relationship between social media activity on Twitter and a global event, the COVID-19 pandemic. In particular, we are interested in understanding how social media is used and consumed because this can greatly impact our understanding of how information is shared, which can lead to an improved
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 S. Latifi (ed.), ITNG 2023 20th International Conference on Information Technology-New Generations, Advances in Intelligent Systems and Computing 1445, https://doi.org/10.1007/978-3-031-28332-1_2
11
12
A. Edwards et al.
understanding of how we combat the spread of misinformation and improve our ability to spread accurate information organically [5]. COVID-19 is a global pandemic that is experienced locally and regionally as well as globally. Rates of infection vary by country, state, region, and county. In the current study, we are interested in how social media correlates with outbreaks of COVID-19. In particular, we are interested in comparing the volume of COVID-19 social media posts to determine if increased social media activity can predict a potential outbreak in a particular region. Conversely, we are also interested in knowing if social media activity in a region increases in response to a local outbreak. In the current study, we first compare social media activity for three US cities: Miami, FL; Las Vegas, NV; and Seattle, WA, to determine correlations between social media activity and COVID-19 cases. Once features are identified, we explore their usefulness for predicting number of cases in the near term.
2.2
Data Sources
To determine the relationships between social media activity and COVID-19 cases, we needed multiple data sets. Data on COVID-19 related tweets with location data was available from [6]. Huang et al. collected and posted COVID-related tweet IDs daily, beginning on Feb 6, 2020 [6]. The collection is regularly updated with additional data (as of the time of writing, data was available through Sept 30, 2022). Huang et al. select all tweets which contain one or more of the following terms in the tweet text or hashtags: coronavirus, wuhan, 2019ncov,sars, mers, 2019-ncov, wuflu, COVID-19, COVID19, COVID, covid-19, covid19, covid, SARS2, and SARSCOV19. In addition to the tweet-id for each post, the data contains the date, keywords related to COVID-19, and the inferred geolocation, when available (country, state, and city). We downloaded COVID-19 case data from USA FACTS [7]. USA Facts provides information on daily COVID-19 cases, hospitalizations, and deaths as reported by local public health agencies. Data is available for the entire US, and also disaggregated by state (including US territories), county, and city. Mainstream news data was retrieved from the GDELT project [8]. GDELT consolidates data from broadcast, print, and online news sources. For the purposes of this project, we used a compilation of URLs and brief snippets of worldwide English-language news coverage mentioning COVID19 from GDELT [8]. The normalized data of interest appears in Fig. 2.1. At first glance there appear to be commonalities in graph shape, but these commonalities are not consistent.
Fig. 2.1 Graphical depiction of fluctuations in data by week. (a) Cases, Social Media, and News for Las Vegas. (b) Cases, Social Media, and News for Miami. (c) Cases, Social Media, and News for Seattle
We therefore undertook a systematic approach toward understanding fluctuations in the data.
2 Getting Local and Personal: Toward Building a Predictive Model for COVID in Three United States Cities
2.3
Correlations for Select US Cities
This section describes the methodology and results from our correlation studies.
2.3.1
Methodology
For purposes of analysis, we collated all data (news stories, social media activity, and COVID-19 cases) by week for three selected US cities in different regions and with different demographic and political profiles: Las Vegas, NV; Miami, FL; and Seattle, WA. These cities were selected because they are in different regions of the country and appeared in the top 15 for the number of Twitter posts related to COVID-19 over the period of interest (Table 2.1). We deliberately chose cities near the bottom of the top 15 under the assumption that data from the largest cities would not necessarily generalize, as their case averages are substantially higher than other cities on the list. Weekly data was used because public health agencies vary in their reporting habits, with weekend data often reported on the following Monday or Tuesday. We also assume that both mainstream news (GDELT) and social media (Twitter) activity may vary based on the day of the week [9, 10]. For our study, we used data from the week of Jan 19, 2020 (the first US cases were reported in January 2020) to Jan 16, 2022 (the end date for the GDELT data). This is 105 weeks of data (just over 2 years). As noted above, the three selected cities also have different demographic profiles (Table 2.2). Miami is by far the most diverse in terms of race/ethnicity, with only 11.5% of the population identifying as non-Hispanic white. Seattle has over
Table 2.1 Top 15 cities by number of COVID-19 tweets as of Jan 2022 City, State Los Angeles CA New York NY Washington DC Chicago IL Houston TX Atlanta GA San Francisco CA Dallas TX Boston MA Seattle WA Philadelphia PA Austin TX San Diego CA Miami FL Las Vegas NV
Region West Northeast Mid-Atlantic Midwest South South West South New England Northwest Mid-Atlantic South West South West
Avg daily COVID-19 tweets 5404 5325 5301 3653 2471 2298 1793 1749 1745 1741 1556 1475 1473 1467 1398
13
double the percentage of adults 25 or older with a bachelor’s or advanced Degree in comparison with the other two cities. Las Vegas has more families with children (population age 18 or younger is 23.7%) (Demographic data from US Census Bureau [11]). Voting data were available at the county level, and, like most urban centers, all three cities favored the democratic candidate (Biden) in the 2020 US presidential election over the republican candidate (Trump). Seattle voted heavily democratic, while Las Vegas and Miami were more balanced politically (Table 2.3) [12]. For this project, we rehydrated all the tweets for Miami, Seattle, and Las Vegas, as well as a random sample of 100,000 tweets per month from over 137 million available US-based tweets. Some of the tweets could not be rehydrated because they had been removed and some contained only partial geolocation data (country and state, or just country) or none at all. Table 2.4 shows the number of available tweets by location. Of those rehydrated, between 58 and 63 percent were retweets, and between 10 and 16 percent were replies. We segregated the Twitter data into three data sets for each sample: All Tweets, No Retweets (i.e., all Tweets with retweets removed), and Replies Only. The Replies Only collection was of interest because it appeared that, while much of the Twitter data that was retrieved was from “corporatized” sources (i.e., news agencies, public health/governmental agencies, etc.), the replies appeared to be from individual users who were reacting to information about the pandemic. For purposes of comparing social media with mainstream media, we retrieved news data from the GDELT project that contained references to our target cities between January 1, 2022 and January 16, 2022. We retrieved 68,307 instances for Las Vegas; 105,995 for Miami; and 70,801 for Seattle. Seattle recorded its first case in January 2020 and by April 17, 2022 had 388,271 total cases had been reported and recorded in the USA facts dataset. Las Vegas’ first case was reported in early March 2020, and it had 537,776 cases as of April 17, 2022. Miami’s reported its first case on March 12, 2022, and over 1.2 million cases had been reported as of June 2022. Reporting for Miami also switched to weekly mid-2021 and in one case the case count for Miami was altered due to reporting of a negative number of cases (on June 11, 2021). Our motivating task was to predict the number of cases based on social media and/or news data. To identify features of interest, we first conducted a statistical analysis to determine features that appeared to be promising for use in the learning algorithms. As this is a predictive task with numeric output, we focused our machine learning experiments on linear, multilinear, and polynomial regression. Statistical analysis was conducted in Python3 using the pearsonr function in scipy [13], and the grangercausalitytests
14
A. Edwards et al.
Table 2.2 Demographic profile for selected cities as of July 1, 2021 City, State Las Vegas NV Miami FL Seattle WA
Total population 646,790 439,890 733,919
Age 18–64 61.4% 66.2% 73.0%
White (not Hispanic) 42.9% 11.5% 62.6%
Table 2.3 Political profile for selected cities County Clark County NV Miami-Dade County FL King County WA
%Republican (Trump) 44.3% 46.0% 22.2%
Black 12.1% 16.0% 7.1%
Hispanic 33.2% 72.5% 7.1%
Bachelors or Higher 25.2% 31.5% 65.0%
Table 2.5 Pearson correlation coefficient, with lag % Democrat (Biden) 53.7% 53.3% 75.0%
Table 2.4 COVID-19 Twitter data Dataset Miami Seattle Las Vegas All US
Num Num tweets rehydrated 1,010,235 733,743 1,197,313 904,401 962,992 663,309 2,700,000 2,027,919
Num no retweets 295,469 338,047 277,245 790,502
Num replies only 77,100 146,217 99,848 308,278
function in statsmodels [14]. Regression function calculations relied on the statmodels.api library, with preprocessing using the sklearn Polynominal function [15, 16]. We used pandas for efficient data storage and retrieval [17].
2.3.2
Results
2.3.2.1 Pearson Correlation Coefficient Table 2.5 shows the results when social media and news data were correlated with COVID-19 cases for each of the three cities. The lag column was introduced to incorporate temporal components. A lag of 0 indicates the case data and social media/news data from the same week; a lag of 1 indicates that the tweet data preceded case data by one week; 2 is tweet data two weeks in advance and so on. Pearson coefficients between 0.5 and 0.7 are moderately correlated and coefficients above .7 are highly correlated. Cases with moderate or high correlation are indicated in red in Table 2.5. From the data it appears that RepliesOnly with a lag of one or two weeks appears to be the most promising feature for prediction. We also see a peak in the correlation coefficient at one week lag time, with correlation diminishing each week afterward. Interestingly, the GDELT had no correlation with the number of cases and none of the cities had a strong correlation between the case counts and the Twitter data from the all US sample. 2.3.2.2 Granger Causality Granger causality [18] is used to determine if one time series data trend can forecast another. For this study, we
use Granger Causality to determine if the number of cases are correlated using the social media or mainstream news data. We used the grangercausalitytests of the statsmodels [14] library in Python to determine if there is a granger causal relationship that may be used to predict the number of cases. The statsmodels documentation states [19]: “The Null hypothesis for grangercausalitytests is that the time series in the second column, x2, does NOT Granger cause the time series in the first column, x1. Granger causality means that past values of x2 have a statistically significant effect on the current value of x1, taking past values of x1 into account as regressors. We reject the null hypothesis that x2 does not Granger cause x1 if the p-values associated with the hypothesis tests corresponding to different lags are small (below a desired size of the test).” The grangercausalitytests method accepts a maximum number of lags as an input parameter (along with the predictor, x2 and the resultant, x1, data to be compared) and returns a p-value for each lag. The p-value measures the likelihood of
2 Getting Local and Personal: Toward Building a Predictive Model for COVID in Three United States Cities Table 2.6 Granger causality p-value with lag
the observed test statistic assuming that the null hypothesis is true. So, in our case, having a small p-value allows us to reject the null hypothesis and provides evidence that the ability of x2 to predict x1 is statistically significant. Table 2.6 shows the p-values for each input type and lag. Statistically significant results at the 90% level or above are shown in red. We conclude from these data that there are strong relationships between local Twitter activity and COVID-19 cases for these three cities. As with the Pearson correlations, the strongest predictor is at the individual (RepliesOnly) level, although it is clear that original tweets also play a role, at least for Seattle and Las Vegas. What is interesting is that retweets (which are a measure of engagement) do not play a strong role, with the exception of Las Vegas. Once again, mainstream news sources have no Granger causal relationship with the COVID-19 case counts. While aspects of the All US Sample data do show Granger causal relationships with case counts, there is no commonality across cities (data not shown due to space limitations). In analyzing these data, a question arises regarding the impact of cases on news/social media activity. To test the hypothesis that case counts may be Granger causal related to mainstream news, we reversed the data inputs to the Granger Causal Test functions (i.e., used case counts to predict news stories and social media activity). These experiments showed an even stronger Granger causal relationship between social media and COVID-19 cases (which makes sense, as both organizations and individuals respond to local conditions). Unsurprisingly, the All US Sample did not show a pattern of Granger causal relationships. Surprisingly, however, mainstream news sources for individual cities also did not
15
demonstrate a Granger causal relationship. We believe this occurred because news sources, even those originating in the selected cities, tend to focus on the national and international news. Perhaps the phrase “all news is local” is, in fact, is more true for social media than mainstream news, even in a global pandemic!
2.3.2.3 Predictive Model A particular item of interest was the usefulness of the correlation results for determining the best features to use to predict the number of cases based on Twitter activity at a local level. For the discussion below we restrict our discussion to Las Vegas because the correlation and Granger causality results were exceptionally promising. The results were similar for all cities. Figures 2.2, 2.3, and 2.4 show a typical set of results for multilinear and polynomial regression using different features. In all cases we trained on the first 80 weeks of available data and tested using the subsequent five weeks (roughly trying to predict a month in advance). The results did not change dramatically as long as we used at least 60 weeks of training data (and always predicting 5 weeks). Figure 2.2 shows the results of the multi-linear model using Replies Only, No Retweets, and All Tweets as our predictors and lag of 1, as these were the best values for our correlations in Table 2.5. In Fig. 2.3 we approximate the model based on the Granger causality in Table 2.6 by building a multilinear regression model using four weeks of prior data for both cases and Replies Only. In Fig. 2.4 we reverted back to using Replies Only, No Retweets, and All Tweets as our predictive variables in a polynomial regression model with a lag of 1 and a degree of 2. The polynomial regression was by far the most promising among the models tested. Increasing the degree did not improve the predictive outcomes.
2.4
Conclusions
In [20], the authors identified 81 articles that analyzed social media communication surrounding the COVID-19 pandemic. The authors concluded that there was a lack of machine learning applications that use social media data for prediction of cases during the COVID-19 pandemic. Tsao, et al. also concluded that there was little evidence that social media data was used for real-time surveillance. In this article we fill a crucial gap in our collective understanding of Twitter activity in the context of a global event, such as COVID-19. Specifically, we show that responses from individual users in a geographic area are strongly correlated with local case counts for COVID-19. We also show that Twitter was a better source for understanding the impact of COVID-19 on a community than mainstream news data. These are novel findings that identify potential new features, such as Replies Only
Fig. 2.2 Multilinear regression with replies only, no retweets, and all tweets for Las Vegas. Training model (top) vs. test results (bottom)
Fig. 2.3 Multilinear regression mimicking Granger causality for Las Vegas. Training model (top) vs. test results (bottom)
2 Getting Local and Personal: Toward Building a Predictive Model for COVID in Three United States Cities
17
Fig. 2.4 Polynomial regression with degree 2 and replies only, no retweets, and all tweets for Las Vegas. Training model (top) vs. test results (bottom)
that should be explored when conducting machine learning research using social media data. Use of these features in multilinear and polynomial regression models shows promise for predicting COVID-19 case counts for a month in advance. With the prevalence of mis/disinformation on social media [5], it is critical to develop a deeper and richer understanding of the relationship between social media and real-world events in order to detect and prevent future influence operations. This project contributes to our understanding of how social media activity within a geographic region tracks with real-world events and identifies both novel feature sets and new approaches to applying machine learning to these tasks.
Acknowledgements Janwa’s research was supported in parts by the NASA grants 80NSSC21M0156 and 80NSSC22M0248. Edwards and Casey were funded in part by the Office of Naval Research. The views expressed are those of the authors and do not reflect the official policy or position of the United States Naval Academy, United States Navy, United States Marine Corps, the Department of Defense or the United States Government.
References 1. C. Cadwalladr, E. Graham-Harrison, Revealed: 50 million Facebook profiles harvested for Cambridge Analytica in major data breach. The Guardian 17, 22 (2018) 2. A. Dandekar, V. Narawade, Twitter sentiment analysis of public opinion on COVID-19 vaccines, in Computer Vision and Robotics, pp. 131–139. Springer, 2022 3. J. Isaak, M.J. Hanna, User data privacy: Facebook, Cambridge Analytica, and privacy protection. Computer 51(8), 56–59 (2018) 4. H. Chen, R. Liu, N. Park, V.S. Subrahmanian, Using Twitter to predict when vulnerabilities will be exploited, in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD’19, pp. 3143–3152, New York, 2019. Association for Computing Machinery 5. R. Chittari, M.S. Nistor, D. Bein, S. Pickl, A. Verma, Classifying sincerity using machine learning, in ITNG 2022 19th International Conference on Information Technology-New Generations. Advances in Intelligent Systems and Computing, ed. by S. Latifi, vol. 1421, (Springer, Cham, 2022). https://doi.org/10.1007/978-3030-97652-1_31 6. X. Huang, A. Jamison, D. Broniatowski, S. Quinn, M. Dredze, Coronavirus Twitter Data: A collection of COVID19 tweets with automated annotations, March 2020. http:// twitterdata.covid19dataresources.org/index
18 7. USAFACTS. US COVID-19 cases and deaths by state. https:// usafacts.org/visualizations/coronavirus-covid-19-spread-map/ 8. The GDELT Project Blog, Now live updating & expanded: A new dataset for exploring the coronavirus narrative in global online news 9. E. Avraam, A. Veglis, C. Dimoulas, News article consumption habits of Greek internet users, in 6th Annual International Conference on Communication and Management (ICCM2021), Athens, Greece, August, pp. 1–5, 2021 10. C. Singh, What Is the Best Time to Post on Twitter in 2022? (SocialPilot, 2022) 11. U.S. Census Bureau, Quickfacts. https://www.census.gov/ quickfacts/fact/table/US/PST045221, 2021 12. Wikipedia, 2020 United States Presidential Election. https:// en.wikipedia.org/wiki/2020. United States presidential election, 2020 13. P. Virtanen, R. Gommers, T.E. Oliphant, M. Haberland, T. Reddy, D. Cournapeau, E. Burovski, P. Peterson, W. Weckesser, J. Bright, S.J. van der Walt, M. Brett, J. Wilson, K.J. Millman, N. Mayorov, A.R.J. Nelson, E. Jones, R. Kern, E. Larson, C.J. Carey, I. Polat, Y. Feng, E.W. Moore, J. VanderPlas, D. Laxalde, J. Perktold, R. Cimrman, I. Henriksen, E.A. Quintero, C.R. Harris, A.M. Archibald, A.H. Ribeiro, F. Pedregosa, P. van Mulbregt, SciPy 1.0 Contributors, SciPy 1.0: Fundamental algorithms for scientific computing in Python. Nat. Methods 17, 261–272 (2020)
A. Edwards et al. 14. S. Seabold,J. Perktold, statsmodels: Econometric and statistical modeling with Python, in 9th Python in Science Conference, 2010 15. L. Buitinck, G. Louppe, M. Blondel, F. Pedregosa, A. Mueller, O. Grisel, V. Niculae, P. Prettenhofer, A. Gramfort, J. Grobler, R. Layton, J. VanderPlas, A. Joly, B. Holt, G. Varoquaux, API design for machine learning software: Experiences from the scikit-learn project, in ECML PKDD Workshop: Languages for Data Mining and Machine Learning, pp. 108–122, 2013 16. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay, Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011) 17. W. McKinney, Data Structures for Statistical Computing in Python, in S. van der Walt, J. Millman, editors, Proceedings of the 9th Python in Science Conference, pp. 56–61, 2010 18. C.W. Granger, Investigating causal relations by econometric models and cross-spectral methods. Econometrica, 424–438 (1969) 19. J. Perktold, S. Seabold, J. Taylor, statsmodels.tsa.stattools. grangercausalitytests. https://www.statsmodels.org/dev/generated/ statsmodels.tsa.stattools.grangercausalitytests.html 20. S.F. Tsao, H. Chen, T. Tisseverasinghe, Y. Yang, L. Li, Z.A. Butt, What social media told us in the time of covid-19: A scoping review. Lancet Digit. Health 3(3), e175–e194 (2021)
3
Integrating LSTM and EEMD Methods to Improve Significant Wave Height Prediction Ashkan Reisi-Dehkordi, Alireza Tavakkoli, and Frederick C. Harris Jr.
Abstract
Keywords
One of the most significant reliable and renewable energy sources is wave energy which has the most energy density among the renewable energy sources. Significant Wave Height (SWH) plays a major role in wave energy and hence this study aims to predict wave height using time series of wave characteristics as input to various machine learning approaches and analyze these approaches under several scenarios. Two different machine learning algorithms will be implemented to forecast SWH. In the first approach, the SWH will be forecasted directly using a Long Short Term Memory (LSTM) network and in the second approach an LSTM and an Ensemble Empirical Mode Decomposition (EEMD) method are proposed for SWH prediction. For this purpose, the elements of wave height will be initially decomposed and used for training an LSTM network to calculate the time series of SWH. Also, the calibration and verification of the modeled wave characteristics will be done using real data acquired from buoys. The results imply that the EEMD approach provides more accurate results and calculating the wave height through the decomposition and prediction of its main wave components can deliver more accurate outcomes considering various error indices. Also, it can be inferred from the results that the accuracy of the predictions will decrease as the forecasting time horizon increases.
Deep learning optimization · Ensemble empirical mode decomposition · Long short term memory network · Neural network in coastal engineering · Ocean wave decomposition · Ocean wave height forecasting · Regression algorithms · Time series analysis · Wave characteristics prediction · Wave energy prediction
A. Reisi-Dehkordi () · A. Tavakkoli · F. C. Harris, Jr. Computer Science and Engineering, University of Nevada, Reno, NV, USA e-mail: [email protected]; [email protected]; [email protected]
3.1
Introduction
Fossil fuel combustion has been shown to have negative effects on our living environment and is one of the main drivers of global climate change. As such, the world is trying to move on from pollutional energy sources to clean and renewable ones [1]. Renewable energy resources including wind, solar, and ocean energy (i.e. thermal, tidal, waves, and currents) are among the common types of renewable energy sources that are employed by industries throughout the world. Ocean waves provide energy densities which are significantly greater than wind and solar resources [2]. This energy density and its renewable property has triggered a surge of efforts in the world to try to harness ocean wave energy. Because of this, wave energy prediction plays a crucial role in planning placement of wave energy converter. The most crucial element of wave energy is Significant Wave Height (SWH) and its prediction plays a significant role to plan for a proper energy converter. Waves are generally generated because of winds and the fluctuations in wave periods and heights are derived from the continuous shifts and changes in various wind’s features [3]. Also, wave conditions are varied over monthly, seasonal,
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 S. Latifi (ed.), ITNG 2023 20th International Conference on Information Technology-New Generations, Advances in Intelligent Systems and Computing 1445, https://doi.org/10.1007/978-3-031-28332-1_3
19
20
A. Reisi-Dehkordi et al.
and annual timescales [4]. Through analyzing ocean wave characteristics obtained from buoy or satellite measurements in various locations, and employing deep-water numerical models, the average wave energy can be determined in those locations [5, 6]. The rest of this paper is structured as follows: Sect. 3.2 covers the related work, Sect. 3.3 presents the employed methodology, Sect. 3.4 presents the results, provides discussion about them and compares the results of the two developed frameworks, and finally Sect. 3.5 draws conclusions and presents ideas for future work.
3.2
Related Work
Researchers have investigated solutions to accurately predict the oceanographic parameters altering wave power and height. The implemented approaches have included a wide range of methods namely statistical methods, numerical methods, empirical models, and hybrid approaches [7]. Of these, the empirical methods are easy and quick to use but cannot provide proper accuracy with results unless being utilized over large horizons [3]. Numerical models can become handy to achieve increased accuracy and wider applicability [3]. For instance, numerical forecasting studies have been made to the Persian Gulf [8] and China Sea datasets [9]. Finite element methods have also been vastly used for the prediction purposes. However, these models suffer from some inherent uncertainties in real-world cases [10]. Besides, in most cases the accuracy of numerical models is highly dependent on mesh sizes which also directly affects the computational time [11]. Soft computing methods analyze data structure to find potential relations to predict outcomes. To be more specific, knowing the inputs and desired outputs, supervised learning neural networks can be developed using back propagation algorithm and these methods are one of the most significant tools which use approximation to determine patterns and relations within the provided data [12, 13]. Using a back propagation neural network, [14] predicted the ocean wave height. Their method could provide the anticipated outcomes quickly while reaching a certain accuracy. In another study, it was concluded that the machine learning-based method could present better results when compared to physics-based models. However, the accuracy decreased as the prediction period increased [15]. Also, [16] and [17] investigated SWH prediction based on an artificial neural network. In their studies, classical time series models and neural network models were applied to observed SWHs along the Indian coast. The outcomes implied that the neural network could predict short-term outcomes with more accuracy whereas the results of the neural network model for long-term predictions were similar to classical models.
Considering soft computing’s limited forecasting capability, it couldn’t gain the trust to be widely applied in operational forecasting marine systems [18]. Hence, in recent years, machine learning, especially deep learning has been applied in marine and meteorological forecasting [19]. One of the mostly applied regression prediction neural network algorithms is a Recurrent Neural Network (RNN) [19]. RNNs are a type of Artificial Neural Network that use internal memories to model temporal dynamic behavior. They can be a good fit to properly learn from non-linear time series and hence, they have been implemented in analyzing many time series problems [20]. Accordingly, they can be a good framework for forecasting systems. The problem with RNNs is that they suffer from the issue of vanishing gradient and as a result, errors cannot be back propagated to a previous neuron in a faraway layer [16]. A solution to this problem is the LSTM network. In these networks, long and short-term memory components take the place of the hidden neurons containing activation functions. Consequently, the network can store values of data in any length of time and the problem with vanishing gradients in RNNs is solved. An example of the LSTMs’ efficiency was demonstrated by Fan et al. [21] to conduct predictions using a hybrid Simulating Waves Nearshore (SWAN) LSTM framework. This developed framework enhanced the accuracy of predictions by nearly 65% when compared to SWAN model simulations. Another study used various datasets to compare the wave model from the European Centre for Medium-range Weather Forecasts (ECMWF) with LSTM and multi-layer perceptron models [22]. The measured marine wave data used as input for neural networks, are consisted from several components having different properties including various frequencies and periods which all form non-stationary time series. Accordingly, in this study an LSTM framework is developed and is used to predict the SWH time series for various time forecasting windows. In the next step, the developed framework is integrated with a decomposition method and is employed for the same prediction as previous step. Afterwards, the performance of the two developed frameworks are compared.
3.3
Methodology and Implementation
Figure 3.1 shows the flow chart of the wave height prediction models used in this study. As it is depicted, the buoy data is first processed to find the missing records in the time series dataset, and linear interpolation is employed to prepare the training set. In this study, two LSTM model structures are developed to predict the wave heights for various lead times. The first LSTM model uses a sequence of wave heights as its input. As it is shown in the Fig. 3.2, for the second model, an EEMD is used as a time-frequency data analysis method
3
Integrating LSTM and EEMD Methods to Improve Significant Wave Height Prediction
21
Fig. 3.1 Flow chart of the LSTM and EEMD-LSTM wave height models
Fig. 3.2 EEMD-LSTM wave height framework
which divides wave height time series into a number of components, called intrinsic mode functions (IMFs). These IMFs correspond to various frequencies and a residue. To be more specific, EEMD has the ability to adaptively analyse a signal regardless of any prior assumption about the composition of it. To do that, it outlines IMFs consecutively through interpolating between peak values. In this study, the underlying abilities of EEMD are utilized since the nonlinear nature of the waves can be more efficiently processed by neural networks through decomposing the waves. The decomposed components will be independently learned by the LSTM. The reversibility ability in wavelet decomposition is a beneficial asset for analysing the results here and hence the LSTM outcomes can be merged to create the final prediction results as it is depicted in the Fig. 3.2.
A three-layer LSTM framework is developed where each of the decomposed waves in the EEMD model as well as the waves in the Non-EEMD one are trained for 100 epochs with a batch size of 64. The dataset includes more than 25,000 wave records that were measured using buoy with one hour intervals between the records. This data was then normalized between 0 and 1 for this study. Also, 85% of these data is used for training and verification of the frameworks and the rest is dedicated for testing purposes.
3.4
Comparisons and Results
In order to quantitatively evaluate the results and measure the performance of the models, their error indices are cal-
22
A. Reisi-Dehkordi et al.
Table 3.1 Error indices of prediction results for the two algorithms— prediction window of 1 hours Method EEMD-LSTM LSTM
RMSE 0.07 0.20
MAE 0.04 0.12
MAPE 12.85 32.51
CC 0.97 0.86
SI 0.12 0.32
Bias .−0.005
0.02
culated and compared. These indices are derived for the two developed LSTM model structures separately and for various prediction scenarios. The employed error indices are bias, root mean square error (RMSE), scatter index (SI), mean absolute error (MAE), mean absolute percentage error (MAPE), and correlation coefficient (CC): Bias = y¯ − x¯
.
RMSE =
n i=1
.
(xi − yi )2 n
RMSE x¯ n 1 .MAE = | xi − yi | n i=1 SI =
.
100 | xi − yi | n i=1 xi n
MAP E =
.
n ¯ × ( yi − y)) ¯ ((xi − x) CC = i=1 n n 2 ¯ × i=1 (yi − y) ¯ 2 i=1 (xi − x)
.
where n is the total number of data and x.i and y.i represent the observed and predicted values, respectively. The term x¯ is the mean value for buoy measured data and y¯ is the one for predicted data. Also, it should be noted that the calibration of the model was carried out using Mean Square Error (MSE) index. Table 3.1 shows the comparison of error indices associated with the two models for testing data. It can be inferred that the two models predictions are in good match with ground truth wave data. Also, the EEMD-LSTM framework could succeed to reach a decrease of 0.12 in RMSE error value comparing to the one for LSTM framework since the two frameworks were devised to minimize the RMSE index. The RSME value of the two models is mentioned in the Table 3.1. Additionally, it can be noted in Figs. 3.3, 3.4, 3.5, and 3.6 that the EEMD-LSTM framework has successfully reconstructed the decomposed wave components into one final wave height time series after processing the components by LSTM network. As depicted in Fig. 3.3, both LSTM models achieved good performances, demonstrating the capability of LSTM network components to learn and simulate the underlying relationships exist between various wave height input ele-
ments. Additionally, the EEMD based method could impressively establish relationships between the various decomposed waves and performed much better than the Non-EEMD LSTM algorithm. This accuracy improvement of EEMD over the Non-EEMD LSTM framework can be seen easier in the error indices presented in Table 3.1. To analyse the frameworks’ capability in various scenarios and the efficiency of the EEMD implementation for longer forecasting windows the frameworks were again trained and tested for 6, 8 and 12 hours prediction windows shown in Figs. 3.4, 3.5, and 3.6. The networks’ performance can also be seen from the error indices of Tables 3.2, 3.3, and 3.4. From Tables 3.1, 3.2, 3.3, and 3.4 it can be inferred that the EEMD-LSTM framework has always outperformed the Non EEMD-LSTM framework. In other words, the implementation of the EEMD algorithm helps the framework to significantly enhance the results. For instance, in the case of the 6 hour prediction window, although the implemented LSTM framework has provided good results, the implementation of EEMD decomposition method has helped the framework to increase its accuracy by 62%. A similar case can be inferred for the other prediction windows according to the provided tables and error indices. Also, various error indices of different prediction windows indicate the decrease of the two framework accuracy as the prediction horizon increases. This is shown in Figs. 3.3, 3.4, 3.5, and 3.6. These figures also present the comparison of measured wave heights recorded by buoys along with the models’ prediction. The LSTM network outputs follow the approximate waveform of the targets, however they fail to correctly predict several local peaks. In contrast, the proposed EEMD-LSTM framework showed its ability to take care of local peaks thanks to decomposing frequency domain of the training data in an explicit way. This decomposing feature makes the framework capable enough to gain a better insight to the characteristics of the data. This feature is more apparent when the window prediction time goes up as seen in Figs. 3.5 and 3.6.
3.5
Conclusions and Future Work
In this study, two LSTM frameworks were developed for predicting Significant Wave Height (SWH) where one of them is utilized with the EEMD method. It was shown that the LSTM framework can demonstrate good performance. It can explore relationships between various records of data and establish waveform trends effectively. Additionally, by integrating the EEMD as a decomposition method with an LSTM neural network, the model can significantly outperform the NonEEMD LSTM prediction model by nearly 60%. It can be inferred that the EEMD-LSTM model’s superiority lies within the embedded decomposition asset that
3
Integrating LSTM and EEMD Methods to Improve Significant Wave Height Prediction
23
Fig. 3.3 EEMD LSTM model—1 hour forecasting window
Fig. 3.4 EEMD and LSTM models—6 hour forecasting window
Fig. 3.5 EEMD and LSTM models—8 hour forecasting window
feeds the network with more and segregated data elements for training. Accordingly, the results highlight the underlying benefits of wavelet decomposition and reconstruction by EEMD. Besides, it is revealed that LSTM and EEMD methods can be mixed and matched together for SWH prediction
purposes. While this is currently a good asset, future work can involve the integration of numerical simulation methods to make the EEMD-LSTM framework more robust against longer prediction horizons and different wave characteristics.
24
A. Reisi-Dehkordi et al.
Fig. 3.6 EEMD and LSTM models—12 hour forecasting window Table 3.2 Error indices of prediction results for the two algorithms— prediction window of 6 hours Method EEMD-LSTM LSTM
RMSE 0.14 0.37
MAE 0.11 0.24
MAPE 29.66 73.26
CC 0.91 0.72
SI 0.25 0.58
Bias .−0.004
0.35
Table 3.3 Error indices of prediction results for the two algorithms— prediction window of 8 hours Method EEMD-LSTM LSTM
RMSE 0.19 0.41
MAE 0.13 0.29
MAPE 38.28 86.3
CC 0.87 0.39
SI 0.31 0.63
Bias .−0.008
0.67
Table 3.4 Error indices of prediction results for the two algorithms— prediction window of 12 hours Method EEMD-LSTM LSTM
RMSE 0.25 0.46
MAE 0.18 0.33
MAPE 56.24 104.7
CC 0.77 0.25
SI 0.40 0.74
Bias .−0.018
0.05
Acknowledgments This material is based in part upon work supported by the National Science Foundation under grant numbers OIA-2148788. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.
References 1. E.S. Keith, Optimizing local least squares regression for short term wind prediction. University of Nevada, Reno (2015). https://www. cse.unr.edu/~fredh/papers/thesis/060-keith/thesis.pdf 2. R.P. Mendes, M.R. Calado, S.J. Mariano, Wave energy potential in portugal-assessment based on probabilistic description of ocean waves parameters. Renew. Energy 47, 1–8 (2012). issn: 09601481. https://doi.org/10.1016/j.renene.2012.04.009 3. P. Salah, A. Reisi-Dehkordi, B. Kamranzad, A hybrid approach to estimate the nearshore wave characteristics in the persian gulf. Appl. Ocean Res. 57, 1–7 (2016). issn: 01411187. https://doi.org/ 10.1016/j.apor.2016.02.005 4. B. Cahill, T. Lewis, Wave period ratios and the calculation of wave power, in Proceedings of the 2nd Marine Energy Technology Symposium, METS2014. April 15–18, 2014, Seattle, WA
5. A. Uihlein, D. Magagna, Wave and tidal current energy–a review of the current state of research beyond technology. Renew. Sustain. Energy Rev. 58, 1070–1081 (2016) 6. A. Muñoz, C. Carthen, V. Le, S.D. Strachan, S.M. Dascalu, F.C. Harris, LDAT: a lidar data analysis and visualization tool, in ITNG 2022 19th International Conference on Information TechnologyNew Generations (Springer, Berlin, 2022), pp. 293–301 7. P.M. Bento, J.A. Pombo, R.P. Mendes, M.R. Calado, S.J. Mariano, Ocean wave energy forecasting using optimised deep learning neural networks. Ocean Eng. 219, 108372 (2021). issn: 00298018. https://doi.org/10.1016/j.oceaneng.2020.108372 8. M.H. Moeini, A. Etemad-Shahidi, V. Chegini, I. Rahmani, M. Moghaddam, Error distribution and correction of the predicted wave characteristics over the persian gulf. Ocean Eng. 75, 81–89 (2014). issn: 00298018. https://doi.org/10.1016/j.oceaneng.2013. 11.012 9. C.W. Zheng, C.Y. Li, X. Chen, J. Pan, Numerical forecasting experiment of the wave energy resource in the china sea. Adv. Meteorol. 2016 (2016). issn: 16879317. https://doi.org/10.1155/ 2016/5692431 10. F. Ghahari, N. Malekghaini, H. Ebrahimian, E. Taciroglu, Bridge digital twinning using an output-only Bayesian model updating method and recorded seismic measurements. Sensors 22(3), 1278 (2022) 11. A. Reisi, P. Salah, M.R. Kavianpour, Impact of chute walls convergence angle on flow characteristics of spillways using numerical modeling. Int. J. Chem. Environ. Biol. Sci. 3(3), 245–251 (2015) 12. S.O. Erikstad, S. Ove, Design patterns for digital twin solutions in marine systems design and operations (2018). https://www. researchgate.net/publication/325871050 13. A. Reisi-Dehkordi, R. Eslami-Farsani, Prediction of high performance fibers strength using back propagation neural network. J. Macromol. Sci. A 52(8), 642–647 (2015) 14. W. Wang, R. Tang, C. Li, P. Liu, L. Luo, A BP neural network model optimized by mind evolutionary algorithm for predicting the ocean wave heights. Ocean Eng. 162, 98–107 (2018). issn: 00298018. https://doi.org/10.1016/j.oceaneng.2018.04.039 15. S.J. Gumiere, M. Camporese, A. Botto, et al., Machine learning vs. physics-based modeling for real-time irrigation management. Front. Water 2, 13 (2020). issn: 26249375. https://doi.org/10.3389/ frwa.2020.00008 16. M. Deo, C.S. Naidu, Real time wave forecasting using neural networks. Ocean Eng. 26(3), 191–203 (1998) 17. J. Agrawal, M. Deo, On-line wave prediction. Marine Struct. 15(1), 57–74 (2002)
3
Integrating LSTM and EEMD Methods to Improve Significant Wave Height Prediction
18. S. Gao, J. Huang, Y. Li, G. Liu, F. Bi, Z. Bai, A forecasting model for wave heights based on a long short-term memory neural network. Acta Ocean. Sin. 40(1), 62–69 (2021) 19. S. Gao, P. Zhao, B. Pan, et al., A nowcasting model for the prediction of typhoon tracks based on a long short term memory neural network. Acta Ocean. Sin. 37, 8–12 (2018). issn: 0253505X. https://doi.org/10.1007/s13131-018-1219-z 20. Y.Y. Chen, Y. Lv, Z. Li, F.Y. Wang, Long shortterm memory model for traffic congestion prediction with online open data, in 2016 IEEE 19th International Conference on Intelligent Transportation
25
Systems (ITSC) (Institute of Electrical and Electronics Engineers, Piscataway, 2016), pp. 132–137, isbn: 9781509018895. https://doi. org/10.1109/ITSC.2016.7795543 21. S. Fan, N. Xiao, S. Dong, A novel model to predict significant wave height based on long short-term memory network. Ocean Eng. 205, 107–298 (2020) 22. G. Reikard, P. Pinson, J.R. Bidlot, Forecasting ocean wave energy: the ecmwf wave model and time series methods. Ocean Eng. 38, 1089–1099 (2011). issn: 00298018. https://doi.org/10.1016/j. oceaneng.2011.04.009
4
A Deep Learning Approach for Sentiment and Emotional Analysis of Lebanese Arabizi Twitter Data Maria Raïdy and Haidar Harmanani
Abstract
Arabizi is an Arabic dialect that is represented in Latin transliteration and is commonly used in social media and other informal settings. This work addresses the problem of Arabizi text identification and emotional analysis based on Lebanese dialect. The work starts with the extraction and construction of a dataset and uses two machine learning models. The first is based on fastText for learning the embeddings while the second uses a combination of recurrent and dense deep learning models. The proposed approaches were attempted on the Arabizi dataset that we extracted and curated from Twitter. We attempted our results with six classical machine learning approaches using separate sentiment and emotion analysis. We achieved the highest result in literature for the binary sentiment analysis with an F1 score of 81%. We also present baseline results for the 3-class sentiment classification of Arabizi tweets with an F1 score of 64%, and for emotion classification of Arabizi tweets with an f1 score of 61%. Keywords
Emotional Analysis · Arabizi · Deep Learning
4.1
Introduction
Social media has been at the center of the digital age causing disruptions from the rise of political polarization to the COVID-19 pandemic. According to the Pew Research Center, seven-in-ten Americans use social media to connect M. Raïdy · H. Harmanani () Department of Computer Science and Mathematics, Lebanese American University, Byblos, Lebanon e-mail: [email protected]
with one another, engage with news content, share information, and express their happiness, frustrations, and anger [1]. Various countries have resorted to social media in order to measure the wellness of their own citizens, not an easy task given that such measures include evolving economical, environmental, and social indicators. Some countries regularly mine social media in order to report welfare statistics and use them to detect social anxiety or sudden decrease in satisfaction. For example, the United Kingdom looks at societal and personal well-being through areas such as health, relationships, education, skills, finance, and the environment [2]. Although there are no governmentled efforts in Lebanon to study societal well-being, various researchers looked at Lebanese well-being in a variety of psychological contexts [3–10]. Other researchers examined the impact of plurality and communitarianism on well-being [11–13] in addition to the Lebanese social media habits during such events such as the Beirut Port Blast [14–20] or during social unrest [21–23]. During the 2019 social unrest and the economical meltdown of 2020, the New Economics Foundation (NEF) ranked Lebanon 120th with a Happy Planet Index (HPI) score of 21.9 [24]. This paper uses a deep learning approach in order to analyze sentiment and emotions of Lebanese Arabizi tweets during the 2019 and the economical meltdown of 2020. We create, curate, and label a Lebanese Arabizi dataset and then compare our results with six classical machine learning approaches using separate sentiment and emotion analysis tasks separately. Finally, we use these results in order to determine the Lebanese Social Happiness Index (LSHI) during the same period.
4.2
Arabizi
Arabizi is a word that is composed of “Arabic” and “englizi,” the Arabic word for English. It is an Arabic dialect that is
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 S. Latifi (ed.), ITNG 2023 20th International Conference on Information Technology-New Generations, Advances in Intelligent Systems and Computing 1445, https://doi.org/10.1007/978-3-031-28332-1_4
27
28
M. Raïdy and H. Harmanani
represented in Latin transliteration and used increasingly by millennials as it makes writing what’s being spoken easier [25–27]. Arabizi is easy to use, fast to type, and lacks complex grammatical or syntax rules. In fact, it is the language of choice for younger bilingual Arab generations who code switch between languages on social media as well as in texting or emailing. In multilingual countries such as Algeria, Tunisia, and Lebanon, Arabizi has evolved to include French. In an extensive study of Arabizi, Sullivan [28] noted that Lebanese Arabizi is more complex than other Arabizis as most Lebanese are trilingual and tend to use several languages (Arabic, English, French, and/or Armenian) in the same sentence. The authors illustrates this case using the classical Lebanese salutation: “Hi! kifak? cava?”1 which mixes Lebanese Arabic (Kifak), French (Ça va), and English (Hi) in the same sentence. Note that in this case the French cédille is omitted from “Ça va.” There are various Arabizi versions depending on the local Arabic dialect. Boudad et al. [29] noted that social media is found to include 30 different Arabizi versions. These informal texts lend themselves to ambiguity as they disregard grammar, capitalization, and punctuation rules. There are no formal rules on how to mix Arabic and Latin characters in Arabizi. However, in most variations, Latin characters and numbers are used in order to represent sounds that do not exist in such languages. For example, the guttural Arabic letter or ayn is typically represented using “3”, “aa” or even “3a” while the voiceless fricative or Haa is typically represented using “7.” Some variations may completely omit all vowels for simplicity.
4.3
Related Work
Although a lot of research investigated NLP techniques for sentiment analysis, emotion analysis, or translation, few tackled Arabizi due to the recent emergence of the dialect and its informality and complexity. Guellil et al. [30] and Duwairi et al. [31] used Naïve Bayes and SVM machine learning techniques in order to classify sentiments of Arabizi messages into positives or negatives. Both transliterated the local Arabizi dialect (Algerian and Jordanian, respectively) into formal Arabic first. Barr et al. [32] chained two distinct methodologies, word2vec and a graph-theoretical algorithm, in order to classify documents into distinct classes. Liu et al. [33] proposed an adaptive DNN inference acceleration framework that accelerates DNN inference. Srour et al. [34, 35] used a mixed theme/event based approach to rate highly influential users in Twitter. Hajjar et al. [36] and Sarkissian et al. [37] used an unsupervised approach to identify sentences 1 hello!
How are you? Is all good?
having the highest potential to represent informative content in a document. Wang et al. [38] used a deep learning model that combined Conditional Random Field and Bidirectional LSTM for NER with unbalanced labels. Thomas et al. [39] explored the relative frequency of social attributes in Tweets across both English and Arabic. Abd El-Wahab et al. [40] explored trends in transliteration using deep learning models and propose an approach for Arabic-English transliteration. Darwish [41] used CRF sequence labeling in order to identify Arabizi in texts prior to converting it to formal Arabic. Bies et al. [42] created a corpus for the purpose of Egyptian Arabizi transliteration into Arabic. Tobaili [43] tackled sentiment analysis for Lebanese Arabizi using an SVM classifier. Kiritchenko et al. [44] detected sentiments behind short informal texts extracted from twitter and SMS using a lexicon-based approach. Kundi et al. [45] developed a tool to detect and analyze interesting slang words while Bahrainian et al. [46] created and used a slang dictionary for classification. Dhaoui et al. [47] tested a lexical sentiment analysis approach versus machine learning approaches.
4.4
Dataset Generation
Twitter-based textual data provide a trove of information that are domain specific and tackle various topics. Tweets are attractive since they are limited in length to a maximum of 280 characters, highly available through APIs, and easily accessible from different media. Since there is no publicly available Lebanese Arabizi dataset that can be used for sentiment and emotion analysis, we did curate, annotate, and label our own data set. We scraped data from Twitter based on a set of 30 common keywords in Lebanese, such as “lech, chou, hayda, kamen, hasset, chwei, ...” The keywords served as seeds to collect tweets published between 01-01-2017 and 31-04-2020. The data was also geographically filtered based on the Tweet and account location. We excluded all spams, images, noise, duplicates, retweets, tweets containing Arabic characters, and tweets that only consist of hashtags and/or URLs. The initial data set included 135,000 Lebanese Arabizi tweets written in English, French, and Armenian. The dataset was next manually inspected and verified. Tweets that have at least 50% Arabizi words were kept, resulting with 46,000 Arabizi tweets. Next, two independent annotators performed sentiment and emotional labeling on the dataset and only kept the tweets the two annotators agreed on. The labels varied between polarity (positive, negative, or neutral) and emotion according to the Ekman model (happiness, anger, sadness, fear, surprise, and disgust). Finally, we added an additional set of social labels that are of particular interest in Lebanon such as sectarianism, gossip, sexism,
4
A Deep Learning Approach for Sentiment and Emotional Analysis of Lebanese Arabizi Twitter Data
29
Table 4.1 Examples of sentiment classifications in our dataset Emotion Joy
Negative
Sadness Bullying Gossip Fear
Shu b7bk ya beirut
Arabizi tweet Khay Shou helo habbet hal jemle !:p nzale btetsalle ktir :p Ana kamen bkit bas ma habbet abadan Kteir z3elet.. :( Lea ktir pedophile Aade manna kitr “waw” ya3ne. Elle e troo aade aywaa aywaa nshalla nnja7 sara7a ana ktir khayfe :/
Text Preprocessing
Translation Oh, how beautiful! I liked this sentence! Go! You’ll enjoy it! I also cried but I didn’t like it at all It made me so sad Lea is too much of a pedophile She’s not that pretty. She’s average. I hope we succeed, I’m honestly so worried.
Positive
Text Vector
Polarity Positive
Feature Extraction
Beirut, how I love you
Sentiment and Emotion Classification Courtesy Words
Fig. 4.1 Classification system architecture
racism, bullying, sarcasm, and foul language. A sample set of tweets and corresponding labels are shown in Table 4.1. The dataset is available on Kaggle.
4.5
Classification System
We analyzed and classified sentiment and emotions of Lebanese Arabizi tweets and then compared our results with six classical machine learning approaches. In what follows, we describe our classification system with reference to Fig. 4.1.
4.5.1
Text Pre-processing
Preprocessing is an important step in our approach because of the informality of the Arabizi language and the multiple typos and errors that are the byproducts of social media. During this step, we convert all the characters to lowercase, and remove user mentions, URLs, special characters, numbers, measurements, and timings. We do, however, keep all the words that might give an insight into the polarity and emotion of the tweep. We also simplified the texts exaggerations and deleted all stop-words using a list of 398 tokens specific to our data. The stop-words list included a combination of English and French words as well as the most common Arabizi terms found in our corpus. Finally, we cleaned the dataset from empty tweets and balanced it using random over sampling. The result was two balanced datasets of 1800 tweets for sentiment analysis and 1500 for emotion analysis.
Table 4.2 Sound-effects tags with examples Tag @laughter @amazement @surprise @annoyance @thinking @kiss @sound-effect
Examples hahahahhahaaah, lololool, hihihihihihi, wahaha yaayyyy, wooowow, wouw ohhh, aaahh, owww mehhh, ufft, ughh, huh, hush, hushtt uhmm, ummm, mmm, heinn mwaahhh, mmmmwh uff, pshht, psst, shhhht, a777
Table 4.3 Emojis and emoticons Tag :-) :) :-] :3 :8-) :-} =) :-( :( :c ::-[ :-| :@ :-* :* :X
4.5.2
Examples happy_face_smiley frown_sad_andry_or_pouting kiss thumbs_up
Feature Extraction
The next step was analyzing the dataset for possible features extraction. In this stage, we kept negating words (la2/la /no/, ma/mich /not/, ...) to preserve the purpose of the tweet. We also tampered down the enthusiasm in the tweets and used different tags to replace exaggerations and sound effects such as @laughter, @amazement, @surprise, @annoyance, @thinking, @kiss and/or simply @sound-e ect (Table 4.2). We also replaced the emojis and emoticons by their description (Table 4.3). All symbols and sounds were normalized in order to account for different regional local dialects (Table 4.4).
30
M. Raïdy and H. Harmanani
Table 4.4 Normalized representation of complex sounds Arabic Script
4.6
Phonetic Alphabet
Common Arabizi Representations
Normalized Representation
Z è x
j/g 7/h 7’/5/kh
j h kh
D S DQ Q G
d/z sh/ch th/z 3/aa 3’/8/gh
Can’t normalize ch Can’t normalize aa gh
q w/u y/i
9/q o/w/ou/u i/y/ey/ei
q u i
Sentiment and Emotion Classification
Dictionary-based approaches are not efficient when dealing with an informal, morphologically rich, and constantly evolving language like Arabizi. In order to classify the Arabizi text, we used multiple models that we will describe next.
4.6.1
Text Vectorization Using Fasttext
We used the skip-gram word embeddings model resulting with a .V × D matrix where V is the vocabulary size and D is the embeddings dimension [48]. One of the limitations of morphologically rich languages with large vocabularies and many rare words such as Arabizi is that this model assigns a distinct vector to each word ignoring its morphology. To resolve this problem, we use the fasttext library which includes subword information [49]. Each word vector is a sum of the vectors of its character n-grams. As a result, they deal with out-of-vocabulary words, such as misspelled words, by being able to embody them by their sub word representations. We improve the model on our dataset. We trained the system on our labeled data in order to predict if a tweet is in Arabizi or not. We used a learning rate of 0.05 and 3 epochs. We set the word n-gram window size was to 3, and based our training on sub word features with character n-grams of size 2–6. We also used hierarchical softmax activation function. We obtained 95% precision recall and F1 following fivefold validation and 98% precision recall and F1 score on the arabizi-detection-validation-set. The results are the highest in reported Arabizi classification. In order to assess the impact of the text pre-processing stage, we repeated the above experiment using the same model and parameters on the raw unprocessed dataset. We
received 94% precision recall and F1 score on the balanced test set and 94% precision 93% recall and F1 score on the Arabizi detection validation set. Hence, the preprocessing method improved our results by 4%. Finally, combining both methods using the automatically Arabizi classified tweets with a ratio of 50% to train our model resulted with an F1 score of 96% on testing and 97% on validation.
4.6.2
Machine Learning Approaches
We next attempted six classical machine learning models including the Multinomial Naive Bayes, Random Forest, SVM (with linear, Gaussian, and sigmoid kernels), decision trees, logistic regression, and a vanilla Neural Networks model. The neural networks model was based on two hidden layers of size 100, the ’Adam’ optimizer, 300 epochs, and warm start turned on. Each model was ran 768 times with different choices at the different stages.
4.7
Deep Learning Model
The last model we attempted was a deep learning model that uses a combination of embedding layer, bidirectional LSTM recurrent networks, dropout layers, and dense layers. The model, shown in Fig. 4.2, used two LSTM layers. The first LSTM layer used 400 units while the second used 200. We also used three dense layers with the first having 400 units and the second 100. The last layer was the prediction layer and the number of units depended on the prediction label. In the polarity case it used a sigmoid unit while in the emotional analysis it used a softmax activation function.
4.8
Experimental Results
We present separate results of our finalized models for sentiment and emotion analysis in Tables 4.5 and 4.6, respectively. For sentiment analysis, the model with the best accuracy of 65% is the Logistic Regression approach with a count vectorizer of trigrams, tagged sound effects, handled negation, replaced emojis and emoticons, and normalized phonetics. Notice that we did not remove stop words in this case. The same model ran on an unprocessed dataset gets an accuracy of 63%. The downside of our approach is that we used all our data for training and testing since it’s a rather small dataset and therefore did not have a validation-set to validate the accuracy of our models on unseen data. To be able to compare our results to those in the literature we trained the best models obtained for our 3-class sentiment analysis (Positive, Negative, Neutral) combined with the preprocessing methods
4
A Deep Learning Approach for Sentiment and Emotional Analysis of Lebanese Arabizi Twitter Data
31
Fig. 4.2 Deep learning model architecture Table 4.5 Sentiment analysis results for machine learning models Algorithm Multinomial naive bayes Random forest SVM linear SVM Gaussian SVM sigmoid Decision trees Logistic regression
Accuracy 0.61 0.60 0.63 0.62 0.62 0.56 0.65
Precision 0.63 0.63 0.64 0.63 0.63 0.56 0.65
Recall 0.61 0.60 0.63 0.62 0.62 0.56 0.65
F1 0.61 0.60 0.63 0.62 0.62 0.55 0.64
Recall 0.58 0.60 0.60 0.59 0.59 0.54 0.61
F1 0.58 0.60 0.60 0.60 0.60 0.54 0.61
Table 4.6 Emotion analysis results for each model Algorithm Multinomial naive bayes Random forest SVM linear SVM Gaussian SVM sigmoid Decision trees Logistic regression
Accuracy 0.58 0.60 0.60 0.59 0.59 0.54 0.60
Precision 0.58 0.62 0.61 0.61 0.62 0.58 0.62
Twitter data. The sample represents 25% of the population who use Twitter frequently, and 33% of those who use Arabizi when on social media. It is interesting to note how well the tweets mood correlated with the events that occurred in Lebanon between the start of the October 17 revolution as well as with the start of the COVID-19 pandemic. Figures 4.4 and 4.5 show the results of sentiment and emotion analysis of our best models. Notice that if we align the above results with the COVID-19 updates in Lebanon, we notice an increase in the number of negative tweets whenever a pessimistic event had happened. For example, on the 21st of February 2020, when the first COVID-19 case was announced in Lebanon, we see a clear peek in anger and fear tweets. This is also the case when the first death case is announced followed by the first complete lockdown starting on the 10th of March 2020. However, anger, fear, and sadness dominated as the number of COVID cases increases as of May 2020.
4.9 opted by each on a 2-class sentiment analysis task (Positive, Negative). The best accuracy obtained is 81% with a decision tree model joined with tagged sound effects, removed stop words, handled negations, and normalized phonetics. As for the best performing model, it was the deep learning model shown in Fig. 4.2 and which achieved a validation accuracy of 68% as shown in Fig. 4.3. One of the applications of our study is to measure the Lebanese Social Happiness Index (LSHI) based on Arabizi
Conclusion
We have presented a deep learning approach for Arabizi text identification and emotional analysis. The proposed approaches were attempted using six classical machine learning approaches. We achieved the highest result in literature for the binary sentiment analysis. We have also presented baseline results for the 3-class sentiment and emotional classification of Arabizi tweets. The reported results and comparisons were favorable (Fig. 4.6).
32
M. Raïdy and H. Harmanani
Fig. 4.3 Deep learning model validation and training accuracy
26
Binary sentiments Negative Positive
24 22 20 Counts of Sentiments
18 16 14 12 10 8 6 4 2 0 7Feb
17Feb
27Feb
8Mar
18Mar
28Mar 7Apr Date
17Apr
27Apr
7May
17May
27May
Fig. 4.4 Binary sentiments detected by the DT model in COVID-19 tweets between February 2020 and May 2020 (Lebanon) with 65% accuracy
4
A Deep Learning Approach for Sentiment and Emotional Analysis of Lebanese Arabizi Twitter Data
33
24
Sentiments Negative Neutral Positive
22 20
Counts of Sentiments
18 16 14 12 10 8 6 4 2 0 7Feb
17Feb
27Feb
8Mar
18Mar
28Mar Date
7Apr
17Apr
27Apr
7May
17May
27May
Fig. 4.5 Sentiments detected by the DT model in COVID-19 tweets between February 2020 and May 2020 (Lebanon) with 65% accuracy
References 1. Pew Research Center, Social Media Fact Sheet (Pew Research Center, Washington, 2021) 2. UK Office of Online Statistics, Well-Being (UK Office of Online Statistics, Newport, 2022) 3. P. Tohme, R. Abi-Habib, E. Nassar, N. Hamed, G. Abou-Ghannam, G. Chalouhi, The psychological impact of the covid-19 outbreak on pregnancy and mother-infant prenatal bonding. J. Child Health 26(11) (2022) 4. P. Tohme, I. Grey, M. El-Tawil, M. El Maouch, R. Abi-Habib, Prevalence and correlates of mental health difficulties following the beirut port explosion: the roles of mentalizing and resilience. Psychol. Trauma Theory Res. Pract. Policy (2022). http://dx.doi. org/10.1037/tra0001328 5. R. Abi-Habib, R. Chaaya, N. Yaktine, M. Maouch, P. Tohme, Predictors of the impostor phenomenon. Asian J. Psychiatr. 75, 1–2 (2022) 6. M. Greaves, The reflexive self; adapting to academic life in a time of social turmoil. Teach. High. Educ. 1–16 (2021) 7. D. Nauffal, J. Nader, Organizational cultures of higher education institutions operating amid turbulence and an unstable environment: the lebanese case. High. Educ. 84, 343–371 (2022) 8. A. Al-Shehhi, I. Grey, J. Thomas, Big data and wellbeing in the arab world, in Positive Psychology in the Middle East/North Africa: Research, Policy, and Practise, ed. by L. Lambert, N. Pasha-Zaidi (Springer, Berlin, 2019), pp. 159–182 9. N.B. Zakhem, P. Farmanesh, P. Zargar, A. Kassar, Wellbeing during a pandemic: an empirical research examining autonomy, workfamily conflict and informational support among SME employees. Front. Psychol. 13 (2022) 10. R. Gök, E. Bouri, E. Gemici, Can Twitter-based economic uncertainty predict safe-haven assets under all market conditions and investment horizons? Technol. Forecast. Soc. Chang. 185, 1–21 (2022)
11. A. Desatnik, C. Jarvis, N. Hickin, L. Taylor, D. Trevatt, P. Tohme, N. Lorenzini, Preliminary real-world evaluation of an intervention for parents of adolescents: the open door approach to parenting teenagers (apt). J. Child Fam. Stud. 30, 38–50 (2021) 12. I. Salamey, The communitarian nation, in The Communitarian Nation-State Paradox in Lebanon, ed. by I. Salamey, chap. 10 (Nova Science Publishers, Hauppauge, 2021), pp. 266–290 13. I. Salamey, Reconstructing the communitarian state, in The Communitarian Nation-State Paradox in Lebanon, ed. by I. Salamey, chap. 13 (Nova Science Publishers, Hauppauge, 2021), pp. 309– 333 14. S.E. Hajj, Archiving the political, narrating the personal: the year in lebanon. Biography 42, 84–91 (2019) 15. S.E. Hajj, Golgotha, beirut: a feminist memoir of the port blast. J. Int. Women’s Stud. 24, 1–4 (2022) 16. G. King, Radical media education practices from social movement media, in The Handbook of Media Education Research, ed. by D. Frau-Meigs, S. Kotilainen, M. Pathak-Shelat, M. Hoechsmann, S.R. Poyntz, chap. 40 (Nova Science Publishers, Hauppauge, 2020) 17. C. Kozman, R. Cozma, Keeping the gates on twitter: interactivity and sourcing habits of lebanese traditional media. Int. J. Commun. 15, 1000–1020 (2021) 18. C.-J. El-Zouki, A. Chahine, M. Mhanna, S. Obeid, S. Hallit, Rate and correlates of post-traumatic stress disorder (ptsd) following the beirut blast and the economic crisis among lebanese university students: a cross-sectional study. BMC Psychiatry 22(1), 532 (2022) 19. G. Sadaka, Illnesses of illusion and disillusionment: from euphoria to aporia. Life Writing J. 1–12 (2022). Published Online 20. P. M. B. Doleh, Truman in Beirut: Journeying through fear and immobility,” Life Writing, pp. 1–13, 2022. 21. J.G. Karam, Teaching political science during crisis: The three-c approach and reflections from lebanon during a social uprising, an economic meltdown, and the covid-19 pandemic. J. Polit. Sci. Educ. 18, 492–510 (2022)
34
M. Raïdy and H. Harmanani
8
Anger
7 6 5 4 3 2 1 0 13 12
Fear
11 10 9 8 7 6 5 4 3 2 1 0 12 11
Sadness
10 9 8 7 6 5 4 3 2 1 0
Joy 3
2
1
0 11 10
None
9 8 7 6 5 4 3 2 1 0
9Feb
19Feb
29Feb
10Mar
20Mar
30Mar Date
9Apr
19Apr
29Apr
9May
19May
Fig. 4.6 Emotions detected by the LR model in COVID-19 tweets between February 2020 and May 2020 (Lebanon) with 61% accuracy
4
A Deep Learning Approach for Sentiment and Emotional Analysis of Lebanese Arabizi Twitter Data
22. T. Fakhoury, F. Al-Fakih, Consociationalism and political parties in the middle east: the lebanese case, in Routledge Handbook on Political Parties in the Middle East and North Africa, ed. by F. Cavatorta, L. Storm, V. Resta, chap. 10 (Routledge, London, 2020), pp. 179–191 23. A. Fakih, R. Khayat, Social identity, confidence in institutions, and youth: evidence from the arab spring. Soc. Sci. Q. 103, 997–1018 (2022) 24. N.E. Foundation, Happy planet index 2016 methods paper (2016). Accessed 12 Nov 2022 25. M.A. Yaghan, Arabizi: a contemporary style of arabic slang. Des. Issues 24(2), 39–52 (2008) 26. H. Alghamdi, E. Petraki, Arabizi in Saudi Arabia: a deviant form of language or simply a form of expression? Soc. Sci. 7(9), 1–19 (2018) 27. R. Akbar, H. Taqi, T. Sadiq, Arabizi in Kuwait: an emerging case of digraphia. Lang. Commun. 74, 204–216 (2020) 28. N. Sullivan, Writing Arabizi: Orthographic Variation in Romanized Lebanese Arabic on Twitter. Ph.D Thesis, University of Texas, 2017 29. N. Boudad, R. Faizi, R. Oulad Haj Thami, R. Chiheb, Sentiment analysis in arabic: a review of the literature. Ain Shams Engineering J. 9(4), 2479–2490 (2018) 30. I. Guellil, A. Adeel, F. Azouaou, F. Benali, A.-E. Hachani, A. Hussain, Arabizi sentiment analysis based on transliteration and automatic corpus annotation, in Proceedings of the 9th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis (2019) 31. R. Duwairi, M. Alfaqeh, M. Wardat, A. Alrabadi, Sentiment analysis for Arabizi text, in 7th International Conference on Information and Communication Systems (2016) 32. J.R. Barr, P. Shaw, F.N. Abu-Khzam, J. Chen, Combinatorial text classification: the effect of multi-parameterized correlation clustering, in First International Conference on Graph Computing (2019), pp. 29–36 33. G. Liu, F. Dai, X. Xu, X. Fu, W. Dou, N. Kumar, M. Bilal, An adaptive DNN inference acceleration framework with end-edgecloud collaborative computing. Futur. Gener. Comput. Syst. 140, 422–435 (2023) 34. A. Srour, H. Ould-Slimane, A. Mourad, H. Harmanani, C. Jenainati, Joint theme and event based rating model for identifying relevant influencers on twitter: Covid-19 case study. Online Soc. Netw. Media 31, 1–15 (2022) 35. A. Mourad, A. Srour, H. Harmanani, C. Jenainati, M. Arafeh, Critical impact of social networks infodemic on defeating coronavirus COVID-19 pandemic: twitter-based study and research directions. IEEE Trans. Netw. Serv. Manag. 17(4), 2145–2155 (2020) 36. A. Hajjar, J. Tekli, Unsupervised extractive text summarization using frequency-based sentence clustering. Commun. Comput. Inform. Sci. 1652, 245–255 (2022)
35
37. S. Sarkissian, J. Tekli, Unsupervised topical organization of documents using corpus-based text analysis, in Proceedings of the International Conference on Management of Digital EcoSystems (2021), pp. 87–94 38. Y. Wang, P. Vijayakumar, B.B. Gupta, W. Alhalabi, A. Sivaraman, An improved entity recognition approach to cyber-social knowledge provision of intellectual property using a CRF-LSTM model. Pattern Recogn. Lett. 163, 145–151 (2022) 39. J. Thomas, A. Al-Shehhi, M. Al-Ameri, I. Grey, We tweet arabic; I tweet english: self-concept, language and social media. Heliyon 5(7), 1–5 (2019) 40. M.M.A. El-Wahab, F.N. Abu-Khzam, J.E. Den, An effective machine learning approach for english-arabic transliteration, in Proceedings of the 4th International Conference on Natural Language Processing (2022), pp. 345–349 41. K. Darwish, Arabizi detection and conversion to Arabic, in Proceedings of the 2014 Workshop on Arabic Natural Language Processing (2014) 42. A. Bies, Z. Song, M. Maamouri, S. Grimes, H. Lee, J. Wright, S. Strassel, N. Habash, R. Eskander, O. Rambow, Transliteration of Arabizi into Arabic orthography: developing a parallel annotated Arabizi-Arabic script SMS/chat corpus, in Proceedings of the 2014 Workshop on Arabic Natural Language Processing (ACL, 2014), pp. 93–103 43. T. Tobaili, M. Fernandez, H. Alani, S. Sharafeddine, H. Hajj, G. Glavaš, SenZi: a sentiment analysis lexicon for the latinised Arabic (Arabizi), in Proceedings of the International Conference on Recent Advances in Natural Language Processing (2019), pp. 1203–1211 44. S. Kiritchenko, X. Zhu, S.M. Mohammad, Sentiment analysis of short informal texts. J. Artif. Intell. Res. 50, 23–762 (2014) 45. F.M. Kundi, S. Ahmad, A. Khan, M.Z. Asghar, Detection and scoring of Internet Slangs for sentiment analysis using SentiWordNet. Life Sci. J. 11(9), 66–72 (2014) 46. S.A. Bahrainian, A. Dengel, Sentiment analysis and summarization of twitter data, in Proceedings of the 16th IEEE International Conference on Computational Science and Engineering (2013) 47. C. Dhaoui, C. Webster, L.P. Tan, Social media sentiment analysis: lexicon versus machine learning. J. Consum. Mark. 34(6), 480–488 (2017) 48. T. Mikolov, I. Sutskever, K. Chen, G. Corrado, J. Dean, Distributed representations of words and phrases and their compositionality, in Proceedings of the 26th International Conference on Neural Information Processing Systems (2013), pp. 3111–3119 49. P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)
5
A Two-Step Approach to Boost Neural Network Generalizability in Predicting Defective Software Alexandre Nascimento, Vinicius Veloso deMelo, Marcio Basgalupp, and Luis Alberto Viera Dias
Abstract
With society’s digitalization, the ever-growing dependence on software increased the negative impact of poor software quality. That impact was estimated at $2.41 trillion to the US economy in 2022. In searching for better tools for supporting quality assurance efforts, such as software testing, many studies have demonstrated the use of Machine Learning (ML) classifiers to predict defective software modules. They could be used as tools to focus test efforts on the potentially defective modules, enhancing the results achieved with limited resources. However, the practical applicability of many of those studies is arguable because of (1) the misuse of their training datasets; (2) the improper metrics used to measure those classifiers’ performance; (3) the use of data from only a system or project; and (4) the use of data from only a computer programing language. When those factors are not considered, the experiments’ results are biased towards a very high accuracy, leading to improper conclusions related to the generalizability of classifiers to practical uses. This study sheds light on those issues and points out promising results by proposing and testing the cross-project and cross-language generalizability of A. Nascimento () Lemann Center for Educational Entrepreneurship and Innovation, Stanford University, Stanford, CA, USA e-mail: [email protected] V. V. de Melo Data Science Department, Verafin Inc., Winnipeg, MB, Canada M. Basgalupp Instituto de Ciência e Tecnologia, Universidade Federal de São Paulo, São José dos Campos, Brazil e-mail: [email protected] L. A. V. Dias Software Engeneering Department, Instituto Tecnologico de Aeronautica, São José dos Campos, Brazil
a novel 2-step approach for artificial neural networks (ANN) using a large dataset of 17,147 software modules from 12 projects with distinct programming languages (C, C++, and Java). The results demonstrated that the proposed approach could deal with an imbalanced dataset and outperform a similar ANN trained with the conventional approach. Moreover, the proposed approach was able to improve by 277% the number of defective modules found with the same software test effort. Keywords
Machine learning · Artificial neural network · Generalizability · Software testing · Efficiency · Efficacy · Defect prediction · Software testing effort · Tuning
5.1
Introduction
As society becomes more digitalized, the economic impact of software quality issues is increasing. The cost of software bugs to the economies has been growing despite all the efforts for better software quality assurance. A study commissioned by the Department of Commerce’s National Institute of Standards and Technology (NIST) estimated in 2002 that software bugs cost the US economy from $22.2 to $59.5 billion annually [1, 2]. In 2019, only Boeing lost $29Bi [3] of its market value because of issues with Boeing 787-Max 8 Autopilot [4] that resulted in 2 fatal accidents accounting for over 350 deaths [5], and all those aircraft grounded for months worldwide. This value does not account for the costs of compensation for families of the victims and airlines with aircraft grounded. A study by the Consortium for Information and Software Quality (CISQ) recently estimated the cost of
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 S. Latifi (ed.), ITNG 2023 20th International Conference on Information Technology-New Generations, Advances in Intelligent Systems and Computing 1445, https://doi.org/10.1007/978-3-031-28332-1_5
37
38
poor-quality software in the US in 2022 as approximately $2.41 trillion [6]. Even though it is impossible to test all the possible conditions of software [7, 8], software testing consumes from 25% to 50% of the total budget of development projects [9]. Since it is critical to the final product quality [10] and the available resources are always scarce, managers must master the art of safely reducing the test scope [11] and wisely allocate the available resources [12] to test the software as much as possible [9]. Machine learning (ML) models to predict software defects [24] can shed light on the portion of the system where the majority of defects are concentrated [13–15]. That can help managers to focus their available resources to test those portions more intensively and reduce the test scope. Those models are classifier algorithms, such as artificial neural networks (ANN), which are trained with a historical dataset encompassing a set of static code metrics extracted from each module’s source code and a feature (class) indicating whether the module had a defect or not [16–20]. Many studies on ML models to predict software defects can be found. Most of them are based on NASA open datasets [21, 22], which maybe could explain why the models’ performances are reported to be similar as observed by [23]. However, those studies have common limitations [22]. First, each classifier is trained with a unique dataset corresponding to a software system, and the test and validation are performed with a portion of the unused data from the same dataset. Therefore, all the reported results do not account for a cross-project test, which would happen if a classifier trained with dataset from a project was tested with dataset from another project. Moreover, They do not account for a crosslanguage test, which would happen if a classifier trained with a dataset of a system based on a programing language (e.g. Java) was tested with a dataset of another system based on a distinct programing language (e.g. C++). Those two tests would validate the classifier’s generalization capability. Second, most studies do not account the imbalance of the dataset used to induce the ML models, since they have more instances of the non-defective class than the ones of the defective class. Thus, the reported results are biased towards the majority class, resulting in high accuracy levels. Those high accuracies are misinterpreted as good results fueling an unreal optimism, which could result into bad decisions if those models are used to support managers. Moreover, they hidden the classifier actual performance, which could be uncovered if more appropriated evaluation metrics were used. In this context, the present study tackles those limitations, proposing a novel 2-step ANN learning approach to reduce the effects of the dataset imbalance and improve the classifier performance and testing on a dataset merging multiple datasets from distinct projects based on different program-
A. Nascimento et al.
ming languages. To the best we know, no other study has used the proposed approach in the defect prediction domain and validated its generalization ability in the way executed here. This study is organized into five sections. Section II presents the related studies. Section III explains the proposed approach and presents the materials and methods used to support the experiments. Section IV presents the experimental results and discussion. Section V presents this study’s conclusions.
5.2
Related Studies
There are many studies on defect prediction models based on software source code metrics in the literature [13, 15, 17, 18, 23–45]. Those studies frequently rely on many types of source-code metrics, such as size [42], complexity [42], McCabe [43], or object-oriented [44]. Authors have a distinct point of view on the importance of each metric type, supported by different studies’ findings, which indicates issues with those models’ generalizability. For example, while [46] found the metrics related to size and complexity were not enough for predicting real-time software defects, [44] found object-oriented metrics to enhance the models’ results, and [40] found the way the source code metrics are used have higher importance than the type of metrics used. The literature also reports many distinct ML techniques used to create those models, such as logistic regression [39], fuzzy logic [43], Naive Bayes [24], support vector machine (SVM) [17], decision tree [39], ANN [15], Bayesian networks [27], deep learning [47, 48], Random Forest [39] or other ensemble-based techniques [29]. Most of those studies use one or more of the modules from the NASA MDP Datasets [21, 22]. However, a study [22] has pointed out the misuse of this dataset in defect prediction experiments to lead to erroneous findings. Studies frequently rely on the ML classifiers’ accuracies for comparing their performance, which can lead to wrong conclusions when training datasets are imbalanced. In addition, many studies rely on improper strategy of splitting data into train and validation (as further discussed), which can cause ML classifiers to reach high accuracy levels that are not sustained when predicting defects in new modules (low generalizability). Also, most studies do not induce a single classifier using data from multiple projects (cross-project) and multiple programming languages (cross-language); when they do, they segment the datasets and induce a different classifier per segment, which can result in low generalizability. Therefore, to the best we know, no study besides the present one has tested an ANN-based software defect predictor trained with a novel 2-step learning approach using an extensive software metrics dataset of 17,147 modules from
5 A Two-Step Approach to Boost Neural Network Generalizability in Predicting Defective Software
12 projects based on 3 distinct programming languages (C, C++, and Java).
5.3
Materials and Methods
5.3.1
Dataset
The dataset was built by merging the data from pre-cleaned [22] datasets of 12 NASA projects [49]. Because each source dataset contains distinct sets of software metrics, an effort was made to identify the common set of metrics types among them before merging their data. Nineteen numerical features (metrics) were identified as all datasets’ common set of metrics. They characterize code features associated with software quality: distinct lines of code measures, McCabe metrics, base and derived Halstead’s measures, and a branch count [50–53]. The other metrics’ data present in each dataset were removed. Then, the 12 datasets were merged into one single dataset. This resulting cross-project and crosslanguage dataset contain source code metrics encompassing 17,147 software modules in C, C++, and Java, where 17.1% of the modules have defects. There are no missing values. Since the number of instances for each class is not similar, the dataset is imbalanced. However, no technique, such as oversampling [54], undersampling [54], case weighting [55], cost-sensitive learning [39], or synthetic minority oversampling technique (SMOTE) [56], was used to balance the dataset classes. That was intended because the present study seeks to validate a novel approach when dealing with unbalanced datasets.
5.3.2
Machine Learning Approach and Experimental Protocol
An ANN learns by having its weights adjusted to minimize the error given by the difference between its predictions and the ground truth. Thus, ANN training is an optimization problem. Traditionally, ANN training is executed in a single step. However, a novel 2-step learning approach (Fig. 5.1) for ANN is used to reduce the effects of the imbalanced dataset and improve the classifier performance on predicting which software module is defective. The method adds a step using a second-order optimization method for tunning the ANN right after its training based on a first-order optimization method. The stochastic gradient descendent algorithm known as backpropagation was the first-order optimization method used to adjust the weights of the neurons during step 1. The second-order optimization used to tune the weights combined to the coefficients of each neuron’s activation function individually during step 2 was the quasi-newton method known as Broyden-Fletcher-Goldfarb-Shanno (BFGS) [57, 58].
39
The novelty here is adding a learning step and adjusting the neuron’s activation function shapes combined with the traditional weight adjustments, seeking superior results. Since the second step is based on a higher-order optimization, it uses a Hessian, which brings more information to drive the optimization process toward a better result. Moreover, since it also optimizes each neuron’s activation function shape by adjusting their coefficients, the second step creates more opportunities for superior results. Thus, the proposed approach is expected to enhance the explore-exploit ability during the ANN learning phase. Both steps are executed until no changes greater than a float precision is detected in the loss function, which is considered convergence. It is noteworthy that the first step is the most traditional method to train ANN; therefore, in the present study’s experimental protocol, the results were considered the baseline for comparing the improvements added by the second step. At the end of the second step, the ANN neuron’s sigmoidal activation functions achieve a heterogenous pattern of shapes (in red in Fig. 5.1). The nonlinearity of the sigmoidal activation function was expected to enhance the opportunities for superior ANN tuning in the second step. The ANN architecture is a fully connected MLP (each neuron output is connected to all the following layer’s neurons) (Fig. 5.1) because it has an excellent cost-benefit ratio since it requires less computation power than deep ANNs. It encompassed an input layer with an input neuron for each dataset attribute (nf = 19 neurons, except for the classification attribute), an intermediate layer with ni = 8 neurons, and the output layer with two neurons corresponding to each dataset class: defective and non-defective. The experiment protocol used a 10-fold cross-validation training/testing strategy to ensure a more reliable validation of the proposed technique [59] by smoothing the extreme effects of the luckiest and unluckiest data selection for training and testing. In the present study’s domain, cross-validation supports more realistic conclusions [59] than the traditional train/test dataset split. The learning rate was set up to 0.01. At the beginning of stage 1, the MLP weights were initialized with random values (normal distribution [0,1]), and the 10-fold cross-validation splits were randomly drawn. Then, the trained MLP (pretuning) is submitted to the second stage for tuning. During the second stage, the MLP weights and each neuron’s sigmoidal activation function are adjusted towards the lower sum of the squared errors until convergence. At the end of each stage, the performance evaluation metrics (subsection C) are computed and stored. Finally, a t-test is used to compare their values before and after ANN tuning aiming to test the statistical significance of the differences. Everything was implemented on a python 3.6.13 code in a Jupyter Notebook, and the
40
A. Nascimento et al.
Fig. 5.1 Experimental protocol
Python library scipy 1.5.4 was used to implement the BFGS optimizer in the second step.
5.3.3
Evaluation Metrics
An ML classifier should not output false negatives (FN) or false positives (FP). That is, it should not classify any positive class as a negative one (FN) or a negative class as a positive one (FP). A good classifier can correctly identify the true positive (TP) and true negative (TN) classes. Most evaluation metrics indicate the relationship between TP, TN, FP, and FN. Accuracy (A) [60] is the total number of instances classified correctly [(TP + TN)/(TP + TN + FP + FN)]. Recall (R) [60] is the total number of actual defective modules classified correctly from the total number of instances the model classified as defective [TP/(TP + FN)]. Precision (P) [60] measures the accuracy of instances classified as positive, that is, the ratio of instances classified as positive that were positive classes [TP/(TP + FP)]. There is a trade-off between P and R because, during the learning phase, the ML adjusts the threshold value to determine if a class is positive or negative. Thus, depending on the adjustment, it might increase R and decrease P or vice-versa. Therefore, F1-score (F1) supports a combined evaluation of both metrics. F1 is the geometric mean of P and R [(2 × P × R)/(P + R)]. Because of that trade-off, the area under the Precision-Recall curve (AUC-PR) is another metric used to evaluate the model quality. AUC-PR is the area under the curve plotting the PR trade-off values, and it is more appropriate than AUR-ROC to evaluate imbalanced datasets [61, 62]. Higher AUC-PR values indicate superior ANNs. Finally, the machine learning algorithm seeks to minimize the loss (L) function during the ANN training process. L was chosen as the sum of the squared error (between ground truth and prediction).
5.4
Results and Discussion
Table 5.1 shows the results of the eleven performance metrics measured in each learning step (training and tuning) to evaluate the proposed technique’s performance when dealing with an imbalanced dropout dataset. The table presents the average, standard deviation, minimum and maximum values measured during each step for each performance metric. A column (Improv.) indicates the percentual improvements of the proposed technique. Then, the last column (Sig,) indicates the statistical significance level of the differences in the measurements in each step. As expected and reported in the literature, the ANN trained by the traditional approach (step 1) reached a considerably high average accuracy (82.9%) in predicting the defective classes. However, the analysis of TP and FP reveals a common reason why the favorable result at first glance denotes a poor performance. That is because the ANN classified all the dataset’s instances as non-defective classes, and since it is an imbalanced dataset with only 17.1% of defective modules, the ANN misclassification strategy results in high average accuracy. The average accuracy has a slight statistically insignificant drop when the tuning is applied. However, the ANN presents a considerable improvement by starting to try to predict the defective classes, which TP and FP indicate. Those values presented very high statistical significance in the improvement. Those results are translated into statistically significant superiority of R, P and F values after the ANN tuning, although those results are not very high. L values also were slightly reduced, but no statistical significance was found in the improvement. Finally, both AUC-ROC and AUC-PR improved considerably with statistical significance. Therefore, the results of the ANN trained with the conventional approach using a cross-project and cross-language dataset were poor. The poor generalization ability of the
5 A Two-Step Approach to Boost Neural Network Generalizability in Predicting Defective Software
41
Table 5.1 Experiment results Perfomance metric
ANN training (step 1) Average Std. Dev True Positive [TP] 0 0 True Negative [TN] 1422.3 15.284 False Positive [FP] 0 0 False Negative [FN] 292.4 15.193 Accuracy [A] 82.9% 0.9% Precision [P] N/A N/A Recall [R] 0.0% 0.0% F-Measure [F] N/A N/A Loss [L] 14.6% 0.5% [AUC-PR] 73.6% 2.7% [AUC-ROC] 53.9% 5.6% * 5%, ** 0.1%, *** 0.01%
Min 0 1399 0 269 81.6% N/A 0.0% N/A 13.9% 67.4% 44.8%
Max 0 1446 0 316 84.3% N/A 0.0% N/A 15.5% 76.8% 60.1%
model based on a heterogenous dataset can be dangerous because of the high average accuracy due to its imbalance. That could mislead unadvised managers who could trust its predictions and mistakenly reduce the software testing effort drastically because of the predicting indicating non-defective modules. However, if the proposed approach tunes the ANN, its generalization ability with a heterogenous and imbalanced dataset is considerably improved, as demonstrated by the experiments. Therefore, the results indicate that the proposed approach can enhance ANN learning to significantly improve the available software test resources results by suggesting a wiser allocation of the test effort. If a manager relied solely on the model prediction presented here, after the ANN tunning, the test scope would be limited to 181 modules, which is calculated by 10 times (number of folds in the cross-validation) the average number of modules indicated as defective by the ANN (TP + FP). However, those tests would uncover 117 defective modules (10xTP). On the other hand, as a benchmark, if the same effort to test 181 modules were not driven by the ANN or other analytics tool, the number of defective modules expected to be identified would be 31 (17% of 181) since the probability of finding a defective module given the classes distribution is 17%. Thus, the proposed approach would help a software test team to improve their efficacy by over 277%, which is a remarkable result. The presented results could potentially be enhanced if the training dataset was balanced. However, no balancing approach was used to validate the proposed technique’s ability to deal with an unbalanced dataset and to prove it could handle it better than the regular ANN training.
5.5
Conclusion
There are many studies on ML classifiers to predict defective software modules. They can become a helpful test planning
ANN Tuning (step 2) Average Std. Dev 11.7 7.8 1415.9 14.7 6.4 5.8 280.7 18.0 83.3% 0.9% 67.0% 10.3% 4.0% 2.7% 7.4% 4.5% 13.9% 1.2% 80.2% 1.4% 68.0% 1.7%
Min 2 1395 1 258 82.0% 50.0% 0.6% 1.3% 12.8% 78.0% 65.6%
Max 31 1440 22 308 84.5% 90.0% 10.7% 18.0% 17.0% 82.3% 71.1%
Improv.
Sig
Inf −0.4% Inf −4.0% 0.5% N/A Inf N/A −4.8% 9.0% 26.2%
*** **
*** *** *** *** ***
tool for supporting testers to invest their scarce resources selectively to test the software modules most prone to defects. Beyond the wiser allocation, those tools could reduce the economic impact of defective software in a world with an ever-growing dependence on software. However, most studies have issues that could lead to wrong conclusions and low generalizability. In this context, this study proposed an approach to enhance the use of ML classifiers for predicting defective software modules even when the dataset is unbalanced. By using a novel ANN 2-step learning approach, the results of an ANN classifier were improved: (1) TP, A, P, R, F, AUC-ROC, and AUC-PR were increased; and (2) FN and L were reduced. The present study has some limitations. Only a single ANN architecture was tested. Furthermore, only a single ANN configuration and the sigmoidal activation function were evaluated. Although keeping the dataset imbalanced was intended to evaluate the proposed technique’s ability to deal with those situations, this can be considered another limitation since superior results could be achieved with its balancing. Future studies will address those limitations.
References 1. G. Tassey, The economic impacts of inadequate infrastructure for software testing. Natl. Inst. Stand. Technol. RTI Proj. 7007(011), 429–489 (2002) 2. R. Cohane, Financial Cost of Software Bugs, Medium (2017). [Online]. Available: https://medium.com/@ryancohane/financialcost-of-software-bugs-51b4d193f107 3. A. Root, Boeing Stock’s $29 Billion in Lost Value Tells a Story About Earnings, Barron’s (2019, March 13). [Online]. Available: https://www.barrons.com/articles/boeing-stockcrash-market-value-earnings-51552425879 4. Ben, Regulators Discover New 737 MAX Autopilot Problem, One Mile at a Time (2019) [Online]. Available: https:// onemileatatime.com/737-max-autopilot-problems/ 5. N. Rivero, Everything we know about the Boeing 737 Max 8 crisis, Quartz (2019). [Online]. Available: https://qz.com/1578227/ everything-we-know-about-the-boeing-737-max-8-crashes/
42 6. H. Krasner, The cost of poor software quality in the US: A 2022 report, in Proc. Consort. Inf. Softw. Qual., 2022 7. B. Beizer, Software Testing Techniques (1990) 8. L. Copeland, A practitioner’s Guide to Software Test Design (Artech House, 2004) 9. K. Li, M. Wu, Effective Software Test Automation: Developing an Automated Software Testing Tool (John Wiley & Sons, 2006) 10. R.M. Trayahú Filho, E. Rios, Projeto & engenharia de software: teste de software (Alta Books, Rio de Janeiro, 2003) 11. M. Rätzmann, C. De Young, Software Testing and Internationalization (Lemoine International, Incorporated, 2003) 12. B. Broekman, E. Notenboom, Testing Embedded Software (Pearson Education, 2003) 13. G. Mauša, T.G. Grbac, G. Mau, T.G. Grbac, Co-evolutionary multipopulation genetic programming for classification in software defect prediction: An empirical case study. Appl. Soft Comput. 55, 331–351 (2017) 14. T.G. Grbac, P. Runeson, D. Huljeni´c, A second replicated quantitative analysis of fault distributions in complex software systems. IEEE Trans. Softw. Eng. 39(4), 462–476 (2013) 15. A.M. Nascimento, V.V. de Melo, L.A.V. Dias, A.M. da Cunha, Increasing the prediction quality of software defective modules with automatic feature engineering. Inf. Technol. Gener., 527–535 (2018) 16. J. Li, P. He, J. Zhu, M.R. Lyu, Software Defect Prediction via Convolutional Neural Network, in Software Quality, Reliability and Security (QRS), 2017 IEEE International Conference on, 2017, pp. 318–328 17. K.O. Elish, M.O. Elish, Predicting defect-prone software modules using support vector machines. J. Syst. Softw. 81, 649–660 (2008) 18. H. Zhang, X. Zhang, M. Gu, Predicting defective software components from code complexity measures, in Dependable Computing, 2007. PRDC 2007. 13th Pacific Rim International Symposium on, 2007, pp. 93–96 19. M.J. Ordonez, H.M. Haddad, The state of metrics in software industry, in Information Technology: New Generations, 2008. ITNG 2008. Fifth International Conference on, 2008, pp. 453–458 20. S.G. Shiva, L.A. Shala, Software reuse: Research and practice, in Information Technology, 2007. ITNG’07. Fourth International Conference on, 2007, pp. 603–609 21. M. Shepperd, Q. Song, Z. Sun, C. Mair, Data quality: Some comments on the nasa software defect datasets. IEEE Trans. Softw. Eng. 39(9), 1208–1215 (2013) 22. D. Gray, D. Bowes, N. Davey, Y. Sun, B. Christianson, The misuse of the NASA metrics data program data sets for automated software defect prediction, in Evaluation & Assessment in Software Engineering (EASE 2011), 15th Annual Conference on, 2011, pp. 96–103 23. D. Bowes, T. Hall, J. Petri, J. Petri´c, Software defect prediction: Do different classifiers find the same defects? Softw. Qual. J., 1–28 (2017) 24. T. Menzies, J. DiStefano, A. Orrego, R. Chapman, Assessing predictors of software defects, in Proceedings of Workshop Predictive Software Models, 2004 25. Y. Zhou, H. Leung, Predicting object-oriented software maintainability using multivariate adaptive regression splines. J. Syst. Softw. 80(8), 1349–1361 (2007) 26. C. Chang, C. Chu, Y. Yeh, Integrating in-process software defect prediction with association mining to discover defect pattern. Inf. Softw. Technol. 51(2), 375–384 (2009) 27. D. Rodriguez, J. Dolado, J. Tuya, Bayesian concepts in software testing: an initial review, in Proceedings of the 6th International Workshop on Automating Test Case Design, Selection and Evaluation, 2015, pp. 41–46 28. Z. Ali, M.A. Mian, S. Shamail, Knowledge-based systems improving recall of software defect prediction models using association
A. Nascimento et al. mining. Knowledge-Based Syst. 90, 1–13 (2015) 29. S.S. Rathore, S. Kumar, Towards an ensemble based system for predicting the number of software faults. Expert Syst. Appl. 82, 357–382 (2017) 30. T. Menzies, J.S. Di Stefano, How good is your blind spot sampling policy, in High Assurance Systems Engineering, 2004. Proceedings. Eighth IEEE International Symposium on, 2004, pp. 129–138 31. L. Kumar, S. Misra, S. Ku, An empirical analysis of the effectiveness of software metrics and fault prediction model for identifying faulty classes. Comput. Stand. Interfaces 53, 1–32 (2017) 32. R. Moussa, D. Azar, A PSO-GA approach targeting fault-prone software modules. J. Syst. Softw. 132, 41–49 (2017) 33. S.S. Rathore, S. Kumar, Knowledge-based systems linear and nonlinear heterogeneous ensemble methods to predict the number of faults in software systems. Knowledge-Based Syst. 119, 232–256 (2017) 34. M.J. Siers, Z. Islam, Software defect prediction using a cost sensitive decision forest and voting , and a potential solution to the class imbalance problem. Inf. Syst. 51, 62–71 (2015) 35. L. Tian, A. Noore, Evolutionary neural network modeling for software cumulative failure time prediction. Reliab. Eng. Syst. Saf. 87(1), 45–51 (2005) 36. C. Catal, B. Diri, Investigating the effect of dataset size , metrics sets, and feature selection techniques on software fault prediction problem. Inf. Sci. (Ny). 179(8), 1040–1058 (2009) 37. C. Andersson, P. Runeson, A replicated quantitative analysis of fault distributions in complex software systems. IEEE Trans. Softw. Eng. 33(5) (2007) 38. C. Catal, B. Diri, A systematic review of software fault prediction studies. Expert Syst. Appl. 36(4), 7346–7354 (2009) 39. A. Moreira Nascimento, L.F. Vismari, P.S. Cugnasca, J.B. Camargo Júnior, J. de Almeira Júnior, A Cost-Sensitive Approach to Enhance the use of ML Classifiers in Software Testing Efforts, in 2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA), 2019, pp. 1806–1813 40. T. Menzies, J. Greenwald, A. Frank, Data mining static code attributes to learn defect predictors. IEEE Trans. Softw. Eng. 33(1), 2–13 (2007) 41. E. Arisholm, L.C. Briand, E.B. Johannessen, A systematic and comprehensive investigation of methods to build and evaluate fault prediction models. J. Syst. Softw. 83(1), 2–17 (2010) 42. N.E. Fenton, I.C. Society, M. Neil, I.C. Society, A critique of software defect prediction models. IEEE Trans. Softw. Eng. 25(5), 675–689 (1999) 43. P. Ranjan, S. Kumar, U. Kumar, Software fault prediction using computational intelligence techniques: a survey. Indian J. Sci. Technol. 10(18) (2017) 44. D. Radjenovi´c, M. Heriˇcko, R. Torkar, A. Živkoviˇc, D. Radjenovic, Software fault prediction metrics: A systematic literature review. Inf. Softw. Technol. 55(8), 1397–1418 (2013) 45. K. Pan, S. Kim, E.J. Whitehead, Toward an understanding of bug fix patterns. Empir. Softw. Eng. 14(3), 286–315 (2009) 46. V.U.B. Challagulla, F.B. Bastani, I. Yen, R.A. Paul, Empirical assessment of machine learning based software defect prediction techniques, in Object-Oriented Real-Time Dependable Systems, 2005. WORDS 2005. 10th IEEE International Workshop on, 2005, pp. 263–270 47. E.N. Akimova et al., A survey on software defect prediction using deep learning. Mathematics 9(11), 1180 (2021) 48. L. Qiao, X. Li, Q. Umer, P. Guo, Deep learning based software defect prediction. Neurocomputing 385, 100–110 (2020) 49. M. Chapman, P. Callis, T. Menzies, JM1/software defect prediction, December 2004 50. J.E. Gaffney Jr, Metrics in software quality assurance, in Proceedings of the ACM’81 Conference, 1981, pp. 126–130
5 A Two-Step Approach to Boost Neural Network Generalizability in Predicting Defective Software 51. M.H. Halstead, Toward a theoretical basis for estimating programming effort, in Proceedings of the 1975 Annual Conference, 1975, pp. 222–224 52. T.J. McCabe, C.W. Butler, Design complexity measurement and testing. Commun. ACM 32(12), 1415–1425 (1989) 53. T.J. McCabe, A complexity measure. IEEE Trans. Softw. Eng. 4, 308–320 (1976) 54. R. Mohammed, J. Rawashdeh, M. Abdullah, Machine learning with oversampling and undersampling techniques: overview study and experimental results, in 2020 11th international conference on information and communication systems (ICICS), 2020, pp. 243– 248 55. R.L. Chambers, Robust case-weighting for multipurpose establishment surveys. J. Off. Stat. 12, 3–32 (1996) 56. A. Fernández, S. Garcia, F. Herrera, N.V. Chawla, SMOTE for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. J. Artif. Intell. Res. 61, 863–905 (2018)
43
57. C. Zhu, R.H. Byrd, P. Lu, J. Nocedal, Algorithm 778: L-BFGSB: Fortran subroutines for large-scale bound-constrained optimization. ACM Trans. Math. Softw. 23(4), 550–560 (1997) 58. D.-H. Li, M. Fukushima, On the global convergence of the BFGS method for nonconvex unconstrained optimization problems. SIAM J. Optim. 11(4), 1054–1064 (2001) 59. M. Stone, Cross-validatory choice and assessment of statistical predictions. J. R. Stat. Soc. Ser. B 36(2), 111–133 (1974) 60. N. Seliya, T.M. Khoshgoftaar, J. Van Hulse, A study on the relationships of classifier performance metrics, in 2009 21st IEEE international conference on tools with artificial intelligence, 2009, pp. 59–66 61. T. Saito, M. Rehmsmeier, The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS One 10(3), e0118432 (2015) 62. H.R. Sofaer, J.A. Hoeting, C.S. Jarnevich, The area under the precision-recall curve as a performance metric for rare binary events. Methods Ecol. Evol. 10(4), 565–577 (2019)
6
A Principal Component Analysis-Based Scoring Mechanism to Quantify Crime Hot Spots in a City Yu Wu and Natarajan Meghanathan
Abstract
Hot spots policing is a tactic that judiciously distributes police resources in accordance with regional historical data on criminal occurrences and local crime patterns. Unquestionably, the key to this method is identifying crime hot spots. A growing number of studies are looking into how to pinpoint crime hot spots with greater accuracy. Nevertheless, the majority of them merely take the task as a binary classification problem. Our research proposes the notion of a Crime Hot Spot Score, a Principal Component Analysis (PCA)-based linear scoring mechanism for assessing regional crime severity, which equips data users with a more flexible way of utilizing crime hot spot analysis results. We conducted our study on the 3-year period crime dataset from the Boston Police Department. All our preliminary results are encouraging: we not only provide a new perspective in hot spot detection, but also reveal the correlation between the crime hot spot and its adjacent area. Keywords
Crime · Crime patterns · Hot spots policing · Complex network analysis · Machine learning · Data-driven policing · Spatio-temporal data · Heat map · Crime hot spots detection · Principal component analysis
Y. Wu · N. Meghanathan () Department of Electrical & Computer Engineering & Computer Science, Jackson State University, Jackson, MS, USA e-mail: [email protected]; [email protected]
6.1
Introduction
Data science experts and law enforcement organizations may now more precisely analyze crime trends making use of the recent breakthroughs in the area. More and more police forces are turning to data-driven policing methods, which is a type of law enforcement method that relies on data and analytics to prevent or deter crimes. This approach has become increasingly popular in recent years as police departments have access to larger amounts of data and better tools for analyzing, predicting, and using it. There are many benefits of data-driven policing, including the ability to target resources more effectively, identify crime patterns more quickly, and im prove overall public safety. Additionally, data-driven policing can help build trust between the police and the community by increasing transparency and accountability. Despite these advantages, data-driven policing is not without its critics. Some worry about its accuracy and robustness. Unlike the weather, which can be forecasted relatively accurately, crime is much harder to predict because of its substantial occasionality. Others argue that data-driven policing relies too heavily on technology and could never replace the role of humans in law enforcement. Regardless of the debate, it is clear that data driven policing is here to stay. As police departments around the world continue to collect more data, they will need to find better ways to use it effectively to solve crime and keep the public safe. The hot spot policing is a frequently used method to smartly allocate police resources. On the basis of the historical crime records, police departments often work to develop more effective measures to deter crime. By using various methods to locate areas with high crime rates, they can then prioritize where to allocate police resources. This may include increasing patrols in high-crime areas, as well as working with community members and local businesses to
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 S. Latifi (ed.), ITNG 2023 20th International Conference on Information Technology-New Generations, Advances in Intelligent Systems and Computing 1445, https://doi.org/10.1007/978-3-031-28332-1_6
45
46
Y. Wu and N. Meghanathan
address any underlying issues that may be contributing to crime. Most researchers in the field of crime hot spots detection use machine learning methods to carry out their predictions and detections [1]. This is because machine learning is able to take in large amounts of data and identify patterns that may not be immediately obvious to the human eye. But because usually it cannot be interpreted, this is also the reason why some people feel that it is unreliable. However, only a few numbers of studies employed the Principal Component Analysis (PCA) as a tool to pinpoint crime hot spots [2] analyze the monthly crime hot spots spatio-temporal distribution in Shanghai and integrate with PCA to unfold the underlying correlation of 18 indicators that were included in the crime distribution. The problem mentioned above serves as the inspiration for our study. We are striving for a balance of accuracy and trustworthiness in our hot spot detection study. Thus, this study looks for a novel method for identifying urban crime hot spots as well as quantifying the severity of the hot spots. Contribution We propose a Principal Component Analysis based approach to quantify (referred to as the Crime Hot Spots score) the severity with which a location could serve as hot spot in a city. We enrich the method of utilizing historical crime data to find crime hot spots within an area and provide an innovative perspective of the application of data mining in criminology. Our work provides crime data users, or law enforcement a reliable, interpretable crime hot spots detection method. This paper is organized as follows: Section 6.2 lists some related works that use different approaches to detect the crime hot spots. We demonstrate our dataset and the experiment method in Sect. 6.3. We present our experimental results in Sect. 6.4. Section 6.5 sums up our work in this paper and identifies some potential areas we could explore more in the future.
6.2
Related Work
Hotspot policing is a policing method that adjusts the allocation of police resources according to regional crime levels. Numerous studies have demonstrated its effectiveness in preventing crime through controlled experiments, especially for violent crimes and recurring crimes [3] demonstrates in great detail how hot zones can reduce crime even more than previously believed. According to the results of his tests, hot spots-based police initiatives cause crime rates at treatment locations to drop by a statistically significant 16% compared to control locations. In another experiment from [4], hot
spot policing interventions in 20 of 25 trials resulted in a discernible drop in crime and disturbance. The rapid development of data science is leading to its increased use in criminology, particularly in the detection of crime hot spots. Many scholars have used deep learning and machine learning methods to identify crime hot spots and offer theoretical support for law enforcement decisionmaking [5] uses Random Forest model to execute binary classification to identify if a region is a hotspot or not, so as to analyze the effectiveness of several data sources to predict potential hot spots for rising interpersonal violence [6] employs a spatio-temporal neural network model to forecast crime hotspots and the model shows outstanding performance in accuracy in comparison to traditional machine learning algorithms. To the best of our knowledge, ours is the first approach to use PCA to quantify the extent with which a location (among several other locations) could serve as a hotspot. The above studies demonstrate the advantages of datadriven hot spot policing; however, this strategy is still not without its difficulties [7] points out that there are others who assert that concentrating too much attention on the region with a high crime rate may escalate tensions between some specific communities. Additionally, despite the fact that experts from all over the world commit their efforts to find a more accurate method of predicting crime hot spots, their studies have not been well accepted in practice. The unique characteristics of crime data make it challenging for academicians to find the specific information they require. Consequently, the limitation of data sources leads to the fact that the predicted results are sometimes unsatisfactory, or lack of generalizability.
6.3
Methodology
6.3.1
Data Description
This paper analyzes a crime incident report that was documented by the Boston Police Department (BPD) for the years 2015–2018. It is an open-source data that contains more than 310,000 incident events responded by BPD officers and the data has details such as the time, location and crime categories [8]. The top 3 categories among all the entries are Motor Vehicle Accident Response, Larceny, and Medical Assistance, which account for more than 20% of the total records. Given the fact that the Motor Vehicle Accident Response and Medical Assistance is not actually a crime/ violation, we only target Larceny incidents in our work. The time information of an incident contains its occurring date and time, while the location information captures the street, district, latitude, and longitude of where it could have most likely taken place. Since the aim of our research is to
6 A Principal Component Analysis-Based Scoring Mechanism to Quantify Crime Hot Spots in a City
recognize crime hot spots within the Boston area using a novel hybrid method, the street and district information will be unsuitable. The reason is, even though these two features can give us a general idea of where it happens, they are still too large in scope. Instead, the latitude and longitude data give us a more specific location of the incident, so that we can utilize it to redefine a proper size of area to be observed. For the opposite reason, we have to discard the specific time stamp feature but only keep the date of occurrence. Intuitively, it is unreasonable to use a small time span (less than 24 h) to detect crime hot spots in our case, for the observation that conception of crime hot spots will only be well-founded when the time span is rationally long. For example, the number of crimes in an area within a short time span will be more easily affected by some special events happening in that area. If an event is held in this place, it is very likely that the probability of crime in this place will increase slightly due to the sudden influx of foreign population. However, when the event is over, the probability of crime in the area is likely to fall back to previous levels. For the above-said reasons, we needed to choose an appropriate geographic range size and time range.
6.3.2
Data Segmentation
In order to more precisely assess the extent to which a location could serve as a crime hot spot, we need to keep track of the frequency of crimes occurring in a certain region at different times. As illustrated in the previous section, choosing unreasonable area size and time span will lead to illogical results. Therefore, having a practical definition of area size and length of time span is crucial. Based on all the geographic information of larceny incidents, we execute a new way of establishing inspection area in Boston. All the areas where at least one larceny incident happens are split up into a 17 by 19 0.01-Degree grid (0.01 Latitude by 0.01 Longitude) whose cell size measures about 230 acres. Accordingly, we build constant time slices in 14-day increments. This is an ideal length of time span that is not too short to find crime hot spots patterns, but also not too long to reveal the variance between different time spans. Figure 6.1 displays an overview of our procedure of dataset segmentation. We take the minimum latitude and longitude involving the location where the larceny crime occurred as the origin, and take 0.01 latitude and longitude as the step, until the grid encompasses all the locations involved. The origin is actually at the extreme southwest point of the map, and the increase in longitude and latitude is actually covered along the northeast direction. The starting point of the timeline is the record for the first larceny incident in the dataset, on June 15, 2015. The timeline increases in steps of 14 days until the last recorded larceny incident in our dataset.
47
This dataset segmentation procedure divided our dataset into 323 grid cells, which covers all the locations (a total of 58 subregions) in Boston where at least one crime occurred and be documented there. At the same time, from the temporal view, there are 85 time windows that can be seen in our dataset, with a fixed length of a 14-day time span.
6.3.3
Principal Component Analysis (PCA)
Principal Component Analysis (PCA) [9] is a statistical technique that is used to reduce the dimensionality of data. It is often used to make data more manageable for machine learning algorithms. Instead of keeping every feature of the original dataset, this method reduces the dimension of the dataset by redefining its principal components, which is the direction of the maximal variation. The PCA algorithm is very effective in noise removal while preserving key features of the dataset, so that the result can be easier to understand. We apply the PCA procedure on the segmented dataset that was previously introduced. In fact, the processed information may be seen as a snapshot of the extent of crimes occurring throughout 85 time slices in 59 sub-regions of Boston. To examine crime hot spots, we take those regions as objects and the time periods as its features. The value of each feature is the number of crimes that occurred in the specific sub-region at a specific time period. The PCA method enables us to capture the crime occurrence patterns that are most indicative of a region, based on its logged incidents data in 85 historical time periods.
6.3.4
Quantifying the Crime Hot Spots
Incorporating scientific and logical methods to pinpoint crime hot spots is essential to our research. We base our definition of crime hot spots on a shorter time period, instead of using the complete dataset’s time range. This is because it will be simpler to determine what is” normal” in a region if things are done this manner. On the other hand, the criminal incidence pattern of each region/cell is readily disregarded if the research and assessment are focused on the time span of the complete dataset. Furthermore, it is not advisable for data users to make conclusions based on research with overly broad time periods. Additionally, our study employed scores to assess the level of crime in a location. Different from the alternative definition, which uses an either/or approach to determine whether an area is a hotspot or not, our definition of crime hot spots in this paper is based on a ranking of scores. The severity of crimes in the area may be more easily seen using this technique, and it is also practical for decisionmakers to focus on various locations and adopt the necessary solutions.
48
Y. Wu and N. Meghanathan
Fig. 6.1 Data segmentation procedure
Fig. 6.2 Example to illustrate the computation of the Crime Hot spot Scores using PCA
This score is calculated by the weighted average of principal components of a location. For each location c, we create a PCA-based Crime Hot Spot Score [S(c)] to evaluate its crime severity: n s(c) =
[ei ∗ P (c)] n , i=1 ei
i=1
(6.1)
where Pi (·) refers to the PCA mapping function in the corresponding dimension i, ei refers to the Eigenvalue of the corresponding Eigenvector for principal component Pi , and n denotes the dimension of the PCA components.
6.3.5
Example
Consider a synthetic dataset shown in Fig. 6.2 wherein we show the number of ‘robbery’-based crimes that have occurred for 16 consecutive weeks of a year at five different locations (space) 1 . . . 5. PCA is run on this dataset with the five locations as the unique records whose features are the 16 weeks. We identify the Eigenvalues and the corresponding principal components totaling to 16 dimensions. Of these, only four Eigenvalues are greater than 0 and the corresponding principal components have non-zero entries. For principal components whose Eigenvalues are 0, the entries in these
6 A Principal Component Analysis-Based Scoring Mechanism to Quantify Crime Hot Spots in a City
49
Fig. 6.3 A Heat map that shows the Crime Hot Spot Scores of the sub regions of Boston
principal components are 0 as well. The crime hot spot score for a location (space) is the weighted average of the entries in the principal components for that location with the weights being the Eigenvalues of the corresponding principal components. In Fig. 6.2, we show the computation of the crime hot spot scores for the five locations. The larger the crime hot spot score for a location, the higher the ranking for a location with respect to the extent with which that location could serve as a crime hot spot. In the example shown in Fig. 6.2, location 5 incurs the largest crime hot spot score of 2.5399 and location 4 incurs the lowest crime hot spot score of −2.6368. A visual inspection of the number of crimes occurring at these two locations over the 16-week period justifies their crime hot spot scores.
6.4
Results
In this section, by implementing the method above, we assign a Crime Hot spot Score to each sub-region. We only keep the top 59 target regions in order to properly convey the results of the crime hot spots analysis. Because only less than 100 larceny events have happened in all time periods in location
cells from the 60th and on (over three years). The area’s lower crime occurrence number can be used as concrete proof that it is not a hot spot. In Fig. 6.3, we show a Heat map of Crime Hot spots Score based on the actual map of Boston. The subregions in this 13 by 13 grid are divided according to the geographical slicing method we introduced before. Roughly, only 35% of them are assigned with a valid score. The remaining locations either do not have a score since there are not any occurrences of larceny, per the report, or they are discarded given the reason that there are not enough records. Cells with darker the color, the higher the score, and vice versa. Aside from those cells without scores, the darkest cell appears at the cell A, with a highest Crime Hot spot Score of 37.045. This indicates that this is an obvious larceny hot spot. The lightest cell can be found at cell B, with the lowest score of −5.114. At the same time, it is not difficult to observe from the above figure that crime hot spots are clustered. Most of the areas with darker color are concentrated around block A, while regions adjacent to block B have significantly lighter color (lower scores). That is to say, an area close to a crime hot spot has a higher probability of becoming a crime hot spot. This is consistent with the crime pattern theory [10], which
50
Y. Wu and N. Meghanathan
Fig. 6.4 Sorted distribution of the Crime Hot Scores across the 58 sub regions of Boston
points out that a person tends to choose a place that he/she is familiar with, or, within his/her routine activities areas, to commit a crime. Once the routine has been established, it will be relatively stable. This explains why crime hot spots are clustered in most cases. It is noteworthy that the distribution of Crime Hot Spot Score is not uniformly distributed. Figure 6.4 displays a sorted distribution of the crime hot spot scores in the Boston area. Among all the locations, only 14 of them receive a positive score. What’s more, the dispersion degree (measured as the standard deviation, STD) of the first part (1–14) is significantly higher than that of the remaining part (15–59), and also significantly higher than the overall level (measured by the Standard Deviation of each part). We can speculate that crime hot spots are also relatively stable in temporal perspective. Crime hot spots that scored high in one fixed time period are more likely to remain crime hot spots in the next time span.
6.5
Conclusions and Future Work
This study set out to explore a more effective method to quantify the extent with which different locations within a region could be crime hot spots. We created a Principal Component Analysis (PCA)-based scoring mechanism to evaluate the severity of crime occurrence. Instead of doing our experiment on the actual administration division, we used a smaller theoretical division in order to achieve a more specific visualization of the hot spots. At the same time, we sliced the entire time series into shorter time spans for the same purpose. As a result, we assigned a Crime Hot Spot Score, a weighted score based on the PCA components and their Eigenvalues, to all the given sub regions, which enabled data users to assess if it can be considered as a crime hot spot. The results reported here shed new light on machine learning enhanced crime hot spot detection. Unlike the traditional binary classification of defining crime hot spots, this scoring method will give data users a precise conception about how
severe crime activities in a certain area are. Furthermore, a linear assessment measure equips law enforcement agencies with a more flexible way of using data. Using the crime hot spot scores, one can not only classify an area as “hot spot” or “non-hot spot”, but also rank the areas for policing tasks that require strict differentiation. At the same time, our experimental results also contribute to the spatio-temporal study of crime. In our study, we segmented the time series as features of locations, then used PCA to reduce its dimensionality. Ultimately, our data was reduced to one dimension. That is to say, the importance of temporal features is very slight when conducting a spatio-temporal analysis of crime. The opposite is the spatial character of crime. According to the crime heat map of Boston in this paper, it is not difficult to observe that the area around the crime hot spot has a greater probability of becoming a crime hot spot. This finding also suggests that for a sub region within an area, the distance to a local crime hot spot and the Crime Hot Spot Score may be positively correlated. However, this study is subject to some limitations. First of all, law enforcement departments in different cities or countries may utilize different ways to collect crime data; the source of crime data can be a factor that influences the results. Secondly, it is hard to completely avoid hysteretic when using a historical-data-based scoring method. The nature of each crime category, or the nature of the crime pattern is not unchangeable. On the contrary, they can change or evolve over time. Finally, the study is limited by the lack of information on details of each crime event. Most opensource crime data only contains simple information like time, location, and crime categories. Only a few of them include some demographic information. Without doubt, with more detailed information to analyze, the outcome of our research could be improved. Our future research would consider generalization of the work. We will make our dataset to be more diverse by collecting data from various sources and run our PCA-based approach for several more crime categories. Besides, we will
6 A Principal Component Analysis-Based Scoring Mechanism to Quantify Crime Hot Spots in a City
start working on machine-learning integrated crime prevention research. More data mining techniques such as network science and deep learning will also be applied for better performance. Acknowledgement Research was sponsored by the Army Research Office and was accomplished under Grant Number W911NF-21-10264. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Office or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein.
References 1. U.M. Butt, S. Letchmunan, F.H. Hassan, M. Ali, A. Baqir, H.H.R. Sherazi, Spatio-temporal crime hotspot detection and prediction: A systematic literature review. IEEE Access 8, 166553–166574 (2020) 2. Z. Wang, J. Wu, B. Yu, Analyzing spatio-temporal distribution of crime hot-spots and their related factors in Shanghai, China, in 2011 19th International Conference on Geoinformatics. IEEE, 2011, pp. 1–6
51
3. A.A. Braga, D.L. Weisburd, Does hot spots policing have meaningful impacts on crime? Findings from an alternative approach to estimating effect sizes from place-based program evaluations. J. Quant. Criminol. 38, 1–22 (2020) 4. A. Braga, A. Papachristos, D. Hureau, Hot spots policing effects on crime. Campbell Syst. Rev. 8(1), 1–96 (2012) 5. D.A. Bowen, L.M. Mercer Kollar, D.T. Wu, D.A. Fraser, C.E. Flood, J.C. Moore, E.W. Mays, S.A. Sumner, Ability of crime, demographic and business data to forecast areas of increased violence. Int. J. Inj. Control Saf. Promot. 25(4), 443–448 (2018) 6. Y. Zhuang, M. Almeida, M. Morabito, W. Ding, Crime hot spot forecasting: A recurrent model with spatial and temporal information, in 2017 IEEE International Conference on Big Knowledge (ICBK). IEEE, 2017, pp. 143–150 7. A.R. Khairuddin, R. Alwee, H. Haron, A review on applied statistical and artificial intelligence techniques in crime forecasting, in IOP conference series: materials science and engineering, vol. 551, no. 1. IOP Publishing, 2019, p. 012030 8. Boston crime incident reports, https://data.boston.gov/dataset/ crimeincident-reports-august-2015-to-date-source-new-system, 2018 9. M. Ringner, What is principal component analysis? Nat. Biotechnol. 26(3), 303–304 (2008) 10. P. Brantingham, P. Brantingham, Crime pattern theory, in Environmental criminology and crime analysis, (Willan, 2013), pp. 100– 116
7
Tuning Neural Networks for Superior Accuracy on Resource-Constrained Edge Microcontrollers Alexandre M. Nascimento, Vinícius V. de Melo, and Márcio P. Basgalupp
Abstract
The approaches to tune Artificial neural networks (ANN) for running on edge devices, such as weight quantization, knowledge distillation, weight low-rank approximation, and network pruning, usually reduce their accuracy (gap 1). Moreover, they usually require at least 32-bit microcontrollers, leaving out of the equation widely used and much cheaper platforms mostly based on 8-bit microcontrollers (e.g., ATMega328p and ATMega2560), such as Arduino (gap 2). Those microcontrollers can cost between $0.01 to $0.10 on a large scale and can make viable extending IoT applications to a wider range of cheaper personal objects, such as bottles, cans, and cups. In this context, the present study addresses those two identified gaps by proposing and evaluating a technique for tuning ANN to run on 8-bit microcontrollers. 16,000 ANN with distinct configurations were trained and tuned with four widely used datasets and evaluated on two 8-bit microcontrollers. Using less than 3.5Kbytes, the embedded ANN average accuracies outperformed their benchmarks on a 64-bit computer. Keywords
Edge computing · Neural network · 8-bit · IoT · Arduino · Embedded system · Machine learning · Microcontroller · Resource-constrained · Tuning A. M. Nascimento () Stanford University, Stanford, CA, USA e-mail: [email protected] Vinícius V. de Melo Verafin Inc., Winnipeg, MB, Canada e-mail: [email protected] M. P. Basgalupp ICT - UNIFESP, São José dos Campos, Brazil e-mail: [email protected]
7.1
Introduction
Artificial neural networks (ANN) have existed for many decades. Since then, many strategies to increase their learning capacity have been proposed, including different training strategies, ever-increasing training datasets, and increasing the number of layers. Increasing the number of layers of ANNs and the size of datasets generates demand for increased computational capacity and processing time, limiting possible applications on less powerful processors. It is possible to perform the ANN training in a more robust computational environment and then embark the trained network on less powerful target devices to perform inferences [1]. However, larger the network, larger the target processor’s minimum requirements. For this reason, most studies test optimized ANNs on 32bit [5, 16]. Even TensorFlow lite has the minimum requirement of 32-Bit for target environment. Consequently, many applications for 8-bit processors or microcontrollers cannot be easily supported [37]. As a cheaper option, the 8-bit microcontrollers still have a strong market demand. Its global market reached $7.14 billion in 2021, projected to grow 4.7% annually (CAGR) until 2027 [8]. Beyond the higher cost of powerful processors, they usually have higher energy consumption, which compromise the embedded environments’ battery autonomy [1]. More complex networks demand higher energy for training and inferences. Therefore, the higher consumption happens throughout the whole network’s entire life cycle, increasing its total lifetime cost. Therefore, enhancing the results of lighter ANNs, such as multilayer perceptrons (MLP), which are widely used, could enable a broader range of applications in cheaper microprocessors/microcontrollers. That could reduce the lifetime cost of smarter IoT devices. Also, that helps to mitigate the current post-pandemic context with shortage, and higher costs of
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 S. Latifi (ed.), ITNG 2023 20th International Conference on Information Technology-New Generations, Advances in Intelligent Systems and Computing 1445, https://doi.org/10.1007/978-3-031-28332-1_7
53
54
A. M. Nascimento et al.
microcontrollers/microprocessors [23]. Moreover, the lower the hardware requirements of the ANNs, the more accessible and democratized the research topic can become to less resourceful countries. Seeking to run ANNs on constrained devices, many tuning techniques are used, such as weight quantization [20], knowledge distillation [36], weight lowrank approximation [34], and network pruning [21]. However, those techniques usually reduce the network accuracy on the target device (microcontroller) compared to the host (computer). Moreover, they are mostly applied to ANNs too large for 8-bit microcontrollers. Few studies testing ANNs and other machine learning techniques on resource-constrained 8-bit microcontrollers were identified (see Sect. 7.2). Although some of them run MLPs on microcontrollers [17,18,26,29], no study proposed an approach where tuned ANN on the resource-constrained microcontroller outperformed the non-tuned ANN accuracy on the computer. Moreover, no study proposed a combined optimization of weights and activation functions approach to tune each activation function’s shape seeking superior performance on the target device. Those two gaps are addressed in the present study. The rest of this paper is organized as follows. Section 7.2 briefly describes related work. Section 7.3 presents the techniques, datasets, and evaluation criteria used for empirically analyzing the proposed method. Section 7.4 reports the results obtained from the empirical evaluation. Finally, the conclusions and point to future work are presented in Sect. 7.5.
7.2
Related Work
A systematic literature review protocol [12], traditionally used in many research fields, was used to identify the related literature. A search on metadata (TAK—Title, Abstract, and Keyword) using the term {(“machine learning” OR “neural network” OR “tinyML” OR “tiny ML” OR “MLP” OR “multilayer perceptron”) AND (“Arduino Uno” OR “Arduino Mega” OR “low power processor” OR “low cost processor”)} was performed on Scopus, Engineering Village, IEEExplore, ACM DL, and SpringerLink. A python script was used to enforce TAK search on Springer papers reducing the overall search result to 938 papers. Only 267 studies remained after the duplicated removal and level 5 exclusion (non-peer reviewed, book, magazines, etc.) [12]. After all the abstracts were reviewed to check the papers relationship to the topic of interest, only 51 studies remained. That considerable reduction on number of papers was mainly for three reasons. Some works have the searched keywords combination in another context [13]. Also, most papers used Arduino boards for data acquisition while all the machine learning was executed on a more powerful computing envi-
ronment (laptops, mobile phones, or cloud) [15]. Other works target ANNs for low-cost and low-power microprocessors, but they use processors much more powerful than 8-bit ones, such as 32-bit Cortex-M microcontrollers [6] or NVIDIA Jetson Nano [33]. Those platforms have lower power and cost than high-end computers with NVIDIA GPUs. However, they are orders of magnitude more expensive and powerful than the microcontrollers targeted by the present research. They targeted those much higher-end processors guided by the paradigm of improving results by adding layers, while in the present research, the goal is to augment the results of simpler ANNs such as MLPs. Only 21 studies really embedded ANNs on the Arduino microcontrollers [1–4, 7, 9–11, 17–19, 22, 24–27, 29–32, 35]. Moofarry and Garcia [18] trained and verified MLP with MATLAB, similarly to [25, 31], and evaluated the memory space required for several topologies on Arduino. Ofoli et al. [19] implemented a neuro-fuzzy logic on Arduino Mega 2560 using the embedded Fuzzy Logic Library (eFLL) for resource-constrained platforms like microcontrollers and applied it to a fire detection subsystem. Daigavane et al. [4] implemented an ANN approach for discriminating the internal faults from the non-internal faults in a transformer using MATLAB/SIMULINK and ported it on an Arduino Uno ATmega328P platform. Their experiments achieved a classification accuracy of 95.63%. Sugawara et al. [28, 29] implemented an MLP and executed it on Arduino to monitor the distinctive electromagnetic (EM) traces corresponding to the activation function’s operations, and evaluate the use of simple electromagnetic analysis (SEMA) for performing cyberattacks. Seshadri and Sharma [24] proposed and implemented Shiftry, an automatic compiler from high-level floating-point ML models to fixed-point C-programs with 8-bit and 16-bit integers, aiming to reduce memory requirements. Then, the study implemented, ported, and tested the RNN on Arduino, achieving good speed and accuracy. Bernal-Escobedo and Santa [3] proposed and implemented emlearn, a library for machine learning (including simple ANNs) inference on microcontrollers. However, the ANNs were not optimized for achieving superior performance on microcontrollers, such as the goal of the present research. Yu [37] implemented and tested proximal SVM to demonstrate the possibility of implementing ANN learning on Arduino UNO. Velichko and Boriskov [35] implemented handwritten digit recognition time on Arduino Uno using LogNNet. Mattos et al. [17] ported an MLP on Arduino Uno to implement an autonomous embedded Electrooculography (EOG). Belikov and Petlenkov [2] implemented a Simplified Additive Nonlinear AutoRegressive eXogenous (SANARX) model on an Arduino board to control the water level in a tank system. Although the ANN was executed on Arduino, a portion of the control intelligence was developed and executed on Mat-
7
Tuning Neural Networks for Superior Accuracy on Resource-Constrained Edge Microcontrollers
lab/Simulink that works together in a Hybrid control architecture with Master and Slave. Subirats et al. [27] and Jerez et al. [10] ported and tested the C-Mantec ANN algorithm on Arduino. Subirats et al. [27] compared the accuracies achieved on Arduino and the computer, concluding that the inferences on Arduino achieved inferior performance than the computer. Hernández and Leal [7] implemented a ANN inverse model controller on an Arduino Mega board to track a photovoltaic module’s maximum power point. Kirdpipat [11] implemented the Levenberg-Marquardt algorithm for training ANN on Arduino to predict the operation time people need to use an electric kettle aiming to control its power (turn on/off) for energy saving. Jerez et al. [10] distinguished from other studies found by being the only one who did training and retraining standalone to reduce the dependence on connectivity infrastructure and delays, and presented case studies where local retraining is needed. Abadeh and Rawassizadeh [1] implemented k-Nearest Neighbor (KNN) on Arduino. Taluk [32] proposed a novel machine learning method inspired by behavioral science theories, tested on the Arduino, and compared to other techniques. Panigrahi and Parhi [22] implemented wavelet ANN on Arduino mega 2560, tested and simulated before in Matlab, and compared results with physical implementation. Abadeh and Rawassizadeh [1] proposed and implemented SEFR, an ultra-low power binary classifier with linear time complexity in the training and testing phases, on Arduino. Suggala et al. [30] implemented ProtoNN, a ANN library, on Arduino and compared three baselines (LDKL-L1, NeuralNet Pruning, L1 Logistic) on four binary datasets, achieving good accuracy when compared with benchmark techniques and low energy consumption. Suarez and Varela [26] implemented ANNs such as MLP and compared performance on several boards, including Arduino, to evaluate which one achieves the best result. In all the studies, the ANN accuracy on Arduino was lower or comparable than that achieved on a computer.
7.3
Methods and Materials
The proposed methodology consists of executing the ANN learning phase in two stages (training and tuning) and validating it on a third stage. In the first stage, a first-order optimization method is used to adjust the ANN’s weights its convergence (no changes greater than a float precision equal to 2.220446049250313E-09). In the second step, a quasi-newton global optimization algorithm is used to tune the weights combined to the coefficients of each neuron’s activation function individually. Therefore, the second step’s results are expected to be better since it uses more information and parameters (requires a hessian) while the first-order optimization method (requires
55
the function’s first derivative) used in first step works with limited local information, resulting in more limited results. In a third stage, the tuned ANN runs on a target microcontroller and its performance metrics are compared to original ANN’s ones (executed on computer before tuning stage). The present study’s main research question is: “Is it possible to improve the network obtained in conventional training using a combined optimization of weights and activation functions?”. Therefore, the present research sought the validation of the following hypotheses to answer the research question: H1: The proposed method allows simple ANNs to achieve a lower loss function value (sum of squared errors); H2: The proposed method allows ANNs to achieve greater performance; and H3: The proposed method allows the results from H1 and H2 be achieved in target devices with a low-end microcontroller. Four classification datasets available in python’s scikitlearn library were used: Iris (150 instances, 3 classes), Wine (178 instances, 3 classes), Cancer (569 instances, 2 classes), and Digits (1797 instances, 10 classes). They were selected because they are widely used, have different complexities and sizes, and, are compatible with fully connected MLPs, the ANN architecture adopted in the present study. Different intermediate layer sizes (number of neurons) were tested, always seeking to use the smallest number of neurons possible to validate whether the proposed method allows obtaining superior results using lighter networks (third hypothesis of the study). The Table 7.1 shows the different MLP settings and the learning rates for testing the proposed method with each dataset. The sigmoid activation function was adopted. Beyond its frequently use in MLPs, it was selected because its non-linearity could result in potentially more opportunities for tuning after the ANN’s weights adjustments in first steps. The first-order optimization method selected for the training stage was the backpropagation, a stochastic gradient descendent algorithm traditionally used for MLP. The second-order optimization method selected for the tuning stage was the Broyden-Fletcher-Goldfarb-Shanno (BFGS) [14, 38]. Thus, the second stage tunes the sigmoid shapes as well using the BFGS optimizer from Python library scipy 1.5.4. Therefore, the resulting MLP has distinct sigmoid shapes as illustrated by Fig. 7.1. Each experiment consisted of 1000 executions of the two steps for each combination of data set and the 16 possible MLP configurations, totaling 16,000 executions. In each run, the weights were started with random values (normal distribution between 0 and 1), and the instances that make up the training and test datasets were randomly drawn. The performance metric selected for evaluation was the average classification accuracy. The ANNs were evaluated using the training and testing dataset in each epoch/iteration executed in stages 1 and 2. In the third stage, the tuned ANNs were ported into the target microcontroller to classify the evalu-
56
A. M. Nascimento et al.
Table 7.1 Experiment settings: datasets, learning rates, #neurons per layer L1/L2/L3, #configurations tested, code and runtime RAM footprints on microcontrollers
Dataset Iris Wine Cancer Digits
Learn. .1E − 04 .1E − 04 .1E − 04 .1E − 03
L1 4 13 30 64
L2 1,2,4,8 1,2,4,8 1,2,4,8 1,2,4,8
L3 3 3 2 10
#Cfg 4 4 4 4
Code (Kb) 9.2 9.2 9.2 9.2–9.5
RAM (Kb) 0.6; 0.6; 0.7; 0.9 0.7;0.7;0.9;1.3 0.8;0.9;1.2;1.8 1.3;1.6;2.2;3.4
x1
Input variables
x8 Weights Input layer
One or more hidden layer
Output layer
Fig. 7.1 Illustration of an ANN with optimized activation functions’ shapes
ation datasets (training and testing) and their performance were compared to the ANNs run on computer before tuning. In all experiments, each dataset was divided into two parts for training (75%) and testing (25%). Therefore, a thousand random splits (75/25%) were performed for each combination of network configuration and dataset. That approach was preferred over 10-fold cross-validation because it reached higher dataset variability and used a smaller dataset proportion for training and a larger proportion for testing. Two low-end/low-cost microcontrollers used for the experiments were the ATmega328p (Arduino Uno board) and ATmega2560 (Arduino Mega 2560 board). Both are 8-bit RISC architecture microcontrollers that can reach up to 16MIPS throughput at 16MHz. They have 32Kbytes and 256kbytes of flash program memory, respectively. Their internal SRAM size is 2Kbytes and 8Kbytes, respectively. Therefore, ATmega2560 can afford programs up to 8 times bigger and can support 4 times the memory for variables compared to ATmega328p. Both can be found in China for less than $0.10piece in large quantities. Those boards already have a serial interface implemented over a USB connection and are supported by a development environment (Arduino IDE). Because they are open source, they are widely used for IoT and Robotics prototyping and projects. Therefore, porting ANNs for Arduinos can enable applications such as image recognition for those developers and hobbyists, creating an impact beyond the research community. All experiment stages were controlled by a python 3.6.13 code. A C code to run on the microcontroller was developed and compiled on Arduino IDE 1.8.15 for performing the
inferences on Arduino. The C code was configured and compiled for each network architecture. The first compilation attempt was always made for the Arduino Uno board. In a few cases, the network architecture exceeded the maximum memory available for variables. Then the compilation was restarted after the Arduino Mega 2560 board was selected. Table 7.1 shows the memory required for each version compiled. The compiled binary code was uploaded to the proper Arduino board over USB cable. Step 1 was executed for each ANN configuration resulting in 1000 ANNs being trained. Those networks were tested using the testing datasets, and the average and standard deviation of the accuracy were computed. Then, they were tuned during step 2, resulting into tuned 16,000 ANNs. Right after, step 3 (Fig. 7.2) was executed for each of the 16 configurations. For each dataset, the python controller code uploaded a tuned ANN to the microcontroller using a serial communication connection between the computer and the Arduino board at 115,200 bauds. Then, it loaded one line of data from the evaluation datasets and sent it to the microcontroller. The microcontroller executed the inference and returned the classification to the computer, which stored the results. The process was repeated until all the evaluation datasets instances have been tested. Then, the controller loaded the following tuned ANN and repeated the abovementioned process. After they all were tested on the microcontroller, the accuracy average and standard deviation for that configuration were computed. Then, this process restarts for another configuration until all four configurations for each dataset have been tested.
7
Tuning Neural Networks for Superior Accuracy on Resource-Constrained Edge Microcontrollers
57
Fig. 7.2 Step 3 illustration
Table 7.2 Iris: experiments’ results (*p-value.