142 48 101MB
English Pages 859 Year 2022
Lecture Notes in Networks and Systems 544
Kohei Arai Editor
Intelligent Systems and Applications Proceedings of the 2022 Intelligent Systems Conference (IntelliSys) Volume 3
Lecture Notes in Networks and Systems Volume 544
Series Editor Janusz Kacprzyk, Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland Advisory Editors Fernando Gomide, Department of Computer Engineering and Automation—DCA, School of Electrical and Computer Engineering—FEEC, University of Campinas— UNICAMP, São Paulo, Brazil Okyay Kaynak, Department of Electrical and Electronic Engineering, Bogazici University, Istanbul, Turkey Derong Liu, Department of Electrical and Computer Engineering, University of Illinois at Chicago, Chicago, USA Institute of Automation, Chinese Academy of Sciences, Beijing, China Witold Pedrycz, Department of Electrical and Computer Engineering, University of Alberta, Alberta, Canada Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland Marios M. Polycarpou, Department of Electrical and Computer Engineering, KIOS Research Center for Intelligent Systems and Networks, University of Cyprus, Nicosia, Cyprus Imre J. Rudas, Óbuda University, Budapest, Hungary Jun Wang, Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong
The series “Lecture Notes in Networks and Systems” publishes the latest developments in Networks and Systems—quickly, informally and with high quality. Original research reported in proceedings and post-proceedings represents the core of LNNS. Volumes published in LNNS embrace all aspects and subfields of, as well as new challenges in, Networks and Systems. The series contains proceedings and edited volumes in systems and networks, spanning the areas of Cyber-Physical Systems, Autonomous Systems, Sensor Networks, Control Systems, Energy Systems, Automotive Systems, Biological Systems, Vehicular Networking and Connected Vehicles, Aerospace Systems, Automation, Manufacturing, Smart Grids, Nonlinear Systems, Power Systems, Robotics, Social Systems, Economic Systems and other. Of particular value to both the contributors and the readership are the short publication timeframe and the world-wide distribution and exposure which enable both a wide and rapid dissemination of research output. The series covers the theory, applications, and perspectives on the state of the art and future developments relevant to systems and networks, decision making, control, complex processes and related areas, as embedded in the fields of interdisciplinary and applied sciences, engineering, computer science, physics, economics, social, and life sciences, as well as the paradigms and methodologies behind them. Indexed by SCOPUS, INSPEC, WTI Frankfurt eG, zbMATH, SCImago. All books published in the series are submitted for consideration in Web of Science. For proposals from Asia please contact Aninda Bose ([email protected]).
More information about this series at https://link.springer.com/bookseries/15179
Kohei Arai Editor
Intelligent Systems and Applications Proceedings of the 2022 Intelligent Systems Conference (IntelliSys) Volume 3
123
Editor Kohei Arai Saga University Saga, Japan
ISSN 2367-3370 ISSN 2367-3389 (electronic) Lecture Notes in Networks and Systems ISBN 978-3-031-16074-5 ISBN 978-3-031-16075-2 (eBook) https://doi.org/10.1007/978-3-031-16075-2 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Editor’s Preface
It gives me immense pleasure and privilege to present the proceedings of Intelligent Systems Conference (IntelliSys) 2022 which was held in a hybrid mode on 1 and 2 September 2022. IntelliSys was designed and organized in such a wonderful manner in Amsterdam, the Netherlands, that in-person and online delegates shared their valuable reasearch in an engaging discussion, and hence, we were able to take advantage of the best that the two modalities can offer. IntelliSys is a prestigious annual conference on artificial intelligence and aims to provide a platform for discussing the issues, challenges, opportunities and findings of its applications to the real world. This conference was hugely successful in discussing the erstwhile approaches, current researches and future areas of study in the field of intelligent systems. The researches managed to give workable solutions to many intriguing problems faced across different fields covering deep learning, data mining, data processing, human–computer interactions, natural language processing, expert systems, robotics, ambient intelligence, to name a few. They also let us see through what the future would look like if artificial intelligence was entwined in our life. One of the meaningful and valuable dimensions of this conference is the way it brings together researchers, scientists, academicians and engineers on one platform from different countries. The aim was to further increase the body of knowledge in this specific area by providing a forum to exchange ideas and discuss results and to build international links. Authors from 50+ countries submitted a total of 494 papers to be considered for publication. Each paper was reviewed on the basis of originality, novelty and rigorousness. After the reviews, 193 were accepted for presentation, out of which 176 (including 8 posters) papers are finally being published in the proceedings. We would like to extend our gratitude to all the learned guests who participated on site as well as online to make this conference extremely fruitful and successful and also special note of thanks to the technical committee members and reviewers for their efforts in the reviewing process.
v
vi
Editor’s Preface
We sincerely believe this event will help to disseminate new ideas and inspire more international collaborations. We kindly invite all to continue to support future IntelliSys conferences with the same enthusiasm and fervour. Kind Regards, Kohei Arai
Contents
How to Improve the Teaching of Computational Machine Learning Applied to Large-Scale Data Science: The Case of Public Universities in Mexico . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sergio Rogelio Tinoco-Martínez, Heberto Ferreira-Medina, José Luis Cendejas-Valdez, Froylan Hernández-Rendón, Mariana Michell Flores-Monroy, and Bruce Hiram Ginori-Rodríguez Fingerprinting ECUs to Implement Vehicular Security for Passenger Safety Using Machine Learning Techniques . . . . . . . . . . . . . . . . . . . . . . Samuel Bellaire, Matthew Bayer, Azeem Hafeez, Rafi Ud Daula Refat, and Hafiz Malik Knowledge Graph Enrichment of a Semantic Search System for Construction Safety . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Emrah Inan, Paul Thompson, Fenia Christopoulou, Tim Yates, and Sophia Ananiadou
1
16
33
Leboh: An Android Mobile Application for Waste Classification Using TensorFlow Lite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Teny Handhayani and Janson Hendryli
53
Mobile Application for After School Pickup Solution: Malaysia Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Check-Yee Law, Yong-Wee Sek, Choo-Chuan Tay, and Wei-Wei Goh
68
Were Consumers Less Price Sensitive to Life Necessities During the COVID-19 Pandemic? An Empirical Study on Dutch Consumers . . . . . Hao Chen and Alvin Lim
79
Eigen Value Decomposition Utilizing Method for Data Hiding Based on Wavelet Multi-resolution Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Kohei Arai
vii
viii
Contents
Long Short-Term Memory Neural Network for Temperature Prediction in Laser Powder Bed Additive Manufacturing . . . . . . . . . . . 119 Ashkan Mansouri Yarahmadi, Michael Breuß, and Carsten Hartmann Detection and Collection of Waste Using a Partially Submerged Aquatic Robot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 Malek Ayesh and Uvais Qidwai Detecting Complex Intrusion Attempts Using Hybrid Machine Learning Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 Mustafa Abusalah, Nizar Shanaah, and Sundos Jamal Gauging Biases in Various Deep Learning AI Models . . . . . . . . . . . . . . 171 N. Tellez, J. Serra, Y. Kumar, J. J. Li, and P. Morreale Improving Meta-imitation Learning with Focused Task Embedding . . . 187 Yu-Fong Lin, ChiKai Ho, and Chung-Ta King Firearm Detection Using Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . 200 Akhila Kambhatla and Khaled R Ahmed PID Parameter Tables for Time-Delayed Systems, Found with Learning Algorithms According to Minimum IAE, ITAE and ISE Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 Roland Büchi Principles of Solving the Symbol Grounding Problem in the Development of the General Artificial Cognitive Agents . . . . . . . . . . . . . 231 Roman V. Dushkin and Vladimir Y. Stepankov Enabling Maritime Digitalization by Extreme-Scale Analytics, AI and Digital Twins: The Vesselai Architecture . . . . . . . . . . . . . . . . . . . . . . . . 246 Spiros Mouzakitis, Christos Kontzinos, John Tsapelas, Ioanna Kanellou, Georgios Kormpakis, Panagiotis Kapsalis, and Dimitris Askounis Computing with Words for Industrial Applications . . . . . . . . . . . . . . . . 257 Aisultan Kali, Pakizar Shamoi, Yessenaly Zhangbyrbayev, and Aiganym Zhandaulet Improving Public Services Accessibility Through Natural Language Processing: Challenges, Opportunities and Obstacles . . . . . . . . . . . . . . . 272 Ilaria Mariani, Maryam Karimi, Grazia Concilio, Giuseppe Rizzo, and Alberto Benincasa Neural Machine Translation for Aymara to Spanish . . . . . . . . . . . . . . . 290 Honorio Apaza Alanoca, Brisayda Aruhuanca Chahuares, Kewin Aroquipa Caceres, and Josimar Chire Saire Hand Gesture and Human-Drone Interaction . . . . . . . . . . . . . . . . . . . . 299 Bilawal Latif, Neil Buckley, and Emanuele Lindo Secco
Contents
ix
Design and Develop Hardware Aware DNN for Faster Inference . . . . . 309 S. Rajarajeswari, Annapurna P. Patil, Aditya Madhyastha, Akshat Jaitly, Himangshu Shekhar Jha, Sahil Rajesh Bhave, Mayukh Das, and N. S. Pradeep Vision Transformers for Medical Images Classifications . . . . . . . . . . . . 319 Rebekah Leamons, Hong Cheng, and Ahmad Al Shami A Survey of Smart Classroom: Concept, Technologies and Facial Emotions Recognition Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326 Rajae Amimi, Amina Radgui, and Hassane Ibn El Haj El The Digital Twin for Monitoring of Cargo Deliveries to the Arctic Territories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339 Nikishova Maria Igorevna and Kuznetsov Mikhail Evgenievich Intelligent Bixby Recommender Systems . . . . . . . . . . . . . . . . . . . . . . . . 348 S. Rajarajeswari, Annapurna P. Patil, Manish Manohar, Mopuru Vinod Reddy, Laveesh Gupta, Muskan Gupta, and Nikunj Das Kasat A Deep Reinforcement Learning Algorithm Using A New Graph Transformer Model for Routing Problems . . . . . . . . . . . . . . . . . . . . . . . 365 Yang Wang and Zhibin Chen Building a Fuzzy Expert System for Assessing the Severity of Pneumonia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 380 Rustam Burnashev, Adelya Enikeeva, Ismail F. Amer, Alfira Akhmedova, Marina Bolsunovskaya, and Arslan Enikeev Evaluating Suitability of a Blockchain Platform for a Smart Education Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397 Manal Alkhammash, Natalia Beloff, and Martin White Towards End-to-End Chase in Urban Autonomous Driving Using Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408 Michał Kołomański, Mustafa Sakhai, Jakub Nowak, and Maciej Wielgosz Methodology for Training Artificial Neural Networks for Islanding Detection of Photovoltaic Distributed Generators . . . . . . . . . . . . . . . . . . 427 Luiza Buscariolli, Ricardo Caneloi dos Santos, and Ahda P. Grilo Pavani Using the Cramer-Gauss Method to Solve Systems of Linear Algebraic Equations with Tridiagonal and Five-Diagonal Coefficient Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442 Anarkul Urdaletova, Sergey Sklyar, Syrgak Kydyraliev, and Elena Burova An Interoperable Cloud Platform for the Garment Industry . . . . . . . . . 457 Francisco Morais, Nuno Soares, Rui Ribeiro, Marcelo Alves, Pedro Rocha, Ana Lima, and Ricardo J. Machado
x
Contents
Optimal Scheduling of Processing Unit Using Convolutional Neural Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 478 Bhavin G. Chennur, Nishanth Shastry, S. Monish, Vibha V. Hegde, Pooja Agarwal, and Arti Arya Facial Gesture Recognition for Drone Control . . . . . . . . . . . . . . . . . . . . 488 Aloaye Itsueli, Nicholas Ferrara, Jonathan Kamba, Jeremie Kamba, and R. Alba-Flores Job Recommendation Based on Extracted Skill Embeddings . . . . . . . . . 497 Atakan Kara, F. Serhan Daniş, Günce K. Orman, Sultan N. Turhan, and Ö. Anıl Özlü Hard Samples Make Difference: An Improved Training Procedure for Video Action Recognition Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 508 Alexander Zarichkovyi and Inna V. Stetsenko Similarity-Based Résumé Matching via Triplet Loss with BERT Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 520 Ö. Anıl Özlü, Günce Keziban Orman, F. Serhan Daniş, Sultan N. Turhan, K. Can Kara, and T. Arda Yücel IOT Based Crowd Management Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . 533 Fai Aldosari, Sarah Alamri, Rahaf Alharbi, Ghaydaa Alghamdi, and supervised by Ala Alluhaidan The Growing Need for Metaverse Regulation . . . . . . . . . . . . . . . . . . . . 540 Louis B. Rosenberg Weak Supervision Can Help Detecting Corruption in Public Procurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 548 Bedri Kamil Onur Tas Applicability of Systems Engineering for Intelligent Transportation Systems: A Roadmap for Model-Based Approach to Manage Future Mobility Needs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556 Harsha Deshpande and Vasudha Srikumar Implementation of Two-Layered Dynamic Pragmatics . . . . . . . . . . . . . . 572 Ádám Szeteli, Attila Friedszám, Anna Szeteli, Laura Kárpáti, Judit Hagymási, Judit Kleiber, and Gábor Alberti Ambient Intelligence Security Checks: Identifying Integrity Vulnerabilities in Industry Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 590 Suzanna Schmeelk, Shannon Roth, Julia Rooney, Mughees Tariq, Khalil Wood, John Kamen, and Denise Dragos Demystifying xAOSF/AOSR Framework in the Context of Digital Twin and Industry 4.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 600 Fareed Ud Din and David Paul
Contents
xi
Mitigating IoT Enterprise Vulnerabilities Using Radio Frequency Security Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 611 John Irungu and Anteneh Girma Tantrum-Track: Context and Ontological Representation Model for Recommendation and Tracking Services for People with Autism . . . . . . 620 Hamid Mcheick, Fatima Ezzeddine, Fatima Lakkis, Batoul Msheik, and Mariam Ezzeddine Video Analysis of Solid Flame Combustion of the Ni-Al System . . . . . . 636 Isaeva Oksana, Boronenko Yuri, and Gulyaev Pavel Intrusion Detection System for Industrial Network . . . . . . . . . . . . . . . . 646 Woo Young Park, Sang Hyun Kim, Duy-Son Vu, Chang Han Song, Hee Soo Jung, and Hyeon Jo Proposing Theoretical Frameworks for Including Discreet Cues and Sleep Phases in Computational Intelligence . . . . . . . . . . . . . . . . . . . . . . 659 Aishwarya Seth and Wanyi Guo DataWords: Getting Contrarian with Text, Structured Data and Explanations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675 Stephen I. Gallant and Mirza Nasir Hossain Extractive Text Summarization for Turkish: Implementation of TF-IDF and PageRank Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 688 Emre Akülker and Çiğdem Turhan Recognition of Similar Habits Using Smartwatches and Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 705 Maren Hassemer, Edmond Cudjoe, Janina Dohn, Claudia Kredel, Yannika Lietz, Johannes Luderschmidt, Lisa Mohr, and Sergio Staab Brand Recommendations for Cold-Start Problems Using Brand Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 724 David Azcona and Alan F. Smeaton Commentary on Biological Assets Cataloging and AI in the Global South . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 734 Issah Abubakari Samori, Xavier-Lewis Palmer, Lucas Potter, and Saltuk Karahan FI-SHAP: Explanation of Time Series Forecasting and Improvement of Feature Engineering Based on Boosting Algorithm . . . . . . . . . . . . . . 745 Yuyi Zhang, Ovanes Petrosian, Jing Liu, Ruimin Ma, and Kirill Krinkin Deep Learning and Support Vector Machine Algorithms Applied for Fault Detection in Electrical Power Transmission Network . . . . . . . . . . 759 Nouha Bouchiba and Azeddine Kaddouri
xii
Contents
Perception of the Situation: Social Stress and Well-Being Indices . . . . . 778 Alexander A. Kharlamov and Maria Pilgun Smart Hardware Trojan Detection System . . . . . . . . . . . . . . . . . . . . . . . 791 Iyad Alkhazendar, Mohammed Zubair, and Uvais Qidwai A Study of Extracting Causal Relationships from Text . . . . . . . . . . . . . 807 Pranav Gujarathi, Manohar Reddy, Neha Tayade, and Sunandan Chakraborty Blending Case-Based Reasoning with Ontologies for Adapting Diet Menus and Physical Activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 829 Ana Duarte and Orlando Belo Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 845
How to Improve the Teaching of Computational Machine Learning Applied to Large-Scale Data Science: The Case of Public Universities in Mexico Sergio Rogelio Tinoco-Martínez1 , Heberto Ferreira-Medina2,3(B) , José Luis Cendejas-Valdez4(B) , Froylan Hernández-Rendón1 , Mariana Michell Flores-Monroy1 , and Bruce Hiram Ginori-Rodríguez1 1 Escuela Nacional de Estudios Superiores Unidad Morelia, UNAM Campus Morelia,
58190 Morelia, Michoacán, Mexico {stinoco,fhernandez}@enesmorelia.unam.mx 2 Instituto de Investigaciones en Ecosistemas y Sustentabilidad, UNAM Campus Morelia, 58190 Morelia, Michoacán, Mexico [email protected] 3 Tecnológico Nacional de México, Campus Morelia, DSC, Morelia, Michoacán, Mexico 4 Departamento de TI, Universidad Tecnológica de Morelia. Cuerpo académico TRATEC-PRODEP. Morelia, 58200 Morelia, Michoacán, México [email protected] Abstract. Teaching along with training on Machine Learning (ML) and Big Data in Mexican universities has become a necessity that requires the application of courses, handbooks, and practices that allow improvement in the learning of Data Science (DS) and Artificial Intelligence (AI) subjects. This work shows how the academy and the Information Technology industry use tools to analyze large volumes of data to support decision-making, which is hard to treat and interpret directly. A solution to some large-scale national problems is the inclusion of these subjects in related courses within specialization areas that universities offer. The methodology in this work is as follows: 1) Selection of topics and tools for ML and Big Data teaching, 2) Design of practices with application to real data problems, and 3) Implementation and/or application of these practices in a specialization diploma. Results of a survey applied to academic staff and students are shown. The survey respondents have already taken related courses along with those specific topics that the proposed courses and practices will seek to strengthen, developing needed skills for solving problems where ML/DL and Big Data are an outstanding alternative of solution. Keywords: Machine learning · Deep learning · Big data · Data science · Teaching skills
1 Introduction The use of tools that allow the analysis of large volumes of data has allowed exact sciences to play an important role for decision-making in organizations [1]. In the Bachelor © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): IntelliSys 2022, LNNS 544, pp. 1–15, 2023. https://doi.org/10.1007/978-3-031-16075-2_1
2
S. R. Tinoco-Martínez et al.
of Information Technologies for Sciences (ITCs) of the Escuela Nacional de Estudios Superiores (ENES) Morelia, Mexico, there are subjects related to Data Science (DS) [2] that are included in the curriculum starting from the 6th semester, known as subjects of the deepening area, and that represent a challenge for students when trying to put the theory learned into practice, in addition to lacking the necessary tools for its application in real problems. The need for teachers and students to know new frontiers in Artificial Intelligence (AI) is observed, specifically in the application of mathematical models of Machine Learning (ML). ML is the branch of AI that is responsible for developing techniques, algorithms, and programs that give computers the ability to learn. A machine learns each time it changes its structure, programs, or data, based on input or in response to external information, in such a way that better performance is expected in the future [3]. In [4] Deep learning (DL) is used to explain new architectures of Neural Networks (NN) that are capable of learning. DL is a class of ML techniques that exploit many layers of nonlinear processing for extraction and transformation of supervised and unsupervised features and pattern analysis and classification [5]. The 21st century has become the golden age for AI; this is due, in large part, to a greater computing capacity and the use of GPUs to speed up the training of these systems, with the ingestion of large amounts of data. Currently, numerous frameworks have ML and DL tools implemented, such as PyTorch [6], fast.ai [7], TensorFlow [8], Keras [9], DL4J [10], among others. Some of the main uses of DL today are, for example, identifying brand names and company logos in photos posted on social media, real-time monitoring of reactions on online channels during product launches, ad recommendation and prediction of preferences, as well as identification and monitoring of customer confidence levels, among others. The use of AI has allowed a better understanding of genetic diseases and therapies, the analysis of medical images (such as X-rays and magnetic resonance imaging) increasing the accuracy of diagnosis in less time and at a lower cost than traditional methods [11]. The DL forms a subcategory of the ML. To differentiate it from the rest of ML algorithms, it uses the fact that large-scale NNs allow a machine to learn to recognize complex patterns by itself, which is difficult to achieve with them [12]. A NN is made up of several layers or levels and a certain number of neurons in each of them, which constitute the processing unit, whose mathematical model allows having several data inputs and an output that is the weighting of their inputs [13, 14]. The connections of several neurons within a NN constitute a powerful parallel computation tool, capable of delivering approximate and non-definitive outputs. Furthermore, NNs can be structured in various ways and can be trained with various types of algorithms [15]. On the other hand, the Internet of Things (IoT) and industry 4.0 have required the introduction of autonomous and intelligent machinery in the industrial sector [16]. In this industry, Convolutional Neural Networks (CNN) are applied, which are a type of DL that is inspired by the functioning of the visual cortex of the human brain and differs from the other NNs by the fact that each of the neurons of the layers that compose it does not receive incoming connections from all the neurons of the previous layer, but only from some of them. This simplifies the learning of the network, generating lower computational and storage costs. All of the above mentioned makes DL models more accurate [12].
How to Improve the Teaching of Computational Machine Learning Applied
3
The main contribution of this paper is to present a proposal to improve the learning of ML, DL, and Big Data subjects with practical training focused on real-life use cases. The rest of the paper is organized as follows: Sect. 2 describes the work related to AI (ML, DL, and Big Data) and the implementation of its teaching in courses or diplomas oriented to data science. It shows the difficulty of the students in learning and applying the topics of ML and DL in data analysis. This problem must be solved with a practical approach or orientation. Section 3 describes the methodology used to develop the improvement proposal. Section 4 presents the results, as well as a brief discussion of them. Finally, Sect. 5 presents the conclusions obtained from the developed proposal.
2 Related Work In current terms [17] explains that within AI there is a branch called ML whose purpose is to improve the performance of algorithms through their experience. The ML uses statistics, computer science, mathematics, and computational sciences having its foundation in data analysis. The definition coincides with other proposals that consider ML as the technique of creating systems that are capable of learning by themselves, using large volumes of data, making them suitable for analysis and, thus, being able to predict future behavior [18]. Regarding the ML as an area [12] points out that there are different approaches for the design of these systems. The approaches are divided into supervised, unsupervised and by reinforcement, if the system is trained under human supervision or not (the most used division); online or offline learning, whether the system can learn on the fly or not; and, finally, in instance-based learning or model-based learning, if the system detects training patterns or if it compares new data against existing data. The supervised ML approach is used when you already have data, and you know the response you want to predict. This knowledge is then used to predict the labels of new data whose label is unknown. The main problems that are solved with this type of learning are regression (predicting the future value of a given element, whose values can only be numerical, from relevant characteristics and previous values [19]) and classification (assigning a label to a given element, from a discrete set of possibilities [20]). In the unsupervised ML approach, the data is not labeled, this means that the system must learn by itself without being told if the classification is correct or not [21]. To classify the data, a grouping technique is used whose objective is to combine data whose characteristics are like each other [22]. Regarding the reinforcement ML approach, the goal is that the system learns in an environment in which the only feedback consists of a scalar reward, which can be positive or negative (punishment) [23]. That is, the system receives in each iteration a reward and the current state of the environment, then takes an action according to these inputs and what results, is considered as an output, which will change the state in the next iteration [24]. The arguably most used algorithms of supervised ML are linear or logistic regression, for regression [25]; and decision trees, k-nearest neighbors, support vector machines or artificial NNs, for classification [26]. The algorithms for unsupervised ML are k-means, visualization and dimensionality reduction or association rules [27]. In relation to artificial NNs, they serve to solve both regression and classification problems and even some unsupervised learning problems. Due to its versatility and
4
S. R. Tinoco-Martínez et al.
performance that have recently surpassed even human performance (at the cost of having example data in large quantities that the IoT has allowed to obtain). The study of DL began in 1943 as a computer model inspired by the NNs of the human brain [28], however, it was not until 1985 that in [4] it was demonstrated that backpropagation in a NN could provide distribution representations of great utility for learning, generating a reborn interest in the area. In [29] the first practical demonstration of backpropagation is provided. The team combined CNNs with backpropagation to read handwritten digits in a system that, finally, was used to read handwritten check numbers. The model was based on the hierarchical multi-layer design (CNNs) inspired by the human visual cortex of the Neocognitron, introduced in 1979 [30]. In the late 1990s, the problem of the fading or exploding gradient was detected in DL models. The problem originates in the activation functions of artificial neurons whose gradient (based on the derivative) decreased when calculated in each layer, until it reached practically zero (or tended to an infinite value); which implies loss of learning. The proposed solution is to store the gradient within the network itself [31] or to make it enter a layer and simultaneously avoid it through skip connections [32]. In [31] Recurrent NNs (RNNs) of the Long Short-Term Memory (LSTM) type are proposed, which specialize in the analysis and prediction of time series and problems that must recall previous states, such as Natural Language Processing (NLP). In addition, this architecture allows to solve the gradient problems mentioned above. In 2009 ImageNet [33] was launched, a free, tagged database of more than 14 million images, which features a thousand different categories of objects as varied as 120 dog breeds. With this resource and coupled with the computing power that the evolution of GPUs already had by 2011, it became possible to train CNNs without the previous training (layer by layer) and considering architectures with an increasing number of these (hence the term DL). The examples that can be mentioned of the efficiency and speed that DL algorithms have achieved are the computer vision algorithms that were winners in the ImageNet Large Scale Visual Recognition Challenge (LSVRC) [34] between the years of 2012 up to 2017 (all CNNs architectures), whose main challenge is based on the classification of images from the ImageNet database and that, as of 2017, it is considered solved in practice and with superior performance to that of the human being. To our knowledge, there are very few works in the recent literature related to the improvement of AI teaching. Some of the most representative ones are mentioned below, as a review of the approaches they address and that are aimed at teaching AI as a secondary objective or use case. According to [35] ML is a discipline that focuses on building a computer system that can improve itself using experience. ML models can be used to detect patterns from data and recommend strategic marketing actions, showing how educators can improve the teaching of these topics using the AI approach. The availability of Big Data has created opportunities and challenges for professionals and academics in the area. Therefore, study programs must be constantly updated to prepare graduates for rapidly changing trends and new approaches. In [36] it is described that the pandemic caused by the COVID-19 virus, the advent of Industry 4.0 confronts graduate students with the need
How to Improve the Teaching of Computational Machine Learning Applied
5
to develop competencies in ML, which are applied to solve many industrial problems that require prediction and classification, and the availability and management of large amounts of data. The proposal of how to apply AI practices in a virtual laboratory is shown, in addition to evaluating the performance of students in this type of environment. In the same way, in [37], an innovative practice of teaching applied ML to first-year multidisciplinary engineering university students is proposed, using a learning tool that consists of a public repository in the cloud and a course project. A set of practices for ML and how to apply it in real cases is offered as a use case for online collaborative work. The inclusion of DL and Big Data is mentioned as future work. In Mexico, there are many academic degrees oriented to DS, ML, and DL. However, there are no uniform curricula on the areas of knowledge, topics, and tools that students require. In this paper, we offer an alternative way to solve this problem based on the experience of a public university such as the ENES Morelia - UNAM.
3 Methodology Based on the review of the literature, a series of steps were generated, which allowed us to obtain the level of knowledge that the ENES Morelia population has about ML/DL and thus be able to define axes that support the design of the pedagogical strategy that will give rise to the proposal of a uniform curriculum through courses of practical experience. The present research is characterized by being a study of type: 1) exploratory, 2) descriptive, 3) correlational and 4) pre-experimental to have a case study through a single measurement. To this end, a survey was generated that was applied to a population made up of professors, students, and researchers of the UNAM (Morelia, Michoacan campus). The methodology followed for this work is shown in Fig. 1. Results of the analysis of the application of this survey and the monitoring of test groups are described in the following sections.
Fig. 1. Block diagram of the methodology used in this work.
6
S. R. Tinoco-Martínez et al.
3.1 Population Primarily the survey was created through the “e-Encuesta®” Web platform and distributed via email and social media to randomly selected individuals affiliated with the campus mentioned before. In the first place, the sample was calculated using the finite population method based on 600 people. This sample has a confidence interval of 95% and a margin of error of 10%, as shown in Table 1. The link to the survey within the Web platform was distributed through the official email of students and teachers. It was validated that the information in each of the answers was consistent and complete. Table 1. Population sample. Description
Value
Population size
600
Trust level
95%
Margin of error
10%
Sample size
83
3.2 Survey An eight-question survey was generated from a critical review of the literature related to DS of ML, DL and Big Data. Experts on the subject validated the selected questions. Its measurement was: a) different options, b) dichotomous responses, and c) Likert scale. The survey was refined by dividing it into four axes: Axis I: Machine Learning; Axis II: Deep Learning; Axis III: Big Data; and Axis IV. Tools, as shown in Table 2. Likert scale applied was: Very important, Important, Neutral, Less important, Nothing important, I do not know. Table 2. Survey questions and format. #
Question description
Axis
Type
Q1
UNAM account or employee number, your name if you do not have them
–
Options
Q2
Bachelor’s degree you are studying; 2.1 Sciences, 2.2 Agroforestry, 2.3 Environmental Sciences, 2.4 Sustainable Materials Science, 2.5 Ecology, 2.6 Social Studies and Local Management, 2.7 Geosciences, 2.8 Geohistory, 2.9 Information Technologies in Sciences, 2.10 Other
–
Options
Q3
Semester you are studying (1–12), does not apply to teachers and researchers
–
Number (continued)
How to Improve the Teaching of Computational Machine Learning Applied
7
Table 2. (continued) #
Question description
Axis
Type
Q4
You consider the following topics related to ML to be: 4.1 Data Science, 4.2 Web Scraping, 4.3 Data Wrangling, 4.4 Machine Learning, 4.5 Data Mining, 4.6 Ensemble Learning, 4.7 Data visualization, 4.8 ML: supervised/unsupervised, 4.9 Binary and multiclass classification, 4.10 EDA, 4.11 Clustering, 4.12 ML model, 4.13 ML evaluation: underfitting, overfitting, 4.14 Cross validation, 4.15 Hyperparameters, regularization, feature engineering, 4.16 PCA
I
Likert options
Q5
You consider the following topics related to DL to be: 5.1 NN Shallow & Deep, 5.3 CNN, 5.3 RNN, 5.4 Transfer Learning & Fine-Tuning, 5.5 Dropout, 5.6 Data Augmentation, 5.7 Batch Normalization
II
Likert options
Q6
You consider the following topics related to Big Data to be: 6.1 Concept, 6.2 Model Scaling, 6.3 Large-Scale Analytics, 6.4 Distributed File System, 6.5 Map-Reduce
III
Likert options
Q7
Skills you have in handling the following tools is: 7.1 TensorFlow, 7.2 Spark, 7.3 Keras, 7.4 Fast.ai, 7.5 PyTorch, 7.6 HDFS, 7.7 Kafka, 7.8 Python, 7.9 Scikit-Learn
IV
Likert options
Q8
Is it important to include some additional topics related to ML, DL and Big Data, not mentioned above?
–
Open
3.3 Data Analysis In the second place, with the information obtained from the survey, descriptive analyses were generated, where the reliability study was carried out applying Cronbach’s Alpha
Fig. 2. Correlation matrix (heat map). See Table 1, for tag details.
8
S. R. Tinoco-Martínez et al.
obtaining as a result 0.956 and demonstrating that the information obtained is consistent. Third, the study of correlations was applied using Pearson’s bivariate and selecting only the correlations obtained at the high and very high levels [0.7–0.93], shown in Fig. 2. According to the analysis of correlations, we identify areas of opportunity according to percentage of importance (scale) that the respondents answered. In Fig. 3 this importance is shown. % Deep Learning correlation
77%
Q5.7
73% Q5.6
70%
Q5.5
69%
Q5.4
68% Q5.3
61%
Q5.2
64% Q5.1
0%
a) ML importance level
c) Big Data importance level
10%
20%
30%
40%
50%
60%
70%
80%
b) DL importance level
d) Tools importance level
Fig. 3. Level of importance (Likert) according to survey respondents; a) ML, b) DL, c) big data and d) tools. See Table 1, for tag details.
4 Results and Discussion According to the developed survey, a certain lack of knowledge of the respondents was observed in some topics. In Fig. 4 topics are shown by axis, ordered by level of unfamiliarity: I do not know (dnK), Nothing important (NImp), Less Important (LImp),
How to Improve the Teaching of Computational Machine Learning Applied
9
Neutral, Important (Imp), Very Important (VImp). And, for tools, scale is: I do not know (dnK), Short, Half, and High. dnK NImp LImp Neutral Imp VImp
5.7
dnK Short Half
Q6. Big Data Axis
7.9
7.8
7.7
7.5
7.6
7 .4
7 .3
7.2
% Level
High
7.1
6.5
6.4
% Level
6.3
5.6
Q5. Deep Learning Axis
dnK NImp LImp Neutral Imp VImp
6.2
5.5
5.3
5.2
5 .1
Q4. Machine Learning Axis
6.1
5.4
% Level
4.16
4.14 4.15
4.9
4.10 4.11 4.12 4.13
4.6 4.7 4.8
4.3 4.4 4.5
4.1 4.2
% Level
dnK NImp LImp Neutral Imp VImp
Q7. Tools Axis
Fig. 4. Levels of unfamiliarity by axes. See Table 1, for tag details.
Based on the analysis of these results and considering the classic progression in recent literature, regarding the teaching of basic topics of ML, DL, and Big Data; coupled with our personal experience in teaching courses on these topics, at the undergraduate level and more of them aimed at teachers in the area of Information Technology, it is proposed to improve learning with practical training to strengthen those topics with the greatest lack of knowledge (dnK level of unfamiliarity in Fig. 4). This practical knowledge is shown in Tables 3 and 4 as a series of practices we recommended to take advantage of these areas for improvement. It was observed that respondents prefer an intervention oriented towards the practical application of knowledge. Table 3. Proposed ML practices. #
Name
Dataset
Evaluation metric
1
Classification using decision trees
Titanic Accuracy and/or passengers [38] Fbeta Metrics
Build a decision tree for the survival analysis of Titanic passengers (Classification)
2
Housing cost prediction
California Housing [39]
Build a real estate cost prediction model. (Linear Regression/Logistic Regression)
RMSE and/or MAE
Description
(continued)
10
S. R. Tinoco-Martínez et al. Table 3. (continued)
#
Name
Dataset
Evaluation metric
Description
3
k-Nearest Neighbors
Water wells [40]
Precision Score Fbeta
Build a prediction model of water well uses. (Supervised)
4
k-Means
Online retail K-means & Hierarchical clustering [41]
Not applicable
Design a model to classify the transactions of a bank’s customers. (No supervised)
5
Installing and using Dask [42]
Not applicable
Not applicable
Show Dask installation and how it is used for Big Data manipulation
6
Installing and using HDFS [43]
Not applicable
Not applicable
Teaching how the installation of HDFS and its basic use is carried out
7
Weather forecasting
RUOA RMSE and/or (UNAM, 2015) MAE [44]
Analyze climate data from the RUOA to predict weather on a daily horizon. (Linear Regression)
9
Car price prediction
100,000 UK Used Car Data set [45]
RMSE and/or MAE
Analyze car data to estimate prices. (Multiple regression)
10
Special case
Public data information
Several
Analyze data to apply the best strategy to solve a problem
Table 4. Proposed DL practices. #
Name
Dataset
Evaluation metric
Description
1
Binary classification with CNN
800 images of mosquitoes, UNAM [46]
Accuracy
Differentiate between species Aedes Albopictus and Aedes Aegypti. (Visualization)
2
Binary classification with CNN
Covid-19 pneumonia screening [47]
Precision & confusion matrix
X-ray tomography analysis for identification of lungs affected by the SARS-CoV-2 virus (continued)
How to Improve the Teaching of Computational Machine Learning Applied
11
Table 4. (continued) #
Name
Dataset
Evaluation metric
Description
3
CNN & data augmentation
Ship Classification [48]
Precision & Confusion Matrix
Classification of 6,252 images of ships (5 categories)
4
RNN
Sarcasm Detection [49]
Accuracy and precision
Identify news titles that are sarcastic or satirical. (NLP)
5
Installing and using PyTorch over Dask
Not applicable
Not applicable Installation and use of PyTorch in Dask
6
Transfer learning
Sports images [50]
Precision and accuracy
Classification of sports images
5 Conclusions By the experience of practical teaching to a mix of students and teachers of the Morelia campus of the UNAM university, divided into two heterogeneous groups, concerning applying the proposed practices, two courses were offered according to the diploma described below [51]: MODULE I. Machine Learning (ML). “Theory and Practice for the Improvement of the Teaching of ML Applied to Data Science”. Topics: 1. Artificial Intelligence and Machine Learning, 2. Phases of an ML Project, 3. Regression Methods, 4. Classification Methods, 5. Prediction Methods, 6. Supervised Learning, 7. Unsupervised Learning, 8. Metrics. Practices to be Developed: See Table 3. MODULE II. Deep learning (DL). “Theory and Practice for the Improvement of the Teaching of DL Applied to Data Science”. Topics: 1. Artificial Intelligence and Deep Learning (DL), 2. Phases of a DL Project, 3. Convolutional Neural Networks (CNNs), 4. Learning Transfer, 5. Recurrent Neural Networks (RNNs), 6. Visualization and Treatment of Natural Language Processing (NLP). Practices: See Table 4. Tools used in the diploma: Anaconda Python, Scikit-Learn, Matplotlib, Dask, PyTorch, fast.ai 2, TensorFlow, Keras, HDFS, among others. At the end of the first course where the intervention was carried out, it was observed that 50% of the attendees, of a total of 40, had various problems solving practices. These problems are shown as percentages of solved practices in Fig. 5. In this figure, the expected results (according to our experiences in previous courses) are compared against the actual results, as practices solved and delivered by the attendees. In addition, the figure shows the efficiency of teaching according to: %efficiency = 100 ∗ (solved practices − expected practices)/expected practices (1)
12
S. R. Tinoco-Martínez et al.
Fig. 5. ML teaching effectiveness.
We observed that students quit working on the most complex practices of ML. Reasons were the increase in data analysis tasks, on top of having to apply statistical and mathematical theory using a programming language (Python). The solution to these problems is to give higher priority to practice with real data than to abstract theory. In works similar to this one, the teaching of ML is used only as a use case to address education in other topics in real cases; indeed, with the AI approach, but no improvements are made to the teaching of ML itself, nor experiences on how to improve the teaching of data science are included. The practical proposal in this work allows for establishing a more complete and broad curriculum, concerning the fact that it includes not only the ML but the DL, the Big Data, and the computer tools associated with data science as well. Our proposal is still under development and, among other issues, it is necessary to evaluate the efficiency of teaching according to this practical approach in a DL course, as well as including other tools that may help facilitate the learning of data science. All this, in addition to including learning platforms as well as other tools that can help facilitate data science learning, such as collaborative learning platforms in the cloud. Acknowledgment. We are grateful for the support of the Instituto de Investigaciones en Ecosistemas y Sustentabilidad (IIES), the CA TRATEC - PRODEP of the Universidad Tecnológica de Morelia (UTM), the TecNM campus Morelia, the Escuela Nacional de Estudios Superiores (ENES), UNAM Campus Morelia, and DGAPA UNAM PAPIME PE106021. Especially thanks to MGTI. Atzimba G. López M., MTI. Alberto Valencia G., Eng. Oscar Álvarez, MTI. Pablo García C. and Eng. Javier Huerta S., for their technical support, comments, and analysis of the
How to Improve the Teaching of Computational Machine Learning Applied
13
statistical calculations. We thank Eng. Alfredo A. Aviña, for his help in applying the survey and Web page support.
References 1. Haenlein, M., Kaplan, A.: A brief history of artificial intelligence: on the past, present, and future of artificial intelligence. Calif. Manage. Rev. 61(4), 5–14 (2019) 2. ENES-UNAM Homepage: http://www.enesmorelia.unam.mx/. Last accessed 18 Jan 2021 3. Nilsson, N.J.: Introduction to Machine Learning. Not published, Stanford, CA (1996) 4. Rumelhart, D., Hinton, G., Williams, R.: Learning representations by back-propagating errors. Nature 323, 533–536 (1986) 5. Deng, L., Yu, D.: Deep learning: methods and applications. FNT Sign. Process. 7, 197–387 (2014) 6. PyTorch Homepage: https://pytorch.org/. Last accessed 28 Jan 2021 7. Fast.ai Homepage: https://www.fast.ai/. Last accessed 28 Jan 2021 8. TensorFlow Homepage: https://www.tensorflow.org/. Last accessed 28 Jan 2021 9. Keras Homepage: https://keras.io/. Last accessed 28 Jan 2021 10. Deeplearning4j Homepage: https://deeplearning4j.konduit.ai. Last accessed 28 Jan 2021 11. El Naqa, I., Murphy, M.J.: What is machine learning? In: El Naqa, I., Li, R., Murphy, M.J. (eds.) Machine Learning in Radiation Oncology, pp. 3–11. Springer International Publishing, Cham (2015). https://doi.org/10.1007/978-3-319-18305-3_1 12. Géron, A.: Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems. O’Reilly Media (2019) 13. Hagerty, J., Stanley R., Stoecker W.: Medical image processing in the age of deep learning. In: Proceedings of the 12th international joint conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP), pp. 306–311 (2017) 14. Basheer, I., Hajmeer, M.: Artificial neural networks: fundamentals, computing, design, and application. J. Microbiol. Methods 43(1), 3–31 (2000) 15. Lévy J., Flórez R., Rodríguez J.: Las redes neuronales artificiales, 1a. Edición. Tirant lo Blanch., Netbiblo (2008) 16. Lasi, H., Fettke, P., Feld, T., Hoffmann, M.: Industry 4.0. Bus. Inform. Syst. Eng. 6(4), 239–242 (2014) 17. Jordan, M., Mitchell, T.: Machine Learning: Trends, Perspectives, and Prospects. Science 349(6245), 255–260 (2015) 18. Mohri, M., Rostamizadeh, A., Talwalkar, A.: Foundations of Machine Learning. MIT Press (2018) 19. Jagielski, M., Oprea, A., Biggio, B., Liu, C., Nita-Rotaru, C., Li, B.: Manipulating machine learning: poisoning attacks and countermeasures for regression learning. In: 2018 IEEE Symposium on Security and Privacy (SP), pp. 19–35 (2018) 20. Li, W., Han, J., Pei, J.: CMAR: Accurate and efficient classification based on multiple classassociation rules. In: Proceedings 2001 IEEE. International Conference on Data Mining, pp. 369–376 (2001) 21. Raschka, S., Mirjalili, V.: Python Machine Learning: Machine Learning and Deep Learning with Python, Scikit-learn, and TensorFlow 2. Packt Publishing Ltd. (2019) 22. Dayan, P., Sahani, M., Deback, G.: Unsupervised Learning. The MIT Encyclopedia of The Cognitive Sciences, pp. 857–859 (1999) 23. Wiering, M., van Otterlo, M. (eds.): Reinforcement Learning. Springer Berlin Heidelberg, Berlin, Heidelberg (2012)
14
S. R. Tinoco-Martínez et al.
24. Kaelbling, L., Littman, M., Moore, A.: Reinforcement learning: a survey. J. Artif. Intell. Res. 4, 237–285 (1996) 25. Montgomery, D., Peck, E., Vining, G.: Introduction to Linear Regression Analysis. John Wiley & Sons (2021) 26. Kotsiantis, S., Zaharakis, I., Pintelas, P.: Supervised machine learning: a review of classification techniques. Emerg Artifi. Intell. Appl. Comput. Eng. 160(1), 3–24 (2007) 27. Celebi, M.E., Aydin, K. (eds.): Unsupervised Learning Algorithms. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-24211-8 28. McCulloch, W.S., Pitts, W.: A logical calculus of the ideas immanent in nervous activity. The Bull. Math. Biophys. 5(4), 115–133 (1943) 29. LeCun, Y., et al.: Handwritten digit recognition with a back-propagation network. In: Touretzky, D. (ed.) Advances in Neural Information Processing Systems, vol. 2. Morgan-Kaufmann (1990) 30. Fukushima, K.: Neocognitron: a self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biol. Cybern. 36, 193–202 (1980) 31. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997) 32. He, K., Zhang, X., Ren S., Sun J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016) 33. Deng, J.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. pp. 248–255 (2009) 34. Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vision 115(3), 211–252 (2015) 35. Thontirawong, P., Chinchanachokchai, S.: Teaching artificial intelligence and machine learning in marketing. Mark. Educ. Rev. 31(2), 58–63 (2021) 36. Miller E., Ceballos H., Engelmann B., Schiffler A., Batres R., Schmitt J.: Industry 4.0 and International Collaborative Online Learning in a Higher Education Course on Machine Learning, Machine Learning-Driven Digital Technologies for Educational Innovation Workshop, pp. 1–8 (2021) 37. Huang L., Ma K.: Introducing machine learning to first-year undergraduate engineering students through an authentic and active learning labware. In: IEEE Frontiers in Education Conference (FIE), pp. 1–4 (2018) 38. Kaggle Homepage. https://www.kaggle.com/c/titanic/data. Last accessed 28 Jan 2021 39. Kaggle Homepage: https://www.kaggle.com/camnugent/california-housing-prices. Last accessed 28 Jan 2021 40. RePDA Homepage: https://app.conagua.gob.mx/consultarepda.aspx. Last accessed 28 Jan 2021 41. Kaggle Homepage: https://www.kaggle.com/hellbuoy/online-retail-k-means-hierarchicalclustering/data. Last accessed 28 Jan 2021 42. Dask Homepage: https://dask.org/. Last accessed 28 Jan 2021 43. Hadoop Homepage: https://hadoop.apache.org/. Last accessed 28 Jan 2021 44. RUOA UNAM Homepage: https://www.ruoa.unam.mx/index.php?page=estaciones. Last accessed 28 Jan 2021 45. Kaggle Homepage: https://www.kaggle.com/adityadesai13/used-car-dataset-ford-and-mer cedes. Last accessed 28 Jan 2021 46. Webmosquito Homepage: http://basurae.iies.unam.mx/webmosquito/html/. Last accessed 28 Jan 2021 47. Kaggle Homepage: https://www.kaggle.com/khoongweihao/covid19-xray-dataset-train-testsets. Last accessed 28 Jan 2021 48. Kaggle Homepage: https://www.kaggle.com/arpitjain007/game-of-deep-learning-ship-dat asets. Last accessed 28 Jan 2021
How to Improve the Teaching of Computational Machine Learning Applied
15
49. Kaggle Homepage: https://www.kaggle.com/rmisra/news-headlines-dataset-for-sarcasm-det ection. Last accessed 28 Jan 2021 50. Kaggle Homepage: https://www.kaggle.com/gpiosenka/sports-classification. Last accessed 28 Jan 2021 51. EscuelaMLDL Homepage: http://www.escuelamldl.enesmorelia.unam.mx/index.php/es/. Last accessed 28 Jan 2022
Fingerprinting ECUs to Implement Vehicular Security for Passenger Safety Using Machine Learning Techniques Samuel Bellaire(B) , Matthew Bayer, Azeem Hafeez, Rafi Ud Daula Refat, and Hafiz Malik University of Michigan-Dearborn, Dearborn, MI 48128, USA [email protected] Abstract. The Controller Area Network (CAN) protocol used in vehicles today was designed to be fast, reliable, and robust. However, it is inherently insecure due to its lack of any kind of message authentication. Despite this, CAN is still used extensively in the automotive industry for various electronic control units (ECUs) and sensors which perform critical functions such as engine control. This paper presents a novel methodology for in-vehicle security through fingerprinting of ECUs. The proposed research uses the fingerprints injected in the signal due to material imperfections and semiconductor impurities. By extracting features from the physical CAN signal and using them as inputs for a machine learning algorithm, it is possible to determine the sender ECU of a packet. A high classification accuracy of up to 100.0% is possible when every node on the bus has a sufficiently different channel length. Keywords: Machine learning · Fingerprinting · Classification · k-NN Gaussian naive bayes · Multinomial logistic regression · Vehicle cybersecurity · In-vehicle networks · CAN
1
·
Introduction
Nearly every vehicle on the road today utilizes Controller Area Network (CAN) as one of its primary in-vehicle networks (IVNs) to interface various sensors and electronic control units (ECUs) inside the vehicle [9]. The CAN standard was adopted in the automotive industry due to its high reliability and noise immunity, but one significant drawback of CAN is its lack of a security layer. CAN bus is a message-based communication protocol where each message sent on the bus is given a unique identifier. When an ECU receives a message, it is impossible to determine the source of the message since there is no identifier for the sender in the CAN frame. This architecture makes CAN bus vulnerable to spoofing attacks, where an external attacker can gain control of any ECU on the bus and impersonate another ECU. In a vehicle, this type of attack could endanger the occupants if critical vehicle functions are disrupted at high speeds. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): IntelliSys 2022, LNNS 544, pp. 16–32, 2023. https://doi.org/10.1007/978-3-031-16075-2_2
Fingerprinting ECUs
17
This was demonstrated in 2015 by Miller et al. [26], who discovered vulnerabilities in a 2014 Jeep Cherokee that allowed them remote access to the vehicle’s engine controller and steering module, among other things. No modifications to the vehicle were required to disrupt these critical vehicle functions, and it was discovered that this type of exploitation could be accomplished at great distance through a cellular network. In this paper, we propose a novel intrusion detection system to defend against spoofing attacks in IVNs. This system links CAN packets to the sender ECU by using unique characteristics extracted from voltage measurements of the CANHigh signal. The features are then used in machine learning algorithms to associate the message with an ECU and determine if the message is legitimate. In the past, several researchers have used voltage characteristics to fingerprint ECUs [5,20]. Unlike them, we have used seven novel signal characteristics, extracted mostly from noise and rising edge transients, as features for our machine learning classifiers. The extraction and selection of these features are an important part of machine learning based IDS’s. This is because the performance of a machine learning system depends on the quality of the selected features just as much as the algorithm itself. To increase the interpretability of our IDS, the feature extraction and selection process is explained in detail in Sect. 4. The effectiveness of the selected features were evaluated with k-nearest neighbors, Gaussian naive bayes, and multinomial logistic regression using a publicly available dataset. The paper is organized as follows. In Sect. 2, the threat model is defined for the CAN bus network. In Sect. 3, the related work section is presented. In Sects. 4, 5, and 6, the machine learning pipeline is detailed (i.e. data acquisition & preprocessing, model selection & parameter tuning, and model validation & results). Finally, the paper is concluded in Sect. 7.
2
Threat Model
The primary threat model that this paper will examine is the case where an external attacker gains control of one or more ECUs in the vehicle. Systems such as the infotainment system, which has wireless connectivity features, allows the attacker to gain remote access without physical modifications to the vehicle. If the compromised ECU has access to the vehicle’s CAN bus, then it would be capable of manipulating other ECUs on the bus to a certain degree, as shown in F ig. 1. Once an attacker gains control of an ECU, there are three types of attacks they can launch on the CAN bus: spoofing, fabrication, and bus-off attacks. Each attack is explained in more detail below. In this paper, the primary focus is on the spoofing attack. 2.1
Spoofing Attack
Spoofing attacks, sometimes called impersonation attacks, occur when an attacker takes control of an ECU and mimics the behavior of another ECU on the bus. Typically, the ECU to be impersonated is disabled before the attacking ECU begins to send CAN messages in its place.
18
S. Bellaire et al.
Fig. 1. IVN threat model
2.2
Fabrication Attack
Fabrication attacks occur when an attacker injects additional messages on the CAN bus through a compromised ECU. These messages can override previously sent messages to disrupt communication. A fabrication attack can also result in denial of service (DoS) if the adversary floods the bus with high-priority messages that block other ECUs from transmitting. 2.3
Bus-Off Attack
A bus-off attack disables the compromised ECU, and it ceases all activity on the CAN bus. This can disrupt time-sensitive communications if other ECUs are dependent on the messages sent by the compromised ECU.
3
Related Work
As shown in F ig. 2, existing approaches for IVN security can be separated into two primary categories: message authentication code (MAC) based approaches [6,17,31,35,37] and intrusion detection system (IDS) based approaches. IDS’s can be further broken down into parameter monitoring [16,21,29,32], information theory [23,30,38], machine learning [10,18,19,22,24,27,33,36], and fingerprinting [1–5,8,11–15,20,28,34] based approaches. 3.1
MAC Based Approach
MAC based systems are the most traditional implementations of IVN security. In CAN, this is accomplished by appending some form of encrypted sender authentication to the message. One such implementation of a MAC based system is LCAP [17], developed in 2012 by Hazem and Fahmy. The system works by implementing a 16-bit “magic number”, either occupying 2 bytes in the message or utilizing the extra identifier bits provided by extended CAN.
Fingerprinting ECUs
19
Fig. 2. IVN security categories
While MAC based security can be effective, it can be computationally expensive and be easily nullified if the adversary gains access to the encryption key. In addition, many protocols such as CAN already have a significant overhead without a MAC. In the case of LCAP, which can occupy 25% of the payload with its “magic number” [17], the data bandwidth is reduced even more. 3.2
Parameter Monitoring IDS Approach
Since many CAN bus messages are broadcasted periodically, IDS’s based on parameter monitoring typically examine message frequency or the time difference between a remote frame and reply. A database of expected message frequencies or timings can be created, and deviations from these timings can be marked as anomalies, as seen in F ig. 3. This approach is excellent at detecting DoS and bus-off attacks, but can fall short if the adversary is timing-aware.
Fig. 3. Concept of frequency based IDS
In 2015, Taylor et al. [32] was able to develop a timing-based system that detected upwards of 96.2% of attacks that altered the frequency of messages on a CAN bus, from message deletion up to ten times the typical frequency.
20
3.3
S. Bellaire et al.
Information Theory IDS Approach
IDS’s based on information theory examine the information entropy of messages on the CAN bus. In 2016, Marchetti et al. [23] demonstrated that certain types of attacks can cause changes in entropy compared to the typical value for the message. This led to accurate detection of anomalies for most messages on the bus, but for the few IDs with varying entropy, the method proved to be less effective. 3.4
Machine Learning IDS Approach
Machine Learning IDS’s utilize various artificial intelligence models, such as neural networks and AI-based classifiers, to implement IVN security. One example of such a system is the deep learning IDS developed by Kang et al. [19]. In their IDS, the probability distributions of bits in the CAN packet’s payload was used as a set of features for the model. This approach was able to classify legitimate and malicious messages with an accuracy of 97.8%. 3.5
Fingerprinting IDS Approach
Fingerprinting is a method that examines the non-ideal characteristics of a transmitter, such as semiconductor impurities or parasitics in its communication channel. These characteristics can be exploited to create a unique fingerprint for each ECU. A rather extreme case can be seen in F ig. 4, where it would be easy to differentiate the two ECUs through features such as transient response length. This ability to differeniate between ECUs makes fingerprinting an excellent candidate to detect spoofing attacks. It is, however, ineffective against bus-off attacks.
Fig. 4. Physical signal differences between ECUs
In 2018, Choi et al. developed VoltageIDS [5], a fingerprinting based IDS that utilizes various time-domain and frequency-domain features, such as mean,
Fingerprinting ECUs
21
standard deviation, and kurtosis, to classify the source ECU of a signal. Using Linear SVM, they achieved F-Scores of up to 99.7% with just 70 samples per ECU. The technology used in many IDS’s for IVNs can also be utilized in broader applications. In 2020, Hady et al. [7] created the Enhanced Healthcare Monitoring System, an IDS for a hospital’s network that combined patient bio-metric data with network data to act as features. In addition to ECUs, fingerprinting based approaches can also be applied to various other sensors and signals. Zuo et al. [39] collected radio frequency emission data from UAV video signals, and were able to successfully differentiate the signals from WiFi interference.
4
Data Acquisition and Preprocessing
4.1
Data Collection
Data collection was carried out using seven different MCP 2515 CAN bus shields mounted on Arduino UNO-R2 microcontrollers, each with a different channel length, varying from 2m to 10m. The CAN-High signal was recorded using a DSO1012A oscilloscope sampling at 20 MSa/s. For each of the ECUs, thirty records were captured with approximately 600 samples in each record, amounting to a pulse train 4–5 pulses in length. F ig. 5 shows a few pulses of CAN-High data from ECU 1. 4.2
Pulse Segmentation
Segmenting each record into individual pulses was done using a simple thresholding algorithm, the pseudo code for which can be seen in Algorithm 1. Algorithm 1. Pulse Segmentation Algorithm k←0 for i ← 0 to N − 2 do if y(i) < 3 and y(i + 1) ≥ 3 then indexList(k) ← (i + 1) k ← (k + 1) i ← (i + B − 1) end if end for
Whenever the raw signal y(n) (which is of length N samples) transitions from below 3 V to above 3 V, the current index is saved in a list of indices. This list is later used to segment the data. In some cases the transient period of the signal’s rising edge has a large amplitude and can cross the 3 V threshold multiple times. To bypass this issue, B samples can be skipped after finding a valid rising edge to ensure the transient has dissipated sufficiently. B = 20 was used in this case.
22
S. Bellaire et al.
Fig. 5. Pulse train from ECU 1
Figure 6 shows an example of a single pulse extracted from the pulse train from F ig. 5. By extracting individual pulses, it makes it easier to extract some signal attributes (e.g. from only the dominant bit, or only the transient period). 4.3
Feature Extraction
After singular pulses were isolated from the data, seven features were extracted from each pulse to form a feature set for the classifiers. Each of these features are described below. Transient Response Length. The time required for the transient response to settle to within some threshold α of the signal’s steady-state value VS . This can be found by starting near the falling edge of the pulse, denoted by the sample NF , where we have good confidence that the signal is in the steady-state, and working backwards until the first value that exceeds the threshold α is found. Pseudo code for this algorithm can be seen in Algorithm 2. Maximum Transient Voltage. The maximum value of the signal observed during the transient period, as shown in Eq. 1
Fingerprinting ECUs
23
Fig. 6. Single pulse from ECU 1
Vmax = max(y(n))
(1)
Energy of the Transient Period. The sum of the squared signal over the transient period, as seen in Eq. 2, where NT is the length of the transient period in samples. Ey =
N T −1
|y(n)|2
(2)
n=0
Average Dominant Bit Steady-State Value. The average value of the dominant bit during the steady state. As seen in Eq. 3, this can be done by averaging the samples from NS to NF , where NS is the sample at the beginning of the steady-state, and NF is the last sample before the falling edge of the pulse. VD =
NF 1 y(n) NF − N S + 1 n=NS
(3)
24
S. Bellaire et al.
Algorithm 2. Transient Length Algorithm for i ← NF to 0 do if |y(i) − VS | > α then ltr = i + 1 break end if end for
Peak Noise Frequency. The frequency at which the peak value occurs in the magnitude spectrum. To determine this, the noise of the signal must first be isolated to ensure data independence in the Fast Fourier Transform (FFT). Thus, the ideal signal must be approximated from the pulse. Since the pulse begins at the rising edge, the only parameter that must be found to construct the ideal signal is the sample NF at which the falling edge occurs. This can be done using a simple 3V threshold, as seen in Algorithm 3. Algorithm 3. Finding Falling Edge Algorithm for i ← 0 to N − 1 do if y(i) ≤ 3 then NF = i break end if end for
With NF calculated, the ideal signal can be acquired, as seen in F ig. 7. This signal is 3.5 V during the dominant bit and 2.5 V during the recessive bit. The ideal pulse can then be subtracted from the actual pulse, leaving the noise behind. Additionally, the mean value of the noise can also be subtracted to remove the DC component of the noise signal, as seen in F ig. 8. The FFT of the noise signal with DC component removed can then be taken to determine the peak frequency of the pulse’s noise, such as in F ig. 9. Average Noise. The average value of the noise signal whose length is N samples, such as the one shown in F ig. 8. See Eq. 4 V =
N −1 1 y(n) N n=0
(4)
Standard Deviation of Noise. The standard deviation of the noise signal whose length is N samples, such as the one shown in F ig. 8. See Eq. 5 (xi − V )2 σ= (5) N
Fingerprinting ECUs
25
Fig. 7. Actual vs. Ideal signal
4.4
Outlier Removal
Outlier removal was performed as in Eq. 6. For a given class ci , the j th data point xij is removed if any of its features differ from the class mean μi of that feature by more than 3 standard deviations. |xij − μi | > 3σi
5
(6)
Model Selection and Parameter Tuning
Three different machine learning models were used to perform ECU classification using the extracted features: k-Nearest Neighbor (k-NN), Gaussian Naive Bayes (GNB), and Multinomial Logistic Regression (MLR). All of these models were implemented using MATLAB’s Statistics and Machine Learning Toolbox [25]. A brief overview of each of these methods can be seen below.
26
S. Bellaire et al.
Fig. 8. Noise of pulse from ECU 1
5.1
K-Nearest Neighbor
k-NN is a simple machine learning algorithm that maps observed data points xi to an n-dimensional feature space. F ig. 10 shows an example of three different features represented in a 3-dimensional space. When an unknown data point u is input to the model, k-NN calculates the distance between u and every observed data point xi , and uses the k closest observations to make a classification decision. 5.2
Gaussian Naive Bayes
GNB is a specific case of the Naive Bayes (NB) classification method that uses a Gaussian distribution. To predict the class of an unknown data point u, the GNB classifier leverages Bayes’ Theorem (see Eq. 7). Since GNB is an implementation of NB, it makes the “naive” assumption that all of the features are independent, which greatly simplifies the calculation of P (u). After calculating P (ci |u) for each class ci , a prediction is made based on the highest probability. P (ci |u) =
P (u|ci )P (ci ) P (u)
(7)
Fingerprinting ECUs
27
Fig. 9. Magnitude spectrum of pulse noise
5.3
Multinomial Logistic Regression
MLR is a generalization of linear regression to a classification problem with more than two classes. Similarly to GNB, MLR assigns a probability for u belonging to class ci for each class i. The classification decision is then made based on the highest probability. 5.4
Model Tuning and Parameter Selection
The tuning phase was implemented through adjusting various parameters of each model to determine which sets of parameters result in the best model performance. Each of the three methods implemented using MATLAB have their own unique parameters that can be changed when fitting the model. K-NN Tuning. A list of parameters that were tested for the k-NN classifier can be seen below. Ultimately, the parameters selected were k = 1, and a euclidean distance metric with equal distance weighting. – Value of k from 1 → 31 – Distance calculation method (euclidean, hamming, cityblock) – Distance weights (equal, inverse, inverse squared) A set of three features was used in the k-NN classifier, which can be seen listed below. – Transient response length – Maximum transient voltage – Energy of the transient period
28
S. Bellaire et al.
Fig. 10. Example of a 3-D feature space
GNB Tuning. A list of parameters adjusted for the GNB classifier can be seen below. For this classifier, a kernel distribution with a normal kernel type produced the best results. – Distribution type (kernel, multinomial, multivariate multinomial, normal) – Kernel type (box, epanechnikov, Gaussian, triangular, normal) For the GNB classifier, an extra feature was added compared to k-NN for a 4-dimensional feature set. These features are listed below. – – – –
Transient response length Maximum transient voltage Energy of the transient period Average Dominant Bit Steady-State Value
MLR Tuning. A list of parameters adjusted for the MLR classifier can be seen below. A hierarchical model with a logit link function was utilized for MLR. – Model type (nominal, ordinal, hierarchical) – Link function (logit, probit, log-log, complementary log-log) For the MLR classifier, poor performance was observed for smaller feature sets. The model performed best when all seven features, detailed in Sect. 4, were used in the classification algorithm.
Fingerprinting ECUs
6
29
Model Validation and Results
Model validation and testing was performed using a 70–30 split in the data set, which amounted to approximately 560 training data points and around 280 testing data points, split evenly across all 7 ECUs. Accuracy was the primary metric used in evaluating model performance, as in eq. 8. NC denotes the number of correct classifications, and NT the number of total classifications made. Acc. =
NC NT
(8)
Table 1. Confusion matrix for 1-NN and GNB classifiers (7 ECUs) P redicted Class
T rue Class
− − E1 E2 E3 E4 E5 E6 E7 Acc. (%)
E1
E2
E3
E4
E5
E6
E7
Acc. (%)
41 0 0 0 0 0 0 100.0
0 40 0 0 0 0 0 100.0
0 0 42 0 0 0 0 100.0
0 0 0 42 0 0 0 100.0
0 0 0 0 36 0 0 100.0
0 0 0 0 0 42 0 100.0
0 0 0 0 0 0 42 100.0
100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0
Table 2. Confusion matrix for MLR classifier (7 ECUs) P redicted Class
T rue Class
− − E1 E2 E3 E4 E5 E6 E7 Acc. (%)
E1
E2
E3
E4
E5
E6
E7
Acc. (%)
36 0 0 0 0 1 0 97.3
0 37 0 0 0 3 0 92.5
0 0 38 0 0 0 0 100.0
0 0 0 35 0 0 0 100.0
1 0 0 0 37 0 0 97.4
0 1 0 1 0 34 0 94.4
0 97.3 0 97.4 1 97.4 3 89.7 0 100.0 0 89.5 41 100.0 91.1 95.9
As can be seen in Table 1, both the k-NN classifier and GNB classifier deliver perfect classification for this dataset, with an accuracy of 100.0%. The MLR classifier, while faring much worse with a per-class accuracy as low as 89.5%, still delivers an overall accuracy of 95.9% as shown in Table 2.
30
S. Bellaire et al.
The reason that the accuracy is so high for this data set is because of the distinct separation between ECUs in the feature space, as can be seen in F ig. 10. Since each of the ECUs in the dataset have a different channel length, they all have different transient characteristics that can be easily distinguished.
7
Conclusion
Connected vehicles are becoming more prevalent every day, and with autonomous vehicle technology on the horizon, security for IVNs is becoming more important than ever before. The fingerprinting methodology we presented in this paper was able to correctly identify the sender ECU of a CAN packet with a high degree of accuracy. Perfect classification was achieved when the ECU channel lengths are sufficiently different, and up to 95.95% accuracy was achieved when this is not possible. Such a system would be cost-effective and could be easily implemented into an IVN with minimal modifications to the network, only requiring an additional ECU to be inserted as the IDS. Thus, we believe that this IDS would be an effective solution to combating spoofing attacks in IVNs. Acknowledgment. The authors extend their appreciation to the Deputyship for Research & Innovation, Ministry of Education in Saudi Arabia for supporting this work through the project # DRI-KSU-934. This research is also partly supported by National Science Foundation (NSF) under the award # 2035770.
References 1. Avatefipour, O., Hafeez, A., Tayyab, M., Malik, H.: Linking received packet to the transmitter through physical-fingerprinting of controller area network. In: 2017 IEEE Workshop on Information Forensics and Security (WIFS), IEEE (2017) 2. Cho, K.T., Shin, K.: Viden: attacker identification on in-vehicle networks. In: 2017 ACM SIGSAC Conference on Computer and Communications Security, ACM, pp. 1109–1123 (2017) 3. Cho, K.T., Shin, K.G.: Fingerprinting electronic control units for vehicle intrusion detection. In: USENIX Security Symposium, pp. 911–927 (2016) 4. Choi, W., Jo, H.J., Woo, S., Chun, J.Y., Park, J., Lee, D.H.: Identifying ecus using inimitable characteristics of signals in controller area networks. IEEE Trans. Veh. Technol. 67(6), 4757–4770 (2018) 5. Choi, W., Joo, K., Jo, H.J., Park, M.C., Lee, D.H.: Voltageids: low-level communication characteristics for automotive intrusion detection system. IEEE Trans. Inf. Forensics Secur. 13(8), 2114–2129 (2018) 6. Doan, T.P., Ganesan, S.: Can Crypto FPGA Chip to Secure Data Transmitted through CAN FD Bus using AES-128 and SHA-1 Algorithms with a Symmetric Key. Technical report, SAE Technical Paper (2017) 7. Hady, A.A., Ghubaish, A., Salman, T., Unal, D., Jain, R.: Intrusion detection system for healthcare systems using medical and network data: a comparison study. IEEE Access 8, 106576–106584 (2020) 8. Hafeez, A.: A robust, reliable and deployable framework for in-vehicle security (2020)
Fingerprinting ECUs
31
9. Hafeez, A., Malik, H., Vatefipour, O., Raj Rongali, P., Zehra, S.: Comparative study of CAN-bus and flexray protocols for in-vehicle communication. Technical report, SAE Technical Paper (2017) 10. Hafeez, A., Mohan, J., Girdhar, M., Awad, S.S.: Machine Learning based ECU detection for automotive security. In: 2021 17th International Computer Engineering Conference (ICENCO), IEEE (2021) 11. Hafeez, A., Ponnapali, S.C., Malik, H.: Exploiting channel distortion for transmitter identification for in-vehicle network security. In: SAE International Journal of Transportation Cybersecurity and Privacy, 3(11-02-02-0005) (2020) 12. Hafeez, A., Rehman, K., Malik, H.: State of the Art Survey on Comparison of Physical Fingerprinting-Based Intrusion Detection Techniques for in-Vehicle Security. Technical report, SAE Technical Paper (2020) 13. Hafeez, A., Tayyab, M., Zolo, C., Awad, S.: Finger printing of engine control units by using frequency response for secure in-vehicle communication. In: 2018 14th International Computer Engineering Conference (ICENCO), IEEE, pp. 79– 83 (2018) 14. Hafeez, A., Topolovec, K., Awad, S.: ECU fingerprinting through parametric signal modeling and artificial neural networks for in-vehicle security against spoofing attacks. In: 2019 15th International Computer Engineering Conference (ICENCO), IEEE, pp. 29–38 (2019) 15. Hafeez, A., Topolovec, K., Zolo, C., Sarwar, W.: State of the Art Survey on Comparison of CAN, FlexRay, LIN Protocol and Simulation of LIN Protocol. Technical report, SAE Technical Paper (2020) 16. Han, M.L., Lee, J., Kang, A.R., Kang, S., Park, J.K., Kim, H.K.: A statistical-based anomaly detection method for connected cars in Internet of things environment. In: Hsu, C.-H., Xia, F., Liu, X., Wang, S. (eds.) IOV 2015. LNCS, vol. 9502, pp. 89–97. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-27293-1 9 17. Hazem, A., Fahmy, H.: LCAP- a lightweight CAN authentication protocol for securing in-vehicle networks. In: 10th ESCAR Embedded Security in Cars Conference, vol. 6 (2012) 18. Jain, N., Sharma, S.: The role of decision tree technique for automating intrusion detection system. In: International Journal of Computational Engineering Research, vol. 2(4) (2012) 19. Kang, M.J., Kang, J.W.: Intrusion detection system using deep neural network for in-vehicle network security. PLoS one 11(6), e0155781 (2016) 20. Kneib, M., Huth, C.: Scission: Signal characteristic-based sender identification and intrusion detection in automotive networks. In: 2018 ACM SIGSAC Conference on Computer and Communications Security, ACM, pp. 787–800 (2018) 21. Lee, H., Jeong, S.H., Kim, H.K.: OTIDS: a novel intrusion detection system for in-vehicle network by using remote frame. In: 2017 15th Annual Conference on Privacy, Security, and Trust (PST), IEEE (2017) 22. Marchetti, M., Stabili, D.: Anomaly detection of CAN bus messages through analysis of id sequences. In: 2017 IEEE Intelligent Vehicles Symposium (IV), IEEE, pp. 1577–1583 (2017) 23. Marchetti, M., Stabili, D., Guido, A., Colajanni, M.: Evaluation of anomaly detection for in-vehicle networks through information-theoretic algorithms. In: 2016 IEEE 2nd International Forum on Research and Technologies for Society and Industry Leveraging a Better Tomorrow (RTSI), IEEE (2016) 24. Markovitz, M., Wool, A.: Field classification, modeling and anomaly detection in unknown can bus networks. Veh.Commun. 9, 43–52 (2017)
32
S. Bellaire et al.
25. MathWorks: MATLAB Statistics and Machine Learning Toolbox (2021) 26. Miller, C., Valasek, C.: Remote exploitation of an unaltered passenger vehicle. Black Hat USA (2015) 27. S. N. Narayanan, S. Mittal, and A. Joshi. Using data analytics to detect anomalous states in vehicles (2015) arXiv:1512.08048 28. Refat, R.U.D., Elkhail, A.A., Hafeez, A., Malik, H.: Detecting CAN bus intrusion by applying machine learning method to graph based features. In: Arai, K. (ed.) IntelliSys 2021. LNNS, vol. 296, pp. 730–748. Springer, Cham (2022). https://doi. org/10.1007/978-3-030-82199-9 49 29. Song, H.M., Kim, H.R., Kim, H.K.: Intrusion detection system based on the analysis of time intervals of CAN messages for in-vehicle network. In: 2016 International Conference on Information Networking (ICOIN), IEEE, pp. 63–68 (2016) 30. Stabili, D., Marchetti, M., Colajanni, M.: Detecting attacks to internal vehicle networks through hamming distance. In: 2017 AEIT International Annual Conference, IEEE (2017) 31. Sugashima, T., Oka, D.K., Vuillaume, C.: Approaches for secure and efficient invehicle key management. In: SAE International Journal of Passenger Cars - Electronic and Electrical Systems, 9(2016-01-0070):100–106 (2016) 32. Taylor, A., Japkowicz, N., Leblanc, S.: Frequency-based anomaly detection for the automotive CAN bus. In: 2015 World Congress on Industrial Control Systems Security (WCICSS), IEEE, pp. 45–49 (2015) 33. Taylor, A., Leblanc, S., Japkowicz, N.: Anomaly detection in automobile control network data with long short-term memory networks. In: 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), IEEE, pp. 130–139 (2016) 34. Tayyab, M., Hafeez, A., Malik, H.: Spoofing attack on clock based intrusion detection system in controller area networks. In: Proceedings of the NDIA Ground Vehicle Systems Engineering Technology Symp, pp. 1–13 (2018) 35. Ueda, H., Kurachi, R., Takada, H., Mizutani, T., Inoue, M., Horihata, S.: Security authentication system for in-vehicle network. SEI Tech. Rev. 81, 5–9 (2015) 36. Wasicek, A.R., Pes´e, M.D., Weimerskirch, A., Burakova, Y., Singh, K.: Contextaware intrusion detection in automotive control systems. In: ACM/IEEE 6th International Conference on Cyber-Physical Systems (ICCPS), IEEE, pp. 41–50 (2015) 37. Wolf, M., Weimerskirch, A., Paar, C.: Security in automotive bus systems. In: Workshop on Embedded Security in Cars (2004) 38. Wu, W., Huang, Y., Kurachi, R., Zeng, G., Xie, G., Li, R., Li, K.: Sliding window optimized information entropy analysis method for intrusion detection on in-vehicle networks. IEEE Access 6, 45233–45245 (2018) 39. Zuo, M., Xie, S., Zhang, X., Yang, M.: Recognition of UAV video signal using RF fingerprints in the presence of Wifi interference. IEEE Access 9, 88844–88851 (2021)
Knowledge Graph Enrichment of a Semantic Search System for Construction Safety Emrah Inan1 , Paul Thompson1 , Fenia Christopoulou1 , Tim Yates2 , and Sophia Ananiadou1(B) 1
2
National Centre for Text Mining, School of Computer Science, The University of Manchester, Manchester, UK {emrah.inan,paul.thompson,efstathia.christopoulou, sophia.ananiadou}@manchester.ac.uk Health and Safety Executive, HSE Science and Research Centre, Buxton, UK [email protected] Abstract. Minimising accident risk for new construction projects requires a thorough analysis of previous accidents, including an examination of circumstances and reasons for their occurrence, their consequences, and measures used for future mitigation. Such information is often recorded only within the huge amounts of unstructured textual documentation that are typically generated for large-scale construction projects. Accordingly, the process of locating, understanding, analysing and combining sufficient intelligence from documentation about previous projects can be overwhelming. Previously, text mining (TM) technology has been used to develop a semantically-enhanced search system over a repository of workplace accident reports, to increase the efficiency and effectiveness of locating important information relating to construction safety. In this article, we describe our enhancement of this system, through the generation and integration of a knowledge graph (KG). We extract triples consisting of subject, predicate and object (SPO) from unstructured text and map them to concepts in a knowledge base (KB). The mapping is carried out by comparing the contextualised representations of the text surrounding the SPO elements with concept descriptions in the KB. Subsequently, a Coreference Resolution method is used to detect mentions in text that refer to the same pronominal references that occur in SPO triples, to ensure that relevant knowledge is not overlooked. Finally, SPO triples are filtered to retain only those containing knowledge that is relevant to construction safety. We show that our approach can significantly outperform related methods, both in terms of detecting the elements of triples and linking them to entries in a KB.
Keywords: Knowledge graph accidents · Construction
· Semantic search system · Workplace
F. Christopoulou—Currently a Research Scientist at Huawei Noah’s Ark Lab, London, UK. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): IntelliSys 2022, LNNS 544, pp. 33–52, 2023. https://doi.org/10.1007/978-3-031-16075-2_3
34
E. Inan et al.
1
Introduction
Compared to other industries, construction is regarded as one of the most dangerous [25], mainly because of the complexity and uncertainty of the work environment [24]. It is therefore of critical importance to manage relevant knowledge, in order to minimise the occurrence of accidents and to respond quickly to uncertainties at construction sites [12]. The task of risk management aims to identify potential risks to workers at each stage of the project, along with safety measures that can mitigate these risks effectively. An important outcome of risk management is a structured risk register. Recent recommendations about health and safety knowledge in construction [5] suggest that each entry in the register should characterize a situation that could constitute a risk, using a combination of project attributes, such as a construction activity (e.g., demolition); materials or equipment used during this activity (e.g., timber or ladders); and types of location where the activity is carried out (e.g., the terrain or part of a building). Potential hazards in this situation are identified (e.g., a fall from a height), along with the level of risk, severity of consequences and possible mitigation strategies. Compiling effective and comprehensive risk registers requires information about safety measures that have been employed in previous similar projects, along with an understanding of how and why any accidents occurred. A wealth of valuable information is collected by organizations such as the UK Health and Safety Executive (HSE)1 , whose rich and varied archive of health and safety data is accrued from its workplace inspection, incident investigation and enforcement activities, along with the incident information reported to HSE by its duty holders. However, the sheer volumes of data that are produced during the lifecycle of every construction project [4], most of which take the form of unstructured textual documents [19], mean that important knowledge about previous projects is essentially hidden, and can only be located by investing significant amounts of time and effort. The most relevant documents to a new project will normally be those relating to previous projects whose attributes overlap with the new project; such documents can facilitate an understanding of how interactions between these attributes can lead to potential hazards. However, the complex nature of construction projects means that using standard keyword-based search techniques to try to retrieve such documents is likely to be ineffective, especially since a given attribute may be described in text using a variety of different phrases, synonyms and/or abbreviations. The task is further complicated by the fact that information about incidents, potential reasons and possible mitigation strategies is often fragmented, and may be deeply buried within longer documents. This means that health and safety experts may potentially overlook a vital piece of intelligence, which could make the critical difference between the life or death of a construction worker. Accordingly, there is a strong need for systems that provide more efficient and systematic ways of locating information within construction industry document archives. 1
https://www.hse.gov.uk/.
Knowledge Graph Enrichment
35
The application of text mining (TM) methods to such archives can make it far easier and more efficient to find documents that have direct relevance to complex sets of information requirements; to locate important details within these documents; and to establish potential links between fragmented pieces of information. TM methods have previously been integrated within systems that aim to improve retrieval of construction-related documents, using techniques such as automatic expansion of search queries with additional semantically-related terms [15], retrieval of semantically similar documents [38] and automatic concept recognition [11]. In this article, we describe our work to enhance an existing, semantic search system for RIDDOR workplace accident reports, through the generation and integration of a knowledge graph (KG), which captures enriched relationships between key construction-related concepts that are described in these reports. As such, our search system goes beyond existing related efforts, by allowing users to efficiently search for specific types of knowledge contained within the reports. In other domains, semantic search systems that allow exploration and refinement of initial search results based on various facets of the semantic content of documents have proven to be effective [23,27,29]. To alleviate the need for users to formulate complex initial keyword queries, the automatic identification of different facets is used to present them with semantic-level information that can be used to iteratively refine and filter even large sets of initial search results, in order to rapidly “drill down” to documents that closely match their search requirements. Facets such as clusters of semantically similar documents, important terms appearing within search results or various categories of automatically recognized named entities (NEs) may be used in combination. For example, identifying a document cluster covering a relevant topic may constitute an initial filter. Subsequently, the NEs mentioned within the cluster may be explored and used for further refinement. After applying these filters, only those documents covering a topic of interest and which mention (combinations of) NEs of interest will need to be reviewed in detail by the user. Going a step further, the ability to explore relations involving NEs can further increase the efficiency with which information relevant to compiling risk registers can be found. For instance, relations could be used to help to identify specific evidence about which types of hazards have previously occurred when using a project-specific piece of equipment (such as a ladder), and/or which types of harmful consequences have specifically been reported to result from a given hazard (e.g., a fall from a ladder may result in a broken leg). Such relationships can be detected through the application of Open Information Extraction (OpenIE) methods, whose aim is to automatically extract structured knowledge from a text in the form of subject-predicate-object (SPO) triples. SPO triples provide information about which pairs of entities that are mentioned in text are linked together, and how they are linked. For example, in the sentence “Operative slid down the ladder”, the subject is operative, the object is ladder and the predicate is slid down. The SPO triples that are extracted by processing a document collection can be used as the basis for creating a knowledge graph (KG),
36
E. Inan et al.
whose nodes correspond to subjects and objects, and whose edges correspond to predicates. A KG resulting from processing workplace accident reports could be easily queried, e.g., to discover the range of predicates that occur when ladder is the object, since these predicates could indicate of the types of hazards that typically involve ladders. Further value can be added to automatically constructed KGs if their nodes and edges are linked to concepts in existing knowledge bases (KBs), such as Wikidata2 and DBPedia3 . Such entries often include comprehensive information, like pictures, videos, relationships to other concepts, links to information in encyclopedias, etc. As such, automatically constructed KGs can be considerably enriched if their nodes and edges can be linked to appropriate information in KBs. In this article, we describe our work to enhance an existing, faceted semantic search system, by generating and integrating a KG that captures knowledge pertaining to construction safety expressed within the free text of RIDDOR workplace accident reports, in the form of structured SPO triples. The generated KG is further enhanced by automatically linking the elements of the triples to Wikidata entries, wherever possible. Although Wikidata is geared towards topics of general interest, and so does not encode detailed knowledge about specialised domains such as health and safety in construction, it does include entries for many concepts and relations that are relevant to workplace accidents, such as falls, ladders or scaffolding. Therefore, given the absence of a comprehensive and detailed KB that is specific to the construction industry, we selected Wikidata as a suitable resource that could help to enrich our KG. We compare our results to those achieved by related methods, and demonstrate that our approach can significantly outperform them, both in terms of detecting the elements of triples and linking them to entries in Wikidata. The existing semantic search system already facilitates search over a range of RIDDOR reports from the HSE archive, by integrating standard keyword-based search with state-of-the-art TM methods to provide a number of different types of faceted search refinements. In addition, the system provides automatic generation of summaries, as a means of increasing the efficiency of scanning longer documents for potential relevancy to a search. While the existing system already allows filtering of reports based on mentions of important concepts (NEs), the KG provides the means to explore and search for knowledge expressed about these NEs within the reports. Through the integration of mechanisms into the search system to allow exploration, searching and filtering of the contents of the KG in different ways (i.e., as a graph or as a table), users of the enhanced system can more easily and rapidly gain an overview of the types of knowledge contained with the RIDDOR reports, without having to read the reports in detail. The remainder of the paper is structured as follows. Section 2 introduces related work, while Sect. 3 describes our method to generate a KG that encodes structured information relating to workplace accidents, extracted from RIDDOR reports, and explains how we have integrated the KG within the existing search 2 3
https://www.wikidata.org/wiki/Wikidata:Main Page. https://www.dbpedia.org/.
Knowledge Graph Enrichment
37
engine to enhance semantic search functionality. Section 4 describes our experiments to evaluate the generated KG. Finally, Sect. 5 concludes the study and explores directions for future work.
2
Related Work
A detailed analysis of the semantics of domain-specific terms used within documents is important to facilitate effective search and retrieval capabilities, e.g., through automatic expansion of user-entered query terms with synonyms and other semantically related terms, or through automatic named entity recognition (NER), which can support filtering of search results based on mentions of specific concepts or concept categories. Important for these tasks is a knowledge about the different ways in which a given concept can be described in text and/or an understanding of semantic relations among different concepts. While domain-specific lexical resources or ontologies can be helpful for such tasks if they list synonyms for each concept, the domain-specific Industry Foundation Classes (IFC) ontology [13] does not include synonyms. The semantic search system described in [11] instead uses the general language WordNet [21] to identify synonyms of IFC concepts, to aid in both query expansion and to improve coverage of ontology-driven NER. However, WordNet is not designed to provide comprehensive coverage of construction-related vocabulary. As an alternative, other efforts have used the Word2Vec algorithm [20] which, given a large corpus of domain-specific text, can automatically induce a comprehensive network of semantic relationships among the words and phrases used in the corpus (e.g., [33]). Relationships identified by Word2Vec have been employed by both [38] and [15] to facilitate more comprehensive query expansion, and by [6] to create knowledge graphs that encode the semantic relatedness between terms used in different types of sentences in literature abstracts relating to construction management. Given that construction ontologies alone are generally not suitable to support NER, other efforts have instead developed hand-crafted rules and/or exploited syntactic document structure for this purpose. In [1], shallow parsing is applied to construction contract documents to identify active and passive concepts, together with relations between them, while [36] enrich document-level accident classification with automatic extraction of accident causes from document titles, using syntactic patterns. The rule-based approach described in [37] combines syntactic information with semantic information from dictionaries and an ontology to extract concepts and relations from construction regulatory documents. In [15], an initial set of rules was manually created to identify four different types of contextual information about accidents. However, since writing rules is a labourintensive process, machine learning is used to augment the initial set. A more comprehensive system for automated analysis of the content of construction injury reports is reported in [31]; the recognition of seven injury types, five categories of injured body parts and nine energy sources within these reports is complemented by the identification of eighty distinct types of injury precursors, including equipment, materials, working environments and actions/activities. A
38
E. Inan et al.
complex combination of manually constructed dictionaries and rules is used to account for the variable ways in which these different types of information could be described in the text. The application of the system in conjunction with unsupervised data mining techniques was used to discover precursor combinations that most frequently contribute to construction injuries [32]. A drawback of [31] is that only a fixed number of concepts is recognised, and the method is difficult to extend; the variability of construction projects means that a static list is unlikely to cover all relevant concepts in all projects. In this respect, supervised NER methods are more flexible, since they learn how to recognise mentions of concepts that never occur in the training data. Recent work to create an annotated corpus of documents relating to construction safety [30] aims to stimulate the development of supervised NER tools for the construction domain. The corpus consists of a set of sentences taken from RIDDOR workplace accident reports, manually annotated with six different categories of concepts relating to safety in construction, covering hazards, consequences, mitigation strategies and project attributes. An overall interannotator agreement score of 0.79 F-score confirms the consistency of the annotated dataset, and hence its suitability for training high-performance machine learning tools. The automatic identification of relationships among NEs can be achieved by applying methods to extract knowledge triples from free-text documents; the extracted relationships can be used as the basis for generating KGs. An example of a triple extraction method is Stanford OpenIE [2], which applies natural logic inference to shorten self-contained clauses which are predicted via a classifier relying on a set of logically entailed shorter utterances by recursively traversing its dependency tree. Meanwhile, AllenNLP OpenIE is a reimplementation of a deep Bidirectional Long-Short Term Memory (BiLSTM) sequence prediction model [28] which defines the information extraction task as a sequence tagging problem and predicts the class of each token as either a relation or an argument. Other work has focused on the development of KGs for specific domains. For example, important information in cancer genetics articles includes details about the populations involved in the study, the hereditary cancer genes concerned and and the risk estimates associated with these genes. The problem is approached in [34] by firstly applying distant supervision to identify sentences describing the ascertainment of populations studied in the articles. Secondly, a transformer-based model jointly identifies text spans corresponding to genes and risk estimates, and predicts whether or not their is a relation between each gene-risk pair. Further methods of generating KGs aim to go beyond the extraction of factual triples whose nodes correspond to entities. Eventualities correspond to activities, states and events; capturing relations amongst eventualities thus goes a step further than extracting relations between entities. [35] build a large-scale eventuality graph that aims to encode entailment relations between eventualities (for example, “eat an apple” entails “eat fruit”). The graph includes 10 million eventuality nodes and 103 million entailment edges. Acknowledging that not all information
Knowledge Graph Enrichment
39
in text is factual, SAMPO [3] constructs knowledge graphs from user reviews in different domains, which can capture both opinions and their implications, via matrix factorisation techniques [17]. SAMPO initially extracts opinions, including modifier and aspect pairs, from a given review. Matrix factorisation is then employed to learn low-dimensional embeddings for each opinion, to facilitate the identification of implication relations. To infer implications between opinions, a nearest neighbour graph is constructed, based on the proximity between the opinion embeddings. The ability to link entities and relations that occur in free text to relevant concept entries in KBs can help practitioners to gain a deeper understanding of the information stated in the text. However, such linking can be problematic due to the ambiguity of entity mentions. As an example, consider the word saw which, in free text, may correspond, amongst other things, to the past tense of the verb see or to a noun describing a particular type of construction equipment. Several studies have therefore aimed to develop knowledge resources (e.g., FrameNet [10], which describes the semantic and distributional properties of English words) and methods that can effectively disambiguate entity mentions and relations, as well as link them to appropriate entities and relations in KBs. OpenTapioca [8] is an example of such a method, which aims to link entity mentions in a text to Wikidata entities. It employs an end-to-end approach, relying on topic similarities and local entity context. The local entity context considers the popularity of a candidate entity, computed by using its PageRank or site links. It also uses a unigram model to compute the probability of a mention occurring in a text as a semantic similarity score. It subsequently trains a Support Vector Machine (SVM) over the combined Markov chain of these features. FALCON 2.0 [26] is a rule-based entity-relation linking system, which means that there is no requirement for a large training dataset. The system takes as it input short texts, which undergo various types of linguistic processing to extract entities and predicates that are subsequently linked to Wikidata entries. Each extracted entity or predicate is queried in a background knowledge graph (BK), which is generated from Wikidata and consists of words/phrases with links to the Wikidata concepts that they could potentially represent; synonyms/aliases for the Wikidata concepts are included to enhance matching performance. Querying the BK retrieves a set of Wikidata concepts with similar labels to the words/phrases extracted from the text. These initial candidates are subsequently ranked by checking whether triples exist in the Wikidata KG that link together candidate entries for the extracted entities and predicates. Rules are also used to rank and filter the candidate concepts. For example, query headwords in the input text (such as “where”) can be used to determine that a triple element should be linked to a location concept. ImoJIE [16] is a Transformer-based encoder-decoder model for OpenIE (specifically employing BERT [9]) that leverages multi-head attention and a copying mechanism to automatically extract tuples for each input text. It boot-
40
E. Inan et al.
straps the training data developed for non-neural existing systems, which have been automatically filtered to reduce redundancy and noise.
3
Method
As mentioned above, a previously developed semantic search system constitutes the starting point for the work described in this article. The system provides a number of TM-driven functionalities that help users to filter and explore the content of RIDDOR workplace accident reports. These include: – Word Cloud - The most pertinent terms occurring in the documents returned by an initial keyword search are displayed as “bubbles” in a word cloud, in which the bubbles are sized according to their relative importance within the search results. – Automatic Descriptive Clustering - Search results are automatically clustered according to the similarity of topics covered within them; descriptive labels are generated to characterise the contents of each cluster, and users can filter their initial search results to focus of specific cluster(s) of interest. – Automatic Summarisation - For retrieved reports whose length is longer than 10 sentences, a summary is automatically produced to make it easier for users to review the most important information within these results. – Named Entities - Domain-specific categories of NEs are automatically identified within search results, and the most frequent of these are displayed within the interface. Search results can be filtered to retain only those documents containing entities of interest. The NE facets are particularly important, given that they can be very powerful in terms of their ability to help locate documents whose content is likely to match closely with a user’s requirements. Accurate recognition of domainspecific NEs, for example, makes it possible to filter documents that mention a specific combination of project attributes, and then to explore which types of accidents and/or harmful consequences are mentioned in these documents. The automatic generation and integration of a KG relevant to workplace accident reports can further complement the automatically extracted NEs, by allowing users to explore exactly what is being said about NEs of interest within the collection of accident reports. For example, rather than having to review all reports that mention a type of equipment, such as scaffolding, the KG provides a structured and systematic means for users to explore what is being said about scaffolding across the document collection, including links to other entities. Users can then focus on documents in which knowledge of interest is being expressed. NEs are recognised by training a layered neural model [14] that integrates a BiLSTM network with a Conditional Random Fields classifier to automatically detect and categorise the types of entities that are annotated in the construction safety corpus presented in [30]. The starting point for the construction of our
Knowledge Graph Enrichment
41
Fig. 1. Knowledge graph construction procedure.
KG is a corpus of approximately 3, 000 RIDDOR reports, which have been split into sentences, resulting in approximately 12, 000 sentences in total. Our main objectives when constructing the KG were the following: – To generate a KG whose triples refer to entities of interest within the construction safety domain. To achieve this, either the subject or object of each triple should contain a domain-specific NE that has been recognised by the neural model mentioned above. – To take into account that entities previously introduced in text may be subsequently referred to using pronouns (e.g., it, them etc.). Failure to handle such pronominal references could result in relevant knowledge contained within the reports being missed. Accordingly, we apply a Coreference Resolution (coref) method to detect mentions in text that refer to the same entity, by using the AllenNLP library4 . – To add value to the automatically extracted knowledge graph, by linking entity and relation mentions to Wikidata entries, wherever possible. As illustrated in Fig. 1, a total of five different steps carried out within two parallel processes allow the raw text to be transformed into a KG of information about workplace accidents. 3.1
Triple Extraction
The first step in the top process in Fig. 1 involves the application OpenIE to automatically extract knowledge triples from raw text. We choose to use the neural-based IMoJIE [16] tool for this purpose. While traditional rule-based approaches to OpenIE have generally tended to outperform neural approaches, IMoJIE has been proven to outperform previous approaches belonging to both families of methods.
4
https://demo.allennlp.org/coreference-resolution.
42
E. Inan et al.
3.2
Entity and Relation Linking
The elements of the SPO triples (i.e., subjects, objects and predicates) of the triples extracted from raw text by IMoJIE are subsequently automatically linked to concept entries in an existing KB, wherever possible. In the absence of a domain-specific KB for the construction domain, we have chosen to use Wikidata5 as the target knowledge base, since, as described above, it contains a broad range of information, including both concepts and relations that are relevant to safety in the construction industry. As mentioned above, issues of ambiguity can pose a major challenge when trying to link entity mentions in text to the most appropriate concept in large KBs with broad ranging coverage, since a given word or phrase in text could potentially refer to many concepts. This means that, in order to carry out automatic linking to the most appropriate KB concepts, it is necessary to go beyond simple string matching and to consider more complex approaches that aim to disambiguate the elements of triples. With this in mind, we have chosen to exploit the ELMo deep contextualised word representations [22] from the AllenNLP library6 to facilitate the linking process. ELMo representations consider the sentential context of words, hence enabling disambiguation. The ELMo model that we use has been pre-trained on a corpus of 800 million tokens from News Crawl data. For computational efficiency, we use the smallest model, which has 13.6 million parameters and a dimensionality of 256. We take advantage of the fact that most Wikidata items have an associated description7 , i.e., a short phrase designed to differentiate items with the same or similar labels. We essentially assume that the most appropriate KB concept to which a triple element (i.e., subject, object or predicate) should be linked is the one whose Wikidata description shares the most similar contextualised representation to the element’s representation in text. Specifically, we apply the ELMo model to each sentence of the RIDDOR corpus to obtain contextualised representations (embeddings) for each word in the sentence. Then, for each element of the triples extracted from the sentence, we construct a single representation. If a triple element consists of multiple words, then their representation is constructed as the average of the representations of the constituent words. In a similar manner, descriptions of entities and predicates in Wikidata are also encoded into a single embedding with the same dimensionality as the SPO representations, by applying ELMo to the corresponding description and keeping the last hidden state. Each subject, predicate and object in an extracted triple is then linked to the Wikidata object whose description phrase has the most similar representation. Given the huge size of Wikidata, it would be computationally inefficient to perform a comparison between the contextualised representation of each element 5 6 7
https://www.wikidata.org. https://allennlp.org/elmo. https://www.wikidata.org/wiki/Help:Description.
Knowledge Graph Enrichment
43
of extracted triples against the contextualised representations of the descriptions of all items in the KB. Accordingly, we select a maximum of eight candidates from Wikidata. The first three of these candidates are selected using a combination of the spaCy8 library and the English Wikipedia9 . Specifically, we estimate the prior probability of a certain Wikidata entity being a candidate for a textual mention in the RIDDOR corpus based on the intra-link Wikipedia count statistics. From these candidates, we retain the three with the highest probabilities, while the remaining five candidates are selected based on the similarity between the surface form of a Wikidata concept and the SPO element.
Fig. 2. Snapshot from the Knowledge Graph View of a Sample Triple (The Knife; Stabbed; The Operative in the Left Thumb).
Since the goal of OpenIE systems is to extract all SPO triples from all sentences in a document, a large number of such triples are identified by IMoJIE when processing the RIDDOR corpus. However, as our focus is on information pertaining to safety in construction, we filter this original set of triples by assuming that those encoding knowledge relevant to construction safety will mention an NE recognised by our domain specific NER model in either their subject or object. In order to retain only those that mention NEs of interest, a parallel process applies the same NER that is already used in the existing system to recognise entities. Subsequently, we use the SequenceMatcher function from the difflib 10 library to find SPO triples whose subject or object contains an automatically recognised NE, by finding the longest common substring between an NE and a subject/object from each triple.
8 9 10
https://spacy.io/. https://en.wikipedia.org/. https://docs.python.org/3/library/difflib.html.
44
E. Inan et al.
Fig. 3. Snapshot from the text view of a particular object, in this case thumb.
3.3
Integration of the Knowledge Graph in the Semantic Search Interface
After a query has been performed in the interface, both a tabular and graphical view for the generated KG can be viewed by clicking on the “KG” button. Figure 2 illustrates the graph visualisation of a sample triple. The figure shows a triple whose subject is knife and whose predicate is stabbed. We can also see that a link has been established between the subject entry in Wikidata - when such a link has been found and an image is available in Wikidata, it is displayed in the interface. Therefore, we see an image of some knives, and clicking on the Wikidata link will cause the corresponding Wikidata page to be displayed. The object contains the word operative, denoting that the operative was stabbed with a knife. Furthermore, this object node in the graph is colour-coded following the convention shown on the top left of the screen, based on the type of NE that it contains. In this case, the object contains information about a body part injured by a knife. By clicking on the object node (as illustrated in Fig. 3), it is revealed that the injured body part was the left thumb. On the List view screen, triples are displayed in tabular format. Results can be filtered by selecting specific relations or objects on the left-hand side of the screen. Furthermore, it is possible to filter triples by entering values into one or more of the subject, relation and object text boxes at the top of the screen. For example, as shown in Fig. 4, by entering stabbed as the relation and thumb as the object, we can view information about the same triple that is illustrated in the graph view in Figs. 2 and 3. We use Neo4J11 as the graph store, and D3.js12 for the visualisation of the KG search process.
11 12
https://neo4j.com/. https://d3js.org/.
Knowledge Graph Enrichment
45
Fig. 4. Faceted triple search view.
4
Evaluation
Our evaluation of the generated KG consists of two subtasks, i.e., triple extraction and entity/relation linking. For the triple extraction task, we evaluate the extent to which triples that constitute domain-specific knowledge are extracted by our system. The entity/relation linking task evaluates the end-to-end performance of the system, i.e., its ability recognise different elements of relevant triples (subjects, objects and predicates) and to link the different elements of the triples to entries in Wikidata. If a match can be found, we assign the Wikidata ID and URL to the corresponding triple element. The evaluation datasets for both tasks have been created by construction industry domain experts. They identified 500 triples, and carried out entity/relation concept mappings for 454 nodes within the identified triples. 4.1
Evaluation Metrics
We compute the performance of entity-relation linking and triple extraction tasks using standard Information Extraction (IE) metrics (i.e., Precision, Recall and F1) using the generated domain-specific dataset. Precision (P ) refers to the proportion of elements assigned to a class by an automated method whose class label is correctly predicted. In our case, the classes in question are the subject, object and predicate elements for the triple extraction task, and correctly assigned Wikidata IDs for the entity/relation linking task. P =
Correct items T otal items returned by the system
(1)
Recall (R) represents the proportion of the expert-identified class elements (i.e. triple elements or links) that our methods have correctly identified.
46
E. Inan et al.
Correct items T otal gold items Finally, F1 is the harmonic mean of precision and recall. R=
2P R P +R Following [7], we report our results in terms of macro-averaged values. F1 =
4.2
(2)
(3)
Results of Entity-Relation-Triple Linking and Triple Extraction
Table 1 reports on the ability of our system to recognise subjects (Sub), predicates (Pred), objects (Obj) and complete triples (Tri). Table 1. Performance comparison of different open information extraction tools for subject (Sub), predicate (Pre), object (Obj) and triple (Tri) recognition. Open IE tool
P
Stanford OpenIE-Sub
0.593 0.507 0.547
R
F1
Stanford OpenIE-Pred
0.243 0.208 0.224
Stanford OpenIE-Obj
0.465 0.398 0.429
Stanford OpenIE-Tri
0.138 0.265 0.182
AllenNLP OpenIE-Sub
0.692 0.683 0.687
AllenNLP OpenIE-Pred 0.455 0.449 0.452 AllenNLP OpenIE-Obj
0.314 0.310 0.312
AllenNLP OpenIE-Tri
0.18
ImoJIE-Sub
0.873 0.916 0.894
ImoJIE-Pred
0.866 0.909 0.887
ImoJIE-Obj
0.847 0.888 0.867
ImoJIE-Tri
0.634 0.662 0.648
0.408 0.25
As illustrated in Table 1, the AllenNLP OpenIE system exhibits superior performance to Stanford OpenIE in terms of the subject, predicate and triple categories. However, Stanford OpenIE performs better when extracting objects, since it tends to predict objects with shorter span lengths, which more closely correspond to the types of objects in our evaluation dataset. Further, ImoJIE significantly outperforms the other two systems, through its use of a copy-attention mechanism that produces a next tuple conditioned on all previously extracted tuples. In Table 2, we compare the results of our entity-relation linking method with those obtained using Falcon 2.0 [26]. We have selected this system as a fair means of comparison, since it is linguistically oriented and also based on Wikidata. Following the experimental setup described in [18], Table 2 illustrates results obtained by our method and Falcon 2.0 for different tasks: “Entity only” considers recognition and linking performance for subjects and objects within triples;
Knowledge Graph Enrichment
47
“Relation Only” considers recognising and linking of predicates, while “Triple” considers the extent to which all elements of of triples (i.e.subject, object and predicate) are correctly recognised and linked. Since Falcon 2.0 is not specifically geared to extracting SPO triples from text, the “Triple” results are reported only for our method. As Falcon 2.0 is designed to focus primarily on short text questions, we split long texts in the evaluation set into short sentences prior to performing the linking process. Table 2. Performance comparison for entity, relation and triple linking. Method
Task
P
R
F1
FALCON 2.0 [26] Entity only
0.491 0.271 0.349
Ours
0.875 0.365 0.515
FALCON 2.0 [26] Relation only 0.659 0.474 0.551 Ours Ours
0.914 0.561 0.696 Triple
0.826 0.25
0.384
Both Falcon 2.0 and our system exhibit the best performance for the “Relation only” task. However, our system is able to outperform Falcon 2.0 for both entities and relations, thus demonstrating that our embedding-based joint triple disambiguation approach is better suited to this domain specific task. In particular, the levels of precision achieved by our method are considerably higher than those achieved by Falcon 2.0 - if our method is able to correctly identify and link one or more elements of a triple, then it is highly likely that these predictions will be correct. 4.3
Qualitative Error Analysis
Table 3 shows some example outputs from our chosen OpenIE methods, along with those obtained from other OpenIE methods that we compare. Both Stanford OpenIE and ImoJIE predict the correct results for the triple in the first sentence. However, the fact that the predicate consists of more than one token, i.e., stumbled over, causes problems for the AllenNLP OpenIE system, which also predicts the object phrase incorrectly. The second sentence illustrates the difficultly in extracting triples from complex sentences. The ImoJIE system performs best here: it identifies the subject, the predicate and the object correctly. In contrast, the Stanford OpenIE system only partially identifies each element of the triple, while AllenNLP OpenIE extracts completely the wrong information. The second sentence illustrates the difficultly in extracting triples from complex sentences. The ImoJIE system performs best here: it identifies the subject, the predicate and the object correctly. In contrast, the Stanford OpenIE system only partially identifies each element of the triple, while AllenNLP OpenIE extracts completely the wrong information.
48
E. Inan et al.
Table 3. Qualitative error analysis between different open information extraction systems. Text
“... the ip stumbled over an angle iron protrusion on the river bank causing him to fall”
Correct triple
(ip; stumbled over; angle iron protrusion)
Stanford OpenIE
(ip; stumbled over; angle iron protrusion)
AllenNLP OpenIE (the ip; stumbled; on the river bank causing him to fall) ImoJIE
(ip; stumbled over; angle iron protrusion)
Text
“the ip then moved the scaffold across around 4 in. by skating instead of exiting the scaffold the correct way to move it into the correct position.”
Correct triple
(the ip; moved; the scaffold across around 4 in. by’ skating then)
Stanford OpenIE
(ip; moved scaffold across; around 4 in.)
AllenNLP OpenIE (the scaffold; exiting; the correct way) ImoJIE
(the ip; moved; the scaffold across around 4 in. by’ skating then)
Example outputs from both FALCON 2.0 and our method for entity and relation linking tasks are depicted in Table 4. The word “fall” in the first example occurs within the following sentence: “The operative had just started plastering a ceiling when his trowel clipped an angle bead causing plaster to fall from the trowel into his left eye”. Both FALCON 2.0 and our own method achieve the correct mapping for this example. For the second example “... had positioned his ladder to the right hand side of the first floor bedroom window...”, both systems correctly map the entity ladder to the concept ID Q168639. The third example concerns the entity right hand in the sentence “... causing the ip to extend his right hand in order to try to arrest his fall”. The expert-assigned mapping here is to the body part right hand, which has the Wikidata concept ID Q24206677. Although FALCON 2.0 partially detects the entity correctly, the Wikidata mapping is wrong, since Q1366301 refers to a mnemonic used in physics and mathematics. In contrast, our method links right hand to the correct Wikidata ID. The final two rows in Table 4 concern entities in the sentence “He was standing on the edge of the pavement when his foot slipped off the kerb and he twisted his ankle which caused him to fall into the road”. While our system correctly links the mention of ankle to the body part Q168002, FALCON 2.0 erroneously links the mention to the concept Q61831605, which corresponds to scholarly article called Ankle in a manual about sports injuries. Furthermore, instead of mapping the mention of road to the correct Wikidata concept Q34442, which our approach achieves successfully, FALCON 2.0 instead identifies the phrase into the road and links it to the Wikidata property P1408, i.e. licensed to broadcast to.
Knowledge Graph Enrichment
49
Table 4. Qualitative results for different entity and relation linking methods. Wikidata ids are included in parentheses. Method
Correct mapping
FALCON 2.0 Fall (Q11868838)
Predicted mapping Fall (Q11868838)
Ours
Fall (Q11868838)
FALCON 2.0 Ladder (Q168639)
Ladder (Q168639)
Ours
Ladder (Q168639)
FALCON 2.0 Right hand (Q24206677) Right-hand rule (Q1366301)
5
Ours
Right hand (Q24206677)
FALCON 2.0 Ankle (Q168002)
Ankle (Q61831605)
Ours
Ankle (Q168002)
FALCON 2.0 Road (Q34442)
... into the road (P1408)
Ours
Road (Q34442)
Conclusion
In this article, we have described the generation and integration of a KG within an existing semantic search system, as a means to improve upon the interactivity and efficiency of the system, in terms of allowing users to discover and explore information relating to workplace accidents. The KG structures relevant knowledge contained within workplace accident reports by automatically extracting SPO triples that mention NEs of interest relating to construction safety. Furthermore, the KG is enriched through links to corresponding concepts in the Wikidata KB, in which further structured information about entities and relations of interest can often be found. Complementing the existing functionalities provided through the application of TM methods (i.e., automatic recognition of terms and NEs, dynamic descriptive clustering and automatic summarisation), the structured knowledge provided in the KG (together with the querying methods provided) goes a step further towards (semi-)automating the process of constructing structured risk registers from knowledge that is hidden within the vast volumes of textual data that are typically generated for construction projects. We have evaluated both the triple extraction (0.648 F1 with ImoJIE) and entity linking (0.515 and 0.696 F1 scores for entity only and relation only, respectively) aspects of our approach, and show that the methods that we employ outperform other related approaches when applied to our dataset. As future work, we plan to augment the system to allow search over a wider range of document types generated during the construction project lifecycle, including workplace inspection reports, reports of prosecutions and health and safety guidelines. The varying nature of these reports may also require the recognition of additional NE types and relations between them, to allow full use to be made of their content. Hence, we intend to extend the size of the annotated dataset.
50
E. Inan et al.
References 1. Al Qady, M., Kandil, A.: Concept relation extraction from construction documents using natural language processing. J. Construct. Eng. Manag. 136(3), 294–302 (2009) 2. Angeli, G, Premkumar, M.J.J., Manning, C.D.: Leveraging linguistic structure for )open domain information extraction. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 344–354 (2015) 3. Bhutani, N., et al.: Sampo: unsupervised knowledge base construction for opinions and implications. In Das, D., Hajishirzi, H., McCallum, A., Singh, S. (eds.), Conference on Automated Knowledge Base Construction, AKBC 2020, Virtual, 22–24 June 2020 (2020) 4. Bilal, M., et al.: Big data in the construction industry: a review of present status, opportunities, and future trends. Adv. Eng. Inf. 30(3), 500–521 (2016) 5. BSI. PAS 1192–6: 2018: Specification for collaborative sharing and use of structured health and safety information using BIM (2018) 6. Chen, H., Luo, X.: An automatic literature knowledge graph and reasoning network modeling framework based on ontology and natural language processing. Adv. Eng. Inf. 42, 100959 (2019) 7. Cornolti, M., Ferragina, P., Ciaramita, M.: A framework for benchmarking entityannotation systems. In: Proceedings of the International World Wide Web Conference (WWW) (Practice and Experience Track) (2013) 8. Delpeuch, A.: Opentapioca: lightweight entity linking for wikidata. arXiv preprint arXiv:1904.09131 (2019) 9. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, June 2019, pp. 4171–4186. Association for Computational Linguistics (2019) 10. Fillmore, C.J., Lee-Goldman, R., Rhodes, R.: The framenet constructicon. SignBased Construct. Grammar 193, 309–372 (2012) 11. Gao, G., Liu, Y.-S., Lin, P., Wang, M., Ming, G., Yong, J.-H.: Bimtag: conceptbased automatic semantic annotation of online BIM product resources. Adv. Eng. Inf. 31, 48–61 (2017) 12. Hallowell, M.R.: Safety-knowledge management in American construction organizations. J. Manag. Eng. 28(2), 203–211 (2012) 13. ISO. Industry foundation classes (IFC) for data sharing in the construction and facility management industries (2013) 14. Ju, M., Miwa, M., Ananiadou, S.: A neural layered model for nested named entity recognition. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 1446–1459 (2018) 15. Kim, T., Chi, S.: Accident case retrieval and analyses: using natural language processing in the construction industry. J. Construct. Eng. Manag. 145(3), 04019004 (2019)
Knowledge Graph Enrichment
51
16. Kolluru, K., Aggarwal, S., Rathore, V., Chakrabarti, S., et al.: Imojie: iterative memory-based joint open information extraction. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 5871–5886 (2020) 17. Koren, Y., Bell, R., Volinsky, C.: Matrix factorization techniques for recommender systems. Computer 42(8), 30–37 (2009) 18. Martinez-Rodriguez, J.L., L´ opez-Ar´evalo, I., Rios-Alvarado, A.B.: Openie-based approach for knowledge graph construction from text. Exp. Syst. Appl. 113, 339– 355 (2018) 19. Mart´ınez-Rojas, M., Mar´ın, N., Miranda, M.A.V.: An intelligent system for the acquisition and management of information from bill of quantities in building projects. Exp. Syst. Appl. 63, 284–294 (2016) 20. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013) 21. Miller, G.: WordNet: a lexical database for English. Commun. ACM 38(11), 39–41 (1995) 22. Peters, M.E., et al.: Deep contextualized word representations. In: Proceeding of NAACL (2018) 23. Przybyla, P., et al.: Prioritising references for systematic reviews with robotanalyst: a user study. Res. Synth. Methods 9(3), 470–488 (2018) 24. Qazi, A., Quigley, J., Dickson, A., Kirytopoulos, K.: Project complexity and risk management (procrim): towards modelling project complexity driven risk paths in construction projects. Int. J. Project Manag. 34(7), 1183–1198 (2016) 25. Sacks, R., Rozenfeld, O., Rosenfeld, Y.: Spatial and temporal exposure to safety hazards in construction. J. Construct. Eng. Manag. 135(8), 726–736 (2009) 26. Sakor, A., Singh, K., Patel, A., Vidal, M.-E.: Falcon 2.0: an entity and relation linking tool over wikidata. In: Proceedings of the 29th ACM International Conference on Information and Knowledge Management, pp. 3141–3148 (2020) 27. Soto, A., Przybyla, P., Ananiadou, S.: Thalia: semantic search engine for biomedical abstracts. Bioinformatics 35(10), 1799–1801 (2018) 28. Stanovsky, G., Michael, J., Zettlemoyer, L., Dagan, I.: Supervised open information extraction. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 885–895 (2018) 29. Thompson, P., et al.: Text mining the history of medicine. PLOS One 11(1), e0144717 (2016) 30. Thompson, P., Yates, T., Inan, E., Ananiadou, S.: Semantic annotation for improved safety in construction work. In: Proceedings of The 12th Language Resources and Evaluation Conference, pp. 1990–1999 (2020) 31. Tixier, A.J.-P., Hallowell, M.R., Rajagopalan, B., Bowman, D.: Automated content analysis for construction safety: a natural language processing system to extract precursors and outcomes from unstructured injury reports. Automat. Construct. 62, 45–56 (2016) 32. Tixier, A.J.-P., Hallowell, M.R., Rajagopalan, B., Bowman, D.: Construction safety clash detection: identifying safety incompatibilities among fundamental attributes using data mining. Automat. Construct. 74, 39–54 (2017) 33. Tixier, A.J.-P., Vazirgiannis, M., Hallowell, M.R.: Word embeddings for the construction domain. arXiv preprint arXiv:1610.09333 (2016)
52
E. Inan et al.
34. Wadhwa, S., Yin, K., Hughes, K.S., Wallace, B.C.: Semi-automating knowledge base construction for cancer genetics. In: Das, D., Hajishirzi, H., McCallum, A., Singh, S. (eds.) Conference on Automated Knowledge Base Construction, AKBC 2020, Virtual, 22–24 June 2020 (2020) 35. Yu, C., Zhang, H., Song, Y., Ng, W., Shang, L.: Enriching large-scale eventuality knowledge graph with entailment relations. In: Das, D., Hajishirzi, H., McCallum, A., Singh, S. (eds.) Conference on Automated Knowledge Base Construction, AKBC 2020, Virtual, 22–24 June 2020 (2020) 36. Zhang, F., Fleyeh, H., Wang, X., Minghui, L.: Construction site accident analysis using text mining and natural language processing techniques. Automat. Construct. 99, 238–248 (2019) 37. Zhang, J., El-Gohary, N.M.: Semantic NLP-based information extraction from construction regulatory documents for automated compliance checking. J. Comput. Civil Eng. 30(2), 04015014 (2013) 38. Zou, Y., Kiviniemi, A., Jones, S.W.: Retrieving similar cases for construction project risk management using natural language processing techniques. Automat. Construct. 80, 66–76 (2017)
Leboh: An Android Mobile Application for Waste Classification Using TensorFlow Lite Teny Handhayani(B) and Janson Hendryli Universitas Tarumanagara, Jl. S. Parman No. 1 Gedung R Lantai 11, Jakarta 11440, Indonesia {tenyh,jansonh}@fti.untar.ac.id Abstract. Waste is an important part of human life and possibly be a serious issue for health and the environment if there is no proper deal. In this paper, the authors develop an Android mobile application for waste classification using EfficientNet-Lite model from TensorFlow Lite. The model is trained and validated using a dataset containing 15,190 images from 11 classes: cardboard, paper, glass, metal, electronics, battery, plastic, textile, shoes, organic, and trash. The model is evaluated using 655 images from the testing dataset and it is produced an accuracy of 95.39%. The model training and validation are done in Google Colab (Python). The model is then used as a classifier for an Android application. The application is named Leboh and developed for Indonesian speakers. The user testing of the Android application obtains an accuracy of 82.5%. Based on user testing, the quantity of plastic waste is higher than other types. EfficientNet-Lite successfully works well to classify municipal solid waste and runs fast on mobile devices. Keywords: Waste Mobile
1
· EfficientNet-Lite · TensorFlow Lite · Android ·
Introduction
Waste is unwanted materials that rejected as useless, unneeded or excess to requirements [1]. Waste is a global issue that poses a threat to public health and the environment if it is not properly dealt. The main types of waste are municipal solid waste (MSW), hazardous waste (HW), e-waste and other waste. MSW are grouped into organic material, textiles, metals, glass, plastics, paper and others. Waste management is one of the problems in society. In some countries, they have well-organized waste management systems. For instance, in England, it is easy to find rubbish bins for different kinds of waste. People usually sort out their waste then put them in the proper bins. In other countries, a waste management system is not the main concern and become one of the serious problems. Sorting waste is still rarely done by people because of some factors: minimum information about the waste management c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): IntelliSys 2022, LNNS 544, pp. 53–67, 2023. https://doi.org/10.1007/978-3-031-16075-2_4
54
T. Handhayani and J. Hendryli
systems and lack of specific bins, especially in the public areas. Figure 1 shows examples of good and bad waste collections. It is a good idea to sort out the wastes before sending them to the waste collectors. Improper wastes collections might cause some problems, e.g., bad smell, infectious diseases, land and water pollution, obstruction of drains, and loss of biodiversity.
Fig. 1. The example of good and bad waste collections
Some classifiers have been implemented to develop systems for waste classifications. Support Vector Machine (SVM) and Convolutional Neural Network (ConvNet) were used to classify waste into 6 classes: glass, paper, metal, plastic, cardboard, and trash [2]. Their experimental results using SVM with scaleinvariant feature transform (SIFT) features and ConvNet produced accuracy 63% and 22%. SpotGarbage implements ConvNet to automatically detect and locate waste in an image [3]. The study in waste classification shows that ConvNet produced the highest accuracy around 89.81% over Random Forest, eXtreme Gradient Boosting, and k-Nearest Neighbour [4]. Moreover, ConvNet has been implemented to classify plastic waste [5,6]. CompostNet is a system to identify compost and recycle material from an image [7]. DNN-TC is a framework for waste classification using Deep Neural Networks and it produced the highest accuracy 98% [8]. ConvNet and region-based Convolutional Neural Network have been successfully used to detect the size of electronic waste [9]. Multilayer hybrid convolution neural network (MLH-CNN) has been proposed to classify TrashNet dataset and produced an accuracy 92.6% [10]. The advantages of implementing deep learning on mobile devices are saving bandwidth and cloud computing cost, quick response time, and improved data privacy [11]. Android supports machine learning tools, e.g., ML Kit [12] and TF Lite Model Maker [13]. Android Studio provides features to integrate these models into the apps that make easy to android developer. Mobile applications have implemented deep learning for various objectives [14,15]. The problems of bad waste collections and high smartphones usage in society encourage the authors to develop a mobile application to classify waste. The research question in this paper is how to classify municipal solid waste in a mobile
Leboh: An Android Mobile Application
55
device? The authors propose EfficientNet-Lite model as a classifier in a mobile device. In this paper, the authors introduce an android mobile application to classify municipal solid waste. The wastes are classified into 11 classes (cardboard, paper, glass, metal, electronics, battery, plastic, textiles, shoes, organic, and trash). The contribution of this paper is an android mobile application for waste classification that is friendly for Indonesian speakers. This paper contains sections of introduction, literature review, methods, experimental results and discussions, and conclusions. The detailed explanations of the contents can be found as follows.
2
Literature Review
Convolutional Neural Networks (ConvNets) are designed to process the datasets in the form of multiple arrays [16]. A color image that consists of three 2D arrays containing pixel intensities is an example of a multiple array dataset. The architecture of a typical CovNet is arranged as a series of stages. There are two types of layers in the first few stages: convolutional layers and pooling layers. Feature maps organize units in a convolutional layer where each unit is connected to local patches in feature maps from the previous layers pass a set of weights. A set of weights is named a filter bank. A nonlinear function, e.g. ReLu, is used to pass the result of the local weighted sum. The same filter bank is shared by all units in a feature map. In a layer, different feature maps use different filter banks. Two main reasons for this architecture are the high possibility local groups of local values are correlated that forming unique local motifs that are easy to be recognized, and the local statistics of images and other signals are invariant to location. The convolutional layer detects the local conjunction of features from the previous layer and the pooling layer merges semantically similar features into one. In one or a few feature maps, a typical pooling unit computes the maximum of a local patch of units. To reduce the dimension of representation and create invariance to small shifts and distortion, neighbouring pooling units take input from patches that are shifted by more than one row or column. Some stages of convolution, non-linearity and pooling are stacked, then followed by more convolutional and fully-connected layers. Numerous network architectures for deep networks have been developed for several purposes, mostly to process images datasets but it does not restrict to those issues. Keras provides functions for several network architectures so that these are easy to be implemented. Keras is a deep learning API written in Python, running on the platform TensorFlow [17]. Some deep network architectures will be explained as follows. Inception is a network architecture developed by Google for image classification and detection [18]. Inception increases the use of computational resources in the network. Inception finds the optimal local contracture and it repeats spatially. The advantage of this method significantly improves the quality with less computational requirement compared to shallower and less wide networks. Inception-V2 [19] and Inception-V3 [20] are a development version of Inception.
56
T. Handhayani and J. Hendryli
VGG is a network architecture developed by adding more convolutional layers and tiny 3 × 3 convolution filters in all layers [21]. It creates a more accurate ConvNet architecture. The input for training in this network is an RGB images size 224 × 224. The image is passed through a stack of convolution layers and it implements 3 × 3 filters. A stack of the convolutional layers is followed by three fully connected layers. In the first and second, each layer has 4096 channels. The third layer performs 1000-way ILSVRC classification and contains 1000 channels. The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) is a benchmark in large-scale object recognition [22]. The configuration of the fully connected layers is the same in all networks and the last layer is the softmax layer. All hidden layers are equipped with ReLU non-linearity. VGG16 and VGG19 are versions of VGG. A residual learning framework was proposed to ease the training of deep networks where deeper neural networks are difficult to train [23]. It explicitly reformulates the layers as learning residual functions with reference to the layer inputs and provides evidence that the residual networks are easier to optimize, increase accuracy from deeper networks. In the Plain network, convolutional layers mostly have 3×3 filters. The network has a global pooling layer and a 1000way fully-connected layer with softmax. Overall, there are 34 weighted layers. Shortcut connections are inserted that turn the network into its counterpart residual version. A new residual unit has been developed to ease training and improve generalization [24]. Some functions of ResNet in Keras are ResNet50V2, ResNet101V2, and ResNet152V2 [17]. Extreme Inception (Xception) is a convolutional neural network architecture based entirely on depthwise separable convolution layers [25]. Xception has 36 convolutional layers that are structured into 14 modules, all of which have linear residual connections around them, except for the first and last modules. Those layers form the feature extraction base of the network. Dense Convolutional Network (DenseNet) is a network architecture that connects all layers directly with each other in a feed-forward fashion to ensure maximum information flow between layers in the network [26]. In each layer, the feature maps from previous layers are used as input for subsequent layers. The features do not combine through summation before they passed into a layer direct conbut they are combined by concatenating them. DenseNet has L(L+1) 2 nections. Various versions of DenseNet based on their depth are DenseNet121, DenseNet169, and DenseNet201 [17]. MobileNets is a network architecture developed based on depthwise separable convolutions to build light weight deep neural networks [27]. MobileNetV2 is a network architecture that contains the initial fully convolution layer with 32 filters, followed by 19 residual bottleneck layers [28]. MobileNetV2 uses ReLU6 as the non-linearity and kernel size 3×3. It utilizes dropout and batch normalization during training. MobileNetV2 is a highly efficient model for mobile applications. NasNet uses Neural Architecture Search (NAS) framework as a search method to find good convolutional architectures on a dataset [29]. The overall architectures of the convolutional nets are manually predetermined and they
Leboh: An Android Mobile Application
57
consist of convolutional cells repeated many times. Each convolutional cell has the same architecture and different weights. NasNet uses two types of convolutional cells; Normal Cell and Reduction Cell to serve two main functions when taking in a feature map. Normal Cell returns a feature map of the same dimension and Reduction Cell returns reduced size feature maps by a factor of two. Two versions of NasNet are NasNetLarge and NasNetMobile [17]. EfficientNets is a neural network architecture that implements a new scaling method to scale all dimensions of depth, width, and resolution [30]. The baseline network is developed by utilizing a multi-objective neural architecture search that optimizes accuracy and FLOPS. The scaling method in this model does search once on the small baseline network then uses the same scaling coefficients for all other models. A mobile size EfficientNet model is accurate and effective with an order of magnitude fewer parameters and FLOPS. EfficientNet-Lite is a set of image classification models for mobile and IoT [31].
3
Method
The development of the application in this paper consists of two main stages: custom model training and mobile application development (see the flowchart in Fig. 2). The authors implement EfficientNet-Lite0; a model from Tensorflow Lite for classification. The advantages of implementing Tensorflow model maker are efficient codes and ease to be implemented. In the first stage, a model is trained using the waste images dataset. The training process is run in Google Collab Python. After the training phase, the model is saved as a .tflite file. This model is then used as a classifier in the mobile application.
Fig. 2. A flowchart of developing a mobile application
TensorFlow Lite is a set of tools for on-device machine learning. TensorFlow Lite provides facilities for developers to run their models on mobile, embedded and IoT (Internet of things) devices [13]. Two main steps of development
58
T. Handhayani and J. Hendryli
workflow: generate TensorFlow Lite model and run inference. A TensorFlow Lite model is an efficient portable format as FlatBuffers (efficient cross-platform serialization library). TensorFlow Lite model is identified by the .tflite file extension. The advantages of the TensorFlow Lite model format are reducing the size and faster inference in devices with limited compute and memory resources. A TensorFlow Lite model can be generated in three ways: use an existing TensorFlow Lite model, create their own custom model using TensorFlow Lite Model Maker, and convert a TensorFlow model into a TensorFlow Lite model using TensorFlow Lite converter. The TensorFlow Lite Model Maker is a library that simplifies the process of training a TensorFlow Lite model using a particular dataset. At the moment, TensorFlow Lite for image classification supports EfficientNet-Lite and MobileNetV2. In this paper, the authors build a model for waste classification using the Tensorflow Lite Model Maker. In the second stage, the .tflite file will be integrated into an Android application. The application is developed in Android Studio supported by Kotlin. The reasons to implement Kotlin are it is lightweight, faster to compile, and prevents applications from increasing size.
4 4.1
Experimental Results and Discussions Data
This paper uses dataset from https://www.kaggle.com/mostafaabla/garbageclassification. Original dataset consists of 12 folders, i.e., battery, biological, cardboard, clothes, green-glass, brown-glass, white-glass, metal, paper, plastic, shoes, and trash. The green-glass, brown-glass, and white-glass are reorganized into the same class named ‘glass’. The authors also add their own collection of electronic waste into the dataset. It is used ‘textile’ to address clothes in general terms. Overall there are 15,190 images of 11 classes: paper (1000), metal (719), textile (5275), plastic (854), organic (935), shoes (1927), cardboard (841), battery (895), glass (1861), electronic (236), and trash (647). The dataset is divided into 80% training and 20% validation. The authors use another 655 images of 11 classes to evaluate the model. 4.2
Experimental Results
The experimental results of EfficientNet-Lite0 are compared to the results of MobileNetV2, InceptionV3, NASNetMobile, VGG16, ResNet50V2, Xception, and DenseNet121. The authors use EfficientNet-Lite0 and MobileNetV2 from Tensorflow Lite and other models are from Tensorflow. The training and validation of the models use a training dan validation dataset. Those models are evaluated using a testing dataset. The experiment of each model runs in 5 epochs. After 5 epochs, all models produce accuracy more than 80%. EfficientNet-Lite0 and MobileNetV2 produce the high accuracy for training and validation among all models. Based on the experimental results, MobileNetV2 runs faster than
Leboh: An Android Mobile Application
59
EfficientNet-Lite0 and other models. Both model also have training and validation loss around 0.6–0.8. NasNetMobile has the highest values of training and validation loss. The models are evaluated using 655 images from testing datasets. Table 1 shows experimental results of the models using a testing dataset. The top three highest accuracies are produced by EfficientNet-Lite0 (95%), InceptionV3 (90%) and MobileNetV2 (87%). Meanwhile. VGG16 produces the worst accuracy around 61%. The other models obtain accuracy around 77%–79%. Table 1. The experimental results of the models using a testing dataset Architecture
Accuracy Time
EfficientNet-Lite0 95.39%
198
MobileNetV2
87.18%
65
InceptionV3
90.91%
113
NASNetMobile
77.56%
52
VGG16
61.07%
329
ResNet50V2
79.39%
340
Xception
79.24%
138
DenseNet121
78.78%
336
Fig. 3. Running time of 5 epochs
Figure 3 shows the comparison of running time of all models. After the first epoch, EfficientNet-Lite0, MobileNetV2, VGG and ResNet50V2 run faster than others. Those results are convenient to the Keras documentation, time (ms) per inference step (GPU) for MobileNetV2, VGG16, ResNet50V2, and EfficientNetB0 are around 3–5 s and other models are more than 5–7 s [17]. It is not surprising that EfficientNet-Lite0 and MobileNetV2 from TensorFlow Lite which are designed for mobile devices run faster than other models.
60
T. Handhayani and J. Hendryli
Fig. 4. Training accuracy and training loss
Figure 4 show training accuracy and loss for 5 epochs. For all models, there is no significant difference accuracy from epoch 1 to 5. The loss values of EfficientNet-Lite0, MobileNetV2 and VGG16 drop slightly from epoch 1. In contrast, the loss values of ResNet50V2, Xception, and DenseNet121 decrease significantly from epoch 1. Figure 5 shows validation accuracy and validation loss. EfficientNet-Lite0, MobileNetV2, InceptionV3, ResNet50V2, Xception, and DenseNet121 produce accuracy higher than the loss values. Based on the training, validation and evaluation results, EfficientNet-Lite0 slightly outperforms MobileNetV2 and it shows a better performance than other methods. The advantages of implementing a model from TensorFlow Lite is easy to integrate with Android Studio, only needs simple codes, and its size fits for mobile applications.
Leboh: An Android Mobile Application
61
Fig. 5. Validation accuracy and validation loss
Comparing the results in this paper to the previous research [2,4,10], the EfficientNet-Lite0 obtains higher accuracy than other methods in this paper. However, the accuracy of EfficientNet-Lite0 is a little bit lower than that in [8]. 4.3
Discussions
This mobile application is named Leboh and designed for Indonesian speakers. Figure 6 shows the interface of the mobile application. There are two buttons: KAMERA (camera) and KELOMPOKKAN (Classify). First of all, the users need to take an image by pressing the button KAMERA then pressing the button KELOMPOKKAN to classify the image. The class label is displayed under the image. In this example, Leboh classifies shaded coconut as organic waste.
62
T. Handhayani and J. Hendryli
Fig. 6. The interface of the mobile application
The user testing involves 10 testers of the age of 20–40 years. The testers are instructed to capture and classify the waste around them using this application. The testers are allowed to capture any kind of solid waste around them or any objects that they think a solid waste. The testers report the screenshots of this application, the name of objects, and their material. This testing is to make sure that the application works well to classify various garbage in the environment that is not restricted to the waste in the datasets. Figure 7 shows screenshots of a single type of waste that is classified correctly (in the form object/class): (i) battery/battery, (ii) remote control/electronic, (iii) aluminum lid/metal, (iv) cardboard/cardboard, (v) duster/textile, (vi) flip flops/shoes, (vii) glass/glass, (viii) plastic/plastic, (ix) paper/paper, (x) beauty cotton/trash, (xi) banana peel/organic, and (xii) mangosteen peel/organic. The users collect 200 images and the application produces an accuracy of 82.5%. The application classified a single waste in an image accurately. If an image contains multiple types of waste, the application only detects one. It happens because the application is designed to classify a single kind of waste. Figure 8 shows the screenshot of the application when it is used to classify an image containing multiple types of waste. Figure 9 shows a confusion matrix of the user testing results. The major misclassification happens among plastic, glass and metal. In some conditions of lighting and image capturing angle, objects made from metal, plastic, and glass look glossy that making them difficult to classify correctly. In the dataset for training, the blue surgical masks are labeled as trash but the application classifies it as textile. It might be caused by the blue surgical masks having texture look like textile. Surprisingly, that testers collect plastic waste more than other types. Those plastic wastes are plastic bottles and cups for beverages, plastic bags, the case of various personal cares i.e. cosmetic and shower care, cutlery from food
Leboh: An Android Mobile Application
Fig. 7. The examples of user testing for single type of waste
Fig. 8. The example of user testing for multiple types of waste
63
64
T. Handhayani and J. Hendryli
Fig. 9. Confusion matrix
delivery, and snack bags. Although this study works with limited participants, it indicates that the use of plastics in society is undeniable. The testers also submit other types of waste that do not exist in the dataset: styrofoam, rubber waste, and wood waste. A tester walks for 20 min at around 11 AM on the pavement located in West Jakarta to capture any solid waste and classify the waste using this application. The tester found more than 100 cigarette stubs spreading on the street. This little study provides important information that the number of smokers in this area is high and the city council must put some cigarette bins on the street. None of the testers complained about the memory usage of their devices. There is no report about this application stop working, crashing, or running slow. It indicates that the application works well and runs fast from the user’s perspective. The users do not need to connect to the internet to run this application. The main contribution of this paper is a mobile application for solid waste classification that is easy to use. Moreover, the contribution enriches the research of deep learning on mobile devices. The user testing helps the authors to enhance the understanding of the types of waste in society. It is important knowledge for developers to update the system fits for users.
Leboh: An Android Mobile Application
5
65
Conclusion
In summary, EfficientNet-Lite model from TensorFlow-Lite works well to classify municipal solid waste. This model compatible to use as classifier in the mobile application for waste classification. The model obtains training and validation accuracy 96.84% and 95.33%, respectively. The evaluation result using a testing dataset produces an accuracy of 95.39%. The model is used to develop a mobile application called Leboh. The users testing produces an accuracy of 82.5%. This application works accurately to classify a single type of waste in an image. The advantage of developing the application using TensorFlow-Lite model is cost efficient and fast computation in the device. This application is a native mobile application that does not need an internet connection to use it. The user testing in this study obtains important information. The first is that plastic waste is a popular municipal solid waste in society. The second is it needs put the cigarette bins in public areas. For future works, the authors develop Leboh 2.0 for waste detection based on the object detection method. Acknowledgment. This paper was supported by research funding from Lembaga Penelitian dan Pengabdian kepada Masyarakat Universitas Tarumanagara, No 1689Int-KLPPM/UNTAR/XI/2021.
References 1. Wilson, D.C., et al.: Global Waste Management Outlook. ISWA, Vienna (2015) 2. Yang, M., Thung, G.: Classification of trash for recyclability status, Technical report, pp. 1–6 (2016). https://cs229.stanford.edu/proj2016/report/ThungYangClassificationOfTrashForRecyclabilityStatus-report.pdf 3. Mittal, G., Yagnik, K.B., Garg, M., Krishnan, N.C.: SpotGarbage: smartphone app to detect garbage using deep learning. In: ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 940–945. ACM, Heidelberg (2016) 4. Satvilkar, M.: Image based trash classification using machine learning algorithms for recyclability status. Technical report, School of Computing National College of Ireland (2018) 5. Kokoulin, A.N., Tur, A.I., Yuzhakov, A.A.: Convolutional neural networks application in plastic waste recognition and sorting. In: Conference of Russian Young Researchers in Electrical and Electronic Engineering (EIConRus), pp. 1094–1098. IEEE, Moscow and St. Petersburg (2018) 6. Bobulski, J., Kubanek, M.: Waste classification system using image processing and convolutional neural networks. In: Rojas, I., Joya, G., Catala, A. (eds.) IWANN 2019. LNCS, vol. 11507, pp. 350–361. Springer, Cham (2019). https://doi.org/10. 1007/978-3-030-20518-8 30 7. Frost, S., Tor, B., Agrawal, R., Forbes, A.G.: CompostNet: an image classifier for meal waste. In: Proceeding of IEEE Global Humanitarian Technology Conference (GHTC), Seattle, WA. IEEE (2019) 8. Vo, A.H., Son, L.H., Vo, M.T., Le, T.: A novel framework for trash classification using deep transfer learning. IEEE Access 7, 178631–178639 (2019). https://doi. org/10.1109/ACCESS.2019.2959033
66
T. Handhayani and J. Hendryli
9. Nowakowski, P., Pamula, T.: Application of deep learning object classifier to improve e-waste collection planning. Waste Manage. 109, 1–9 (2020). https:// doi.org/10.1016/j.wasman.2020.04.041 10. Shi, C., Tan, C., Wang, T., Wang, L.: A waste classification method based on a multilayer hybrid convolution neural network. Appl. Sci. 11, 1–19 (2021). https:// doi.org/10.3390/app11188572 11. Deng, Y.: Deep learning on mobile devices - a review (2019). https://arxiv.org/ abs/1904.0927 12. Machine learning for mobile developers. https://developers.google.com/ml-kit 13. Abadi, M., et al.: TensorFlow Large-Scale Machine Learning on Heterogeneous Systems (2006). https://www.tensorflow.org/ 14. Pudaruth, S., Mahomoodally, M.F., Kissoon, N., Chady, F.: MedicPlant: a mobile application for the recognition of medicinal plants from the Republic of Mauritius using deep learning in real-time. IAES Int. J. Artif. Intell. (IJ-AI) 10, 938–947 (2021). https://doi.org/10.11591/ijai.v10.i4.pp938-947 15. Nasir, H.M., Brahin, N.M.A., Aminuddin, M.M.M., Mispan, M.S., Zulkifli, M.F.: Android based application for visually impaired using deep learning approach. IAES Int. J. Artif. Intell. (IJ-AI) 10, 879–888 (2021). https://doi.org/10.11591/ ijai.v10.i4.pp879-888 16. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521, 436–444 (2021). https://www.nature.com/articles/nature14539 17. Keras. https://keras.io 18. Szegedy, C., et al.: Going deeper with convolutions. In: Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA. IEEE (2015) 19. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. Proc. Mach. Learn. Res. 37, 448–456 (2015) 20. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Conference on Computer Vision and Pattern Recognition, Las Vegas. IEEE (2016) 21. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition (2015). https://arxiv.org/abs/1409.1556 22. Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vision 115(3), 211–252 (2015). https://doi.org/10.1007/s11263-0150816-y 23. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Las Vegas (2016) 24. He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 630–645. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0 38 25. Chollet, F.: Xception: deep learning with depthwise separable convolutions. In: Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Honolulu (2017) 26. Huang, G., Liu, Z., Maaten, L.V.D., Weinberger, K.Q.: Densely connected convolutional networks. In: Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Honolulu (2017) 27. Howard, A.G., et al.: MobileNets: efficient convolutional neural networks for mobile vision applications (2017). https://arxiv.org/abs/1704.04861 28. Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: MobileNetV2: inverted residuals and linear bottlenecks. In: Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4510–4520. IEEE, Salt Lake City (2018)
Leboh: An Android Mobile Application
67
29. Zoph, B., Vasudevan, V., Shlens, J., Le, Q.V.: Learning transferable architectures for scalable image recognition. In: Conference on Computer Vision and Pattern Recognition, pp. 8697–8710. IEEE, Salt Lake City (2018) 30. Tan, M., Le, Q.: EfficientNet: rethinking model scaling for convolutional neural networks. In: Proceedings of Machine Learning Research, pp. 6105–6114. PMLR (2019) 31. Wang, A.: EfficientNet-Lite (2021). https://github.com/tensorflow/tpu/tree/ master/models/official/efficientnet/lite
Mobile Application for After School Pickup Solution: Malaysia Perspectives Check-Yee Law1(B)
, Yong-Wee Sek2 , Choo-Chuan Tay2 and Wei-Wei Goh1
,
1 Multimedia University, Melaka, Malaysia
{cylaw,wwgoh}@mmu.edu.my
2 Universiti Teknikal Malaysia Melaka, Melaka, Malaysia
{ywsek,tay}@utem.edu.my
Abstract. School pickup is a daily routine for most parents during the school term. The use of private vehicles as daily commute to transport students to and from schools has caused school-related traffic congestion and air pollution problems. The after school waiting bay of most schools in urban areas is normally very crowded with students waiting for transport to travel home. Teachers on duty and school student council members are assigned to monitor and control the crowded situation. Some schools implement drive-thru pickup procedure in which the teacher on duty need to be very attentive and observe the cars approaching the bay for pickup in order to queue up the students at the pickup bay. With the outbreak of pandemic COVID-19, the teachers not only need to observe the vehicles heading to the pickup bay but to ensure the students obey social distancing norms. To alleviate after school pickup related problems, a mobile application for after school pickup is proposed. With this application, the school teachers would be able to obtain a list of pickup requests from the parents via the mobile application and queue up the students accordingly with safe social distancing at the pickup bay. This can ease the crowded situation, minimize parents’ waiting time at the pickup lane, and reduce air pollution problem caused by vehicle engines that are running while waiting at the pickup lane. Keywords: Drive-thru pickup · Private vehicles · School pickup · School-related traffic congestion
1 Introduction School pickup is a daily routine for most parents during the school term. A thorough review of school pickup related articles reveal that the pickup activity causes schoolrelated traffic congestion. School-related traffic congestion refers to the overcrowded roads or streets around the school area during drop-off and pickup period due to the use of private vehicles to transport children to and from school [1]. School-related traffic congestion mostly happens in middle/secondary and elementary/primary schools [1]. This problem occurs in various countries such as the United States [1, 2], the United Kingdom [1], China [3, 4], Vietnam [5], and Malaysia [6]. This issue has posed potential © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): IntelliSys 2022, LNNS 544, pp. 68–78, 2023. https://doi.org/10.1007/978-3-031-16075-2_5
Mobile Application for After School Pickup Solution
69
safety and health concerns especially to the communities in and around the school areas. Increase of child pedestrian injuries around school areas caused by on-street parking and high traffic flow has been reported [7, 8]. Lu et al. [3] found that the traffic congestion index for workdays during the school holidays is 20% lower than during the school terms and this leads to remarkable reduction of vehicle criteria pollutant – the particulate matter (PM). Various considerations and strategies have been proposed to address and ease the school-related traffic congestion problem. One of the ways is to educate parents so that they are aware of the consequences of the problem. Parents are educated to strictly follow the traffic rules such as no on-street parking around school zone to reduce child pedestrian injuries. Parents are also urged to consider for alternative mode of school transport such as walking and biking. Besides that, school curriculum is designed to introduce physical health and environmental benefits of active commuting, that is, walking or biking to school as well as to institute different modes of transport to school and the impact of each mode especially to the community and the environment. In addition, various programs or campaigns such as Walk to School Program [9] and Walking School Bus Program [10, 11] are held to raise awareness about road safety and physical activity. In this way, students are encouraged to walk or bike to school. To ensure student safety while walking or biking on the road, pedestrian walkways are built and crossing guards are employed. Carpooling and school bus service are suggested to reduce the number of students taken to school by private vehicles and thus alleviate the school-related traffic congestion problem. Despite various considerations and strategies have been proposed to ease the schoolrelated traffic congestion caused by private vehicles, cars remain the choice of some parents. The primary reasons why parents transport children to and from school via private vehicles include distance from home, car or motorcycle ownership, time constraints, family schedule, weather consideration, urban form, stranger danger, traffic hazards, and safety concern [1, 5, 6, 12–14]. For these reasons, private vehicles especially cars have become the choice of many parents for dropping off and picking up children from school. It was reported that cars are used as daily commute by around 75% of school-aged children in the United States [1]. In the United Kingdom, cars are used as the mode of transport by roughly 50% of school children [15]. It was also reported that the active commuting activity to school continued to decline over time [16]. In Malaysia, school-related traffic congestion is a common scenario, especially in the urban areas [6, 17, 18]. The after school waiting bay of most schools in urban areas is normally very crowded with students waiting for transport to travel home. Teachers on duty and school student council members are assigned to monitor and control the crowded situation. Some schools implement drive-thru pickup procedure. Through this procedure, parents are asked to show a pickup tag with student name and class on it as the car is approaching the school pickup bay. The teacher on duty will announce the student name, class and the car plate number as the car is approaching the bay for pickup. Students can then be queued accordingly to wait for pickup. This procedure can shorten the pickup time. As parents’ waiting time at the pickup lane is shortened, this reduces waste of petrol consumption and air pollution. This is because leaving vehicle engines
70
C.-Y. Law et al.
running while waiting at the pickup lane not only consume gasoline but the emissions from the exhaust systems, pollute the air. However, school teachers on duty have to pay attention to read student name on the pickup tag as cars are approaching for pickup. On rainy days, reading student name from the pickup tag becomes a challenge for the teachers. With the outbreak of pandemic COVID-19, World Health Organization (WHO) has developed a list of preventive measures to curb the spread of the viral infection. One of the preventive measures is to keep a social distancing of 1–2 m with each other [19]. With this outbreak, it makes the pickup process more challenging than ever, especially for teachers who are on duty at the pickup bay. While the teachers need to observe which car is approaching the school for pickup, they would need to ensure the children obey the preventive measures of keeping a safe social distancing with each other. This paper reports the research work of designing and developing a mobile application for after school pickup. Section 2 presents the prototype design of the application system. The implementation of this application system is explained with user flow charts and some screenshots of the prototype user interface in Sect. 3. This is followed by Sect. 4 that presents the security features of the application. Section 5 shows the functional testing results in a tabular format. Section 6 concludes the research work with suggestions for future work.
2 The Prototype Design of the Proposed School Pickup Application System School Pickup mobile application is a program to facilitate after school pickup process by using Android Java Programming and Firebase Database. The application system has three types of users, i.e., the administrators, parents, and teachers. Figure 1 is the use case diagram of this application system that presents the general functions of School Pickup. It illustrates the interaction between the users and the functions of the system. In this system, school administrators and teachers are able to update the pickup timetable that can be viewed by the parent users. Parent users can interact with the system to indicate if they are coming for after school pickup. Based on parents’ inputs, the teachers can then track all the pickup requests by using this application system. With this system, the teachers can queue up the students at the pickup bay based on the pickup requests sent by the parents. It can help the teachers to manage the students at the pickup bay and this can ease the crowded situation at the after school pickup bay. The presentation of this paper focuses on teacher user interface and parent user interface.
Mobile Application for After School Pickup Solution
71
Fig. 1. Use case diagram of school pickup.
3 The User Interface Design of School Pickup System Figure 2(a) shows the home screen of School Pickup System. All parent users have to register with their name, email address, a login password, contact number, and their vehicle plate number as in Fig. 2(b) before they can login to use the system. When the registered parents login, they will see the navigation drawer as in Fig. 3. Figure 3(a) is the navigation drawer of the parent user interface whereas Fig. 3(b) is the navigation drawer of the teacher user interface. There are six navigation options available in parent user interface that include Pickup Student, Add Student, My students, View Teacher, View Timetable, and Logout. Add Student is for parents to add a child’s information for pickup. The added information can then be viewed via the My Students option. View Teacher is for parents to view a teacher’s contact number and to communicate with teacher about pickup related matters. The View Timetable enables parents to view pickup timetable for the week. This option is useful whenever there are changes in pickup schedule due to some occasions as parents can view the updated pickup schedule from the application. The Pickup Student option allows the parent users to inform the teacher at the pickup bay that they are on the way for pickup. As for the teacher user interface, View Pickup List (Fig. 3b) allows the teachers to know a list of students whose parents are On The Way or Arriving Soon for pickup for the day. Figure 4 presents the pickup flow chart.
72
C.-Y. Law et al.
Fig. 2. (a) Home screen of school pickup system. (b) registration page.
Fig. 3. Navigation drawer of school pickup for a parent user (a), and a teacher user (b).
Mobile Application for After School Pickup Solution
73
Fig. 4. Pickup flow chart.
When parent users start the pickup trip, they can tap on the Pickup Student option (Fig. 3) in the Navigation Drawer. A location map that indicates parents’ location with Pickup button will be shown (Fig. 5a). When the parent user tap on the Pick Up button (Fig. 5a), the parent user interface view will change to Arriving Soon (Fig. 5b). A pickup message will be displayed on the teacher user interface as On The Way with student name, student gender, class and parent car plate number as in Fig. 6(a). When parents are near the school compound, they can then tap on Arriving Soon (Fig. 5b) and the pickup status will change to Arriving Soon in the teacher user interface (Fig. 6b). The mobile application also provides useful information related to the pickup matters to parents.
74
C.-Y. Law et al.
Fig. 5. Pickup button (a) and arriving soon button (b) in parent user interface.
Fig. 6. Pickup list showing some parents are on the way (a) and arriving soon (b) in the teacher user interface.
Mobile Application for After School Pickup Solution
75
4 Security Features As the system captures user’s personal information, it is important that the proposed application possesses the security features to safeguard user’s personal information. Several security measures are implemented to help prevent the misuse of the app and to ensure the app is secure. One of the measures is the implementation of Firebase Rules to protect the database with a set of rules. In this project, the rules are set to allow reading and writing of data by the authenticated users only. This means only users who logged into the firebase with their account can gain access to read and write data. Figure 7 shows the firebase rules for the Cloud Firestore Database.
Fig. 7. Firebase rules of the cloud firestore database.
Password Rules are also applied to ensure the passwords used by users must have at least 8 characters. This is to make sure that the passwords used by users are not easily guessable. Figure 8 shows the snippet of the coding of checking the password used to register an account with a length of at least 8 characters. Besides that, to avoid a normal user from creating a dummy teacher account to access the functions in the teacher interface, the creation of a teacher’s account can only be done through the admin function. In addition, each student has a unique identification number (ID) and only the Admin can create or add student identification number in the database.
Fig. 8. Snippet of coding for checking password length
5 Prototype Functionality Testing Results Functionality testing was conducted to verify the performance of the prototype system. A complete testing that includes the testing for Admin, Teacher, and Parents interface,
76
C.-Y. Law et al.
software navigation, user registration and login was conducted. All functionality results confirmed the performance of the prototype system and there is no functionality issue. The application system fulfils the expected functionality requirements. The presentation of this paper focuses on the functional testing for Teacher and Parent interface. Table 1 presents the functionality testing results with the test input, expected output, and the testing outcome. Table 1. Functionality testing results. Test input
Expected output
Outcome
i. Tap on home button
Teacher information including name, email, and contact number is shown
Fulfilled
ii. Tap on edit button
Teacher information can be edited
Fulfilled
i. Tap on view pickup list
Display pickup list information that includes parent vehicle plate number, pickup status, student name, gender, and class
Fulfilled
ii. Tap on clear pickup list
Pickup list is cleared
Fulfilled
i. Tap on view students list
Display a list of all registered students
Fulfilled
d) View timetable i. Tap on view timetable
Display pickup timetable for the week
Fulfilled
ii. Tap on edit timetable
Timetable is editable
Fulfilled
iii. Tap on confirm timetable
Timetable is updated
Fulfilled
Logout a teacher user
Fulfilled
Teacher interface a) Home
b) View pickup list
c) View students
e) Logout i. Tap on logout Parent interface a) Home i. Tap on home button
Parent information including parent name, Fulfilled email, contact number, and vehicle plate number is shown
ii. Tap on edit button
Parent information can be edited
Fulfilled
b) Pickup student (continued)
Mobile Application for After School Pickup Solution
77
Table 1. (continued) Test input
Expected output
Outcome
i. Tap on pickup button
Parent user interface view changes to arriving soon and the teacher user interface shows on the way message with student name, student gender, class and parent car plate number
Fulfilled
ii. Tap on arriving soon button
The pickup status in the teacher user interface changes from on the way to arriving soon
Fulfilled
iii. Tap on done pickup button
The pickup request is removed from the pickup list in the teacher user interface
Fulfilled
i. Add student whose ID is in the database
Student name and information are added for pickup
Fulfilled
ii. Add student whose ID is not in the database
Show error message and parents are asked Fulfilled to contact the Admin
c) Add student
d) My students i. Tap on my students
Display a list of student names registered for pickup
Fulfilled
Display teacher information that includes teacher name, email and contact number
Fulfilled
Display timetable for pickup
Fulfilled
Logout a parent user
Fulfilled
e) View teacher i. Tap on view teacher f) View timetable i. Tap on view timetable g) Logout i. Tap on logout
6 Conclusion In this research work, a drive-thru school pickup mobile application is proposed to mitigate school-related traffic congestion issues. With this application, the teachers can queue up the students at the pickup bay based on the pickup request sent by the parents. It can help the teachers to manage the students at the pickup bay and this can ease the crowded situation at the after school pickup bay. With the crowded situation under control, the teachers can then ensure the children obey the preventive measures of social distancing with each other. As students are ready in the queue and parents can just drive-thru, pick the children up, and start the journey back home. This will minimize parents’ waiting time at the pickup lane and thus reduce waste of petrol consumption and air pollution. Future enhancement may include pickup dashboard and data analytics to provide useful pickup related information to teachers and parents.
78
C.-Y. Law et al.
References 1. La Vigne, N.G.: Traffic congestion around schools. US Department of Justice, Office of Community Oriented Policing Services (2007) 2. Plummer, D.: School zone traffic congestion study. Miami-Dade County Metropolitan Planning Organization, Miami (2002) 3. Lu, M., Sun, C., Zheng, S.: Congestion and pollution consequences of driving-to-school trips: a case study in Beijing. Transp. Res. Part D: Transp. Environ. 50, 280–291 (2017) 4. Sun, W., Guo, D., Li, Q., Fang, H.: School runs and urban traffic congestion: evidence from China. Reg. Sci. Urban Econ. 86, 103606 (2021) 5. Hiep, D.V., Huy, V.V., Kato, T., Kojima, A., Kubota, H.: The effects of picking up primary school pupils on surrounding street’s traffic: a case study in Hanoi. The Open Transport. J. 14(1), 237–250 (2020) 6. Nasrudin, N., Md Nor, A.R.: Travelling to school: Transportation selection by parents and awareness towards sustainable transportation. Procedia Environ. Sci. 17, 392–400 (2013) 7. LaScala, E.A., Gruenewald, P.J., Johnson, F.W.: An ecological study of the locations of schools and child pedestrian injury collisions. Accid. Anal. Prev. 36(4), 569–576 (2004) 8. Schwebel, D.C., Davis, A.L., O’Neal, E.E.: Child pedestrian injury: a review of behavioral risks and preventive strategies. Am. J. Lifestyle Med. 6(4), 292–302 (2012) 9. Woodruff, A.: Strategies to encourage active travel to school: walk to school and beyond, pp. 1–16 (2019) 10. Benson, S.S., Bruner, B., Mayer, A.: Encouraging active transportation to school: Lessons learned from implementing a walking school bus program in Northeastern Ontario. J. Transp. Health 19, 100914 (2020) 11. Carlson, J.A., et al.: Walking school bus programs: Implementation factors, implementation outcomes, and student outcomes, 2017–2018. Prev Chronic Dis 17, E127 (2020) 12. Bradshaw, R.: Why do parents drive their children to school? Traffic Eng. Control 36(1), 16–19 (1995) 13. McMillan, T.E.: The relative influence of urban form on a child’s travel mode to school. Transport. Res. Part A: Policy Pract. 41(1), 69–79 (2007) 14. Pooley, C.G., Turnbull, J., Adams, M.: The journey to school in Britain since the 1940s: continuity and change. Area 37(1), 43–53 (2005) 15. Dickson, M.: Characteristics of the escort education journey. Transport Trends 47–54 (2000) 16. Villa-González, E., Barranco-Ruiz, Y., Evenson, K.R., Chillón, P.: Systematic review of interventions for promoting active school transport. Prev. Med. 111, 115–134 (2018) 17. Liew, J.X.: Easing jams near schools, The Star, Malaysia. https://www.thestar.com.my/metro/ metro-news/2021/04/07/easing-jams-near-schools (2021). Last accessed 31 Jan 2022 18. Michael, K., Shukri, N.: Three schools and a massive traffic problem, The Star, Malaysia. https://www.thestar.com.my/metro/community/2017/06/23/chaotic-trafficon-road-with-three-schools (2017). Last accessed 31 Jan 2022 19. Elengoe, A.: Covid-19 outbreak in Malaysia. Osong Public Health Res. Perspect. 11(3), 93–100 (2020)
Were Consumers Less Price Sensitive to Life Necessities During the COVID-19 Pandemic? An Empirical Study on Dutch Consumers Hao Chen1(B) and Alvin Lim1,2 1
Retailer Products Research and Development, NielsenIQ, Chicago, IL 60606, USA [email protected] 2 Goizueta Business School, Emory University, Atlanta, GA 30322, USA [email protected] Abstract. This research investigates if consumers were less price sensitive to life necessities during the COVID-19 pandemic via a demand modeling system. Consumers’ price sensitivity was explicitly quantified by the price elasticity of demand. Consumer behavior in nine categories of products considered as life necessities were studied in two non-overlapping time periods: a year before the onset of the COVID pandemic and a year following the initial panic buying caused by the declaration of the COVID pandemic. The changes in price elasticity of demand between the two periods across the nine product categories were determined from the weekly sales data of a Dutch retailer. Using the proposed demand modeling system applied to available data, it was empirically found that the majority of essential food products were price inelastic, while the majority of non-food products were price elastic during the COVID period. Among the nine product categories, four categories were identified to have significantly different elasticities across the two time periods, while eight categories were observed to have practically smaller magnitudes in elasticities. These insights not only prove the usefulness of the proposed demand modeling system, but also provide valuable theoretical and managerial implications for retail business practitioners, particularly in pricing and inventory planning. Keywords: Consumer behavior · Retail analytics demand · Regression analysis · COVID-19
1
· Price elasticity of
Introduction
The outbreak of COVID-19 in December of 2019 [62] has shaken the world in an unprecedented way since the great influenza pandemic of 1918 [42]. Even at the time of this writing, the spread of COVID-19 is yet to be under reasonable control, and a surge due to the new Omicron variant is taking root worldwide according to the World Health Organization [63]. Under such a severe health c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): IntelliSys 2022, LNNS 544, pp. 79–100, 2023. https://doi.org/10.1007/978-3-031-16075-2_6
80
H. Chen and A. Lim
crisis, consumers’ grocery shopping behavior has been inevitably impacted in a non-trivial way all over the globe [53]. Due to the countermeasures imposed by governments such as community lockdown, travel restrictions and closure of sites of dense gatherings, consumers resorted much more to online shopping [29] and tend to visit grocery stores for essential products only [27]. Toilet paper, disposable gloves, hand wash and sanitizers are among the most popular products that consumers stockpiled during the early stages of COVID-19 [7]. Different theories are developed in the literature to explain such panic buying and hoarding behaviors, which are hypothesized to originate from complex human psychological emotions, such as perceived scarcity [13], future scarcity [6] and anticipated regret [10]. In addition, according to a survey by Slickdeals.net, the US consumers were spending 18% more during the COVID-19 period [9], which was echoed by another study released by J.P. Morgan, in which the authors found that the more time consumers stay at home, the more frequently they make purchase impulsively [32]. In a nutshell, the COVID-19 pandemic has inevitably transformed consumers’ spending behaviors. In retail analytics, price elasticity of demand (PED) is a useful measure of the effect of price on sales volume, which is widely perceived as the sensitivity of demand to the change in price [48]. For a branded product, pricing is among the most important decisions in the marketing mix, and it is because price changes can usually have an immediate effect on sales [43]. In practice, the price elasticity of demand is often estimated by assuming a functional sales-response equation. In the appendix of [64], the authors provided a detailed derivation of how PED could be estimated by assuming a multiplicative functional form. More recently, Chen and Lim [11] explicitly characterized the transfer of demands between products when analyzing price elasticity. Instead of using pairs of products, Chen and Lim proposed a network-based PED model taking all the substitute products into account. Their numerical analysis confirmed the feasibility and superiority of the proposed network-based model compared to the traditional model. We detail the formal definition of PED and review three commonly used functional forms in Sect. 2.1. Moreover, it is generally accepted that price and demand are inversely correlated, which is supported by both the classical economic theory of supply and demand [61] and empirical studies such as Cooper in [16] and Andreyeva et al. in [2]. More relevant literature will be reviewed in Sect. 2.1 as well. The primary objective of this research is to explore how the COVID-19 pandemic affects consumers’ price sensitivity to life-essential products. In other words, the paper is trying to answer the following question: were consumers less price sensitive to life necessities when during the COVID-19 period? In order to answer the research question, a PED modeling system requiring minimal manual configuration was developed. The available retail data is then fed into the modeling system, and the output results are used in data-driven business decision making. There were several papers discussing the impact of COVID-19 on retail sales in the extant literature, such as Goddard in [25] for the impact on food retailing, and Bhatti et al. in [7] for the influence on e-commerce. In a more recent paper, Lim et al. [39] made observation via quantitative analyses that the
Were Consumers Less Price Sensitive During the COVID-19 Pandemic?
81
COVID-19 pandemic boosted online shopping at the expense of in-store sales. The authors concluded that the pandemic not only drove consumers to stockup on shelf-storable fruits, but also triggered a demand shift towards organic poultry and beef products over conventional ones that is trending towards a new normal. The fact that organic food products are generally more expensive than conventional ones motivates us to investigate a potential shift in consumer price sensitivity. While research articles exploring the impact of COVID-19 on consumers’ behavior exist in the literature, to the best of our knowledge, we are not aware of any literature specifically addressing the change in price elasticity of demand of life-essential products during the COVID-19 pandemic. Therefore, the primary contribution of this study to existing literature is to fill this research gap and provide important insights on how COVID-19 impacted consumers’ response to price changes. Via the proposed PED model, several important insights were obtained, leading practitioners to gain profound understanding of the effect of the pandemic on consumers’ perception of price changes that will allow adjustments to marketing strategies and supply planning practices. The rest of the paper is organized as follows. In Sect. 2, we provide some theoretical background and develop some hypotheses on consumer price perception. The data, variables and the proposed PED model are detailed in Sect. 3. The analysis results are presented in Sect. 4. We conclude with some remarks and describe future research work in Sect. 5.
2 2.1
Theoretical Background and Literature Review Price Elasticity of Demand
For a given product, its price elasticity of demand, η, is defined as the percentage change in demand for the product per one percentage change in its price: η=
δSt Pt δSt /St = , δPt /Pt δPt St
(1)
where δSt is the change in sales volume at time t, δPt is the change in price at time t, and St and Pt are the sales volume and price at time t, respectively. It is worth noting that the such definition of the price elasticity of demand, technically speaking, is the so-called short-run elasticity [1] that represents the sales reaction to temporary price changes since data collected at retail level primarily reflect temporary price changes [44]. Compared to long-run elasticities that may reflect major structural shifts in the business environment and consumer behavior, short-run elasticities are usually more elastic due to factors such as consumer expectations and stockpiling behavior [8], and is the focus of this research. In terms of estimating price elasticity in practice, a functional form of the sales volume is usually assumed to describe how it is affected by price and other control variables [38]. In previous studies, there are three commonly employed
82
H. Chen and A. Lim
functional forms: linear, multiplicative and exponential. Bolton [8] did a comprehensive summary of these three canonical functional forms and found that the linear form exhibits a larger absolute magnitude of apparent bias than either of the interactive models, which advocates for either the multiplicative or the exponential form. In particular, the multiplicative form is given in Eq. (2): St = α0 Ptα1 Xtα2 t , η = α1 ,
(2) (3)
where St is the product’s sales volume at time t, Pt is the product’s price at time t, Xt is(are) other explanatory variable(s) representing other non-trivial factors affecting sales. By taking the first derivative with respect to Pt on both sides of Eq. (2), it can straightforwardly be shown that the price elasticity of demand is α1 as in Eq. (3) under the multiplicative functional form. Equation (2) is linearizable by logarithmic transformation, for which statistical inference of the unknown model parameters can be conducted as usual. The exponential form is an alternative, for which the right-hand side of Eq. (2) changes to exp (α0 + α1 Pt + α2 Xt + t ). Price elasticity has always been a metric of focus in retail analytics. Whether a product is elastic (|η| > 1) or inelastic (|η| < 1) determines its sales and pricing strategy. Colchero et al. [15] studied the price elasticity of demand for sugar-sweetened beverages and soft drinks in Mexico and found that the demand for sugar-sweetened beverages were indeed elastic: a 10% increase in the price was associated with an 11.6% decrease in sales volume. Andreyeva et al. [2] conducted a systematic review of research on the price elasticity of demand for food. The authors chose 160 studies on the price elasticity of demand for major food categories. Price elasticities for food and non-alcoholic beverage categories ranged from −0.81 to −0.27. All mean estimates from the meta analysis were inelastic such as soft drinks (−0.79), juices (−0.76) and fruits (−0.70). Moreover, in a study on dairy products, Subramanian et al. [57] reported that both butter and milk powder in rural India were elastic with PEDs of −1.342 and −1.266, respectively. Green et al. [28] pulled together 3495 own price elasticities for food categories from 136 studies, and reported that cereals have an average of −0.61 and −0.43 PED for low and high income countries, respectively. The corresponding numbers were −0.78 and −0.60 for meat, both signaling inelastic demand. Evidently, even for the same product category, the specific PED numbers reported could differ due to factors such as data availability, methodology employed, and demographic differences. In this paper, we propose a hybrid of multiplicative and exponential form to estimate price elasticities at aggregated product category level for a variety of life necessities for a Dutch retailer. More technical details will be presented in Sect. 3. 2.2
Life Necessities
Despite the lack of a globally recognized list of life necessities, there are some extant literature in the field of sociology, policy study and poverty prevention
Were Consumers Less Price Sensitive During the COVID-19 Pandemic?
83
outlining life-essential products in each respective context. According to the pioneering work by Rowntree in [51], poverty was measured as insufficient income to obtain the minimum necessaries of the maintenance of merely physical efficiency, and minimum necessaries were classified into two categories: food necessities such as bread, vegetable and non-food necessities such as basic clothing. In the paper by [36], the authors investigated the opinions of 250 Korean adults about their perception of necessities of life and found that products or services related to food, clothing and housing were among the highest rank necessities. Also the UK Poverty and Social Exclusion (PSE) Survey (https://www.poverty.ac.uk/) listed six categories that constitute life necessities in the UK: items related to housing, diet and clothing, household maintenance, social and family life, financial needs, and children’s needs [19]. We generally agree with the opinions expressed by Pantazis et al. [45] that there is hardly a consensus by the general population regarding what are considered as life necessities since opinions differ by geographical location, education, income, gender, age and ethnic groups. In a more recent paper, Fahmy et al. [22] also suggested that the public’s understanding of the term “necessity” is diverse and may not always align with researchers’ interpretations. In that regard, we pulled together a list of the following 9 product categories in a retail environment that are considered specifically in this paper as life necessities: – (1) bread (2) ham and sausage (3) fresh vegetable (4) milk (5) salad (6) cheese – (7) hand sanitizer (8) shampoo & conditioner (9) toothpaste The specification of the above list is a combination of common life-essential products from the literature reviewed above and data available to us. Therefore, although it is sufficient for this research, the list may not be applicable in other scenarios. Among the 9 categories, the top 6 categories are food products, and the remaining 3 are non-food products, which shall shed light on how the pandemic impacted consumers’ price sensitivity to both food and non-food life-essential products. In particular, hand sanitizer is included to form a comprehensive view, especially during the COVID pandemic period. The consumers’ behavior towards it could be different from the rest of the categories since it is closely associated with consumers’ perception of COVID-19 and panic buying [21,37]. 2.3
Panic Buying Behavior
Panic buying occurs when consumers buy unusually large amounts of a product in anticipation of, or after, a disaster or perceived disaster, or in anticipation of a large price increase or shortage [54]. It is an individual’s reaction to a combination of psychological perceptions such as threat of a crisis, scarcity of products and fear of the unknown [65], and it can be manifested in large groups leading to herd behavior [5]. Under the health crisis due to COVID-19, there are quite a few studies focusing on the impact of panic buying in the retail industry. Prentice et al. [47] observed the non-trivial influence of COVID-19 on consumer behavior, i.e.,
84
H. Chen and A. Lim
unusual panic buying behavior, and relied on the scarcity principle, crowd psychology and contagion theory to investigate its antecedents and consequences. Islam et al. [31] confirmed that scarcity messages with limited quantity and time significantly developed the perceived consumer arousal, which was positively correlated with impulsive buying. In an article by Arafat et al. [3], the authors analyzed a total of 784 media reports, and found out that a sense of scarcity was reportedly found as the most important factor in about 75% of the reports, followed by increased demand, the importance of the product (45.02%), and anticipation of price hike (23.33%). Moreover, Chronopoulos et al. [12] reported that there was a strong increase in groceries spending in the UK consistent with panic buying and stockpiling behavior in the two weeks following the WHO announcement describing COVID-19 as a pandemic. In a similar fashion, Wang et al. [58] showed that after COVID-19, Chinese consumers extended their food reserves from an average of 3.37 to 7.37 days, and were willing to pay an extra premium of 61% on perishable food reserves due to the their perception of COVID-19. Weersink et al. [60] reported that due to panic buying, overall sales were 46% higher in grocery stores for the week ending 14 March in 2020 compared to the same period of 2019 in Canada. The authors also showed that, in terms of specific grocery store product categories, milk sales increased by 31%, butter by 76% and fresh chicken by 50%. There also exists research that rigorously quantified the effects of COVID-19 panic buying in the literature. Keane and Neal [33] proposed to use keyword search data from Google trends (https://trends.google.com/trends/), and then composed the so-called panic index and search index through factor analysis [52] to explicitly measure panic buying. Singh et al. [55] collected data from questionnaires distributed to 357 participants in Fiji, and employed structural equation modeling (SEM) [34] to break down customers’ panic buying behavior into different factors. In the paper by Prentice et al. [46], the researchers quantitatively concluded that the interventions and support from government and businesses influenced consumers’ panic buying engagement. In this paper, we specify two discontinuing, non-overlapping periods: (1) the pre-COVID period, and (2) the COVID period, and investigate how the price elasticity of demand for each of the product categories changed between them. Since the impact of COVID on consumers’ panic buying was not negligible from the reviewed literature, the period of COVID-induced panic buying was excluded from the COVID period in our analysis to obtain a clearer picture of consumers’ grocery shopping behavior in the period. 2.4
Hypotheses Development
Wang et al. in [58] reported that consumers were willing to pay an extra premium of 61% on perishable food reserves in the initial weeks when COVID-19 was spreading in China. It was also shown by Weersink et al. in [60] that grocery sales for fresh product categories such as milk, butter and fresh chicken have dramatically increased since the outbreak of COVID-19. However, to the best of our knowledge, there is hardly any research articles indicating a decrease in
Were Consumers Less Price Sensitive During the COVID-19 Pandemic?
85
prices for grocery food products. Considering factors such as inflation, supply chain disruptions [49], as well as governments’ counter measures against COVID19, e.g., localized lockdown [4,20], it was extremely unlikely, if not impossible at all, that the prices of food products would decrease during the COVID period. Goolsbee and Syverson in [26] concluded that consumers avoided visiting busy and crowded stores and supermarkets, and more consumers globally tend to shift their purchasing online [7]. As a consequence of this shift, it was assumed that consumers paid less visits to grocery stores, but spent more per visit, which was corroborated by the exploratory data analysis presented in Sect. 3 of this paper for the Dutch retailer. Focusing on the COVID period, except for cheese with a long shelf life [40], the remaining 5 food necessities specified in Sect. 2.2 are all perishable [56], so it is unlikely that consumers were willing to store a large amount of perishable foods even when they were on promotion. Therefore, for the 6 food necessities considered in this paper, we would hypothesize that the demands were inelastic during the COVID period, which represents our speculation that due to limited grocery store visits, consumers would tend to buy essential food groceries at reasonable increased prices. Hence, we propose to test the following hypothesis: Hypothesis 1: The demands for the 6 food necessities are inelastic during the COVID period, i.e., the absolute value of PED is less than 1 for the COVID period. However, for the 3 non-food necessities, such as toothpaste and shampoo & conditioner, they are non-perishable products that can be stored for an extended amount of time [35], and consumers would likely buy more during promotions. Therefore, it is reasonable to expect that the demands for non-food necessities will be elastic even during the COVID period. In other words, we hypothesize that the pandemic does not fundamentally change the price-demand nature of non-food necessities, such as toothpaste, shampoo and conditioner. However, in terms of hand sanitizer products, the situation could be different since it was tightly correlated with consumers’ perception of COVID-19 and panic buying [37]. That said, the general hypothesis to test for non-food necessities is as follows: Hypothesis 2: The demands for the 3 non-food necessities are elastic during the COVID period, i.e., the absolute value of PED is greater than 1 for the COVID period. The above two hypotheses focused on testing price elasticities based on reasonable consumer behavior in the COVID period. In addition, we were also interested in testing the change in PED across the two non-overlapping periods: pre-COVID and COVID. Goolsbee and Syverson [26] reported that consumers tended to avoid crowded grocery stores and supermarkets under the pandemic. Delasay et al. [18] also observed a similar pattern of less-frequent in-store shopping during COVID. So, we hypothesize that the magnitude of PED would become smaller reflecting decreased volume with stable prices due to less instore shopping behavior or stable volume with increased prices. This hypothesis is as follows:
86
H. Chen and A. Lim
Hypothesis 3: The absolute values of PEDs across life necessities for the COVID period are smaller than those of the pre-COVID period. For Hypothesis 3, if the magnitudes of PEDs are indeed smaller than those of the pre-COVID period, it will then have important practical implications since price elasticity is an important input to the subsequent price optimization process [23,59] in retail pricing.
3 3.1
Data, Variables and the Proposed PED Model Data
A Dutch retailer made its weekly sales data available to us for research purposes. The weekly sales data were aggregated to the product category level. As discussed in Sect. 2.3, we identified 9 product categories for both food necessities and non-food necessities among the available categories. We specify two nonoverlapping periods: (1) the pre-COVID period, which was from December 30, 2018 to December 28, 2019, and it was in total 52 weeks; and (2) the COVID period, which was from July 5, 2020–July 3, 2021 (52 weeks as well), to investigate the change of price elasticity for each of the product categories. Since the available data was from a Dutch retailer, the COVID period was defined to align to the Dutch government’s announcement that supermarket, restaurants, cafes and bars could reopen from June 1, 2020 after a lock-down period. We first present an exploratory data analysis below to provide a general picture of the sales information for the selected 9 product categories in both periods. Some important summary sales information about the 9 product categories were presented in Table 1. The total sales information is broken into product categories and reported in Table 2. Terminology used are defined as follows: – Average Basket Size: total sales amount/total number of transactions – Category Penetration Rate (PR) [17]: total number of transactions containing at least one product from the category/total number of transactions across all categories Note that the inflation rate in the Netherlands quantified by consumer price index (CPI) was moderate: up 1.28% in the year 2020 and 2.40% in 2021 [30]. From Table 1, it turned out that the total number of transactions dwindled by about 4.587%, but the average basket size and total sales amount increased by 15.465% and 10.168%, respectively, which were significantly larger than the increase in CPI. This clearly indicates that compared with the pre-COVID period, consumers visited less frequently but spent over 15% more money per transaction at the retailer. In addition, from Table 2, it is clear that both the sales amount and penetration rate in the COVID period were larger than those of the pre-COVID period: for example, the sales of cheese increased by 10.714%, and its penetration rate increased by 12.761%. The most interesting category is hand sanitizer whose sales increased more than 66% indicating consumers bought much more
Were Consumers Less Price Sensitive During the COVID-19 Pandemic?
87
hand sanitizers in the COVID period than in the pre-COVID period. It was also observed that the 6 categories of food necessities, in fact, dominated sales for both periods. These preliminary observations are consistent with the literature such as Wang et al. [58], which reported that consumers were willing to pay more on food consumption during the COVID-19 pandemic, as well as Delasay et al. [18] which observed that consumers had less in-store visits. Table 1. Summarized sales information for the selected 9 product categories. Pre-COVID
COVID
Change
Total number of transactions
240,415,552
229,387,471
−4.587%
Total sales amount
e1,151,227,630 e1,268,288,223 10.168%
Average basket size (e/transaction) e4.788
e5.530
15.465%
Number of categories
9
0.000%
9
Table 2. Detailed sales and Penetration Rate (PR) information for the selected 9 categories. Category
Pre-COVID Sales
COVID PR
Sales
Change PR
Sales
PR
Cheese
e244,297,892 0.215 e270,472,524 0.242 10.714% 12.761%
Salad
e241,425,512 0.166 e259,198,555 0.198
Ham and Sausage
e227,843,670 0.166 e265,724,585 0.194 16.626% 17.395%
7.362% 19.670%
Fresh vegetable
e173,733,371 0.156 e195,796,058 0.167 12.699%
7.029%
Milk
e134,578,748 0.154 e143,802,142 0.164
6.854%
6.616%
Bread
e92,073,174
0.143 e92,144,947
0.160
0.078% 12.355%
Toothpaste
e23,939,323
0.014 e24,884,618
0.015
3.949%
Shampoo & Conditioner e9,408,591
0.005 e9,713,717
0.007
3.243% 26.525%
e3,927,350
0.004 e6,551,077
0.005 66.807% 21.983%
Hand Sanitizer
3.2
6.726%
Variables
The model variables are described in Table 3. In the context of a regression model, sales volume is the dependent variable. Following Clapp et al. in [14], a transformation for equivalization was made on the original sales volume by utilizing fixed product-specific weighting factors developed by Nielsen Consumer LLC. The primary reason for using the equivalized sales volume over the raw sales volume is that equivalized sales volume of products of different brands and packaging in a category can be compared and aggregated. Henceforth, sales volume refers to equivalized sales volume for the remainder of the paper unless otherwise specified. Since data has been aggregated to the product category level, price index was computed for each category per unit sales volume for each week, and was the
88
H. Chen and A. Lim
independent variable of interest. For each product category, price index for different period was also divided by its period-specific, category-specific fair price [41], denoted as rm,n , where m = 1, 2 indicates the pre-COVID or the COVID period, and n = 1, . . . , 9 represents the product categories. rm,n is computed as total sales amount of category n in period m divided by total volume of the same category in the same period. The main purpose is to take the possible different fair prices into account as COVID could have an impact on consumers’ perception of the reference (fair) price (see [24,50]). The point estimate of price elasticity of demand, and its corresponding standard error (S.E.) are then computed as η ) = S.E.(ˆ αm,n )/rm,n , where α ˆ m,n and S.E.(ˆ αm,n ) denote ηˆ = α ˆ m,n /rm,n and sˆ(ˆ the original estimate and its associated S.E. obtained from the model proposed in Sect. 3.3. Moreover, several control variables are also included in the model: binary holidays and events indicators, number of stores, number of products as well as seasonality index. Table 3. Description of variables in a model of a product category. Variable name
Data type: Description
Sales Volume
Float: the weekly sales volume
Variable type Dependent
Price Index
Float: the weekly price index
Independent
Seasonality Index
Float: the index quantifies seasonal effects
Control
Number of Stores
Integer: number of stores that carry such category
Control
Number of Products
Integer: number of products that each category carries
Control
Holiday-New Year
Binary: 1 if the week is a NY week; 0 otherwise
Control
Holiday-Easter
Binary: 1 if the week is an Easter week; 0 otherwise
Control
Holiday-Easter Monday Binary: 1 if the week is an Easter Monday week; 0 otherwise Control Holiday-Pentecost
Binary: 1 if the week is a Pentecost week; 0 otherwise
Control
Holiday-White Monday Binary: 1 if the week is an White Monday week; 0 otherwise Control Holiday-Ascension
Binary: 1 if the week is a Ascension week; 0 otherwise
Holiday-Christmas
Binary: 1 if the week is a Christmas week; 0 otherwise
Control
Event-Carnival
Binary: 1 if the week is a Carnival week; 0 otherwise
Control
Event-Mother’s Day
Binary: 1 if the week is a Mother’s Day week; 0 otherwise
Control
Event-Father’s Day
Binary: 1 if the week is a Father’s Day week; 0 otherwise
Control
Event-Back to School
Binary: 1 if the week is a Back to School week; 0 otherwise
Control
3.3
Control
The Proposed PED Model
The PED modeling system is centered around a hybrid multiplicativeexponential model to estimate the price elasticity of demand because of the existence of binary control variables: ⎞ K ⎛ J β k ⎝exp( yt = β0 (pα xt,k zt,j γj + t )⎠ , (4) t) k=1
j=1
where for a week t, yt is the sales volume, pt is the price index. The model also includes K continuous (positive) control variables and J binary control
Were Consumers Less Price Sensitive During the COVID-19 Pandemic?
89
variables. Taking the natural logarithm on both sides, we obtain the linearized form as follows: log(yt ) = β0 + α log(pt ) +
K
βk log(xt,k ) +
J
γj zt,j + t ,
(5)
j=1
k=1
where β0 is the constant propensity, and t is assumed to follow a parametric distribution so that the likelihood approach can be applied to estimate the unknown parameters. Usually, a stationary white noise sequence is assumed: t ∼ N (0, σ 2 ). However, a more complicated form can be assumed such as ARMA(p, q) on t , if residuals show significant lower-ordered serial correlation. The proposed modeling process is given in Algorithm 1. For each category, we fit a separate model for each period, so we have 9 × 2 = 18 models. Algorithm 1. The PED Modeling System 1: procedure PED Modeling System(X, Y ) 2: Initialize o == {...} 3: Fit the regression with t ∼ N (0, σ 2 ) with all control variables 4: Perform model selection with stepwise AIC to exclude control variables that are not significant. 5: Refit the regression with only selected control variables. 6: Denote the model as Model 1. 7: Extract the model residuals and draw its autocorrelation and partial autocorrelation. 8: if No significant series correlation is detected within lag 3 then 9: Set o == Model 1 10: else 11: Identify the proper order p and q from Step 7 for series correlation. 12: Refit Model 1 with t ∼ ARM A(p, q). 13: Denote the above model as Model 2 14: Set o == Model 2 15: Output results from o. 16: stop
Note that the modeling process would only output the original estimated coefficient of price index (ˆ α) and its associated S.E. (S.E.(ˆ α)), which must be η ). divided by its fair price, rm,n to obtain PED, ηˆ and its corresponding S.E. sˆ(ˆ With a slight abuse of notation only in the following part, we let η to denote the magnitude of PED, i.e. −1 × PED = |PED| and its estimate ηˆ > 0 accordingly, so η > 0 and ηˆ > 0. We employ two different types of tests: – In order to test the magnitude of PED against 1, we have H0 : η = 1 vs. p-value:
Ha : η < 1,
P (T
1. – To statistically test the change of the absolute values of PED across the two non-overlapping time intervals is a more difficult problem since for each period, we only obtained two numbers from the modeling process: the point estimate, ηˆ and its S.E. sˆ(ˆ η ). Following the formulation of a vanilla two sample t-test, we propose to conduct the following test: H0 : η2 = η1
vs. Ha : η2 < η1 ,
ηˆ2 − ηˆ1 ∗ , p-value: P T < sˆp
2
2
s (ˆ η1 )+(N2 −d2 )ˆ s (ˆ η2 ) 1 )ˆ is the pooled standard error, η1 is the where sˆ2p = (N1 −d(N 1 −d1 )+(N2 −d2 ) PED in the pre-COVID period, η2 is the PED in the COVID period. N1 = N2 = 52 are the sample sizes and d1 , d2 are the number of variables in the final models selected by Algorithm 1 for the pre-COVID and COVID periods, respectively. The test statistic, T follows a t distribution with degrees of freedom equal to N1 + N2 − d1 − d2 . By using the pooled standard error, we aim to take standard errors from both periods into account. Similarly, we could also construct a one-sided test for the alternative η2 > η1 .
The proposed PED modeling system qualifies as a solution to our research problem in accordance to the research objectives outlined in Sect. 1, i.e., how the COVID-19 pandemic affects consumers’ price sensitivity that is quantified by price elasticity of demand. This is due to several reasons. First, the proposed model can separately estimate the PED for two non-overlapping time periods through a regression model, which also provides the associated standard error of the estimate facilitating statistical inference. Second, a comparison is made on the estimates, and statistical significance is tested by the two tests proposed above: one is a test of PED against 1, the other is a test on the change of the absolute values of PED across the two non-overlapping periods. Third, the proposed model also takes variable selection and statistical model diagnostics into account ensuring rigor in the process. Finally, the results obtained from the proposed model make real business sense, leading to a better understanding about how the pandemic affects consumers’ price sensitivity to life-essential products.
4
Findings and Discussion
The 9 estimated PED and its corresponding S.E. for the COVID period are reported in Table 4. The same metrics for the pre-COVID period are also included for comparison and reference. The fair prices, and the modeling details
Were Consumers Less Price Sensitive During the COVID-19 Pandemic?
91
are presented in Tables 5, 6 and 7 in the Appendix. For reproducibility, for each category, dividing the original estimates of price index reported in Tables 6 and 7 in the Appendix by its corresponding fair price in Table 5 should reproduce the estimate of price elasticity in Table 4. In addition, all the model fits are acceptable with R2 above 0.50 as shown in Tables 6 and 7 in the Appendix. Figure 1 shows the estimates of price elasticities and their associated 90% confidence intervals (CIs). Please note that although CIs correspond to two-sided tests, the conclusions between two-sided tests and one-sided tests for the 9 categories were consistent when testing Hypotheses 1 and 2, for which we will elaborate below. Focusing on the COVID period, from the last column of Table 4, we observe that price elasticities of 5 out of the 6 food product categories are significantly smaller than 1 in magnitude, supporting Hypothesis 1, which indicates that Dutch consumers are indeed price-insensitive to food necessities during the COVID period. This may be due to the change in shopping behavior, such as less in-store visits as revealed in the exploratory data analysis in Sect. 3.1. The only category that is price-elastic is cheese, which may be due to relatively long shelf-lives of cheese products [40]. In other words, since cheese is not as fastperishable, consumers can buy more when prices drop, which explains its elastic price behavior. That said, the fact that 5 out of 6 life necessities are inelastic in the COVID period is a strong evidence to support Hypothesis 1. Hypothesis 2 is also firmly supported because both the magnitude of price elasticities of toothpaste and shampoo & conditioners are significantly higher than 1. Hand sanitizer is indeed inelastic, but for the 3 non-food necessities, the sales of toothpaste and shampoo & conditioners accounted for 90% of the total. Therefore, it is safe to conclude that the majority of the non-food necessities considered in this paper is elastic in the COVID period, i.e., consumers tend to purchase more when the price is low. The above observation also confirmed our intrinsic speculation that although the pandemic had non-trivial impact on consumers’ shopping behavior, it does not dramatically change their behavior towards a product category, i.e., even during the pandemic, consumers were price elastic to toothpaste and shampoo & conditioners as they were before the pandemic. In addition, comparing across the two periods, we have observe that all price elasticities in the COVID period are smaller in magnitude than those in the pre-COVID period, except for hand sanitizer. This strongly supports Hypothesis 3. Hand sanitizer is a very interesting product category, which is the only category that had a higher magnitude of PED in the COVID period. One of the possible reasons for that is the increase in consumption of hand sanitizers: in the pre-COVID period, only consumers who paid extreme attention to health and hygiene were buying hand sanitizer and there was not an appetite to stock them at lower prices. However, in the COVID period, the general public realized the importance of hygiene with increased consumption of hand sanitizers, and they
92
H. Chen and A. Lim
tended to stock them when the prices were low, leading to a higher price elasticity. This is corroborated by the dramatic increase in sales: the sales amount of hand sanitizer was e3, 927, 350 in the pre-COVID period, but increased by 66% to e6, 551, 077 for the COVID period. It is also observed that 4 out of 9 products had a significantly different PED when comparing across intervals: (1) fresh vegetable with p-value 0.001; (2) milk with p-value 0.009; (3) salad with pvalue 0.000; (4) hand sanitizer with p-value 0.000. Among them fresh vegetable, milk and salad products have a significantly lower PED magnitude, while hand sanitizer had a significantly higher PED magnitude compared to the pre-COVID period. For the other 5 categories, the current data collected does not support a statistically significant change in PEDs across the two time periods. Because most price optimization systems deployed in retail optimize prices regardless of statistical significance, the smaller magnitudes of PEDs in the COVID period will likely lead higher prices among the inelastic categories and less discounts among the elastic categories. The above findings also lead to important theoretical and managerial implications. Empirically confirming Hypothesis 1 validates the observation made by Wang et al. [58] that consumers were willing to pay an extra premium on perishable food reserves in the initial weeks when COVID cases were first reported. This behavior could be due to perceived scarcity of products or fear of the unknown [65] under a health crisis. However, the exact cause of such behavior would be the topic of another study. In addition, the empirical analysis also supports Hypothesis 2, which indicates that the pandemic does not fundamentally change the price-demand dynamics of non-food necessities, such as toothpaste and shampoo & conditioner. This is not only of theoretical importance but also provides much-needed managerial insights to retail practitioners to design sales Table 4. The magnitude of the estimated price elasticity, its associated standard error were reported. Each magnitude was tested against 1 with a one-sided test: * Significant at 0.1; ** Significant at 0.05; *** Significant at 0.01 Category
Pre-COVID
COVID
Bread
0.286(0.151)*** 0.267(0.282)***
Ham and Sausage
0.319(0.323)**
0.146(0.328)***
Fresh vegetable
0.840(0.344)
0.078(0.048)***
Milk
1.016(0.213)
0.385(0.301)**
Salad
2.293(0.552)*** 0.471(0.153)***
Cheese
1.296(0.304)
Hand sanitizer
0.003(0.010)*** 0.432(0.118)***
1.273(0.133)**
Shampoo and Conditioner 2.108(0.091)*** 1.999(0.120)*** Toothpaste
2.003(0.186)*** 1.638(0.270)**
Were Consumers Less Price Sensitive During the COVID-19 Pandemic?
93
Fig. 1. Price elasticity of demand of the chosen 9 product categories. Each short vertical line has 3 numbers: the point estimate of price elasticity, the lower and upper bounds of 90% confidence interval. The dashed horizontal line is 1.
and promotion strategies during the pandemic or similar crises in the future. Furthermore, the empirical support of Hypothesis 3 is consistent with the observation of Delasay et al. in [18] that consumers showed a pattern of less-frequent in-store shopping during the pandemic. Consequently, it makes sense that whenever consumers visited a grocery store, a majority of them were less picky about price when compared to the pre-COVID period.
5
Concluding Remarks and Future Work
The purpose of this study is to explore how the COVID-19 pandemic impacted consumers’ price sensitivity to life-essential products, which were intrinsically categorized as food necessities and non-food necessities via the proposed PED modeling system. Consumers’ price sensitivity was quantified using the price elasticity of demand, and 18 separate models were fitted to provide estimates of PEDs for each combination of (1) pre-COVID period or COVID period; and
94
H. Chen and A. Lim
(2) 9 different product categories. We used data from a Dutch retailer to test against three different hypotheses. It was empirically found that the majority of food essential products was price inelastic, while the majority of non-food products remained price elastic in the COVID period as before. Among the 9 product categories, 4 were identified to have different elasticities with statistically significance across the two time periods, while 8 categories were observed to have smaller magnitudes in elasticities. As a result of this research, we not only prove the usefulness of the proposed PED modeling system, but also identify a set of theoretical and managerial contributions as reported below. First, we treated the 9 product categories independently to allow inference of consumers’ reaction to price changes in each category before and during the COVID-19 pandemic. The separation allowed possibly distinct structural difference across the categories. Although, except for hand sanitizer, the magnitude of price elasticities of all product categories became smaller in the COVID period, consumers’ price sensitivity towards the product categories remainly largely unchanged with 5 food necessities staying price inelastic, while the non-food necessities and cheese remaining price elastic. Second, in terms of data, unlike extant research literature that utilizes data collected from questionnaires to identify, compare and analyze consumers’ perception of COVID-19 and its impact on retail, we collected sales data directly from a Dutch retailer which allowed the estimation of the price elasticity at product category level. It provided a novel perspective and a varying view from the retailer’s side. The results of the data analysis can then be directly applied for the retailer to make important managerial decisions. Third, in terms of handling the impact of the pandemic, we deliberately selected two non-overlapping time intervals: (1) the pre-COVID period; and (2) the COVID period to explore the impact of the pandemic on price elasticity for each of the product categories. Since the effect of COVID on consumers’ panic buying was not negligible from the literature review in Sect. 2.3, the COVID panic buying period was excluded from the analysis to eliminate its influence on consumers’ grocery shopping behavior during the COVID period. Although the COVID period was the research focus, the pre-COVID period was also included to have a more comprehensive view of the situation. Fourth, the research also provided important managerial insights. Among the different product categories, the Dutch retailer needed to adjust its sales and promotion strategies accordingly for all the categories that have been analyzed, which showed practically smaller magnitude of price elasticities. For example, a completely new pricing strategy on the salad category is necessary since it became significantly more inelastic in the COVID period. At minimum, it was strongly recommended that any retailer should closely monitor the possible
Were Consumers Less Price Sensitive During the COVID-19 Pandemic?
95
change of price elasticity of demand for important product categories, and act accordingly when a significant trend is identified through analysis. Last but not the least, although it was observed that the magnitudes of PEDs became practically smaller during the pandemic, the current data support statistical significance on the difference in PEDs between the two periods in just 4 of the 9 categories. More data and information are needed on the remaining 5 categories to gain more insights and reach more decisive conclusions. While the scope of the current analysis is limited to the Dutch retail data, the methodology employed can be used on more product categories of other retailers in other parts of the world as data become available. As the COVID-19 pandemic is far from over, it remains to be seen if the findings and insights from this study constitute a “new normal” consumer behavior towards prices of groceries or a temporary reaction during the pandemic. This will be an interesting future direction of research.
Appendix
Table 5. The category-specific, period specific fair prices Category
pre-COVID COVID
Bread
1.002
1.036
Ham and Sausage
1.026
1.087
Fresh vegetable
0.918
0.871
Milk
1.008
1.031
Salad
0.997
0.988
Cheese
1.026
1.027
Hand sanitizer
1.038
0.949
Shampoo and Conditioner 0.915
0.952
Toothpaste
0.982
0.988
Ham and Sausage 7.478(1.534)*** −0.328(0.331) 0.529(0.100)***
0.050(0.008)*** 0.074(0.018)*** 0.021(0.009)** 0.021(0.0132) 0.198(0.019)*** 0.034(0.018)* 0.888 0.833
Bread 0.088(0.273) −0.287(0.151)* 1.000(0.019)***
0.75
0.781
0.769
0.92
Hand sanitizer S&C −0.253(0.042)*** 1.354(2.719) −0.003(0.011) −1.928(0.083)*** 1.023(0.004)*** 1.686(0.421)***
0.185(0.010)*** 0.008(0.011) −0.006(0.004) 0.871 0.723
0.042(0.011)***
Salad Cheese 2.735(1.760) 3.876(1.466)** −2.286(0.550)*** -1.330(0.311)*** 0.830(0.115)*** 1.807(0.227)***
0.127(0.019)***
Milk 1.389(1.559) −1.024(0.214)*** 0.910(0.106)***
0.171(0.029)*** 0.061(0.014)***
Fresh vegetable 3.516(2.072)* −0.771(0.316)** 0.778(0.137)***
0.745
0.059(0.024)*
1.782(0.262)*** 0.054(0.025)*
Toothpaste 1.650(1.689) −1.978(0.184)***
Intercept Price index Seasonality New Year Day Easter Easter Monday Pentecost Christmas R2
0.837
0.031(0.011)*** −0.021(0.012)*
Bread 0.063(0.296) −0.276(0.293) 0.996(0.020)***
0.071(0.016)*** 0.701
Ham and Sausage 0.613(1.066) −0.158(0.357) 0.964(0.070)***
Milk −2.008(3.127) −0.397(0.310) 1.139(0.212)***
Salad 0.338(1.820) −0.606(0.295)** 0.979(0.118)*** −0.022(0.011)*
0.037(0.011)*** 0.0529(0.015)** 0.036(0.011)** 0.808 0.577 0.754
Fresh vegetable −0.215(0.304) −0.068(0.042) 1.016(0.020)***
0.030(0.008)*** 0.111(0.008)*** 0.881
0.036(0.008)***
Cheese 5.737(1.701)** −1.308(0.137)*** 0.639(0.110)***
0.052(0.011)*** 0.835
0.068(0.032)** 0.083(0.034)** 0.889
Hand sanitizer S&C −1.311(0.420)*** 12.304(0.048)*** −0.410(0.118)*** −1.903(0.114)*** 1.115(0.036)
0.648
Toothpaste 4.171(2.091)* −1.608(0.265)*** 0.681(0.160)***
Table 7. The 9 fitted models for the COVID period. Control variables that were included in at least one model are reported. The model R2 is reported in the last row. Each parameter was tested against 0: * Significant at 0.1; ** Significant at 0.05; *** Significant at 0.01. S & C Stands for Shampoo & Conditioner
Intercept Price index Seasonality Num of Stores New Year Day Easter Ascension Christmas Mother R2
Table 6. The 9 fitted models for the Pre-COVID period. Control variable that was included for at least one model was reported. The last row is R2 . Each parameter was tested against 0: * Significant at 0.1; ** Significant at 0.05; *** Significant at 0.01. S & C Stands for Shampoo & Conditioner
96 H. Chen and A. Lim
Were Consumers Less Price Sensitive During the COVID-19 Pandemic?
97
References 1. Anderson, P.L., McLellan, R.D., Overton, J.P., Wolfram, G.L.: Price elasticity of demand. McKinac Center for Public Policy (1997). Accessed 13 Oct 2010 2. Andreyeva, T., Long, M.W., Brownell, K.D.: The impact of food prices on consumption: a systematic review of research on the price elasticity of demand for food. Am. J. Public Health 100(2), 216–222 (2010) 3. Arafat, S.M., et al.: Responsible factors of panic buying: an observation from online media reports. Front. Public Health 8, 747 (2020) 4. Atalan, A.: Is the lockdown important to prevent the COVID-19 pandemic? Effects on psychology, environment and economy-perspective. Ann. Med. Surg. 56, 38–42 (2020) 5. Barlow, D.H., Durand, V.M., Hofmann, S.G.: Abnormal psychology: an integrative approach. Cengage Learning (2016) 6. Bentall, R.P., et al.: Pandemic buying: testing a psychological model of overpurchasing and panic buying using data from the United Kingdom and the Republic of Ireland during the early phase of the COVID-19 pandemic. PLOS One 16(1), e0246339 (2021) 7. Bhatti, A., Akram, H., Basit, H.M., Khan, A.U., Raza, S.M., Naqvi, M.B.: Ecommerce trends during COVID-19 pandemic. Int. J. Future Gener. Commun. Networking 13(2), 1449–1452 (2020) 8. Bolton, R.N.: The robustness of retail-level price elasticity estimates. J. Retail. 65(2), 193–219 (1989) 9. Cain, S.L.: Americans are Spending More During the Coronavirus Pandemic. Slickdeals (2020) 10. Chakraborty, N.: The Covid-19 pandemic and its impact on mental health. Prog. Neurol. Psychiatry 24(2), 21–24 (2020) 11. Chen, H.-W., Lim, A.: A network price elasticity of demand model with product substitution. J. Revenue Pricing Manage. (2022). https://doi.org/10.1057/s41272022-00375-w 12. Chronopoulos, D.K., Lukas, M., Wilson, J.O.S.: Consumer spending responses to the COVID-19 pandemic: an assessment of Great Britain. Available at SSRN 3586723 (2020) 13. Chua, G., Yuen, K.F., Wang, X., Wong, Y.D.: The determinants of panic buying during COVID-19. Int. J. Environ. Res. Public Health 18(6), 3247 (2021) 14. Clapp, J.E., Niederman, S.A., Leonard, E., Curtis, C.J.: Peer reviewed: changes in serving size, calories, and sodium content in processed foods from 2009 to 2015. Preventing Chronic Dis. 15, E33 (2018) 15. Colchero, M.A., Salgado, J.C., Unar-Mungu´ıa, M., Hernandez-Avila, M., RiveraDommarco, J.A.: Price elasticity of the demand for sugar sweetened beverages and soft drinks in Mexico. Econ. Human Biol. 19, 129–137 (2015) 16. Cooper, J.C.B.: Price elasticity of demand for crude oil: estimates for 23 countries. OPEC Rev. 27(1), 1–8 (2003) 17. Cox, E.: Retail Analytics: The Secret Weapon, vol. 45. Wiley, New York (2011) 18. Delasay, M., Jain, A., Kumar, S.: Impacts of the COVID-19 pandemic on grocery retail operations: an analytical model. Available at SSRN 3979109 (2021) 19. Dermott, E., Main, G., Bramley, G., Bailey, N.: Poverty and Social Exclusion in the UK, vol. 1. Policy Press (2018)
98
H. Chen and A. Lim
20. Di Domenico, L., Pullano, G., Sabbatini, C.E., Bo¨elle, P.-Y., Colizza, V.: Impact of lockdown on COVID-19 epidemic in ˆIle-de-France and possible exit strategies. BMC Med. 18(1), 1–13 (2020) 21. Engstrom, T., et al.: Toilet paper, minced meat and diabetes medicines: Australian panic buying induced by COVID-19. Int. J. Environ. Res. Public Health 18(13), 6954 (2021) 22. Fahmy, E., Sutton, E., Pemberton, S.: Are we all agreed? Consensual methods and the ‘necessities of life’ in the UK today. J. Soc. Policy 44(3), 591–610 (2015) 23. Ferreira, K.J., Lee, B.H.A., Simchi-Levi, D.: Analytics for an online retailer: demand forecasting and price optimization. Manuf. Serv. Oper. Manage. 18(1), 69–88 (2016) 24. Fibich, G., Gavious, A., Lowengart, O.: The dynamics of price elasticity of demand in the presence of reference price effects. J. Acad. Mark. Sci. 33(1), 66–78 (2005) 25. Goddard, E.: The impact of COVID-19 on food retail and food service in Canada: preliminary assessment. Can. J. Agric. Econ./Revue canadienne d’agroeconomie 68, 157–161 (2020) 26. Goolsbee, A., Syverson, C.: Fear, lockdown, and diversion: comparing drivers of pandemic economic decline 2020. J. Public Econ. 193, 104311 (2021) 27. Grashuis, J., Skevas, T., Segovia, M.S.: Grocery shopping preferences during the COVID-19 pandemic. Sustainability 12(13), 5369 (2020) 28. Green, R., et al.: The effect of rising food prices on food consumption: systematic review with meta-regression. BMJ 346, f3703 (2013) 29. Hashem, T.N.: Examining the influence of COVID 19 pandemic in changing customers’ orientation towards e-shopping. Mod. Appl. Sci. 14(8), 59–76 (2020) 30. Inflation.EU. Historic inflation the Netherlands - CPI inflation, December 2021. https://www.inflation.eu/en/inflation-rates/the-netherlands/historicinflation/cpi-inflation-the-netherlands.aspx 31. Islam, T., et al.: Panic buying in the COVID-19 pandemic: a multi-country examination. J. Retail. Consum. Serv. 59, 102357 (2021) 32. Morgan, J.P.: (2020). https://www.jpmorgan.com/solutions/cib/research/covidspending-habits 33. Keane, M., Neal, T.: Consumer panic in the COVID-19 pandemic. J. Econometrics 220(1), 86–105 (2021) 34. Klem, L.: Structural equation modeling (2000) 35. Kumar, S., Nigmatullin, A.: A system dynamics analysis of food supply chainscase study with non-perishable products. Simul. Model. Pract. Theory 19(10), 2151–2168 (2011) 36. Kwak, Y.: Adult necessities of life in South Korea. J. Asian Public Policy 7(1), 41–57 (2014) 37. Leung, J., Chung, J.Y.C., Tisdale, C., Chiu, V., Lim, C.C.W., Chan, G.: Anxiety and panic buying behaviour during COVID-19 pandemic-a qualitative analysis of toilet paper hoarding contents on Twitter. Int. J. Environ. Res. Public Health 18(3), 1127 (2021) 38. Lilien, G.L., Kotler, P.: Marketing decision making: a model-building approach. Harpercollins College Division (1983) 39. Lim, A., Xu, J., Yu, Y.: Consumer food demand shifts in the COVID-19 pandemic: an empirical study based on retail sales data. Available at SSRN 4053920 (2022). Accepted to the Intelligent Systems Conference (IntelliSys) 2022 40. Man, D.: Shelf Life. Wiley, New York (2015) 41. Martin, W.C., Ponder, N., Lueg, J.E.: Price fairness perceptions and customer loyalty in a retail context. J. Bus. Res. 62(6), 588–593 (2009)
Were Consumers Less Price Sensitive During the COVID-19 Pandemic?
99
42. Morens, D.M., Fauci, A.S.: The 1918 influenza pandemic: insights for the 21st century. J. Infect. Dis. 195(7), 1018–1028 (2007) 43. Moriarty, M.M.: Retail promotional effects on intrabrand and interbrand sales performance. J. Retail. 61(3), 27–47 (1985) 44. Neslin, S.A., Shoemaker, R.W.: Using a natural experiment to estimate price elasticity: the 1974 sugar shortage and the ready-to-eat cereal market. J. Mark. 47(1), 44–57 (1983) 45. Pantazis, C., Townsend, P., Gordon, D.: The necessities of life in Britain. Bristol University, PSE Working Paper, vol. 1 (2000) 46. Prentice, C., et al.: Relevant, or irrelevant, external factors in panic buying. J. Retail. Consum. Serv. 61, 102587 (2021) 47. Prentice, C., Quach, S., Thaichon, P.: Antecedents and consequences of panic buying: the case of COVID-19. Int. J. Consum. Stud. 46, 132–146 (2020) 48. Rao, V.R.: Pricing research in marketing: the state of the art. J. Bus. 57, S39–S60 (1984) 49. Rizou, M., Galanakis, I.M., Aldawoud, T.M.S., Galanakis, C.M.: Safety of foods, food supply chain and environment within the COVID-19 pandemic. Trends Food Sci. Technol. 102, 293–299 (2020) 50. Roggeveen, A.L., Sethuraman, R.: How the COVID-19 pandemic may change the world of retailing. J. Retail. 96(2), 169 (2020) 51. Rowntree, B.S.: 1901 Poverty: A Study of Town Life. Macmillan and Company Ltd., London and New York (1899) 52. Rummel, R.J.: Applied Factor Analysis. Northwestern University Press, Evanston (1988) 53. Sheth, J.: Impact of Covid-19 on consumer behavior: will the old habits return or die? J. Bus. Res. 117, 280–283 (2020) 54. Sim, K., Chua, H.C., Vieta, E., Fernandez, G.: The anatomy of panic buying related to the current COVID-19 pandemic. Psychiatry Res. 288, 113015 (2020) 55. Singh, G., Aiyub, A.S., Greig, T., Naidu, S., Sewak, A., Sharma, S.: Exploring panic buying behavior during the COVID-19 pandemic: a developing country perspective. Int. J. Emerg. Mark. (2021) 56. Sloof, M., Tijskens, L.M.M., Wilkinson, E.C.: Concepts for modelling the quality of perishable products. Trends Food Sci. Technol. 7(5), 165–171 (1996) 57. Subramanian, R., Kakkagowder, C., Perumal, A., Gurusamy, P.: Consumption, expenditure and demand analysis of milk and milk products in India. Indian J. Econ. Dev. 15(2), 301–306 (2019) 58. Wang, E., An, N., Gao, Z., Kiprop, E., Geng, X.: Consumer food stockpiling behavior and willingness to pay for food reserves in COVID-19. Food Secur. 12(4), 739– 747 (2020). https://doi.org/10.1007/s12571-020-01092-1 59. Wang, X., Huang, H.-C., Han, L., Lim, A.: Price optimization with practical constraints. arXiv preprint arXiv:2104.09597 (2021) 60. Weersink, A., von Massow, M., McDougall, B.: Economic thoughts on the potential implications of COVID-19 on the Canadian dairy and poultry sectors. Can. J. Agric. Econ./Revue canadienne d’agroeconomie 68(2), 195–200 (2020) 61. Whelan, J., Msefer, K., Chung, C.V.: Economic Supply & Demand. MIT, Cambridge (2001) 62. WHO: Coronavirus disease (COVID-19), January 2020. https://www.who.int/ emergencies/diseases/novel-coronavirus-2019 63. WHO: Update on Omicron, November 2021. https://www.who.int/news/item/2811-2021-update-on-omicron
100
H. Chen and A. Lim
64. Wittink, D.R., Addona, M.J., Hawkes, W.J., Porter, J.C.: SCAN*PRO: the estimation, validation and use of promotional effects based on scanner data. Internal Paper, Cornell University (1988) 65. Yuen, K.F., Wang, X., Ma, F., Li, K.X.: The psychological causes of panic buying following a health crisis. Int. J. Environ. Res. Public Health 17(10), 3513 (2020)
Eigen Value Decomposition Utilizing Method for Data Hiding Based on Wavelet Multi-resolution Analysis Kohei Arai(B) Saga University, Saga 840-8502, Japan [email protected]
Abstract. Eigen value decomposition utilizing method for data hiding based on wavelet Multi-Resolution Analysis (MRA) is proposed. It is possible to improve an invisibility of the hidden information by using eigen value decomposition as a preprocessing of the conventional wavelet Multi-Resolution Analysis (MRA) based method. In the proposed method, the information of the key image is protected by the existence of the eigenvector. That is, the key image information can be restored only if the true original image information is known. The coefficient of eigen value decomposition differs for each original image and is composed of the eigenvectors of the original image. In a 3-band color image, a method involving Hue Saturation Intensity (HIS) conversion, or the like can be considered for the purpose of protecting the information of the key image, but the conversion coefficient such as HSI conversion is a well-known coefficient. Since the conversion coefficient of the method involving HSI conversion is well known, there is a possibility that a third party may obtain the information of the key image. The effectiveness of the proposed method is confirmed from the viewpoint of protecting the information of the key image. Keywords: Eigen value decomposition · Multi-resolution analysis (MRA) · Hue Saturation Intensity (HIS)
1 Introduction Data hiding is a general term for watermarking technology and steganography technology and embeds a key image that does not appear in the original image. Data hiding methods [1–3] can be roughly classified as follows. • Method of embedding a key image in the real space of the original image [4] • Method of embedding a key image in the frequency space of the original image [5, 6] Compared to the former, the latter is a key image from the viewpoint that it is possible to embed a key image in a specific frequency band that is relatively unaffected by the original image even if the image after embedding is subjected to processing such as compression. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): IntelliSys 2022, LNNS 544, pp. 101–118, 2023. https://doi.org/10.1007/978-3-031-16075-2_7
102
K. Arai
It has the ability to hide information about. The former requires a device to embed the key image by manipulating the edge part of the original image [4]. The latter requires determination of the frequency band of the original image in which the key image should be embedded [5, 6]. In addition, a data hiding method using RGB/color original images has also been proposed [7, 8]. The purpose of this study is to improve invisibility of the hidden key image in the data hiding images. From the viewpoint of the amount of information in the original image, data hiding using a color original image has the ability to hide the information of the key image compared to data hiding using a non-color original image. Data hiding using a color original image uses a method of embedding a key image in a certain component of the original image. For example, a method of embedding a key image in the G component of the original image is adopted [3]. Therefore, when embedding the key image in the R component of the original image, the information of the R component and B component of the original image is not used. Furthermore, the following points are not taken into consideration in the existing method. • The original image is not always 3-band data. • In the case of a method that uses only a specific component of the original image, it is hard to say that the fact that the original image is multi-band data is effectively used. • The conversion coefficient for HSI conversion, etc. is based on the conditions that the R component is the observed brightness value in the 0.7 um wavelength band, the G component is the observed brightness value in the 0.5 um wavelength band, and the B component is the observed brightness value in the 0.4 um wavelength band. In this paper, in order to overcome the drawbacks of the method of embedding the key image in any band image of the original multi-band image (including RGB color image), multi-band based on multi-resolution analysis with eigen value decomposition is used. The data hiding method using images is proposed together with a confirmation of the effectiveness of the proposed method. In the next section, related research works are described followed by research background and theoretical background. Then, the proposed system is described at first followed by some experiments are described together with conclusion and some discussions.
2 Related Research Works There is the following eigen value decomposition related papers, Decomposition of SAR polarization signatures by means of eigen-space representation is proposed [9] together with polarimetric SAR image classification with maximum curvature of the trajectory in eigen space domain on the polarization signature [10]. Polarimetric SAR image classification with maximum curvature of the trajectory in eigen space domain on the polarization signature is conducted [11]. Meanwhile, method for face identification with Facial Action Coding System: FACS based on eigen value decomposition is proposed [12]. Prediction method for time series of imagery data in eigen space is proposed [13].
Eigen Value Decomposition Utilizing Method
103
Also, there are the following data hiding method related papers, Data hiding method which is robust to run-length data compression based on lifting dyadic wavelet transformation is proposed [14]. Also, improvement of secret image invisibility in circulation image with Dyadic wavelet-based data hiding with run-length coding is proposed [15]. Data hiding method based on Dyadic wavelet transformation improving in-visibility of secret images on circulation images by means of run length coding with permutation matrix based on random number is also proposed [16]. Method for data hiding based on Legall 5/2 (Cohen-Daubechies-Feauveau: CDF 5/3) wavelet with data compression and random scanning of secret imagery data is proposed [17]. On the other hand, data hiding method replacing LSB of hidden portion for secrete image with run-length coded image is proposed [18]. Noble method for data hiding using steganography discrete wavelet transformation and cryptography Triple Data Encryption Standard: DES is proposed and demonstrated its effectiveness [19]. Data hiding method with Eigen value Analysis (PCA) and image coordinate conversion is proposed [20].
3 Theoretical Background and Proposed Method 3.1 Theoretical Background Watermark technology requires the following requirements. • Information that appears outside is important. • Embed a small amount of landmark information (signature). • The embedded information [mark information] will not be destroyed. On the other hand, steganography technology has the following tacit understanding. • Information that does not appear outside is important • It is desirable that the capacity that can be embedded is large. • In some cases, the embedded information may be corrupted. Following are the requirements for watermarking technology and steganography technology. • Do not let a third party be aware of what was hidden • Parties can restore what was hidden • It is difficult for a third party to restore what was hidden. If the amount of information in the key image increases, it becomes easier for a third party to be aware of what was hidden. In order not to make a third party aware, it is necessary to suppress deterioration due to hiding.
104
K. Arai
3.2 Proposed Method When wavelet decomposition is performed on a two-dimensional signal, four components: 1 low frequency component (LL1 component) and 3 high frequency components (LH1 component, HL1 component, HH1 component)] are generated. In addition, when wavelet decomposition is performed on the LL1 component, four components (LL2 component, LH2 component, HL2 component, and HH2 component) are further generated. Figure 1 shows a conceptual diagram of wavelet multi-resolution analysis and an example of wavelet decomposition performed up to 3 times. If a biorthogonal wavelet is adopted and the four components after wavelet decomposition are present, the twodimensional signal given with zero error is restored. Orthogonal wavelets are a type of biorthogonal wavelets. That is, the biorthogonal wavelet transform can be inversely transformed.
Fig. 1. Laplacian pyramid representation of MRA
The outline of the data hiding method based on the multi-resolution analysis is shown. Data hiding is performed according to the procedure shown below. 1. Wavelet decomposition is performed on one of the band images of the multi-band original image (Fig. 2). 2. Insert the key image into the high-frequency component after wavelet decomposition (Fig. 3 (a)). 3. A data hiding image is generated by wavelet reconstruction (Fig. 3 (b)). Note that Fig. 3 shows an example in which the key image is inserted into the LH1 component. It is also possible to insert the key image into the HL1 component, HH1 component, HH2 component, etc. Figure 3 shows that it is possible to insert the key image in addition to the HH component. The fact that the component into which the
Eigen Value Decomposition Utilizing Method
105
key image is inserted can be changed means that data hiding based on multi-resolution analysis has the ability to protect the information in the key image.
(a)Level 1
(b)Level 2
Fig. 2. Resultant images of MRA
(a)Key image in LH1
(b)Reconstructed image
Fig. 3. Example of the MRA-based data hiding (key image is hidden in LH1 component of MRA and the reconstructed image)
The problem is that in step 1 of data hiding, “for any band image of the multi-band original image”, the existing method uses only specific components of the multi-band original image and hides the key image information. In the proposed method, eigen value conversion is used as a preprocessing to realize energy concentration of the multi-band original image, and the key image information is hidden in the first eigen value image. Eigenvalue decomposition is a type of orthogonal transformation and can be inversely transformed. The proposed method can also be applied when the original image is not a 3-band original image. That is, HSI conversion is not applicable when the original image is not a 3-band original image. In other words, the proposed method performs eigenvalue decomposition on the multi-band original image and hides the key image information on the first eigenvalue image for the purpose of suppressing image quality deterioration due to hiding. The reason for hiding the key image information on the first eigenvalue image is that the first eigenvalue image is the image in which the energy of the multi-band original image is most concentrated.
106
K. Arai
Next, the decryption method of the key image information will be described. The existing method is realized by performing wavelet decomposition only on a specific component of the multi-band image in which the key image information is hidden. The proposed method uses the coefficients obtained by performing eigenvalue decomposition on the multi-band original image before the key image information is hiding and creates the first eigenvalue image for the multi-band image in which the key image information is hiding. It is realized by constructing and performing wavelet decomposition on the first eigenvalue image. Decoding of the key image information by the proposed method can be performed only when the coefficient eigenvalue decomposition is performed on the multi-band original image before hiding the key image information is known. That is, the coefficient of eigenvalue decomposition differs depending on the multi-band original image before hiding the key image information. Coefficients such as HSI conversion are well known. If the conversion factor is well known, a third party may obtain the key image information. Since the existing method hides the key image information only in a specific component of the multi-band original image, there is a possibility that a third party can obtain the key image information by performing wavelet decomposition on the specific component. That is, there is a possibility that a third party can obtain the key image information by performing wavelet decomposition on each band image.
4 Experiment In this paper, an experiment of a 3-band image from the viewpoint of color display of a multi-band image is conducted. By setting an experiment for a 3-band image, it can be replaced with an experiment for an RGB color image. A method for hiding a key image on an R component image of an RGB color image as an existing method is demonstrated. Furthermore, as an example, the wavelet function adopts the Daubechies basis, which is an orthogonal basis. The purpose of this experiment is to demonstrate the superiority of the proposed method that suppresses image quality deterioration due to hiding over the existing method. 4.1 Data Used As the data used, the one shown in Fig. 4 is used as the original image, and the one shown in Fig. 5 is used as the key image. Figure 4 is a 3-band image having an image size of 128 × 128 and 256 gradations × 3 per pixel. Figure 5 is a single band image having an image size of 64 × 64 and 256 gradations per pixel. Hide the data demonstration is shown in Fig. 5 to the data in Fig. 4. Figure 4 (a) is a three-band image in which the brightness values of the same pixels are the same. As a result, Fig. 4 (a) is a monochrome image. Figure 4 (b) shows the actual observation data from the sensor TM mounted on the artificial satellite Landsat. Band 3 data (red region: 0.63 to 0.69 μm) is used for the R component, and band 4 is used for the G component. This is a 3-band image generated by assigning the data (near-infrared region: 0.76–0.90 um) to the band 2 data (green region: 0.52–0.60 um) to the B component.
Eigen Value Decomposition Utilizing Method
(a)Lena (SIDBA)
107
(b)Landsat TM image
Fig. 4. Original images (128 by 128 pixels)
Fig. 5. Key image (64 by 64 pixels of the image to be hidden: secrete image)
Figure 6 shows the luminance value of each RGB component of Fig. 4 (a), and Fig. 8 shows the luminance value of each RGB component of Fig. 4 (b). Figure 7 shows a scatter plot of the RGB space of Fig. 4 (a), and Fig. 9 shows a scatter plot of the RGB space of Fig. 4 (b). Figure 4 (a) is used for simulation experiments, and Fig. 4 (b) is used for application to actual observation multi-band images. Figure 10 shows an example of the result of performing eigenvalue decomposition on Fig. 4 (a). Figure 10 (a) is a first eigenvalue image, Fig. 10 (b) is a second eigenvalue image, and Fig. 10 (c) is a third eigenvalue image. Because Fig. 4 (a) is a monochrome image, Fig. 10 (a) and Fig. 6 (a) are the same “Lena” images. The second eigenvalue contribution rate and the third eigenvalue contribution rate in Fig. 10 are zero. From Fig. 10, it can be seen that the first eigenvalue image is the image in which the energy of the multi-band original image is most concentrated. Here, consider colorizing Fig. 4 (a). Colorization is performed by independently adding normal random numbers with mean zero and standard deviation s to each RGB component image in Fig. 6. Then, the contribution rate α (0 < α < 1) of the first eigenvalue is introduced for the purpose of expressing the degree of inter-band correlation of the original image. λ1 α = 3
k=1 λk
(λ1 ≥ λ2 ≥ λ3 )
(1)
108
K. Arai
(a)Red
(b)Green
(c)Blue
Fig. 6. RGB components of the original image (lena)
Fig. 7. Scatter diagram of the RGB components of the original image
(a)Red
(b)Green
(c)Blue
Fig. 8. RGB components of the original landsat TM image
λk is the k-th eigenvalue. If the original image is a monochrome image, a = 1. Figure 11 shows a scatter plot of the RGB space of the colorized image of Fig. 4 (a). By changing the standard deviation s of the normal random number, the contribution rates a change. When the contribution rate a is calculated from the spraying in Fig. 11, a = 0.952 in Fig. 11 (a), a = 0.833 in Fig. 11 (b), and a = 0.692 in Fig. 11 (c). Therefore, in Fig. 11 (d), a = 0.564. Note that a = 1.000 in Fig. 4 (a) and a = 0.678 in Fig. 4 (b). It can be seen that the value of the contribution rates α is smaller in the color image than in the monochrome image.
Eigen Value Decomposition Utilizing Method
109
Fig. 9. Scatter diagram of the original image of landsat TM image
(a)First eigen value
(b)Second eigen value
(c)Third eigen value
Fig. 10. First to third eigen value components image of the original image of lena
4.2 Evaluation Function Equation (2) is used as the evaluation function for the purpose of comparing the proposed method with the existing method. J1 =
J2 of the conventional method J2 of the proposed method
where, J2 is expressed with Eq. (3). J2 = N1 i,j (εR (i,j) + εG (i,j) + εB (i,j))
(2)
(3)
where,
εR (i,j) = IR (i,j)(gR (i,j) − gR (i,j) )
2
εG (i,j) = IG (i,j)(gG (i,j) − gG (i,j) )
2
εB (i,j) = IB (i,j)(gB (i,j) − gB (i,j))2
(4) (5) (6)
110
K. Arai
(a) α=0.952
(b) α=0.833
(c) α=0.692
(d) α=0.564
Fig. 11. Scatter diagrams of RGB components of the original image with the contribution factors of α
Eigen Value Decomposition Utilizing Method
111
Note that gR (i,j), gG (i,j), gB (i,j) represents the brightness value of the R component at the pixel position (i,j) after hiding, the brightness value of the G component, and the brightness value of the B component, respectively. gR (i,j), gG (i,j), gB (i,j) are the intensity values of the R component of the pixel position (i,j) of the original image, respectively. Moreover, it is Eqs. (7) and (8) that represents the brightness value of the G component and the brightness value of the B component.
1, gK (i,j) = gK (i,j) 0, gK (i,j) = gK (i,j)
(7)
(IR (i,j) + IR (i,j) + IR (i,j))
(8)
IR (i,j) = { N=
i,j
where, K = R, G, B. In other words, the effectiveness of the proposed method is evaluated by finding the value only for the non-zero error and using the ratio of the proposed method to the existing method. If it is 1.0, the effectiveness of the proposed method can be confirmed. The reason why the value including the one with zero error is not calculated is to consider the factor of the value fluctuation depending on the number of bands in the multi-band image. 4.3 Evaluation Procedure It is assumed that the average of the key images is zero. Hiding is performed by substituting the key image only for the HH1 components of the original image. That is, the proposed method is substituted for the HH1 component of the first eigenvalue image of the original image, and the existing method is substituted for the G component of the HH1 component image of the original image for the experiment. If the HH1 component image is F(i,j) and the key image is S(i,j) with respect to the pixel position (i,j), the raster scan method is used. Adopted raster scanning hiding is performed by F(i,j) ← S(i,j). It is also possible to set F(i1 ,j1 ) ← S(i2 ,j2 ) by adopting the Hilbert scan method. The purpose of this experiment is to demonstrate the effectiveness of the proposed method that suppresses image quality deterioration due to hiding. Therefore, hiding is performed by matching the HH1 component image size and the key image size and substituting only for the HH1 component. By limiting the hiding component to one component, it is possible to compare with the existing method. By matching the HH1 component image size and the key image size, an experiment is set when the amount of information in the key image is large. When the signature is a key image, the amount of information in the key image is generally small. The existence of multiple key images is also included in the content of this experiment. That is, Fig. 5 can be regarded as data in which a plurality of key images is combined. Furthermore, in the experiment of the 3-band original image with the same brightness value of the same pixel, the first eigen value component image and the G component image of the original image have the structure of “Lena” image, so that it is compared with the existing method. It will be possible. Since the biorthogonal wavelet transform and the eigen value transform can be inversely transformed, the proposed method and the existing method can restore the key image.
112
K. Arai
4.4 Preliminary Experimental Result Firstly, demonstration of validity of hiding on the first eigen value image is made. The validity of hiding on the first eigen value image in the proposed method is demonstrated using monochrome image Fig. 4 (a). Figure 12 indicates a result image obtained by hiding the k-th eigen value image of the original image (k = 1, 2, 3). From Fig. 12, the result image of k = 1 and k = 2 has a larger degree of color (color distortion) than the result image of k = 3, so it is the original image. It is appropriate to embed the key image in the first eigen value image. The reason why the degree of color of the result image of k = 1 is smaller than the degree of color of the result image of k = 2 and k = 3 is that the first eigen value image is a multi-band source. This is because it is the image in which the energy of the image is most concentrated.
(a)k=1
(c)k=3
(b)k=2
(d)Original image
Fig. 12. Original and the resultant images of k = 1, 2, 3, respectively.
4.5 Simulation Result Figure 13 shows the images of the proposed method and the existing method after data hiding in the case of α = 1. Figure 13 (a) shows the result of the proposed method, Fig. 13 (b) shows the result of the existing method, and Fig. 13 (c) shows the original image (Fig. 4 (a)). Note that J1 = 1.718. Figure 14 shows the value of when the contribution rate
Eigen Value Decomposition Utilizing Method
113
α is changed. The horizontal axis represents J1 the contribution rate α, and the vertical axis represents J1 .
(a)Proposed
(b)Conventional
(c)Original
Fig. 13. Comparison of the resultant image between the proposed and the conventional methods
4.6 Experimental Result with Remote Sensing Satellite Image Figure 15 shows the images of the proposed method and the existing method after data hiding for Fig. 4 (a). Figure 17 (a) shows the result of the proposed method, Fig. 17 (b) shows the result of the existing method, and Fig. 17 (c) shows the original image (Fig. 4 (a)). J1 is 1.732. Furthermore, as in the simulation image experiment, the experimental results are obtained by independently adding a normal random number with mean zero and standard deviation s to each component image in Fig. 4 (a), shown in Fig. 16. Figure 16 indicates the value of when the contribution rate is changed. The horizontal axis represents the contribution rate, and the vertical axis represents. From Fig. 13 (a), after data hiding of the existing method, the image quality of the green component deteriorates. The cause of the deterioration of the image quality of the green component of the existing method is that the key image is data-hiding the component of the original image. From Fig. 13 (a), the image after the data hiding of the proposed method does not show any deterioration in the image quality of only the green component, and the information of the key image is whitened, and the data hiding is performed.
114
K. Arai
Fig. 14. Cost function J1 (contribution factor) of evaluation function against α
(a)Proposed
(b)Conventional
(c)Original
Fig. 15. Comparison of the reconstructed images between the proposed and the conventional methods
Eigen Value Decomposition Utilizing Method
115
Fig. 16. Cost function J1 (contribution factor) of evaluation function against α
Therefore, from the viewpoint of hiding the information of the key image, the proposed method is superior to the existing method in data hiding using multi-band images. In addition, in the existing method, because the key image information is whitened, if the key image information is also hidden in the components of the original image, the deterioration error value of the image after hiding becomes large. In other words, hiding the key image in the three components is equivalent to adding a large amount of noise to the original image. As an example, Fig. 17 shows a comparison when the key image information is hidden for the component and the component for Fig. 4 (a). Figure 17 (a) is the result of hiding only the G components, Fig. 17 (b) is the result of hiding to the three components, and Fig. 17 (c) shows original image. Compared to Fig. 17 (a), Fig. 17 (b) whitens the key image information, but the error of Fig. 17 (a) is 39.978, and the error of Fig. 17 (b) is 69.246. From Fig. 14, it can be seen that it is 1.0 for all contribution rates a. From Fig. 15, it can be seen that the image quality deterioration of the existing method is larger than that of the proposed method. From Fig. 16, it can be seen that it is 1.0 for all contribution rates a. Furthermore, for the purpose of showing the effectiveness of the principal component conversion as preprocessing when the original image is limited to 3 bands by using Fig. 4 (a), the first after the principal component conversion is performed. An example of the comparison results between the method of hiding the key image on the main component (proposed method) and the method of hiding the key image on the converted component is shown in Fig. 18. If the original image has 3 bands, conversion etc. can be applied. As a result, J1 = 1.708, confirming the effectiveness of the proposed method.
116
K. Arai
(a)G component
(b)All components
(c)Original image
Fig. 17. The resultant images of the conventional method with (a) key image is hidden in G component, (b) key image is hidden in all the components, (c) original image
(a)Proposed
(b)HIS conversion
Fig. 18. Comparison between the reconstructed image of (a) the proposed and (b) the proposed method with HIS conversion as a preprocessing
5 Conclusion It was confirmed that it was 1.0 in the experiment using simulation data and the experiment using actual observation satellite data. That is, the effectiveness of the proposed method was confirmed in both experiments. In this paper, the existing method shows an example of hiding the key image to the component of the original image, but the result
Eigen Value Decomposition Utilizing Method
117
when hiding the component or component is also 1, it becomes 0. In addition, it was confirmed by experiments using simulation data that the contribution rates an increase as it decreases. In other words, it was confirmed that the smaller the inter-band correlation, the greater the superiority of the proposed method over the existing method. Furthermore, in the proposed method, the information of the key image is protected by the existence of the eigenvector. That is, the key image information can be restored only if the true original image information is known. The coefficient of principal component conversion differs for each original image and is composed of the eigenvectors of the original image. In a 3-band color image, a method involving conversion or the like can be considered for the purpose of protecting the information of the key image, but the conversion coefficient such as conversion is a well-known coefficient. In this paper, the basis as a wavelet (a biorthogonal wavelet of Daubechies), the key image information can be restored. The key image information can also be protected by concealing what is adopted as the biorthogonal wavelet. Furthermore, in this paper, an experiment is conducted by inserting the key image only into the HH1 components, but the bit data of the key image was divided and high-frequency components other than the components were used. It is also possible to insert key image information into the other high frequency components. When the image quality deterioration due to hiding is expressed by the error from the original image, it can be shown that the error of the proposed method is about 40 = {1 − (1/17)} × 100% smaller. As a development of the proposed method, in this paper, we conducted an experiment using 3 bands out of the 3-band original image, but it is also possible to perform hiding using the m band of the n band original image. In other words, since there are street hiding combinations {mCn}, the proposed method is superior to the conventional method in its ability to protect the information in the key image. The conventional method uses only one band in the n band original image. Information on how many bands is used by a third party in the original band image for hiding is difficult to obtain.
6 Future Research Works In the future, since the conversion coefficient is well known in the method involving conversion, a third party may obtain the information of the key image. The effectiveness of the proposed method will be confirmed from the viewpoint of protecting the information of the key image. In this paper, hiding was performed using a still original image, but it can also be applied to a moving original image. For example, the still image of band 1 of the sensor may be regarded as the image of time #1, and the still image of band 2 of the sensor may be regarded as the image of time #2. Since it can be applied to moving images, it can also be applied to 3D still images. Acknowledgment. The authors would like to thank Professor Dr. Hiroshi Okumura and Professor Dr. Osamu Fukuda of Saga University for their valuable discussions.
118
K. Arai
References 1. Koshio, M.: Basics of Digital Watermarking, Morikita Publishing (2000) 2. Kawai, J.: Watermark-Genuine proof/paper plow (watermark). J. Inst. Image Electron. 31(2), 253–260 (2002) 3. Kawaguchi, E., Noda, H., Niimi, M.: On digital steganography technology. J. Soc. Image Electron. Electron. 31(3), 414–420 (2002) 4. Cox, I.J., Killian, J., Leighton, T., Shamoon, T.: Secure spread spectrum watermarking for multimedia. IEEE Trans. Image Process. 6(12), 1673–1687 (1997) 5. Okamoto, A., Miyazaki, A.: Digital watermarking method using morphological signal processing. J. Inst. Electron., Inform. Commun. Eng. 84(8), 1037–1044 (2001) 6. Inoue, N., Miyazaki, A., Shimazu, M., Katsura, T.: Digital watermarking method for image signals using wavelet transform. J. Video Inform. Media 52(12), 1832–1839 (1998) 7. Ouellette, R., Noda, H., Niimi, M., Kawaguchi, E.: Topological ordered color table for BPCS steganography using indexed color images. IPSJ J. 42(1), 110–113 (2001) 8. Hidetoshi, S., Yuji, M., Suzu, K., Yoshinao, A.: Color image digital watermarking method using fresnel transformation in image compression. IEICE Technical Report, pp. 43–48 (2000) 9. Kohei, A.: Decomposition of SAR polarization signatures by means of eigen-space representation. In: Proc. of the Synthetic Aperture Radar Workshop ‘98 (1998) 10. Kohei, A., Wang, J.: Polarimetric SAR image classification with maximum curvature of the trajectory in eigen space domain on the polarization signature. In: Abstracts of the 35th Congress of the Committee on Space Research of the ICSU, A3.1-0061-04 (2004) 11. Kohei, A., Wang, J.: Polarimetric SAR image classification with maximum curvature of the trajectory in eigen space domain on the polarization signature. Adv. Space Res. 39(1), 149–154 (2007) 12. Arai, K.: Method for face identification with facial action coding system: FACS based on eigen value decomposition. Int. J. Adv. Res. Artif. Intell. 1(9), 34–38 (2012) 13. Kohei, A.: Prediction method for time series of imagery data in eigen space. Int. J. Adv. Res. Artif. Intell. 2(1), 12–19 (2013) 14. Kohei, A, Yuji, Y.: Data hiding method which robust to run-length data compression based on lifting dyadic wavelet transformation. In: Proceedings of the 11th Asian Symposium on Visualization, ASV-11-08-11, pp. 1–8 (2011) 15. Arai, K., Yamada, Y.: Improvement of secret image invisibility in circulation image with Dyadic wavelet-based data hiding with run-length coding. Int. J. Adv. Comput. Sci. Appl. 2(7), 33–40 (2011) 16. Arai, K.: Data hiding method based on Dyadic wavelet transformation improving in-visibility of secret images on circulation images by means of run length coding with permutation matrix based on random number. Proc. Inform. Technol. Next Gener.: ITNG Conf. 2012, 189 (2012) 17. Kohei, A.: Method for data hiding based on Legall 5/2 (Cohen-Daubechies-Feauveau: CDF 5/3) wavelet with data compression and random scanning of secret imagery data. Int. J. Wavelets Multi Solut. Inform. Process. 11(4), 1–18 (2013) 18. Arai, K.: Data hiding method replacing LSB of hidden portion for secrete image with runlength coded image. Int. J. Adv. Res. Artif. Intell. 5(12), 8–16 (2016) 19. Arai, C.R.K., Prasetyo, A., Arigki, N.: Noble method for data hiding using steganography discrete wavelet transformation and cryptography triple data encryption standard: DES. Int. J. Adv. Comput. Sci. Appl. IJACSA 9(11), 261–266 (2018) 20. Kohei, A.: Data hiding method with eigen value analysis and image coordinate conversion. (IJACSA) Int. J. Adv. Comput. Sci. Appl. 12(8), 25–30 (2021)
Long Short-Term Memory Neural Network for Temperature Prediction in Laser Powder Bed Additive Manufacturing Ashkan Mansouri Yarahmadi(B) , Michael Breuß, and Carsten Hartmann BTU Cottbus-Senftenberg, Institute for Mathematics, Platz der Deutschen Einheit 1, 03046 Cottbus, Germany {yarahmadi,Ashkan.MansouriYarahmadi,breuss,hartmanc}@b-tu.de
Abstract. In context of laser powder bed fusion (L-PBF), it is known that the properties of the final fabricated product highly depend on the temperature distribution and its gradient over the manufacturing plate. In this paper, we propose a novel means to predict the temperature gradient distributions during the printing process by making use of neural networks. This is realized by employing heat maps produced by an optimized printing protocol simulation and used for training a specifically tailored recurrent neural network in terms of a long short-term memory architecture. The aim of this is to avoid extreme and inhomogeneous temperature distribution that may occur across the plate in the course of the printing process. In order to train the neural network, we adopt a well-engineered simulation and unsupervised learning framework. To maintain a minimized average thermal gradient across the plate, a cost function is introduced as the core criteria, which is inspired and optimized by considering the well-known traveling salesman problem (TSP). As time evolves the unsupervised printing process governed by TSP produces a history of temperature heat maps that maintain minimized average thermal gradient. All in one, we propose an intelligent printing tool that provides control over the substantial printing process components for L-PBF, i.e. optimal nozzle trajectory deployment as well as online temperature prediction for controlling printing quality. Keywords: Additive manufacturing · Laser beam trajectory optimization · Powder bed fusion printing · Heat simulation · Linear-quadratic control
1
Introduction
In contrast to traditional machining, additive manufacturing (AM) builds objects layer by layer through a joining process of materials making the fabrication of individualized components possible across different engineering fields. The laser powder bed fusion (L-PBF) technique as an AM process, that we focus on in c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): IntelliSys 2022, LNNS 544, pp. 119–132, 2023. https://doi.org/10.1007/978-3-031-16075-2_8
120
A. M. Yarahmadi et al.
this study, uses a deposited powder bed which is selectivity fused by a computercontrolled laser beam [17]. The extreme heating by the laser on the one hand, and on the other hand the influence of the degree of homogeneity of the heat distribution on the printing quality in L-PBF, make it highly challenging to conduct the printing process in an intelligent way that may guarantee high quality printing results. As explained in more detail when discussing related work, there has thus been a continuous effort to (i) propose beneficial printing paths that help to avoid unbalanced heating and (ii) to forecast the heat distribution in order assess the potential printing quality and terminate printing in case of foreseeable flaws. In this paper, we propose to couple a laser beam trajectory devised on the basis of a heuristic control during the fabrication phase of L-PBF with prediction based on neural networks. The developed novel framework addresses both the abovementioned main issues in L-PBF and represents an intelligent printing tool that provides control over the printing process. To this end, we aim at conducting controlled laser beam simulation that approximately achieves temperature constancy on a simulated melted power bed. In addition, we opt to perform temperature rate of change prediction as an important factor for microscopic structure of the final fabricated product. The main novelty of the current paper is to adopt long-short-term memory (LSTM) [8] prediction framework, which is introduced in Sect. 4 to predict the temperature distribution and its gradient during printing. This consequently can be used to avoid any overheating by taking necessary actions in advance, namely stopping the printing process to avoid the printer damage due to overheated deformed parts of the printing product. Based on this, we conjecture that our developed pipeline may provide a highly valuable step for practical printing that provides quality control of the printed product, while being efficient with regard to energy consumption and use of material. Finally, in Sect. 5, we present an effective numerical test concerning the predicted temperature gradients. In Sect. 3 of this paper a simulation framework is brought out by recalling the heat transfer model together with a cost function that consists of two terms aiming to maintain almost a constant temperature with a low spatial gradient across the power bed area. For simplicity, we confine ourselves to a 2-dimensional domain, which is still a realistic description of printing over the manufacturing plate. In Subsect. 3.2, the idea of the travelling salesman problem (TSP) as a heuristics for the laser beam steering is explained; being one of the most fundamental and well-studied NP-hard problems in the field of combinatorial optimization (e.g. [4,7]), we will use a stochastic optimization strategy (simulated annealing) to establish an optimal laser trajectory.
2
Related Work in Laser Powder Bed Additive Manufacturing
In general, a variety of different laser beam parameters such as laser power, scan speed, building direction and laser thickness influence the final properties
Temperature Prediction in Additive Manufacturing Using LSTM
121
of the fabricated product. Due to intensive power of laser during additive manufacturing, the printed product can have defects, such as deviations from the target geometry or cracks caused by large temperature gradients. For example, inhomogeneous heating may lead to unmelted powder particles that can locally induce pores and microscopic cracks [6]. At the same time, the cooling process determines the microstructure of the printed workpiece and thus its material properties, such as strength or toughness, which depend on the proportion of carbon embedded in the crystal structure of the material [1]. In a broader view, machine learning approaches may be deployed to provide monitoring capabilities over varying factors of L-PBF, namely the used metal powder and its properties both at the initial time of spread and during the printing process as well as the laser beam parameters, aiming to investigate and avoid any defect generation during the fabrication. See [2] for an survey. Concerning the powder properties, different capturing technologies along with machine learning tools are used to automate the task of defect detection and avoidance during the printing process. In [19,20], the k-means clustering [14] and convolution neural network (CNN) [12] respectively, were used to detect and classify defects at the time of initial powder spread and their probable consequences during the entire printing phase and based on captured grey images. In [11], high resolution temporal and evolving patterns are captured using a commercial EOS M270 system to find layer-wise heat inhomogeneities induced by the laser. In [10], an inline coherent imaging (ICI) system was used to monitor the defects and unstable process regimes concerning the morphology changes and also the stability of the melt pools. Here, the back scattered intensities from the melt pool samples are measured as a function of their heights called A-lines. Later, a Gaussian fitting of individual A-lines is performed to determine centroid height and amplitude of melt pools as a function of time corresponding to a range of different stainless steel powders with different properties. About the laser beam and its parameter optimization task one can avoid conducting expensive real experiments, in terms of material and power usage, by simulating the printing process by means of finite element method [5] (FEM), Lattice Boltzmann method (LBM) or finite volume method (FVM) See [3,18] for extensive surveys. Later the gathered simulated data may be used in a datadriven machine learning approach within a L-PBF framework. In this context, a prediction task of thermal history was performed in [13] by adopting a recurrent neural network (RNN) structure with a Gated Recurrent Unit (GRU) in a L-PBF process. A range of different geometries are simulated by FEM while accounting for different laser movement strategies, laser power and scan speed. A threedimensional FEM is adopted in [24] to simulate the laser beam trajectory and investigate its effects on the residual stresses of the parts. The simulation results show modifications of the residual stress distributions and their magnitudes, that was validated through experimental tests, as a result of varying laser beam trajectory type. A parametric study [25] used the same FEM simulation setup as [24] with three varying factors namely the laser beam speed, the layer thickness and the laser deposition path width. While each factor value varies in its range
122
A. M. Yarahmadi et al.
from low, medium to high the hidden relations among the factors and their affects on residual stresses and part distortions are revealed. In context of FEM simulation with a steering source of heat to represent the laser movement, one can refer to the work developed in [21]. Here, the residual stresses during the printing is predicted though the laser nozzle steering rule is not revealed.
3
Heat Transfer Model and TSP Formulation
As indicated, we first describe our heat simulation setting which is the framework for the TSP optimization protocol described in the second part of this section. 3.1
Heat Simulation Framework
We set up a simulation environment, namely (i) a moving source of heat (cf. (3)) to act as a laser beam on (ii) an area Ω ⊂ R2 simulated as deposition of aluminium metal powder called a plate. We assume that the plate is mounted to a base plate with large thermal conductivity, which makes the choice of Dirichlet boundary conditions with constant boundary temperature appropriate; if the surrounding is an insulator, then a reflecting i.e. zero-flux or von Neuman boundary condition is more suitable. A sequence of laser beam movements, called a trajectory, is followed so that at each point the heat equation (1) is resolved based on FEM providing us a temperature map that varies on different plate locations as the time evolves. Letting u be the temperature across an open subset Ω ⊂ R2 as time t evolves in ∈ [0, T ], the heat equation that governs the time evolution of u reads ∂ u(x, y, t) = α∇2 u(x, y, t) + βI(x, y) , ∂t u(x, y, t) = θ0 u(x, y, 0) = u0 (x, y) where we denote by ∇2 φ =
∂2φ ∂2φ + 2 ∂x2 ∂y
(x, y, t) ∈ Ω ◦ × (0, T )
(1a)
(x, y, t) ∈ ∂Ω × [0, T ]
(1b)
(x, y) ∈ Ω
(1c)
(2)
the Laplacian of some function φ ∈ C 2 , by Ω ◦ we denote the interior of the domain Ω, and by ∂Ω its piecewise smooth boundary; here u0 is some initial heat distribution, θ0 is the constant ambient space temperature (20 ◦ C), and we use the shorthands 1 κ and β := α := cρ cρ with κ, c and ρ all in R+ , denoting thermal conductivity, specific heat capacity and mass density. Our power density distribution of choice to simulate a laser beam is a Gaussian function :
Temperature Prediction in Additive Manufacturing Using LSTM
⎧⎡ ⎤⎫ 2 2 ⎬ ⎨ y − y x − x c c ⎦ I (x, y) = I0 · exp ⎣−2 + ⎭ ⎩ ω ω
123
(3)
with an intensity constant 2P (4) πω 2 by knowing ω and P to be the radius of the Gaussian beam waist and the laser power, respectively. In our study we let x ∈ [−1, +1], y ∈ [−1, +1] with (x, y) ∈ Ω ◦ , t ∈ R+ and also u (x, y, 0) = 0. The aluminium thermal properties are used to simulate the metal powder spread across the manufacturing plate. We solved (1) using [15] by setting P = 4200 (W) and ω = 35 pixels while letting (xc , yc ) to take all possible trajectory points such that the domain Ω ◦ always be affected by five consecutive heat source moves. In this way, we simulated the heat source movement across the board. I0 =
Control Objective. By adoption of the TSP based protocol, we aim to minimize the value of a desired objective function : 2 1 T |∇um (z, t)|2 + um (z, t) − ug dz dt (5) J(m) = 2 0 Ω with um = um (z, t) ,
z = (x, y) ∈ Ω ◦ , t ∈ [0, T ]
(6)
being the solution of the heat equation (1) on the interval [0, T ] under the control m : [0, T ] → Ω ◦ ,
t → (xc (t), yc (t))
(7)
that describes the trajectory of the center of the laser beam. Moreover, we have introduced ug as the desired target temperature to be maintained over the domain Ω ◦ as time t evolves. The motivation behind (5) is to maintain a smooth temperature gradient over the entire plate for all t ∈ [0, T ] as is achieved by minimizing the L2 -norm of the gradient, ∇u, while at the same time keeping the (average) temperature near a desired temperature ug for any time t. We proceed, by dividing the entire plate Ω ◦ into 4×4 sub-domains (see Fig. 1) and investigate our objective function (5) within each sub-domain as explained in Sect. 5. 3.2
TSP-Based Formulation
A common assumption among numerous variants of the TSP [4], as an NP-hard problem [7], is that a set of cities have to be visited on a shortest possible tour. Let Cn×n be a symmetric matrix specifying the distances corresponding to the paths connecting a vertex set V = {1, 2, 3, · · · , n} of cities to each other with n ∈ N to be the number of cities. A tour over the complete undirected graph
124
A. M. Yarahmadi et al.
Fig. 1. We divide the entire domain Ω ◦ containing the diffused temperature values into 4×4 sub-domains separated by white lines. within each sub-domain, (5) is computed to reveal how the temperature gradient |∇um (·, t)| evolves as a function of time t evolves and how the average temperature u ¯ is maintained near to a target value of ug . Note, the laser beam positions are in this image irrelevant.
G (V, C) is defined as a cycle passing through each vertex exactly once. The traveling salesman problem seeks a tour of minimum distance. To adopt the TSP into our context, we formulate its input V as the set of all 16 × 16 stopping points of the heat source over the board Ω ◦ , and the set C as a penalty matrix with each element Cij ≥ 0 being the impact (i.e. cost) of moving the heat source from a i ∈ V to j ∈ V. For every vertex i ∈ V, the possible movements to all j = i with the associated cost (5) is computed and assigned to Cij (see below for details). With this formulation, we want to remind the reader that C elements are non-negative and follow the triangle inequality : Cij ≤ Cik + Ckj
(8)
with i, j, k ∈ V. Note that, C matrix is obtained based on a prior set of temperature maps produced using FEM without enforcing any particular protocol on them. With this general formulation at hand, let us have a closer look at the discretized form of (5) that was used in current study to compute the elements of the penalty matrix:
Cij
4×4 2 2 4×4 2 2 (9) Ψ (i, l) + Λ (i, l) − ug Ψ (j, l) + Λ (j, l) − ug − = l=1 l=1
with l to be the sub-domain index. In addition, Ψ (·, l) = z∈Ωl t∈tl ∇um (z, t) represents the temperature gradient aggregation within each sub-domain, and Λ (·, l) = |Ω1l | z∈Ωl t∈tl um (z, t) is the average temperature value of each subdomain, with tl to be the time period on which the nozzle operates on Ωl .
Temperature Prediction in Additive Manufacturing Using LSTM
125
Here, by |Ωl |, we mean the number of discrete points in Ωl ⊂ Ω ◦ . In other words, (9) is the TSP cost of moving the nozzle from the ith to the j th stopping point that depends on (a) the mean square deviation of the temperature field from constancy and (b) on the mean square deviation from the global target temperature ug . In our simulation, the nozzle moves in the direction of the shortest (Euclidean) path connecting two successive stopping points. Thereby we assume the nozzle always adjusts its velocity so that the path between any arbitrarily chosen stopping points i and j always takes the same amount of time. The motivation behind this is to avoid heating up the entire domain Ω ◦ as a result of keeping the nozzle velocity constant. In practice no polynomial-time algorithm is known for solving the TSP [7], so we adopt a simulated annealing algorithm [23] that was first proposed in statistical physics as a means of determining the properties of metallic alloys at a given temperatures [16]. In the TSP context, we adopt [23] to look for a good (but in general, sub-optimal) tour corresponding to the movement of the heat source leading to the minimization of (9). In Sect. 5, we reveal our prediction results obtained by adopting a TSP based heuristic along with the LSTM network. Before moving to the next section, let us observe a subset of temperature maps obtained based on TSP shown as Fig. 2.
Fig. 2. A subset of heat maps produced by FEM as the solution to the heat Eq. (1). one clearly observes the effect of the previous laser positions on current status of the Map, in terms of diffused temperature. The TSP as a heuristics steers the heat source across the plate aiming to keep temperature constancy. Note that all temperatures are in Celsius.
126
4
A. M. Yarahmadi et al.
The LSTM Approach
Let us start discussion of our deep learning framework structure by investigating its LSTM [8] cell building blocks shown as Fig. 3a used to comprise a stack of three LSTM layers (see Fig. 3b) followed by a fully connected layer. Here, we use temperature gradient values of μ = 14 previous (i.e. from previous time) heat maps to predict the gradients values of the current heat map. By letting ζ to be the current heat map, its history feature values formally lie in a range of [ζ − μ, ζ − 1] heat maps with ζ > μ. By considering each heat map to have 16 sub-domains and the same number of gradient features Ψ (·, l), each corresponding to one sub-domain, we obtain in total ν = μ × 16 number of gradient feature history values that we vectorise to frame the vector X ∈ Rν . Our aim is to use sub-sequences from X to train the stacked of LSTMs and forecast a sequence of 16 number of gradient feature values corresponding to the sub-domains of a heat map of interest ζ. Let us briefly discuss the weight and bias matrix dimensions of each LSTM cell. Here, we use q ∈ N as the number of hidden units of each LSTM cell and n ∈ N to represent the number of features that we obtain from FEM based heat maps and fed to the LSTM cell. More specifically, we have only one feature Ψ (·, l) per sub-domain, i.e. n = 1. In practice, during the training process and at a particular time t , a batch of input feature values X ⊃ X t ∈ Rb×n with b ∈ N to be the batch size, are fed to each LSTM cells of the lowest stack level in Fig. 3b. Here LSTM learns to map each feature value in X t to its next adjacent value in X as its label. The mapping labels are applied during the training and to the only neuron Rη ∈ R of the last fully connected layer with η = 1. In addition to X t , each LSTM cell accepts two others inputs, namely ht −1 ∈ Rb×q and ct −1 ∈ Rb×q , the so called the hidden state and cell state both of which are already computed at the time t − 1. Here, the cell state ct −1 carries information from the intervals prior to t . A few remarks on how the cell state ct (10) at time t is computed by the formula (10) below are in order: A more precise look at (10) reveals that it partially depends on Γf ct −1 with ct −1 being the cell state at the previous time t − 1. The term Γf satisfies
ct = Γu c˜t + Γf ct −1
(10)
with representing the element-wise vector multiplication. As (10) further shows, ct also depends on c˜t that itself is computed based on the feature vector X t and the previous hidden state ht −1 as : ⎞ ⎛
X t c˜t = tanh ⎝ ht −1 (11) × Wc + (bc )b×q ⎠ b×q b×n
with × visualizing in this work standard matrix multiplication, and bc and Wc to be the corresponding bias and weight matrices, respectively.
Temperature Prediction in Additive Manufacturing Using LSTM
127
Fig. 3. (a) A graphical representation of the LSTM cell accepting the hidden state ht −1 and the Cell State ct −1 from the previous LSTM and the feature vector X t at the current time. (b) a schematic representation of the adopted stack of LSTMs comprised of three recurrent layers Processing the Data. The Upper LSTM Layer is followed by a Fully Connected Layer. The Network Performs a Regression Task being Trained based on Half-Mean-Square-Error Loss Function (16). note, the fully connected t layer is established between the output of LSTM stack h3 and the only neuron of the t last layer Rη with η = 1.
Equation (10) contains two further terms, Γu and Γf , called the update gate and forget gate defined as ⎞ ⎛
X t (12) Γu = σ ⎝ ht −1 × Wu + (bu )b×q ⎠ b×q
and
⎛ Γf = σ ⎝ ht −1
t X b×q
b×n
b×n
⎞ × Wf + bf b×q ⎠
(13)
that are again based on the feature vector X t and the previous hidden state ht −1 with bu , bf , Wu and Wf to be the corresponding biases and weight matrices. Let us shortly conclude here, that the feature vector X t and the previous hidden state ht −1 are the essential ingredients used to compute c˜t , Γu and Γf , that are all used to update the current cell state ct (10).
128
A. M. Yarahmadi et al.
The motivation to use the sigmoid function σ in structure of the gates shown in (12) and (13) is its activated range in [0, 1], leading them in extreme cases to be fully on or off letting all or nothing to pass through them. In non-extreme cases they partially contribute the previous cell state ct −1 and the on the fly t t to the current cell state c as shown in (10). computed value c˜ To give a bigger picture, let us visualize the role of the Γu and Γf gates con cerning the cell state ct . In Fig. 3a, a direct line connecting ct −1 to ct carries the old data directly from time t − 1 → t . Here, one clearly observes the Γu and Γf gates both are connected by + and × operators to the passed line. They lin early contribute, as shown in (12) and (13), the current feature value X t and the adjacent hidden state ht −1 to update the current cell state ct . Meanwhile, Γu shares its contribution through × operator with c˜t to the passing line. Finally, to make the current LSTM activated we need the cell state value at the time t , namely ct , that we obtain from (10) and also the so called output gate obtained from ⎞ ⎛
X t (14) × Wo + (bo )b×q ⎠ Γo = σ ⎝ ht −1 b×q b×n
with Γo ∈ [0, 1], and bo and Wo to be the corresponding bias and weight matrices. The final activated value of the LSTM cell is computed by (15) ht = Γo tanh ct .
Here, the obtained activated value ht from (15) will be used as the input hidden state to the next LSTM cell at the time t + 1. Let us also mention that all the biases bc , bu , bf , bo ∈ Rb×q and the weight matrices are further defined as , Wu := (Wuh )q×q (Wux ) n×q Wc := (Wch )q×q (Wcx ) n×q
Wf := Wf h q×q Wf x
Wo := (Woh )q×q (Wox ) n×q
,
n×q
leading both the c˜t , ct ∈ Rb×q . Finally, we have a fully connected layer that maps the output h b×q of the stacked LSTM to the only neuron of the output layer Rη . This is achieved during ˆ ∈ Rq×η and bias vector ˆb ∈ Rb×η the training process while the weight matrix W corresponding to the fully connected layer are updated based on a computed loss L using half-mean-square-error (16) between the network predictions and target temperature gradient values obtained from the heat maps produced by FEM. L=
η b 1 2 (pi i − yi1 i2 ) 2ηb i =1 i =1 1 2 1
(16)
2
Here, p and y values represent the predicted and the target gradient temperature values, respectively.
Temperature Prediction in Additive Manufacturing Using LSTM
5
129
Results
To begin with, we consider a set of computed root-mean-square-error measures (RMSE) between the predicted and the target gradient values corresponding to the nozzle moves as shown in Fig. 4. More precisely, each curve value represents a computed RMSE between all 4 × 4 sub-domains gradient feature values of predicted heat map ζ and their ground truth counterpart. Since we use a history of μ = 14 previous gradient heat maps, the first prediction can be performed for the 15th nozzle move. Among all the measured RMSE values, we highlight four of them as can be seen in Fig. 4, that correspond to the 25th , 50th , 75th and 100th percentiles.
RMSE
0.1 0.05 0 15 55 105 155 205 255 Movement of nozzle Fig. 4. Each curve value represents a computed RMSE between all 4 × 4 sub-domains gradient feature values of predicted heat map ζ and their ground truth counterpart. The RMSE computation can be started from 15th nozzle move onward, since we use a history of μ = 14 previous gradient maps. Those RMSE measures, highlighted as × in Ascending order correspond to the 25th , 50th , 75th and 100th percentiles, respectively.
As one observes in Fig. 4, a relatively low RMSE measure is obtained almost across all nozzle moves on horizontal axis, though there exist some outliers. We further visualize the corresponding prediction results of the percentiles as Fig. 5. Specifically, let us take the 25th RMSE percentile computed between the black and its overlapping part of the pink curve shown in Fig. 5a. The black curve in Fig. 5a is comprised of 4 × 4 forecasted vectorized gradient feature values of heat map sub-domains produced by 54th nozzle move with a RMSE equal to 0.009, compared to its overlapping pink curve. In this case, we let the i as the nozzle move number to range in [ζ − μ, ζ − 1] to produce a history of feature gradient values corresponding to the heat map with ζ = 54. This consequently means, the non-overlapping part of the pink curve in Fig. 5a represent the vectorized history feature values of 40th to 53th heat maps that comprise μ = 14 number of preceding heat maps of ζ = 54, each of them with 4 × 4 sub-domains. The
130
A. M. Yarahmadi et al.
Fig. 5. The Ground Truth and Predicted Gradient Feature Values used to Compute the RMSE Measures of 25th , 50th , 75th and 100th Percentiles Shown in (a) (b) (c) and (d), Respectively. The Ground Truth Pink Curves Obtained and Vectorized from μ Number of Previous Heat Maps Preceding to the Current Heat Map of ζ. Here we use i as the Nozzle Move Number to Vary in its Range [ζ − μ, ζ − 1] Producing a Set of Gradient Feature History Values Shown in Pink. In Addition the Vectorized Ground Truth Gradient Feature Values of the Current Heat Map ζ are also Shown as Part of the Pink Curve that Overlaps with the Black Curve. The Black Curve is the Forecasted Vectorized Gradient Feature Values Corresponding to the ζ Heat Map. We also have Noted the RMSE Measure Concerning each Case (a) to (d) below it. The Last Case (d) with the Worst RMSE (See Fig. 4) Shows a Clear Asynchronous Prediction, though the Shape of the Forecasted Curve Looks to Conform with its Overlapping Ground Truth. In Cases (a) (b) and (c) we have Synchronous Predictions though in Some Parts the Black Curve is not Predicting the Pink One Accurately. (Color figure online)
Temperature Prediction in Additive Manufacturing Using LSTM
131
black curves in Fig. 5b, 5c and 5d are also comprised of the predicted gradient feature values of the heat maps ζ equal to 129, 220 and 244, respectively, that are forecasted based on μ number of their previous heat maps. A closer look at the four prediction samples shown in Figs. 5 reveals that the even the 100th percentile, that marks some kind of outlier, is accurately predicted in that the shape of the black curve tracks the pink curve (ground truth). Concerning other three RMSE percentile values, the synchronicity among the black and its pink curves is preserved equally well though in some part we do not have a full overlap. Finally, the parameters used during the training phase are revealed to be the Adam optimizer [9] applied on batch data of size 6. The epoch value is chosen to be 350 that results to a meaningful reduction of RMSE and loss measures within each batch. The initial learning rate was also chosen to be 0.008 with a drop factor of 0.99 concerning each 12 epochs. To avoid the overfitting phenomenon, the flow of data within the network structure is randomly adjusted by setting the LSTMs outputs with a probability of 0.25 to zero [22].
6
Conclusion
We developed a novel and practical pipeline and mathematically justified its comprising components. Our proposed model consists of two major components, namely the simulation part of a laser power bed fusion setup based on FEM and an intelligent agent based on LSTM network that actively judges the simulation results based on a proposed cost function. The FEM simulation can be robustly applied before conducting expensive real-world printing scenarios so that the intelligent component of the pipeline can decide on early stopping of the printing process. The LSTM based network predicts the forthcoming temperature rate of change across the simulated power bed based on previously seen temperature history leading us to have a means of control to achieve a final optimal printing process as visualized by our obtained results. Acknowledgments. The current work was supported by the European Regional Development Fund, EFRE 85037495.
References 1. Ali, M., Porter, D., K¨ omi, J., Eissa, M., Faramawy, H., Mattar, T.: Effect of cooling rate and composition on microstructure and mechanical properties of ultrahighstrength steels. J. Iron. Steel Res. Int. 26, 1–16 (2019) 2. Abdelrahman, M., Reutzel, E., Nassar, A., Starr, T.: Flaw detection in powder bed fusion using optical imaging. Addit. Manuf. 15, 1–11 (2017) 3. Baturynska, I., Semeniuta, O., Martinsen, K.: Optimization of process parameters for powder bed fusion additive manufacturing by combination of machine learning and finite element method: A conceptual framework. Procedia Cirp. 67, 227–232 (2018) 4. Flood, M.: The Traveling-Salesman Problem. Oper. Res. 4, 61–75 (1956)
132
A. M. Yarahmadi et al.
5. Fish, J., Belytschko, T.: A first course in finite elements. Wiley (2007) 6. Großwendt, F., et al.: Additive manufacturing of a carbon-martensitic hot-work tool steel using a powder mixture - Microstructure, post-processing, mechanical properties. Mater. Sci. Eng. A. 827, 142038 (2021) 7. Lewis, H.: A guide to the theory of NP-completeness. J. Symbolic Logic. 48, 498– 500 (1983) 8. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 1735–1780 (1997) 9. Kingma, D., Ba, J.: Adam: A method for stochastic optimization. ArXiv Preprint ArXiv:1412.6980 (2014) 10. Kanko, J., Sibley, A., Fraser, J.: In situ morphology-based defect detection of selective laser melting through inline coherent imaging. J. Mater. Process. Technol. 231, 488–500 (2015) 11. Krauss, H., Zeugner, T., Zaeh, M.: Layerwise monitoring of the selective laser melting process by thermography. Phys. Procedia 56, 64–71 (2014) 12. Lecun, Y., Bengio, Y.: Convolutional Networks for Images, Speech and Time Series. In: The Handbook of Brain Theory and Neural Networks, pp. 255–258 (1995) 13. Mozaffar, M., et al.: Data-driven prediction of the high-dimensional thermal history in directed energy deposition processes via recurrent neural networks. Manufact. Lett. 18, 35–39 (2018) 14. The MathWorks, k-Means Clustering (2020) 15. The MathWorks, Partial Differential Equation Toolbox (2020) 16. Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A., Teller, E.: Equation of State Calculations by Fast Computing Machines. J. Chem. Phys. 21, 1087 (1953) 17. Taruttis, A., et al.: Laser additive manufacturing of hot work tool steel by means of a starting powder containing partly spherical pure elements and ferroalloys. Proc. CIRP 94, 46–51 (2020) 18. Schoinochoritis, B., Chantzis, D., Salonitis, K.: Simulation of metallic powder bed additive manufacturing processes with the finite element method: A critical review. Proc. Inst. Mech. Eng. Part B: J. Eng. Manuf. 231, 96–117 (2017) 19. Scime, L., Beuth, J.: Anomaly detection and classification in a laser powder bed additive manufacturing process using a trained computer vision algorithm. Addit. Manuf. 19, 114–126 (2018) 20. Scime, L., Beuth, J.: A multi-scale convolutional neural network for autonomous anomaly detection and classification in a laser powder bed fusion additive manufacturing process. Addit. Manuf. 24, 273–286 (2018) 21. Song, X., et al.: Advances in additive manufacturing process simulation: Residual stresses and distortion predictions in complex metallic components. Mater. Des. 193, 108779 (2020) 22. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014) 23. Tian, P., Ma, J., Zhang, D.: Application of the simulated annealing algorithm to the combinatorial optimisation problem with permutation property: An investigation of generation mechanism. Eur. J. Oper. Res. 118, 81–94 (1999) 24. Zhang, Y., Chou, Y.: Three-dimensional finite element analysis simulations of the fused deposition modelling process. Proc. Inst. Mech. Eng. Part B: J. Eng. Manuf. 220, 1663–1671 (2006) 25. Zhang, Y., Chou, K.: A parametric study of part distortions in fused deposition modelling using three-dimensional finite element analysis. Proc. Inst. Mech. Eng. Part B: J. Eng. Manuf. 222, 959–968 (2008)
Detection and Collection of Waste Using a Partially Submerged Aquatic Robot Malek Ayesh(B) and Uvais Qidwai Qatar University, Doha 2713, Qatar {ma1509162,uqidwai}@qu.edu.qa
Abstract. With the amount of waste being dispersed into oceans on the rise, mitigating this issue has become a global concern. In the past few decades, governments, scientists, organizations, and individuals have been attempting to attenuate the effects of global warming, partially caused by improper waste disposal into oceans. This study presents a solar powered, partially submerged aquatic robot constructed from recycled, recyclable, upcycled, and sustainable materials. The robot aims to provide flexibility in the choice of construction materials by not being limited to what operating system, microcontroller, motors, and robot floaters are used. This robot detects and collects seven different categories of commonly littered waste namely cardboard (95.3%), wrappers (94.1%), metal cans (93.8%), surgical face masks (93.2%), plastic bags (96.2%), polystyrene (92.6%), and plastic bottles (93.8%). The custom detection system was evaluated based on whether it is capable of detecting waste and how well if little, medium, and high movement was introduced to the robot. Furthermore, the detection system’s performance in low light situations along with the drivetrain’s effectiveness was tested. Future improvements include forming larger dataset, enhancing the detection system’s low light capabilities, and attaching a larger battery. Keywords: Object detection · YOLOv3 · Robotics · Recycling · Sustainability
1 Introduction Due to rising concerns, it is vital that effective approaches are formed to address and mitigate the issue of environmental pollution [1]. Environmental pollution is simply the introduction of harmful materials and substances into the environment [2]. There are three main categories of environmental pollution namely, air pollution, land pollution, and water pollution [3]. This study focuses on water pollution. Water pollution is the contamination of water sources making them unfit for consumption and usage by their beneficiary [4]. Water pollution can be categorised into anthropogenic, caused by humans, and non-anthropogenic, caused by natural events [5]. Clarified in Fig. 1, this project tackles the problem of commonly littered landfill anthropogenic environmental pollutants of surface water. More specifically, this project proposes a sustainable, eco-friendly, and autonomous partially submerged aquatic craft that detects and collects commonly found waste categories namely, (1) plastics; plastic bottles, plastic bags, packaging wrappers, polystyrene, and surgical facemasks, (2) metal cans, such as beverage © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): IntelliSys 2022, LNNS 544, pp. 133–149, 2023. https://doi.org/10.1007/978-3-031-16075-2_9
134
M. Ayesh and U. Qidwai
containers and food tins, and (3) cardboard, mainly laminated cardboard. To understand why addressing this issue is important, first, 4.98 billion tonnes of plastic have ended up in landfills since the 1960’s [6]. 14 million tonnes of plastic are dispersed into natural bodies of water yearly [7]. The introduction of plastic waste into bodies of water may occur by littering, disposing of them down drains, and by falling off of waste management carrier ships [8]. Furthermore, due to the Coronavirus disease 2019 (COVID-19), production of surgical facemasks has increased leading them to being more commonly found in water bodies [9]. All this plastic waste dumping has led to the alarming issue of microplastics that is being investigated by scientists and researchers. Microplastics are simple tiny pieces of plastic, smaller than five millimetres in size, that are harming the environment [10] and are usually formed when a larger piece of plastic fragments into smaller ones [11]. Due to their small nature, microplastics have been found in soil [12], food [13], oceans [14], and even in human placentas [15]. Second, metal cans are another category of waste that this aquatic robot detects and collects. Cans are usually made from Aluminium, Tin, and stainless steel [16]. These metal cans may corrode, generally by forming metal oxides [17], which may have devastating effects on the quality of water. These effects include manipulating the PH levels, of water, which negatively affects beneficial algae present [18]. Third, cardboard is the final category of waste, the proposed robot detects and collects. Cardboard is a common way to package goods. The issue with cardboard packaging is not the cardboard itself but rather the glue, plastic linings, and ink that are on it [19].
Fig. 1. Environmental pollution categories.
Acting now to address environmental pollution is a must in order to conserve the well-being of living and non-living organisms. Furthermore, Protecting and cleaning the environment results in better living conditions for future generations to come. With a large research gap in the field of autonomous water-body cleaning solutions, this paper aims to motivate researchers and spark a trend that has the goal of saving our planet. Through innovative techniques, what is considered as waste for some, can be upcycled or recycled into sustainable and eco-friendly products that are as effective as their polluting counterparts. One instance of these techniques is the proposed solution of this study. This project addresses common landfill anthropogenic environmental pollutants found floating on the surface of bodies of water. This is done by constructing an autonomous partially submerged aquatic robot which detects and collects these pollutants. The robot itself is completely constructed out of recyclable, recycled, upcycled, and sustainable
Detection and Collection of Waste Using a Partially Submerged Aquatic Robot
135
materials and aims to spark a trend of saving the planet using the exact materials that are harming it. By constructing and testing the proposed system, this study aims to answer the following research questions: 1. How can this autonomous robot contribute to the global efforts of cleaning-up bodies of water? And how effective is it? 2. How much of this robot’s construction is recyclable, recycled, upcycled, or sustainable materials? 3. How flexible is this robot’s design and what are some possible different configurations? 4. With respect to cost and manufacture time, how viable are eco-friendly and sustainable solutions? With the aforementioned in mind, this project presents the designs and schematics of the autonomous, partially submerged, eco-friendly, and sustainable aquatic robot that can detect and collect commonly littered landfill waste. This robot will typically operate on beaches of seas, rivers, and lakes where garbage is usually washed-up. This is possible through a combination of cameras, computation devices, and actuators. A custom object detection system will be trained to identify the different categories of waste and their detected location. The custom detector uses You Only Look Once version 3 (YOLOv3) object detection and feeds its findings into a microcontroller which then actuates and controls the driving motors. The entire system operates on electrical energy and is capable of charging its batteries using the onboard solar panel. This study aims to accomplish the following: 1. Develop and train a custom object detection system, using YOLOv3, which can identify selected types of commonly littered waste through a common webcam. 2. Design an efficient and optimized drivetrain, power source, and energy generation system for the robot and have it built using commonly found sustainable materials. 3. Design and construct a stable partially submerged watercraft hull, from recyclable, recycled, upcycled, and sustainable materials, that is suitable for the environment and conditions it will be deployed in. 4. Interface and optimize the robot’s different systems to interact and function together in harmony. 5. Motivate organizations, institutes, and researchers to investigate and construct more sustainable and eco-friendlier environmental-pollution clean-up solutions. The remainder of this paper is structured as follows. Section 2 discusses the background and related work, Sect. 3 talks about the solution’s design and implementation while Sect. 4 presents the system evaluation and testing. Finally, Sect. 5 discusses the conclusion and future work.
2 Background Concepts and Related Work It is critical to discuss ideas related to the proposed solution in the form of some background concepts and related work. This provides and further clarifies the purpose of this robot and what knowledge gaps it tackles in the field of service robots.
136
M. Ayesh and U. Qidwai
Fig. 2. Gargoor fishing net [20].
2.1 Background Concepts Recently, with the rise in popularity of artificial intelligence, more sophisticated, high performing, customizable techniques have become more accessible to researchers and developers [21]. Some commonly utilized object detection techniques include Faster Regions with Convolutional Neural Networks (Faster R-CNN), Single Shot Detector (SSD), and You Only Look Once (YOLO). Many studies suggested YOLO’s superior performance over other commonly used object detection techniques. This superiority comes in terms of and is not limited to the accuracy and speed of detection [22, 23]. YOLO [24–26] YOLO is a real-time object detection system based on Darknet. Darknet is an opensource neural network framework that is compatible with both CPU and GPU. Darknet is written in both C and in Compute Unified Device Architecture (CUDA) languages. YOLOv3 is used in the proposed solution as it is an improved version of YOLOv1 and YOLOv2. YOLOv3 is comprised of a bigger neural network but a faster one. YOLO works by dividing an image into equal squares of dimensions S × S. Each square detects the contents within itself forming a class probability map. On the other hand, numerous bounding boxes are drawn on the input image. Bounding boxes outline the object in an image and each bounding box has a height, width, class, and box centre. Both the class probability map and the bounding boxes techniques are used together for the final detection. One approach used in making the final detection is Intersection Over Union (IOU) which observes the overlap of boxes and determines if the detected grids make up the entire object or not. If yes, then the final detection is that box, else, it expands the final detection box to fit the entire object. Gargoor ()قرقور In an effort to use sustainable material and support local businesses, this project uses a ‘Gargoor’ fishing net. Gargoor nets are handmade dome shapes nets made from galvanized steel wires and their diameter ranges from 0.50 m to approximately 5.00 m.
Detection and Collection of Waste Using a Partially Submerged Aquatic Robot
137
The Gargoor net usually has a minimum of one trap that is large at its opening and gets narrower the more you go in. The nets are specifically designed for seawater use; therefore, they are very durable. These nets are infamous in the Arabian Gulf and are strongly linked to the identity of the people in the region. A Gargoor net can be seen in Fig. 2. 2.2 Related Work Understanding and analysing previous attempts is crucial to fulfilling knowledge gaps present in any field. This section discusses work, achieved by other researchers, that is related to every component of this proposed solution. This is then followed by a comparison which will highlight the novelty of this attempt. Firstly, when discussing waste collection robots, study [27] proposes a relatively tiny partially submerged robot, shown in Fig. 3(a) that has a conveyer belt shape. Their attempt weighs 10 kg and manages to clean a 0.3 m2 area while consuming 45 W of power. An observation made on their design is that no waste detection software was used but rather a mobile application that the user controls the robot from. Furthermore, with such a small area cleaned, this robot may arguably be considered inefficient and redundant. Shown in Fig. 3(b) is the attempt of study [28] at water cleaning and surveying robot. The study’s robot is made to follow a predetermined path, of size 1.0 km × 0.5 km, and constantly senses the surface of the water using an ultrasonic sensor. Water temperature, humidity, and conductivity data is constantly read by the robot and sent through a Node MCU to the cloud. The waste collection mechanism utilizes a conveyer belt system, similar to study [27]. The criticism of study [28] comes mainly through the choice of sensor as it is unclear how such a combination can detect pollution in water. The author in [29] presents a pontoon shaped robot, shown in Fig. 3(c) that cleans the surface of the water via a motor controlled robotic arm. The presented robot, which can carry up to 16 kg of waste, is remote controlled and has no kind of waste detection system instead, human supervision over the robot is required.
Fig. 3. Displays the Attempts of (a) Study [27], (b) Study [28], and (c) Study [29].
The user controlled earthbound robot in [30] can generate its own power through the help of a 40W solar panel. Shown in Fig. 4(a), the robot has bulldozer-like tracks and shovel collects waste and stores it in an on-board box. Controlling the robot is done with the help of a remotely located operator observing the robot through its camera. Coming at 0.52 × 0.74 × 0.17 m, the aluminium robot moves at an average speed of 0.5 m/s. Some criticism to the presented robot includes the arguably small waste storage compartment. An IoT “Aquatic Iguana” robot is presented in [31] and it consists of a live-feed
138
M. Ayesh and U. Qidwai
camera, PH sensor, temperature sensor, and a turbidity sensor. The conveyer-like waste collection system, shown in Fig. 4(b), has a 15kg capacity and claimed to be able to collect 2Kgs of waste in 10 min. The remote-controlled robot’s pitfalls may include its inability to generate its own power therefore, it is limited in terms of operation time. An interesting robot is presented in [32]. A fully submerged torpedo-shaped robot, shown in Fig. 4(c), aims to detect and make chemical, biological, and physical observations. These observations include oil spills and sea floor terrain sampling. The robot in [32] is autonomous and can communicate to the base station via satellite phone, radio frequencies, or acoustic telemetry. With the ability to operate in the ocean for months, the robot can also reach depths of 0.5 km under water. The author in [33] claims their robot can clean oil spills and pipeline leakages. This is done by analysing the water quality and if abnormal readings appear, a distress signal is sent to the base station for action to be taken. The method in which water quality data is analysed is through a machine learning model which obtains data from the camera, LiDAR, ultrasonic sensor, PH sensor, turbidity sensor, and a temperature sensor. The different labels of their machine learning model are plastic bags, plastic bottles, Styrofoam, algae, metals, and oil spills. The model is implemented on a Raspberry Pi. The conveyer belt-style waste collection system drives the detected waste through a narrow channel and into the onboard waste bin. Study [33] did not detail which machine learning model was used to detect the waste. Furthermore, the waste collection bin, shown in Fig. 4(d), may not be suitable for entrapping liquid pollutants such as oil spills. Study [34] aims to clean river streams by utilizing deep learning. Their proposed design uses a partially submerged craft which consists of three cameras around the robot to give an all-around view, two paddle wheels to drive the craft, and two arms which guide waste into the waste collection tub via its gate. Their robot claims to be multidirectional and moves in the direction of the delectated waste. Despite this, the study only presented the implementation of the object detection system but not the robot itself. There is no mention of the deep learning technique utilized by their proposed method. Secondly, as for current object detection advancements, six different categories of waste were detected by [35] namely, glass, plastic, paper, trash, metal, cardboard. The study’s object detector is not limited to aquatic use and utilizes a hybrid transfer learning technique for classification along with Faster-CNN to obtain the region with the detected waste. A dataset of 400 images was used and 0.842, 0.878, 0.859 were obtained for the precision, recall, and F1-score respectively. The testing criteria of this paper is well chosen but the small data set size and not training on real images may set back the algorithms ability to properly generalize. A custom animal detector is presented in [36] and it shows how footage obtained from a motion-triggered camera can be fed into a machine learning model and optimized to detect and classify different animals. Their presented model uses Faster R-CNN and aimed to optimize footage contrast and high false-positives rates. These optimizations lead to a 4.5% increase in animal detection accuracies. Finally, considering the aforementioned related work above, many knowledge gaps exist. This paper aims to tackle these knowledge gaps and set a new path for how a waste detection and collection robot should be designed and function. Previous work did not discuss the environmental impact of the materials used to construct the robot itself as no
Detection and Collection of Waste Using a Partially Submerged Aquatic Robot
139
Fig. 4. Presents the attempts of study (a) [30], (b) [31], (c) [32], and (d) [33].
mention of using recycled materials was made. This point is crucial to the sustainability and success of such a robot as when the robot is decommissioned, it will most probably end up as waste similar to that the robot itself was collecting. On the other hand, the robot proposed in this paper only uses recycled, recyclable, upcycled, and sustainable eco-friendly materials. Furthermore, another issue that is not well emphasized in previous work is the use of clean energy generation. Most of the discussed robots employ a battery system that needs to be recharged when fully drained. This has two implications: (1) the environment cleaning robot is not using an environmentally friendly source of power and (2) the robot cannot be deployed uninterrupted for months at a time but has to return to the base station to be recharge, unlike the robot proposed in this paper. A common and unexplained trend in related work is the concept of using paddle system to move the robot and a conveyer belt design to intake waste. The unexplained design choices raise many questions like why was this design used? Is that design the most efficient out there? This paper aims to clarify and justify its design choices by providing reasoning and practical proofs. Many of the previously discussed robots are remote controlled. Despite remote controlled vehicles having their applications, common waste collection robots may not be one of them. Constant human supervision of common waste collection is a very timeconsuming process since a person needs to search for waste and collect it. No time or movement critical situations are generally present in common waste collection unlike other applications such as nuclear waste and explosives management. This robot in this paper is completely autonomous and does not require any human intervention. Another issue with previous work is the inefficient consumption of energy. For instance, previous attempts tend to utilize actuators to control the collection compartments in order to entrap waste. This paper will provide a mechanical approach to trap waste that does not require electrical energy. Furthermore, some studies, like study [28], attached unrelated sensors to their robot and claimed they test how polluted the water is without providing a logical explanation on how that is done. Such unrelated sensors waste energy that can be useful elsewhere. The lack of design flexibility is a major concern when constructing such robots. For such a robot to be successful, it should be able to adapt to the construction materials commonly available in that specific region. The proposed robot in this paper provides suggestions on what parts can be substituted to suit that specific deployment region. Furthermore, the majority of this robot is constructed from globally abundant waste that is free of charge. Finally, almost all previous attempts only demonstrate how their object detection system is able to successfully detect waste with no further testing criteria, unlike this paper which aims to improve upon that.
140
M. Ayesh and U. Qidwai
Fig. 5. Proposed system workflow.
Fig. 6. System interactions.
3 Solution Design and Implementation This section presents how the proposed solution is designed while taking into consideration functional requirements. Figure 5 describes the robot’s system workflow. First, the live camera feed is analysed for any present waste. If more than 30 s pass without the robot detecting any waste i.e., the robot is still, it drives forwards for 30 s. Otherwise, the robot checks for waste and if the waste detected is located in the left region of the live footage, the robot drives left for 3 s, else if the waste is located straight ahead, the robot drives straight for 3 s else if the robot detects the waste in the right region, the robot drives right for 3 s. Whether the robot moves left, straight, or right, the counter is reset to zero. If waste was not detected in either the left, straight, or right directions and the current timer is less than 30 s, the timer keeps incrementing, and the systems attempts to detect waste all over again. The timer is used to keep track how long the robot has been still. 3.1 Hardware Design This section presents the different hardware used in the proposed solution. Figure 6 illustrates how the different components are connected to one another. One of the main objectives of this project is to allow for design flexibility and ease of parts substitution. With that in mind, a Phillips P506 webcam is utilized for this solution. The proposed solution uses a webcam as they can be cheap, easily found, and are convenient to setup. In an effort to reduce e-waste, this project aims to upcycle old and slow computers by giving them a new purpose. A decommissioned Lenovo ThinkPad 440 was obtained and with some updates and removing unnecessary programs, the laptop became capable
Detection and Collection of Waste Using a Partially Submerged Aquatic Robot
141
of running the object detection system. A Windows, Apple, or Linux-based operating system computer can be used for this project due to Python and Arduino IDE being cross-platform. An Arduino Nano was salvaged from an old project. The Arduino is used as a way to take instructions from the object detection system and accordingly, send commands to the motor drivers. A total of 6 digital pins are required therefore, the Arduino Nano used here can be interchanged with most of the popular Arduino models. Using an Arduino and its IDE is a cheaper way of controlling the motors when compared to other solutions such as LabVIEW which cost significantly more. To further support the cause of reducing e-waste, two motor drives were also obtained from a colleague. Two ‘Ruler 1500’ bilge pumps were obtained and recycled. It is suspected that the bilge pump’s water seals have gone bad but the motors inside are completely fine. The bilge pumps were modified by disassembling their bottoms and some of their internals. This was followed by making new steel motor shafts on the lathe. Each motor requires 12 V and 4.8 A to function. The only new but sustainable and eco-friendly parts purchased were the 100 W solar panel, 20 A solar panel controller, and the 18 A hour battery. Considering that the average lifespan of a solar panel is 25 years and that a battery can be recycled, using electrical energy is a more sustainable and eco-friendly method of supplying the robot with energy. 3.2 Software Design This section discusses software specifications, the dataset used, and how software controls the hardware components. Dataset and Waste Detection All images were taken through the Phillips P506 webcam on both beaches and swimming pools in Qatar. After surveying multiple beaches in the region, it was determined that 7 main types of littered waste were present in bodies of water namely, cardboard, plastic wrappers, metal cans, surgical face masks, plastic bags, polystyrene, and plastic bottles. The 820-image dataset was manually labelled using the open-source software Labelling. The labelled images were uploaded to Google Colab for training, validation, and testing. The final weights were downloaded and ran on the real-time YOLOv3 object detector. Translating Detection Findings and Driving the Robot After the real-time waste detector was working, its Python code was modified to obtain the detected object’s label, confidence rate, and location on the screen. These variables are needed to drive the motors. The Arduino is then connected to the computer and the pyFirmata library was used which allows for the Arduino to be controlled through Python code. The previously obtained variables are given as commands to the motors through the Arduino which drives the robot towards the detected waste. By referring to Fig. 7(a), the robot’s floaters are recycled water cooler bottles with hose clamps and wires around their necks along with plugs to make them completely watertight. These floaters can be partially filled to balance the robot. The water cooler bottles can be replaced with jerry cans, barrels, and even plastic bottles tied together. The waste capturing and storage system, the Gargoor net, does not require any energy to function unlike previously mentioned attempts in the related work. The robot’s drivetrain, shown
142
M. Ayesh and U. Qidwai
Fig. 7. (a) Robot’s front view and (b) Robot’s rear view.
in Fig. 7(b), presents a system made from recycled pallet wood, a propeller, 3D printer belt, and a piece of 8 mm threaded rod that acts as a belt tensioner.
4 Solution Evaluation and Testing 4.1 Testing is based on Four Factors 1. 2. 3. 4.
Can the object detection system detect waste? How well does the object detector cope with movement? Can the robots still detect waste in relatively low light conditions? Can the robot collect waste?
4.2 Can the Object Detection System Detect Waste? With the webcam placed 26 cm above the surface of the water, the following waste detections were made. From Fig. 8, it can be observed that the object detection system is able to detect all 7 classes. 4.3 How well does the Object Detector Cope with Movement? For this next experiment, movement was simulated to observe the variation in detection confidence provided by the waste detection system. Four categories of movement were studied which are no movement, low movement, medium movement, and high movement. Movement for each category was simulated as close as possible to provide consistency and fairness. The ‘Accelerometer’ mobile application was downloaded on an I-Phone, through the app store, and was set to obtain readings at a rate of 20 Hz. The I-Phone was attached to the webcam and readings were obtained, exported as a CSV file, and analysed.
Detection and Collection of Waste Using a Partially Submerged Aquatic Robot
143
Fig. 8. Shows waste detection for (a) Cardboard (95.29%), (b) Wrapper (94.14%), (c) can (93.75%), (d) Surgical face mask (93.15%), Plastic bag (96.19%), (f) Polystyrene (92.60%), and (g) Plastic bottle (93.80%).
Fig. 9. Movement level versus detection confidence percentage.
Shown in Fig. 9 are the results for the detection system’s confidence values for each waste category. When the camera experiences no movement, cardboard is the best detected category, 98.7%, while wrappers are the worst, 93.1%. When little movement is introduced to the camera, cardboard and plastic bags are equally the best detected category, 94.2%, while wrappers are still the worst, 87.5%. An overall decrease in detection
144
M. Ayesh and U. Qidwai
confidence percentages can be observed as camera movement increases to medium. The detection confidence between cardboard and plastic bags is very small at 92.1% and 91.7%, respectively. Wrappers are still the category with the lowest detection confidence percentage, 83.5%, which is significantly less than the second worst detected category, metal cans, at 87.4%. Finally, when high movement is introduced to the camera, all categories’ detection percentages are below 90.0% with the wrappers category suffering the most at 72.0%. Larger fluctuations between categories appears as they collectively stop following the overall shape they did in previous movement levels. Overall, the reason why detecting wrappers is more difficult when compared other classes might be because of the different shapes they take. Therefore, it is much harder to collect a wrappers dataset which considers their malleability and colour unlike bottles or cans which have a consistent shape. While surveying waste, despite plastic bags also being malleable and varying in shape, they do not protrude above the surface of the water and usually do not have sharp edges similar to that found in cardboard and polystyrene therefore, plastic bags can be detected well. 4.4 Can the Robots still Detect Waste in Relatively Low Light Conditions? This experiment’s purpose is to test the custom detection system to see if it still can detect waste when the environment of operation has little light (i.e., during sunset, fog, or on very cloudy days). For this experiment, the camera was fixed in place (No movement). Figure 10 shows sample detections in low light conditions. A significant decrease in the confidence values can be seen in Fig. 11. The proposed custom waste detection system’s dataset consists of mainly images that were captured during the afternoon. The detections during normal and dim light, shown in Fig. 11, have the same overall shape with cardboard being the best detected class while wrappers being the worst.
Fig. 10. Presents detections of the categories in dim light. (a) Cardboard (78.0%), (b) Wrapper (52.5%), (c) can (67.4%), (d) Surgical face mask (72.8%), Plastic bag (73.9%), (f) Polystyrene (74.6%), and (g) Plastic bottle (67.7%).
Detection and Collection of Waste Using a Partially Submerged Aquatic Robot
145
Fig. 11. Detection confidence values with normal versus dim environment lighting.
Dim lighting conditions not only affect detection confidence but the object detection system’s ability to correctly label its findings. For instance, shown in Fig. 12 is a piece of polystyrene that was once detected as cardboard and once as a face mask. The reasoning for this error could be that when there is little light in the environment of operation, the captured live footage becomes grainy and starts to lose its colour. The loss of colour causes the camera feed to become nearly grayscale which may cause polystyrene, cardboard, and face masks to have a similar overall shape.
Fig. 12. Polystyrene misclassified as cardboard and face mask in dim conditions.
4.5 Can the Robot Collect Waste? The custom waste detection system was successful in detecting the targeted waste categories. This experiment tests the robot’s ability to receive instructions and act upon them through its drivetrain. This experiment tests the robot’s motion which drives it to collect the detected waste.
146
M. Ayesh and U. Qidwai
With this dual motor setup, the robot achieved a speed of 0.67 m/s, or 2.41 km/h, when driving straight with no waves present in the water. The robot was able to turn left or right, as instructed, towards the detected waste and collected it. Some changes can be made to the proposed systems as they will be discussed and detailed in the future work.
5 Future Work and Conclusion The living conditions of current and future generations are dependent on how we treat our planet. One major issue negatively affecting our planet is surface water pollutants. Surface water pollutants could be introduced to bodies of water through littering, during the waste transportation process, and by these items being thrown down the drains. Surface water pollutants have many negative effects on the environment such as tangling and suffocating marine creatures, destroying beneficial algae existing on the surface of the water, and the production of microplastics which are finding their way into our drinking water and food. With the aforementioned in mind, this paper contributes to the ongoing efforts to combat the issue of surface water pollutants by proposing a partially submerged aquatic robot that detects and collects them. More specifically, this robot is capable of detecting and collect seven different categories of surface water environmental pollutants namely, metal cans, surgical face masks, plastic bags, plastic bottles, polystyrene, wrappers, and cardboard. These seven categories were selected after local beaches were surveyed. The robot itself is constructed from recycled, recyclable, upcycled, sustainable, and eco-friendly materials in an effort to further contribute to the cause. The solar powered robot has an onboard computer that is recycled, which has a custom YOLOv3 waste detection system. The waste detection system was trained using images taken on local beaches and pools. The waste detection system was tested in terms of whether it can detect waste, how well it deals with movement and how effective is it in low light conditions. Finally, the robot was tested to see how effective it operates to collect the waste detected. The custom waste detection system showed promising results and is able to detect cardboard (95.29%), wrapper (94.14%), cans (93.75%), surgical face masks (93.15%), plastic bags (96.19%), polystyrene, and plastic bottles (93.80%). The values of the detection decrease as the movement level increased. Detection levels were affected the most when the object detect system was subject to low light conditions. In terms of the robot’s ability to act upon its detections, the robot is able to move at a relatively high speed of 2.41 km/h to collect the detected waste. Despite the promising outcomes achieved by the proof of concept presented in this paper, there is still room for improvement. Future work includes collecting a larger dataset. Having a larger dataset will help in the detection system’s ability to generalize as it will have seen more appearances and forms of that waste during training. Increasing the dataset’s size includes adding more categories of waste and adding images from different operating environments to accommodate for the different regions of the world. Improving the robot’s ability to work in low light conditions is another improvement. This can be done by strapping a lighting system to the robot and adding images obtained during low light conditions to the dataset. In terms of the robot’s hull, further experimentation includes testing different types of floaters such as water bottles and jerrycans. Different drivetrain configurations need to be tested. This includes experimenting with a single
Detection and Collection of Waste Using a Partially Submerged Aquatic Robot
147
motor setup for the drivetrain. Initial testing was conducted on a single motor setup, which resulted in a forward’s driving speed of 0.4 m/s, or 1.44 km/h. As for turning, a rudder style system will need to be designed, implemented, and tested. Implementation of a single motor setup will increase the robot’s operation time, but it will probably decrease its ability to drive through larger waves. Finally, regarding energy generation and consumption, a larger battery is in the plans which will increase the robot’s operation time.
References 1. US EPA: Air Pollution: Current and Future Challenges | US EPA (2022). [online] Available at: Accessed 9 February 2022 2. Boudreau, D., McDaniel, M., Sprout, E., Turgeon, A.: pollution (n.d.). [online] National Geographic Society. Available at: Accessed 9 February 2022 3. Ukaogo, P.O., Ewuzie, U., Onwuka, C.V.: Environmental pollution: causes, effects, and the remedies. In: Microorganisms for sustainable environment and health, pp. 419–429. Elsevier (2020) 4. Harvard, T.H.: Chan School of Public Health. Water Pollution (2022). [online] Available at: Accessed 9 February 2022 5. Schweitzer, L., Noblet, J.: Water contamination and pollution. In: Green chemistry, pp. 261– 290. Elsevier (2018) 6. Unep.org: Our planet is drowning in plastic pollution. This World Environment Day, it’s time for a change (2018). [online] Available at: Accessed 9 February 2022 7. IUCN: Marine plastic pollution (2021). [online] Available at: Accessed 9 February 2022 8. WWF: How does plastic end up in the ocean? (n.d.) [online] Available at: Accessed 9 February 2022 9. Dharmaraj, S., et al.: The covid-19 pandemic face mask waste: a blooming threat to the marine environment. Chemosphere 272, 129601 (2021) 10. Oceanservice.noaa.gov: What are microplastics? (2021). [online] Available at: Accessed 9 February 2022 11. Stanley, M.: Microplastics (n.d.). [online] National Geographic Society. Available at: Accessed 9 February 2022 12. Guo, J.-J., et al.: Source, migration and toxicology of microplastics in soil. Environ. Int. 137, 105263 (2020) 13. Cox, K.D., Covernton, G.A., Davies, H.L., Dower, J.F., Juanes, F., Dudas, S.E.: Human consumption of microplastics. Environ. Sci. Technol. 53(12), 7068–7074 (2019) 14. Zhang, D., et al.: Microplastic pollution in deep-sea sediments and organisms of the western pacific ocean. Environ. Pollut. 259, 113948 (2020) 15. Ragusa, A., et al.: Plasticenta: first evidence of microplastics in human placenta. Environ. Int. 146, 106274 (2021) 16. Deshwal, G.K., Panjagari, N.R.: Review on metal packaging: Materials, forms, food applications, safety and recyclability. J. Food Sci. Technol. 57(7), 2377–2392 (2020)
148
M. Ayesh and U. Qidwai
17. Sani, R., Mohammadi, M., Zare, M.: Corrosion in metal cans (2019) 18. Hegde, S.: Impacts of Aluminum on Aquatic Organisms and Epa’s Aluminum Criteria. Water Center (2019). [online] Watercenter.sas.upenn.edu. Available at: Accessed 9 February 2022 19. Bailey, K.: What happens to plastic-coated paper in compost?. Eco-Cycle’s Latest National Report (n.d.). [online] Eco-Cycle. Available at: Accessed 9 February 2022 20. Chen, W., Al-Baz, A., Bishop, J.M., Al-Husaini, M.: Field experiments to improve the efficacy of gargoor (fish trap) fishery in kuwait’s waters. Chin. J. Oceanol. Limnol. 30(4), 535–546 (2012) 21. Davenport, T.H.: From analytics to artificial intelligence. Journal of Business Analytics 1(2), 73–80 (2018) 22. Wu, S., Zhang, L.: Using popular object detection methods for real time forest fire detection. In: 2018 11th International symposium on computational intelligence and design (ISCID), vol. 1, pp. 280–284. IEEE (2018) 23. Benjdira, B., Khursheed, T., Koubaa, A., Ammar, A., Ouni, K.: Car detection using unmanned aerial vehicles: Comparison between faster r-cnn and yolov3. In: 2019 1st International Conference on Unmanned Vehicle Systems-Oman (UVS), pp. 1–6. IEEE (2019) 24. Redmon, J.: Darknet: Open Source Neural Networks in C (2022). [online] Pjreddie.com. Available at: Accessed 9 February 2022 25. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788 (2016) 26. Redmon, J., Farhadi, A.: Yolov3: An incremental improvement (2018). arXiv preprint arXiv: 1804.02767 27. Akib, A., Tasnim, F., Biswas, D., Hashem, M.B., Rahman, K., Bhat-tacharjee, A., Fattah, S.A.: Unmanned floating waste collecting robot. In: TENCON 2019–2019 IEEE Region 10 Conference (TENCON), pp. 2645–2650. IEEE (2019) 28. Gavade, S., Phadke, G., Somal, S., Gaikwad, P., Mane, M.: Autonomous ocean garbage collector. Int. J. Sci. Res. Eng. Trends 6(4) (2020) 29. Rahmawati, E., Sucahyo, I., Asnawi, A., Faris, M., Taqwim, M.A., Mahendra, D.: A water surface cleaning robot. In: Journal of Physics: Conference Series, vol. 1417, no. 1, p. 012006. IOP Publishing (2019) 30. Watanasophon, S., Ouitrakul, S.: Garbage collection robot on the beach using wireless communications. Int. Proc. Chem. Biol. Environ. Eng 66, 92–96 (2014) 31. Turesinin, M., Kabir, A.M.H., Mollah, T., Sarwar, S., Hosain, M.S.: Aquatic iguana: A floating waste collecting robot with iot based water monitoring system. In: 2020 7th International Conference on Electrical Engineering, Computer Sciences and Informatics (EECSI), pp. 21– 25. IEEE (2020) 32. Fries, D., et al.: Solar robotic material sampler system for chemical, biological and physical ocean observations. In: OCEANS’11 MTS/IEEE KONA, pp. 1–5. IEEE (2011) 33. Adarsh, J., Anush, O., Shrivarshan, R., Krishnaan, S.M., Akash, J., Arul, R., Angalaeswari, S.: Ocean surface cleaning autonomous robot (oscar) using object classification technique and path planning algorithm. In: Journal of Physics: Conference Series, vol. 2115, no. 1, p. 012021. IOP Publishing (2021) 34. Rumahorbo, B.N., Josef, A., Ramadhansyah, M.H., Pratama, H., Budiharto, W.: Development of robot to clean garbage in river streams with deep learning. In: 2021 1st International Conference on Computer Science and Artificial Intelligence (ICCSAI), vol. 1, pp. 51–55. IEEE (2021)
Detection and Collection of Waste Using a Partially Submerged Aquatic Robot
149
35. Kulkarni, H.N., Raman, N.K.S.: Waste object detection and classification. In: CS230 Stanford (2019) 36. Zhang, Z., He, Z., Cao, G., Cao, W.: Animal detection from highly cuttered natural scenes using spatiotemporal object region proposals and patch verification. IEEE Trans. Multimedia 18(10), 2079–2092 (2016)
Detecting Complex Intrusion Attempts Using Hybrid Machine Learning Techniques Mustafa Abusalah1(B) , Nizar Shanaah2 , and Sundos Jamal3 1 Consolidated Contractors Group S.A.L, 15125 Marousi, Greece
[email protected]
2 Arab American University, Ramallah, Palestine 3 Paltel, Nablus, Palestine
Abstract. Organizations are in a constant race to secure their services and infrastructure from ever-evolving information security threats. The primary security control of the arsenal is the Intrusion Detection System (IDS), which can automatically detect attacks and intrusion attempts. For the IDS to be effective, it needs to detect all kinds of attacks while not disturbing legitimate traffic by erroneously classifying them as attacks and affecting normal operations. Additionally, IDS needs to detect previously unknown attacks that do not exist in its knowledge base. This capability is traditionally achieved by anomaly detection based on trends and baselines; an approach that is prone to high false-positive rates. This paper will explore the most appropriate machine learning algorithms and techniques, specifically hybrid machine learning. This hybrid approach will combine unsupervised and supervised machine learning to detect previously unknown attacks while minimizing false positives by analyzing events generated by different connected systems and devices. Keywords: Intrusion detection system · Anomaly · Machine learning · Unsupervised · Supervised · Evaluation · Big Data
1 Introduction 1.1 Intrusion Detection Systems Intrusion detection is the process of detecting misuse or abuse of an organization’s assets through the monitoring and analysis of events generated by various connected systems and devices. Organizations rely on Intrusion Detection Systems (IDSs) to detect such intrusion and misuse attempts (Friedberg, et al., 2015). Traditionally IDS systems operate in two modes, signature-based and anomaly detection. The signature-based relies on a list of previously known attacks for detection. The anomaly detection works by looking for events that fall out of normal system operation and the established normal baseline to detect misuse (Syarif, et al., 2012) (Repalle & Kolluru, 2017). The signature-based primary advantage is its ability to accurately identify wellknown attacks because it relies on previously programmed indicators like hashes, traffic © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): IntelliSys 2022, LNNS 544, pp. 150–170, 2023. https://doi.org/10.1007/978-3-031-16075-2_10
Detecting Complex Intrusion Attempts Using Hybrid Machine
151
content, and known byte sequences. However, it falls short in detecting new attacks as it does not have the signatures of those attacks in its database. In contrast to signaturebased detection, anomaly detection systems do not require a signature database. It learns normal behavior by establishing a normalized baseline to be used as a benchmark to detect abnormal behavior. Hence, it does a better job of detecting new attacks but suffers from a high false-positive detection rate (Amoli, et al., 2016). 1.2 Supervised and Unsupervised Learning The main difference between supervised and unsupervised learning is the existence of labels in the training data set. Supervised machine learning learns from labeled training data for prediction and classification. There are two main categories for supervised learning algorithms: classification and regression, in addition to logistic regression, decision tree, random forest, Naive Bayes classifiers, K-nearest neighbors, and support vector machine. The main advantage of supervised learning is helping optimize the performance criteria with the help of experience. Also, it solves various types of real-world computation problems. Conversely, supervised learning has some disadvantages when classifying big data as it requires a lot of computation time to train the model. Furthermore, it is costly and challenging to obtain labeled data for emerging real-world problems. Meanwhile, unsupervised machine learning recognizes patterns without the need to a target variable. All the variables used in the analysis are included as inputs, and the algorithm needs to find the hidden structure by itself. Hence, the techniques are suitable for clustering and association mining application domains. Here the challenge will be to group unsorted information based on similarities, patterns, and differences. There are main categories of unsupervised learning: clustering and association. The clustering algorithms identify inherent groupings within the unlabeled data and then assign a grouping label to each data value. However, association algorithms aim to identify rules that accurately represent relationships between attributes, and they discover rules that describe large portions of data. Clustering contains hierarchical clustering and k-means. However, unsupervised learning is computationally complex, and results are less accurate than supervised learning algorithms (Berry, et al., 2019). The hybrid approach involves combining supervised and unsupervised learning to evade overfitting and reduce the high computational costs in high-dimensional big data set. 1.3 Problem Statement Malicious entities are increasingly evolving in their techniques to evade detection by exploiting novel vulnerabilities that the IDS do not yet have their signatures, also known as zero-day attacks [1]. While the anomaly detection approach can help detect previously unknown attacks, it suffers from high false-positive rates [2]. Such threats require algorithms and techniques to augment the capabilities of IDS to detect such unseen attack attempts while not disrupting legitimate operational processes.
152
M. Abusalah et al.
1.4 Paper Structure and Contributions This paper will suggest whether combining different unsupervised and supervised machine learning methods and strategies will improve network access security using the real-world and recent CSE-CIC-IDS2018 dataset with 16 million records. We will also utilize the hold-out method to create training, validation, and testing sets. Second section of this paper discusses the literature review, previous work and research gaps. While the third section overviews the research approach including prediction methods, evaluation criteria, dataset overview, cleansing and feature engineering. Section four overviews the model developed starting from the unsupervised machine learning methods and results comparison, then after selecting the best performers among them, applying the supervised machine learning methods. Results are then combined using hard voting. The final section will discuss the results and compare them with the previous work.
2 Literature Review 2.1 Datasets The most well-known Dataset in the experimental work for anomaly detection was the KDD99, but this dataset was issued in 1999 and it is considered outdated. Most systems and device makers updated their systems and products with the latest security patches, and older attacks are no longer a concern. On the other hand, a new class of services and applications has emerged, introducing new security vulnerabilities [3]. The recent CSE-CIC-IDS2018 issued by the Canadian Establishment for Cybersecurity (CIC) is considered the latest dataset being optimal for our experiments [3–5]. The CSE-CIC-IDS2018 dataset was compiled through traffic generated by 50 attacking machines to a victim organization that has five departments, 30 servers, and 420 end-user workstations to generate seven different attack scenarios, including, Heartbleed, Brute-force, DoS, DDoS, Web attacks, Botnet, and infiltration [6, 7]. The dataset consists of 16 million network flows. 2.2 Previous Work Most of the previous work used the KDD’99. In [8] presented an experimental evaluation framework for supervised and unsupervised learning methods for intrusion detection applications. They used the KDD’99 dataset to demonstrate that the supervised learning methods significantly outperform the unsupervised when dealing with known attacks. Consequently, they highlighted that unsupervised methods are better suited to handle unknown attacks as supervised methods’ performance drops significantly. In [9], the author performed a literature analysis of various existing machine learning approaches and the ability to detect attacks in network traffic data using unsupervised and supervised learning approaches using the KDD’99 dataset. Their analysis suggests that unsupervised learning has a higher detection rate than supervised learning; however,
Detecting Complex Intrusion Attempts Using Hybrid Machine
153
the results of their analysis suggest that they are susceptible to a high false-positive rate due to the changing network environment or services, patterns of normal traffic. The author in [10] discussed the large data size, Higher dimensionality, and Data preprocessing challenges of the real-world CSE-CIC-IDS2018 dataset. The use of fully connected dense deep neural network (DNN) was proposed, achieving 0.90 accuracy of detection. In addition, the researchers recommended utilizing a proper feature selection method to increase accuracy, detection rate, decrease False Alarm Report (FAR), and minimize computing time. Hyperparameter tuning is also recommended for further efficiency. A hybrid approach that combines unsupervised and supervised techniques was proposed [11]. Their experimental work showed high accuracy of over 99% using k-means and random forest; those results are based on the ISCX2012 dataset containing around two million records. K-Means was effectively used as an under-sampling technique on the ISCX dataset. DNS flows were reduced from 309K to 55K, while HTTPWeb flows were reduced from 881K to 367K. The ISCX2012 doesn’t include HTTPS requests which account for most web traffic in modern applications and environments. The paper [3] surveyed and analyzed the use of intrusion detection models and IDS datasets, starting with ISCX2012 and its limitations and lack of realistic and recent nature. They have also discussed the CSE-CIC-IDS2018 dataset, which was prepared from a much larger network of simulated client targets and attack machines. They also determined that the best performance scores for each study were unexpectedly high overall and attributed it to overfitting. They also found that the information on the data cleaning of CSE-CIC-IDS2018 was insufficient across the board. They also suggested significant research gaps in large data processing frameworks, concept drift, and transfer learning. 2.3 Research Gap As discussed in the literature most of the previous research focused on the outdated KDD’99 dataset, while the recent CSE-CIC-IDS2018 dataset did not yet have the same research attention. Significant research gaps exist in intrusion detection based on CSECIC-IDS2018 dataset. Topics such as big data processing frameworks, concept drift such as change in data patterns over time, and transfer learning are missing from the literature. Also, most of the literature we found that used CSE-CIC-IDS2018 dataset did not utilize the whole dataset, as it requires computing resources and processing time, which we were able to obtain through our institute. In the literature, the utilization of hybrid ML approach was applied to a much smaller datasets, our research findings introduced a unique result that adds to the literature. Transfer learning was done by applying unsupervised ML methods, and then used the generated clusters as base to the supervised ML methods. Then we combined three supervised ML methods and used hard voting to make a final label judgment. We have also investigated correlation between features and applied data cleaning and feature engineering to address research limitations discussed in [3].
154
M. Abusalah et al.
3 Research Approach 3.1 Prediction Method This work used unsupervised methods to detect attacks among 16 million network flows in the CSE-CIC-IDS2018 dataset. We will use clustering methods like K-means and mean shift algorithms. In addition, unsupervised outlier detection methods will be used, such as isolation forest and local outlier factor. Moreover, a hybrid approach will be used. First, we will apply K-means algorithm on the training data, and then we will perform two supervised models: logistic regression and random forest. We will then apply local outlier factor and then implement logistic regression, random forest, and gradient boost. Hard voting will be used to aggregate the classifications, and results will be evaluated and compared. After that, synthetic minority oversampling (SMOTE), a commonly used standard for learning from imbalanced data [12], will be performed on the training data, and then the hybrid approach will be repeated based on the new resampled dataset. 3.2 Evaluation Criteria Validation techniques are essential to evaluate a model’s accuracy. In supervised learning, the validation is done mainly by measuring the performance metrics like accuracy, precision, recall, AUC, etc. Conversely, the validation process will not be simple since we do not have the correct labels in unsupervised learning. For unsupervised learning, two statistical techniques can be used for validation: internal validation and external validation. Internal validation is used to measure cluster quality without the need for external data and is used when we have no further information. Internal validation metrics are further broken down into cohesion and separation. Cohesion assesses the closeness of the elements of the same cluster and is defined by: proximity(xi , yi ) (1) cohesion(Ci ) = xi ,yi ∈Ci
where x and y are an example in clusters with centroids of Ci and Cj Separation measures quantify the level of separation between clusters and is defined by: separation Ci , Cj = proximity(xi , yi ) (2) x∈Ci ,y∈Cj
The proximity function is used to determine how similar a pair of examples is, where we can use similarity, dissimilarity, and distance functions. Clustering is good when it has a high separation between clusters and high cohesion within clusters. Internal indices are usually used in two families of clustering algorithms: hierarchical clustering algorithms and partitional algorithms, and there are several metrics that try to combine separation and cohesion in a single measure. For partitional algorithms, we can use silhouette coefficient, Calisnki-Harabasz coefficient, Dunn index, Xie-Beni score, and
Detecting Complex Intrusion Attempts Using Hybrid Machine
155
Hartigan index. In hierarchical cluster algorithms, Cophenetic Correlation Coefficient and Hubert Statistic can be used to assess the results [13]. On the other hand, external validation methods can be used when we have additional information, i.e., external class labels for training examples. So external validation methods are not used on most clustering problems. External metrics also depend on the clustering algorithm used. It is used to compare the set of clusters obtained from the clustering algorithm denoted C and results obtained from other partition denoted P, which is obtained by the expert knowledge of the analyst, prior knowledge of the data in the form of class labels, or the results obtained by another clustering algorithm. Then, a contingency matrix built to evaluate the clusters used the terms; T P: The number of data pairs found in the same cluster, both in C and in P, F P: number of data pairs in the same cluster in C but in different clusters in P, F N: The number of data pairs which is in different clusters in C but in the same cluster in P and T N: number of data pairs found in different clusters, both in C and in P. One of the external validation methods is matching sets which are used to compare the clusters from C with ones with the same nature from P and then measure the similarity by precision, recall and purity. Another external validation method is Peer-to-peer Correlation, which depends on the correlation between pairs, i.e., measuring similarity between the results of a grouping process for the same set and using two different methods. The metric used to measure the correlation between pairs are Jaccard coefficient, Rand coefficient, Folkes and Mallows, and Hubert statistical coefficient. Peer-to-peer Correlation depends on correlation between pairs, i.e., measuring similarity between the result of a grouping process for the same set, but by using two different methods. Since our dataset contains the actual labels, we will apply external validation methods using the precision, recall, F1, and average precision score. 3.3 Data Cleanup and Exploratory Data Analysis The dataset consists of 10 CSV files; each contains benign and malicious network traffic collected on different days. 3.3.1 Number and Percentage of Benign Network Flows in Relation to Malicious Flows The first thing we want to examine is how many attacks and benign records are there in the dataset. Figure 1 shows there are 13,484,708 benign, representing 83% of the dataset, while attacks are 2,748,235, making only 17% of the dataset. Hence, we can see that the data is highly imbalanced. 3.3.2 Number of Flows Per Attack Type In this section we will examine the different types of attacks in the dataset and the number of records corresponding to each type. Figure 2 shows the number of flows accounting for the different attack types. As we can see, DoS attacks-GoldenEye, DoS attacks-Slowloris, DDOS attack-LOIC-UDP, Brute Force-Web, Brute Force-XSS, and SQL Injection has the lowest records in the
156
M. Abusalah et al.
Fig. 1. Benign and attack percentage
dataset, which could be challenging to train the multi-class supervised classifier to detect those kinds of network traffic. This is not the case for the unsupervised classifiers as they will not be able to distinguish between the attack types.
Fig. 2. Attack types
3.3.3 Correlation Between Features Correlation refers to how close two features are to having a linear relationship with each other. A heat map is produced to examine the Pearson correlation [14] between all features. We noticed that there is a clear correlation between some of the feature pairs. High correlation between two features will affect the independent features and bias the prediction. For pair of features if the correlation absolute value > 0.8; those features seem to be redundant and can be removed.
Detecting Complex Intrusion Attempts Using Hybrid Machine
157
For example, tot_fwd_pkts (Total packets in the forward direction) and fwd_header_len (Total bytes used for headers in the forward direction) are correlated, i.e., an increase in the number of packets leads to a higher number of bytes used as headers. Also, correlation between features and target values were investigated. To examine the features with the highest correlations, Kolmogorov–Smirnov Test [15] was utilized to examine if those features can predict the network flow label. P-value, which is the probability of obtaining a result that equals or exceeds one observed assuming the null hypothesis of no effect is true; it will give us a measure of the strength to see if the variable affects the target label or not [16].
4 Modelling 4.1 Train – Test Split For the supervised machine learning experimentation, we used the Holdout evaluation by dividing the dataset into three sub-datasets: training, evaluation, and testing, with the ratio of 80, 10, 10 consequently. Holdout evaluation aims to test a large data model (16 million records) on different data than the data used to train the model; this provides an unbiased estimate of learning performance. In the unsupervised method we used the whole dataset as we considered the data unlabeled, while in the Hybrid approach we used train/test method with ratio of 80/20. Recall, Precision, F1-score, and Average Precision Score were used to evaluate the performance of the classifiers [17]. Furthermore, the split is stratified using the attack category to guarantee that all attacks are represented in the training and test set based on their occurrences in the dataset. 4.2 Metrics for Evaluation There are several ways to evaluate the unsupervised models depending on the algorithm used and other information about the proper labels. External validation methods can be used when we have additional information, i.e., External class labels for training examples. In our case, we have the actual labels, and since the distribution of classes shows that the dataset is highly imbalanced with class 0 - Benign contributing to ~ 83% of all the samples. For this reason, the metric accuracy is not suitable to measure the performance, and so to evaluate the performance of a classifier, two metrics will be used: recall (weighted average) will be used as the primary metric since the goal of the classifier should be to detect as many attacks as possible while reducing the false positive rate, and this is the metric that classifiers will be optimized for. Precision (weighted average) will be used as a secondary classifier as the number of false-positives should be kept to a minimum. The average precision score will also be calculated. 4.3 Unsupervised Machine Learning Models In this section, we will perform several unsupervised models. Since the data we have is big data [3], not all the unsupervised algorithms can be applied, so we will choose
158
M. Abusalah et al.
algorithms that can be applied to such data like K-means, mean shift, local outlier factor, and isolation forest. The outcomes obtained from the applied algorithms will be compared to the actual labels for the evaluation. 4.3.1 K-means K-means clustering is one of the simplest and most popular unsupervised machine learning algorithms. It simply groups similar data points and discovers underlying patterns. In this method, we must define k (the number of clusters), where a cluster refers to a collection of data points aggregated together because of certain similarities. The k number represents the number of centroids desired in the dataset. A centroid is the imaginary or real location representing the center of the cluster. So, the k number of centroids will be identified, and then every data point will be assigned to the nearest cluster, with the aim of keeping the number of clusters as small as possible. We first standardize the data for our case and reduce the dimensionality to 2 dimensions to reduce complexity and computational time. Then we fit the algorithm on the dataset and choose the k = 2 after using the elbow method to determine the best number of clusters. After that, the data points are plotted before clustering. The right side of the Fig. 3 show the data points before clustering and the left side of Fig. 3 highlights after clustering (0-benign colored in Blue, 1-attack colored in Orange).
After Clustering
Before Clustering
Fig. 3. Distribution of points before (Right) and after (Left) k-Means clustering
Detecting Complex Intrusion Attempts Using Hybrid Machine
159
Comparing the results with the true labels, we obtained the results in Table 1 below: Table 1. Results with the true labels Labels
Precision
Recall
F1-score
0
0.89
0.74
0.81
1
0.31
0.56
0.4
Accuracy
0.71
Macro average
0.6
0.65
0.6
Weighted average
0.79
0.71
0.74
4.3.2 Mean Shift Another clustering algorithm is the mean shift algorithm which assigns the data points to the clusters iteratively by shifting points towards the mode. As such, it is also known as the Mode-seeking algorithm. It has applications in the field of image processing and computer vision. Mean shift is a centroid-based algorithm, which works by performing multiple iterations to select candidates of centroids to be the mean of points within a given region. To eliminate near-duplicates, the candidate centroids are filtered in a post-processing step. Unlike K-means clustering, mean shift does not require specifying the number of clusters in advance. Instead, the number of clusters is determined by the algorithm with respect to the data. Mean shift is computationally expensive [18].
Fig. 4. Estimated number of clusters after mean shift clustering
160
M. Abusalah et al.
Mean shift created 44 clusters. The majority cluster contains 8,804,512 records, while the remaining records are distributed among 43 clusters. Figure 4 shows the 44 clusters. When analyzing the clusters, it is clear benign records are heavily distributed in more than one cluster, which will affect the overall results. The results obtained with respect to the true labels are demonstrated in Table 2 below: Table 2. Mean shift results Labels
Precision
Recall
F1-score
0
0.83
0.54
0.66
1
0.17
0.46
0.25
Accuracy
0.53
Macro average
0.5
0.5
0.45
Weighted average
0.72
0.53
0.59
As we can see the results of K-means performs better precision score while mean shift provided better recall since it is more robust for outlier detection. 4.3.3 Local Outlier Factor Local Outlier Factor (LOF) is another unsupervised machine learning algorithm intended for outlier detection. It computes the local density deviation of a given data point with respect to its neighbors. It considers outliers as the samples that have a substantially lower density than their neighbors. It works well on high-dimensional datasets.
Fig. 5. Outliers detected by local outlier factor
Detecting Complex Intrusion Attempts Using Hybrid Machine
161
After standardizing our dataset and after reducing the dimensionality, we fit the algorithm on the data. Figure 5 shows the outliers that have been detected in the dataset. We choose the contamination parameter, which represents the percentage of attacks in the data set, to be 0.1, representing the expected number of outliers; the model detects 1,623,295 outlier observations. The below table compares the results with the actual labels. The model did well in detecting inliers (normal traffic) but performed poorly in detecting the outliers (attacks) as shown in Table 3. Table 3. Local outlier factor results Labels
Precision
Recall
F1-score
0
0.82
0.89
0.86
1
0.09
0.05
0.07
Accuracy
0.75
Macro average
0.46
0.47
0.46
Weighted average
0.7
0.75
0.72
Since the data is highly imbalanced, we decided to apply under-sampling on the normal data by using only 20% of benign data and all the attacks to obtain a balanced dataset of around 5 million records of benign and attack records. The algorithm performed better in detecting attack cases. However, it showed a decrease in the performance for identifying benign data. Table 4 shows LOF performance after applying under-sampling. Table 4. Local outlier factor after under-sampling Labels
Precision
Recall
F1-score
0
0.40
0.41
0.40
1
0.41
0.41
0.41
Accuracy
0.41
Macro average
0.41
0.41
0.41
Weighted average
0.41
0.41
0.41
4.3.4 Isolation Forest Isolation forest is another unsupervised algorithm effective for outlier and novelty detection in high-dimensional data. It isolates observations by randomly selecting a feature and split value between the maximum and minimum values of the selected feature. Here, we will train the isolation forest model on the training dataset and then we will predict the outliers on the testing dataset. Figure 6 shows the outliers (attacks) on the
162
M. Abusalah et al.
original dataset and the predicted ones on the testing data. As demonstrated in Table 5, the algorithm performs well for normal traffic and poorly for detecting attacks.
Fig. 6. Outlier detected by isolation forest
Table 5. Isolation forest results Labels
Precision
Recall
F1-score
0
0.81
0.82
0.81
1
0.05
0.05
0.05
Accuracy
0.69
Macro average
0.43
0.43
0.43
Weighted average
0.68
0.69
0.68
4.4 Hybrid Approach (Unsupervised then Supervised) This section will apply a hybrid approach by combining unsupervised models (K-means and LOF) and then using the results as input to the supervised models (logistic regression, random forest, and gradient boost). After that, a voting classifier is used to combine the outcome of those three classifiers. Furthermore, synthetic oversampling will be used for the attacks, and then will repeat the hybrid approach to examine if it performs better after the dataset imbalance issue is addressed. The aim of using a hybrid approach is not only to have the better performance, but also to avoid the overfitting problem. 4.4.1 K-means and Logistic Regression After applying k-means on the training and testing data, the results of training data are then used as training data in the logistic regression model for prediction. The logistic model is used to model the probability of a specified event occurring such as pass/fail, attack/benign [19]. Table 6 shows that while this combination of algorithms provided good precision and recall for benign, it performed poorly for detecting previously unknown attacks.
Detecting Complex Intrusion Attempts Using Hybrid Machine
163
Table 6. k-Means with logistic regression performance Labels
Precision Recall F1-score
0
0.91
0.96
0.94
1
0.74
0.55
0.63
Accuracy
0.89
Macro average
0.82
0.76
0.78
Weighted average
0.88
0.89
0.88
Average precision score 0.48
4.4.2 K-means and Random Forest Random forests or random decision forests are an ensemble learning method that work by generating several classifiers and combining their results via a majority vote to classify a new instance to achieve better performance on the same training data [20]. As shown in Table 7, K-Means perform better in precision and recall when coupled with random forest for both benign and attacks than logistic regression. The better results can be attributed to how random forest works by creating a set of decision trees from a randomly selected subset of the training set and then aggregating the votes from different decision trees to decide the final class of the test object. Table 7. k-Means with random forest performance Labels
Precision Recall F1-score
0
0.99
0.99
0.99
1
0.97
0.95
0.96
Accuracy
0.99
Macro average
0.98
0.97
0.98
Weighted average
0.99
0.99
0.99
Average precision score 0.93
4.4.3 Local Outlier Factor and Random Forest As shown in the Table 8 below, local outlier factor provides comparable results to kmeans when combined with random forest. Again, such results can be attributed to the nature of random forest described earlier.
164
M. Abusalah et al. Table 8. Local outlier factor and random forest performance Labels
Precision Recall F1-score
0
0.99
0.99
0.99
1
0.96
0.95
0.96
Accuracy
0.99
Macro average
0.98
0.97
0.97
Weighted average
0.99
0.99
0.99
Average precision score 0.988
4.4.4 Local Outlier Factor and Logistic Regression Applying local outlier factor with logistic regression produces an average precision score of 0.94 compared to 0.63 achieved by logistic regression alone. Table 9 below demonstrates the results in more details. Table 9. Local outlier factor and logistic regression performance Labels
Precision Recall F1-score
0
0.94
0.97
0.96
1
0.84
0.70
0.76
Accuracy
0.93
Macro average
0.89
0.84
0.86
Weighted average
0.92
0.93
0.92
Average precision score 0.94
4.4.5 Local Outlier Factor and Gradient Boost Gradient boosting is used for regression and classification by generating a prediction model by combining ensembles of weak learners -typically decision trees into a single strong learner in an iterative fashion to build a more robust prediction model [21]. Table 10 shows that Gradient boost with local outlier factor offers improvements in the precision of detecting attacks and benign recall with an average precision score of 0.986 compared to 0.93 obtained from applying only gradient boost. 4.4.6 Voting Classifier Here, we will apply hard voting or majority voting. The main idea of hard voting is that every individual classifier vote for a class and the majority wins. In statistical terms, the predicted target label of the ensemble is the mode of the distribution of individually
Detecting Complex Intrusion Attempts Using Hybrid Machine
165
Table 10. Local outlier factor and gradient boost Labels
Precision Recall F1-score
0
0.99
1.00
0.99
1
1.00
0.93
0.97
Accuracy
0.99
Macro average
0.99
0.97
0.98
Weighted average
0.99
0.99
0.99
Average precision score 0.9867
predicted labels. Therefore, we combine the results obtained from the three classifiers: Local Outlier Factor with (logistic regression, random forest, and gradient boost) to see if the performance will be improved. Then, we take the results (predicting attacks and benign) from each classifier and then use the mode as a voting mechanism (i.e., if at least two out of three models rule it is an attack or a benign). Table 11 below shows that the voting classifier offers similar results to those obtained by combining local outlier factor with gradient boost. Table 11. Hard voting classifier Labels
Precision Recall F1-score
0
0.99
1.00
0.99
1
1.00
0.93
0.96
Accuracy
0.99
Macro average
0.99
0.97
0.98
Weighted average
0.99
0.99
0.99
Average precision score 0.9863
4.4.7 Synthetic Minority Oversampling Technique “SMOTE” The nature of the dataset is imbalanced. Therefore, the model cannot effectively learn the decision boundary because there are too few examples of the minority class (attacks). To overcome this challenge, we can utilize SMOTE for oversampling instead of simply duplicating attacks in the training dataset to be balanced with benign before fitting the model. Such a process can balance the class distribution without providing any additional information to the model [22]. In our dataset, we will do the oversampling on some types of attacks that have less than 100,000 records in the training data, after SMOTE and hybrid implementation of local outlier factor followed by random forest and gradient boost. As a result, there was an improvement in the results without overfitting.
166
M. Abusalah et al.
LOF with random forest after SMOTE provided a recall of 0.97 compared to 0.95 obtained without SMOTE, as shown in Table 12. Table 12. Local outlier factor & random forest Labels
Precision Recall F1-score
0
0.99
0.99
0.99
1
0.97
0.97
0.97
Accuracy
0.99
Macro average
0.98
0.98
0.98
Weighted average
0.99
0.99
0.99
Average precision score 0.992
Similarly, after applying SMOTE with LOF and gradient boost, a recall result of 0.94 compared to 0.93 with SMOTE was achieved, as shown in Table 13. Table 13. Local outlier factor & gradient boost Labels
Precision Recall F1-score
0
0.99
1
0.99
1
1
0.94
0.97
Accuracy
0.99
Macro average
0.99
0.97
0.98
Weighted average
0.99
0.99
0.99
Average precision score 0.988
5 Results and Conclusions The work discussed in this paper suggested an effective prediction model on Network intrusion detection based on CSE-CIC-IDS2018 dataset. The work included big data processing, cleansing, feature engineering, under-sampling and oversampling. The research focused on how to achieve highly accurate intrusion detection utilizing machine learning approaches, specifically suggesting combining unsupervised and supervised machine learning approaches. This unique approach demonstrated promising results. Results were evaluated based on external validation methods using the precision, recall, F1, and average precision score. Table 14 compares our results with the previous related work.
Detecting Complex Intrusion Attempts Using Hybrid Machine
167
With such big data dataset k-means performed better than mean shift, producing an F1-score of 0.74 compared to 0.59 and 0.56 recall of attacks compared to mean shift’s 0.46. Unsupervised outlier detection methods were examined as well, like local outlier factor and isolation forest. Local outlier factor produced an F1-score of 0.72 but performed poorly in recall of attacks with a score of 0.05; nevertheless, it scored 0.89 for recall for benign. Since the dataset is highly imbalanced, under-sampling was performed by taking only 20% of benign data and all attacks, applying the local outlier factor resulted in a poor F1-score of 0.41, 0.41 attack recall, and 0.41 benign recall. On the other hand, isolation forest produced an F1-score of 0.68, 0.05 attack recall, and 0.82 for benign recall. Next, a hybrid strategy was performed combining unsupervised and supervised algorithms. After applying k-means to the training data, two supervised models are implemented: logistic regression and random forest. While the former produced 0.89 accuracy and F1-score of 0.88, the latter performed better, producing an accuracy of F1-score of Table 14. Comparing our results with previous work Our results
Soheily-Khah, Marteau, & Béchet, 2018
Proposed models
Precision
Recall
F1-Score
Notes
LOF & gradient 1 boost
0.93
0.99
Voting Classifier (for hybrid approach)
1
0.93
0.99
LOF with gradient boost
0.97
0.97
0.99
SMOTE (LOF & gradient boost)
1
0.94
0.99
• Complete CSE-CIC-IDS2018 dataset of 16 million records • Hierarchal clustering for feature selection • Hard voting to combine results from different hybrid approaches
Hybrid: K-means and Random Forest (kM-RF)
0.999
1 for SSH
1 for SSH
• ISCX2012 dataset, doesn’t include https with only 2M records • K-Means was used as under sampling technique • DNS flows reduced by 82% (309K to 55K) • HTTPWeb flows reduced by 46% ( 881K to 367K) (continued)
168
M. Abusalah et al. Table 14. (continued)
Verkerken, D’hooge, Wauters, Volckaert, & De Turck, 2020
Autoencoders (stacked)
0.95
0.98
Farhan, Maolood, & Hassan, 2020
Dense deep neural network (DNN)
FTP_Brutforce: P:0/R:0/ F1-score:0 SSH_Brutforce: P: 0.87/ R: 0.27 / F1-score: 0.41 DDOS attack_LOTC_UDP: 1/1/1
Kumar, Glisson, & Benton, 2020
Mean shift
81.2
0.75
0.96
0.99 +
• CIC-IDS-2017 dataset • The training dataset used data collected on Monday (benign day) to generate a sample of 50,000 records benign only for training • 200,000 records for validation and testing • CSE-CIC-IDS2018 dataset • The algorithm was trained only on around 162,500 records • We achieved comparable results for DDoS • KDD’99 outdated dataset
0.99. Furthermore, the hybrid approach was performed by combining local outlier factor with logistic regression, random forest, and gradient boost; with logistic regression, the approach resulted in 0.93 accuracy and 0.92 F1-score; the attack recall was 0.70 with an average precision score of 0.94. Local outlier factor with random forest resulted in 0.99 accuracy, 0.99 F1-score, 0.95 attack recall, and 0.988 for the average precision score. Local outlier factor and gradient boost resulted in 0.99 for accuracy, 0.99 for F1-score, 0.93 attack recall, and 0.987 for average precision score. The gradient boost results are close to random forest. In general, using local outlier factor with supervised methods produced better results than using K-means with supervised methods. After applying the local outlier factor with the three supervised methods, hard voting was used in an attempt to obtain better results; the results of hard voting were: 0.99 accuracy, 0.99 F1-score, 0.93 recall for attacks, 1.0 recall for benign, and 0.986 average precision score. Thus, the results were close to the ones obtained when applying local outlier factor with gradient boost. To overcome the challenges [3] witnessed in their research discussed earlier in the literature, in addition to using hybrid ML approach which heavily contributed to the reduction of dataset overfitting challenge, we have used feature correlation analysis to locate the critically important features on which others depend to reduce possible overfitting caused by features. And feature engineering in order to optimize the dataset. We
Detecting Complex Intrusion Attempts Using Hybrid Machine
169
have also used Synthetic minority oversampling (SMOTE) to overcome the overfitting challenge caused by the nature of the dataset which is heavily unbalanced. When we applied SMOTE to the training data, and the hybrid approach was performed again, accuracy was improved. For example, the attack recall for local outlier factor and random forest was 0.97 with an average precision score of 0.99. The recall attack for local outlier factor and gradient boost was 0.94 with 0.988 for an average precision score. The dataset subject to this study is big data, and applying the unsupervised methods has some limitations as some unsupervised algorithms need extensive computations, which makes them unsuitable for big data. We faced challenges detecting anomalies using outlier detection algorithms since the records are not far from benign. We have applied hard voting on the hybrid classifications, trying soft voting by giving different weights to classifiers based on their performance could be of benefit. Finally, further studies on deep learning methods such as autoencoders should be experimented.
References 1. Singh, U.K., Joshi, C., Kanellopoulos, D.: A framework for zero-day vulnerabilities detection and prioritization. J. Info. Secu. Appli. 46, 164–172 (2019) 2. Grill, M., Pevný, T., Rehak, M.: Reducing false positives of network anomaly detection by local adaptive multivariate smoothing. J. Comp. Sys. Sci. 83(1), 43–57 (2017) 3. Leevy, J.L., Khoshgoftaar, T.M.: A survey and analysis of intrusion detection models based on CSE-CIC-IDS2018 Big Data. Journal of Big Data 7(1), 1–19 (2020) 4. Kumar, A., Glisson, W., Benton, R.: Network attack detection using an unsupervised machine learning algorithm. In: Proceedings of the 53rd Hawaii International Conference on System Sciences (2020) 5. Thakkar, A., Lohiya, R.: A review of the advancement in intrusion detection datasets. Procedia Computer Science 167, 636–645 (2020) 6. Ferrag, M.A., Maglaras, L., Moschoyiannis, S., Janicke, H.: Deep learning for cyber security intrusion detection: Approaches, datasets, and comparative study. J. Info. Secu. Appl. 50, 102419 (2020) 7. Kanimozhi, V., Jacob, T.P.: Calibration of various optimized machine learning classifiers in network intrusion detection system on the realistic cyber dataset cse-cic-ids2018 using cloud computing. Int. J. Eng. Applie. Sci. Technol. 4(6), 2455–2143 (2019) 8. Laskov, P., Düssel, P., Schäfer, C., Rieck, K.: Learning intrusion detection: supervised or unsupervised?. In: International Conference on Image Analysis and Processing. Heidelberg, Berlin (2005) 9. Gogoi, P., Borah, B., Bhattacharyya, D.K.: Anomaly detection analysis of intrusion data using supervised & unsupervised approach. J. Convergence Inf. Technol. 5(1), 95–110 (2010) 10. Farhan, R.I., Maolood, A.T., Hassan, N.: Performance analysis of flow-based attacks detection on CSE-CIC-IDS2018 dataset using deep learning. Indonesian J. Electr. Eng. Comp. Sci. 20(3), 1413–1418 (2020) 11. Soheily-Khah, S., Marteau, P.F., Béchet, N.: Intrusion detection in network systems through hybrid supervised and unsupervised machine learning process: A case study on the iscx dataset. In: 1st International Conference on Data Intelligence and Security (ICDIS) (2018) 12. Fernández, A., Garcia, S., Herrera, F., Chawla, N.V.: SMOTE for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. J. Artifi. Intel. Res. 61, 863– 905 (2018)
170
M. Abusalah et al.
13. Palacio-Niño, J.O., Berzal, F.: Evaluation metrics for unsupervised learning algorithms. arXiv, p. 1905.05667 (2019) 14. Nasir, I.M., et al.: Pearson correlation-based feature selection for document classification using balanced training. Sensors, 20–23 (2020) 15. Moscovich, A.: Fast calculation of p-values for one-sided Kolmogorov-Smirnov type statistics. arXiv, p. 2009.04954 (2020) 16. Goodman, W.M., Spruill, S.E., Komaroff, E.: A proposed hybrid effect size plus p-value criterion: empirical evidence supporting its use. The American Statistician 73(sup 1), 168–185 (2019) 17. Abusalah, M.: Cross language information retrieval using ontologies. University of Sunderland, Sunderland (2008) 18. Vatturi, P., Wong, W.K.: Category detection using hierarchical mean shift. In: 15th ACM SIGKDD international conference on Knowledge discovery and data mining (June 2009) 19. Tolles, I., Meurer, W.J.: Logistic regression: relating patient characteristics to outcomes. Jama 316(5), 533–534 (2016) 20. Oshiro, T.M., Perez, P.S., Baranauskas, J.A.: How many trees in a random forest?. In: International workshop on machine learning and data mining in pattern recognition, Berlin, Heidelberg (July 2012) 21. Bentéjac, A., Csörg˝o, A., Martínez-Muñoz, G.: A comparative analysis of gradient boosting algorithms. Artificial Intelligence Review 54(3), 1937–1967 (2021) 22. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artifi. Intelli. Res. 16, 321–357 (2002)
Gauging Biases in Various Deep Learning AI Models N. Tellez, J. Serra, Y. Kumar(B) , J. J. Li, and P. Morreale School of Computer Science and Technology, Kean University, Union, NJ 07083, USA {tellezn,serrajo,ykumar,juli,pmorreal}@kean.edu
Abstract. With the broader usage of Artificial Intelligence (AI) in all areas of our life, accountability of such systems is one of the most important topics for research. Trustworthiness of AI results require very detailed and careful validation of the applied algorithms, as some errors and biases could reside deeply inside AI components, which might affect inclusiveness, equity, justice and irreversibly influence human lives. It is critical to detect them and to reduce their negative effect on AI users. In this paper, we introduce a new approach to bias detection. Using the Deep Learning (DL) models as examples of a broader scope of AI systems, we make the models self-detective of the underlying defects and biases. Our system looks ‘under the hood’ of AI-model components layer by layer, treating the neurons as similarity estimators – as we claim the main indicator of hidden defects and bias. In this paper, we report on the result of applying our self-detection approach to a Transformer DL model, and its Detection Transformer object detection (DETR) framework, introduced by Facebook AI Research (FAIR) team in 2020. Our approach automatically measures the weights and biases of transformer encoding layers to identify and eventually mitigate the sources of bias. This paper focuses on the measurement and visualization of the weights and biases of the DETR-model layers. The outcome of this research will be our implementation of a modernistic Bias Testing and Mitigation platform. It will be open to the public to validate AI applications and mitigate their biases before their usage. Keywords: Artificial Intelligence (AI) · Deep Learning (DL) · Bias detection · Bias mitigation · Bias convergence · Detection Transformer (DETR)
1 Introduction Machine Learning (ML) area of Artificial Intelligence (AI) becomes a mainstream in AIrelated research. Deep Learning (DL) – the subject of this paper – is a complex multi-layer type of Machine Learning. Deep Neural Network (NN) has many layers with many nodes and even understanding its architecture mathematically is challenging, which might take extended time and requires strong background in the multidisciplinary of physics, biology, mathematics, algorithms, and AI. We are moving further and investigating the NN structure in depth, detect and gauge weights and biases and look for ways to mitigate the found biases and improve the NN model overall. Our research on mitigating biases is currently ongoing. In this paper, we focus on our main questions/hypothesis about © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): IntelliSys 2022, LNNS 544, pp. 171–186, 2023. https://doi.org/10.1007/978-3-031-16075-2_11
172
N. Tellez et al.
bias gauging, but do not actually discuss the experimental approaches we took for bias mitigation. Many papers focus on biases in AI, and we do agree that there might be those present in data input (dataset), in the model structure and in the results interpretation, as well as in between. Most of existing studies focus mainly on the dataset, its classification and/or on the results validation/verification. We discovered that such focus is not sufficient [1], and thus our research focuses on the hyper parameters such as optimizers and weights of the deep NN itself. We train the deep learning models to self-detect and eventually self-mitigate the detected bias. In our research, we exhaustively work with various optimizers to observe their impact to accuracy and biases. We discovered that they indeed play an important role in the model optimization (as their names states), as well as biases. Our main optimizer is AdamW (explained in detail further). We also improve our AI/DL model utilizing Nesterov Momentum. One of the prominent deep learning models is the object detection and classification developed by Facebook, the Transformer Neural Network (TNN) and the associated Framework, Facebook’s End-To-End Object Detection with Transformers (DETR). Before DETR’s broader usage, it is the right time to investigate its potential biases [2]. Our research identifies and then validates biases in AI/DL models based on our interpretation of neurons being similarity estimators [1]. With this interpretation, we further claim that the structure of Neural Networks (NN) has a similar underlying principle to the nearest radian neighbor approach that we previously investigated in depth [3, 4]. In this research, we use the same principle to investigate DETR models. Such an approach might have some limitations as compared to the relatively simple Radius of Neighbors (RN) method that we previously discovered and successfully applied to natural language prediction and translation [4]. Previously we discovered that RN is more effective as compared to the K-Nearest Neighbor (KNN) in such a way that, it instead of picking up several closest nodes (by distance), it looks for so-called neighbors inside of a set-up radius (or a circle in other words) representing the most relevant training samples. RN also works for the situation of very high dimensions, even though it might get more complex as the number of dimensions grows. We claim that our research is gap research we are trying to fill in, i.e. a unique way to interpret NN as similarity estimators. As RN and its application RN-chatter [4] is novel, unique and our own, its application to NNs and their nodes is also custom, representing the state of the art. Our RN approach was later adapted into the self-attention concept of transformers. In our investigation, we started by implementing a DETR model [5] and having it trained on a unique set of pollen data obtained locally. We then ran a variety of simulations to determine the best parameters possible for the model and to validate the model’s convergence to the optimal curve with the least error. During this initial step of model training, our in-house tool automatically collects the measurement of the first transformer encoding layer’s weights and biases. With the help of the Google Collaboration (Colab) PRO+ [6] tool, we were able to analyze over 301 epochs at once to get a better understanding of the NN internal structure. We then collected the weights and biases of the specific layers of our interest.
Gauging Biases in Various Deep Learning AI Models
173
The rest of the paper is organized as follows. Section 2 describes in detail our experimental setting; Sect. 3 presents our experimental results followed by discussions of results in Sect. 4. Section 5 will conclude with some observations.
2 Experimental Setting After the initial setup of implementing a DETR model, we further proceeded with the manual and automated training of the classification with the help of the transformer. For our research, we chose to use unique and not publicly available data of microscopic pollen images. The source of our data came from the AccuPollen™ Allergy Tracker App [7], recently developed by a group of researchers from Kean University (Union, NJ). The App is connected to a specially installed for this purpose pollen station, residing on the roof of one of the campus buildings. The pollen images were collected to understand the impact of high pollen count on COVID-19 susceptibility [8] and another group of researchers from our university found that pollen allergy does add to the susceptibility of viral infection and negatively affects human health. The DETR architecture is made up of two main components: a Convolutional Neural Network (CNN) – a backbone for feature extraction, and an encoder-decoder transformer or the TNN. In addition, there is a feed-forward network (FFN) that helps setting the height and width of the object detected in the pollen image. We later added to it a Faster R-CNN model that allows one to perform the classification predictions. It helps to identify (presented as green boxes in Fig. 1) an accurate area of where our detected object is in the image.
Fig. 1. Underlying deep NN architecture
In more detail, the architecture underneath DETR is explained in [9]. We were building our code on top of this Framework (by using it). Initially, we had some difficulty with the DETR source code as some parts of the officially released by Facebook research team version was not up to date and some functions of the open-source machine learning framework PyTorch were already deprecated. It took us almost 6 months to update and debug the transformer and make it work properly. The data used for the project was unique, just recently collected (and is being collected) and is one of a kind; we had no previous knowledge of it and could not predict what to expect from our study. Part three below explains it in more detail.
3 Experimental Results Starting from the raw data we used, this section presents the data we used and the approach we took to detect biases in our AI pollen system.
174
N. Tellez et al.
3.1 Raw Data Our dataset included more than five thousand microscopic pollen images, assigned to 44 different classes of pollen species. We evaluated the accuracy of the training set classification first (discussed in more detail below) and further decided to transform our images in a way complying to the transformer’s requirements best and achieved an even more accurate classification (with its accuracy close to 90%). We further proceeded with the model validation. At-a-Glance, our data is straightforward, as can be seen in Table 1: Table 1. The project dataset. Parameter
The value
Comments
Training data
5000 images
After the removal of
Testing data
318 images
Corrupted images
Classification
44 classes
Our data came from the Pollen Station located at Kean University (Union NJ) and the AccuPollen™ Allergy Tracker App available in the Apple store. The Fig. 2 below provides the Graphical User Interface of the AccuPollen App [10], demonstrating the types and density of the pollen in the area shown on the map. The Tracker is connected to many pollen stations nationwide which can be seen from the image below.
Fig. 2. Data source: AccuPollen™ allergy tracker app (Developed as a Part of Kean University, Union NJ, Pollen Project) [10, 11]
We used OpenCV library cv2 [12] and its methods imread() and haveImageReader() to start with the image processing. After initial analysis of the pollen images, we discovered that some of them were corrupted, had too much noise and were not being processed correctly through the DETR model. As a result, we had to remove corrupted images (.JPG files) and their corresponding YOLO text (.TXT files) from our dataset. Figure 3 represents some examples of unwanted pollen data.
Gauging Biases in Various Deep Learning AI Models
175
Fig. 3. Pollen image with pollen detection showing some noisy data.
The next step was to download the pretrained weights (initially we only used DETR R50, but later ran simulations with DETR R101) provided by Facebook for transfer learning to speed up training. The number of corrupted images was not significant (less than 3%) and their removal did not affect the project much. We concluded that corrupted images will not affect the results of this study and from that point proceeded only with the cleaned up and wanted data. The example of such is represented on Fig. 4.
Fig. 4. Uncorrupted (Clean) image of Juglans Californica pollen species
If any corrupted images were detected by OpenCV, they were removed. A very insignificant number of images were removed from the total dataset, and we still had over 5,000 images to work with which did impact the model performance. 3.2 Model Training Using the PyTorch library [13], Python programming language, and the Google Collaboratory service [6] we were able to run the model over several epochs and manipulate hyper-parameters, for example, the number of queries which determine the maximum number of objects the DETR model can detect in a single image. Considering the model contains a no object class (∅) that can be detected (empty green squares on Fig. 5), there can be several numbers of object queries with the recommended number for Common Objects in Context (COCO) to be set at 100, however for this model we used 30 [1]. We used the pycocotools library to work with the COCO dataset [14]. The Fig. 5 demonstrated erroneous (on the left) vs valid (on the right) classification of the pollen species. Several models were trained with different parameters set to have a better understanding of the effects bias has on test data. Some models with a low threshold and a high number of queries were too sensitive and detected null objects as pollen images
176
N. Tellez et al.
Fig. 5. The images displaying multiple no object classes (∅,) erroneously determined to be a type of pollen species (Epoch 301, on the Left) vs the test image displaying an accurate number of pollen species detected (Epoch 100, on the Right)
even with high epochs. The right image on Fig. 5 represents a good threshold set and the number of queries, an accurate number of pollen species can be detected, and all null objects are not considered. 3.3 Optimizers Optimization within the reference to ML algorithms may go a long way in minimizing the objective loss function which a trained model has and improving its metrics such as accuracy. One widely recognized and used optimizer within both the software libraries TensorFlow by Google, and PyTorch developed by Facebook is the Adam optimizer. The Adam optimizer is short for Adaptive Moment Estimation and has various applications within the domain of Computer Vision (CV) in the field of AI. One of the underlying insights behind the Adam optimizer is that it implements a form of Stochastic gradientoptimization [15]. The advantages that come with using the Adam optimizer include the following: it can work with sparse gradients which could benefit a biased regulatory system, the Adam optimizer remains unchanged when the gradient is being rescaled [16], it performs a form of step sizing annealing [17]. The Adam algorithm can counteract the initialization bias, resulting in the better first moment, and second raw moment vector benefits. The exponential decay rates within the model permit a large bias from having a huge effect on the optimization of the model. In our code, we used AdamW [15] optimizer and adjusted the learning rate through Pytorch’ torch.optim.lr_scheduler() [18] to adjust the learning rate based on the number of epochs. 3.4 Visualization of Weights and Biases Saving checkpoints at every epoch step, we collected 786 bias node values for the first transformer encoding layer over 301 epochs. This would allow us to plot 786 lines displaying the change in bias for this transformer layer. Additionally, at each node 256 weight values would also be collected and if visualized would give us a total of 196,608 unique weights for this layer at one epoch. For visualization purposes, only the first weight and second weight were selected to display a cleaner visual representation. However, other weights can be visualized and shown with the script, and for future work can be shown in a graphical user interface for sake of analysis for each weight like how TensorBoard functions in TensorFlow work. The variable type of this data set is that it
Gauging Biases in Various Deep Learning AI Models
177
cannot be averaged hence it is not a numerical variable but categorical giving insight only into the model’s performance. All the first weights for all 786 nodes for the first layer and their convergence can be observed on Fig. 6 (left), all the 786 biases for the same layer are displayed next to it (on the right).
Fig. 6. Convergence of the first and second weights values after 200th epoch
Figure 7 below represents the convergence of the biases values after the 200th Epoch shown only for the first Transformer Encoder Layer, as it was not feasible to create a meaning graph with more details. Each color on the graph represents one of the first 20 nodes.
Fig. 7. Convergence of the biases values after 200th epoch
Figure 8 represents the corresponding graph with the weight values demonstrating convergence of the first and second weights after the 200th epoch. The first and the second weights from the total of 256 weights per node were chosen to display the convergence that occurs at and after epoch 200 for the first encoding
178
N. Tellez et al.
layer of the transformer. With a total of 786 bias nodes or values able to be plotted, the first twenty nodes were chosen to give a visual representation of the bias values also converging at approximately epoch 200. The first twenty nodes were chosen because there is a lot to see on the propagation of the bias via the data that was being analyzed via the multi-layer perceptron (MLP) [19].
Fig. 8. Convergence of the first and second weights values after 200th epoch
The Python script that was used allowed us to tune the optimizer, and the learning rate of the model so a future area of investigation is to see how different hyper-parameters can possibly show how the bias and weight convergence can be affected by such a change. One of the significant results we were able to achieve is to find an optimum point in our model training process. Figure 9 below shows that when convergence is reached at epoch 200, our train loss significantly drops and so our test loss does, however, it does not happen as dramatically. To verify our results, we ran our model on DETR 101 (previously all trials were done using DETR 50). The test/train loss results when
Fig. 9. The test/train loss results with convergence optima on the 200th epoch with results two different backbones: a ResNet-50 and (Left) and a ResNet-101 (Right).
Gauging Biases in Various Deep Learning AI Models
179
using the pretrained both transformers’ weights are presented below. We discovered that DETR R101 is a little less accurate. With these results, one will be able to apply a grey-box method to improve the model by adding more layers, reducing the number of nodes, and introducing learning schedules to improve the model. This will be implemented in a future work showing how visualizing the biases is an important aspect of trying to mitigate the biases. As the model training passes the 200th epoch, the test loss increases significantly, and our train loss continues to drop. This difference in behavior shows that the model is overfitting past epoch 200. We are very much interested in testing our model on another dataset to see if the optima found persists. The model result can be seen in Table 2. Table 2. The convergence optima on 200th epoch Number of epochs
The graph behaviour
Comments
84% for Handgun class whereas Ys has got Recall > 46% for Rifle class. The graphs from Fig. 4 are not specified to single class but to entire five classes (Overall Model comparisons). The Table 2a are data of individual class performance in detecting the true predictions.
214
A. Kambhatla and K. R. Ahmed
(a)
Ys I-Sample
(b)
Ys II-Sample
(c)
Ys III-Sample
(d)
Ys IV-Sample
(e)
Ym I-Sample
(f)
Ym II-Sample
(g)
Ym III-Sample
(h)
Ym IV-Sample
(i)
Yl I-Sample
(j)
Yl II-Sample
(k)
Yl III-Sample
(l)
Yl IV-Sample
Fig. 3. Detections & confidence scores of YOLOv5 model. Ys : (a,b,c,d); Ym : (e,f,g,h); Yl : (i,j,k,l)
(a)
Precision
(c)
[email protected]
(b)
(d)
Recall
[email protected]
Fig. 4. Training comparison of YOLOv5 Ys ,Ym and Yl models with different MetricsPrecision, Recall,[email protected] and [email protected]
Firearm Detection Using Deep Learning
(a)
Classification
(b)
215
Classification
Fig. 5. Loss graphs: (a) Training loss; (b) Validation loss
5
Conclusions and Amelioration
Different augmentation techniques are applied to increase the amount of images in the dataset and add images with different resolutions. The firearms dataset includes some firearm images, which are blurred more than 50% and it’s resolution even decreased up to 150 X 150 pixels and then rescaled them to 640 X 640 without losing it’s originality. YOLOv5 models (Ys , Ym , Yl ) are trained on the generated firearms dataset. More than 70 different experiments are conducted. The experiment results show that Ys has the highest [email protected], Precision and Recall followed by Ym and Yl respectively. In addition, Ys fastest model to detect and classify firearms followed by Ym and Yl . Considering Recall, Ym scored the largest recall value Recall > 84%in detecting handgun followed by Ys and Yl respectively whereas Yl scored the largest recall value Recall > 48% in detecting rifle followed by Ym and Ys respectively. The extension of the research is to prune the trained models to certain percentage to increase the speed of detection and reducing the size. Ensemble the individual models for live detection which is a multi-class classification that avoids false alarming in the case of a policeman handling weapon. The further extension of the research includes custom neural network model which can replace C3 model in YOLOv5 structure to improve the accuracy of the firearm detection model.
References 1. 2. 3. 4. 5. 6. 7.
Do x-rays and gamma rays cause cancer? Gun violence archive Imagenet Imfdb- database Imguag Tsa Advantages and disadvantages of machine learning language, Mar 2021
216
A. Kambhatla and K. R. Ahmed
8. Ahmed, K.R.: Parallel dilated CNN for detecting and classifying defects in surface steel strips in real-time. In: Arai, K. (ed.) IntelliSys 2021. LNNS, vol. 294, pp. 168–183. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-82193-7 11 9. Ahmed, K.R.: Smart pothole detection using deep learning based on dilated convolution. Sensors 21(24), 8406 (2021) 10. Jawad Siddique, M., Ahmed, K.R.: Deep learning technologies to mitigate deervehicle collisions. In: Ahmed, K.R., Hassanien, A.E. (eds.) Deep Learning and Big Data for Intelligent Transportation. SCI, vol. 945, pp. 103–117. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-65661-4 5 11. Alahi, A., Ortiz, R., Vandergheynst, P.: Freak: fast retina keypoint. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 510–517. IEEE (2012) 12. Alaqil, R.M., Alsuhaibani, J.A., Alhumaidi, B.A., Alnasser, R.A., Alotaibi, R.D., Benhidour, H.: Automatic gun detection from images using faster r-cnn. In: 2020 First International Conference of Smart Systems and Emerging Technologies (SMARTTECH), pp. 149–154. IEEE (2020) 13. Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995) 14. Dalal, N., Triggs. B.: Histograms of oriented gradients for human detection. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005), vol. 1, pp. 886–893. IEEE (2005) 15. Darker, I., Gale, A., Ward, L., Blechko, A.: Can cctv reliably detect gun crime? In: 2007 41st Annual IEEE International Carnahan Conference on Security Technology, pp. 264–271. IEEE (2007) 16. Darker, I.T., et. al.: Automation of the cctv-mediated detection of individuals illegally carrying firearms: combining psychological and technological approaches. In: Visual Information Processing XVIII, vol. 7341, p. 73410P. International Society for Optics and Photonics (2009) 17. de Azevedo Kanehisa, R.F., de Almeida Neto, A.: Firearm detection using convolutional neural networks. In: ICAART, vol. (2), pp. 707–714 (2019) 18. Deng,J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) 19. Dodge, S., Karam, L.: Understanding how image quality affects deep neural networks (2016) 20. Elfwing, S., Uchibe, E., Doya, K.: Sigmoid-weighted linear units for neural network function approximation in reinforcement learning (2017) 21. Elfwing, S., Uchibe, E., Doya, K.: Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. CoRR, abs/1702.03118 (2017) 22. Fei-Fei, L., Perona, P.: A bayesian hierarchical model for learning natural scene categories. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005), vol, 2, pp. 524–531. IEEE (2005) 23. Fernandez-Carrobles, M.M., Deniz, O., Maroto, F.: Gun and knife detection based on faster R-CNN for video surveillance. In: Morales, A., Fierrez, J., S´ anchez, J.S., Ribeiro, B. (eds.) IbPRIA 2019. LNCS, vol. 11868, pp. 441–452. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-31321-0 38 24. Flitton, G., Breckon, T.P., Megherbi, N.: A comparison of 3d interest point descriptors with application to airport baggage object detection in complex ct imagery. Pattern Recogn. 46(9), 2420–2436 (2013) 25. Jocher, G.: Hyper parameter evolution (2020)
Firearm Detection Using Deep Learning
217
26. Gesick, R., Saritac, C., Hung, C.-C.: Automatic image analysis process for the detection of concealed weapons. In: Proceedings of the 5th Annual Workshop on Cyber Security and Information Intelligence Research: Cyber Security and Information Intelligence Challenges and Strategies, pp. 1–4 (2009) 27. Glowacz, A., Kmie´c, M., Dziech, A.: Visual detection of knives in security applications using active appearance models. Multimedi Tools Appli. 74(12), 4253–4267 (2015) 28. Ben Halima, N., Hosam, O.: Bag of words based surveillance system using support vector machines. Int. J. Sec. Appli. 10(4), 331–346 (2016) 29. Hara, K., Saito, D., Shouno, H.: Analysis of function of rectified linear unit used in deep learning. In: 2015 international joint conference on neural networks (IJCNN), pp. 1–8. IEEE (2015) 30. He, K., Zhang, X., Ren, S., Sun, J.: Spatial pyramid pooling in deep convolutional networks for visual recognition. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8691, pp. 346–361. Springer, Cham (2014). https:// doi.org/10.1007/978-3-319-10578-9 23 31. Hosam, O., Alraddadi, A.: K-means clustering and support vector machines approach for detecting fire weapons in cluttered scenes. Life Sci. J. 11(9) (2014) 32. Ineneji, C., Kusaf, M.: Hybrid weapon detection algorithm, using material test and fuzzy logic system. Comput. Elect. Eng. 78, 437–448 (2019) 33. Jiao, L., et al.: A survey of deep learning-based object detection. IEEE Access 7, 128837–128868 (2019) 34. Kambhatla, A.: Automatic Firearm Detection by Deep Learning. M.Sc. thesis, Southern Illinois University at Carbondale (2020) 35. Kanungo, T., Mount, D.M., Netanyahu, N.S., Piatko, C.D., Silverman, R., Wu, A.Y.: An efficient k-means clustering algorithm: analysis and implementation. IEEE Trans. Pattern Anal. Mach. Intell. 24(7), 881–892 (2002) 36. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning, May 2015 37. Li, Y., Tian, G.Y., Bowring, N., Rezgui, N.: A microwave measurement system for metallic object detection using swept-frequency radar. In: Millimetre Wave and Terahertz Sensors and Technology, vol. 7117, p. 71170K. International Society for Optics and Photonics (2008) 38. Liu, W., et al.: SSD: single shot multiBox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0 2 39. Taghi, M., Shorten, C.: A survey on image data augmentation for deep learning. J. Big Data 6(1), 1–48 (2019) 40. Nercessian, S., Panetta, K., Agaian, S.: Automatic detection of potential threat objects in x-ray luggage scan images. In: 2008 IEEE Conference on Technologies for Homeland Security, pp. 504–509. IEEE (2008) 41. Olmos, R., Tabik, S., Herrera, F.: Automatic handgun detection alarm in videos using deep learning. Neurocomputing 275, 66–72 (2018) 42. Park, H., Yoo, Y., Seo,G., Han, D., Yun, S., Kwak, N.: C3: Concentratedcomprehensive convolution and its application to semantic segmentation (2019) 43. Zhang, R., Qiu, Z.: Optimizing hyper-parameters of neural networks with swarm intelligence: A novel framework for credit scoring. PLoS ONE 15(6), 881–892 (2020) 44. Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: towards real-time object detection with region proposal networks. IEEE-Pattern Anal. Mach. Intell. 39(6), 1137– 1149 (2017)
218
A. Kambhatla and K. R. Ahmed
45. Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., Savarese, S.: Generalized intersection over union: a metric and a loss for bounding box regression. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 658–666 (2019) 46. Hahnloser, R.H.R., Sarpeshka, R., Mahowald, M.A., Douglas, R.J., Seung, H.S.: Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit. Nature 405, 947–951 (2000) 47. Sahil, S., et al.: A survey of modern deep learning based object detection models (2021) 48. Sevak, J.S., Kapadia, A.D., Chavda, J.B., Shah, A., Rahevar, M.: Survey on semantic image segmentation techniques. In: IEEE International Conference on Intelligent Sustainable Systems (ICISS), pp. 306–313 (2017) 49. Sheen, D.M., McMakin, D.L., Hall, T.E.: Three-dimensional millimeter-wave imaging for concealed weapon detection. IEEE Trans. Microwave Theory Tech. 49(9), 1581–1592 (2001) 50. Shorten, C., Khoshgoftaar, T.M.: A survey on image data augmentation for deep learning. Big Data 6, 60 (2019) 51. Tiwari, R.K., Verma, G.K.: A computer vision based framework for visual gun detection using harris interest point detector. Proc. Comput. Sci. 54, 703–712 (2015) 52. Ultralytics. Ultralytics/yolov5: Yolov5 in pytorch 53. Upadhyay, E.M., Rana, N.K.: Exposure fusion for concealed weapon detection. In: 2014 2nd International Conference on Devices, Circuits and Systems (ICDCS), pp. 1–6. IEEE (2014) 54. Verma, G.K., Dhillon, A.: A handheld gun detection using faster r-cnn deep learning. In: Proceedings of the 7th International Conference on Computer and Communication Technology, pp. 84–88 (2017) 55. Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2001, vol. 1, pp. I–I. IEEE (2001) 56. Vlahos, J.: Surveillance society: new high-tech cameras are watching you. Pop. Mech. 139(1), 64–69 (2008) 57. Wang, C.-Y., Mark Liao, H.-Y., Wu, Y.-H., Chen, P.-Y., Hsieh, J.-W., Yeh, I.-H.: Cspnet: a new backbone that can enhance learning capability of cnn. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 390–391 (2020) 58. Wu, F., Duan, J., Chen, S., Ye., Y, Ai, P., Yang, Z.: Multi-target recognition of bananas and automatic positioning for the inflorescence axis cutting point. Front. Plant Sci. 2, 12 (2021) 59. Xue, Z., Blum, R.S.: Concealed weapon detection using color image fusion. In: Proceedings of the 6th International Conference on Information Fusion, vol. 1, pp. 622–627. IEEE (2003) 60. Zelong, X., Xuan, L., Jiangjiang, Y., Li, W., Luyao, R.: Automatic detection of concealed pistols using passive millimeter wave imaging. In: 2015 IEEE International Conference on Imaging Systems and Techniques (IST), pp. 1–4 (2015)
PID Parameter Tables for Time-Delayed Systems, Found with Learning Algorithms According to Minimum IAE, ITAE and ISE Criteria Roland Büchi(B) School of Engineering, Zurich University of Applied Sciences, Winterthur, Switzerland [email protected]
Abstract. For single-in, single-out systems, SISO, the PID controllers have been by far the most frequently used controllers for a long time. The control of timedelayed systems is particularly challenging. There are some heuristic methods that can be used in the time and frequency domain. These are, for example, those by Ziegler Nichols or Chien Hrones and Reswick. The parameters that were found with it result in asymptotically stable regulated systems. In most cases, however, they show transient behavior that still requires readjustment. This publication presents a table for PID parameters that can be used to control systems with timedelayed, stable step responses. This minimizes the common quality criteria in the time domain, IAE, ITAE and ISE. The determination of the parameter sets is very computationally intensive. Therefore, an approach from the field of artificial intelligence was chosen for their calculation. The application of the parameter sets found is verified using an example of the control of a liquid level. The parameter sets also take into account the controller output limitations that are relevant in practice. They can basically be used for all PID controllers of controlled systems with a time delay. Keywords: Control theory · Machine learning · ITAE · ISE · IAE criteria
1 Introduction and Related Work Time-delayed systems have special control requirements because delayed signals are quite difficult to control. In practice, however, they are very common, especially in process engineering or in thermal systems, since the sensor often cannot be placed directly next to the actuator. There are different approaches known for finding PID controller parameters from step responses of time-delayed systems. All of them result in stable control systems. However, these parameters must be further optimized afterwards. The first approach was the parameter set from Ziegler Nichols [1]. There are also several others existing, for example Chien, Hrones and Reswick [2]. For further optimization, there are also known methods from the field of artificial intelligence, for example particle swarm optimization, PSO [3]. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): IntelliSys 2022, LNNS 544, pp. 219–230, 2023. https://doi.org/10.1007/978-3-031-16075-2_14
220
R. Büchi
In this paper, the method hill climbing [4] is used. It is another stochastic method for optimizing controllers, but it is related to PSO. The important difference is, that the parameter search is processed serially and not in parallel. Furthermore, due to the missing social velocity component of the parameter search, more local minimal criterions are found with hill climbing and therefore, the probability for finding smaller local minima is increased. At the end, the smallest of the minimal criteria is accepted as the result. Furthermore, various other methods are used for optimization of controllers [5–11].
2 Modelling of Time-Delayed Systems In order to be able to deal with such delayed systems in terms of control technology at all, a mathematical model is required first. There are many such models in the literature. There are for example with time delay elements or combined with several serial PT1 systems. They are dealt with in the literature [12, 13]. The PID parameter tables, the creation of which is described and used in this document, refer to PTn systems with identical time constants. These are identical first-order systems connected in series, i.e. PT1 elements. These systems are also very common outside of process engineering and can be found in all engineering disciplines. The step response from such elements leads to delayed signals. In this way, the dead times can also be modeled well with linear models. The transfer function for such systems is Ks (s · T1 + 1)n
(1)
2.1 Time Percentage Value Method A method that can be used to model these PTn systems according to the above is the time percentage value method [14]. Here, the step response according to Fig. 1 is measured and the times t10, t50 and t90 are determined.
Fig. 1. Step response of a PTn-system, time percentage value method.
Table 1 shows further parameters that have to be calculated. The order n is determined by the parameter μ, according to the formula 2 below. t10 μ= (2) t90
PID Parameter Tables for Time-Delayed Systems, Found with Learning Algorithms
221
Ks is generally determined by the formula: staticfinalvalueoutputafterstep − staticfinalvalueoutputbeforestep inputvalueafterstep − inputvaluebeforestep
(3)
And the identical time constants T1 of the n PT1 elements T1 are calculated to T1 =
∝10 ·t10 + ∝50 ·t50 + ∝90 ·t90 3
(4)
Table 1. Coherences n, μ, α10, α50, α90 n, PTn
μ
α10
α50
α90
2, PT2
0.137
1.880
0.596
0.257
3, PT3
0.207
0.907
0.374
0.188
4, PT4
0.261
0.573
0.272
0.150
5, PT5
0.304
0.411
0.214
0.125
6, PT6
0.340
0.317
0.176
0.108
8, PT8
0.396
0.215
0.130
0.085
10, PT10
0.438
0.161
0.103
0.070
3 Calculation of the Optimal Parameters for PID Controllers with the Minimization of the Quality Criteria IAE, ITAE and ISE
Fig. 2. General block diagram of a PTn system, controlled with a PID controller.
The block diagram of a controlled system is shown in Fig. 2. The system to be controlled is the PTn found above with n PT1 elements connected in series. This also includes the parameters Ks, T1 and n found with the methods described above. Finding the PID parameters is a topic in control engineering with great potential, because the parameters can be optimized using various methods in order to achieve good transient behavior
222
R. Büchi
[5]–[11]. In addition to various methods from the frequency range, in the time domain the IAE, ITAE [15] and ISE criteria are also used. They describe the error area of a step response of the controlled system. These error areas are shown in Fig. 3. ∞
∞
∞
0
0
0
IAE : ∫ |e(t) − e(∞)|dt; ITAE : ∫ |e(t) − e(∞)| · t · dt; ISE : ∫ [e(t) − e(∞)]2 dt (5) It can also be seen from this that the IAE criterion calculates the amount of the error area. The ITAE criterion for its part also takes into account the time. This means that the error area is weighted more heavily as time progresses. The IAE and ITAE criteria are also called the L1 criterion. The ISE criterion does not calculate the error area per se, but it’s square. This means that it does not have to be calculated with the amount, since the negative signs cancel each other out when squared. The ISE criterion is also called the L2 criterion.
Fig. 3. Error areas in the step response of the closed-loop system according to Fig. 2, for calculating the IAE, ITAE and ISE criteria.
Because the time-delayed systems were approximated as PTn elements in Sect. 1, all PID controller parameters Kp, Ti and Td could also be calculated for all orders n and different Ks and T1. One would simply have to calculate the quality criteria for step responses described above. These parameters could then be presented as table values for the minimum quality criteria. The problem here is that this has to be done for a multi-dimensional space (for example order n, Kp, Ti, Td, quality criteria, various limitations of the controller output). So it would take a long time with the computing power available today. This would have to be done with nested loops of all parameters. With the 6 parameters this would result in the complexity f (n) = O (x6 ). The problem could still be solved in polynomial time, but the power of 6 is very high. A method from the field of artificial intelligence according to Russell and Norvig, hill climbing, was therefore chosen [4]. In this method, some of the parameters, in this case the parameters of the PID controller, are added to a heuristic function. Then it is recalculated whether the quality criteria IAE, ITAE and ISE have become smaller. If so, the new parameters are used as a reference, otherwise the old ones remain. In this way and even after many iterations, the final value of the parameters then remain at local minima of the quality criteria. The method requires much less computing time than a complete calculation in multi-dimensional space.
PID Parameter Tables for Time-Delayed Systems, Found with Learning Algorithms
223
However, since it only finds local minima, many different random tuples of start values for the control parameters were simply used in the parameter search. So, many of the parameter solutions of the converged minimum quality criteria agree with one another. Therefore, one can assume with reasonably good certainty that the parameters found are actually the PID parameters Kp, Ti and Td, which either correspond to the absolute minimum of the quality criteria, or which at least come very close to them. Figure 4 shows a flow chart of this. The use of this artificial intelligence method is well chosen in control engineering. Often, due to the time complexity of the calculation steps to be used, it is not possible to carry out complete calculations in the entire parameter space. Since with such methods only part of the parameter space is calculated, the computing time is strongly minimized and the results are parameter sets for excellent transient behavior. The cross checks whether these are really the minimum parameters is carried out with several different starting values for the parameters. Compared to the complete calculation, this still represents a significantly reduced computing time.
Fig. 4. Flow chart of the hill climbing method for finding the PID controller parameters according to the minimum ITAE, IAE and ISE criteria.
224
R. Büchi
4 Table for the PID Control Parameters After Minimizing the Quality Criteria IAE, ITAE and ISE The Table 2 below contains the PID controller parameters calculated with Matlab / Simulink and the Hill Climbing method, which are based on the minimized quality criteria IAE, ITAE and ISE according to Fig. 3. The block diagram according to Fig. 2 serves as a basis. The output limitation is implemented on the one hand after the controller and on the other hand also after the integrator (anti windup) and is assumed to be ± 2, ± 3, ± 5, ± 10. It is calculated as (Maximum controller output – controller output before the step) divided by (controller output for the stationary end value – controller output before the step). In the calculations the anti-windup after the integrator is never active in this case, but it is inserted anyway, because in practice it can happen for various reasons that in the static end value the controlled variable does not reach the desired value. It is particularly noteworthy that both the static gain Ks and the time constant T1 are part of the controller parameter tables. This makes the table universally applicable for a very wide range of applications. Table 2. Table values of the PID parameters for the minimum IAE, ITAE and ISE criteria of controlled PTn or time delayed systems. PT1
±2
±3
±5
±10
IAE
Kp·Ks = 10
Kp·Ks = 10
Kp·Ks = 10
Kp·Ks = 10
Ti = 3.1·T1
Ti = 2·T1
Ti = 1.3·T1
Ti = 1·T1
Td = 0 (PI)
Td = 0 (PI)
Td = 0 (PI)
Td = 0 (PI)
Kp·Ks = 9.3
Kp·Ks = 9.5
Kp·Ks = 9.1
p·Ks = 10
Ti = 2.9·T1
Ti = 1.9·T1
Ti = 1.2·T1
Ti = 1·T1
Td = 0 (PI)
Td = 0 (PI)
Td = 0 (PI)
Td = 0 (PI)
Kp·Ks = 10
Kp·Ks = 10
Kp·Ks = 9.8
Kp·Ks = 10
Ti = 2.7·T1
Ti = 1.6·T1
Ti = 1.5·T1
Ti = 0.2·T1
Td = 0 (PI)
Td = 0 (PI)
Td = 0 (PI)
Td = 0 (PI)
ITAE
ISE
PT2
±2
±3
±5
±10
IAE
Kp·Ks = 10
Kp·Ks = 10
Kp·Ks = 10
Kp·Ks = 10
Ti = 9.6·T1
Ti = 7.3·T1
Ti = 5.6·T1
Ti = 3.7·T1
ITAE
Td = 0.3·T1
Td = 0.3·T1
Td = 0.3·T1
Td = 0.2·T1
Kp·Ks = 10
Kp·Ks = 10
Kp·Ks = 9.6
Kp·Ks = 9.8
Ti = 9.6·T1
Ti = 7.3·T1
Ti = 5.4·T1
Ti = 4.7·T1 (continued)
PID Parameter Tables for Time-Delayed Systems, Found with Learning Algorithms
225
Table 2. (continued) Td = 0.3·T1
Td = 0.3·T1
Td = 0.3·T1
Td = 0.3·T1
Kp·Ks = 10
Kp·Ks = 10
Kp·Ks = 10
Kp·Ks = 10
Ti = 9.7·T1
Ti = 7.3·T1
Ti = 5.1·T1
Ti = 4.6·T1
Td = 0.2·T1
Td = 0.2·T1
Td = 0.2·T1
Td = 0.1·T1
PT3
±2
±3
±5
±10
IAE
Kp·Ks = 5.4
Kp·Ks = 7
Kp·Ks = 8.4
Kp·Ks = 10
Ti = 9.4·T1
Ti = 10·T1
Ti = 9.8·T1
Ti = 9.7·T1
ISE
ITAE
ISE
Td = 0.7·T1
Td = 0.7·T1
Td = 0.7·T1
Td = 0.7·T1
Kp·Ks = 5.4
Kp·Ks = 7
Kp·Ks = 8.2
Kp·Ks = 10
Ti = 9.4·T1
Ti = 10·T1
Ti = 9.6·T1
Ti = 9.7·T1
Td = 0.7·T1
Td = 0.7·T1
Td = 0.7·T1
Td = 0.7·T1
Kp·Ks = 6.1
Kp·Ks = 8.1
Kp·Ks = 10
Kp·Ks = 10
Ti = 10·T1
Ti = 9.8·T1
Ti = 10·T1
Ti = 7.8·T1
Td = 0.6·T1
Td = 0.6·T1
Td = 0.6·T1
Td = 0.6·T1
PT4
±2
±3
±5
±10
IAE
Kp·Ks = 2
Kp·Ks = 2.9
Kp·Ks = 3.3
Kp·Ks = 3.3
Ti = 5.2·T1
Ti = 6.5·T1
Ti = 7.1·T1
Ti = 6.9·T1
Td = 1.1·T1
Td = 1.2·T1
Td = 1.3·T1
Td = 1.3·T1
Kp·Ks = 1.9
Kp·Ks = 2.4
Kp·Ks = 2.3
Kp·Ks = 2.1
ITAE
ISE
Ti = 5·T1
Ti = 5.9·T1
Ti = 5.7·T1
Ti = 5·T1
Td = 1.1·T1
Td = 1.2·T1
Td = 1.2·T1
Td = 1.1·T1
Kp·Ks = 2.8
Kp·Ks = 3.6
Kp·Ks = 4.9
Kp·Ks = 5.2
Ti = 6.6·T1
Ti = 7·T1
Ti = 7.1·T1
Ti = 7·T1
Td = 1.2·T1
Td = 1.2·T1
Td = 1.4·T1
Td = 1.4·T1
PT5
±2
±3
±5
±10
IAE
Kp·Ks = 1.7
Kp·Ks = 1.8
Kp·Ks = 1.8
Kp·Ks = 1.7
Ti = 5.8·T1
Ti = 5.9·T1
Ti = 5.8·T1
Ti = 5.5·T1
ITAE
ISE
Td = 1.6·T1
Td = 1.6·T1
Td = 1.6·T1
Td = 1.6·T1
Kp·Ks = 1.4
Kp·Ks = 1.4
Kp·Ks = 1.4
Kp·Ks = 1.4
Ti = 5.3·T1
Ti = 5.2·T1
Ti = 5.2·T1
Ti = 5.0·T1
Td = 1.4·T1
Td = 1.4·T1
Td = 1.4·T1
Td = 1.4·T1
Kp·Ks = 1.9
Kp·Ks = 2.6
Kp·Ks = 2.5
Kp·Ks = 2.5 (continued)
226
R. Büchi Table 2. (continued) Ti = 5.9·T1
Ti = 6.5·T1
Ti = 6.3·T1
Ti = 6.1·T1
Td = 1.7·T1
Td = 1.8·T1
Td = 1.8·T1
Td = 1.8·T1
PT6
±2
±3
±5
±10
IAE
Kp·Ks = 1.3
Kp·Ks = 1.3
Kp·Ks = 1.3
Kp·Ks = 1.3
Ti = 5.9·T1
Ti = 5.8·T1
Ti = 5.8·T1
Ti = 5.6·T1
Td = 1.9·T1
Td = 1.9·T1
Td = 1.9·T1
Td = 1.9·T1
Kp·Ks = 1.1
Kp·Ks = 1.1
Kp·Ks = 1.1
Kp·Ks = 1.1
Ti = 5.5·T1
Ti = 5.5·T1
Ti = 5.4·T1
Ti = 5.3·T1
Td = 1.7·T1
Td = 1.7·T1
Td = 1.7·T1
Td = 1.7·T1
Kp·Ks = 1.8
Kp·Ks = 1.8
Kp·Ks = 1.8
Kp·Ks = 1.8
Ti = 6.8·T1
Ti = 6.5·T1
Ti = 6.5·T1
Ti = 6.3·T1
Td = 2.1·T1
Td = 2.1·T1
Td = 2.1·T1
Td = 2.1·T1
PT8
±2
±3
±5
±10
IAE
Kp·Ks = 0.9
Kp·Ks = 0.9
Kp·Ks = 0.9
Kp·Ks = 0.9
Ti = 6.3·T1
Ti = 6.3·T1
Ti = 6.2·T1
Ti = 6.1·T1
Td = 2.3·T1
Td = 2.3·T1
Td = 2.3·T1
Td = 2.3·T1
Kp·Ks = 0.8
Kp·Ks = 0.8
Kp·Ks = 0.8
Kp·Ks = 0.8
Ti = 6·T1
Ti = 5.9·T1
Ti = 5.9·T1
Ti = 5.8·T1
Td = 2.0·T1
Td = 2.0·T1
Td = 2.0·T1
Td = 2.0·T1
Kp·Ks = 1.2
Kp·Ks = 1.2
Kp·Ks = 1.2
Kp·Ks = 1.2
Ti = 7·T1
Ti = 7·T1
Ti = 6.9·T1
Ti = 6.8·T1
Td = 2.7·T1
Td = 2.7·T1
Td = 2.7·T1
Td = 2.7·T1
ITAE
ISE
ITAE
ISE
5 Application of the PID Parameter Table: Control of a PT2 System The control of the filling level of a water reservoir is implemented as an application example of the parameter table. Figure 5 shows the system. The pump is the actuator and shows a PT1 behavior between the input voltage of the converter and the volume flow of the water. The water tank is filled with the hose on the left. The hose on the right serves as a drain. Depending on the level, the pressure and thus the flow rate of the drain rises. Each delivery rate of the pump thus results in a different stationary end value of the fill level. Thus, the water tank itself also exhibits PT1 behavior between the inflow and the filling level of water. The entire system cascade with the input voltage of the speed setpoint of the pump as the input and the filling level of the water tank as the output has therefore a PT2 behavior.
PID Parameter Tables for Time-Delayed Systems, Found with Learning Algorithms
227
Fig. 5. Application of the parameter table: control of filling level of water
Figure 6 shows a step response of the system from xe = 3 V to 4 V at the input. The identification with the time percentage value method shows a system order n of 2, according to formula (2) and Table 1. Ks = xa/xe = 0.8 V/1.0 V = 0.8 according to formula (3). Finally, T1 results in 16 s according to formula (4). Since one cannot generally assume in a practice system that the time constants are all the same, the procedure of the time percentage value method presented above is only an approximation procedure.
Fig. 6. Step response of the 2nd order filling level of water according to Fig. 5.
In order to read the correct table values for the PID controller, the controller output limitation factor is also required. To show how this is done in the general case, a general jump in the filling level is chosen, that does not start from 0 cm. In this case, the jump from 15 cm to 21 cm is selected, which requires that the controller output is jumping
228
R. Büchi
from a little more than 3 V to around 4 V. The maximum controller output is 8 V. So the calculation becomes: Maximumcontrolleroutput−controlleroutputbeforethestep controlleroutputforthestationaryendvalue−controlleroutputbeforethestep
Controloutputlimitation =
= 8V − 3V =5 4V − 3V
(6)
However, the table values for the different controller output limitations are similar for the respective PTn systems. Nevertheless, one can take this into account, provided one knows, in which operating points the control loop should work. If one reads out for example the table for the ITAE criterion with these values, the result for PT2, control output limitation 5 is: Kp =
9.6 = 12.0 Ti = 5.4 · T1 = 84.6 s Td = 0.3 · T1 = 4.8 s Ks
Fig. 7. Step response of the closed loop system, step from 15 cm (= 150 mm/(130 mm/V) = 1.15 V to 21 cm (= 210 mm/(130 mm/V)) = 1.61 V.
If the system is controlled with a general block diagram according to Fig. 2, the behavior is according to Fig. 7 results. The PTn is a step response from the real system that was shown in Fig. 5. For a further comparison, a didactic example is used to show the general suitability of the parameter table. The response of a time-delayed system to a unit jump shows a static final value of 1 and the result is a PT2 behavior with Ks = 1 and T1 = 1 s.
PID Parameter Tables for Time-Delayed Systems, Found with Learning Algorithms
229
Fig. 8. Step response of the closed loop system after Fig. 2 for a PT2 system (T1 = 1, Ks = 1), the PID parameters were found with the table values for the ITAE criterion
For the ITAE criterion, the table values of the PID parameters for the PT2 system are read off. The simulation of the step responses of the closed loop system according to Fig. 2 is shown in Fig. 8. They show a very nice transient response. The different dynamics or rise times can be explained well with the different controller output limitations. The comparison with the real system according to Fig. 6, which also represents a controlled PT2 system, shows a very slightly larger overshoot there. This can be explained by the fact that the time constants of the pump and the water tank are not exactly the same. Therefore, the identification with the above time percentage value method is only an approximation.
6 Discussion and Outlook The examples discussed show excellent transient response. However, the table can be used in general, especially because the parameters Ks and T1 are included. It cannot be compared with heuristic methods, because the parameters in the table are hard-calculated values that really minimize the quality criteria. It is also up to the discussion what would happen if one performed different jumps and therefore had to choose the parameters according to different factors of the control output limitation. It turns out, however, that the parameters are very similar and values for jumps should be selected which are most likely to occur in the specific system. In the case of non-linear systems, the step responses of the systems differ in different operating points, so the Ks and T1 are also different. In this case you would have to carry out more detailed analyzes of the meaningful parameters. The parameters Ks and T1 directly affect the parameters of the PID controller according to the table values. Thus, in this case, the possible ranges of Ks and T1 and the resulting ranges of the
230
R. Büchi
controller parameters should be examined. Then, the best values for occurring setpoint changes should be used. Incidentally, this also applies to the water filling level system, whose drainage behavior is based on Torricelli’s law and therefore does depend on the fill level in a non-linear manner. But even for the general setpoint jump by any value, the parameters still result in a good closed loop behavior. It turns out that the PTn systems that occur very frequently in practice can be regulated very well with the table values available according to the minimized IAE, ITAE and ISE criteria. In practice, you can often do without a simulation and only measure the step response of the system. Then the order n and the parameters for the PID controller can be read from the table and the controller can be implemented directly on the system. May this document contribute in order that the development process of the controller design for such frequently occurring systems will be greatly simplified in the future.
References 1. Ziegler, J.B., Nichols N. B., Optimum settings for automatic controllers. ASME Trans. 64, 759–768 (1942) 2. Chien, K.L., Hrones, J.A., Reswick, J.B.: On the automatic control of generalized passive systems. In: Transactions of the American Society of Mechanical Engineers, vol. 74, S. 175–185, Cambridge (Mass.), USA, Feb 1952 3. Qi, Z., Shi, Q., Zhang, H.: Tuning of digital PID controllers using particle swarm optimization algorithm for a CAN-based DC motor subject to stochastic delays. IEEE Trans. Industr. Electron. 67(7), 5637–5646 (2019) 4. Russel, S., Norvig, P.: Artificial Intelligence: A Modern Approach, 2nd edn, pp. 111–114. Prentice Hall, Upper Saddle River, New Jersey, ISBN 0–13–790395–2 (2003) 5. Büchi, R.: Machine learning for optimal ITAE controller parameters for thermal PTn actuators. In: Arai, K. (ed.) IntelliSys 2021. LNNS, vol. 295, pp. 137–145. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-82196-8_11 6. Joseph, E.A., Olaiya, O.O.: Cohen-Coon PID tuning method, a better option to Ziegler Nichols-PID tuning method. Comput. Eng. Intell. Syst. 9(5) (2018). ISSN 2222-1719 7. Stephan, O., Docekal, T.: PID controller design based on global optimization technique with additional constraints. J. Electr. Eng. 67(3), 160–168 (2016) 8. Hussain, K.M., et al.: Comparison of PID controller tuning methods with genetic algorithm for FOPTD system. Int. J. Eng. Res. Appl. 4(2), 308–314 (2014). ISSN: 2248-9622 9. Büchi, R.: Modellierung und Regelung von Impact Drives für Positionierungen im Nanometerbereich (Doctoral dissertation, ETH Zurich) (1996) 10. da Silva, L.R., Flesch, R.C., Normey-Rico, J.E.: Controlling industrial dead-time systems: when to use a PID or an advanced controller. ISA Trans. 1(99), 339–350 (2020) 11. Silva, G.J., Bhattacharyya, A.D.S.P.: PID Controllers for Time-Delay Systems. Boston. ISBN 0-8176-4266-8.2005 12. Unbehauen, H.: Regelungstechnik. Vieweg, Braunschweig (1992) 13. Zacher, S., Reuter, M.: Regelungstechnik für Ingenieure. 15.Auflage, Springer Vieweg Verlag (2017) 14. Schwarze, G.: Bestimmung der regelungstechnischen Kennwerte von P-Gliedern aus der Übergangsfunktion ohne Wendetangentenkonstruktion, In: – messen-steuern-regeln Heft 5, S. 447–449 (1962) 15. Martins, F.G.: Tuning PID controllers using the ITAE criterion. Int. J. Eng. Ed. 21(5), 867–873 (2005)
Principles of Solving the Symbol Grounding Problem in the Development of the General Artificial Cognitive Agents Roman V. Dushkin1 and Vladimir Y. Stepankov2(B) 1 Artificial Intelligence Agency, Volkonsky 1st Lane, 15, Moscow 127473, Russia 2 National Research Nuclear University MEPhI, Moscow 115409, Russia
[email protected]
Abstract. The article describes the author’s approach to solving the problem of symbol grounding, which can be used in the development of artificial cognitive agents of the general level. When implementing this approach, such agents can receive the function of understanding the sense and context of the situations in which they find themselves. The article gives a brief description of the problem of understanding the meaning and sense. In addition, the author’s vision is given of how the symbol grounding should occur when the artificial cognitive agent uses sensory information flows of various modality. Symbol grounding is carried out by building an associative-heterarchical network of concepts, with the help of which the hybrid architecture of an artificial cognitive agent is expanded. The novelty of the article is based on the author’s approach to solving the problem, which is represented by several important principles—these are multisensory integration, the use of an associative-heterarchical network of concepts and a hybrid paradigm of artificial intelligence. The relevance of the work is based on the fact that today the problem of constructing artificial cognitive agents of a general level is becoming more and more important for solving, including within the framework of national strategies for the development of artificial intelligence in various countries of the world. The article is of a theoretical nature and will be of interest to specialists in the field of artificial intelligence, as well as to all those who want to stay within the framework of modern trends in the field of artificial intelligence. Keywords: Meaning · Sense · Frege’s triangle · Symbol grounding · Semantics · Understanding · Artificial intelligence · Multisensory integration · Associative-heterarchical network · Hybrid cognitive architecture
1 Introduction For the first time Stevan Harnad wrote about the Symbol Grounding Problem in his work [1]. This article raises the question of how symbols within a certain syntactic symbolic system get their meanings. In fact, S. Harnad described the general problem of the cognitive sciences - the problem of what meaning is, what is its nature and how it relates to behavior, cognition, intelligence and reason. His attempt to solve this problem © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): IntelliSys 2022, LNNS 544, pp. 231–245, 2023. https://doi.org/10.1007/978-3-031-16075-2_15
232
R. V. Dushkin and V. Y. Stepankov
using some hybrid model, in which the syntax and semantics “meet somewhere in the middle” does not look satisfactory, which is noted in the conclusion of the article by S. Harnad himself. At the same time, the problem of obtaining meaning by symbols has worried researchers for a long time. Gottlob Frege introduced the concept of the semantic triangle (today often called the “Frege’s triangle”), in which he defined and delineated the concepts of “sign”, “meaning” and “referent” [2]. A certain sign has both a meaning or a concept (denotation) and a reference to an object (designatum) - all these concepts are closely related. Nevertheless, the answer to the question of how exactly the sign gets its meaning and its referent remains open. In addition, Ludwig Wittgenstein pointed out that a context is required to define both the meaning and the referent of a sign [3]. In Fig. 1 shows a classic view of Gottlob Frege’s semantic triangle [4].
Fig. 1. The classic view of the semantic triangle
In this work, the components of the semantic triangle will be understood as: 1. Sign - is a material or informational entity, to which, by explicit or implicit agreement, a certain value is attributed, usually understood in the same way by the agents of the community in which the sign is used. In the context of this work, the terms “sign” and “symbol” (for addressing the Symbol Grounding Problem) will be considered interchangeable. 2. Meaning (denotation) - that object (class of objects) or phenomenon (class of phenomena) of the surrounding reality or an abstract concept (class of concepts), which is indicated by a sign. It is generally accepted that the sign matters. 3. Sense (designatum) - information about an object indicated by a sign, some generally significant information. At the same time, in the works of G.S. Osipov, an extension of the Frege semantic triangle for the “tetrahedron” is given, which includes the personal meaning of a sign for
Principles of Solving the Symbol Grounding Problem
233
a cognitive agent, obtained on the basis of personal experience [5, 6]. Similar conclusions are confirmed by the author’s studies [7]. Thus, an expanded view of the relationship of the basic concepts of this work is the semantic tetrahedron, which is shown in Fig. 2.
Fig. 2. Extension of the frege triangle - a semantic tetrahedron
However, according to the views of some modern researchers, symbols by themselves have no sense [8]. The meaning appears only when symbols are connected in a sequence. And it is the sequence of symbols that makes sense. Thus, individual sign can only be assigned meanings, while sequences of symbols have no meanings, but make sense. Based on this statement, we can say that: firstly, the cognitive agent knows the meaning of the symbol, and, secondly, understands the sense of the sequence of symbols. An additional nuance that introduces certain violations in the harmonious concept of the Frege triangle is the context, as L. Wittgenstein first pointed out [3]. Oddly enough, both the meaning of the symbol and the sense of the sequence of symbols are heavily dependent on the context. And in different contexts, the same symbol can have different meanings, and the same sequence of symbols can make different sense. It is the context, which depends on the agent’s personal memory, that makes serious adjustments to the sequence, to meanings and sense. Taking into account what was described earlier regarding agents, they possess knowledge of the meanings of symbols and an understanding of the meaning of sequences of symbols, taking into account their own characteristics - personal experience and memory. In other words, agents may have personal understandings of meanings and personal understandings of the sense. Thus, both Frege’s triangle and Osipov’s tetrahedron are transformed into some allegorical figure, which could be called a “comb” immersed in the syrup of context. A diagram that describes these provisions is shown in Fig. 3. Thus, in the further presentation of this work, the following provisions are used: 1. Symbols have meaning. 2. The sequences of symbols make sense.
234
R. V. Dushkin and V. Y. Stepankov
3. The agent knows the meanings of the symbols. 4. The agent understands the sense of the sequence of symbols. 5. The meanings of symbols and the sense of sequences of symbols depend both on the agent’s personal experience and memory, and on the context in which the agent perceives them.
Fig. 3. Semantic “comb”
2 Multisensory Integration The problem of symbol grounding is one of the cornerstones in the theory of artificial intelligence [7]. And if, within the framework of building systems of weak artificial intelligence, its solution is not particularly required, then when considering approaches to building a strong artificial intelligence, there is no way to get away from solving the problem of symbols grounding. This and the next sections attempt to propose an approach to solving this problem. It can be assumed that the grounding of symbols in people is carried out in the process of their learning. First of all, in the process of teaching natural language. And such a grounding of symbols is carried out through, firstly, multisensory integration [9], and secondly, when building a branched associative-heterarchical network of concepts in the associative cortex [10]. It goes without saying that the learning process of a person as an intellectual agent is an individual process [11]. It is within the framework of such an individual process that a person’s personal experience is formed and thereby determines the knowledge of the personal meanings of symbols that a person comprehends, including understanding the personal sense of sequences of symbols, as indicated in the previous section.
Principles of Solving the Symbol Grounding Problem
235
Thus, a similar process that we observe in humans can be taken for solving the problem of symbol grounding in artificial intelligent agents of a general level. In other words, the grounding of symbols for artificial intelligent agents of the general level should be carried out using: 1. Multisensory integration. 2. Building an associative-heterarchical network of concepts. It makes sense to consider in detail both of these processes in order to propose a final solution to the problem of symbol grounding in artificial intelligent agents. In accordance with the provisions set forth in the works [7], the sensory perception of an intelligent agent is arranged as follows. There should be at least several different sensory modalities, within which there can be one or more channels of information perception. For example, if we consider such a sensory modality as vision, then within its framework the channels of information perception can be color vision, and, possibly, with different channels for different colors, perception of brightness and some other parameters of visual perception. Another sensory modality could be auditory. In it, the channels are the perception of sounds of various sampled frequencies. The third important modality that is used in humans is tactile. In the case of a person, different channels can protrude different parts of the body, which are used to feel objects in the outside world, for example, the palms of two different hands. In the case of an artificial intelligent agent embodied in the physical world, completely different sensors can serve as sensory modalities and their mutual influences [12]. Nevertheless, within the framework of studying the problem of symbol grounding, it will be further understood that an artificial intelligent agent has at least three sensory modalities - these are computer vision, sound analysis and tactile perception analysis. At the very beginning of training, an intelligent agent forms library of primary symbols tied to direct perception [13]. Thus, the agent forms visual, auditory and tactile symbols, and the creation of these sets is based primarily on the statistical processing of a large amount of sensory information that comes to the agent. For example, visual symbols contain various objects that the agent observes from different angles, from different sides, in different turns, etc. Such objects can be simple geometric bodies or complex objects of the surrounding reality, but as a result, a large number of primary symbols are formed visual nature of some abstract nature, which correspond to the observed images arriving to the agent through visual sensors [14]. The simplest example is the symbol of a specific table, which the agent can observe from different angles. And this symbol will reflect this very table that the agent is watching. But this symbol will show exactly the abstract invariant properties of this particular table. That is, this table through such a symbol will be recognizable, no matter how the agent looks at it, from whatever angle, from which side, from afar, close, etc. The same symbol libraries are generated for the auditory and tactile modalities. If we consider the same example of a table, then the sound symbols for it will be a whole set of “symbols” - sequences of sounds that the table can produce. Or that can be done by manipulating the table. Such symbols of the auditory modality will be the sounds
236
R. V. Dushkin and V. Y. Stepankov
when the table is moved on the floor, scraped on the table with something, something is thrown at it. This whole library of sound symbols is also tied to the table. The tactile modality gives one large symbol, which appears by feeling the table with the manipulators of an artificial intelligent agent. Using the mechanism of multisensory integration, all three sets of symbols - visual, auditory and tactile - are combined into one set and create one meta-symbol, which is a multifactorial implementation of the table object in the internal perception of an agent with three sensory modalities. And thus, through this metacharacter, the agent can perceive the table in all its diversity, with the help of visual, auditory and tactile modalities, simultaneously with all their help or separately. This process can be illustrated using the diagram shown in Fig. 4. The diagram shows that the “table” metacharacter is formed by integrating character sets from three different sensory modalities available to an intelligent agent. And in fact, such a metacharacter with its specific symbols from various modalities is isomorphic to the process of perceiving table images through different modalities and multisensory integration of these perceptions into a single meta-symbol.
Fig. 4. Basic touch symbols and the meta-symbol “table”
But how does the abstract symbol “table” appear, which then becomes a single name for all tables with which an intelligent agent can meet? It is precisely the process of forming an associative-heterarchical network that is responsible for this.
Principles of Solving the Symbol Grounding Problem
237
3 Construction of an Associative Heterarchical Network of Concepts to Solve the Problem of Symbol Grounding and Its Possible Solution However, the sets of basic symbols in each of the sensory modalities, as well as the metacharacter obtained as a result of multisensory integration, do not lead to the appearance of abstract signs. Such abstract signs or symbols of natural language appear from the outside when an agent starts to name objects that he perceives with the help of voice or shows the names of these objects written in symbols of natural language. Digressing now from the question of where natural language came from as a symbolic system [15], it should be noted that all signs of a natural language come to an intellectual agent from the outside. And therefore, the table begins to be called a “table” only after the agent, who has already received an internal representation of the table, is told that this particular object for which he has visual, auditory and tactile symbols, as well as a meta-symbol, is called a “table.“ And at this moment that very associative-heterarchic network of concepts begins to form. This happens for the simple reason that the sign “table” begins to be given to the agent when he sees different tables: writing, dining, secretary table, etc. All the set of specific meta-symbols that he received from his sensory systems, begins to associate with one abstract symbol “table”. In this case, the abstract sign “table” itself is associated with all meta-symbols corresponding to specific tables that are in the immediate environment of the agent. In fact, the abstract sign “table” becomes a leaf top for the heterarchical model of the agent’s personal experience, but this leaf top is attached to specific meta-symbols obtained from sensory systems of different modality of the intelligent agent. In other words, this is how abstract symbols are bound through specific sensory symbols and their meta-symbols. What does the concept of “associative-heterarchic network” itself mean? The fact is that further, with the subsequent development of the intellectual agent, the expansion of his knowledge and understanding of the surrounding reality, all signs such as “table”, “chair”, “sofa”, etc. are united by the sign of a higher level of abstraction - “furniture”. This is exactly how the next level is linked, when the sign “furniture”, as an abstract sign, is linked to other abstract signs - namely, “chair”, “table”, “sofa”, etc. And thus, the abstract sign “furniture” binds to modal symbols and the multisensory integration meta-symbol only indirectly. An associative relationship arises between the abstract sign “furniture” and abstract signs of a lower level: “table”, “chair”, “sofa”, etc. This associative relationship is actually a link in the hierarchy, because the table is furniture. A relationship of type IsA is created. A chair is a piece of furniture, the same connection arises. Nevertheless, associative connections also arise between the sign “furniture” and specific sensory symbols. They are also of type IsA, meaning touch symbols are the basic representation of furniture. But it goes without saying that associative connections of a different nature can arise between abstract signs. For example, between the abstract sign “furniture” and the abstract sign “room” there may be a relationship “is in”, but the sign “furniture” is not part of the sign “room”. That is, the furniture is in the room. And thus, a very
238
R. V. Dushkin and V. Y. Stepankov
intertwined associative network appears between all the many signs that an intelligent agent comprehends in the process of his learning. At the same time, various types of hierarchies arise in this associative network. The table is furniture. Furniture is a household item. Household items are things. Things are objects in general. This is one of the natural hierarchies. A table can also be part of a lower order hierarchy: it consists of a table top and legs. In this case, the table is the top of the hierarchy, which includes the table top and legs. But at the same time, the table top and legs are not furniture and do not belong to the previous hierarchy. And thus, the sign “table” is included in two different hierarchies, therefore it is said that the signs form a heterarchy, and the network itself becomes associative-heterarchic. The situation would be so simple if an intelligent agent needed to ground only symbols that correspond to objects - to some objects of the surrounding reality, such as a table, the example of which was considered. However, an intelligent agent in the environment is surrounded not only by objects, but also by the properties of such objects, the actions of objects in relation to the agent and to each other, the actions of the agent itself in relation to objects, and even the properties of actions. All of the above can be characterized from the linguistic point of view, since language, just as a symbolic system, gives names to signs that designate and are attached to all of the listed categories [19]. Thus, the intelligent agent is surrounded in the environment by objects, attributes of objects, predicates and attributes of predicates. Moreover, predicates can be further divided into three categories. First, these are predicates in which the agent himself is the actor, and the complement is an object outside the agent. Second, these are predicates in which the external object is the actor, and the agent itself is the complement of the predicate. And, thirdly, these are predicates in which both the actor and the complement of the action are objects of the surrounding reality. In addition, all the listed categories of signs must be considered from the point of view of a certain scale of the degree of abstractness, since, as was shown earlier, signs have different levels of abstractness. Such a scale can contain five levels of abstractness, namely: 1. Specific signs given to the agent in sensory modalities. 2. The first level of abstractness of signs. At this level there is, for example, the sign “table”, which unites all specific sensory signs “table” given to the agent in sensory modalities. 3. The next level of the hierarchy of signs, uniting abstract concepts that are still given to the intelligent agent in his sensory sensations. This is, for example, a sign such as “furniture”, which unites the signs “table”, “chair”, “sofa”, etc., and these signs can still be observed. 4. Directly abstract concepts that are not associated with any objects of the surrounding reality. Examples of such signs are “conscience”, “honor”, “love”, etc. 5. Characters that are higher in the hierarchy than the characters of the previous level. These signs combine the previous ones into abstract groups. Thus, each sign in the symbolic system of the language must be considered precisely in a two-dimensional space, the axes of which are, on the one hand, the class of abstraction, and on the other hand, the type of sign is an object, predicate, attribute of
Principles of Solving the Symbol Grounding Problem
239
an object or attribute of a predicate. It is in this space that it is required to consider the procedure for grounding a sign and, thus, to solve the problem of symbols grounding for each specific class of symbols. This paper attempts to show a procedure and outline a framework for solving the problem of symbol grounding through multisensory integration and defining basic touch signs. A deep consideration of the signs of all twenty classes of the designated twodimensional space is beyond the scope of this work and should be systematically considered in additional research. However, the following will show how to solve the problem of symbol grounding for each of the four linguistic types of signs. The first type of linguistic signs is things or objects. They call something that can act in a sentence as a subject or object of action. Simple signs correspond to objects and phenomena of the surrounding reality. More complex signs correspond to some abstract concepts, which are also somehow connected with objects or phenomena of the surrounding reality. In the case of signs corresponding to objects, the binding occurs as shown in the example of the abstract sign “table” through a set of sensory signs corresponding to three sensory modalities, and the meta-sign of multisensory integration. And, in fact, the abstract sign “table” is tied to the entire set of perceptions of tables, carried out by an intelligent agent. A more complex process is the grounding of signs, which correspond to the linguistic type of attributes of objects or various qualities. Examples of such signs can be designations of colors or qualities of tactile perception of objects, external forms, substances that make up objects. In other words, such signs are most often adjectives in languages, and their grounding is carried out through a more abstract process than the binding of objects. The fact is that if we consider such a quality as, for example, “red”, then the linguistic sign “red” will correspond to a huge number of objects that an intelligent agent can perceive through its sensory systems. These will be completely different objects, but they will all have one common quality - they are of one or another shade of red. Thus, the sign “red” is on a higher hierarchy of abstraction than more specific signs denoting objects, since different objects can be red. And the attribute “red” is associated associatively with various signs that designate objects. In this case, the first number on the scene is the internal qualitative states of perception by the intelligent agent of various qualities. In the case of red, the intelligent agent perceives red through its visual sensory modality. And in fact, the redness of objects is one of the basic symbols that are formed in the visual modality. The linguistic sign “red” is tied to specific symbols of the visual modality that correspond to the perception of the qualitative state of redness, and this happens as shown in Fig. 5.
240
R. V. Dushkin and V. Y. Stepankov
Fig. 5. Symbol grounding for linguistic signs that are attributes of objects
Thus, the grounding of attributive linguistic signs is carried out through the process of highlighting common properties. And general properties correspond to the same specific sensory signs, which correspond to different qualitative states of sensory perception of an intelligent agent. The situation with predicates denoting a state or an action is even more complicated. The fact is that in order to ground predicate linguistic symbols, an intelligent agent needs to observe with his sensory systems the development of objects and phenomena of the surrounding reality in dynamics, and at the same time compare historical information about his observations. Indeed, if we take, for example, such a sign as “move”, then the grounding of this sign to the results of sensory observations will be carried out through a large number of observations of moving objects that move in space from one point to another. This sign at a lower level of abstraction will be associated with other predicate signs, such as “walk”, “run”, “crawl”, “fly”, etc., which are observed by an intelligent agent in the course of its functioning for certain classes of objects of observation. Thus, predicate signs also create different hierarchies [16], which, when connected to each other, conform heterarchical networks. And finally, the most difficult situation is the grounding of signs for attributes that denote properties of actions, that is, properties of predicates. In natural languages, such signs usually correspond to adverbs. And in order to ground such signs to the results of sensory observations, an intelligent agent needs not only to observe the states or actions of objects, or phenomena of the surrounding reality in dynamics, but also to be able to compare different “instances” of observations with each other.
Principles of Solving the Symbol Grounding Problem
241
For example, what does the sign “fast” mean? Fast is an attribute of action. You can move fast, you can go fast, you can run fast. Accordingly, an intelligent agent in the process of its observations must be able to compare fast movement with non-fast movement in order to differentiate the variants of changes in the characteristics of the action carried out by the observed object with the previous options, and thus, highlight the values of the “fast” characteristic attributed to the predicate symbol. So, the sign “fast” is tied precisely to predicate signs of a lower level of abstraction and denotes their characteristics in the same way as the attributive signs corresponding to adjectives designate the characteristics of the observed objects, and are also at a higher level of abstraction from them. All of the above allows us to say that with the help of recognition of signs of various classes, an associative-heteroarchic network is formed. The signs or symbols themselves in this network are connected with each other by associative links and form natural hierarchies. The grounding of symbols in such a network is carried out as follows: specific symbols are attached to the results of sensory observations of an intelligent agent through a set of specific sensory symbols and a meta-symbol of multisensory integration. Symbols of a more abstract level bind to symbols of a less abstract level. For example, the symbol “furniture” is attached to the symbols “chair”, “table”, “sofa”. The symbol “move” is tied to the symbols “run”, “crawl”, “walk”, “fly”, etc. The higher the symbol is on the scale of abstraction levels, the more difficult it is for it to ground to sensory symbols and the direct sensory experience of an intelligent agent. Nevertheless, all such symbols have grounding to symbols of a lower level of the hierarchy and, through them, indirectly to the sensory experience of an intelligent agent. In the described associative-heterarchical networks of symbols, at the same time, so-called circular definitions may arise, when some symbols are tied to symbols of a lower level of abstraction and through them are again tied to the symbols of their higher level of abstraction and through them again to themselves. That is, in the graph of the associative-heterarchical network, cycles are obtained, and the grounding of symbols is carried out recursively. The question of how to resolve circular definitions is beyond the scope of this work, but there are approaches and methods for such resolution that should be considered in future studies.
4 Expansion of the Hybrid Architecture of an Artificial Cognitive Agent, Taking into Account the Problem of Symbol Grounding The theses presented in this work, namely the need to form a personal understanding on the part of an intelligent agent, taking into account the context and his personal memory, the use of multisensory integration for grounding symbols, as well as the construction of an associative-heterarchic network, requires changes to the hybrid architecture of an artificial intelligent agent, which described in works [17]. These changes will be described later in this section. The changes that need to be made to the hybrid artificial intelligent agent architecture come down to the following points: 1. An artificial intelligent agent must have several sensory modalities, or at least within one sensory modality there must be several channels of perception.
242
R. V. Dushkin and V. Y. Stepankov
2. Information flows from all sensory modalities and channels in them should come to the Multisensory Integration Center, whose task is to build a holistic description of the perceived situation in the environment. 3. Information from the Multisensory Integration Center should go to the module, which is responsible for maintaining the personal memory of an intelligent agent based on the use of an associative-heterarchical network. The three presented components cover the affector chain within the hybrid architecture of an intelligent agent and naturally fit into the cybernetic scheme of such an architecture. In Fig. 6 shows the updated architecture with new components built into it. The circuit shown in Fig. 6 is an augmented hybrid artificial cognitive agent architecture with symbol grounding. The hybrid architecture itself is described in [17]. For detailed information on how this architecture works, the reader is referred to this work. The rest of this section will only describe the change in architecture associated with the use of symbol grounding technology. As stated earlier, an artificial intelligent agent that has the ability to ground symbols must have multiple sensory modalities. In the diagram in Fig. 6 presents N sensory modalities, each of which sends a stream of filtered sensory information both to the reflex circuit of fast decision making, as described in [10], and to the Multisensory Integration Center. In the Multisensory Integration Center, in fact, the process of symbol grounding is carried out, which occurs according to the procedure described earlier in this work. The constructed fragments of the associative-heterarchical network are sent to a Universal Inference Engine, which “puts” them into the base of the Personal Experience of an
Fig. 6. General architecture of a general-level artificial intelligent agent with the ability to ground symbols
Principles of Solving the Symbol Grounding Problem
243
artificial intelligent agent, or binds these fragments to the existing General Knowledge Base. In the future, the attached symbols can be used by an artificial intelligent agent in the process of universal inference based on knowledge, obtaining them either from the general knowledge base or from personal experience. At the same time, both general and personal knowledge are used to resolve the context. In this case, personal experience has a higher priority, since the context is resolved precisely with the help of the personal experience of an artificial intellectual agent. All other processes occurring within the framework of the presented architecture are carried out in full accordance with the hybrid architecture of an artificial intelligent agent, which is presented in [17].
5 Conclusion Thus, the paper describes the general principles of solving the problem of symbol grounding for artificial intelligent agents of the general level. Sketches of the procedure for grounding of symbols in the construction of an associative-heterarchical network of concepts are given. Also, an updated general architecture of an artificial intelligent agent of a general level is given, based on hybrid principles, which allows to ground symbols and save them in two knowledge bases: in the base of general knowledge and the base of personal experience. Thereby a direction has been determined, the development of which will allow solving the grounding of symbols in artificial cognitive agents when implementing for them the functionality of understanding the meaning and context of the situations in which they find themselves. Understanding the meaning, both the sense of natural language phrases and the situations in which the agent is located, will allow the agent to act more effectively in accordance with his personal experience and goals set from the outside. Moreover, the procedure of continuous training and replenishment of the general knowledge base and the base of personal experience, in accordance with the technology described in [18], will allow such an artificial cognitive agent to constantly update his knowledge, replace the meanings of various symbols, if necessary, and thereby adapt to the changing environment. The implementation of the described principles will make it possible to approach both the construction of artificial cognitive agents of a general level and a general understanding of how the grounding of symbols is arranged in cognitive agents of an arbitrary nature. The authors will continue research in the presented topic and described direction. In the future, experiments will be carried out on the construction of associative-heterarchical networks, dividing them into general knowledge and personal experience, and also the presented scheme of the general cognitive architecture will be implemented, taking into account the mechanisms of multisensory integration from sensors of various modalities with the symbol grounding mechanism presented in this work.
244
R. V. Dushkin and V. Y. Stepankov
References 1. Harnad, S.: The symbol grounding problem. Phys. D: Nonlinear Phenom. 42(1–3), 335–346 (1990). https://doi.org/10.1016/0167-2789(90)90087-6 2. Frege, F.L.G.: Über sinn und bedeutung. Zeitschrift für Philosophie und Philosophische Kritik 25–50 (1892) 3. Wittgenstein, L.: Logical and philosophical treatise. Translation from German by Dobronravova and Lakhuti, pp. 133. Common ed. and foreword by Asmus V.F. Nauka, Moscow 1958 (2009) (1958) 4. Zalta, E.N.: Gottlob frege. In: Zalta, E.N. (ed.) Stanford Encyclopedia of Philosophy (Fall 2014) (2014) 5. Osipov, G.S.: Signs-based vs. symbolic models. Advances in Artificial Intelligence and Soft Computing (2015) 6. Panov, A.I., Petrov, A.V.: Hierarchical temporary memory as a model of perception and its automatic representation. In: Sixth International Conference “System Analysis and Information Technologies” SAIT-2015 (June 15–20, 2015, Svetlogorsk, Russia): Proceedings of the conference, vol. 2 (2015) 7. Dushkin, R.V.: On j. searle’s “chinese room” from the hybrid model of the artificial cognitive agents design. Sib. J. Philos. 18(2), 30–47 (2020). https://doi.org/10.25205/2541-7517-202018-2-30-47 8. Masse, A., Chicoisne, G., Gargouri, Y., Harnad, S., Picard, O., Marcotte, O.: How is meaning grounded in dictionary definitions? (2008). https://doi.org/10.3115/1627328.1627331 9. Dushkin, R.V.: Is it possible to recognize a philosophical zombie and how to do it. In: Arai, K. (ed.) Intelligent Systems and Applications IntelliSys 2021 Proceedings of the 2021 Intelligent Systems Conference (IntelliSys) Volume 1. LNNS, vol. 294, pp. 778–790. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-82193-7_52 10. Stepankov, V.Y., Dushkin, R.V.: Hierarchical associative memory model for artificial generalpurpose cognitive agents. Procedia Comput. Sci. 190, 723–727 (2021). https://doi.org/10. 1016/j.procs.2021.06.084 11. Stout, D., Khreisheh, N.: Skill learning and human brain evolution: an experimental approach. Camb. Archaeol. J. 25(4), 867–875 (2015). https://doi.org/10.1017/S0959774315000359 12. Leshchev, S.V.: From artificial intelligence to dissipative sociotechnical rationality: cyberphysical and sociocultural matrices of the digital age. In: Popkova, E.G., Ostrovskaya, V.N., Bogoviz, A.V. (eds.) Socio-economic Systems: Paradigms for the Future. SSDC, vol. 314, pp. 65–72. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-56433-9_8 13. Shumsky, S.A.: Machine intelligence. Essays on the Theory of Machine Learning and Artificial Intelligence, pp. 340. RIOR Publ., Moscow (2020). ISBN: 978-5-369-01832-3 14. Sundas, A., Bhatia, A., Saggi, M., Ashta, J.: Reinforcement learning. In book: Machine Learning and Big Data: Concepts, Algorithms, Tools, and Applications. John Wiley & sons, July 2020 (2020) 15. LeDoux, J.E.: How does the non-conscious become conscious? Curr. Biol. 30(5), R196–R199 (2020). https://doi.org/10.1016/j.cub.2020.01.033 16. Harnad, S.: To cognize is to categorize: cognition is categorization. In: Handbook of Categorization in Cognitive Science, pp. 19–43. Elsevier (2005). https://doi.org/10.1016/B978-008 044612-7/50056-1 17. Dushkin, R.V., Stepankov, V.Y.: hybrid bionic cognitive architecture for artificial general intelligence agents. Procedia Comput. Sci. 190, 226–230 (2021). https://doi.org/10.1016/j. procs.2021.06.028
Principles of Solving the Symbol Grounding Problem
245
18. Dushkin, R.V., Stepankov, V.Y.: Semantic supervised training for general artificial cognitive agents. In: Tallón-Ballesteros, A.J. (ed.) Fuzzy Systems and Data Mining VII: Proceedings of FSDM 2021. IOS Press (2021). https://doi.org/10.3233/FAIA210215 19. Žáˇcek, M., Telnarová, Z.: Language networks and semantic networks. Central European Symposium on Thermophysics 2019 (Cest). In: AIP Conference Proceedings 2116(1), 060007, July 2019 (2019). https://doi.org/10.1063/1.5114042
Enabling Maritime Digitalization by Extreme-Scale Analytics, AI and Digital Twins: The Vesselai Architecture Spiros Mouzakitis, Christos Kontzinos(B) , John Tsapelas, Ioanna Kanellou, Georgios Kormpakis, Panagiotis Kapsalis, and Dimitris Askounis Decision Support Systems Lab of the Electrical and Computer Engineering School, National Technical University of Athens, Athens, Greece [email protected]
Abstract. The beginning of this decade finds artificial intelligence, high performance computing (HPC), and big data analytics in the forefront of digital transformation that is projected to heavily impact various industries and domains. Among those, the maritime industry has the potential to overcome many shortcomings and challenges through innovative technical solutions that combine the aforementioned innovative technologies. Naval vessels and shipping in general, generate extremely large amounts of data, the potential of which remains largely untapped due to the limitations of current systems. Simultaneously, digital twins can be used for conducting complex simulations of vessels and their systems to improve efficiency, automate, and evaluate current and future performance. However, they require large amounts of real-time and historical data to simulate efficiently, as well as AI models and high-performance computing that will help the entire system run smoothly and be scalable to higher volumes of data and computation requirements. Integrating these technologies and tools in a unified system poses various challenges. Under this context, the current publication presents the high-level conceptual architecture of VesselAI, an EU-funded project that aims to develop, validate and demonstrate a novel holistic framework based on a combination of the state-of-the-art HPC, Big Data and AI technologies, capable of performing extreme-scale and distributed analytics for fuelling the next-generation digital twins in maritime applications and beyond, including vessel motion and behaviour modelling, analysis and prediction, ship energy system design and optimisation, unmanned vessels, route optimisation and fleet intelligence. Keywords: Maritime · Artificial Intelligence · Data analytics · Digital twins · High-performance computing · High-level architecture
1 Introduction The beginning of this decade finds Artificial Intelligence, HPC, and Big Data in the forefront of digital transformation. The maritime industry is embracing them wholeheartedly due to their true potential to meet the ever-increasing and complex traffic and © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): IntelliSys 2022, LNNS 544, pp. 246–256, 2023. https://doi.org/10.1007/978-3-031-16075-2_16
Enabling Maritime Digitalization by Extreme-Scale Analytics, AI and Digital Twins
247
shipping demand for safety, performance, energy efficiency, automation, and environmental impact [1]. According to a report by EMSA [2] regarding the previous decade, and more specifically the period 2011–2018, 23000 casualties or incidents with a ship were recorded in European territorial seas, representing an average of 3239 marine casualties or incidents per year, 65.8% of which were attributed to a human error and 20% to system/equipment failures. In the meantime, according to the 3rd IMO GHG study [3], shipping emits around 940 million tons of CO2 annually and is responsible for about 2.5% of global greenhouse gas (GHG) emissions. Even worse, according to the same study, shipping emissions are projected to increase between 50% and 250% by 2050, if business goes as usual, undermining the objectives of the Paris Agreement [4]. Meanwhile, shipping generates extremely large amount of data in every minute, the potential of which, however, still remains untapped due to the involvement of enormous stakeholders and the sophistication of modern vessel design and operation. The key to addressing these challenges lies in research and innovation, especially in novel algorithms, tools, and platforms within the areas of AI, HPC, and Big Data, a combination of which could unlock the new possibilities of a diverse range of current maritime applications for vessel traffic monitoring and management, ship energy system design and operation, autonomous shipping, fleet intelligence and route optimisation, and so on. While digital transformation is progressing rapidly in all aspects of the society, it is now the right time to unlock the potential of extreme-scale data and advanced technologies to address the high computational, modelling and data processing demands required for accurate modelling, estimation and optimization of design and operation of ships and fleets under various dynamic conditions in a timely manner. Digital twins, i.e., replicas of physical assets in the digital world, can be used for conducting complex simulations of vessels and their systems to improve efficiency, automate, and evaluate current and future performance. Most current digital twin models rely on physics-based modelling methods (e.g., mechanical modelling), consisting of various modelled systems and physical components where ideally a 1-to-1 cardinality between physical asset and virtual twin exists. However, in reality, a 1-to-N cardinality is usually found, with several more-or-less well-connected twins, each covering a subset of relevant physics dimensions, IP scope, stakeholder perspectives, sub-systems or processes [5]. Often, due to the lack of model availability and performance restrictions (e.g., for complex simulations to be completed in a timely manner), oversimplified vessel models are often adopted, which presents big challenges, for example, to accurately predict the manoeuvring performance of a vessel in confined or congested waters. Recent advances in Artificial Intelligence and machine learning make it possible to derive digital twin models from historical data (e.g., using variables like sensor, weather, performance, fuel consumption data etc.). Such data driven models can be tuned with massive streams of data from the operating assets to act as a digital twin enabling modelling and simulations far beyond the current range of available applications, such as modelling the human element in the system (e.g., crew response to specific conditions at sea). Figure 1 below provides a summary of a coupling between physics-based and data-driven digital twins. Although the data may often be only partial, contain errors and exhibit a given degree of uncertainty, coupling physics-based models with data-derived models can lead to increased accuracy of the digital twins, i.e., using data to correct model bias
248
S. Mouzakitis et al.
Fig. 1. Data-driven digital twins
and using the model to correct measurement bias. However, this requires huge amounts of data and computing power, making forecasting with digital twins a highly data and compute intensive challenge, which goes beyond the capacity of current systems. The above complex and computationally intensive challenges hinder the design and development of efficient Digital Twins that would enable the provision of new insights and perspectives of the vessels and their systems during both design and operation phases. In the digital twin environments, data acquisition and analysis involve the entire life cycle (i.e., design, operation, and maintenance) to generate data-driven models and simulations which allow to identify corrective measures and recommend preventive actions. To date, a trade-off is still required between model accuracy and speed. Recent advancements of deep learning (DL) [6] have been largely driven by the ability to train large models on vast amounts of data, though limited by the computational complexity of the task. The requirements of vessel simulations call for a synergy of streaming processing systems and much better neural network models powered by HPC platforms. Under these circumstances, this paper presents VesselAI, an EU funded project aiming to develop, validate and demonstrate a novel holistic framework based on a combination of the state-of-the-art HPC, Big Data and AI technologies, capable of performing extreme-scale and distributed analytics for fuelling the next-generation digital twins in maritime applications and beyond, including vessel motion and behaviour modelling, analysis and prediction, ship energy system design and optimisation, unmanned vessels, route optimisation and fleet intelligence. Section 1 introduces the scope of the present paper by presenting the current situation and challenges in maritime digitalisation and explaining how AI and other innovative technologies can lead to more effective solutions. Section 2 presents the background of the paper’s thematic, technological advancements in AI, as well as related works from the research bibliography. Section 3 introduces the VesselAI architecture in detail. Finally, Sect. 4 concludes the paper and describes the future actions that will be taken to realize the project goals.
2 Background Vessels’ design, manoeuvring and behaviour modelling and prediction is one of the most important aspects of maritime research with countless of critical applications, including ship simulations [7, 8], vessel traffic management applications [9, 10], route planning and optimization [11], ship systems design [12] and autonomous ships [13].
Enabling Maritime Digitalization by Extreme-Scale Analytics, AI and Digital Twins
249
On the other hand, the research on machine learning models has created impressive achievements in the past few years. Especially supervised deep learning, but also reinforcement learning, have taken major steps. Tensorflow [14] and Pytorch [15] are currently leading the scene in deep learning followed by projects like Sonnet, horovod mxnet [16], while ML tools include libraries such as scipy, scikit-learn, Spark MLib and OpenCV and scikit-image for image processing [17]. The tools support out-of-thebox capabilities for a vast number of ML techniques including supervised, semi-, self-, unsupervised learning, deep reinforcement learning, inductive/deductive/ transductive learning for different types of neural networks including CNN, DCN, MLP, RNN, BNN, GNN, LSTM, GAN [18] depending on the usage such as recommendation systems, classifications, automation, time-series predictions, image classification anomaly detection in data, and natural language understanding. The latest versions of these libraries provide advanced features such as tools for easier model building with premated estimators, transfer learning, federated learning, and automatic differentiation for optimizing models. Lastly, there are efforts to automate the pipeline of machine learning (AutoML) with services like CloudML [19]. Despite the significant improvement of ML and DL tools it is not easy to bring the core technology into practice. According to a Gartner study in 2018, only 4% companies have deployed AI based solutions. We also know from software engineering studies [20] that testing and maintenance are consuming the majority of development costs with 15% and 67% shares, respectively. Therefore, the key challenge is how to integrate ML model development to the software systems development process. Machine learning models have characteristics that make them different from regular software modules. First, there results are statistical, and it is not straightforward how to deal with the cases when they have errors. Second, their training and operation is highly dependent on the quality of data. In an extreme-scale systems it is almost certain that there will be occasional problems in data collection, e.g., because of faulty sensors or because of disruptions in communication. Therefore, ensuring the robustness of the system with faulty data is essential. Moreover, ensuring data integrity and representativeness as well as designing the appropriate data de-biasing techniques, detect anomalies and increase the predictive quality of the model is still a major challenge. Today’s software development is increasingly moving to the continuous integration and continuous delivery direction (CI/CD). Software development pipelines supporting this approach are widely used. However, the regular approach is not suitable for introduction of machine learning modules. The ML modules could retrain themselves automatically but, often, the retraining is computationally very heavy. Therefore, judicious decision making is needed to detect when and how an ML module needs to be retraining. It is also important to understand how the interdependencies between the modules influence the retraining needs. Therefore, it is not enough to focus on a singly ML module at a time but also keep in mind the whole systems consisting of many classical and ML modules.
3 VesselAI High-Level Architecture The current chapter describes in detail the high-level architecture of the VesselAI solution. VesselAI will set the necessary tools, services and workflows based in alignment
250
S. Mouzakitis et al.
and synergy with the AI4EU platform [21], by designing and deploying the data, HPC and AI services as microservices interoperable by the AcumosAI framework [22] currently planned by the AI4EU platform. Figure 2 provides a high-level architecture that will be followed in VesselAI.
Fig. 2. VesselAI high-level architecture
The following paragraphs describe in detail the high-level architecture of the VesselAI technical solution, which is divided into 5 layers and 12 components. The Data Services and Semantic Enrichment layer is at the bottom of the VesselAI platform responsible for feeding both private and public data into the platform. This layer will be able to handle data in different formats and structure (CSV, Apache Parquet, ORC, images, JSON, AIS, netCDF, relational, satellite and more) with varying data life cycles: persistent, infrequently changed data workflow rules and information, sensitive transactions, and transient streaming data typically containing sensor measurements. Five different components can be distinguished. The Data Ingestion module is responsible for importing structured, semi-structured and unstructured data of different formats from where it is originated into the VesselAI system, where it can be processed, stored, and analysed in the upper layers. Data ingestion in VesselAI will support input from the Vessel Traffic Service (VTS) systems and maritime applications, simulators, sensors and smart IoT devices, either continuously
Enabling Maritime Digitalization by Extreme-Scale Analytics, AI and Digital Twins
251
or asynchronously, by two different ingestion workflows for batch and streaming data respectively, depending on the characteristics of the source and the needs of the upper layer applications. The batch data import will be performed by uploading large amounts of historical data to the HDFS Data Lake, performing all the needed cleansing and semantic enrichment actions and importing them to a distributed data warehouse solution (e.g., Apache Hive). Streaming data ingestion will follow a different approach by utilising a stream-processing platform, such as Apache Kafka and Apache Storm, to act as a publisher/subscriber system, which connects to multiple streaming data sources, gathers the information through isolated data channels, and either routes them to the cleansing and semantic enrichment modules or sends them directly to the services layer to be consumed by a running analytical model. The data ingestion module will be designed and implemented in such a way that it is easily adaptable and extendable to handle changing input sources over time. The Data Cleansing Curation and Anonymisation module is responsible for handling data pre-processing, such as restructuring, predefined value substitutions and reformatting of fields (e.g., dates), to more advanced processes, such as detection of outliers in and their elimination from a data set, data inconsistencies handling and noise reduction. Particularly in VesselAI’s context of extreme data processing and analysis, data cleansing and curation will require big data processing methods. Moreover, in order to provide the ability, if desired and applicable, to share data sets (e.g., to the community or hackathons) without private information (e.g., vessel names, routes, operational information, identifiers) this module will also perform anonymisation during the data ingestion process to protect such information by complete data removal-suppression, generalization or pseudonymity. No personal data will be processed in the platform. Implementation tools to be utilised include ARX data anonymization tool [23], UTD Anonymization Toolbox [24], and the CloudTeams Anonymization Tool [25], which provides a data-structure agnostic and flexible tool for real-time anonymization of data. The Semantic Enrichment module is the point of the architecture where semantics and metadata are added to the original data sets. Persistence and data preparation are also considered part of this layer, as the choice of internal data representation formats is closely connected to the nature of the enriched data, as well as the various operations that will be executed in the upper layers. The process of data semantic enrichment requires two distinct yet equally important factors: domain-related vocabularies and schemas, as well as input from knowledge experts. Both are required to conduct a high-quality and accurate semantic enrichment. In this process, knowledge experts or users with at least some experience with the schema of the underlying data should be asked to help the system map the raw data to some predetermined widely accepted vocabularies and ontologies related to the appropriate subdomain. These will form the Common Data Model and VesselAI Ontology, which vastly serves data interoperability by ensuring that all data processed by the system adheres to the same standards of semantics based on a common set of terms, concepts and relations across different data sources. The target of this module is to maximise the additional semantics while using techniques to minimise the overhead in data volumes. The Triplestore and Reasoning Engine will leverage a graph database to be used as a triplestore to persist the data set semantics and any RDF information produced by the
252
S. Mouzakitis et al.
Semantic Enrichment Module. On top of that, a Semantic Reasoning Engine is going to enable the application of semantic queries on the triplestore to retrieve the semantic information, and the performance of reasoning operations to extract new insights, with a view to support semantic queries and AI reasoning over the available knowledge for the digital twin and maritime applications of the project pilot use cases. This includes the creation of interactive knowledge graphs, advanced visualisations and dashboards, intelligent querying and search capabilities exposed as APIs to the VesselAI applications or directly to UI and recommender engines. In this way machine learning algorithms will be enhanced for content classification in order to produce higher precision models. The Distributed Query Execution Engine is responsible for handling data retrieval from the distributed data warehouse. In addition, a powerful analytical column store, such as MonetDB [26], to handle complex analytics workloads. Such engines provide the ability to perform complex queries on a distributed data lake in highly efficient and scalable ways. The distributed query execution over a pure memory-based architecture allows fast generation of the result-sets required from the analytical processes. The AI Models Development and Serving for federated and continuous learning is the middle tier of the VesselAI architecture and the core of the Artificial Intelligence capabilities of the overall system. It includes a coordinated set of methods and tools that will enable the creation of multi-fidelity analytical models and the serving of the generated models to the different applications. It will have access to the semantically enriched data already ingested in the VesselAI platform and will allow a customisable orchestration of different algorithms and technologies that will be used by the resulting models. The AI services will be integrated/extended with the AcumosAI platform technology and will include the following components: The Data Feed module acts as the intermediate that retrieves the data from the underlying storage layer, makes the necessary data and metadata preparation and validation using descriptive statistics, and performs the necessary transformations before passing the retrieved data to the training models. This stage is necessary as the models cannot operate on the raw data as-is, but each model requires the data to be formed in a specific way in order to handle them, as well as check the validity of the data and metadata to ensure tainted data does not accidentally sneak into the training pipeline. In more detail, this module will interact with the query execution engines to perform the proper queries to retrieve the data, perform a set of transformations, such as handling missing values, normalising them if needed, schema update, selecting the right features, and pass the final data to the AI models. The ML Suite is a library of the state-of-the-art AI data-driven tools and methods that will be used to develop the AI models of VesselAI. Different technologies and software will be exploited for machine learning and image processing. In this context, these tools will expose a rich software library for defining, training, and deploying machine learning models including artificial neural networks, classifiers, knowledge representation and reasoning, in order to attach new knowledge and predictions on the existing extreme-scale streams of data. The Model Development module will exploit the ML Suite and use the available tools to create and train models based on data assimilation between data-driven and physicsbased models. By interacting with the HPC components, it will allow the massive-scale
Enabling Maritime Digitalization by Extreme-Scale Analytics, AI and Digital Twins
253
training, evaluation and validation of numerous analytical models, fed with large amounts of data, in a way that it will ensure the ever-growing increase of the models’ accuracy and efficiency throughout VesselAI’s operation by following federated and continuous learning. The Model Serving module will include the set of the developed and trained models produced by the VesselAI Model Development and will constitute the building blocks of the upper layer. These models will be fed with both batch and streaming data coming from the Query Engine and the Stream Manager respectively, and it will submit execution jobs to the HPC cluster to get the results and pass them to the Application Layer. Furthermore, the pre-trained models will be shared win the AI4EU repository while continuously updated through the lifelong learning process from observational data. The HPC Services Layer will be responsible for the execution of any computation planned within the VesselAI models utilising HPC resources and the state-of-the-art techniques to address the extreme scale needs of the project. Computation jobs will include all the jobs about model training, evaluation, validation and serving, and for these actions suitable endpoints will be exposed to the AI Services Layer to submit jobs. Job execution on the HPC cluster will utilise containerisation technologies to also exploit the advantages of containers to package entire scientific workflows, software, libraries and even data. For this, singularity containers are going to be used, as they allow containerised machine learning and deep learning tasks to fully benefit from the high bandwidth and low latency characteristics of HPC and execute containers as if they are native programmes or scripts on a host computer. Orchestration and workflow controllers will be responsible for supervising and monitoring the job executions, and for the batching of multiple requests into a single request to significantly reduce the cost of performing inference, especially in the presence of hardware accelerators. The VesselAI HPC layer will consist of not only CPU and GPU but also novel TPU clusters, utilising neuromorphic processors and Message Passing Interfaces (MPIs) to ensure high performance. The Application Layer is the highest level of the VesselAI platform and allows the execution of user applications on the top of the VesselAI models and infrastructure. The VesselAI application layer will constitute the foundation of the VesselAI integrated solution and will include both a web interface exposed to users as well as external APIs to third party applications. More specifically it will include the following services: The Visualisations and Reports Engine will be responsible for the visual representation of the stored data and the results produced from the analytical components, including model’s efficiency and explainability. It will offer a variety of visual representations including charts and map visualisations, which can be customised to satisfy different user needs. The produced visualisations will be available to the end-user applications but will also be used to create custom dynamic dashboards and reports consisting of multiple visualisations of different types, configured by the user. The Data Analytics Environment will allow users to perform freeform queries and data analytics on the assets which are accessible to the platform. Moreover, the environment will enable users to translate queries into data processing actions and then express
254
S. Mouzakitis et al.
the responses graphically. The environment will have the ability to conduct ad-hoc analysis on billions of rows of data due to cutting-edge parallel processing techniques to achieve high performance for query preparation and execution. The VesselAI Digital Twin Applications will include all the new end-user applications that will be developed, or the existing ones that will be extended by exploiting the developed and trained models served by the AI Services Layer. Based on the concept of digital twins, the developed applications will create digital representations of physical systems like ports, vessels, vessel analysed per subsystems, energy systems and more, to perform analytics, simulations, examine multiple scenarios, run tests, make predictions and estimations to the digital space, and generate actionable insights and optimisation recommendations for a number of use cases at the physical world based on custom criteria. Such applications will examine vessel manoeuvring and behavioural models to predict the forces and moments of a vessel depending on its characteristics, optimise the vessel route and its voyage planning, simulate, and analyse port operations for better management, optimise the operations of energy systems on vessels, and examine the navigation decisions of autonomous ships under specific environmental and economic conditions. The End-to-End Security Framework will ensure the security, fine-grained access control, and encryption across all the architectural components, and the communication between them, especially the federated operations which are more vulnerable on individual nodes. It will guarantee that the platform will keep all the information safe, prevent any unauthorised actions and enhance the trustfulness of the system.
4 Conclusions and Next Steps In this era of rapid digitization and unprecedented data availability, the maritime industry is trying to leverage the new generation of innovative tools and technologies to build the next generation of maritime services, including vessel motion and behaviour modelling, analysis and prediction, ship energy system design and optimisation, unmanned vessels, route optimisation and fleet intelligence. Artificial intelligence is at the forefront of maritime developments as it provides the ability to create maritime-specific models that can provide the computational intelligence needed for various maritime and shipping processes. Current challenges that hinder the massive application of such solutions include the huge amounts of data and computational capabilities required to efficiently train the AI models. Apart from that, an accurate maritime model requires not only historical but also real-time data, generated by a vessel’s sensors and other sources. VesselAI, the project presented in this publication, recognizes these issues, and plans to address them by developing an innovative technical solution that combines AI, big data analytics, and HPC among others. Furthermore, the VesselAI solution will be applied in four distinct pilot use cases that will validate the system and include cases such as vessel traffic monitoring and management, optimal design of ship energy systems, autonomous ships in short sea transports, and weather routing. At its current stage, the project has completed the initial tasks and requirement elicitation stages and has started setting out the technical specifications of its various components. Future steps of the project include the finalization of technical requirements,
Enabling Maritime Digitalization by Extreme-Scale Analytics, AI and Digital Twins
255
gathering of the necessary datasets that will be used to train the models, development and training of the AI models. Many of these tasks are already underway. The main challenge so far, is presented not in developing and validating the individual components but rather in integrating them in a unified system that operates smoothly. The role of the HPC infrastructure is central as it is necessary if the system is going to handle the intensive data and computational requirements of the project’s use cases and the maritime industry at large.
References 1. McKinsey: How container shipping could reinvent itself for the digital age. https://www.mck insey.com/industries/travel-transport-and-logistics/our-insights/how-container-shippingcould-reinvent-itself-for-the-digital-age. Last Accessed 10 Jan 2022 2. EMSA Annual Overview of marine casualties and incidents. http://www.emsa.europa. eu/newsroom/latest-news/item/3734-annual-overview-of-marine-casualties-and-incidents2019.html. Last Accessed 10 Jan 2022 3. Third IMO GHG Study 2014. https://www.imo.org/en/OurWork/Environment/Pages/Air-Pol lution.aspx. Last Accessed 10 Jan 2022 4. United Nations Climate Change: The Paris Agreement. https://unfccc.int/process-and-mee tings/the-paris-agreement/the-paris-agreement. Last Accessed 10 Jan 2022 5. Erikstad, S.O.: Merging physics, big data analytics and simulation for the next-generation digital twin. In: High-Performance Marine Vehicles, pp. 141–151 (2017) 6. Alvarellos, A., Figuero, A., Sande, J., Peña, E., Rabuñal, J.: Deep learning based ship movement prediction system architecture. In: Rojas, I., Joya, G., Catala, A. (eds.) IWANN 2019. LNCS, vol. 11506, pp. 844–855. Springer, Cham (2019). https://doi.org/10.1007/978-3-03020521-8_69 7. Xue, Y., Clelland, D., Lee, B. S., Han, D.: Automatic simulation of ship navigation. Ocean Eng. 38(17–18), 2290–2305 (2011) 8. Ni, S., Liu, Z., Cai, Y.: Ship manoeuvrability-based simulation for ship navigation in collision situations. J. Mar. Sci. Eng. 7(4), 90 (2019) 9. Skvarnik, I.S., Sovkova, O.I., Statsenko, L.G.: Wireless broadband access technology for building of communication and data transfer networks of vessel traffic management system. In: 2019 International Multi-Conference on Industrial Engineering and Modern Technologies (FarEastCon), pp. 1–5. IEEE (2019) 10. Allersma, S.: Digitalization of Vessel Traffic Management in Port Areas: Gaining Insight into VHF-Communication and Research into Solutions for Further Reduction. Delft University of Technology (2021) 11. Zaccone, R., Figari, M., Martelli, M.: An optimization tool for ship route planning in real weather scenarios. In: The 28th International Ocean and Polar Engineering Conference, pp. 738–744. OnePetro (2018) 12. Trivyza, N.L., Rentizelas, A., Theotokatos, G., Boulougouris, E.: Decision support methods for sustainable ship energy systems: a state-of-the-art review. Energy 239, 122288 (2022) 13. Munim, Z.H.: Autonomous ships: a review, innovative applications and future maritime business models. Supply Chain Forum: An International Journal 20(4), 266–279. Taylor & Francis (2019) 14. Tensorflow 2.0 overview. https://www.tensorflow.org/guide/effective_tf2. Last Accessed 10 Jan 2022 15. Pytorch homepage. https://pytorch.org/. Last Accessed 10 Jan 2022 16. Apache MXNET. https://mxnet.apache.org/. Last Accessed 10 Jan 2022
256 17. 18. 19. 20. 21. 22. 23. 24. 25. 26.
S. Mouzakitis et al. Scikits: https://www.scipy.org/scikits.html. Last Accessed 10 Jan 2022 Haykin, S.: Neural Networks and Learning Machines. 3/E. Pearson Education India (2010) AutoML. https://cloud.google.com/automl. Last Accessed 10 Jan 2022 Schach, S.R.: Object-Oriented and Classical Software Engineering, vol. 6. McGraw-Hill, New York (2007) AI4Europe. https://www.ai4europe.eu/. Last Accessed 10 Jan 2022 AcumosAI. https://www.acumos.org/. Last Accessed 10 Jan 2022 ARX Anonymisation tool. https://arx.deidentifier.org/anonymization-tool/. Last Accessed 10 Jan 2022 UTD Anonymisation tool. http://www.cs.utdallas.edu/dspl/cgi-bin/toolbox/index.php. Last Accessed 10 Jan 2022 NTUA Anonymisation tool. https://github.com/epu-ntua/anonymizer2. Last Accessed 10 Jan 2022 MonetDB. https://www.monetdb.org/. Last Accessed 10 Jan 2022
Computing with Words for Industrial Applications Aisultan Kali, Pakizar Shamoi(B) , Yessenaly Zhangbyrbayev, and Aiganym Zhandaulet Faculty of Information Technology, Kazakh-British Technical University, Almaty, Kazakhstan [email protected]
Abstract. More and more industrial applications are facing the problem of imprecise information handling. Computing with Words (CW) is a mathematical model for approximate knowledge representation, reasoning, and processing of natural language. The very basic idea of CW is to use words instead of numbers for computing and reasoning using fuzzy sets and logic. However, the implementation of this approach in the project requires certain knowledge. This paper presents our initial efforts towards building of a methodology and library based on the extended version of CW, CWiPy. CWiPy can be effectively used to apply CW techniques easily without any prior knowledge in this field. So, developers can add it to an existing system and use it as a black box (plug and play). In CWiPy, the traditional CW was extended to process a bigger variety of linguistic hedges, enhancing the system expressiveness. CWiPy provides an API that allows handling of fuzzy variables, sets, hedges, and quantifiers. Results show that CWiPY can be easily applied in real-life industrial applications to deal with imprecise information and provide help for experts. Two different usage scenarios of the library are presented as a proof of concept: natural language query processing and database summarization. Keywords: Fuzzy sets · Fuzzy logic Words · Summarization · Querying
1
· Library · Computing with
Introduction
Nowadays the problem of computer-assisted decision making in the environments of imprecision is becoming more and more relevant. One possible way to deal with this class of problems lies in the application of fuzzy sets theory [17]. However, the implementation of this approach in the project requires certain knowledge. For industrial developers the learning curve can be quite steep, impeding the software engineering process. Our aim is to make Computing with Words (CW) usable for developers who have no background knowledge in fuzzy sets theory. The main idea behind fuzzy sets is that in real life it is too difficult to determine whether some object belongs to some set or not e.g. are jeans that c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): IntelliSys 2022, LNNS 544, pp. 257–271, 2023. https://doi.org/10.1007/978-3-031-16075-2_17
258
A. Kali et al.
I bought a half year ago can be called “new”? The fuzzy logic approach allows defining the degree of membership of some object to the set. Many properties of traditional sets such as union, intersection, associativity, etc. are available in fuzzy sets [17,18]. In CW words are used instead of numbers for computing and reasoning by means of fuzzy sets and logic. The basic idea of CW is to describe natural language in numerical form which is understandable for a machine. Fuzzy sets and logic play a pivotal role in CW [15]. In CW, a word is represented by a fuzzy set of points drawn together by similarity, with the fuzzy set serving as a constraint on a variable [15]. We use basic concepts from fuzzy set theory under the frame of CW: 1. Fuzzy Set Definition. A fuzzy subset A of a universe of discourse U is characterized by a membership function μA : U → r[0, 1] which associates with each element y of U a number μA (y) in the interval [0, 1] which represents the “grade of membership” of y in A [17]. For example, an interval [0, 100] with y temperature and fuzzy set hot can be represented as follows: ⎧ ⎪ if 0 y 30 ⎨0 (1) μhot (y) = 0, 025y − 0, 75 if 30 < y < 70 ⎪ ⎩ 1 if 70 y 100 In this case temperature is a linguistic variable [18] which is a fuzzy variable whose values are fuzzy sets labeled by words (e.g., hot). 2. Operations on Fuzzy Sets. Fuzzy logic provides operators that act on fuzzy sets that are counterparts to the Boolean logic operators such as equality, complement, union and intersection. For example, intersection and union are defined through the min and max operators, which are analogous to product and sum in algebra. They satisfy the associative property, distributive laws, and De Morgan’s laws. The proofs of these properties can be found in [17]. The complement of a fuzzy set A is denoted by A and is defined by [17]: μA = 1 − μA
(2)
The α-cut is a crisp set that includes all the members of the given fuzzy subset f whose values are not less than α for α in a range [0, 1]: fα = {x : μf (x) α}
(3)
Normalization could be defined as a division to the supremum of a membership function[14]: μN ORM (x) (u) = (sup μx (u))−1 μx (u)
(4)
U
3. Fuzzy Relation. Fuzzy relation A from a set X to a set Y is a fuzzy subset of the Cartesian product X × Y (X × Y is a collection of ordered pairs (x, y) where x ∈ X, y ∈ Y ) [19]. fB◦A (x, y) = sup M in[(fA (x, υ)), (fB (υ, y))] U
(5)
Computing with Words for Industrial Applications
259
4. Linguistic Hedges, or Modifiers. Hedges are linguistic expressions by which other expressions are modified. Generally, a linguistic term is modeled by a fuzzy set and a modifier, therefore, by an operation that transforms a fuzzy set into another [18]. For example, the values of t may be hot, not hot, very hot, not very hot, cold, etc.. In general, a value of a linguistic variable is a composite term x = x1 x2 ...xn , which is a concatenation of atomic terms x1 , x2 , ...xn [18,19]. 5. Fuzzy Quantifier named Q is represented as a function Q whose domain depends on whether it is absolute or relative: Qabs : R → [0, 1], Qrel : [0, 1] → [0, 1]
(6)
where the domain of Qrel is [0, 1] because the division a/b ∈ [0, 1], where a is the number of elements fulfilling a certain condition and b is the total number of existing elements. Consider an example: most students are very young. Here age of a student is a linguistic variable, young is a label of a fuzzy set applied to age, very is a linguistic hedge, most is a quantifier. In this paper, we propose an extended CW model and a corresponding library for Python supporting its basic operations. The decision to write the library in Python was made because it is the most popular language for artificial intelligence and data analysis, it is so called “the language of science”. As for the related works, there is a number of wide-ranging ramifications and applications of CW [5,9] and existing fuzzy sets and logic tools in general, like Frill [1] and fuzzyJess [10]. Our contribution primary lies in providing plug and play library for Python that allows to bring practical applicability of CW to a new level. Moreover, traditional CW methodology [15] was extended with a bigger modifiers knowledge base, enhancing the model expressiveness. The structure of the paper is as follows. Section 1 is this introduction. An overview of the Zadeh’s Computing with Words is presented in Sect. 2. In Sect. 3 we introduce CWiPY - extended model based on CW along with the software library. Then, two example prototype CWiPY-based applications along with results and examples are shown in Sect. 4. Finally, concluding remarks are given in Sect. 5.
2
Computing with Words
CW brings better rapport with reality by way of using words instead of numbers for reasoning [15]. The main principle that underlies CW methodology is Generalized Constraint Language(GCL). A generalized constraint is represented as: X isr R where isr (pronounced ezar ) is a variable copula which defines how R constrains X. More specifically, the role of R in relation to X is defined by the value of the discrete variable T . Some of the values of T and their interpretations are defined below (for the full list, refer to [15]):
260
– – – – –
A. Kali et al.
e: equal (abbreviated to =) d: disjunctive (possibilistic) c: conjunctive p: probabilistic ps: rough set (Pawlak Set)
Example of a generalized constraint: An apartment price is very high. Price is a linguistic variable, apartment is object, linguistic value is high, hedge is very, semantic modality is possibilistic. In this paper we are going to consider the case when r = d. Fuzzy set plays the role of a fuzzy constraint on a variable. Let us consider one way of representing fuzzy sets. Let U = y1 , y2 , ..., yn is a universe of discourse. Starting from here we will represent it in this way [19]: U = y1 + y2 + ... + yn =
n
(7)
yk
k=1
Let A be a fuzzy set of U with the membership function μA (y). We are going to use the representation of A as [19]: A = μA (y1 )/y1 + ... + μA (yn )/yn =
n
μA (yk )/yk
(8)
k=1
Note that in the case when U is a continuous set A has the form [19]: A= μA (y)/y
(9)
U
Returning to an example in Eq. (1), where the value of linguistic variable temperature is defined by the membership function hot and combining it with Eq. (9) we get: μhot (y) =
0
30
0/y +
70
30
(0, 025y − 0, 75)/y +
100
70
1/y
(10)
Now let us look at how linguistic hedges are applied using fuzzy sets theory. Linguistic hedges play the role of modifiers that can be applied to fuzzy sets. Here we demonstrate the meaning of a phrase not very hot temperature. The word very doesn’t have an exact meaning in natural language. CW suggests that very x, where x is a term, is defined as a square of x, that is [19]: μvery (x) = x2
(11)
Similarly the word not is defined as follows [19]: μnot (x) = 1 − x
(12)
Computing with Words for Industrial Applications
Thus, the fuzzy set labeled as very hot becomes: 30 70 2 0/y + (0, 025y − 0, 75) /y + μvery hot (y) = 0
30
100
70
1/y
261
(13)
Then, the value of a primary term not very hot becomes: μnot very hot (y) 30 1/y + = 0
70
30
(1 − (0, 025y − 0, 75)2 )/y +
100
70
0/y
(14)
Another fundamental part of CW is deduction rules. As for now, CWiPy supports GCL only.
3
CWIPY Methodology and Library
The goal of the study is the development of the extended CW methodology and library in Python programming language (CWiPy). The knowledge in CWiPy is represented using standard GC in form of X isr R, where X is a linguistic variable, constrained by the linguistic term (fuzzy set) R, and r is a semantic modality of this constraint. Modality can be either probabilistic or possibilistic. We implemented an only possibilistic modality so far. For words encoding, we use fuzzy sets expressed via either trapezoidal or triangular membership functions (or crisp interval, e.g., married/not married). To ease computations, we use the fuzzy partition. As a simple illustration, let us consider the proposition “Price is high”. High is the fuzzy set that serves as a constraint on the price of a house. We use extension of the basic Zadeh’s CW. CWiPy supports fuzzy linguistic variables, linguistic terms, extended set of modifiers (hedges), quantifiers, and connectives. Installation and usage of the library are quite easy, all you need to do as a user is to “plug and play”. No need to dive into the details of fuzzy logic. Python’s syntax rules allow developers to focus on the logic itself instead of worrying about additional code. 3.1
Hedges in CWiPy
CWiPy provides an extended set of fuzzy hedges and allows more language expressiveness of the system. CWiPy supports the following set of linguistic modifiers, or hedges (see Fig. 2) [2,14]. CWiPy also supports the processing of various combinations of hedges, like not very cheap and not very expensive. As we see in Fig. 1 we took a point of 180 cm and calculated the grade of membership for each fuzzy set. For example, given the threshold value of 0.5 the height 180 cm will be considered as tall but not very tall and not extremely tall. As we can observe, very and extremely hedges are intensifiers, so they steepen the curve.
262
A. Kali et al.
Fig. 1. Calculating the membership of 180 cm to hedged fuzzy sets: tall, very tall, extremely tall. µtall (180) = 0.66, µvery tall (180) = 0.44, µextremely tall (180) = 0.29.
Fig. 2. Illustration of some linguistic hedges supported by CWiPy - very, not, extremely, somewhat, sort of, more-or-less, indeed, quite, slightly, highly.
3.2
Quantifiers in CWiPy
As it was defined in Eq. (6), fuzzy quantifiers [3,6–8,20] are expressions allowing us to express fuzzy proportions in order to provide an approximate idea of the number of elements of a subset fulfilling a certain condition or of the proportion of this number in relation to the total number of possible elements. Quantifiers can be either absolute or relative. Absolute quantifiers serve to express quantities over the total number of elements of a given set, e.g., much more than 50, close to 100, a significant number of, etc. Next, relative quantifiers are used to express quantities over the total number of elements, that meet a specific condition, e.g., the minority or most, little of, about the third part of, and so on. Quantifiers (QGC) are also represented via fuzzy sets (see Fig. 3), e.g., in a proposition “Most prices is cheap”, most is a quantifier. 3.3
Architecture of CWiPy
Modifier class implements modifier (hedge) behavior. Each modifier has its own modifying function and name as stated in the modifiers table. All the listed modifiers in table exist in list_modifiers and dict_modifiers as static methods. As we mentioned, modifiers can be applied together forming a composite modifier, that kind of modifiers can be formed using call method applying modifiers
Computing with Words for Industrial Applications
263
Fig. 3. Illustration of some quantifiers supported by CWiPy - almost none, few, about half, many, most.
together, e.g., very_very_= very(very). They support and and or operations as well. Next, MembershipFunction abstract class has two child classes: TriangularMembershipFunction, TrapezoidMembershipFunction. Each of them has abstract call and extract_range methods. extract_range accepts alpha_cut variable, returns the range where the function value is greater or equal than the given alpha cut, and the call method accepts x point and returns the value of the function in point x. Modifier could be applied to each of those membership function classes, as a modifier argument. TriangularMembershipFunction class has 3 parameters: a, b, c. Here our function starts from a point linearly growing till b point, and going down till c point. TrapezoidMembershipFunction class has the same behaviour as TriangularMembershipFunction having a, b, c, and d arguments. The function starts from a point linearly growing till b point, and then holding value till c point and decreasing till d point. QuantifierSet class implements quantifiers. It has static fields for each of quantifiers: almost none, few, some, many, most (see Fig. 3). Each of them returns membership function (triangular or trapezoidal) with satisfying arguments. All of these classes are listed in Fig. 4 with their corresponding fields and methods. Documentation is provided in Fig. 5 as well. In CW, there are initial and terminal data sets (IDS and TDS), consisting of propositions expressed in natural language. CWiPy can accept as IDS not only natural language propositions but also natural queries and database dumps. CWiPy can produce TDS in the form of either linguistic expressions (when derived constraints are translated into a natural language) or database records. The developed library enables us to perform reasoning in the framework of CW. It can be highly efficient for expert systems requiring expressiveness and natural language power. Along with the implemented library, we developed 2 prototype example applications that are making use of CWiPy.
264
A. Kali et al.
Fig. 4. CWiPy library class diagram
Fig. 5. Documentation of CWiPy
4
Application and Results
CWiPy is a flexible Python Library that can handle various scenarios. It can be useful for any task that can get an advantage from CW. As its creator, Zadeh, specified in [16], there are two imperatives for CW: – When the available information is imprecise – When there is a tolerance for imprecision that can be used to achieve usability, robustness, low cost of the solution, and a better relationship with reality.
Computing with Words for Industrial Applications
265
With this object in view, target applications of CWiPy can be any of those that require the manipulation of perceptions. For example, summarizing stories or data (e.g., from social networks), processing natural queries in marketing (e.g., an expert has an idea about potential customers [12]) or e-commerce applications [11], projects evaluation and selection, human resources management [9]. Figure 6 illustrates general usage of CWiPy for industrial application. We enlivened our methodology and developed the corresponding library. Next, we used it in two prototype applications as proof of concept: – Intelligent retrieval of rental apartment data. Application aims to help experts to perform basic data querying using natural language. IDS here is a proposition (query) expressed in natural language and database, while TDS - a subset of database records. – Generation of linguistic database summaries. Application provides a reporting tool, allowing to obtain a useful linguistic summary from the provided data. IDS - database records, TDS - summary in natural language.
Fig. 6. General architecture of CWiPy industrial usage
By default, having X numerical attributes in a provided table, X linguistic variables are formed, having three linguistic terms each. Fuzzy sets can be subsequently redesigned in range (by drag and drop operations) and membership function type (see Fig. 7). As it is seen from the figure, fuzzy partition is preserved.
266
A. Kali et al.
Fig. 7. Illustration of the “fuzzy sets generation” module for some linguistic variable
4.1
Intelligent Retrieval of Rental Apartments Data
The data operated in nowadays information systems are precise in nature. However, queries to a database made by humans often tend to be vague and have some degree of fuzziness. For instance, let’s consider query “find the housing proposal which is not very expensive and is close to downtown”. Statements “not very expensive” and “close” are unspecific, imprecise, although rent price is completely determined, and the distance from the center of the apartment - up to a kilometer. CWiPy can be easily used to process natural queries for such kind of scenarios. The decision to use rental apartment data as an example was made because it contains a sufficient amount of numerical values, which can further be described using linguistic values and modified by linguistic hedges. The data was retrieved and parsed from popular Kazakhstan web-site of ads on the sale, purchase, rental of apartments, as well as information on any types of commercial real estate. Figure 8 illustrates the workflow of the first application, showing the steps of data retrieval. Once a user provides a database file, he can access it by writing queries in natural language (IDS). For instance, “show apartments that are very cheap and not old ”. Hereafter the query is transformed to SQL query using our CWiPy fuzzy library so that a server will be able to perform the request on the SQLite database. The received data is then sent back to a web application, which displays resultant database records to the user (TDS).
Computing with Words for Industrial Applications
267
Fig. 8. Intelligent retrieval of rental apartments data. The main workflow of the application. CWiPy is used to convert natural query formed by the domain expert to crisp SQL query.
Let us look at the example of natural language queries and how they are processed in CWiPy. Suppose we need to process a query with a threshold value of 0.7: “not very cheap and not very expensive more-or-less big apartments”. Here we need to make a conjunction of two constraints on one fuzzy variable – price. Furthermore, there are other fuzzy criteria – the area is more-or-less big. We obtain the following: CHEAP[Price; not, very; μtotal = 0.7] ∩ EXPENSIVE[Price; not, very; μtotal = 0.7] ∩ BIG[Area; more-or-less; μtotal = 0.7] = CHEAP[Price; very; μtotal = 0.3] ∩ EXPENSIVE[Price; very; μtotal = 0.3] ∩ BIG[Area;; μtotal = 0.49] = CHEAP[Price;; μtotal ≈ 0.55] ∩ EXPENSIVE[Price;; μtotal ≈ 0.55] ∩ BIG[Area;; μtotal = 0.49] In much the same way we can process even more complicated queries, like “very very cheap apartments not far from center with the average or big area”, “not very cheap is untrusted, and not very expensive is not affordable”. 4.2
Generation of Linguistic Database Summaries
In a modern world the amount of data generated each day is enormous in size and keeps increasing, but fortunately storing and processing that data is affordable as the cost of hardware gets cheaper and efficiency of software rises. Yet plain data is useless for business unless it provides some valuable knowledge [4]. This application solves the problem by deriving linguistic summaries (TDS) as linguistically quantified propositions from the provided database file (IDS) using CWiPy.
268
A. Kali et al.
According to linguistic summary approach proposed by Yager [13], we have: – V is a quality (attribute) of interest, with numeric and non-numeric (e.g. linguistic) values as, e.g., salary in a database of workers, – Y = {y1 , . . . , yn } is a set of objects (records) that manifest quality V, e.g. the set of workers; hence V (yi ) is a value of quality V for object yi , i = 1, ..., n, – D = {V (y1 ), . . . , V (yn )} is a collection of the observations of the property V for the elements in the set Y (“database”). Next, a summary of the data set D consists of: – a summarizer S (e.g. young) which is a fuzzy set characterized by its membership function μs (y), ∀y ∈ Y ; – a quantity in agreement Q (e.g. most) which is a fuzzy linguistic quantifier; – truth (validity) T (e.g. 0.7), and may exemplified by “T(most of employees are young) = 0.7”. Zadeh’s [20] fuzzy-logic-based calculus of linguistically quantified propositions is a relevant technique to calculate the truth (validity) of a summary. It is fully supported by CWiPy. Let’s first consider the case when Q is a relative quantifier. Then due to [20] the procedure of obtaining T is the following: 1. For each di ∈ D calculate S(di ), the degree to which di satisfies the summarizer S. n 2. Let r = (1/n) i=1 S(di ), the proportion of D which satisfies S. 3. Then T = Q(r), the grade of membership of r in the proposed quantity in agreement. If supposing our quantity in agreement, Q, is an absolute nquantity, our procedure remains the same except that in step 2 we have r = i=1 S(di ), the total amount of satisfaction of S. CWiPy library frees the developer from understanding such details and allows them to concentrate on application logic. As an example of our second application, let’s consider a table provided by the world’s biggest data science community1 and try to generate summaries using CWiPy. The data set is one of the historical sales of the supermarket company, which has collected records from 3 different branches for 3 months. The data structure is described in Table 2, where all bold attributes are fuzzy. As a result, we obtain information like in Table 1. As we can see, CWiPY makes it possible to use fuzzy terms in queries and summaries and serves as a powerful tool for getting insights into relations within database data.
1
Dataset “Supermarket sales. Historical record of sales data in 3 different supermarkets.” (https://www.kaggle.com/aungpyaeap/supermarket-sales).
Computing with Words for Industrial Applications
269
Table 1. Linguistic summaries and their degree of validity Summary
Truth value
Few of the “Electronic accessories” products have somewhat middle unit_price
0.98
Many of the “Food and beverages” products have indeed low total (price)
0.95
Almost none of the “Health and beauty” products bring extremely high gross_income
0.88
Almost none of records are very high gross income 0.77 Most of the products in “Sports and travel” category were purchased in middle quantity
0.70
About half of the purchases in “Home and lifestyle” 0.66 category were given fairly low rating Many sales are with not low rating
0.63
Table 2. Database structure (see footnote 1). All numeric attributes are fuzzy. Attribute name
Description
Invoice ID
Computer generated sales slip invoice ID
Branch
Branch of supercenter
City
Location of supercenters
Customer type
Type of customer (Member or Normal)
Gender
Gender type of customer
Product line
General item categorization groups
Unit price
Price of each product in $
Quantity
Number of products purchased
Tax 5%
5% tax fee for customer buying
Total
Total price including tax
Date
Date of purchase
Time
Purchase time
Payment
Payment used by customer for purchase
Cogs
Cost of goods sold
Cross margin % Gross margin percentage Gross income
Gross income
Rating
Customer satisfaction rating
270
5
A. Kali et al.
Conclusion
CW is a computational theory of perceptions, its basic idea is to use words instead of numbers for computing and reasoning. The implementation of this approach, however, is not straightforward and requires certain knowledge and a deep understanding of the fuzzy sets theory. In this paper, we presented how CW methodology can be easily applied in real-life industrial applications. By extending Zadeh’s traditional CW we developed our own methodology and implemented a corresponding library, CWiPy. CWiPy allows us to effectively deal with imprecise information and apply CW techniques easily without any prior knowledge in this field. It is primarily oriented for developers because using fuzzy sets theory concepts in an application requires certain knowledge and can noticeably impede the process of system development. It serves as an adapter between CW and real-world industrial applications, hiding the complexity of underlying fuzzy mechanisms, eliminating the difficulty of handling vague information. Developers can add it to the existing system and use it as a black box (plug and play). CWiPy enables to not only handle fuzzy variables, hedges, quantifiers and connectives but also provides a wider set of linguistic modifiers which makes system output capability more expressive. CWiPy provides an API to handle fuzzy variables, sets, hedges, and quantifiers. We also present two separate implementation scenarios as a proof of concept: natural query processing (flexible querying interface) for rental apartment data and database summarization. In general, CWiPy mechanisms can provide great help for experts in human-consistent decision support systems. The presented approach can be generalized to other industrial areas. Though CWiPy provides a lot of useful features, it still lacks some functionality that can be implemented in future works. Specifically, we aim to implement the modules that will deal with fuzzy relations and the handling of deduction rules. Besides this, we plan to perform system evaluation involving the industrial experts. We are also working on making CWiPy available for download.
References 1. Baldwin, J.F., Martin, T.P., Vargas-Vera, M.: FRIL++ a language for objectoriented programming with uncertainty. In: Ralescu, A.L., Shanahan, J.G. (eds.) FLAI 1997. LNCS, vol. 1566, pp. 62–78. Springer, Heidelberg (1999). https://doi. org/10.1007/BFb0095071 2. Chandramohan, A., Rao, M.V.C.: Novel, useful, and effective definitions for fuzzy linguistic hedges. Discrete Dyn. Nat. Soc. 2006(December 2005), 1–13 (2006) 3. Galindo, J., Medina, J.M., Cubero, J.C., Garcia, M.T.: Relaxing the universal quantifier of the division in fuzzy relational databases. Int. J. Intell. Syst. 16(6), 713–742 (2001) 4. Kacprzyk, J., Yager, R.R., Zadrozny, S.: Fuzzy linguistic summaries of databases for an efficient business data analysis and decision support. In: Abramowicz, W.,
Computing with Words for Industrial Applications
5.
6.
7. 8. 9.
10. 11.
12. 13. 14. 15. 16.
17. 18. 19. 20.
271
Zurada, J. (eds.) Knowledge Discovery for Business Information Systems. The International Series in Engineering and Computer Science, vol. 600. Springer, Boston (2002). https://doi.org/10.1007/0-306-46991-X_6 Khorasani, E.S., Rahimi, S., Patel, P., Houle, D.: CWJess: implementation of an expert system shell for computing with words. In: 2011 Federated Conference on Computer Science and Information Systems (FedCSIS), pp. 33–39 (2011) Lietard, L., Rocacher, D.: Evaluation of quantified statements using gradual numbers. In: Handbook of Research on Fuzzy Information Processing in Databases (2008) Liu, Y., Kerre, E.E.: An overview of fuzzy quantifiers. (1). Interpretations. Fuzzy Sets Syst. 95(1), 1–21 (1998) Liu, Y., Kerre, E.E.: An overview of fuzzy quantifiers. (2). Reasoning and applications. Fuzzy Sets Syst. 95(2), 135–146 (1998) Martinez, L., Ruan, D., Herrera, F.: Computing with words in decision support systems: an overview on models and applications. Int. J. Comput. Intell. Syst. 3(4), 382–395 (2010) Orchard, R.: Fuzzy reasoning in JESS: the Fuzzyj toolkit and Fuzzyjess, pp. 533– 542 (2001) Shamoi, P., Inoue, A., Kawanaka, H.: Fuzzy model for human color perception and its application in e-commerce. Int. J. Uncertain. Fuzz. Knowl.-Based Syst. (IJUFKS) 24, 47–70 (2016) Shamoi, P., Inoue, A.: Computing with words for direct marketing support system. In: MAICS (2012) Yager, R.R.: A new approach to the summarization of data. Inf. Sci. 28(1), 69–86 (1982) Zadeh, L.A.: A fuzzy-set-theoretic interpretation of linguistic hedges. J. Cybern. 2(3), 4–34 (1972) Zadeh, L.A.: Fuzzy logic = computing with words. IEEE Trans. Fuzzy Syst. 4(2), 103–111 (1996) Zadeh, L.A.: From computing with numbers to computing with words From manipulation of measurements to manipulation of perceptions. IEEE Trans. Circ. Syst. I Fundam. Theory Appl. 46(1), 105–119 (1999) Zadeh, L.A.: Fuzzy sets. Inf. Control 8(3), 338–353 (1965) Zadeh, L.A.: The concept of a linguistic variable and its application to approximate reasoning 1. Inf. Sci. 8(3), 199–249 (1975) Zadeh, L.A.: Outline of a new approach to the analysis of complex systems and decision processes. IEEE Trans. Syst. Man Cybern. SMC-3(1), 28–44 (1973) Zadeh, L.A.: A computational approach to fuzzy quantifiers in natural languages. Comput. Math. Appl. 9(1), 149–184 (1983)
Improving Public Services Accessibility Through Natural Language Processing: Challenges, Opportunities and Obstacles Ilaria Mariani1(B) , Maryam Karimi2 , Grazia Concilio2 , Giuseppe Rizzo3 , and Alberto Benincasa3 1 Department of Design, Politecnico di Milano, 20158 Milano, Italy
[email protected]
2 Department of Architecture and Urban Studies, Politecnico di Milano, 20133 Milano, Italy
{maryam.karimi,grazia.concilio}@polimi.it 3 LINKS Foundation, 10138 Torino, Italy {giuseppe.rizzo,alberto.benincasa}@linksfoundation.com
Abstract. The adoption of AI in the public sector processes and operations is already showing significant benefits, improving for example their efficiency and effectiveness in delivering services. In this direction, the EU-funded easyRights project explores the application of NLU techniques to improve service accessibility and in particular to extract from administrative documents an effective and step-wise description of the user experience. The project is devoted to easing the understanding of service procedures and improving the experience for the service users. The easyRights project especially aims at easing the access to public services to immigrants so targeting a special users’ category whose social fragility is further challenged by bureaucratic complexity. The present paper delineates the work done by applying NLU techniques to service-related administrative documents in four European cities. The first part describes the NLU system intended to play the role of a pathway generator and further outlines the basic architecture of the pathway composed of four key descriptors for each step, namely the “what”, “when”, “where”, and “how”. In the second part, the article discusses the initial outputs of the experiments related to a total of eight services developed in four pilot projects. Evidence of the key obstacles encountered is therefore discussed. Finally, the paper critically reflects on the general value of the achievements and findings and opens up some crucial questions related to the relevant lessons learnt from the side of the public authority and their preparedness to fully exploit the potential benefits from the adoption of AI solutions. Keywords: Natural language understanding · Public sector innovation · Service accessibility
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): IntelliSys 2022, LNNS 544, pp. 272–289, 2023. https://doi.org/10.1007/978-3-031-16075-2_18
Improving Public Services Accessibility Through Natural Language Processing
273
1 Artificial Intelligence, Public Administrations and Natural Language Processing 1.1 A Subsection Sample Artificial intelligence (AI) is gaining momentum in the public sector. Nevertheless, the government and Public Administration lag behind the rapid development of AI in their effort to provide adequate governance [1]. AI encompasses different technologies and approaches with different degrees of maturity in stimulating intelligent behaviours through technical systems, providing broad potentialities of applications within Public Administrations. Among scientists, AI has been categorised in three levels: weak AIs, or narrow AIs, as the ones widely used in everyday life; strong AIs, which are the systems able to think, plan, learn and make logical decisions independently; and super intelligence AIs which aim to emulate humans in every aspect of intelligence [2]. Narrow AIs are usually developed and employed for specific tasks such as expert systems, speech recognition, navigation systems, and translation services [3] which Public Administrations often take advantage of. The practical approach to this level of AI primarily includes the simulation of human intelligence, replication of linguistic, and logical-mathematical intelligence [4]. Adopting the remaining two categories encounters relevant resistance due to the complexity and uncertainty of real life. Relying on rule-based algorithms and AI technologies in Public Administrations would require producers to be controlled democratically and transparently, and verifiability should be at the top of any decision-making process [5]. The highly complex information process practises in Public Administration have recently found their answer to AI as a solution to reduce the complexities with a broad range of beneficial opportunities in the public sector [6]. For instance, several countries, in particular the United States and China, have recognized the great value of AI for public usage and have launched various cost-intensive AI initiatives [7, 8]. However, implementing artificial intelligence and automation processes has been challenging for Public Administrations. Here, several limits in terms of technical transformation persisted for quite some time. Some studies highlight the crucial role of early assessment and prioritisation of the procedures which are suitable for automation in the first place [4]. Others highlight the lack of sufficient government and public authorities’ awareness of the full range of AI applications opportunities or of related challenges [9]. While many governments and researchers struggle to formulate a long-term regulative perspective on the interaction with the AI market in both public and private fields [10] and the role that AI should have in a mature information society [11], Natural Language Processing (NLP) could make its path to the public sector. Within the AI landscape, the development of deep neural networks and NLP made governments increasingly interested in exploiting AI technologies to enhance their performances at multiple levels. Indeed, after their successful implementation in the private sector, they are now progressively being taken up by the public sector [9, 12]. The adoption in public sector processes and operations is already showing significant benefits, improving, for example, their efficiency and effectiveness in delivering services [13] while favouring the automation of numerous activities [4]. In this framework, NLP
274
I. Mariani et al.
is increasingly applied to several processes, aiming to ameliorate and make the communication between Public Administrations and citizens more transparent [14], improving the availability, accessibility, and clarity of information to every citizen, avoiding and overcoming disparity. Natural Language Understanding (NLU) as a subtopic of NLP is an interdisciplinary domain dealing with the comprehension of the structure and meaning of human language to allow better interaction between users and the computer using natural sentences. Although Natural Languages are intrinsically intricate, the complexity of the problem domain has been addressed and managed through data-driven approaches [15, 16]. Among its major application fields, NLU has been applied to improve citizen-public administration communication, giving access to structured and non-structured information access in a more friendly and intuitive manner [17]. Moreover, NLU is currently recognized as especially valuable for supporting and orienting the design and re-design of services, and their delivery, sustaining policymaking mechanisms, and improving the possibilities and quality of citizen engagement [17]. However, when applied for ameliorating accessibility to public services, NLU may require some relevant challenges to favour the interaction of AI-based systems with the sources of data in possession of the Public Administrations, which can largely differ for formats, contents, coding, language and registers, and so on. Therefore, going beyond a technological standpoint, it is necessary to draft directions and requirements to mitigate or solve the challenges that NLU has to address when applied to complex administrative procedures related to service provision and vice versa.
2 The easyRights Pathway Generator 2.1 Context of Application: The EasyRights Project The continuous rise of migrant flows across Europe calls for larger attention on the role of digital information on Public Administration, migration management, and society as a whole. This situation challenges national and local governments, especially Public Administrations, at several institutional levels and in the different spheres of policymaking and service provision. Barriers to accessing information and exercising rights fuel discrimination in receiving equitable and fair services, contributing to social exclusion and widening the gap between citizens and migrants. Among many obstacles to migrants’ integration, one of the most critical ones refers to service accessibility [18] which due to two major problems migrants experience every time they enter in contact with public services: the complexity of the procedures to perform, and the language comprehension within the framework of the related administrative procedures and documents. Given this premise, language comprehension plays a significant role in increasing the difficulty in integration. Although various EU countries often offer introductory local language courses, administrative documents are formulated using a level of complexity that is hardly mastered by migrants that do not already know the language or at least have a good grounding. The access to the information contained in such documents is limited or even prevented, making language proficiency the main barrier to accessing information and services. In [19], the authors identify “language proficiency as the main barrier (96%) to accessing available information and services, as well as negatively
Improving Public Services Accessibility Through Natural Language Processing
275
impacting service utilisation and overall settlement experience”, hence an obstacle to the search for integration which must be undoubtedly considered and tackled. Adding to the language barrier issue, the complexity of rules and administrative procedures makes it even more difficult for migrants to follow the paths in order to access the services and fulfil their rights. A level of complexity that further increases especially for those services having a nested structure (services in services). Services often consist of several steps and activities and require the provision of relevant documents [20] which may need to interact with other services and institutional offices. This makes procedural complexity another significant obstacle to migrants in exercising their rights. Moreover, very often the language of administrative procedures is designed to guide the work of public servants for the service provision rather than being “users oriented”. Within this landscape, it is worthwhile to acknowledge the European Commission’s interest to promote sound use of ICTs in the public sector led to several funding streams, especially combining ICTs (including AI), on the one hand, with co-creation practises that involve relevant stakeholders as well as final users, and on the other with co-developing joint solutions with other Public Administrations or authorities, sharing successful practises [21]. The easyRights project has been funded under one of the above-mentioned calls launched by the European Commission and expressly devoted to ICT solutions for migrants’ integration. It uses AI technology to make it easier for migrants to access the services they are entitled to and exercise their rights. Acknowledging that public services play a crucial role in the integration of migrants, refugees, and asylum seekers, the project addresses the necessity to provide more accessible services to support rather than hindering such a complex and demanding process. It focuses on tools and processes for easing local services’ bureaucracy to increase autonomy and, therefore, contribute to support integration at different levels. It is essential to improve communication effectiveness, impacting migrants’ understanding of their fundamental rights, as a prerequisite for their inclusion in the host society, and the possibility of accessing public services. Among the utmost challenges to be addressed is accessing knowledge in an understandable and actionable manner; a knowledge often enclosed in lengthy documents that use a complicated language (having heavy legal and bureaucratic tone), often without a simplified guide to the service. This motivated the creation of an AI-based tool that uses NLU to automatically generate easy instructions starting from a bulk of textual information describing a given service, such as the asylum request procedure. The application of NLU techniques is intended to improve service accessibility and in particular to extract a step-wise description of the user experience from administrative documents with the aim to ease the understanding of service procedures and improve the experience for the service users. The easyRights project especially aims at easing the access to public services to migrants so targeting a special users’ category whose social fragility is further challenged by bureaucratic complexity. To generate easy instructions and lower the bureaucracy level for migrants, NLU has been applied to build a fully autonomous tool called Pathway Generator that exploits the latest AI advancements to translate verbose and lengthy documents describing processes such as asylum seeking into a set of actionable instructions in a format named pathway.
276
I. Mariani et al.
A pathway is made up of a list of sequential steps, each being described as for which activities need to be implemented and how, in order to accomplish the service-related procedure. In particular, what and when to do certain activities and how to attain the expected output [22]. In order to facilitate the retrieval of information for migrants, an AI-based system that automatically generates easy instructions to migrants starting from a bulk of textual information is considered a supportive tool. The language barrier constitutes the leading motivation behind the development of the easyRights’ NLU System. It is designed as an intelligent agent able to digest lengthy and verbose documents through an autonomous AI-driven process that converts administrative procedural documents into textual objects [23]. These are easier to understand and serve as actionable information in a format easily and intuitively accessible by migrants. Furthermore, the agent is intrinsically conceived for replication (scaling out mechanism) and scalability (scaling up mechanism) [24, 25]. The reasoning on its ground is intended for potentially impacting laws and policies since it transcends the specificities and context-dependencies of the sites where and for which the solution has been developed. Within the scale of easyRights, the solution is already tested and scaled across different services within the four pilot sites where the project activities took place: Birmingham in the UK, Larissa in Greece, Málaga in Spain, and Palermo in Italy. The easyRights NLU system has been tested by pilots as enclosed and controlled areas for experimentation [26], supporting the provision of eight services: • • • •
(1) the asylum request, and (2) the acquisition of the work permit in Malaga, (3) the birth certificate, and (4) the certification of nationality in Larissa, (5) the registration at the registry office, and (6) job seeking in Palermo, (7) the access to the Clean Air Zone, and (8) the access to learning English as a second language through the BAES ESOL courses in Birmingham.
The easyRights NLU System has been developed to “speak” four languages: English, Italian, Spanish, and Greek. It is designed to provide outputs adherent to the language of the pilot sites in which it is applied to translate verbose and long documents describing bureaucratic procedures in a simpler and actionable set of instructions (the pathway) for fulfilling the services. 2.2 Natural Language Understanding Tasks for Administrative Procedural Documents The work presented and discussed here considers four tasks of NLU applied to administrative procedural documents such as document conversion, document segmentation, document annotation, and text generation (generation of a textual synthesis). In the world of NLU, the input processed is text. However, this input is often supplied in other different formats (such as docx or pdf), making it necessary to implement a document conversion task to have the input ready as raw text. For this purpose, it has been utilised textract, an open-source tool that extracts content from different formats among those images and generates a textual file as output. Then, it was developed as an algorithm to chunk the output file into sentences to facilitate the task described in the following step.
Improving Public Services Accessibility Through Natural Language Processing
277
The concept of document segmentation is based on a fundamental principle, namely the contextual and semantic relationality of two subsequent sentences. This is relevant because it enables the segmentation of a very long text (such as those officially released by Public Administrations) into different sections: these sections can be interpreted as related to different topics within the document, and they have been assumed as segments of a procedure, sort of sub-procedures called “steps” in the development of the intelligent tool described here. This segmentation produces some advantages. For instance, it makes the analysed procedure easier and more manageable, and it allows the user to proceed step by step, without the necessity to know the entire procedure in advance (this is an advantage from a user experience point of view). Moreover, it can be used to structure a sequence of operations in the form of a checklist. In the literature related to document segmentation, the task of finding two semantically similar sentences is called Semantic Textual Similarity [27]. In proceeding with the implementation of the segmentation pipe, the inspiration came from the work of Nils Reimers, who presented a variant of BERT (a pre-trained transformer network; see [28]) capable of finding sentences that are semantically similar. In this work, two Siamese neural networks [29] were deployed to fine-tune BERT, and the produced sentence embeddings could be compared with a metric called cosine-similarity. The document annotation relies on a task that Grishman and Sundheim [30] defined as Named Entity Recognition (NER), also known as entity extraction or identification. NER is an information extraction technique that aims to locate and classify information units such as the names of a Person or an Organisation, a Location, a Brand, a Product, a numerical expression including Time, Date, Money, and Percent found in a textual document. NER has been developed and studied with knowledge-based systems, unsupervised systems featuring engineered supervised or semi-supervised learning algorithms. In the latest years, the achievements obtained over NER, like many other NLP sub-tasks, increased considerably thanks to the implementation of supervised learning techniques that use deep learning technologies taking as inputs work vector representations such as embeddings. As of today, the overall task of NER is a foundational step in all approaches that aim to extract relevant information with semantics from the text. State-of-the-art approaches rely on Transformers [31]. It was proposed for the first time by Vaswani at Google Brain in 2017 in order to completely replace the problematic recurrent mechanism of Long Short Term Memory [32]. The original Transformer is an encoder-decoder architecture for sequence-to-sequence learning [33] completely based on attention mechanisms [34] for neural machine translation. The concept of attention refers to a simplification work, focusing on a few relevant words in a sentence rather than the entire one. The core component of a Transformer is the multi-head attention mechanism that computes different kinds of attention over the input sentence. This technology has been applied to label the categories linked to each token of the text. In total, five categories were utilised, namely: Location, Organisation, Time, Document, Procedure. The technology has been trained with the WikiNER corpus [35]. Text generation is still a recent task in NLU and a certain number of approaches have been developed relying on neural language modelling [36]. The work addressed in easyRights for the generation of the pathway is similar to the task of slot filling [37], being a derivation of text generation, holding the same ultimate goal, i.e. generating text,
278
I. Mariani et al.
but filled into a predefined schema. In the easyRights project, the latter methodology was adopted, defining a hand-crafted set of rules (ten to map entity types into the slots what-where-when-how of the conceptual framework described below) that the model must follow during the generation process. This approach is effective as it is based on the concept of automating operations that are usually done by a domain expert in highlighting the necessary steps to take to access a service. 2.3 The EasyRights Conceptual Framework in the Pathway Generator The Pathway Generator is built relying on the easyRights conceptual framework, which operationalises the Triple Loop Learning mechanisms in the NLU system. In the context of the project, the Triple Loop Learning is meant to trigger the pilots’ ecosystems to account for the problems addressed in their overall complexity towards transformational learning [38]. Within this framework, the learning mechanisms are reflected in the Pathway Generator through the lenses through which the information is provided. In particular, they become the key descriptors of a multi-step process, where each step describes the actions to take place in terms of: the “what” is associated with relevant information such as “where” to go, “when” to complete the step, and “how” to perform the step. The conceptual framework of what-where-when-how and its foundational components, the four key descriptors, guide the NLU system in the identification of the necessary information for autonomously deriving the pathway of a service. The Pathway Generator is a NLU System with three interfaces: 1. a backend dedicated to information providers, where experts of the services can revise, and extend the system output, in a feedback loop meant to provide a clean and correct version of the pathway to migrants; 2. a frontend to allow municipalities and welcome organisations to adjust, complement, and validate the output generated by the Pathway Generator before being published to migrants; 3. a frontend as the access point to the system for its target users, the migrants. The NLU system autonomously elaborates the documents to extract and structure data into information according to the following elements, corresponding to the four foundational components of the framework, also presented in Fig. 1: • What: specifies the topic as the key action of the captured step. • When: encapsulates temporal information about the service. It presents two subcategories of information: (1) the response time that is an estimation of how long the step takes for completion; and, in case of physical places, (2) the opening hours of local authorities (reported in the “where”), specifying the access time to offices. • Where: informs about where the step has to be performed. It presents two typologies of information: (1) if the step can be carried out online, it indicates the websites where to access; (2) otherwise, it indicates the physical address for reaching the office. In the case of services provided in a mixed form, online and in presence, these two pieces of information can be complementary and coexist within a pathway.
Improving Public Services Accessibility Through Natural Language Processing
279
• How: reports on the modalities for completing the key actions described in the “what” section. It can have two sub-categories of information: (1) the procedure to follow (e.g., an interview, a module to be filled, etc.); and (2) the documents that the service step may require to be formally presented (e.g.: the passport or the residential permit). Coherently to the inherent complicatedness of the bureaucratic process, some services can require more operations (steps) to be completed. Moreover, such operations often depend on other requirements to be completed before accessing the following step of the process. As a consequence, the multiple steps of certain services necessitate multiple what-where-when-how blocks to describe the pathway. An excessive simplification that may reduce the information to a unique block cannot be considered; it would be counterproductive in respect to the scope on the ground of the Pathway Generator. Therefore, the system relies on the what-where-when-how conceptual framework as a collection of steps (blocks) that can occur multiple times to describe the operations required to complete the procedures of the service (Fig. 1). The sum of the steps for carrying out a service composes the pathway. As a consequence, just very simple services may generate a one-step pathway with a unique what-where-when-how block.
Fig. 1. The what-where-when-how block and its sub-categories compose each step for carrying out a service. The pathway generator autonomously extracts information from the documents and displays them in one or more blocks.
The figure below shows the pipeline to the pathway generation, as composed of four tasks leading to the generation of the pathway (Fig. 2).
280
I. Mariani et al.
Fig. 2. Four-task pipeline of the pathway generation starting from one or more documents.
Task 1: Document conversion (doc2txt in Fig. 2). The first task is dedicated to the conversion of documents into text. 1. The doc2txt is used to convert documents (in a range of different formats) into the textual and structured format later used for NLP. The activity uses the library textract, which contains further libraries for the conversion of a specific format. The supported document formats are: .csv,. doc,.docx,.eml,.epub,.gif,.htm,.html,.jpeg,.jpg,.json,.log,.mp3,.msg,.odt,.ogg,.pdf, .png,.pptx,.ps,.psv,.rtf,.tff,.tif,.tiff,.tsv,.txt,.wav,.xls,.xlsx. 2. The resulting text is clean from URLs. URLs are replaced by an identifier represented by an incremental counter in order to avoid conflicts with the text processing model. The information is not lost but stored in a temporary file with the notation URL_number (Fig. 3) and is restored after the annotation process. 3. The text is cleaned of any irrelevant information – page numbers, repeated headings, table of context, empty sentences or symbols resulting from previous operations. 4. The text is divided obtaining a complete sentence for each line, using the dot as a separator. The obtained elements are processed checking whether the element: a. is composed of a single word, b. starts with a symbol, and c. starts with a number. If the element falls into one of these cases, it is considered part of the previous sentence. Otherwise, it corresponds to a new one. The resulting text is saved, ready to be analysed by the subsequent tasks.
Fig. 3. Replacement of URLs with symbols saved for restoration after the annotation process.
Improving Public Services Accessibility Through Natural Language Processing
281
Task 2: Document segmentation. The converted document obtained using doc2txt is inputted to the second task in charge of the segmentation. The text is divided into semantic sections defining the different steps of the procedure. The output of the operation is one or more sub-documents, each relating to a unique step. Procedures composed of several steps are carried out sequentially, maintaining the order. The logical separation helps the annotation model to better process the textual information, contributing to better describing a specific task at a time. Task 3: Document annotation. Through a process called Named Entity Recognition, the third task detects and classifies words into categories, named entities. The entities observed are (i) Organization, (ii) Person, (iii) Document, (iv) Procedure, (v) Location, (vi) Time. The output is a Python structure that brings along information regarding the text, the entities found, as well as their position in the text, and the category they belong to. Task 4: Text generation (Step generation in Fig. 2). The last task generates the Pathway. Taking the references to the annotated entities from the previous task as inputs, it outputs the structure defined in Fig. 1. To develop the easyRights pathways for the eight services, each pilot provided an overview of the underpinned processes of the two selected services, and identified the documents and relevant information for describing the service journey. As a result, the Pathway Generator is inputted with a bulk of documents related to each of the eight services, which are elaborated for outputting a step-wise description of the service procedures (the pathway). The Pathway Generator has been tested and validated two times for each pilot. Firstly, during the redesign phase of the eight services, conducted as Hackathon events; secondly, when it has been included in the frontend, showing the pathway of the eight services to the final users. Figure 4 shows the workflow of the NLU system. The steps of the process are presented in sequential order, highlighting when an action is automatically computed.
Fig. 4. Flow chart of the NLU system
282
I. Mariani et al.
3 Early Testing and Results 3.1 Overview of the Underpinned Process Each time the documents collected by the pilots were processed by the Pathway Generator, it returned the main steps and procedures defined by the public authority to access and use the service. Following the four-task pipeline described in Sect. 2.3, the Pathway Generator, which is at a prototypal stage, autonomously produces one or more steps shaped as a preliminary what-where-when-how form (Fig. 5). The resulting document is then given to administrative experts in the domain of the services of application for early validation of the document segmentation [26]. Experts, in charge of the supervision of the results, are asked to verify and, if needed, complement the generated output with the information contained in the original documents that were not identified by the Pathway Generator. On the other hand, this process elicits what is both missing from the Pathway Generator output and from the documents, stimulating the experts to add this information to the final output. This process verified the correctness of the number of blocks identified, and of the labels assigned to the descriptor “what”. A second iteration regards the completeness and correctness of the blocks identified in terms of steps and their contents. This phase showed that some steps have not been identified, and are not recognised as “what”. Likewise, experts supervise the when-where-how slots of each step. From a qualitative standpoint, the integration was performed to add information – for instance, the location to perform a service, missing in some of the documents provided by the pilots. These changes were necessary mainly in two cases: when the information needed by the pathway is not present within the selected documentation (so the entity was added manually), while in the second case the information was present but not correctly tagged; here, the change suggests a problem in the Pathway Generator, which required to add the correct entity before the generation phase so that, with the same input documents, the next generation iteration would output the correct entities. Meanwhile, it was observed that the pathway coherently reported the key information that was present in the original documents. Altogether, the output of the Pathway Generator and the intervention of the experts as editors led to the results reported in the following. During the hackathons held in the four pilots, the pathways generated and supervised by experts were made available through API calls. In these events, it was observed that numerous API calls made to the service resulting in the pathway output displayed or used uniquely by the different solutions developed. Qualitatively, this is additional evidence that the output of the Pathway Generator was not just functional but beneficial. It provided accessible and understandable pathways – Fig. 5 shows the two pathways of the two Malaga services – interweaving information derived from multiple sources and multi-language translation while giving access to knowledge often hidden in-between administrative and legal jargon. In doing so, the NLU system demonstrates high potentialities for further and broader scalability and transferability. However, the application of the NLU system to the eight bulk of documents provided by the pilots made evident the persistence of a significant issue: the typology of documents, how they are written, their involute contents and structure hamper the potentiality, and effectiveness of the AI-based solution, limiting the benefits for the public sector.
Improving Public Services Accessibility Through Natural Language Processing
283
Fig. 5. Pathway generated for the Malaga pilot. On the right the procedure for the asylum request. On the left the process to obtain the work permit.
This issue features a high level of complexity, which is explored and unpacked in the following paragraph. However, acknowledging how the concurrence of these factors limits the potentialities derived from applying NLU systems for supporting the public sector, it is still fundamental to recognise that a change in the production of documents may broadly and positively impact its applicability. 3.2 Obstacles and Barriers The manner in which administrative procedural documents are written, structured and made available provides several points for reflections in relation to the outputs produced by the NLU system of easyRights and to the obstacles such documents create. The general premise is that there is no single or typical administrative document. There are normative documents written with heavy legal jargon; they are rich in normative and legal landscape text (describing the normative conditions at the basis of the administrative act they represent) not providing any real content on the procedure the document is meant to activate; the text targeting the procedure may be very short (even irrelevant as per the length of the entire document), general (rich in basic principles and strategic statements), and providing no, or very poor, description of or reference to the procedure. There are administrative and organisational documents having an operational value: they set the service provision environment, identify the responsible persons in the offices and the tasks they have to perform to guarantee the effectiveness on the service supply side. They may describe the service supply chain as per the flow of information and documents that infrastructure the procedure. They can be interpreted as per the subsequent steps the procedure is made of. However, they do not provide information useful to users as they are mainly targeting the procedure from the point of view of the back office. There is a diversified amount of supportive documents: namely templates (eventually distributed to collect information and data from the users) or documents drafting declarations to be filled in, signed, and given back. These last documents are relevant for one step of the target procedure. They may eventually have a larger value across the
284
I. Mariani et al.
entire procedure. Nevertheless, they provide no insights on the entire service experience, and are very misleading as per the complete pathway description. Finally, but very rarely, there are service guidelines for the users. These are the best documents to be processed by the NLU systems: they are written in a simple language; they usually adopt a stepwise structure of the text and are very targeted (no irrelevant information is provided). Generally, they are not administrative products, they are rather produced by external organisations and are very hard to find (they are rarely produced). They may contain some imprecise info. Not all the documents described above are available for each service. In some cases (see the registration service in Palermo) only supportive documents are available whereas normative or legal documents target fundamental rights and do not supply details on the procedures or on specificities of the local provision of the service itself. The situation is characterised, on the one side, by the lack of documents, both procedural and general, on the other, when available, by documents with an inherently complicated nature due to a frequent presence of different linguistic codes. Beyond an imbalanced coexistence of normative and legal information, intertwined with the substance of procedures, such documents often display the usage of specialised language, syntactic ambiguities, unclear actors, acronyms as well as grammatical errors. To further summarise, key obstacles embedded in the nature of documents are: exceeding bureaucratic and legal jargon of the documents, prevailing administrative target of the documents, lack of procedural contents. The lack of a large number of documents and the inadequacy of the collected ones made the testing of the NLU system a very complex and challenging task. Yet, the pathway generated proved to be an effective device to improve service accessibility. Users familiarise themselves easily with it and gain a rapid understanding of the entire service process. Still, the unavailability of documents having a clear descriptive intent and the variability of the complexity that characterises each service step in addition to a variable granularity of the steps themselves make the adoption of NLU system for the generation of service pathways a “still far in the future” solution.
4 Conclusions and Open Questions: Challenges for Public Administrations 4.1 Public Administration’s Readiness to Exploit AI Potentials The two-year-long experimentation conducted within the framework of the easyRights project confirms that the readiness of Public Administrations in embedding AI potentials is a crucial factor. They need to make sure they include space and opportunities for flexibility and experimentation in order to encourage fast learning within the public sector. Moreover, according to Corvalán [39], while it is not an easy task for bureaucracies to adapt to new AI-based technologies and to boost towards digital administration, the key is given by a reconfiguration of Public Administration from the concept of inclusive innovation and the promotion of new technologies from the perspective of people and their rights. At different scales, it also becomes crucial to engage with pressing issues considering how AI can be smartly, wisely, and trustworthily included in the identification of solutions
Improving Public Services Accessibility Through Natural Language Processing
285
[1], also considering that it might reshape priorities and orient future commitments. Such a mindset is a precondition to foster innovation and diffusion of interdisciplinary knowledge and know-how while creating a favourable space to go beyond learning new approaches and how to handle them. This reasoning, however, has to be situated in a broader landscape that witnesses many Public Administrations suffer from the lack of ICT skills and digital literacy [40, 41] as well as the lack of infrastructure development of digital public services [42]. Although the current trajectories and investments are trying to mend this condition, it is still rooted, feeding the risk of low acceptance of digital transformation. Indeed, the application of AI technologies undoubtedly requires rethinking the current notion of how to design products and services, but also interactions and experiences. A revision that also affects praxes and procedures within the public sector. Beyond the technical standpoint that can be summarised as the need to build internal know-how and capacity, there is also the necessity to develop resources inherently AI-oriented. As a consequence, in parallel to the reflection on the development of critical approaches for designing and implementing fair, trustworthy, and accountable AI-based solutions, it also surfaces the need to discuss how to produce and design administrative documents that make it more efficient for an AI-based understanding. To support this and similar innovative solutions, several steps are needed: from the reconsideration of structures and processes, to the review of workflows and rules, guided by a clear idea of the specific requirements to make AI-based solutions applicable in the first place, and more effective in making public service provision more efficient. High-level benefits can derive especially from developing innovative solutions that work on the concepts of interrelatedness and interdependence among different services and transversely to the services domains. Finally, beyond bringing benefits in terms of effectiveness and efficiency, and enabling new forms of public service delivery, a better and strategic inclusion of AI in the public sector can contribute to generating relevant data also for sustaining evidencebased or data-informed policymaking. To follow this direction, a transformation in and of the public sector [43] is required, subverting existing, rooted paradigms. In addition, the progress of AI in the re-development and provision of public services requires to pay significant and increased attention to the needs of the end-users. Such a direction would also contribute to addressing a further, fundamental challenge to AI adoption in public administration services. At the current state of the art of service provision, information and service literacy still depends upon migrants’ ability to understand and communicate effectively [19]. This demonstrates the public sector’s current inability to appropriately and responsively consider and incorporate the needs of a vulnerable target group, whose presence is likely to increase over time. The current situation contributes to the growth of inequalities and social exclusion due to limiting or even precluding migrants from access to certain services. A situation that could benefit from the intervention of NLP and NLU technologies that support, among the others, processes of information simplification.
286
I. Mariani et al.
4.2 Requirements for Administrative Procedural Documents The Pathway Generator and its process of development and application, as an experiment conducted within the easyRights project, led us to recognise the presence and persistence of multi-level challenges, paving the way for a wider discourse on what is necessary to revise in the public sector in order to better exploit the potentials of available technologies. Challenges lie at institutional capacity and organisational learning levels and require a shift of mindset. To effectively apply and exploit the potentialities of AI-technology, and NLP and NLU in particular, and extrapolate information from textual documents, to derive structured pathways, documents should be developed in a perspective different from the traditional ones. Throughout the three years of the project, it emerged a strong and shared praxis of relying on public servants to convey information about procedures. Rather than producing procedural documents, the information on the service pathway relies on people, their individual knowledge of the process, and their verbal communication. As a result, there is a general lack of documents and documentation of the procedures that undermines the effectiveness of the NLU system in contributing to the production of pathways. In easyRights, the lack of documents was partly answered by relying on guidelines developed to train and orient civil servants and Public Administration personnel on the procedures. These documents developed to clearly explain procedures to a target audience within the public sector are characterised by a low level of complexity that favoured successful processing through the NLU system. Since the production of documents is per se a recommendation, a further input for Public Administrations is to design and develop such documents considering the requirements of technology-oriented protocols and processes. Moreover, such documents should also be developed with a clear distinction between potential users, and therefore respond differently to specific requirements. This is not only necessary to ensure the provision of services to migrants in a more responsive and integrated way, but it also reflects a broader condition for procedures that are easy to understand by all entitled citizens. The need is to review and incorporate appropriate procedures to better target the production of administrative documentation in order to meet the requirements that would improve AI-based processing and understanding. To foster innovations that facilitate public access to public services, it is essential to activate a multi-level dialogue and collaboration among administrative actors. In order to systematically address the identified barriers to access to service, it is crucial to pay particular attention to complex service-related procedures at the organisational, service provision, and policymaking levels. Funding. The work presented in this document was funded through the easyRights project. This project has received funding from the European Union’s Horizon 2020 Programme under Grant Agreement No. 870980. However, the opinions expressed herewith are solely of the authors and do not necessarily reflect the point of view of any EU institution.
Conflicts of Interest. The authors declare no conflict of interest.
Improving Public Services Accessibility Through Natural Language Processing
287
References 1. Wirtz, B.W., Weyerer, J.C., Sturm, B.J.: The dark sides of artificial intelligence: an integrated AI governance framework for public administration. Int. J. Public Adm. 43, 818–829 (2020). https://doi.org/10.1080/01900692.2020.1749851 2. Wang, W., Siau, K.: Artificial intelligence, machine learning, automation, robotics, future of work and future of humanity: a review and research agenda. J. Database Manag. (JDM). 30, 61–79 (2019). https://doi.org/10.4018/JDM.2019010104 3. Mainzer, K.: Künstliche Intelligenz-wann übernehmen die Maschinen? Springer, Berlin, Heidelberg (2016) 4. Etscheid, J.: Artificial intelligence in public administration. In: Lindgren, I., et al. (eds.) EGOV 2019. LNCS, vol. 11685, pp. 248–261. Springer, Cham (2019). https://doi.org/10.1007/9783-030-27325-5_19 5. Holzinger, A.: Explainable AI (ex-AI). Informatik-Spektrum 41(2), 138–143 (2018). https:// doi.org/10.1007/s00287-018-1102-5 6. Boyd, M., Wilson, N.: Rapid developments in artificial intelligence: how might the New Zealand government respond? Policy Q. 13, 36–43 (2017). https://doi.org/10.26686/pq.v13i4. 4619 7. Knight, W.: China’s AI awakening: The west shouldn’t fear China’s artificial-intelligence revolution It should copy it. MIT Technol. Rev. 120, 66–72 (2017) 8. Knight, W.: The Dark Secret at the Heart of AI. MIT Technol. Rev. (2017). https://www.tec hnologyreview.com/2017/04/11/5113/the-dark-secret-at-the-heart-of-ai/ 9. Wirtz, B.W., Weyerer, J.C., Geyer, C.: Artificial intelligence and the public sector—applications and challenges. Int. J. Public Adm. 42, 596–615 (2019). https://doi.org/10.1080/019 00692.2018.1498103 10. Cath, C., Wachter, S., Mittelstadt, B., Taddeo, M., Floridi, L.: Artificial Intelligence and the ‘Good Society’: the US, EU, and UK approach. Sci. Eng. Ethics 24(2), 505–528 (2017). https://doi.org/10.1007/s11948-017-9901-7 11. Floridi, L.: Mature information societies—a matter of expectations. Philos. Technol. 29(1), 1–4 (2016). https://doi.org/10.1007/s13347-016-0214-6 12. Desouza, K.C., Dawson, G.S., Chenok, D.: Designing, developing, and deploying artificial intelligence systems: lessons from and for the public sector. Bus. Horiz. 63, 205–213 (2020). https://doi.org/10.1016/j.bushor.2019.11.004 13. de Sousa, W.G., de Melo, E.R.P., Bermejo, P.H.D.S., Farias, R.A.S., Gomes, A.O.: How and where is artificial intelligence in the public sector going? a literature review and research agenda. Gov. Inf. Q. 36, 101392 (2019). https://doi.org/10.1016/j.giq.2019.07.004 14. Magnini, B., Not, E., Stock, O., Strapparava, C.: Natural language processing for transparent communication between public administration and citizens. Artif. Intell. Law 8, 1–34 (2000). https://doi.org/10.1023/A:1008394902165 15. Ferrari, A., Dell’Orletta, F., Esuli, A., Gervasi, V., Gnesi, S.: Natural language requirements processing: a 4D vision. IEEE Softw. 34, 28–35 (2017). https://doi.org/10.1109/MS.2017. 4121207 16. Gudivada, V.N., Arbabifard, K.: Chapter 3 - open-source libraries, application frameworks, and workflow systems for NLP. In: Gudivada, V.N., Rao, C.R. (eds.) Computational Analysis and Understanding of Natural Languages: Principles, Methods and Applications, pp. 31–50. Elsevier (2018) 17. Carenini, M., Whyte, A., Bertorello, L., Vanocchi, M.: Improving communication in edemocracy using natural language processing. IEEE Intell. Syst. 22, 20–27 (2007). https:// doi.org/10.1109/MIS.2007.11
288
I. Mariani et al.
18. Concilio, G., Costa, G., Karimi, M., del Vitaller Olmo, M., Kehagia, O.: Co-designing with migrants’ easier access to public services: a technological perspective. Soc. Sci. 11(2), 54 (2022). https://doi.org/10.3390/socsci11020054 19. Abood, J., Woodward, K., Polonsky, M., Green, J., Tadjoeddin, Z., Renzaho, A.: Understanding immigrant settlement services literacy in the context of settlement service utilisation, settlement outcomes and wellbeing among new migrants: a mixed methods systematic review. Wellbeing, Space Soc. 2, 100057 (2021). https://doi.org/10.1016/j.wss.2021.100057 20. Ponce, J.: Good administration and administrative procedures. Indiana J. Global Legal Stud. 12, 551–588 (2005) 21. Akhgar, B., Hough, K.L., Samad, Y.A., Bayerl, P.S., Karakostas, A. (eds.): Information and Communications Technology in Support of Migration. Springer, Cham, Switzerland (2022) 22. easyRights: NLU System (2021) 23. Agarwal, S., Atreja, S., Agarwal, V.: Extracting procedural knowledge from technical documents. arXiv preprint arXiv:2010.10156, pp. 1–7 (2020) 24. Omann, I., Kammerlander, M., Jäger, J., Bisaro, A., Tàbara, J.D.: Assessing opportunities for scaling out, up and deep of win-win solutions for a sustainable world. Clim. Change 160(4), 753–767 (2019). https://doi.org/10.1007/s10584-019-02503-9 25. Riddell, D., Moore, M.-L.: Scaling out, scaling up, scaling deep. McConnell Foundation. JW McConnell Family Foundation & Tamarack Institute (2015) 26. easyRights: Pilot Agendas (2021) 27. Cer, D., Diab, M., Agirre, E., Lopez-Gazpio, I., Specia, L.: SemEval-2017 task 1: semantic textual similarity multilingual and crosslingual focused evaluation. In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). pp. 1–14. Association for Computational Linguistics, Vancouver, Canada (2017). https://doi.org/10.18653/v1/S172001 28. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, pp. 1–16 (2019) 29. Reimers, N., Gurevych, I.: Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084, pp. 1–11 (2019) 30. Grishman, R., Sundheim, B.: Message understanding conference- 6: a brief history. In: COLING 1996 Volume 1: The 16th International Conference on Computational Linguistics (1996) 31. Vaswani, A., et al.: Attention is all you need. arXiv preprint arXiv:1706.03762, pp. 1–15 (2017) 32. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 1735–1780 (1997). https://doi.org/10.1162/neco.1997.9.8.1735 33. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Proceedings of the 27th International Conference on Neural Information Processing Systems, vol. 2. pp. 3104–3112. MIT Press, Cambridge, MA, USA (2014) 34. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1706.03762, pp. 1–15 (2016) 35. Ghaddar, A., Langlais, P.: WiNER: A wikipedia annotated corpus for named entity recognition. In: Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers). pp. 413–422. Asian Federation of Natural Language Processing, Taipei, Taiwan (2017) 36. Rizzo, G., Van, T.H.M.: Adversarial text generation with context adapted global knowledge and a self-attentive discriminator. Inf. Process. Manage. 57, 102217 (2020). https://doi.org/ 10.1016/j.ipm.2020.102217 37. Bin, Y., Ding, Y., Peng, B., Peng, L., Yang, Y., Chua, T.-S.: Entity slot filling for visual captioning. IEEE Trans. Circuits Syst. Video Technol. 32, 52–62 (2022). https://doi.org/10. 1109/TCSVT.2021.3063297
Improving Public Services Accessibility Through Natural Language Processing
289
38. Bateson, G.: Steps to an Ecology of Mind. University of Chicago Press, Chicago, IL (1972) 39. Corvalán, J.G.: Digital and intelligent public administration: transformations in the era of artificial intelligence. A&C - Revista de Direito Administrativo & Constitucional. 18, 55–87 (2018). https://doi.org/10.21056/aec.v18i71.857 40. D’Ambrosio, I.: The digital culture within enterprises and public administration: legal aspects and repercussions on the country’s socioeconomic fabric. In: Comite, U. (ed.) Public Management and Administration. InTech (2018). https://doi.org/10.5772/intechopen.77606 41. Datta, P., Walker, L., Amarilli, F.: Digital transformation: learning from Italy’s public administration. J. Inform. Technol. Teach. Cases 10, 54–71 (2020). https://doi.org/10.1177/204388 6920910437 42. Lincaru, C., Pîrciog, S., Grigorescu, A., Tudose, G.: Low-Low (LL) high human capital clusters in public administration employment-predictor for digital infrastructure public investment priority-Romania case study. Entrepreneurship Sustain. Issues 6, 729 (2018). https://doi.org/ 10.9770/jesi.2018.6.2(18) 43. Berryhill, J., et al.: Hello, world: artificial intelligence and its use in the public sector. OECD Working Papers on Public Governance, vol. 36. OECD Publishing, Paris (2019). https://doi. org/10.1787/726fd39d-en
Neural Machine Translation for Aymara to Spanish Honorio Apaza Alanoca1(B) , Brisayda Aruhuanca Chahuares2 , Kewin Aroquipa Caceres3 , and Josimar Chire Saire4 1
Universidad Nacional de San Agustin, Universidad Nacional de Moquegua, Arequipa, Peru [email protected], [email protected] 2 Universidad Cayetano Heredia, Lima, Peru [email protected] 3 Universidad Nacional Mayor de San Marcos, Lima, Peru [email protected] 4 Universidade Sao Paulo, Sao Carlos, Brazil [email protected]
Abstract. There are many native languages in Latin America, over the decades the number of speakers was reduced by the strong influence of the Spanish language. There is a continuous concern for the preservation of these languages, such as: Aymara, Quechua, Guaran´ı. To create Neural Machine Translator (NMT) models, there is no data set of translations from the native language Aymara - Spanish. Therefore, this document presents a data set of conversations in native Aymara language and the respective translations into Spanish. The first translation tests with the seq2seq model are also carried out. The first initial results are promising, considering that it is the first application of Natural Language Processing (NLP) and translation machine for the native Aymara language.
Keywords: Neural machine translator Native language · Aymara · Spanish
1
· Natural language processing ·
Introduction
Many prehispanic cultures were present [2] in Latin America, and the most relevant were Mayas, Aztecs, Incas. Each civilization were located in North, Central and South America. After colonization, the culture and language were changing following Spanish language and religion. Many centuries later, the preservation of the languages have became a important concern to keep this variety, richness about culture, language. Incas empire were a joint of many cultures with their own language, i.e. Quechua and Aymara. On the other hand, Machine Translation area is focused to translate one sentence from one language to another. Through two different approaches [4], statistical and neural based. Many projects c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): IntelliSys 2022, LNNS 544, pp. 290–298, 2023. https://doi.org/10.1007/978-3-031-16075-2_19
Neural Machine Translation for Aymara to Spanish
291
were conducted for Machine Translation of different languages around the world, i.e. Papa New Guinea (15 languages) [3], Quechua [1] and one limitation is the scarce availability of data set. After searching in the literature about Aymara language, there is no evidence about Machine Translation for this specific language, besides there is no available any structure data set to conduct this kind of study. For this reason, this paper has two main objectives: gather data, process and structured to build a data set of Aymara, and explore Machine Translation for Aymara and Spanish languages.
2
Native Language Aymara
The Aymara language is traditionally spoken in the departments of Puno, Moquegua and Tacna, although as a result of migration, large Aymara-speaking groups now also live in Lima, Arequipa and Madre de Dios. It belongs to the Aru linguistic family. The Aymara language is also spoken in Bolivia and northern Argentina and Chile. In the language itself, the correct writing is Aymara. 2.1
Population that Has the Language as Their Mother Tongue
According to the 2017 National Censuses of the National Institute of Statistics and Informatics [6], 450,010 people learned to speak in the Aymara language and According to the Ministry of Education (2018), the Aymara language is in a critical situation. 2.2
Aymara Writing
The Aymara language has an official alphabet established by Ministerial Resolution No. 1218-85-ED [7], of November 18, 1985, with 32 spellings (a, a¨, ch, chh, ch ’, i, ¨ı, j, k, kh, k ’, l, ll, m, n, n ˜, p, ph, p’, q, qh, q ’, r, s, t, th, t’, u, u ¨, w, x, y). 2.3
Registered Interpreters and Translators
Currently, within the framework of the implementation of Law No. 297355 (Law of Languages), the Ministry of Culture has registered twenty-nine (29) interpreters and/or translators.
3
Aymara Prayer and Its Elements
A sentence (amayu) is made up of a word or set of words that expresses a complete grammatical meaning [8]. The following are sentences made up of a single word: Sentences are made up of syntactic units, called phrases, and the verb. For its representation in writing, a sentence begins with the capital letter and at the end is a period [8].
292
H. A. Alanoca et al.
Sentences One: Maya ayllunxa qatiqati jamach’ixa sapa arumawa jalnaqatayna. [Maya ayllun](F. Locative) [qatiqati jamach’i](Subject) [sapa aruma](F.Ubic.Temp) [jalnaqatayna](Verb) Sentences two: Ukatha maya arumaxa akathjamatha qatiqatixa jach’a quqaruwa achuntayasitayna. [Ukatha](Connector) [maya aruma](F. Ubic. Temp.) [akathjamatha] (F. Adv.) [qatiqati](Subject) [jach’a quqaru](Compl) [achuntayasitayna](Verb) Sentences three: Qatiqatixa jani jalta˜ na yatisaxa qatiqir¨ıtaynawa. [Qatiqati](Subject) [jani jalta˜ na yatisa](Cla´ us.Subord.) [qatiqir¨ıtaynaw](Verb)
4
Methodology
To achieve the objectives of this research, the following steps have been proposed, see Fig. 1, where in summarized steps three important steps are shown, the first consists of the collection and construction of Aymara data (texts) translated into Spanish, the second step consists of carrying out the data pre-processing respect in NLP, it is understood by cleaning portfolios that do not correspond ASCII, homogenization of characters and give the format for the input to the model, finally the third step is to develop Neural Machine Translation for Peruvian native language Aymara to Spanish.
Collection of translated documents (Aymara / Spanish)
Preprocessing
Fig. 1. Pipeline research.
Machine translate modeling
Neural Machine Translation for Aymara to Spanish
4.1
293
Data Collections
Data set of translations from Aymara to Spanish of the Aymara native language for research purposes with artificial intelligence was not found in the scientific community, nor in public artificial intelligence forums. The idea of the research was presented at various academic events, texts written in Aymara were requested, such as: books, stories, government materials, the bible, hymnals of religions, etc. the communities that promote the revaluation of the Aymara culture and the native language. Finally, AYMARA ARUSKIPAWINAKA (Conversations in Aymara) material was found. In this text you will find 1914 conversations written in Aymara and accompanied by their interpretation in the Spanish language, each one of them is developed or written in different situations (everyday contexts), they will allow understand and learn the Aymara language, not only in terms of its writing, but also in terms of its meaning and interpretation, for which it is requested that. You can read and study this bibliographic material very carefully, and it will surely be very helpful for your learning of this wonderful language[9]. The example will be seen in Table 1. Table 1. Example of the data set structure of the Spanish - Aymara translations. Aymara
Spanish
Aski uruk¨ıpan kullaka
Buen d´ıa hermana
Aski uruk¨ıpanay kullaka
Buen d´ıa hermana
Kamisaki?
¿C´ omo est´ as?
Waliki
Bien
Juman sutimax kunasa?
¿Cu´ al es tu nombre?
Nayan sutijax Yanina Gisella Olarte Tarifawa
Mi nombre es Yanina Gisella Olarte tarifa
Jumansti?
¿Y de ti?
Nayan sutijax Mar´ıa Alejandra Mamani Corderuwa Mi nombre es Mar´ıa Alejandra Mamani Corderu Jumax qawqha maranitasa?
¿T´ u cu´ antos a˜ nos tienes?
Nayax p¨ a tunka pusini maranitwa
Yo tengo 24 a˜ nos
The data set has been built by extracting texts from the book [9], in total 1914 translations of conversations were obtained, in the book the text is written in the original Aymara language and translated into Spanish. The dataset will be available at https://github.com/Honorio-apz/AYMARA ARUSKIPAWINAKA. git. 4.2
Reprocessing
The pre-processing of the data (texts written in Aymara - Spanish) has been carried out manually, in principle it has been collected in an excel file, for this first
294
H. A. Alanoca et al.
experiment the text of the aforementioned text has been taken [9], The translated texts have been extracted in a two-column excel file, in the first column is the list of texts in Aymara and in the second column is the list of texts translated into Spanish, later with python programming language a input format for decoding and the rest of the training process. Input texts are multilingual texts with limited vocabulary. That is why it will be important for the model to standardize the input text. The first step is Unicode normalization to break up the accented characters and replace the compatibility characters with their ASCII equivalents. For this purpose we use the tensroflow−text package contains a Unicode normalization operation. Later we vectorize the texts making use of the preprocessing. TextVectorization() function of tensorflow, which creates a vocabulary and converts them into token sequences for training input. 4.3
Modeling
– Training Architecture In this research trains a sequence to sequence (seq2seq) model for Aymara to Spanish translation based on Effective Approaches to Attention-based Neural Machine Translation [5]. In Fig. 2 the architecture of training in two stages will be appreciated, the first corresponds to the coding and the second corresponds to the decoding. Encoder stage consists of taking as input each character of the Aymara writing, these characters are converted into a hidden representative sequence, then they will pass through the F function, the final encoder result is an encoded vector that will be used in the next stage. Decoder stage takes the encoded vector as input, then making use of this vector the neural network will try to do the master reinforcement to update the Spanish language characters, because the next input are Spanish writing characters, the final result of this stage is the targets conditioned by the input sequence. But, instead of taking characters as input to the neuronal network, the present will take whole words. F(h1……..hn)
a
c
h
i
l
a
a
b
u
e
l
o
.
ST
a
b
u
e
l
o
Fig. 2. Sequence to sequence model architecture for training.
ED
Neural Machine Translation for Aymara to Spanish
295
– Testing Architecture As with the previous architecture figure, Testing Architecture it consists of two stages, the encoder functions as an encoder layer for each of the Aymara characters, as seen in Fig. 3. The decoder is the previously trained neural network, it takes the encoded sequence as input to try to produce the output and in the next process the predicted character and the encoder state are taken into account as input, finally it produces a translation output of aymata to Spanish. But, instead of taking characters as input to the neuronal network, the present will take whole words.
F(h1……..hn)
a
c
h
i
a
l
a
b
u
e
l
o
.
ED
ST
Fig. 3. Sequence to sequence model architecture for testing.
4.4
Results
The Encoder: Grab a list of token IDs from text input process. Find a key vector for each token using key layers. Process the inlays into a new sequence using Glavnoye Razvedyvatelnoye Upravlenie layers. Returns: The result of this stage is the processed sequence, this will become the main attention because it will be used to initialize the decoder. The characteristics of the input texts and encoder are presented below: 1 2 3 4
Input batch , shape ( batch ) : (64 ,) Input batch tokens , shape ( batch , s ) : (64 , 5) Encoder output , shape ( batch , s , units ) : (64 , 5 , 1024) Encoder state , shape ( batch , units ) : (64 , 1024) Listing 1.1. Python code.
The decoder is in charge of generating predictions for the next output token. The decoder receives as input the full output of the encoder, you use an Recurrent neural network (RNN) to keep track of what you’ve generated so far, It uses its RNN output as the query to care about the encoder’s output, producing the context vector. Combine the RNN output and the context vector to generate the
296
H. A. Alanoca et al.
“attention vector”, generates logit predictions for the next token based on the “attention vector”. The configuration has been done according to the tutorial of Neural machine translation with attention sequence to sequence (seq2seq) model for Spanish to English translation [10]. We start with a test of how a new model can be adapted to a batch of text input from the data set. In Fig. 4 you can see the loss should quickly go to zero.
Fig. 4. Data set batch loss.
After having trained the model, some translation tests have been carried out, the example can be seen in the following illustration with the python programming language. Where three Aymara phrases were entered from the Table 2 where the translations in Spanish and English are also found, in the sample he has avidly achieved a fairly clear translation. The same one that was validated according to the texts written in the book [9]. Table 2. Example of translations. Example Aymara number
Spanish
English
1
Jumax uywanitati
¿Tienes animales?
Do you have animals?
2
Jikisi˜ nkamay kullaka
hasta luego hermana See you later sister
3
Juman Sutimax Kunasa? ¿Cual es tu nombre? What’s your name?
Neural Machine Translation for Aymara to Spanish 1 2 3 4 5 6 7 8 9 10 11 12 13
297
three_input_text = tf . constant ([ ’ Jumax uywanitati ? ’ , ’ Jikisi n ~ kamay kullaka ’ , ’ Juman Sutimax Kunasa ? ’ , ]) result = translator . tf_translate ( three_input_text ) for tr in result [ ’ text ’ ]: print ( tr . numpy () . decode () ) print () output : ¿ tu tienes animales ? hasta luego hermana ¿ cual es tu nombre ? Listing 1.2. Python code for translation test.
Since there is a rough alignment between the input and output words, expect the focus to be close to the diagonal, in Fig. 5a and 5b it is not very clear because the training texts have a maximum of two compound words, if there were more words, the trend would be much clearer, which is expected for the next experiment. The examples of translations carried out have been evaluated according to the original text and by a language specialist, for short texts it apparently works, but it has difficulties with much longer texts. We believe that it is because 1914 examples are too few for a good learning of the model.
Fig. 5. Detail of the result examples one and three translations.
298
5
H. A. Alanoca et al.
Conclusion
For Machine Translation is vital to have a data source with the text of the languages involved. In spite of the existence of many documents of Aymara, there are less documents with the translation to Spanish. And, these documents can be available through different kind of files, i.e. pdf, txt and encondig standards, i.e. unicode. For these reasons, was necessary a pre-processing step to unify the text from the different sources of files. On the other hand, after data preparation, the Automatic Translation task was explored using a recurrent architecture proposed by Luong, promising results are obtained considering the metrics, considering that it is one of the first works of application of NLP, machine translation for the Ayamra native language.
6
Future Work
An extension will be explored to get Automatic Translation from these languages Aymara and Spanish, to reach English translation. The translations can be considerably improved by increasing the number of translations in the data set, in the present case we only worked with 1914 translations, obviously the results have been better than expected. Other recurrent network architectures would be tested for comparison, apparently the native Aymara language adapts very well to recurrent models.
References 1. Llitj´ os, A.F., Levin, L., Aranovich, R.: Building Machine translation systems for indigenous languages (2005) 2. Adelaar, W.F.H.: Endangered languages with millions of speakers: focus on Quechua in Peru (2014) 3. Bird, S., Chiang, D.: Machine translation for language preservation (2012) 4. Oladosu, J., Esan, A., Adeyanju, I., Adegoke, B., Olaniyan, O., Omodunbi, B.: Approaches to machine translation: a review (2016). https://doi.org/10.46792/ fuoyejet.v1i1.26 5. Luong, M.T., Pham, H., Manning, C.D.: Effective approaches to attention-based neural machine translation. archivePrefix=arXiv (2015) 6. Instituto Nacional de Estad´ıstica e Inform´ atica, XII Censo Nacional de Poblaci´ on y VII de Vivienda (2017) 7. Ministerio de Educaci´ on, Lenguas originarias del Per´ u (2018) 8. Segura, G., Ricardo, R.: Aymara arutha chiqapa qillqa˜ nataki panka Manual de escritura Aimara (2021) 9. AYMARA ARUSKIPAWINAKA: Conversaciones en aimara, Rom´ an Pairumani Ajacopa and Alejandra Bertha Carrasco Lima, Centro de Apoyo en Investigaci´ on y Educaci´ on Multidisciplinaria - CAIEM (2022) 10. Abadi, M., et al.: Large-scale machine learning on heterogeneous systems (2015). https://www.tensorflow.org/
Hand Gesture and Human-Drone Interaction Bilawal Latif(B) , Neil Buckley, and Emanuele Lindo Secco Robotics Laboratory, School of Mathematics, Computer Science and Engineering, Liverpool Hope University, Hope Park, Liverpool L16 9JD, UK {20210503,bucklen,seccoe}@hope.ac.uk
Abstract. Human computer Interaction is a wide domain which includes different ways of interaction i.e., using hand gestures and body postures. Gestures Detection relate to non-verbal ways to deliver information to the system for control. The aim of gesture recognition is first recording the gestures and then these gestures are read and interpreted by a camera. Gesture recognition has wide range of applications. It can be used by disabled persons to communicate. This paper focuses on detailed research of controlling drones with hand gestures. The presented system is made of three main blocks i.e. (1) the detection of gestures, (2) translating the gestures and (3) controlling the drone. Deep learning algorithm is used in the first module to detect the real-time gestures of hands. Secondly gesture translator uses some image processing techniques to identify gestures. Control signals are then generated for controlling the drone. Preliminary validation results are promising. Keywords: Hand gesture · Human robot interaction · Human drone interaction
1 Introduction Human uses different ways to communicate with machines and some common forms of communication are by body postures. Drones are widely used in most of the applications i.e., coverage of sports event, ariel photography, fast way of transporting equipment to emergency area. Nowadays researchers are planning to use it as a mode of transportation. Much research had been brought forward in finding the most precise way to interact with drones such as, for example, through Hand gesture-controlled [1]. Human-Computer Interface (HCI) can be referred as methods to interact with the machines. The basic example of interacting with machine is a keyboard and a mouse that are used to give input to a personal computer. Advancements in HCI has created interest in researchers. The most significant beliefs in HCI are usability and functionality [2]. Functions are services or tasks offered by a system while usability will be using the function appropriately. The increasing growth of drones has prompted researchers to establish a new field of study known as Human Drone Interaction (HDI). Without a remote controller, it was previously difficult to interact with drones [3–6]. Many studies suggest that drones can be controlled by gestures and postures. In some of the experiments that controlled drones by gesture, a front camera connected to the drone was utilized, and in others, a camera was used to control drones from a ground station. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): IntelliSys 2022, LNNS 544, pp. 299–308, 2023. https://doi.org/10.1007/978-3-031-16075-2_20
300
B. Latif et al.
1.1 Problem Formulation Drones’ fast growth and expansion make them an attractive topic for researchers in a variety of fields, whether for commercial or personal use. Most of the current study has been conducted utilizing third-party data. Many research organizations have also been aroused in controlling the drone with different and efficient means. The author in [2], for example, uses Kinect camera to detect the body and gestures and then translating the gestures according to their requirement and send to drone which was connected to microprocessor via a wi-fi communication protocol, which finally connects to a Leap Motion controller. Leap motion controller do track and recognize gestures using its cameras and infrared LEDs. On the other hand, author in [3] uses different technology, such as a front camera. The images taken from the video streaming of the camera were then processed. In other literature contributions, such as in [4], for example, a Kinect camera to detect gestures and gestures are predefined. For the hand gesture detection, a lot of work has been performed by Nvidia as well: they use depth, color, and IR sensor to gather the data [5]. 1.2 Gesture Recognition Usually, one human gesture has many meanings: when we raise our hand above head or wave the hand some people are unclear about what we mean and take it as stop. This problem is not only with gestures, but it also affects written and speaking languages. Focusing on the human upper limb and, in particular, on the human hand, we could provide the following definitions: a movement of the hand could be defined as a hand motion, whereas single pose of hand is called hand posture. We will focus on the latter one and try to provide an overview on how we can detect and monitor hand posture in terms of the available technologies and software. Earlier many of researchers were attracted towards senor-based hand glove. This glove consists of sensors that is used to detect the motion of fingers and hand. Nowadays vision based is new concept. It is in attention of many researchers and is simply defined as to detect motion by using camera. There are two main techniques of vision based i.e., model based, and image based. Model based technique is making a model of human hand and then using that for recognition while in image-based approach they use camera to capture the image and then recognizing the human gesture by certain algorithms. Using a human hand gesture recognition for controlling external device, such as a drone, is clearly an important aspect of applications where intuitive human robot interactions are required. This aspect should provide the possibility for non-expert endusers to have an access to the new technologies and increment the inclusivity of such use. Clearly, such an approach, also introduces some concerns such as the need to consider which risks are implicitly inherited within this approach. At the stage of this work, we looked at how we can integrate human robot interaction on controlling drones and how to preliminary test the efficiency of such a solution. In this context, in the following sections we overview other solutions which are available in the literature (Sect. 2) and then present our approach (Sect. 3).
Hand Gesture and Human-Drone Interaction
301
2 Background 2.1 Devices for Gesture Recognition Hand gesture is key for the interaction between human and computer and is very convenient. The first section gives detailed discussion about different types of data gloves and their functionalities. The second section gave a bit touch to the image processing techniques. Wearable Glove Wearable gloves have been designed since 1970. Each glove has its own special capability and functionality. For example, Sayre Glove was the first ever invented hand glove for detection of hand gestures. Data entry glove was then introduced in back 1980 which is used to enter data into computers. Data glove has given birth to hand sensor recognition techniques. Many researchers think that sign language is inspired by gestures, and it can be used to interact with computer. Data glove consists of some sensors, wires attached with it. The position of hand (open or closed) is determined by resistive sensors at joints. It detects if the joints are straight or twisted. This data is conveyed to computer and then interpreted to information. The advantage of these devices are that they do not need many resources and have limited processing power. It is a bit difficult to manage because of huge number of wires, however it’s a worthy invention in back 90’s [7]. After advancements in technologies, wireless sensors that transmit the data to the computer wirelessly have been also introduced. We may classify two main types of data glove i.e., active, and passive data gloves. The glove with many sensors to monitor finger movement and accelerator that sends data to computer is an active glove. Gloves with marker of colors on it, is passive glove with no sensors [7]. MIT AcceleGlove In the gloves’ context, MIT gloves represent an interesting solution since they have wide capabilities as compared to other systems. The MIT glove was developed by AnthroTronix, a MIT company. It is reprogrammable glove and user can reprogram it according to its need and usage. It is widely used in sports video games etc. An accelerometer sensor is placed under each fingertip and at the back which detect the position of the finger in 3D space and then predict the motion with reference to default position. It is a very user-friendly device where users can do their tasks after wearing it on their hands. Other Devices There are also many other gloves as well that are widely used, such as, for example the Cyber Glove III (and Cyber Glove II), the 5D sensor Glove, the X-IST Data Glove and the P5 Glove. These devices well represent a mix of methods for gesture recognitions that are widely used before the introduction of vision-based methods.
302
B. Latif et al.
2.2 Algorithms for Gesture Recognition Hand Gesture recognition is a demanding task and – from an image processing viewpoint - a crucial computer vision task. Detection of hand from a congested scene is moreover a challenging task as every human skin has its own color type and many diversities may occur in the scene. Therefore, to detect the hands present in each frame, we can use various methods, with extreme precision in identifying the hands. It is also important to mention that it is challenging to get high accuracy in real time behavior. Some methods of hand detection using camera interface are based on the use of: • Artificial Neural Networks • Fuzzy Logic • Genetic Algorithm These methods can be combined with Sensor Glove based hand detection - Gloves consists of multiple sensor that detect the position and motion of the fingers and palm of hand. These techniques are very accurate and easy to use. The connection of sensors with computer is very complex and sluggish. This system itself is not cost effective. Color Marker based Glove - This technique uses color markers mark on the glove which gives separate colors to the palm and fingers. Extracting the geometric features gives the actual shape of the hand. This approach is not so costly and is very straightforward as well but it’s not the feasible interaction with the machine. Appearance Based - Fingertip technique is widely used as a technique in image generation. Here, for example, Nolker [8, 9] proposed a system called GREFIT which generates the image of hand using the fingertip. In this study the author shares some important points for locating fingertip i.e., using different images for prototype and by marking fingertip as colored. Most of the authors have proposed the study that reconstructs the hand with the help of fingertip, contour, and vectors. Skin color Thresholding - The most basic way to detect the hand is using color range thresholding. Setting the color range of human skin all the elements other than color range should be removed and only color ranged object remains. Some geometric calculation is applied on the extracted hand to extract the fingers. The method fails for some systems due to many reasons. For example, the skin color range for humans are different (i.e. the objects with human skin color also exists in the picture and therefore it is difficult to extract the hand from the image); environmental changes also effect the skin color, and corrupts the whole perception; the hand is placed in front of object with the same color range. 2.2.1 Hand Detection Using Deep Learning One of the most accurate way to detect the hand is applying deep learning techniques. Many researchers have been working in Computer vision domain in deep learning and many studies have been abrupted. According to some researchers some architectures give good accuracy in image processing, such as, for example, AlexNet [6], VGG [10] and ResNet [11]. There are network architectures that can work as 10 detectors, but they
Hand Gesture and Human-Drone Interaction
303
differ in speed and accuracy. Most of neural networks are very accurate but cannot be used in real time. We need high accuracy in real time and therefore YOLO [12] and SSD [13] can be used as a solution to these problems. 2.2.2 Hand Detection Using Tensor Flow TensorFlow is a framework defined by Google Inc. to ease the implementation of machine learning models and to optimize the training algorithms [14]. TensorFlow offers a wide range of operations, from numerical computations to neural network components. TensorFlow is a backend library which is used as a base for Keras library. It allows developer to create ML applications by utilizing different tools, Libraries, and resources. Keras is basically an API which is built over TensorFlow which eases the complex commands and instruction of TensorFlow. It eases the test train and save the CNN model. TensorFlow design also enables simple compute application over a wide range of platforms. It permits to define flow graphs and topologies to indicate the flow of data over a graph by recognizing inputs as a multidimensional array. It supports on designing a flowchart of processes that may be done on these inputs, which goes at one end and returns as output. The TensorFlow architecture is basically organized in three parts, namely. • Preprocess the data • Build the model • Train the model Graphs contain multiple nodes and each node acts as a calculator. All the calculators are connected to each other by stream of data packets. Data path is then set by these calculators and stream and the Mediapipe is built on three different models. • Evaluating the performance • Sensor data gathering framework • Component collection Mediapipe have built-in models and are ready to use. Developers can amend it according to their need and modify it accordingly. Hand detection is carried out very smoothly and easily without consuming many resources. Previously real time object detection with a camera at 30 fps with limited resources was not possible but Mediapipe achieved this by tracking and detecting in parallel. Mediapipe detects the hand and its key point. Based on this background, we selected TesnorFlow as our tool in order to detect human hand gesture and apply it to our system. Figure 1 shows detected hand. This shows all the key points on the hand. To detect the hand in real time Mediapipe used single shot detector. First this module is trained by palm detector model as it is easy to train palm. Furthermore, fingers and joints are detected.
304
B. Latif et al.
Fig. 1. The hand posture detection by using the Mediapipe library and the 20 key points of the hand on panel 1 and 2, respectively.
According to this background, we present the implementation and integration of a hand-gesture system for controlling drone. The following text is reporting a Section about the development of the visual gesture recognition and code implementation (Sect. 3), followed by a conclusion section (Sect. 4) where you report the main outcomes of the project.
3 Implementation This paragraph provides an overview of HCI and the background of gesture detection, and its types. Here we provide an overview of hand gesture detection and its many forms, as well as the various cameras used for 2D and 3D pictures. A gesture is a nonverbal means of communication used in HCI systems, according to one basic definition. The primary goal of gesture recognition system is to create a system that can recognize human motion and utilize them to transmit information and establishing interface between user and machine. HCI has recently grown in importance as its use expands over a variety of applications, including human movement tracking. It must first establish the concept of human motion acquisition, which is the recording of a human or an object’s motions and transmission of those movements as 2D or 3D information. Developing a 3D digital image requires the use of software and technologies that are deemed proprietary to such organizations [9, 15]. One of the key aspects is the synchronization between the technology and the actual world, which guarantees that the system uses the human body motion while adhering to real-world standards and presenting information in a simple and reasonable sequence. The techniques used and some of the vision-based gesture detection are shown below in Fig. 2.
Hand Gesture and Human-Drone Interaction
305
Fig. 2. Fist gesture (1-flying drone), peace gesture (2-moving forward), opposite thumb (3-turning right), open hand (4-landing drone), thumb and index (5-turning left), rock and roll (6-moving backwards)
3.1 Image Processing TensorFlow provides many libraries that already have some trained models. MediaPipe has the trained model of the hand and some of the gestures are recognized in it as well. Therefore, we integrate all the libraries according to requirement and how to use and access the functions of TensorFlow and MediaPipe. The library has already a Tensorflow pre-trained model. And we just must load that model. OpenCV module will allow us to read frames from the camera over which we will perform landmark estimation. This functionality is offered by Mediapipe module and then we must convert image into RGB. The function takes the input in RGB format. After that we must predict the gesture by calling a function by Keras library. Mediapipe performs the SSD at the backend. Landmarks are basically the key-points on the hand that tells the actual posture of it and track it. After that we must set the output from a file on getting the output we gave signal to the drone i.e., move forward, move backward, takeoff, land, etc. 3.2 Mapping Hand Gesture with the Drone Behavior In order to have the drone flying, some conditions have to be met by the drone and if any one of those conditions fail, drone will not accept the arm command. If drone is ready and user doesn’t respond any gesture, then it will set guided mode. It will fly at the altitude of 1 m. A fist gesture is applied to takeoff the drone. • Moving Forward - To move the drone in the forward direction you should use the peace gesture to move the drone in the forward direction. The gesture is parsed by the camera and then passed to the drone controller module. • Turn Right - To move the drone in the right direction you should use the gesture as shown in Fig. 2. The gesture is passed to the drone controller. • Landing - Gesture shown in the same figure is used to land the drone. Gesture is parsed and then forwarded to drone controller. • Turn left - Gesture shown in panel 5 (Fig. 2) is used to move the drone in left direction.
306
B. Latif et al.
• Moving Backward - Gesture shown in panel 6 (Fig. 2) is used to move the drone in backward position. 3.3 Code Importing the necessary package OpenCV, NumPy, Mediapipe, Tensorflow, loading the model from Keras. These are the libraries that are commonly used while doing mathematical analysis OpenCV is used for the computer vision where Mediapipe is the aforementioned library that runs over Tensorflow. Mp.solution.hand is the function that performs the hand detection algorithm. So, an object is defined. Mp.hands.hands is a function that is used for the configuration of model. max_num_hands means the number of hands we want to be detected and we have set it to 1 whereas mp.solution.drawing.utils will draw and connect the key points. Initializing TensorFlow and loading the pre trained models. Opening the file that contains the string data of the name of the gestures that we will perform. ClassNames will read and splits all the gestures name in numerical order. In the first line we created the object videocapture and we passed 0 as an argument because if we have more than one camera then we must pass a different value else we leave it default. Inside the loop we are reading every frame. Similarly flipping and showing the frame on new window. The basic technique in image processing starts with drawing the landmarks of the object that we must recognize after that those landmarks are passed to the predict function which returns an array as shown in Fig. 3 (panel 6). As shown that the classID is displayed below the predicted classes which is the index of the gesture. After that taking gesture into the frame.
Fig. 3. Code implementation: (1) Importing packages, (2) Mp.solution.hand, (3) TensorFlow initialization and loading of the pre-trained models, (4) Definition of the videocaputre Object, (5) Landmarks, (6) Output array of the prediction function
Hand Gesture and Human-Drone Interaction
307
4 Conclusion Gesture Recognition is widely used in all the necessities from smart home automation system to medical field. Mostly it deals with the interaction of humans and machines. We have discussed the evolution of hand detection and shared the study of many researchers. The block diagram of three main module i.e., hand detector, gesture detector and drone controller. We pass image as the input to the system via camera connected to the device. The main objective of this report is to propose a robust system that can work with high accuracy in real time. System consists of three main modules. 1. Hand detection 2. Gesture Recognition system 3. Drone controller First module uses deep learning models, dataset is gathered and model is trained. The MediaPipe Python Library which uses SSD is used to detect the hand. Second module uses TensorFlow library to detect the gesture of hands. It was a dynamic system which means that if we want to add more gestures, we can add it without retraining the model. The last module drone controller then takes the signals from the drone and parses it accordingly. These three modules interact together in a very friendly manner. Deep learning method of hand detection is easiest solution that can replace any method of gesture recognition. Clearly, in this context, other technologies and approaches may be considered where the end-user interacts with external devices in an intuitive way [16–19]. Acknowledgments. This work was presented in dissertation form in fulfilment of the requirements for the MSc in Robotics Engineering for the student Bilawal Latif under the supervision of N. Buckley and E.L. Secco from the Robotics Lab, School of Mathematics, Computer Science and Engineering, Liverpool Hope University.
References 1. Faa.gov: FAA Releases Aerospace Forecast—Federal Aviation Administration (2018). https:// www.faa.gov/news/updates/?newsId=89870. Accessed 27 Sept 2021 2. Bin Abdul Mutalib, M.K.Z.: Flying drone controller by hand gesture using leap motion. Int. J. Adv. Trends Comput. Sci. Eng. 9(1.4), 111–116 (2020) 3. Brown-Syed, C.: Library and information studies and open-source intelligence. Libr. Arch. Secur. 24(1), 1–8 (2011) 4. Cheng, X., Ge, Q., Xie, S., Tang, G., Li, H.: UAV gesture interaction design for volumetric surveillance. Proc. Manuf. 3, 6639–6643 (2015) 5. Molchanov, P., Yang, X., Gupta, S., Kim, K., Tyree, S., Kautz, J.: Online detection and classification of dynamic hand gestures with recurrent 3D convolutional neural networks. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016) 6. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. Commun. ACM 60(6), 84–90 (2017)
308
B. Latif et al.
7. Palacios, J., Sagüés, C., Montijano, E., Llorente, S.: Human-computer interaction based on hand gestures using RGB-D sensors. Sensors 13(9), 11842–11860 (2013) 8. Sacchi, C., Granelli, F., Regazzoni, C.S., Oberti, F.: A real-time algorithm for error recovery in remote video-based surveillance applications. Sig. Process.: Image Commun. 17(2), 165–186 (2002) 9. Kofman, J., Borribanbunpotkat, K.: Hand-held 3D scanner for surface-shape measurement without sensor pose tracking or surface markers. Virtual Phys. Prototyp. 9(2), 81–95 (2014) 10. Savidis, I., Vaisband, B., Friedman, E.G.: Experimental analysis of thermal coupling in 3-D integrated circuits. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 23(10), 2077–2089 (2015) 11. MMBIA IEEE Computer Society Workshop on Mathematical Methods in Biomedical Image Analysis in conjunction with Computer Vision and Pattern Recognition (CVPR), Kauai, Hawaii, USA, 8–9 December 2001 (2001). http://ipagwww.med.yale.edu/mmbia2001. Med. Image Anal. 5(2), 171 12. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016). https://arxiv.org/pdf/1506.02640.pdf 13. Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). https://doi.org/ 10.1007/978-3-319-46448-0_2 14. Rampasek, L., Goldenberg, A.: TensorFlow: biology’s gateway to deep learning? Cell Syst. 2(1), 12–14 (2016) 15. Ramachandra, P., Shrikhande, N.: Hand gesture recognition by analysis of codons. In: Intelligent Robots and Computer Vision XXV: Algorithms, Techniques, and Active Vision (2007) 16. Buckley, N., Sherrett, L., Secco, E.L.: A CNN sign language recognition system with single & double-handed gestures. In: IEEE 45th Annual Computers, Software, and Applications Conference (COMPSAC), pp. 1250–1253 (2021) 17. Secco, E.L., McHugh, D.D., Buckley, N.: A CNN-based computer vision interface for prosthetics’ application. In: EAI MobiHealth 2021 - 10th EAI International Conference on Wireless Mobile Communication and Healthcare (2021) 18. McHugh, D., Buckley, N., Secco, E.L.: A low-cost visual sensor for gesture recognition via AI CNNS. In: Intelligent Systems Conference (IntelliSys) 2020, Amsterdam, The Netherlands (2020) 19. Maereg, A.T., Lou, Y., Secco, E.L., King, R.: Hand gesture recognition based on nearinfrared sensing wristband. In: Proceedings of the 15th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2020), pp. 110–117 (2020). https://doi.org/10.5220/0008909401100117
Design and Develop Hardware Aware DNN for Faster Inference S. Rajarajeswari1 , Annapurna P. Patil1 , Aditya Madhyastha1 , Akshat Jaitly1 , Himangshu Shekhar Jha1 , Sahil Rajesh Bhave1(B) , Mayukh Das2 , and N. S. Pradeep3 1 Department of Computer Science and Engineering, M.S. Ramaiah Institute of Technology,
Bangalore 560054, India {raji,annapuranp2}@msrit.edu, [email protected] 2 Microsoft Research, Bangalore, India [email protected] 3 Samsung R&D Institute, Bangalore, India [email protected]
Abstract. On many small-scale devices, advanced learning models have become standard. The necessity of the hour is to reduce the amount of time required for inference. This study describes a pipeline for automating Deep Neural Network customization and reducing neural network inference time. This paper presents a hardware-aware methodology in the form of a sequential pipeline for shrinking the size of deep neural networks. MorphNet is used at the pipeline’s core to iteratively decrease and enlarge a network. Upon the activation of layers, a resource-weighted sparsifying regularizer is used to identify and prune inefficient neurons, and all layers are then expanded using a uniform multiplicative factor. This is followed by fusion, a technique for combining the frozen batch normalization layer with the preceding convolution layer. Finally, the DNN is retrained after customization using a Knowledge Distillation approach to maintain model accuracy performance. The approach shows promising initial results on MobileNetv1 and ResNet50 architectures. Keywords: MorphNet · Fusion · Knowledge distillation
1 Introduction There has been an ever increasing amount of research done with the goal of developing Deep Neural Networks with lower latency. Deep Neural Networks are generally heavy in size and thus less efficient when used on small devices like mobile phones and tablets. More the number of weights and layers in the neural network, more is the number of floating point operations involved. Thus there is a dire need of a method that is able to reduce the network size by keeping accuracy in mind. The reduction in size should not hamper the accuracy that the original network was able to achieve, and at the same time, should reduce the latency of the original network. With the growing ubiquity of deep neural networks, automating the process of their design and development, especially on edge devices, has become a field of active research. This proposes an interesting © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): IntelliSys 2022, LNNS 544, pp. 309–318, 2023. https://doi.org/10.1007/978-3-031-16075-2_21
310
S. Rajarajeswari et al.
unsolved challenge where consideration has to be given to energy efficiency as well as real time inference capability. In this paper, an automated pipeline is proposed to obtain a new model architecture based on the hardware device to produce lower inference latency. Given the neural network and specifications of a hardware device, it produces an optimized structure for the model to reduce inference time for that particular device. First, an optimized map is generated using MorphNet and the network is pruned based on this new structure. This reduces the overall size of the model leading to a reduction in latency. While retraining this new model, knowledge distillation is used to preserve accuracy. Fusion and quantization is then performed to obtain the final model. The experimentation showed significant reduction in inference latency on mobile devices for the ResNet and MobileNet model. To summarize, the contributions are as follows: • An approach is proposed to reduce the inference latency of deep neural networks on a particular edge device by optimizing the network structure for the particular device without significant loss in accuracy. • Results on ResNet50 and MobileNetv1 on mobile devices demonstrate a significant reduction in latency.
2 Literature Survey A survey was first conducted to look into what determines hardware efficiency and figure out some correlation between performance and model parameters/characteristics. The Two Biggest Challenges Faced in Designing Hardware Aware Deep Learning Models are: Characterizing hardware performance of DL models [1]: Hardware performance with respect to run-time and energy consumption determines the efficiency of the DL models. Modeling tools like Eyeriss, Paleo, and Neural Power help to accurately model and predict the hardware efficiency. Designing DL models under Hardware Constraints [1]: The optimization of hyper-parameters of DL models, i.e., tuning of parameters such as the number of layers or the number of filters present in each layer is a very extensive process. Performance Characteristics of a Deep Learning Model with Respect to Mobile Devices: In mobile devices, latency-throughput increases are not linearly dependent on the batch size. Instead, sometimes increasing the batch size by one image produces higher throughput with a minimal increase in latency, and on other occasions this increase in batch size results in both a lower throughput as well as higher inference latency. Depending on the inference model and the hardware device being used, certain batch sizes are more optimal than others. The optimal batch size is not straightforward to estimate without actual measurements.
Design and Develop Hardware Aware DNN for Faster Inference
311
The main finding of the paper [2] was that CNNs for mobile computer vision have significant trade-offs of latency–throughput. The behavior of this trade-off is very complex. A number of different factors affect the performance that yields the complex behavior. This poses a great challenge in automatic optimization mechanisms. Designing a network from scratch for a large search space of possible architectures, the application becomes unreasonably computationally expensive in terms of both resources as well as time which are not possible for many institutions to work upon. Instead of techniques based on Neural Architecture Search (NAS), existing architecture can be utilized for a similar problem and, in one shot, optimized based on the requirement.
Fig. 1. Targeted Regularization [3]
Targeted Regularization Targeted Regularization targets the reduction of a particular resource. The resource can be FLOPs per inference or model size. Illustration of Fig. 1: The left part of the figure shows a network with ResNet-101 architecture. The center shows the model after pruning has 40% lesser FLOPs. The right panel shows the structure after the model size is reduced by 43%. While performing optimization for the computation cost, higher-resolution neurons in the lower layers of the network are pruned more than lower-resolution neurons in the upper layers. When targeting a smaller model size, the pruning trade-off works in the opposite manner.
312
S. Rajarajeswari et al.
Knowledge Distillation Knowledge distillation is a model compression technique which involves teaching a smaller network, step by step, exactly what to do using a bigger already trained network. The ability to extract complex features is determined by the deepness of a model. Deeper neural networks involve a large amount of floating point operations and require high-speed processors to produce faster inferences. Thus, they give a higher latency on mobile devices. Larger models have been trained using a large amount of memory and powerful GPUs and have a certain encoding of the data embedded within their layers. Thus it can be useful to use the outputs from this deeper neural network while training a smaller network instead using just the ground truth from the available dataset. The knowledge from the larger model is transferred to the smaller network, which can then be used to run on edge devices having tighter memory constraints. To achieve this, consider the heavy model as Teacher Network and the new small model as Student Network. The distilled model, also called student, is trained to mimic the output of the larger network, teacher, instead of training it on the raw data directly.
3 Proposed Approach The proposed pipeline as shown in Fig. 2 consists of an automated pipeline that receives as input a neural network model and details about the hardware device for which it has to be optimized. The pipeline generates a new architecture that will produce lower inference times on that specific hardware device. 3.1 Hardware Metrics From Fig. 3, it can be inferred that for device P100 the number of FLOPS is tightly bound with latency, i.e. on decreasing the FLOPS the latency will also decrease but for V100 it is loosely bound. In a generalized scenario, it can be assumed that the latency is directly proportional to the number of FLOPS. So, Peak Compute Time is used as one of the metrics where Peak Compute Time refers to the maximum number of GFLOPs or giga floating-point operations per second of hardware. Inference latency also depends on how fast the processor can load weights from the memory, Memory bandwidth is the rate at which data can be read from or stored into the memory by a processor. These two metrics can be generalized for a wide variety of deep learning models and are used by MorphNet to perform hardware aware pruning during training time.
Design and Develop Hardware Aware DNN for Faster Inference
Fig. 2. Proposed pipeline
313
314
S. Rajarajeswari et al.
Fig. 3. Latency vs. FLOPs for InceptionV2 [3]
3.2 Hardware Aware Pruning Morphnet optimizes network architecture in two phases-shrinking and expanding phases. A morphnet cost metric is calculated for each neuron with respect to the target resource. This cost is added to the overall loss during training time by the pipeline. Thus during training, morphnet is able to learn which neurons are not very resource-efficient and thus can be pruned from the architecture. During the expansion phase, morphnet makes use of a width multiplier to increase the number of neurons. For example, if the number of neurons in a layer shrank to 100 from 350, and expansion is done by 50%, then the layer would now have 150 neurons. These two phases are performed during each training step and gradients are computed. The Hardware aware nature of the MorphNet based optimization is achieved by taking a peak to compute and memory bandwidth of the targeted hardware as input hyperparameters. 3.3 Customization of Model-Based on Pruning A final map is obtained that consists of the total number of neurons/channels that need to be preserved for each layer of the network. This map is stored in JSON format. Open-source tools such as KerasSurgeon are incorporated into the pipeline to actually prune the model based on the results obtained from the previous step in the pipeline. With this, the new model is generated with the new optimized architecture for the input model. 3.4 Retraining Post Customization For retraining post customization, Knowledge Distillation has been incorporated as part of the pipeline. Here, the pruned network can learn not just from the hard labels present in the dataset but also from the soft labels that are the predictions of the original model. The trained input model acts as a teacher network and the optimized model acts as a
Design and Develop Hardware Aware DNN for Faster Inference
315
student. This enables the preservation of accuracy despite a reduction in the size of the model. 3.5 Fusion The next step in the pipeline is fusion. Fusion helps us in further reducing the inference time by merging the frozen batch norm layer with the previous Conv2D layer. This is done using a Keras inference time optimizer library that works on a trained model. 3.6 Quantization The final step in the pipeline is quantization. This involves the approximation of a neural network that uses floating-point numbers by a network that uses lower bit width weights on each layer. FP16 or FP32 quantization can be incorporated based on the requirement. With this, the final trained model is obtained.
4 Experimentation and Results The initial testing is done on the Google Colab cloud environment. The architectures used were ResNet50 and MobileNetv1. The dataset used was CIFAR-10, which contains 60,000 images of dimension 32 × 32 that are divided into 10 different classes. To measure the inference time, we had the model predict over 10,000 test-set images. Before taking the measurement, it is necessary to perform GPU warm up so that the weights are loaded into memory. All of the models had a drop in accuracy of 0.4%, so the performance was preserved. The result of testing on mobile devices are: • Samsung M31: Device Details: Exynos 9611 CPU and Mail-G72 MP3 GPU. On this device, a reduction in inference latency from 2.55 s to 0.905 s for the ResNet50 model and a drop from 467 to 170 ms for the MobileNetv1 model was observed. • OnePlus Nord: Device Details: SnapDragon 765 g CPU and Adreno 620 GPU. On this device, a reduction in inference latency from 1.204 s to 0.375 s for the ResNet50 model and a drop from 202 to 170 ms for the MobileNetv1 model was observed. Similarly, observations for other mobile devices are tabulated in Tables 1 and 2.
316
S. Rajarajeswari et al. Table 1. Observations on MobileNet architecture
Device
CPU
GPU
Memory bandwidth
Initial inference Inference time time after optimization
Samsung M31
Exynos 9611
Mali-G72 MP3
11.92 GiB/s 367 ms
OnePlus Nord
SnapDragon 765g
Adreno 620 17 GiB/s
202 ms
170 ms
RealMe 2 Pro
SnapDragon 660
Adreno 512 13.9 GiB/s
491.01 ms
407.37 ms
OnePlus 6T
SnapDragon 845
Adreno 630 29.87 GiB/s 140.569 ms
122.46 ms
Redmi Note 7 Pro
SnapDragon 675
Adreno 612 14.6 GiB/s
382.67 ms
475.68 ms
170 ms
Table 2. Observations on RESNET50 architecture Device
CPU
GPU
Memory bandwidth
Initial inference Inference time time after optimization
Samsung M31
Exynos 9611
Mali-G72 MP3
11.92 GiB/s 2.55 s
OnePlus Nord
SnapDragon 765g
Adreno 620 17 GiB/s
1.204 s
0.375 s
RealMe 2 Pro
SnapDragon 660
Adreno 512 13.9 GiB/s
1.001 s
0.309 s
OnePlus 6T
SnapDragon 845
Adreno 630 29.87 GiB/s 0.797 s
0.252 s
Redmi Note 7 Pro
SnapDragon 675
Adreno 612 14.6 GiB/s
1.62 s
4.07 s
0.905 s
This observation from Tables 1 and 2 are plotted in Fig. 4 and 5, respectively:
Design and Develop Hardware Aware DNN for Faster Inference
317
The plot of initial vs optimized inference time in ResNet50 (Fig. 3) clearly shows a decrease of almost 50%. Whereas, in the case of MobileNetV1 (Fig. 4) the trends show a reduction of almost 20% in the inference time after passing through the pipeline.
Fig. 4. Inference times for ResNet50 before/after optimization
Fig. 5. Inference times for MobileNetV1 before/after optimization
5 Conclusions Inference latency is a crucial factor that needs to be taken into account while designing network architectures in today’s world. Many real-world applications such as face recognition on mobile devices and filters on various social media applications require rapidly fast inference times. To this end, this paper surveys the latest research in this field and proposes a pipeline to generate an optimized network architecture given an input model
318
S. Rajarajeswari et al.
and details of the hardware device for which inference time needs to be reduced. The pipeline extracts hardware metrics, performs hardware-aware pruning by implementing morphnet, performs ad hoc customization of the model based on the pruned map, fuses the Conv and Batch Norm into a single layer, and finally perform Knowledge Distillation during retraining so as to maintain performance accuracy. This approach performed well on both the cloud platform as well as on mobile devices as detailed in the results section. On various mobile devices, for the ResNet50 model, the inference time was reduced by around 50% while for the MobileNetV1 architecture, a reduction of around 20% was observed using this technique. Acknowledgements. We would like to extend our gratitude to the Samsung PRISM program for providing us with the opportunity to work on this research project, and for their guidance and support throughout the duration of the same.
References 1. Marculescu, D., Stamoulis, D., Cai, E.: Hardware-aware machine learning. In: Proceedings of the International Conference on Computer-Aided Design (2018). https://doi.org/10.1145/324 0765.3243479 2. Hanhirova, J., et al.: Latency and throughput characterization of convolutional neural networks for mobile computer vision. In: Proceedings of the 9th ACM Multimedia Systems Conference (2018). https://doi.org/10.1145/3204949.3204975 3. Gordon, A., et al.: MorphNet: fast & simple resource-constrained structure learning of deep networks. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1586–1595 (2018). https://doi.org/10.1109/CVPR.2018.00171
Vision Transformers for Medical Images Classifications Rebekah Leamons, Hong Cheng, and Ahmad Al Shami(B) Department of Computer Science, Southern Arkansas University, Magnolia, AR, USA {hcheng,aalshami}@saumag.edu
Abstract. Image classification method based on Vision Transformers (VT) is gaining popularity as the standard models in natural language processing (NLP). The VT does not depend on the convolution blocks, rather captures the relative relations between image pixels regardless of their three-dimensional distance. In this paper, three different Deep Learning models were developed to detect the presence of Invasive Ductal Carcinoma, the most common form of breast cancer. These models include convolutional neural network (CNN) was used as a baseline, a residual neural network (RNN) and a Vision Transformer (VT). To test these models, we used a dataset of breast cancer tissue images. These images were used to train and validate our three models, which led to different levels of high accuracy. Experimental results demonstrate that the VT model outperforms the CNN and RNN on different tasks by up to 93% accuracy classification rate, while the other models highest rate was 87%. Keywords: Deep learning · Classification · Predictive analysis · Computer vision · Convolutional Neural Network · Residual Neural Network · Vision Transformer · Breast cancer
1 Introduction Breast cancer is the most common form of cancer, and Invasive Ductal Carcinoma (IDC) is the most common form of breast cancer [4]. More than 180,000 women in the United States are diagnosed with invasive breast cancer each year, and about 8 out of 10 of these women are diagnosed with IDC [5]. Automating the diagnostic process could lead to both faster and more accurate diagnoses. The primary goal of our research is determining which deep learning model is most effective in detecting IDC. This could help make the diagnostic process much more efficient, both for IDC diagnosis and, eventually, for many other medical conditions. These models were trained separately on the same dataset to determine their success rates. The objectives of this study are: • To create Residual Neural Network and Vision Transformer models to classify an image as IDC negative or IDC positive, • To test the Residual Neural Network and Vision Transformer’s accuracy against a Convolutional Neural Network, and • To determine which model is most efficient at IDC detection. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): IntelliSys 2022, LNNS 544, pp. 319–325, 2023. https://doi.org/10.1007/978-3-031-16075-2_22
320
R. Leamons et al.
2 Background 2.1 Convolutional Neural Network (CNN) Traditionally, a Convolutional Neural Network would be used for image classification. From here, a Convolutional Neural Network will be abbreviated to CNN. The primary purpose of Convolution in case of a CNN is to extract features from the input image. Convolution preserves the spatial relationship between pixels by learning image features using small squares of input data. A CNN contains two phases: feature learning and classification. In feature learning, an input is first given to the program. In this case, input includes an image. Then, convolution layers are used to create a feature map. Convolution is an image processing technique that uses a weighted kernel (square matrix) to revolve over the image, multiply and add the kernel elements with image pixels. Figure 1 depicts the full CNN process as follows.
Fig. 1. This diagram shows the phases of a CNN, from taking input, running it through a convolution layer (2D convolution with 1 filter (channel), 3 × 3 h & w and 0 paddings), pooling, and classifying the data for output [8].
The convolution of f and g for example written as f ∗ g, it is defined as the integral of the product of the two functions after one is reversed and shifted. ∞ f (τ )g(t − τ )d τ (1) (f ∗ g)(t) = −∞
The convolution layers apply filters and padding to the input to create a map of any important features on an image. The padding increases the size of the convoluted image. The relationship between padding and the output size of the convolutional layer is given by: nx + 2p − nh +1 (2) O= S
Vision Transformers for Medical Images Classifications
321
where O is the output size, S is the stride parameter which indicates the jump sizes the convolution kernel needs to slide over the image, default value S = 1, meaning that the kernel moves one pixel at a time. However, S = 2 is also common. P is the padding, how much zero-padding should be placed around the input image. nx is the length of the input signal and nh is the length of the filter [9]. Consider a CNN trying to detect human faces; it would likely map out eyes, lips, and noses on its feature map. After the convolution layer, a pooling layer will be used. The pooling layer is a dimensionality reduction technique used to reduce the size of the layer, which reduces the necessary computational power. Finally, the data is classified into a predetermined number of categories. 2.2 Residual Neural Networks (RNN) Residual Neural Networks are another common model used for image classification. From here, a Residual Neural Network will be referred to as an RNN. In a traditional neural network, the input runs straight through weight layers and activation functions, shown in the below diagram as f(X) [6]. An RNN is very similar to a traditional neural network in terms of its layer structure. However, in an RNN, there are connections between layers known as “skip” connections. These skips connect the layers in a round, which allows the program to return to a previous weight layer or activation function and run it again [7]. This optimizes the program without having to add additional layers, which also improves computational efficiency. The below diagram (see Fig. 2), shows a traditional Neural Network (left) and an RNN (right).
Fig. 2. This diagram compares a Residual Neural Network to a standard neural network. in this diagram, we can see how a neural network only works through the weight layers and activation function, f(x), once, while a residual neural network can “skip” back and run through f(x) Multiple times.
2.3 Vision Transformers (VT) Transformers have only recently been used in image classification, but they are quickly proving their efficiency. A transformer works by first splitting an input image into patches
322
R. Leamons et al.
and giving these patches a “position,” which allows the program to keep track of the original image composition. The patches are then run through special encoders to try and relate one patch of an image to another. Transformers were traditionally used in language processing, so it may be easier to think of relating these patches as relating words in a sentence. As humans, we can comprehend which words in a sentence relate to another; such as in the sentence, “Sarah is smart, Ben is creative.” We know that smart relate to Sarah and creative relates to Ben. Similarly, the transformer is trying to isolate patches of an image and relate them to one another to find “context,” which is in this case IDC positive or IDC negative. After the patches have been through the encoder, the image is run through a neural network and classified. Figure 3 below shows this process.
Fig. 3. This diagram shows how a Vision Transformer works, using an example from the dataset. as seen in the diagram, the transformer takes an image, splits it into patches, creates a linear projection of these patches, assigns a location to the patches, runs them through an encoder, then sends the data through a neural network, before finally classifying the image.
3 Methodology This project used a dataset provided by Paul Mooney at Kaggle.com. The code provided was also used as a building block to create our own CNN model. Cornell University also provided the papers An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale [1] and Attention Is All You Need [3], both of which were used as research for our Transformer. We began by downloading the dataset provided by Paul Mooney. This dataset was then run through a preprocessing program that divided the dataset into positive and negative categories and prepared the data for testing. The data was then trained on the CNN, RNN, and Vision Transformer models. These models run as described in the introduction. Histograms like the ones presented below (see Fig. 4 and Fig. 5) were also used to plot these images and show their pixel intensity. These histogram models were useful image filters because, to the untrained eye, the primary difference between IDC (−) and IDC (+) images is their coloration.
Vision Transformers for Medical Images Classifications
323
Fig. 4. The above image shows a breast tissue sample that tested positive for IDC (left) and a diagram displaying the pixel intensity across the image. This intensity can be a factor in classification, depending on the model.
Fig. 5. The above image shows a breast tissue sample that tested negative for IDC (left) and a diagram displaying the pixel intensity across the image. This intensity can be a factor in classification, depending on the model.
4 Dataset Description and Preprocessing This dataset contains approximately five thousand 50 × 50 pixel RGB images of H&Estained breast histopathology samples [2]. These images are labeled either IDC or nonIDC (see Fig. 6). The dataset is contained in NumPy arrays, and these arrays are small patches taken from digital images of breast tissue samples. Only some of the cells are cancerous, though breast tissue contains many cells. To better process these images, each image was resized to either 50 by 50, in the case of the CNN and RNN, or 256 by 256 in the case of the transformer. Additionally, in the case of the transformer, each image was split into 16 patches as described in the introduction. Otherwise, the data was taken as-is into our suggested model.
324
R. Leamons et al.
Fig. 6. The above images are random examples of breast tissue histopathology images taken from the dataset. The three on the left represent images that were classified as IDC (−), while the three on the right represent images that were classified as IDC (+).
5 Results Each of the models was trained on the full dataset for 100 epochs. The CNN model predicted the presence of breast cancer images with 81.4% accuracy, the RNN model with 87.9% accuracy and the Vision Transformer with 93% accuracy. The model accuracy of the CNN is shown below (see Fig. 7a). One can see that after each iteration, or epoch, the accuracy increased. However, the model’s accuracy begins to plateau around the thirtieth epoch. The model accuracy of the RNN is presented below (see Fig. 7b). Much like CNN, the accuracy increases after each epoch. The RNN also increases in accuracy faster without plateauing as the CNN did, however the model signal no improvement above 87% accuracy rate. The model accuracy of the Vision Transformer (VT) is also shown below (see Fig. 7c). One can see how the accuracy increases after each epoch. The VT also increases in accuracy faster without plateauing as other models did. Overall, this implies that the Visio Transformer is a better model.
Fig. 7. a. This image shows the model accuracy overtime for the CNN model. Along the X-axis, we see the epoch, or iteration, while the Y-axis shows accuracy. From this graph, we can see how the accuracy peaks to a max of 84% after 30 epochs, with slight increase after this. b. This image shows the model accuracy over time for the RNN model. Along the X-axis, we see the epoch, or iteration, while the Y-axis shows accuracy. From this graph, we can see how the accuracy slightly improved without plateauing. c. This image shows the model accuracy over time for the transformer model. Along the X-axis, we see the epoch, or iteration, while the Y-axis shows accuracy. From this graph, one can see how the accuracy is gradually improve without plateauing. The accuracy peaked to a max of 93% after few epochs.
Vision Transformers for Medical Images Classifications
325
6 Conclusion This study confirmed two things. The first is that computer vision can be used to diagnose breast cancer with accuracies up to or over 90%. This means that, with some refinement, these tools could soon be used as a diagnostic tool by doctors. The second is that, of the models tested, Vision Transformers are the most accurate at diagnosing cancer when given photos of histopathology samples with an accuracy of 93%. In the future, we hope to research Topographic Data Analysis, also known as TDA. TDA can be used in computer vision by isolating the topology of an image and using the filtered data as a sort of feature map, similar to the feature maps created by a CNN. These feature maps can then be used by the computer to make assumptions about the data. In the future, we hope to study TDA as another resource in image classification. Acknowledgment and Disclaimer. This material is based upon work supported by the National Science Foundation (NSF) under Award No. OIA-1946391. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
References 1. Dosovitskiy, A., et al.: An Image Is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. Cornell University, 3 June 2021. https://arxiv.org/abs/2010.11929 2. Mooney, P.: Predict IDC in Breast Cancer Histology Images. Kaggle. Kaggle, 6 March 2018. https://www.kaggle.com/paultimothymooney/predict-idc-in-breast-cancer-histology-images 3. Vaswani, A., et al.: Attention Is All You Need. Cornell University, 6 December 2017. https:// arxiv.org/abs/1706.03762 4. Invasive Breast Cancer (IDC/ILC). Cancer.org. American Cancer Society, 19 November 2021. https://www.cancer.org/cancer/breast-cancer/about/types-of-breast-cancer/invasive-bre ast-cancer.html 5. Invasive Ductal Carcinoma: Diagnosis, Treatment, and More. Breastcancer.org. Breatcancer.org, 13 October 2021. https://www.breastcancer.org/symptoms/types/idc 6. Bhattacharyya, S.: Understand and Implement ResNet-50 with TensorFlow 2.0. Medium, Towards Data Science, 9 September 2021. https://towardsdatascience.com/understand-andimplement-resnet-50-with-tensorflow-2-0-1190b9b52691 7. Module: Tf.keras.applications.RESNET: Tensorflow Core v2.7.0. TensorFlow, 12 August 2021. https://www.tensorflow.org/api_docs/python/tf/keras/applications/resnet 8. 2D Convolution Block. Peltarion (2022). https://peltarion.com/knowledge-center/documenta tion/modeling-view/build-an-ai-model/blocks/2d-convolution. Accessed 10 Feb 2022 9. The Mathematical Engineering of Deep Learning Home Page. https://deeplearningmath.org/ convolutional-neural-networks.html#:~:text=A%20convolution%20is%20an%20operation% 20on%20two%20vectors%2C,led%20to%20the%20development%20of%20convolutional% 20neural%20networks. Accessed 10 Feb 2022
A Survey of Smart Classroom: Concept, Technologies and Facial Emotions Recognition Application Rajae Amimi(B) , Amina Radgui, and Hassane Ibn El Haj El National Institute of Posts and Telecommunications, Rabat, Morocco {amimi.rajae,radgui,ibnelhaj}@inpt.ac.ma
Abstract. Technology has transformed traditional educational systems around the globe; integrating digital learning tools into classrooms offers students better opportunities to learn efficiently and allows the teacher to transfer knowledge more easily. In recent years, there have been many improvements in smart classrooms. For instance, the integration of facial emotion recognition systems (FER) has transformed the classroom into an emotionally aware area using the power of machine intelligence and IoT. This paper provides a consolidated survey of the state- of-the-art in the concept of smart classrooms and presents how the application of FER systems significantly takes this concept to the next level. Keywords: Smart classrooms · Students affect states Intelligent tutoring · Student expressions database
1
· FER system ·
Introduction
The concept of a modern classroom has long ago attracted the interest of many researchers. It dates back to the early 16th century when the Pilgrims Fathers established the first public school in 1635. Since the 1980s, with the development of information technology such as networking, multimedia, and computer science, the classrooms of various schools have become more and more information-based at different levels. Generally, a classroom is defined as an educational space where a teacher transfers knowledge to a group of students; this learning environment is one of the basic elements that influence the quality of education. Therefore, researchers suggest smart classrooms as an innovative approach that gives rise to a new intelligent teaching methodology, which became popular since 2012. The literature presents two visions of this concept. The first approach that has taken good advantage of the joint growth of the computer science and electronics industry concentrates on the feasibility of deploying various intelligent devices in replacement to traditional materials, such as replacing books with optical discs or pen drives, getting rid of shalk board in favor of interactive whiteboard. In his paper c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): IntelliSys 2022, LNNS 544, pp. 326–338, 2023. https://doi.org/10.1007/978-3-031-16075-2_23
Smart Classroom and FER Technology
327
“what is a smart classroom?”, Yi Zhang [45] notes “The smart classroom can be classified as a classroom with computers, projectors, multimedia devices (video and DVD), network access, loudspeakers, etc. and capable of adjusting lighting and controlling video streams;”. According to some authors, this concept of a “technology-rich classroom” has a significant limitation in that it concentrates solely on the design and equipment of the classroom environment, ignoring pedagogy and learning activities [42]. However, propositions from other studies point to a different insight, [25] et al. envision making classrooms an emotionally aware environment that emphasizes improving teaching methodologies, this second approach focuses on the pedagogical aspect rather than technology and software design. Kim and al propose integrating machine intelligence and IOT to create a classroom that can listen, see and analyze students’ engagement [25]. In 2013, Derick Leony and al confirm that “Many benefits can be obtained if intelligent systems can bring teachers with knowledge about their learner’s emotions” [30], emotions may be a fundamental factor that influences learning, as well as a driving force that encourages learning and engagement. As a result, researchers suggest a variety of approaches for assessing students’ affect states, including textual, visual, vocal, and multimodal methodologies. According to [43], the most widely utilized measurement channels are textual and visual. In comparison to the visual channel, textual (based on questionnaires and text analysis) is less innovative, and Facial Emotion Recognition (FER) systems are classified at the top of the visual channel. In another state of the art, numerous researchers propose performant FER approaches with high accuracy up to 80% [21]; this encourages their integration as an efficient method to analyze student affect states. Even though there has been an increasing amount of attention paid to technologies used in smart education, there is no literature that tackles the many elements of employing students’ facial expression recognition systems in smart classrooms. Our article helps understand the concept of future classrooms and their technologies, particularly students’ FER. We organize this paper as follows: In Sect. 2, we define smart classrooms and we present some of their technologies. In Sect. 3, we investigate the integration of FER systems to intelligent classrooms; enumerating FER databases and approaches. In Sect. 4, we elaborate our insight and future works, then in Sect. 5, we present a brief conclusion.
2 2.1
Smart Classroom: Definition and Technologies Definition of Smart Classroom
Smart learning is technology-enhanced learning, it facilitates interaction between students and their instructor and provides learners access to digital resources; it also provides guidance, tips, helpful tools, and recommendations to teach and learn efficiently. This innovative learning system comes with several new ways of learning classified into two categories:
328
R. Amimi et al.
Learning via technology: for instance, taking interactive courses via massively open online courses (MOOC), educational games, or intelligent tutoring systems (ITS). Learning with a teacher: consists of learning in a technology-enhanced physical classroom, which is generally the concept of Smart classrooms (SC). There are many definitions of smart classroom; indeed, it is hard to agree on a single definition that is accepted by the scientific community overall. Authors like [45], introduce intelligent classroom as learners-centered environment that supports students, adapts to their learning abilities, and helps teachers transfer their knowledge interactively and easily. Meanwhile, [27,36] et al. define a smart classroom as a physical environment containing digital tools, interactive devices, and various technologies to facilitate the activity of teaching and enhance learning. The author in [25] et al. comes up with a definition that goes beyond considering just the possibility of deploying intelligent materials into a physical place, they envision a smart classroom as an emotionally aware environment using real-time sensing and the power of machine intelligence, and in this context, they suggest a new system with advanced technologies and provide directions to deploy it. In literature, authors present the concept of the future classroom in their own distinctive way, but the goal remains the same: to enhance education through technology. A pertinent question to ask here, is, what elements make up a typical intelligent classroom? In response to this question, [36] et al. propose a taxonomy of a typical smart classroom, as shown in Fig. 1.
Fig. 1. Taxonomy of a typical smart classroom
A typical smart classroom provides tools such as desktop computers, digital cameras, recording and casting equipments, interactive whiteboards, etc. [32] for effective presentation, better assessment, constructive interaction, and comfortable physical environment. In order to present this taxonomy, four components are to consider:
Smart Classroom and FER Technology
329
Smart Content and Presentation: technology assists the instructor in preparing the content of his courses and presenting it easily and interactively [32]. Smart Assessment: includes automated evaluation of students learning capacity; also, it consists of managing students’ attendance and recuperating their feedback to enhance lectures’ quality as presented by [5,37]. Smart Physical Environment: smart classrooms offer a healthy climate by controlling factors like air, temperature, humidity, etc.; using sensors and actuators [18,34]. Smart Interaction and Engagement: it focuses on analyzing the level of students engagement and consists of providing tools to enhance interaction [19, 20,37]. 2.2
Smart Classroom Technologies
Education technology (edtech) is a multi-billion-dollar business that is rising every year. Many nations in the Organisation for Economic Co-operation and Development (OECD) spend more than 10% of their budgets on education [24]. Furthermore, as more nations raise their education investment, public spending on Edtech is expected to rise. Low- and middle-income nations, for example, expect to boost education expenditure from US$1.2 trillion to US$3 trillion per year [12]. According to the Incheon Declaration, nations must devote at least 4% to 6% of their GDP to education, or at least 15% to 20% of public spending to education. Aside from the predicted expansion in the educational sector, the market for smart classroom technologies is rapidly growing and is strongly linked to advances in computer science, robotics, and machine intelligence. As indicated in Table 1, every technology has advantages and disadvantages; the most prevalent constraints of most of these technologies are cost in the first position, followed by technical knowledge concerns (generally, teachers and students don’t have the required technical knowledge to use those technologies correctly) in the second place. Below are some of the main SC technologies: Interactive Whiteboard (IWB): is an intelligent tool that allows users to manipulate their presentations and project them on a board’s surface using a special pen or simply their hand [28]. IWB can be used to digitalize operations and tasks or merely as a presentational device [15]. The use of this device has revolutionized the nature of educational activities; it has the power to reduce the complexity of teaching and offers the instructor more flexibility during presentations. RFID Attendance Management System: RFID stands for Radio Frequency Identification; it is a wireless technology used to track an object then memorize and recuperate data using radio tags [11]. An RFID attendance system automatically marks students’ presence in the classroom by validating their ID cards on the reader. It is a bright, innovative solution to replace classic attendance registers [22], and help teachers gain wasted time verifying students’ presence daily.
330
R. Amimi et al.
Educational Cobot: is a new innovative tool used as a co-worker robot to help with teaching tasks. It contains several sensors, cameras, microphones, and motors so it can listen, see, communicate with students and assist the teacher [4]. Embedded into classrooms, cobots represent the school of the future and have significantly changed the traditional ways of teaching [39]. Sensors and Actuators: consists of installing sensors to collect data from their environment, then sending this data to the cloud to be analyzed, and might decide to act using actuators; for example, a temperature sensor detects and measures hotness and coolness, then sends the information to another device to adjust the temperature. This technology provides an adequate healthy climate by controlling air, temperature, humidity, etc. [18]. Augmented Reality (AR): is a system that combines real and virtual worlds, it is a real-world interactive experience in which computer-generated perceptual information enhances real objects [2]. It was first used in training pilots applications in the 1990s [38], and over the years, it has demonstrated a high efficiency when adopted in educational settings; it has been used to enhance many disciplines particularly when students learn subjects with complex, abstract concepts or simply things they find difficult to visualize such like mathematics, geography, anatomy, etc. [8]. A Classroom Response System (CRS): is used to collect answers from all students and send them electronically to be analyzed by the teacher, who can graphically display a summary of the gathered data [1]. CRS is an efficient method to increase classroom interaction [13]. According to Martyn (2007) [31], “One of the best aspects about an CRS is that it encourages students to contribute without fear of being publicly humiliated or of more outspoken students dominating the discussion.” Commercial Off-the-Shelf (COTS) eye tracker: is a gaze-based model that is used to monitor students’ attention and detect their mind wandering in the classroom. It helps the instructor to better understand their interests and evaluate their degree of awareness [16,17]. Wearable Badges: tracks the wearer’s position, detects when other badge wearers are in range, and can predict emotion from the wearer’s emotional voice tone. It helps the teacher manage the classroom; he can use those badges to detect if a student leaves without permission or when a group of learners make a loud noise that disturbs the rest of their classmates. MIT has developed a wearable badge by Sandy Pentland’s team [24,39]. Learning Management System (LMS): is a web-based integrated software; used to create, deliver and track courses and outcomes. It aids educators to develop courses, post announcements, communicate easily and interactively, grade assignments, and assess their students; besides, it allows students to submit their work, participate in discussions, and take quizzes [3,9,10]. Student Facial Emotions Recognition (FER): is a technology that analyzes expressions using a person’s images; it is part of the affective computing
Smart Classroom and FER Technology
331
field. Authors uses this technology to predict real-time student feedback during lectures [21], which helps the instructor improve the quality of his presentations. Table 1. Merits and limitations of smart classroom technologies Technology
Merits
Limitations
IWB
The touchscreen made its use simpler and more effective It is equipped with smart tools such as a pointer, screen capture... It provides access to the web
Expensive: many schools are not able to afford it Teacher training: The school should spend time and money teaching their instructors how to use the equipment correctly
RFID
Quick and Rapid: it identifies students in seconds Accuracy: it provides more accurate identification
Expensive: In case of a large strength of students, purchasing tags for everyone is costly Not secure: the system is prone to manipulation.
COBOT
Wide Knowledge: it saves a large amount of Technical Disruptions: it can break down at information any time Human-machine interaction: it may have trouble in trying to interact with the students Expensive
Sensors and actuators
Comfort: it provides a healthy and comfortable environment
Technical support: it needs IT professionals to help set it up and maintain it
AR
It provides outstanding visualizations It increases students’ engagement
It may presents functionality Issues It is expensive
CRS
Rapid assessment: It provides outcomes of Expensive: it costs an average of $75/device formative assessment It presents technical problems It provides immediate feedback for student Ineffective for opinion questions Time saving: for instance, Fast grading
COTS
It records real time eye movements and fixations, which can report student reactivity
Wearable badges
It increases the level of security and privacy Privacy concern: pupils may have some in classrooms misapprehension about their privacy when it comes to wearable devices
LMS
It saves wasted time on menial tasks like It requires IT and programming knowledge grading papers It gives students access to learning material in one place from any device
3 3.1
It is not able to track all eyes: eye-tracking camera is impacted for example by lenses or glasses It costs money, time and labor resources Efficiency: visual attention is not sufficient to interpret students’ engagement
The Application of FER System to Smart Classroom FER Approaches Used for Smart Classroom
Analyzing student’s affective states during a lecture is a pertinent task in smart classroom [14]. It is crucial to determine student engagement during a lecture in order to measure the effectiveness of teaching pedagogy and enhance the interaction with the instructor. Therefore, in order to get student feedback and achieve his satisfaction, researchers suggest different methods such as: body gesture recognition detected by using Electroencephalography (EEG) signals [26],
332
R. Amimi et al.
body posture using either cameras or a sensing chair [20], hand gestures [40], heart rate [33], and so forth. Meanwhile, FER systems appear in literature like the most used solution to recognize student’s affective states in smart classroom. Recently, FER systems are used in several domains like robotics, security, psychology, etc. thus, over time many researchers suggest new performant FER approaches. The framework of FER systems is structured as shown in Fig. 2; the input of this system is the images from a FER database; those images are pre-processed in a way to match the input size of the network, then, adequate algorithms are used to detect area of the face properly and extract the main features that help the network learn from training data; the final step is to classify results according to the database labels [21]. Thousands of articles have been written on this subject, but only a few of them have applied FER systems to smart classrooms.
Fig. 2. Facial expressions recognition system framework
In Table 2, we cite FER approaches used in smart classrooms; we classify those approaches into three categories: Handcrafted Approaches: are traditional methods of machine learning that consist of manually extracting features. Some examples include edge detection, sharpening, corner detection, histograms, etc. LBP pattern, for instance, is a type of image descriptor used to extract a texture of an image. Then, for classification, an adequate classifier is used; it is also a traditional machine learning algorithm such as: Support Vector Machine (SVM), K-nearest neighbor (KNN), decision tree, etc. Deep Learning Approaches: are new methods based on neural networks for both feature extraction and classification. Learned features are extracted automatically using deep learning algorithms; Convolutional Neural Network (CNN)
Smart Classroom and FER Technology
333
is the most commonly used for analyzing visual imagery; it can choose the best features to classify the images. Hybrid Approaches: combine both algorithms of machine learning and deep learning. They utilize traditional machine-learning algorithms to extract features and neural networks to classify images. Table 2. FER approaches in application to smart classroom Approach
Works Year Feature extraction Classification
Accuracy
Handcrafted
[41]
2014 Gabor features
72%
[37]
2015 ULGBPHS
Support vector machine (SVM) K-nearest neighbor classifier (KNN)
[23]
2019 LBP-TOP
Deep Neural Network (DNN)
85%
Hybrid
Deep Learning [29] [35] [40]
2019 Convolutional Neural Network (CNN) 2020 CNN-1: analyze single face expression in single image CNN-2 analyze multiple faces in single image 2020 CNN based on GoogleNet architecture with 3 types of Databases
79%
70% 86% for CNN1 70% for CNN2 88% 79% 61%
Notes and Comments. In the beginning, approaches (before 2019) employed classical machine learning methods like Gabor filters, LPB, KNN, and SVM to extract features and classify emotions; then gradually, since 2019, authors started integrating neural networks. As shown in Table 2, [23] and al utilize a hybrid approach that consists of using LBP-TOP as a descriptor then deep neural network (DNN) for classification which gives a better result (85%) in comparison to both the methods proposed by [37,41] accuracies (respectively 72%, 79%) and the new deep method suggested by [29]; additionally, Ashwin et al. suggest methods with different accuracies improved (from 61% to 88%) by changing the training data [35,40]; which point on the importance of choosing an adequate database. 3.2
FER Student’s Databases
Over the past years, there was a lack of student facial expressions databases [14], and most of the authors use general FER databases like FER2013 and CK+ to train their models [6,40]. It is essential to have an adequate database to increase the models’ accuracy and get better results [16], because the efficiency of FER models depends mainly on the quality of both databases and FER approaches [19]. Each created database has its characteristics, depending on the classes of expressions, ethnicity of participants, labeling methods, Size, method and angle of image acquisition. In Table 3, we have gathered students’ expressions databases used in intelligent classrooms.
334
R. Amimi et al. Table 3. Student’s FER databases
Works Database
Expressions classes
Ethnicity & gender of participants
Method or angle of acquisition
Labeling
[41]
Spontaneous
4 classes Not engaged Nominally engaged Engaged in task Very engaged
Asian-American Caucasian-American African-American 25 female
Pictures are taken from an IPad camera posed in 30 cm in front of the participant’s face while playing a cognitive game
Labeled by students N/A from different 34 participant disciplines: computer science, cognitive and psychological science
[37]
Spontaneous
5 classes: Joviality Surprise Concentration Confusion Fatigue
Asian
Acquired using a full 1080p HD camera configured at the front of the classroom while the student watching a 6 min video
Participants labeled their own pictures
200 images 23 participants
[44]
Spontaneous
4 Classes: Frustration Boredom Engagement Excitement
N/A
Computer webcam takes a photograph every 5 s
Labeled using a mobile electroencephalography (EEG) technology called Emotiv Epoc
730 images
[7]
Spontaneous
5 classes: Confusion Distraction Enjoyment Neutral Fatigue
Chinese (29 male and 53 female)
Acquired using Computers cameras
Labeled by the participants and external coders
1,274 video 30,184 images 82 participants
[40]
Posed and spontaneous
14 Classes: 7 classes of Ekman’s basic emotions 3 learning-centered emotions (Frustration, confusion and boredom) Neutral
Indian
Frontal posed expressions
Labeled using the semi-automatic annotation process and reviewed manually to correct wrong annotations
4000
4
Database content
Insights and Future Work
In this paper, we have surveyed the concept of smart classroom and its technologies, especially FER systems. In the future, we plan to consider the evolution of smart classrooms in this critical period of the pandemic. Today, the pandemic of coronavirus is causing a global health crisis. During this time, countries around the world have imposed restrictions on social distancing, masking, and other aspects of public life. The lockdowns in response to the spread of the virus are significantly impacting educational systems. As a result, governments are striving to maintain continuity of learning and are proposing distance learning as a suitable interim solution, but unfortunately, not all students around the globe have access to digital learning resources. In our future work, we propose a model for an intelligent physical classroom that respects the restrictions of Covid-19. We base our model on two propositions; as shown in Fig. 3: Proposition 1: We propose a system that automatically detects whether the students in the classroom are wearing their masks or not. Proposition 2: We detect the distances between students and compare them with the allowed distance.
Smart Classroom and FER Technology
335
If the students do not comply with the restrictions, the system sends a warning to the teacher in real-time.
Fig. 3. Proposed model for intelligent classroom
Notes and Comments. The Proposed intelligent classroom system in Fig. 3 consisting of (Model 1, Model 2) artificial neural networks to detect students not wearing masks and to measure distances between learners. Wearing face masks strongly confuses facial emotions recognition systems (FER). In our future work, we will study the limitations of these systems, used in today’s smart classrooms, as well as the possibility of predicting emotions in a student face wearing a mask; since it is possible, for example, to predict students’ mind wandering and the level of their engagement based only on their gaze [16,17].
5
Conclusion
Smart classroom is not a new concept, but over the years it has known many changes through the integration of various technologies. Researchers have transformed the classroom from a simple physical space gathering learners and their instructors to an emotionally aware environment that can interact with students and help them learn efficiently. Numerous technologies have revolutionized the evolution of digital classrooms, and FER systems are considered the most innovative. The authors have adapted FER ’s systems for use in smart classrooms using approaches with high accuracies and personalized databases. The evolution of these interactive classrooms can be considered to aid in teaching during the restrictions of the COVID 19 pandemic. It can also be
336
R. Amimi et al.
adapted for teaching students with special needs or intellectual disabilities to facilitate their interaction with their teachers. Acknowledgment. Authors would like to thank the National Center for Scientific and Technical Research (CNRST) for supporting and funding this research.
References 1. Yu, S., Niemi, H., Mason, J. (eds.): Shaping Future Schools with Digital Technology. PRRE, Springer, Singapore (2019). https://doi.org/10.1007/978-981-139439-3 2. Ak¸cayır, M., Ak¸cayır, G., Pekta¸s, H.M., Ocak, M.A.: Augmented reality in science laboratories: the effects of augmented reality on university students’ laboratory skills and attitudes toward science laboratories. Comput. Hum. Behav. 57, 334– 342 (2016) 3. Alhazmi, A.K., Rahman, A.A.: Why LMS failed to support student learning in higher education institutions. In: 2012 IEEE Symposium on E-learning, Emanagement and E-services, pp. 1–5. IEEE (2012) 4. Zawieska, K., Spro´ nska, A.: Anthropomorphic robots and human meaning makers in education. In: Alimisis, D., Moro, M., Menegatti, E. (eds.) Edurobotics 2016 2016. AISC, vol. 560, pp. 251–255. Springer, Cham (2017). https://doi.org/10. 1007/978-3-319-55553-9 24 5. Ashwin, T.S., Guddeti, R.M.R.: Unobtrusive behavioral analysis of students in classroom environment using non-verbal cues. IEEE Access 7, 150693–150709 (2019) 6. Ashwin, T.S., Guddeti, R.M.R.: Impact of inquiry interventions on students in elearning and classroom environments using affective computing framework. User Model. User-Adap. Inter. 30(5), 759–801 (2020). https://doi.org/10.1007/s11257019-09254-3 7. Bian, C., Zhang, Y., Yang, F., Bi, W., Weigang, L.: Spontaneous facial expression database for academic emotion inference in online learning. IET Comput. Vis. 13(3), 329–337 (2019) 8. Chen, P., Liu, X., Cheng, W., Huang, R.: A review of using Augmented Reality in Education from 2011 to 2016. In: Innovations in Smart Learning. LNET, pp. 13–18. Springer, Singapore (2017). https://doi.org/10.1007/978-981-10-2419-1 2 ´ Garc´ıa-Pe˜ 9. Conde, M.A., nalvo, F.J., Rodr´ıguez-Conde, M.J., Alier, M., Casany, M.J., Piguillem, J.: An evolving learning management system for new educational environments using 2.0 tools. Interact. Learn. Environ. 22(2), 188–204 (2012) 10. Courts, B., Tucker, J.: Using technology to create a dynamic classroom experience. J. Coll. Teach. Learn. (TLC) 9(2), 121–128 (2012) 11. Rjeib, H.D., Ali, N.S., Al Farawn, A., Al-Sadawi, B., Alsharqi, H.: Attendance and information system using RFID and web-based application for academic sector. Int. J. Adv. Comput. Sci. Appl. 9(1), 266–274 (2018) 12. Incheon Declaration: SDG4-education 2030 framework for action (2016) 13. Fies, C., Marshall, J.: Classroom response systems: a review of the literature. J. Sci. Educ. Technol. 15(1), 101–109 (2006) 14. Gligori´c, N., Uzelac, A., Krco, S.: Smart classroom: real-time feedback on lecture quality. In: 2012 IEEE International Conference on Pervasive Computing and Communications Workshops, pp. 391–394. IEEE (2012)
Smart Classroom and FER Technology
337
15. Glover, D., Miller, D., Averis, D., Door, V.: The interactive whiteboard: a literature survey. Technol. Pedagog. Educ. 14(2), 155–170 (2005) 16. Hutt, S., et al.: Automated gaze-based mind wandering detection during computerized learning in classrooms. User Model. User-Adap. Inter. 29(4), 821–867 (2019). https://doi.org/10.1007/s11257-019-09228-5 17. Hutt, S., Mills, C., Bosch, N., Krasich, K., Brockmole, J., D’Mello, S.: Out of the Fr-Eye-ing Pan. In: Proceedings of the 25th Conference on User Modeling, Adaptation and Personalization. ACM, July 2017 18. Faritha Banu, J., Revathi, R., Suganya, M., Gladiss Merlin, N.R.: IoT based cloud integrated smart classroom for smart and a sustainable campus. Procedia Comput. Sci. 172, 77–81 (2020) 19. Kapoor, A., Burleson, W., Picard, R.W.: Automatic prediction of frustration. Int. J. Hum. Comput. Stud. 65(8), 724–736 (2007) 20. Kapoor, A., Picard, R.W.: Multimodal affect recognition in learning environments. In: Proceedings of the 13th Annual ACM International Conference on Multimedia - MULTIMEDIA 2005. ACM Press (2005) 21. Kas, M., El Merabet, Y., Ruichek, Y., Messoussi, R.: New framework for personindependent facial expression recognition combining textural and shape analysis through new feature extraction approach. Inf. Sci. 549, 200–220 (2021) 22. Kassim, M., Mazlan, H., Zaini, N., Salleh, M.K.: Web-based student attendance system using RFID technology. In: 2012 IEEE Control and System Graduate Research Colloquium. IEEE, July 2012 23. Kaur, A., Mustafa, A., Mehta, L., Dhall, A.: Prediction and localization of student engagement in the wild. In: 2018 Digital Image Computing: Techniques and Applications (DICTA). IEEE, December 2018 24. Khosravi, S., Bailey, S.G., Parvizi, H., Ghannam, R.: Learning enhancement in higher education with wearable technology. arXiv preprint arXiv:2111.07365 (2021) 25. Kim, Y., Soyata, T., Behnagh, R.F.: Current issues and directions for engineering and education: towards emotionally aware AI smart classroom. IEEE Access 6, 5308–5331 (2018) 26. Kumar, J.: Affective modelling of users in HCI using EEG. Procedia Comput. Sci. 84, 107–114 (2016) 27. Kwet, M., Prinsloo, P.: The ‘smart’ classroom: a new frontier in the age of the smart university. Teach. High. Educ. 25(4), 510–526 (2020) 28. Le Lant, C., Lawson, M.J.: Interactive whiteboard use and student engagement. In: Publishing Higher Degree Research, pp. 33–42. SensePublishers (2016) 29. He, J., et al. (eds.): ICDS 2019. CCIS, vol. 1179. Springer, Singapore (2020). https://doi.org/10.1007/978-981-15-2810-1 30. Leony, D., Mu˜ noz-Merino, P.J., Pardo, A., Kloos, C.D.: Provision of awareness of learners’ emotions through visualizations in a computer interaction-based environment. Expert Syst. Appl. 40(13), 5093–5100 (2013) 31. Martyn, M.: Clickers in the classroom: an active learning approach. Educ. Q. 30(2), 71 (2007) 32. Miraoui, M.: A context-aware smart classroom for enhanced learning environment. Int. J. Smart Sens. Intell. Syst. 11(1), 1–8 (2018) 33. Monkaresi, H., Bosch, N., Calvo, R.A., D’Mello, S.K.: Automated detection of engagement using video-based estimation of facial expressions and heart rate. IEEE Trans. Affect. Comput. 8(1), 15–28 (2017) 34. Pacheco, A., Cano, P., Flores, E., Trujillo, E., Marquez, P.: A smart classroom based on deep learning and osmotic IoT computing. In: 2018 Congreso Internacional de Innovaci´ on y Tendencias en Ingenier´ıa (CONIITI). IEEE, October 2018
338
R. Amimi et al.
35. Ashwin, T.S., Guddeti, R.M.R.: Automatic detection of students’ affective states in classroom environment using hybrid convolutional neural networks. Educ. Inf. Technol. 25(2), 1387–1415 (2019). https://doi.org/10.1007/s10639-019-10004-6 36. Saini, M.K., Goel, N.: How smart are smart classrooms? A review of smart classroom technologies. ACM Comput. Surv. 52(6), 1–28 (2020) 37. Tang, C., Xu, P., Luo, Z., Zhao, G., Zou, T.: Automatic facial expression analysis of students in teaching environments. In: CCBR 2015. LNCS, vol. 9428, pp. 439–447. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25417-3 52 38. Thomas, P.C., David, W.M.: Augmented reality: an application of heads-up display technology to manual manufacturing processes. In: Hawaii International Conference on System Sciences, pp. 659–669 (1992) 39. Timms, M.J.: Letting artificial intelligence in education out of the box: educational cobots and smart classrooms. Int. J. Artif. Intell. Educ. 26(2), 701–712 (2016). https://doi.org/10.1007/s40593-016-0095-y 40. Ashwin, T.S., Guddeti, R.M.R.: Affective database for e-learning and classroom environments using Indian students’ faces, hand gestures and body postures. Future Gener. Comput. Syst. 108, 334–348 (2020) 41. Whitehill, J., Serpell, Z., Lin, Y.-C., Foster, A., Movellan, J.R.: The faces of engagement: automatic recognition of student engagement from facial expressions. IEEE Trans. Affect. Comput. 5(1), 86–98 (2014) 42. Williamson, B.: Decoding ClassDojo: psycho-policy, social-emotional learning and persuasive educational technologies. Learn. Media Technol. 42(4), 440–453 (2017) 43. Yadegaridehkordi, E., Noor, N.F.B.M., Ayub, M.N.B., Affal, H.B., Hussin, N.B.: Affective computing in education: a systematic review and future research. Comput. Educ. 142, 103649 (2019) 44. Zatarain-Cabada, R., Barron-Estrada, M.L., Gonzalez-Hernandez, F., RodriguezRangel, H.: Building a face expression recognizer and a face expression database for an intelligent tutoring system. In: 2017 IEEE 17th International Conference on Advanced Learning Technologies (ICALT). IEEE, July 2017 45. Zhang, Y., Li, X., Zhu, L., Dong, X., Hao, Q.: What is a smart classroom? A literature review. In: Yu, S., Niemi, H., Mason, J. (eds.) Shaping Future Schools with Digital Technology. PRRE, pp. 25–40. Springer, Singapore (2019). https:// doi.org/10.1007/978-981-13-9439-3 2
The Digital Twin for Monitoring of Cargo Deliveries to the Arctic Territories Nikishova Maria Igorevna(B) and Kuznetsov Mikhail Evgenievich(B) Federal Autonomous Scientific Institution Eastern State Planning Center, Moscow, Russia {m.nikishova,m.kuznetsov}@vostokgosplan.ru
Abstract. The report describes key problems in carrying out of the vital goods deliveries to remote Arctic territories (so-called “Northern Delivery”) and the potential for increasing the efficiency of measures aimed at delivery of the vital goods, food and oil products to local residents using system solutions, local measures and modern digital technologies. The authors propose a new tool – the digital twin of Northern Delivery, and approaches to increase the efficiency of the range of measures for organization of northern deliveries. It can help to make the deliveries more systematic, transparent and predictable, reducing the risk of failures. The main features of the digital twin database being formed are a low speed of data information using classical information request procedures, a complex structure of data obtained from the regions despite the existence of standardized forms, discrepancies in statistics and different understanding of northern deliveries in different regions. The system was tested in the Sakha (Yakutia) region. As a result of the application of the system, routes were changed and, according to a stakeholder’s survey, the accuracy of supply forecasting increased and there were no delays. The process of distributing the developed digital twin to 25 regions of the country has started. Keywords: The Arctic · Northern delivery · The digital twin · Digitalization of Northern delivery · Hard-to-reach territories · Product approach in a state organization
1 Introduction Issues of regular supply of basic vital goods, food and oil products for residents of the Far North and similar territories are an overriding priority for the regions of the Russian Federation. The quality of life of the population of 25 remote territories of the Russian Federation depends on solution of these problems. The specific character of such regions is determined by the following factors: • severe climatic conditions and lack of year-round transport accessibility; • remoteness of the main industrial regions which complicates independent delivery of goods for individuals and makes it very expensive; • absent or underdeveloped transport and logistic infrastructure; © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): IntelliSys 2022, LNNS 544, pp. 339–347, 2023. https://doi.org/10.1007/978-3-031-16075-2_24
340
N. M. Igorevna and K. M. Evgenievich
• lack of local production facilities for production of many industrial and agricultural goods; • lack of integrated statistics on natural volumes and cost performance, transport and other costs for cargo delivery for all regions; The factors listed above may be treated both like problems of remote territories and like their opportunities that can be used to step up efforts in order to increase the efficiency of northern deliveries and to improve the quality of life of local residents in those regions. There are other countries in the world with hard-to-reach regions comparable with the Russian ones in terms of supply and logistics, despite various geographical and social conditions and the variety of approaches to solution of those problems. However, all of them to a certain degree emphasize the problem of power supply (fuel deliveries) and problems of food and vital goods delivery during the periods of seasonal transport inaccessibility of some settlements. In this connection Russian experience in terms of deliveries and creation of a digital twin can be useful for other countries, such as: • Canada with Alaska and northern regions of Canada as classical examples of remoteness and inaccessibility, • Australia: the Australian continent itself is an example of remote location and its extensive territories are located in areas unfit for human habitation and require a special approach in terms of regional policy; • Denmark with a remote island of Greenland being a part of Denmark; • The USA with many examples of hard-to-reach regions apart from Alaska, such as remote eastern districts of the prosperous State of Washington, separated from the western counties by the Cascade Mountains; • China with its own example of a problematic remote region - Xinjiang Uighur Autonomous Region. The relevance of this study lies in the fact that deliveries to some regions are carried out within two years for Russia, and the risks of supply failure associated with climatic conditions (fires, ice conditions, etc.), organizational failures and suboptimal logistics can lead to the fact that residents of the regions will be left without essential products. In this connection the authors propose a new digital tools and approaches to increase the efficiency of the range of measures for organization of northern deliveries. It can help to make the deliveries more systematic, transparent and predictable, reducing the risk of failures. The purpose of the paper is to show how the authors solve the problem of the national scale as a result of scientific research using digital technologies. In accordance with the research objectives first, the authors explain the research methods that were used and focus on the research of scientists on this topic. Then then, the authors describe the results of the study concerning the definition of the problems of delivering goods to hard-to-reach territory (timeliness; volumes and range sufficiency; safety, reliability, guaranteed and sustainable food and energy supply; reasonable cost of food and essentials). Further, the authors note how the approach to the work on creating the digital product was arranged and point out the flaws in the system and explain that
The Digital Twin for Monitoring of Cargo Deliveries
341
their elimination and development of organizational mechanisms will be further areas of research.
2 Materials and Methods The study of the ways to increase the efficiency of northern deliveries was conducted by the scientists of FASI Vostokgosplan on the basis of analysis of regulations, strategies, policies of different regions and industries, studying statistics, data from tender platforms, and works of such researchers and experts as Yu.P. Alekseev [1], V.M. Gruzinov [2], Yu. A. Zvorykina [2], G. V. Ivanov [2], A.H. Klepach [4], V.N. Razbegin [4], W.C. Thompson and others. The position of state regulation of the “northern delivery” and its assessment are presented in the works of I.B. Boger through the prism of historical analysis. According to the researcher, “… direct state management of the process of delivery to the regions of the Far North has no advantages in market conditions.” The author proposed an alternative option, in which the state contributes to the creation of corporate groups involved in financial and material flows to the regions, exerting only a minor regulatory impact. V.A. Goman, in opposition to this, considers the risks arising from the lack of state regulation in the organization of the “northern delivery” and the impossibility of a complete transition to market relations with enterprises of housing and communal services and the fuel and energy complex. In turn, the researcher names the underdeveloped transport infrastructure as the reasons for the lack of prospects for the development of competitive relations in the field of early “northern delivery”. Among the latest studies, it is necessary to single out the work of Academician of the Russian Academy of Natural Sciences E.E. Pil, which presents an analysis of the volume of “northern delivery” carried out by inland water transport for the period from 2015 to 2020. The analysis carried out allowed the author to make a forecast of the volume of cargo transportation carried out by water transport until 2030. The paper also gives proposals for the use of new modes of transport (ekranoplans) within the framework of the “northern delivery”. We used deep interviews with representatives of regions and other stakeholders, surveys of concerned agencies to check the hypotheses formed on the basis of desk study. We also considered the possibilities of modern digital technologies and used mathematical modeling on the basis of AnyLogistix and Yandex DataLence systems to see how they can help to increase the efficiency of northern deliveries. In creation of a digital product based on AnyLogistix and Yandex DataLence we used a “product-based approach”. The purpose of the study is to analyze the existing peculiarities of cargo deliveries to the regions of the Far North and find a new tools and approaches to increase the efficiency of the range of measures for organization of northern deliveries. After that the study contains a description of the principle of operation of the created digital twin and tested in one of the regions, images of the interfaces are provided.
342
N. M. Igorevna and K. M. Evgenievich
3 Results and Their Discussion Through the “northern delivery”, 25 regions of the Far North are annually supplied, and this is more than 3 million people, and more than 3 million tons of cargo is delivered. The most important factors in organization of food, essentials and fuel deliveries to remote Arctic regions are: • • • •
timeliness; volumes and range sufficiency; safety, reliability, guaranteed and sustainable food and energy supply; and reasonable cost of food and essentials.
Provision the conditions listed above requires significant efforts and coordinated work of a large number of stakeholders at different stages, beginning from forecasting, collection of orders from local settlements, their justification and consolidation, followed by organization of tenders (it may involve work with more than 40 different suppliers at the same time in one region for one period of northern deliveries) and direct delivery that can be carried out by various types of transport (sea, river, air, and different types of land transport). Quite often in the regions there is no understanding what volume of deliveries is covered by private entrepreneurs and what is covered by regional administrations; supply requests can be insufficiently precise, and very often possible transportation opportunities using alternative routes or other types of transport are not taken into consideration. If unpredicted risks are taken, the regions have to use expensive solutions or transport intended for emergency situations. The practice of issuing requests for tenders shows that in many regions’ tenders for delivery of the goods are announced one to two months before the beginning of the process of northern deliveries due to budgetary peculiarities, even though the bidders need to prepare for the tenders in advance and make preliminary arrangements with transport and other organizations. This means that the chances for new players trying to enter the market for organization of northern deliveries are quite slim. Payment for cargo deliveries to remote territories is made post factum, which causes cash flow gaps for suppliers, and additional crediting drives up the prices for the goods. Recent summary statistics and analysis of possible risks and deviations in this process are often missing or prepared intuitively, for example, on the basis of experience and opinions of the drivers and captains bringing the goods. The majority of northern delivery routes haven’t been revised since last century and the process is often organized on the basis of historical precedents with customary routes, means of transport, timelines and parameters. This is only a small fraction of the peculiarities of the process of northern deliveries, and in many respects, it is connected with the lack of unified standards, regulations and “best practices” which could be used by the regions as guidelines. In this regard, the authors propose the following measures: • adoption of the law on northern delivery with a description of the status of measures, responsible persons, processes and conditions of financing; • creation of the Fund for financing northern delivery based on the experience of Canada; • creation of the monitoring and rapid response center;
The Digital Twin for Monitoring of Cargo Deliveries
• • • •
343
optimization of organizational mechanisms in accordance with the new law; optimization of logistics flows and transition to partial self-sufficiency where relevant; the use of new modes of transport (heavy drones, airships, snowmobiles); and creation of a unified communication platform between the participants of the northern delivery and feedback collection.
As to manage something, you first need to measure it, the authors created a digital twin of the “northern delivery” to understand the processes and indicators of the “northern delivery” with the ability to optimize it. This system can become the basis for the monitoring system for “northern delivery”. 3.1 The “DIgital Twin” Relying on scientific research and best practices of the regions, the authors are developing both large-scale system methods to increase the efficiency of northern deliveries, such as creation of a unified regulatory framework, radical solutions for optimization of tender procedures, additional mechanisms of state support, updating of the river fleet, creation of the “best practices” database that can be used by the regions, and more local solutions that can result in quality improvement of the situation in remote areas. Such local solutions include renewable power generation, modernization of waterways, formation of a network of trade and logistic centers, a regular container line, partial transition to greenhouse facilities, use of alternative means of transport (airships, aero boats, unmanned transport) and others. However, one of the main tasks for the authors in addition to identification of the best existing variants is search for principally new advanced ways to increase the efficiency of northern deliveries. One of such solutions is creation of the “digital twin” which is a unified information and statistical system for all regions relying on northern deliveries that makes it possible to optimize the routes and to calculate and justify the need to build additional infrastructure objects – warehouses, ports, wholesale and logistic centers, bridges, and to eliminate “bottlenecks” taking into account the climate and other specifics of each area and respective restrictions. The system is created on the basis of the tools - AnyLogistix and Yandex DataLence - and makes it possible to answer the following questions by modeling, optimization and scenario prediction: • Where to open a new warehouse/port/wholesale and logistic center/the transshipment point/airfield/winter road/year-round means of communication? What capacity and reserves do new objects need to have? • What types and models of transport and in what number would be better to use for “northern deliveries”? Where should it be based? • In what settlements it will be expedient use renewable power generation? What are the priorities in modernization of power generation facilities? • What are the potential reasons of failures in northern deliveries? What is the cost of elimination of arising emergency situations? What is the cost of insurance from potential risks? What are the scenarios for the area provision with the goods in case of emergency situations?
344
N. M. Igorevna and K. M. Evgenievich
• What is the economic effect of the measures taken, the cost of realization of “northern deliveries”. What is the cost of delivery for one ton of the goods to this or that settlement and how can it affect the consumer price index? At present the pilot region for testing of the system of northern delivery is the Sakha (Yakutia) Republic and at the same time our analysts are collecting data on other regions. The main features of the database being formed are a low speed of data information using classical information request procedures, a complex structure of data obtained from the regions despite the existence of standardized forms, discrepancies in statistics and different understanding of northern deliveries in different regions. There were no statistics for some regions belonging to the territories of northern deliveries before. Another key feature is that the system contains a large amount of data concerning different industries, cities with a great number of correlations, but sometimes without a clear understanding of real delivery terms or peculiarities of certain very hard-to-reach territories, and influence of natural factors in a particular area. In this connection in order to decrease the uncertainty and improve detailed elaboration and the quality of data the scientists of Vostkogosplan rely not only on the obtained data, their analysis and verification, but also on interviews with port managers, sea captains and aircraft pilots, warehouse superintendents, management of organizations owning vehicles, representatives of ship-building yards, design offices, river shipping and other companies involved in process, as well as the Ministry of Emergency Situations that has to interfere in the process in case of unforeseen circumstances, relevant agencies, heads of settlements and local residents. This makes our digital model more precise. The experts of Vostokgosplan are considering other digital systems that could supplement and improve the created system and in the long term will allow us to carry out automatic monitoring of remaining goods and conditions of their storage in warehouses, to trace cargos as parcels sent by mail like DHL and systems similar to Yandex taxi, but on a larger scale with a possibility to arrange multimodal transportations, allowing us to consolidate and integrate cargoes or even to carry out “return export” of the products. The “digital twin” on the first region – Yakutia – now includes more than 2,000 routes, more than 1,500 units of transport, 65 ports and transshipment points, more than 600 power generation objects, 631 settlements, more than 200 airports and airfields. The work connected with updating and specification of the parameters and modeling of optimized routes taking into account a number of local solutions, such as construction of a bridge in a certain place is continued, the system shows potential bottlenecks and what happens to them in different scenarios. The interface of the system is shown in Fig. 1. The system was tested in the Sakha (Yakutia) region. As a result of the application of the system, routes were changed and, according to a stakeholder’s survey, the accuracy of supply forecasting increased and there were no delays. The process of distributing the developed digital twin to 25 regions of the country has started. However, it should be noted that this system has been introduced in test mode so far. There are a number of problems that cannot be solved by the system. For example, it is not possible to reduce delays caused by bureaucratic delays with the help of the system, it is not always possible to predict when the ice will melt, there are not always
The Digital Twin for Monitoring of Cargo Deliveries
345
Fig. 1. Interfaces of the Digital Twin by Vostokgosplan.
available ships for charter, and sometimes cargo that has arrived at the port has a long wait in line. In addition, the system connects multimodal transportation, infrastructure with its limitations, traffic flows, etc. but cannot take into account the human factor. Thus, in addition to this system, parallel work in terms of organizational management mechanisms is important. Therefore, in addition, our team is engaged in their study. 3.2 Arranging of Works on Creation of the Digital Twin: A Product-Based Approach At the summit on innovations in 2010 M. Quinn emphasized “The role of science in realization of innovations consisting in transformation of the results of scientific research into products” [5]. The digital twin became one of such products for Vostokgosplan - a scientific institution and an analytical center of the Ministry for the Development of the Russian Far East and the Arctic. It might be helpful to say a few words about the details of organization of our works on digitalization.
346
N. M. Igorevna and K. M. Evgenievich
The organization has been developing according to a classical research paradigm since 1992 and has gained considerable scientific expertise. However, having realized that the results of studies should include finished products, not just research work and recommendations, helping to make management decisions on the basis of big data and high-quality predictive modeling and planning, we supplemented our traditional scientific approach with a product-based one involving fast prototyping and testing of a value-adding product. New approaches required a lot of respective preparation: a unified information system to support project management ensuring high-quality data processing and storage has been introduced, elements of Agile culture have been implemented, full-fledged studies of the needs of target audience in terms of usefulness of the product and user experience at different stages of its development have been started. The team began to use BI systems and complex modeling systems. These changes reinforced the existing strong scientific base with digital tools. “Thus, symbiosis of traditional science and new approaches allowed us to reach a new level and to create more relevant tools that can be used for making management decisions and to take on a data-driven approach” [3].
4 Conclusion It is expected that creation of the “digital twin” and realization of a complex of proposed measures on its basis will help to reduce the costs of northern deliveries, increase food and power security of the population of the territories, decrease prices for the goods and expand their range and to reduce the risks of supply disruption. Moreover, the system will make it possible to ensure transparency of the whole complex of measures aimed to provide regions of Far North and remote territories with essentials and to give exact answers to questions what infrastructure facilities must be created and where, and how to optimize logistic flows in various scenarios. it is expected that the digital twin will be extended to all territories of the northern delivery, and on its basis a monitoring center and a rapid response center will be created. The experience in creation of the digital twin can be useful for countries facing similar difficulties in northern deliveries, such as Canada, the USA, Denmark, Australia, and China. Future Research Directions In continuation of the research and creation of the digital development, the authors plan to adapt the created model to the other 24 regions and develop organizational mechanisms of the delivery process. At the moment, logistics data and traffic flows for the remaining regions have already been collected. The next step is modeling and creating the digital twin of the whole system. Further optimization work is expected to be done.
References 1. Alekseev, Yu.P., Alisov, A.N.: Russian North: strategic quality of management. OOO Tydex Co., Moscow. 320 p. (2004)
The Digital Twin for Monitoring of Cargo Deliveries
347
2. Gruzinov, V.M., Zvorykina, Yu.V., Ivanov, G.V., Sychev, Yu.F., Tarasova, O.V., Filin, B.N.: Arctic transport highways on land, water areas and in the air. Arctic: Ecol. Econ. 1(33), 6–20 (2019). https://doi.org/10.25283/2223-4594-2019-1-6-20 3. Fateeva, A.: How to introduce a product approach to a state organization. Experience of a food case championship. https://vc.ru/life/298080-kak-vnedrit-produktovyy-podhod-v-gosuda rstvennuyu-organizaciyu-opyt-produktovogo-keys-chempionata 4. Klepach, A.N., Razbegin, V.N.: The role of transport projects in the development of the Arctic and the Russian North. Gos. Audit. Right. Econ. 1, 121–124 (2017) 5. Konopkin, A.M.: Innovation: history, etymology, complexity of definition. In: Baranets, N.G., Verevkin, A.B. (ed.) Philosophy and Methodology of Science: Proceedings of the Third AllRussian Scientific Conference, 15–17 June 2011, Ulyanovsk, pp. 408–414 (2011)
Intelligent Bixby Recommender Systems S. Rajarajeswari1 , Annapurna P. Patil1 , Manish Manohar1(B) , Mopuru Vinod Reddy2 , Laveesh Gupta1 , Muskan Gupta1 , and Nikunj Das Kasat1 1 Computer Science and Engineering, Ramaiah Institute of Technology, MSR Nagar,
Bengaluru, India {raji,annapurnap2}@msrit.edu, [email protected] 2 Samsung R&D Institute India - Bangalore Private Limited, Bengaluru, India [email protected]
Abstract. In today’s society, there is an explosion of data, which brings with it the inherent challenge of dealing with this data. One such issue is analyzing data and making personalized suggestions of utterances to the users before any query is issued to the voice assistant. It should be possible to recommend the most relevant set of queries for quick access in advance to the user. These set of queries should be based on what the user might want to ask the voice assistant at a particular time based on the context such as the location, occasion and other features. Currently, ‘Bixby’ does not have a feature to recommend utterances to the users based on their demographics and usage patterns. Handling implicit data is problematic since it is difficult to analyze the user’s preferences. In this study, we analyze our strategy of recommending personalized utterances to users based on similar user profiles registered with ‘Bixby’ by taking into account several characteristics of the user such as the current context (time, place, occasion, etc.), demographics, utterances, and the frequency with which the utterances are recorded. Keywords: Bixby · Cosine similarity · Implicit data · Recommender systems
1 Introduction A Recommender System is capable of estimating a user’s choice from a collection of objects and recommending the ones with the greatest expected preference to the user. It is vital in today’s society because people have too many options to choose from, thanks to the development of the internet. Recommender systems deal with massive amounts of data, and the majority of their procedures include filtering items based on the preferences and interests of the users. It attempts to discover a link between the users and the things, which might then be used to recommend items to another user. This strategy has clearly demonstrated that it increases decision making and quality. The collaborative filtering technique and the content-based approach are the two basic approaches to recommender systems. Instead of relying on user interactions and feedback, a content-based method necessitates a substantial amount of knowledge about the objects’ own alternatives. It is a good notion in content-based filtering to identify the important attributes in relation to the problem statement and then rank the content before offering items to the user. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): IntelliSys 2022, LNNS 544, pp. 348–364, 2023. https://doi.org/10.1007/978-3-031-16075-2_25
Intelligent Bixby Recommender Systems
349
Collaborative Filtering, on the other hand, focuses on user-related data for item recommendation. It has historical information about users and assumes that people who have given comparable ratings in the past will do so again in the future. Two forms of ratings are used to express the user’s preference. Explicit Rating is a rating given expressly by people, such as a rating given to a book purchased. This is the most immediate response. Implicit Rating reveals consumers’ preferences implicitly, such as page views, clicks, record purchases, whether or not to hear a music piece, and so on. Cosine Similarity is a measure that quantifies the similarity of two or more vectors. The cosine similarity is defined as the cosine of the angle between two vectors. The cosine similarity is defined mathematically by dividing the real number of vectors by the product of the Euclidean norms or magnitude of each vector. The cos Similarity metric starts by calculating the cos of two non-zero vectors. This can be calculated using the Euclidean real number formula, which is written as: A.B = |A|.|B| cosθ
(1)
Then, given the 2 vectors and therefore the real number, the cos similarity is outlined as: cos (θ ) = (A.B)/(|A|.|B|)
(2)
The output can manufacture a price starting from −1 to one, indicating similarity wherever −1 is non-similar, zero is orthogonal (perpendicular), and one represents total similarity. 1.1 Embeddings An embedding is a vector translation from a higher dimension space to a lower dimension space. Embedding’s are the lower-dimensional representation of data learned from continuous vectors in the setting of neural networks. One of the primary motivations for employing embedding in neural networks is that they may be used to discover the nearest neighbors to the active user. The concept of embeddings can be used to determine the relationship between the categories. 1.2 Alternating Least Squares (ALS) It is also a multi-threaded matrix factorization algorithmic program. ALS is designed for large-scale cooperative filtering problems and is implemented in Apache Spark cc. The ALS is doing an excellent job of resolving the quantifiability of rating knowledge. It’s easy to use and scales nicely to incredibly huge datasets. The paper is organised as follows. Section 2 describes the uses of recommender systems and the recent developments in recommender systems. Our proposed methodology as well as the experiments performed is discussed in Sect. 3 followed by the results and observations in Sect. 4. Section 5 compares our work with the existing methodologies and discusses about the scope for future work. We conclude by stating the central idea of the paper and the main results in Sect. 6, followed by the references in Sect. 7.
350
S. Rajarajeswari et al.
2 Background The purpose of recommender systems is to provide users with tailored recommendations on various items. The suggestions are classified into two groups. One is an explicit feedback setting, and the other is an implicit feedback setting. In the explicit feedback, users rate items explicitly. These ratings serve as the foundation for the user-item connection. The preference relationship between a set of user–item pairs is explicitly known since users’ rate things in explicit feedback, whereas implicit feedback includes the existence or absence of purchase, click, or search activities [1]. The algorithmic rule Alternating Least Squares (ALS) is used in matrix factorization. ALS is designed for large-scale collaborative filtering problems and is implemented in Apache Spark ML [2]. ALS is doing a good job of resolving the quantifiability and sparsity of the Ratings data, and it’s simple and scalable well to very large datasets. The alternating least squares (ALS) algorithmic rule divides a given matrix R into two matrices U and V in such a way that RUTV. The unknown row dimension is passed to the algorithm as a parameter and is referred to as latent factors. Because matrix factorization can be used in the context of recommendation, matrices U and V are frequently referred to as user and item matrices, respectively. The ith column of the user matrix is denoted by ui and also the ith column of the item matrix is vi. The matrix R can be called the ratings matrix with (R)i,j = ri,j. The user and item matrix is found as follows: 2 2 T 2 (3) ri,j − ui vj + λ nui ||ui || + nvj vj arg minU ,V {i,j| ri,j=0}
i
j
The regularization factor is taken as λ, the number of items the user i has rated as nui and the number of times the item j has been rated as nvj . This scheme to avoid overfitting is called weighted-λ-regularization [3]. We fix one of the matrices U or V and obtain a quadratic form. The matrix R is given in its sparse representation as a tuple of (i,j,r) where i denotes the row index, j the column index and r is the matrix value at position (i,j). Using implicit feedback gives a better understanding of the user since it tracks their activity of that particular user without the explicitly asking the user any kind of question. It also helps in preventing data poisoning attacks which are common while collecting explicit feedback from the users. The attackers create many fake profiles of users and try to manipulate the recommender system into recommending what they want it to recommend. When it comes to platforms that deal with huge amounts of feedback from the user, it is very difficult to avoid data poisoning. This makes it important to give more weightage to the implicit feedback rather than the explicit feedback although both of the kinds of feedback are important in recommender systems [4]. In [5], the authors have developed an algorithm for providing personalized recommendations of learning resources to the users on an online learning platform. The major idea behind the algorithm is collaborative filtering. This concept of collaborative filtering is applied to three aspects – course, knowledge and users. Course-based collaborative filtering aims to recommend courses to the user based on the kinds of courses that the user is taking up on the online learning platform. Knowledge-based collaborative filtering recommends courses to the users based on the knowledge tracks that they are treading
Intelligent Bixby Recommender Systems
351
upon. User-based collaborative filtering aims to recommend courses to the user based on similar profiles of users on the platform. All of these approaches together give very good recommendations to the users. Providing personalization in e-learning can help in boosting the productivity of the learners online. This is also true when it comes to the learning by users who use intelligent voice assistants [6].
3 Methodology and Experiments 3.1 Methodology We used the data obtained from Kaggle [7] for our model in the first version of the implementation. This dataset included explicit ratings of several items for each user. The data included attributes such as music genre (rock, pop, jazz, etc.), movie genre (horror, thriller, romantic, comedy, etc.), subject-wise ratings (mathematics, physics, chemistry, geography, etc.), ratings corresponding to topics such as religion, politics, emotions, moods, and personal details such as age, gender, location, height, weight, and so on. In the first stage, we read the data, removed any extraneous columns, and turned it to a data frame using the ‘Pandas’ package. By replacing the vacant cells with zeros, the dataset was normalized. Once the dataset was complete, the next step was to further standardize it for use in subsequent processes [8]. Gender, left/right handedness, only child, and other qualities were capable of being converted into binary values because they only took two values. Gender, for example, had the values male/female, left/right handed had the values left handed or right handed, and ‘only child’ had the values yes/no. To do this, functions were written and applied to the data frame. The ‘cosine similarity’ metric [9] (imported from sklearn.metrics.pairwise) was then used to compute the user to user similarity matrix. Because each user is similar to himself, the diagonal components in this matrix are all equal to 1. To read the matrix, move horizontally to the row corresponding to a certain user id, and the values in that row represent the similarity score of the active user with regard to the specific user under observation. Moving vertically accomplishes the same thing. This matrix was transformed into a data frame for use in the following phases. To find the similarity scores with respect to the active user, a function called ‘get similar users()’ was constructed. This function takes the user id of the active user and returns the similarity scores of all the users with respect to the active user in the descending order. This makes sure that the most similar user is at the top and the least similar user is at the bottom of the list. The following goal was to recommend utterances to users based on their similarity. We had to determine the types of utterances before we could recommend them. For the implementation, the following categories were prioritized: news, finance, transportation, and music. To make more relevant recommendations, a subset of the attributes from the dataset were chosen for each of the categories, and similarity scores were calculated category-by-category. A function called ‘get similar users based on category()’ was created to return the similarity scores of the users with respect to the active user on the basis of the category.
352
S. Rajarajeswari et al.
The cosine similarity metric was then applied to each of the dataset’s subgroups containing data relating to news, finance, transportation, and music. The matrix was then transformed into a data frame, with the rows and columns representing the user ids. It is feasible to find the similarity scores of the users with respect to the active user using these data frames and the function that was built (‘get similar users based on category’). Following that, a new dataset was generated to store the users’ utterances. Each row contained the user id, the utterance, the time of the utterance (in 24-h format), the location of the utterance, the occasion of the utterance, and the category of the utterance as identified by ‘Bixby.’ This dataset’s data was read and transformed to a data frame. Following that, a function called ‘get recommended utterances()’ was written to recommend utterances to the active user. It takes the time, location, occasion, category, and the speech data frame containing the utterance data of the most comparable user to the active user. The utterances that meet the category are chosen first in this function. It is then subjected to a TPO (Time, Place, and Occasion) ranking procedure: • • • • • • •
Those utterances in which T,P and O matches are stored in a variable called ‘res’. Those utterances in which T and P matches are stored in a variable called ‘res1’. Those utterances in which T and O matches are stored in a variable called ‘res2’. Those utterances in which P and O matches are stored in a variable called ‘res3’. Those utterances in which only T matches are stored in a variable called ‘res4’. Those utterances in which only P matches are stored in a variable called ‘res5’. Those utterances in which only O matches are stored in a variable called ‘res6’.
Now, we have the return the relevant utterances. For this, we check if ‘res’ is not empty. If ‘res’ is not empty, these utterances are returned. If ‘res’ is empty, we check for whether ‘res1’ is empty or not. If ‘res1’ is not empty, these utterances are returned. If ‘res1’ is empty, we check for whether ‘res2’ is empty or not. If ‘res2’ is not empty, these utterances are returned. If ‘res2’ is empty, we check for whether ‘res3’ is empty or not. If ‘res3’ is not empty, these utterances are returned. If ‘res3’ is empty, we check for whether ‘res4’ is empty or not. If ‘res4’ is not empty, these utterances are returned. If ‘res4’ is empty, we check for whether ‘res5’ is empty or not. If ‘res5’ is not empty, these utterances are returned. If ‘res5’ is empty, we check for whether ‘res6’ is empty or not. If ‘res6’ is not empty, these utterances are returned. If ‘res6’ is empty, we return the first utterance from the list of utterances of the most similar user. This is illustrated in Fig. 1. The following step is locating the most comparable user and his corresponding utterances and submitting them to the function “get similar users based on category()’ to obtain the utterances to be recommended to our active user. This is repeated for each of the four areas, namely news, finance, transportation, and music. We included implicit data in the future version of the implementation. We produced this dataset for the implementation. Each row in the dataset indicates the frequency of utterances related to the previously described four categories, namely news, finance, transportation, and music. When an utterance pertaining to a specific category is made, the count of that category corresponding to the user is increased by one.
Intelligent Bixby Recommender Systems
353
Fig. 1. An illustration of the utterance recommender logic
The data from the implicit dataset is read in the first phase. ‘Pandas’ is then used to convert this to a data frame, and the missing cells are filled with zeros. The following step was to compute the user similarity matrix. The ‘cosine similarity’ metric was used for this, and the similarity matrix was created. After that, the matrix was transformed into a data frame. In the following stage, we wrote a function called ‘get similar users()’ to calculate the similarity scores with respect to the active user. This function accepts the active user’s user id and returns the similarity scores of all users to the active user in descending order.
354
S. Rajarajeswari et al.
This ensures that the most similar user is at the top of the list and the least similar person is at the bottom. To create better recommendations, we used a subset of features from the ‘Kaggle’ dataset (described above) that were relevant to the user’s demography. This dataset included information such as height, weight, age, number of siblings, gender, left/righthandedness, and so on. This dataset’s data was read and then transformed to a data frame using ‘Pandas,’ and the missing cells were replaced. The data was then read from the utterance dataset and translated to a data frame using ‘Pandas,’ with the missing cells replaced with zeros. The dataset’s properties which were capable of being converted into binary values were converted into binary values. The user similarity matrix was then discovered using the ‘cosine similarity’ metric. This matrix was then turned into a data frame. In the following stage, we wrote a function called ‘get similar users dem()’ to calculate the similarity scores with respect to the active user. This function accepts the active user’s user id and returns the similarity scores of all users to the active user in descending order. This ensures that the most similar user is at the top of the list and the least similar person is at the bottom. Using the ‘Pandas’ package, data was read from the utterance dataset and translated into a data frame. Three functions were developed in order to offer recommendations based on the utterances. The first function is ‘get recommended utterances(),’ which gives a subset of recommended utterances if none of the most comparable user’s utterances have a perfect ‘TPO’ match. It accepts as input parameters the time, place, occasion, category, and utterance data frame of the most comparable user to the active user. The initial step within the function is to choose those utterances that meet the category. This resulting subset of utterances goes through a TPO ranking process as follows: • • • • • •
Those utterances in which T and P matches are stored in a variable called ‘res1’. Those utterances in which T and O matches are stored in a variable called ‘res2’. Those utterances in which P and O matches are stored in a variable called ‘res3’. Those utterances in which only T matches are stored in a variable called ‘res4’. Those utterances in which only P matches are stored in a variable called ‘res5’. Those utterances in which only O matches are stored in a variable called ‘res6’.
Now, we have to return the relevant utterances. For this, we check if ‘res1’ is empty or not. If ‘res1’ is not empty, these utterances are returned. If ‘res1’ is empty, we check for whether ‘res2’ is empty or not. If ‘res2’ is not empty, these utterances are returned. If ‘res2’ is empty, we check for whether ‘res3’ is empty or not. If ‘res3’ is not empty, these utterances are returned. If ‘res3’ is empty, we check for whether ‘res4’ is empty or not. If ‘res4’ is not empty, these utterances are returned. If ‘res4’ is empty, we check for whether ‘res5’ is empty or not. If ‘res5’ is not empty, these utterances are returned. If ‘res5’ is empty, we check for whether ‘res6’ is empty or not. If ‘res6’ is not empty, these utterances are returned. If ‘res6’ is empty, we return the first utterance from the list of utterances of the most similar user.
Intelligent Bixby Recommender Systems
355
The second function is called ‘check_if_tpo_matches()’ which returns a subset of utterances that match all the three i.e. time, place and occasion. It takes the time, place, occasion and the utterance data frame of the most similar user with respect to the active user as input parameters. In case there are no utterances with a perfect TPO match, it returns ‘None’. The third function is called ‘Recommender’ and is the main function to recommend the utterances. This function takes the time, place, occasion and the user id of the active user as input parameters. Within this function, the first step is to find the list of similar users on the basis of the implicit data by using the ‘get_similar_users()’ function that was created previously and the result is stored in a variable called ‘similar_users’. The next step is to find the list of similar users on the basis of demographic data by using the ‘get_similar_users_dem()’ function and the result is stored in a variable called ‘similar_users_dem’. In order to make good recommendations, we add the similarity scores from both ‘similar_users’ and ‘similar_users_dem’ and store the result in ‘similarity_scores_of_users’ and this result is then sorted in descending order in order to find the most similar user. From this, we have the user id of the most similar user. We use this user id to find the utterances from the utterances dataset. Now, we check for a perfect ‘TPO’ match by calling the ‘check_if_tpo_matches()’ function. If the result is not ‘None’ we return those utterances. In case the result is ‘None’, we find out the maximum frequency category of the most similar user from the implicit dataset and send the time, place, occasion, maximum frequency category and the utterance data frame (of the most similar user’s utterances) to the ‘get_recommended_utterances()’ function. The result of this function is recommended to the active user. In the future implementation, we plan to use the user embeddings along with a deep learning model in order to find the similarity between the users before recommending the utterances. The final version of our implementation is depicted as a flowchart in Fig. 2. The first step was to import all of the necessary libraries for our model. In this version, our explicit dataset has only the relevant columns for representing the demography related data of the users such as age, gender, height, weight, number of siblings, type of location etc. We then read the data of the users from the explicit dataset and convert it to a data frame. The next step was to binarize the values that are capable of being converted into binary values and fill the empty cells with 0. Our main idea for this version was that the explicit data is less dynamic compared to the implicit data which makes it relevant to use the explicit data to find the similarity between the users and then fine tune this similarity by using the implicit data (frequency of utterances of the users on the basis of the category of utterance). Cosine similarity metric was applied to the explicit dataset to find the similarity scores. A function called ‘get_similar_users()’ was created just like the previous versions to return the similarity scores of all the users with respect to the active user. The next step was to read the data from the implicit dataset and fill the empty cells with 0 (implying zero frequency of utterance corresponding to that particular category of utterance for a user). We then applied the ‘MinMaxScaler’ transform in order to keep the frequency values within the range of 0 to 1.
356
S. Rajarajeswari et al.
Fig. 2. An illustration of the proposed methodology for recommending personalised utterances to the user
In the next step, a function called ‘similarity_scores()’ was created. This function takes the active user’s user ID and the top 4 most similar user’s user IDs as input. It then extracts the corresponding frequencies of the active user and the 4 most similar users from the implicit dataset. Now, it computes the ‘Root Mean Squared’ difference between the frequencies of the 4 users with respect to our active user. The user with the least difference is more similar. Therefore, an ascending order of the scores is returned from this function. Another function called ‘extract_top_4()’ was created in order to extract the top 4 most similar user’s data from the implicit dataset (frequencies). In the next step, the data was read from the utterance dataset and converted to a data frame using pandas library.
Intelligent Bixby Recommender Systems
357
In the next step, a function called ‘get_recommended_utterances()’ was created. It takes the T,P,O, maximum frequency category of the most similar user and the utterances of the most similar user. Within the function, the first step is to extract the utterances that match the maximum frequency category. In the next steps, it goes through a TPO ranking process like the previous versions but the only difference is that we are not checking for a perfect TPO match i.e. an utterance in which the time, place and occasion matches. The reason for this is that this function would be called only when there is no perfect TPO match within the main recommender function. In the next step, a function called ‘check_if_tpo_matches()’ was created in order to find if there is(are) any utterance(s) that has a perfect TPO match. If there is a perfect TPO match, it returns those subset of utterances; else it returns ‘None’. In the next step, we created the most important function for our model which is the ‘recommender()’ function. This function takes the T,P,O and the user id of the active user as input. Within the function, the following steps are followed: Use the ‘get_similar_users()’ function to find the similarity scores of all the users with respect to our active user and sort them in the descending order. Extract the top 5 most similar users from this set of users. Remove the active user’s row(data) from the top 5 users in order to get the top 4 most similar users. Use the ‘extract_top_4()’ function to extract the implicit data of the top 4 users. Use the ‘similarity_scores()’ function to fine tune the similarity scores of the top 4 users. Extract the most similar user’s user ID which is the first item in the returned list from the ‘similarity_scores()’ function. Extract the utterances of the most similar user from the utterance dataset. Call the ‘check_if_tpo_matches()’ function in order to check if there is a perfect TPO match. If there is a perfect TPO match, return that subset of utterances as the recommended utterances to the active user. If there is no perfect TPO match, extract the utterance category name from the implicit dataset which corresponds to the category that has the maximum frequency for the most similar user and then call the ‘get_recommended_utterances()’ function to get the recommended utterances for our active user after going through the TPO ranking process. When the function ‘recommender()’ was called with sample values, the recommended utterances were in accordance with the logic that was proposed. In the most recent version of our implementation, we have used an embedding layer approach for finding the similarity between the users. The main idea behind the usage of the embedding layer approach was to tackle the problems of big data associated with Bixby. In this approach, two parameters were taken as input (age and height). ‘Keras’ was used to form the embeddings for the input and these embeddings were concatenated. This was then passed through dropout and dense layers followed by several ‘ReLU’ activation layers and the output was limited to the range of the minimum and maximum values of the output parameter which was used for predictions. We used the ‘Adam’ optimizer and ‘Mean squared error’ loss function. In the next step, a function was defined in order to
358
S. Rajarajeswari et al.
extract the embeddings from the model layer. A function called ‘get_recommendations()’ was created to get the predicted values of the output parameter corresponding to the current user and consequently return the set of user ID’s along with the predicted values. The most similar user’s user ID is obtained and this is used for extracting the utterances of the most similar user and passes it through the TPO ranking process in order to return the most relevant utterances to the active user. 3.2 Experiments Pseudo Code for Getting the Recommended Utterances: This function is called when there is no perfect TPO match and the utterances have to be recommended. Figure 3 illustrates the same.
Fig. 3. Pseudo code of the function for getting the recommended utterances when there is no perfect TPO match
Pseudo Code to Check for a Perfect TPO Match: This function is called in order to check for the presence of utterances which have a perfect TPO (Time, Place and Occasion) context match. Figure 4 illustrates the same. Pseudo Code for the Main Recommender: This function is called in order to get the final set of utterances which has to be recommended to the active user. Figure 5 illustrates the same.
Intelligent Bixby Recommender Systems
359
Fig. 4. Pseudo code of the function to check for a perfect TPO match from the set of utterances
Fig. 5. Pseudo code for the main recommender
Output of the Recommender (Case 1): Figure 6 shows the output of the recommended set of utterances with respect to the active user when the active user’s current context is Time = 8:00, Place = Bengaluru, Occasion = holiday and User id = 236.
360
S. Rajarajeswari et al.
Fig. 6. Output of the set of recommended utterances in the case of a perfect TPO match
We find that the most similar user is the user with the user id 386. This is the case of a perfect TPO match where the time, place and occasion match. Output of the Recommender (Case 2): Figure 7 shows the output of the recommended set of utterances with respect to the active user when the active user’s current context is Time = 8:00, Place = Bengaluru, Occasion = tour, User id = 529.
Fig. 7. Output of the set of recommended utterances in the case where there is no perfect TPO match
We find that the most similar user is the user with the user id 94. This is the case when there is no perfect TPO match. We first find the highest frequency category of utterance of the most similar user (user id: 94). Next step is to pass that subset of utterances which match the highest frequency category through the TPO ranking process. Recommender Function Using Embedding Layers Approach: In this approach, two parameters, namely, age and height was taken for training the model as shown in Fig. 8. The two embedding layers were concatenated and then passed through multiple layers of reLu activation along with the other layers such as the dropout and dense layers. The Adam optimizer and mean squared error loss function were used for the evaluation of the loss.
Intelligent Bixby Recommender Systems
361
Fig. 8. Pseudo code for the recommender function in the embedding layers approach
4 Results and Observations The initial approach which was used for recommending utterances was capable of providing recommendations to the users in a highly accurate manner since it used a combination of cosine similarity and root mean squared error to find the similarity between the users based on the explicitly and implicitly collected data about the user. This data was then compared to find the most similar user and the corresponding relevant utterance for the recommendation to our active user. While this is a very good methodology, it does not address the scalability issues while dealing with big data. While dealing with real-time applications such as voice assistants that handle huge amounts of data, it is very important to minimize the amount of time taken for performing
362
S. Rajarajeswari et al.
any action. In order to deal with this particular issue, our latest approach is to use the concept of embedding layers to find the similarity between users.
Fig. 9. Model loss for 100 epochs in the embedding layer approach
When using the embedding layer strategy to detect similar users, it was discovered that using a binary-valued output parameter (a parameter that has only two outcomes) resulted in a smaller loss than using a non-binary-valued output parameter. The model loss in the embedding layer approach was estimated and plotted across 100 epochs, as shown in Fig. 9. The model’s precision score was 0.3214, while the recall score was 0.3103, which was not as high as expected. In future versions of the implementation, these scores will be enhanced by the usage of neural collaborative filtering and deep context aware recommender systems. This is required since recommender systems are expected to give very accurate results to be relevant to the user.
5 Comparison of Approaches and Future Work 5.1 Initial Approach In our method, we started with three datasets: user profile data, utterance data, and metadata which include the device usage history. We employed cosine similarity to find the resemblance between our current active user and other users utilizing user profile data, or demographic data. Then, using metadata which includes the device usage history, we filter and slice similar users based on the category of recommendation. To improve the recommendation, utterance data encompassing time, place, and occasion is employed. The utterance of another user with identical parameters and interests is suggested to our current user. 5.2 Latest Approach Through supervised learning, a neural network with embeddings was developed and trained on a dataset, resulting in comparable users having a closer representation in the
Intelligent Bixby Recommender Systems
363
embedding space. The embeddings are the neural network parameters. These numbers are modified during training to reduce the prediction problem’s loss. In other words, the network strives to fulfill the task as accurately as possible by modifying the representations of the user and metadata which includes the device usage history. Once we have the embeddings for the user and metadata which includes the device usage history, we’ll use the cosine distance as a measure of similarity (another viable option is the Euclidean distance), and we can discover the most similar user to a specific user by computing the distance between the two. 5.3 Future Work Using encrypted data, the concepts of Neural Collaborative Filtering [10] and Deep Context-Aware Recommender systems [11] can be applied to construct a recommender system for recommending the most relevant utterances to users. The matrix factorization problem is generalized by NCF. To recommend the items, it employs an ensemble model of GMF (Generalized Matrix Factorization) and MLP (Multilayer Perceptron) based on the latent vectors. It can generate a set of scores ranging from 0 to 1 for each item, as well as a list of recommended things.
6 Conclusion With the increase in data volume, there has always been a demand for comprehension and the provision of valuable insights. We discussed our approach to recommender systems in order to provide meaningful predictions of utterances to users, which is currently lacking in ‘Bixby’. Understanding the implicit data and making excellent recommendations is a difficult task in and of itself. We will be able to locate similar users by analyzing user embeddings and then using a good ‘TPO’ rating mechanism to offer the best utterances to the active user.
References 1. Hu, Y., Koren, Y., Volinsky, C.: Collaborative filtering for implicit feedback datasets. In: 2008 Eighth IEEE International Conference on Data Mining, pp. 263–272. IEEE, December 2008 2. Ravi Kumar, R.R.S., Appa Rao, G., Anuradha, S.: Efficient distributed matrix factorization alternating least squares (EDMFALS) for recommendation systems using spark. J. Inf. Knowl. Manag. 2250012 (2021) 3. https://nightlies.apache.org/flink/flink-docs-release-1.2/dev/libs/ml/als.html. Accessed 10 Dec 2021 4. Huang, H., Mu, J., Gong, N.Z., Li, Q., Liu, B., Xu, M.: Data poisoning attacks to deep learning based recommender systems. arXiv preprint arXiv:2101.02644 (2021) 5. Zhong, M., Ding, R.: Design of a personalized recommendation system for learning resources based on collaborative filtering. Int. J. Circuits Syst. Sig. Process. 122–131 (2022) 6. Kundu, S.S., Sarkar, D., Jana, P., Kole, D.K.: Personalization in education using recommendation system: an overview. In: Deyasi, A., Mukherjee, S., Mukherjee, A., Bhattacharjee, A.K., Mondal, A. (eds.) Computational Intelligence in Digital Pedagogy. ISRL, vol. 197, pp. 85–111. Springer, Singapore (2021). https://doi.org/10.1007/978-981-15-8744-3_5
364
S. Rajarajeswari et al.
7. https://www.kaggle.com/miroslavsabo/young-people-survey. Accessed 10 Dec 2021 8. VanderPlas, J.: Python Data Science Handbook: Essential Tools for Working with Data. O’Reilly Media, Inc. (2016) 9. https://www.machinelearningplus.com/nlp/cosine-similarity/. Accessed 16 Nov 2020 10. He, X., Liao, L., Zhang, H., Nie, L., Hu, X., Chua, T.S.: Neural collaborative filtering. In: Proceedings of the 26th International Conference on World Wide Web, pp. 173–182, April 2017 11. Livne, A., Unger, M., Shapira, B., Rokach, L.: Deep Context-Aware Recommender System Utilizing Sequential Latent Context. arXiv preprint arXiv:1909.03999 (2019)
A Deep Reinforcement Learning Algorithm Using A New Graph Transformer Model for Routing Problems Yang Wang and Zhibin Chen(B) The Department of Mathematics, Kunming University of Science and Technology, 650093 Kunming, People’s Republic of China [email protected]
Abstract. Routing problems, which belong to a classical kind of problem in combinatorial optimization, have been extensively studied for many decades by researchers from different backgrounds. In recent years, Deep Reinforcement Learning (DRL) has been applied widely in self-driving, robotics, industrial automation, video games, and other fields, showing its strong decision-making and learning ability. In this paper, we propose a new graph transformer model, based on the DRL algorithm, for minimizing the route lengths of a given routing problem. Specifically, the actor-network parameters are trained by an improved REINFORCE algorithm to effectively reduce the variance and adjust the frequency of the reward values. Further, positional encoding is used in the encoding structure to make the multiple nodes satisfy translation invariance during the embedding process and enhance the stability of the model. The aggregate operation of the graph neural network applies to transformer model decoding stage at this time, which effectively captures the topological structure of the graph and the potential relationships between nodes. We have used our model to two classical routing problems, i.e., Traveling Salesman Problem (TSP) and Capacitate Vehicle Routing Problem (CVRP). The experimental results show that the optimization effect of our model on small and medium-sized TSP and CVRP surpasses the stateof-the-art DRL-based methods and some traditional algorithms. Meanwhile, this model also provides an effective strategy for solving combinatorial optimization problems on graphs. Keywords: Graph transformer · Graph neural network · Deep Reinforcement Learning · Routing problems · Combinatorial optimization
1 Introduction Routing problems, such as Traveling Salesman Problem (TSP) and Capacitate Vehicle routing problem (CVRP), are a classical kind of problem in Combinatorial Optimization Problems (COPs) with real-world applications in many domains [1]. TSP asks the following question: “Given the distance between a series of cities and each pair of cities, what is the shortest possible route to visit each city once and return to the starting city?” © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): IntelliSys 2022, LNNS 544, pp. 365–379, 2023. https://doi.org/10.1007/978-3-031-16075-2_26
366
Y. Wang and Z. Chen
[2]. Compared with TSP, CVRP has more optimization objectives and constraints [3]. Routing problems are NP-hard in general [4], even in its symmetric 2D Euclidean version, which is this paper’s focus. When it comes to solving routing problems, there are many ways in literature, such as small-scale exponential algorithms, approximation algorithms, and heuristic algorithms [1]. However, none of these can perfectly solve routing problems if you think optimality, speed, accuracy, and the ability to generalize as a whole package. For instance, traditional exact methods will take an exponential of time even on a small instance of the problem [5]. Heuristic algorithms do well on routing problems as they only compute “good in practice” solutions to COPs without optimality guarantee [5]. However, once the problem statement changes slightly, they need to be revised accordingly. In contrast, Deep Learning (DL) methods have the potential advantage to be applied to routing problems by automatically discovering their heuristics based on data. It is well suited for nature signals with no mathematical formula because the actual data distribution of routing problems is analytically unknown [6]. DL and routing problems are closely related, particularly in optimizing to minimize errors between prediction and target. In recent years, with the booming development of Deep Reinforcement Learning (DRL) [7] and Graph Neural Networks (GNNs) [8], DRL-based methods are thinking for solving optimization problems. Wang et al. [6] give a survey on how combinate DRL for the NP-hand COPs (including TSP and CVRP) recently. People believe it is feasible to apply DRL and GNNs in the decision-making or heuristic for solving COPs. The support of high-performance infrastructure has given new research significance to the routing optimization problems in the era of big data. The model automatically learns the algorithm for process of solving routing problems. Due to the successful application of transformer [9] in the field of Natural Language Processing (NLP), the models have obtained more accurate results in solving COPs. It may become a significant milestone in solving COPs. Inspired by the GNNs [8], transformer [9] model and DRL can learn decisions for reasoning by interacting with environment [7]. We combine these three methods that can make better decisions to solve routing problems. The model policies can be parameterized by neural network and trained by DRL to obtain more robust algorithms for routing problems. The contribution of this paper is three-fold: • We propose a new graph transformer model, based on the DRL algorithm [10]. It can effectively solve routing problems and has a good generalization ability. • The aggregate operation of GNNs applies to transformer decoding stage at this time, so that the vector space of Graph Embedding (GE) has more flexibility, allowing more information to be characterized and mined. • Compared with the current DRL-based methods and some traditional algorithms, the graph transformer model is significantly improved in the optimization accuracy of small and medium-sized routing problems.
2 Related Work 2.1 Traditional Algorithms for Solving Routing Problems Typically, Exact algorithms for routing problems are branch and bound, dynamic programming, exponential time algorithms for small-scale instances, etc. The computation
A Deep Reinforcement Learning Algorithm Using A New Graph Transformer Model
367
time is unacceptable as the increasing of sizes [11]. Approximation algorithms for routing problems are usually local search, the primal-dual approach based on linear programming, etc. Approximation algorithms for routing problems are polynomial-time algorithms that can output feasible solutions whose objective function value is within a guaranteed factor of the optimal [12]. Heuristic and meta-heuristics algorithms for routing problems are nearest neighbor algorithm, ant colony optimization, particle swarm optimization, etc. They solve the routing problems quickly, but it is easy to fall into local optimum [13]. However, offering a fast and reliable solution is still a challenging task. 2.2 DRL-Based Methods for Solving Routing Problems Using traditional heuristic algorithms for routing problems, the entire distance matrix must be recalculated and the system must be re-optimized from scratch, which is impractical, especially the scope of routing problems become larger. In contrast, DRL framework do not require an explicit distance matrix, and only one Feed-Forward (FF) pass of the network will update the routes based on the new data generated by environmental interactions. Vinyals et al. [14] improved sequence-to-sequence (Seq2Seq) model [15] and proposed Pointer Networks (PN) with Long-Short Term Memory (LSTM) as an encoder and Attention Mechanism (AM) as a decoder to effectively solve small-scale TSP. Bello et al. [16] used Reinforcement Learning (RL) to train PN for solving the TSP within 100 nodes. To solve the complex CVRP, Nazari et al. [17] improved the PN network and added Beam Search (BS) to inference process. In recent years, GNNs as a powerful tool effectively handle non-Euclidean data. Dai et al. [18] proposed graph embedding model. Network parameters are trained by deep Q-learning algorithm which solves large-scale TSP. Kool et al. [19] applied AM model to solve routing problems such as TSP and CVRP for the first time. Due to the placement of Multi-Head Attention (MHA) and self-attention, the AM can efficiently capture deep node information. Further, Bo et al. [20] proposed a dynamical AM model to solve CVRP. Chen et al. [21] proposed a NeuRewriter architecture to continuously improve the solution of the CVRP in an iterative manner by training model with DRL. Kwon et al. [22] proposed Policy Optimization with Multiple Optima (POMO) model to solve COPs such as TSP and CVRP. The construction method of multi-starting nodes improves training efficiency. Wu et al. [23] proposed a direct policy approach that parameterize the policy model by the self-attention mechanism to obtain the solution of the TSP and CVRP in the model training phase. Xin et al. [24] proposed a Multi-Decoder Attention Model (MDAM) to solve the multi-objective routing problems and added embedding glimpse which improves the overall optimization performance of the model in the encoding. Those models can learn to choose appropriate solutions for routing problems from the vast potential solutions of the combinatorial space. Based on the recent work, we further enhance the model in several ways.
3 The Model Architecture The selection of route emphasizes environment factors which is naturally similar to the behavior selection of DRL will affect the decision. In this paper, the graph transformer
368
Y. Wang and Z. Chen
model proposed consists of encoder, decoder and training, as shown in Fig. 1. Routing problems satisfy Markov property, so they are described as Markov Decision Process (MDP). Hence DRL algorithms are used to train parameters θ and input instance s, the probability of solution P (at |s) can be decomposed by chain rule as: t=1 Pθ (at |s) = Pθ (at |s, a1:t−1 ). (1) N
Fig. 1. Diagram of graph transformation model frame.
3.1 Encoder In the encoder structure, we preserve the Positional Encoding (PE) technique of the original transformer via compared with the existing DRL solving COPs models [14–24]. In the process of embedding that the initial node coordinates can satisfy translation invariance. Meanwhile, the high-level neural network can learn the effective location information. Following the original transformer architecture [9], these initial node embeddings are fed into the attention layer that extracts the node information of the deep network and updated N = 3 times with N attention layers. Each attention consists of two sublayers: MHA layer and a fully connected FF layer. Here PE and MHA of each node i are defined as: d 10000 2i sin(2πfi t), i is odd with fi = , (2) PEt,i = cos(2πfi t), i is even 2π
A Deep Reinforcement Learning Algorithm Using A New Graph Transformer Model
Qi = W Q hi , K i = W K hi , V i = W V hi , i iT QK i i i Head = Attention Q , K , V = softmax √ V i i = 1, 2, · · · , H , d MHA(Q, K, V ) = Concat Head 1 , Head 2 , · · · , Head H Wo . i
369
(3) (4) (5)
hi = tan h(hi + MHA(Q, K, V )).
(6)
where PEt ∈ Rd , t is the location of node and d is the dimension, here parameters W Q and W K are (dk × dh ) matrices and W V has size (dV × dh ). The number of head is set M = 8, hi is denoted as the node embedding of each node i. After each of two sublayers, skip connected [25] and Batch Normalization (BN) [26] are applied: fˆ i = BN hˆ i + MHA(Q, K, V ) , (7) fi = BN fˆ i + FF fˆ i ,
(8)
the FF sublayer has one hidden sublayer with dimension 512 and ReLu activation. 3.2 Decoder In routing problems, each node is related to its neighbor nodes, they will abstract as an issue which between the set of nodes and edges. Obviously, they can be modeled as a graph model structure. Therefore, all feature vectors can be made of aggregation operation of GNNs in GE layer. In this way, network can effectively capture the topological structure of the graph and the potential relationship between points, so that more information can be representation. And encoder information embedding will have better performance. The aggregation operation of GNNs is introduced into the decoding operation for the first time, here the expression of GE structure is described as: (9) xl = γ xl−1 ∅ + (1 − γ )ϑθ xl−1 /|N (i)| , where xl ∈ RN ×dl , γ is a trainable parameter which adjust the weight matrix of eigenvalues, ∅ ∈ Rdl−1 ×dl , ϑθ : RN ×dl−1 → RN ×dl is the aggregation function [27], N (i) is the adjacency set of node i. We consider routing problems that are characterized by a complete graph with symmetry. In order to maintain the global properties of decoding structure and weight distribution of attention, xl is homogenized which makes the aggregated attribute information uniformly embedded in the context vector. Here the expression of GE structure is described as: L X = xl . (10) l=1
370
Y. Wang and Z. Chen
Similar to Kool et al. [18], context vector is computed by self-attention, initial nodes are randomly selected after the first starting points. Masking technology that visited nodes cannot be accessed again can be understand as the output of the next visited city node with a high probability. Figure 2 shows the decoding construction process of optimal path π = (3, 1, 2, 4). Finally, the probability Pθ (at |s, a1:t−1 ) is computed with a single-head attention layer:
i i ; h X ; h πt−1 π1 t > 1 hic = (11) none t < 1, qc = W Q hc , ki = W K hi ,
ucj =
(12)
√ C · tanh qcT kj / d if j = πt ∀t < t −∞
(13)
other,
eucj Pθ (at |s, a1:t−1 ) = ucj , je
(14)
where [;] is concatenation operator, X is the graph embeddings, hiπt−1 is the embedding of tth selected node, j adjacent to i. Our model does not use the decoder to determine the first selected node, we simply define hiπ1 = hi .
Fig. 2. Diagram of routing problem decoding for 4 nodes.
4 DRL Training with Shared Baseline In order to measure the difference distribution between different nodes, we add the hyperparameter β = 0.1 to baseline r τ i of the REINFORCE algorithm [10]. It can adjust the frequency of change of the reward value and reduce the variance. Agent is more focused on short-term rewards to prevent premature convergence. We use the shared baseline which is global reward. R τ i is represented by: N (15) R τ i = β × r τ i + (1 − β)1/N r τi , i=1
A Deep Reinforcement Learning Algorithm Using A New Graph Transformer Model
371
During training, route graphs are drawn from a distribution s and the total training objective is defined as: J (θ ) = Eπ ∼Pθ (J (θ |s)).
(16)
In order to maximize the expected return J and circumvent non-differentiability of hard-attention, the model has recourse to the well-known REINFORCE algorithm [10] learning rule. The model’s parameters θ : (17) ∇θ J (θ ) = Eπ ∼Pθ R τ i − R τ i ∇θ logPθ (τ i |s) . We calculate the total return R τ i of each solution τ i , so the gradient of J (θ ) can be expressed by: N R τ i − R τ i ∇θ logPθ (τ i |s). (18) ∇θ J (θ ) ≈ 1/N i=1
The model can learn the parameter θ of the actor network through a random strategy. The gradient of the above formula is calculated and updated, then the optimal strategy Pθ (at |s, a1:t−1 ) is obtained through iterative training. Through the construction of the shared reward baseline, the critic network in the model is replaced, thus the model structure is simplified. It is realized that the accurate mapping of routing problems from point sequence to solution sequence. The training algorithm is described in Algorithm 1. The algorithm terminates when the parameters are converged, or pre-defined maximum number of iterations is reached.
372
Y. Wang and Z. Chen
5 Experiments 5.1 Experimental Environment and Hyperparameters The training data is generated randomly in the unit square [0, 1] × [0, 1], the instants used in our experiments are symmetrical TSP and CVRP with 20, 50 and 100 nodes, respectively. We call them TSP20, CVRP20, etc. The capacity of vehicle D = 30 for VRP20, D = 40 for VRP50, D = 50 for VRP100. For each problem and model are executed on a single GPU Tesla K80. 200 epochs expenditure on average 3 h, 24 h and 136 h for TSP20, TSP50, TSP100; 6 h, 33 h, 168 h for CVRP20, CVRP50 and CVRP100. We have waited as long as 2000 epochs to observe full converge, but as shown in Fig. 5 and Fig. 8, most of the learning is already converged by 200 epochs. The model arrives at the same optimal solution from different starting points. Moreover, we fine-tune the hyperparameters based on [19], which are summarized in Table 1. Table 1. Hyperparameters used for training. Parameter
Value
Parameter
Value
Epoch
200
Optimizer
Adam
Batch size
256
Learning rate
1e−3
GE layers
3
Weight decay
1e−7
10000
Test instances
1000
Training instances
5.2 Experiments for TSP In recent years, DRL-based methods have achieved fantastic results in solving COPs. It can also be seen that the most of these methods still need to combine some traditional algorithms, such as greedy, BS, sampling, etc. Kwon et al. [22] propose a neoteric inference way that is 8-instance augmentation. Because of the coordinates of the simulation experiments have symmetry, we attempt to take 4-instance augmentation. It converts all coordinates (x, y) to (x, 1 − y), (1 − x, y), (1 − x, 1 − y). Figure 3 shows that encoder-decoder architecture and inference can be applied to solve TSP.
Encode
Input
Inference
Decode Feature Representation
Multiple starting nodes
Output
Solution
Fig. 3. Diagram of encoder-decoder architecture and inference for TSP.
A Deep Reinforcement Learning Algorithm Using A New Graph Transformer Model
373
Firstly, the most advanced professional solving tool Concorde [28] and LKH3 [29] are used to calculate the optimal solution of TSP and CVRP. Concorde [28] and LKH3 [29] run on Intel Core i5-9300H CPU. In Table 2 we compare the performance of our model on TSP with other baselines. Baseline includes professional solving tools, traditional algorithms and DRL-based. Secondly, we compare the optimal gap between DRL-based methods and traditional algorithms in Fig. 4, respectively. The 4-distance augmentation inference method results in an optimal gap of 0.00% for TSP20; 0.02% for TSP50; and 0.10% for TSP100, surpassing the current DRL-based methods [16–24]. As for the inference techniques, it is shown that combined use of graph transformer and 4-distance augmentation can reduce the optimal gap even further. Compared with AM model, the performance of our model is notably improved for both TSP20 (0.34%), TSP50 (1.74%) and TSP100 (4.43%). Table 2. Comparison of TSP optimization results of different models. Model
TSP20 Len
TSP50 Gap
Len
TSP100 Gap
Len
Gap
Concorde
3.83
0.00%
5.69
0.00%
7.76
0.00%
LKH3
3.83
0.00%
5.69
0.00%
7.76
0.00%
OR-Tools*
3.86
0.94%
5.85
2.87%
8.06
3.86%
Farthest Insertion
3.89
1.56%
5.97
4.92%
8.34
Nearest Neighbor
4.48
2-opt
3.95
3.13%
6.11
7.38%
8.50
9.53%
AM (sampling)*
3.84
0.08%
5.73
0.52%
7.94
2.26%
AM (greedy)*
3.85
0.34%
5.80
1.76%
8.12
4.53%
16.9%
6.94
21.9%
9.68
7.47% 24.7%
Wu
3.83
0.00%
5.70
0.20%
7.87
1.42%
MDAM (greedy)
3.84
0.05%
5.73
0.62%
7.93
2.19%
POMO (8 augment)
3.83
0.00%
5.69
0.05%
7.77
0.14%
Ours (4 augment)
3.83
0.00%
5.69
0.02%
7.76
0.10%
Note: result with * are reported from other papers.
Learning curves of TSP50 and TSP100 in Fig. 5 show that our model can converge stably to the optimal solution of TSP within 200 batches. Due to the initial solution with random starting points and the processing of PE layer, high-quality solutions can be obtained in the training process. The experimental results in Table 3 illustrate the effectiveness of 4 instance augmentation inferencing method. The inferencing time is shortened by about 50% and the optimal gap of TSP50 and TSP100 is also slightly improved, indicating the rationality of this method. At the same time, it can be seen that the solution time of our model is faster than that of some traditional algorithms [26, 27] in the inferencing stage. Our model similar to the current DRL-based methods [16–24]. Traditional solves like Concorde [28] and LKH3 [29] still outperform DRL-based solvers in terms of performance and
374
Y. Wang and Z. Chen
(a) DRL-based
(b) Traditional
Fig. 4. Comparison of routing optimal gap of different models on TSP.
(a) TSP50
(b)TSP100
Fig. 5. Training convergence of TSP.
generalization. However, they can only provide weaker solutions or would take very long to solve in routing problems. Although we combine graph transformer model and 4-distance augmentation, the method only takes