127 65 39MB
English Pages 384 [376] Year 2023
Lecture Notes in Networks and Systems 750
Pablo García Bringas · Hilde Pérez García · Francisco Javier Martínez de Pisón · Francisco Martínez Álvarez · Alicia Troncoso Lora · Álvaro Herrero · José Luis Calvo Rolle · Héctor Quintián · Emilio Corchado Editors
18th International Conference on Soft Computing Models in Industrial and Environmental Applications (SOCO 2023) Salamanca, Spain, September 5–7, 2023, Proceedings, Volume 2
Lecture Notes in Networks and Systems
750
Series Editor Janusz Kacprzyk , Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland
Advisory Editors Fernando Gomide, Department of Computer Engineering and Automation—DCA, School of Electrical and Computer Engineering—FEEC, University of Campinas— UNICAMP, São Paulo, Brazil Okyay Kaynak, Department of Electrical and Electronic Engineering, Bogazici University, Istanbul, Türkiye Derong Liu, Department of Electrical and Computer Engineering, University of Illinois at Chicago, Chicago, USA Institute of Automation, Chinese Academy of Sciences, Beijing, China Witold Pedrycz, Department of Electrical and Computer Engineering, University of Alberta, Alberta, Canada Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland Marios M. Polycarpou, Department of Electrical and Computer Engineering, KIOS Research Center for Intelligent Systems and Networks, University of Cyprus, Nicosia, Cyprus Imre J. Rudas, Óbuda University, Budapest, Hungary Jun Wang, Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong
The series “Lecture Notes in Networks and Systems” publishes the latest developments in Networks and Systems—quickly, informally and with high quality. Original research reported in proceedings and post-proceedings represents the core of LNNS. Volumes published in LNNS embrace all aspects and subfields of, as well as new challenges in, Networks and Systems. The series contains proceedings and edited volumes in systems and networks, spanning the areas of Cyber-Physical Systems, Autonomous Systems, Sensor Networks, Control Systems, Energy Systems, Automotive Systems, Biological Systems, Vehicular Networking and Connected Vehicles, Aerospace Systems, Automation, Manufacturing, Smart Grids, Nonlinear Systems, Power Systems, Robotics, Social Systems, Economic Systems and other. Of particular value to both the contributors and the readership are the short publication timeframe and the worldwide distribution and exposure which enable both a wide and rapid dissemination of research output. The series covers the theory, applications, and perspectives on the state of the art and future developments relevant to systems and networks, decision making, control, complex processes and related areas, as embedded in the fields of interdisciplinary and applied sciences, engineering, computer science, physics, economics, social, and life sciences, as well as the paradigms and methodologies behind them. Indexed by SCOPUS, INSPEC, WTI Frankfurt eG, zbMATH, SCImago. All books published in the series are submitted for consideration in Web of Science. For proposals from Asia please contact Aninda Bose ([email protected]).
Pablo García Bringas · Hilde Pérez García · Francisco Javier Martínez de Pisón · Francisco Martínez Álvarez · Alicia Troncoso Lora · Álvaro Herrero · José Luis Calvo Rolle · Héctor Quintián · Emilio Corchado Editors
18th International Conference on Soft Computing Models in Industrial and Environmental Applications (SOCO 2023) Salamanca, Spain, September 5–7, 2023, Proceedings, Volume 2
Editors Pablo García Bringas Faculty of Engineering University of Deusto Bilbao, Spain
Hilde Pérez García School of Industrial, Computer University of Leon León, Spain
Francisco Javier Martínez de Pisón Department of Mechanical Engineering University of La Rioja Logroño, Spain
Francisco Martínez Álvarez Data Science and Big Data Lab Pablo de Olavide University Seville, Spain
Alicia Troncoso Lora Data Science and Big Data Lab Pablo de Olavide University Seville, Spain
Álvaro Herrero Applied Computational Intelligence University of Burgos Burgos, Spain
José Luis Calvo Rolle Department of Industrial Engineering University of A Coruña A Coruña, Spain
Héctor Quintián Department of Industrial Engineering University of A Coruña A Coruña, Spain
Emilio Corchado Faculty of Science University of Salamanca Salamanca, Spain
ISSN 2367-3370 ISSN 2367-3389 (electronic) Lecture Notes in Networks and Systems ISBN 978-3-031-42535-6 ISBN 978-3-031-42536-3 (eBook) https://doi.org/10.1007/978-3-031-42536-3 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland Paper in this product is recyclable.
Preface
This volume of Lecture Notes in Networks and Systems contains accepted papers presented at the 18th International Conference on Soft Computing Models in Industrial and Environmental Applications (SOCO 2023). This conference was held in the beautiful city of Salamanca, Spain, in September 2023. Soft computing represents a collection or set of computational techniques in machine learning, computer science, and some engineering disciplines, which investigate, simulate, and analyze very complex issues and phenomena. After a peer-review process, the SOCO 2023 International Program Committee selected 61 papers published in these conference proceedings, representing an acceptance rate of 60%. In this relevant edition, a particular emphasis was put on the organization of special sessions. Seven special sessions were organized related to relevant topics such as: Time Series Forecasting in Industrial and Environmental Applications, Technological Foundations and Advanced Applications of Drone Systems, Soft Computing Methods in Manufacturing and Management Systems, Efficiency and Explainability in Machine Learning and Soft Computing, Machine Learning and Computer Vision in Industry 4.0, Genetic and Evolutionary Computation in Real World and Industry, and Soft Computing and Hard Computing for a Data Science Process Model. The selection of papers was extremely rigorous to maintain the high quality of the conference. We want to thank the members of the Program Committees for their hard work during the reviewing process. This is a crucial process for creating a high-standard conference; the SOCO conference would not exist without their help. SOCO enjoyed outstanding keynote speeches by distinguished guest speakers: Prof. Oscar Cordón at University of Granada (Spain), Prof. Paulo Novais at University of Minho (Portugal), Prof. Michał Wo´zniak (Poland), and Prof. Hujun Yin at University of Manchester (UK). SOCO 2023 has teamed up with “Neurocomputing” (Elsevier), “Logic Journal of the IGPL” (Oxford University Press), and “Cybernetics and Systems” (Taylor & Francis) for a suite of special issues, including selected papers from SOCO 2023. Particular thanks go as well to the conference’s main sponsors, Startup Olé, the CYLHUB project financed with NEXT-GENERATION funds from the European Union and channeled by Junta de Castilla y León through the Regional Ministry of Industry, Trade and Employment, BISITE research group at the University of Salamanca, CTC research group at the University of A Coruña, and the University of Salamanca. They jointly contributed in an active and constructive manner to the success of this initiative. We would like to thank all the special session organizers, contributing authors, as well as the members of the Program Committees and the Local Organizing Committee
vi
Preface
for their hard and highly valuable work. Their work has helped to contribute to the success of the SOCO 2023 event. September 2023
José Luis Calvo Rolle Francisco Javier Martínez de Pisón Pablo García Bringas Hilde Pérez García Francisco Martínez Álvarez Alicia Troncoso Lora Álvaro Herrero Héctor Quintián Emilio Corchado
Organization
General Chair Emilio Corchado
University of Salamanca, Spain
International Advisory Committee Ashraf Saad Amy Neustein Ajith Abraham Jon G. Hall Paulo Novais Amparo Alonso Betanzos Michael Gabbay Aditya Ghose Saeid Nahavandi Henri Pierreval
Armstrong Atlantic State University, USA Linguistic Technology Systems, USA Machine Intelligence Research Labs—MIR Labs, Europe The Open University, UK Universidade do Minho, Portugal President Spanish Association for Artificial Intelligence (AEPIA), Spain Kings College London, UK University of Wollongong, Australia Deakin University, Australia LIMOS UMR CNRS 6158 IFMA, France
Program Committee Chairs José Luis Calvo-Rolle Francisco Javier Martínez de Pisón Pablo García Bringas Hilde Pérez García Francisco Martínez Álvarez Alicia Troncoso Lora Álvaro Herrero Héctor Quintián Emilio Corchado
University of A Coruña, Spain University of La Rioja, Spain University of Deusto, Spain University of León, Spain Pablo Olavide University, Spain Pablo Olavide University, Spain University of Burgos, Spain University of A Coruña, Spain University of Salamanca, Spain
Program Committee Agustina Bouchet Aitor Garrido Akemi Galvez-Tomida
University of Oviedo, Spain University of the Basque Country, Spain University of Cantabria, SPAIN, Spain
viii
Organization
Alfredo Jimenez Álvaro Michelena Grandío Andres Fuster-Guillo Andres Iglesias Prieto Angel Arroyo Anton Koval Antonio Sala Bartosz Krawczyk Beatriz De La Iglesia Bogdan Okreša Ðuri´c Borja Sanz Bowen Zhou Carlos Pereira Damian Krenczyk Daniel Honc Daniel Urda Daniela Perdukova David Griol Dragan Simic Eleni Mangina Eloy Irigoyen Enrique De La Cal Marín Enrique Onieva Esteban Jove Fares M’Zoughi Fernando Ribeiro Fernando Sanchez Lasheras Florentino Fdez-Riverola Francisco Martínez-Álvarez Francisco Zayas-Gato Franjo Jovic Giuseppe Psaila ´ Grzegorz Cwikła Guangdi Li Hector Cogollos Adrián Héctor Alaiz Moretón Héctor Quintián Humberto Bustince Isaias Garcia Izaskun Garrido J. Enrique Sierra García Jaroslav Marek
KEDGE Business School, Spain University of A Coruña, Spain University of Alicante, Spain University of Cantabria, Spain University of Burgos, Spain Luleå University of Technology, Sweden Polytechnic University of Valencia, Spain Virginia Commonwealth University, USA University of East Anglia, UK University of Zagreb, Croatia University of Deusto, Spain Northeastern University, China ISEC, Portugal Silesian University of Technology, Poland University of Pardubice, Czechia University of Burgos, Spain Technical University of Kosice, Slovakia University of Granada, Spain University of Novi Sad, Serbia University College Dublin, Ireland University of the Basque Country, Spain University of Oviedo, Spain University of Deusto, Spain University of A Coruña, Spain University of the Basque Country, Spain EST, Portugal University of Oviedo, Spain University of Vigo, Spain Pablo de Olavide University, Spain University of A Coruña, Spain University of Osijek, Croatia University of Bergamo, Italy Silesian University of Technology, Poland Northeastern University, China University of Burgos, Spain University of León, Spain University of A Coruña, Spain UPNA, Spain University of León, Spain University of the Basque Country, Spain University of Burgos, Spain University of Pardubice, Czechia
Organization
Jaume Jordán Javier Díez-González Javier Sanchis Saez Jiri Pospichal Jorge Barbosa Jorge García-Gutiérrez José Dorronsoro José Valente de Oliveira José Antonio Aveleira Mata José Carlos Metrolho José Luis Calvo-Rolle José Luis Casteleiro-Roca José M. Molina José Manuel Lopez-Guede José Ramón Villar Juan Gomez Romero Juan Albino Mendez Juan J. Gude Juan M. Alberola Julio César Puche Regaliza Laura Melgar-García Lidia Sánchez-González Luis Alfonso Fernández Serantes Luis Paulo Reis Manuel Castejón-Limas Manuel Graña Marcin Paprzycki Maria Fuente Maria Teresa Godinho Matilde Santos Matilde Santos Peñas Michal Wozniak Miriam Timiraos Díaz Nashwa El-Bendary Noelia Rico Oscar Castillo Ovidiu Cosma Pablo García Bringas Panagiotis Kyratsis
Polytechnic University of Valencia, Spain University of León, Spain Polytechnic University of Valencia, Spain University of Ss. Cyril and Methodius, Slovakia ISEC - Instituto Superior de Engenharia de Coimbra, Portugal University of Seville, Spain Autonomous University of Madrid, Spain Universidade do Algarve, Portugal University of León, Spain IPCB, Portugal University of A Coruña, Spain University of Coruña, Spain University Carlos III of Madrid, Spain University of the Basque Country, Spain University of Oviedo, Spain University of Granada, Spain University of La Laguna, Spain University of Deusto, Spain Polytechnic University of Valencia, Spain University of Burgos, Spain Pablo de Olavide University, Spain University of León, Spain FH-Joanneum University of Applied Sciences, Spain APPIA, University of Porto/LIACC, Portugal University of León, Spain University of the Basque Country, Spain IBS PAN and WSM, Poland University of Valladolid, Spain Polytechnic Institute of Beja, Portugal Complutense University of Madrid, Spain Complutense University of Madrid, Spain Wroclaw University of Technology, Poland University of A Coruña, Spain Arab Academy for Science, Technology, and Maritime Transport, Egypt University of Oviedo, Spain Tijuana Institute of Technology, Mexico Technical University Cluj-Napoca, Romania University of Deusto, Spain University of Western Macedonia, Greece
ix
x
Organization
Paulo Moura Oliveira Payam Aboutalebi Petr Dolezel Petrica Pop Reggie Davidrajuh Robert Burduk Rogério Dionísio Ruben Ruiz Santiago Porras Alfonso Sebastian Saniuk Stefano Pizzuti Valeriu Manuel Ionescu Vladimir Ilin Wei-Chiang Hong Wilfried Elmenreich Zita Vale
UTAD University, Portugal Norwegian University of Science and Technology, Norway University of Pardubice, Czechia Technical University of Cluj-Napoca, North University Center at Baia Mare, Romania University of Stavanger, Norway University Wroclaw, Poland Instituto Politécnico de Castelo Branco, Portugal University of Burgos, Spain University of Burgos, Spain University of Zielona Góra, Poland Energy New Technologies and Sustainable Economic Development Agency (ENEA), Italy University of Pitesti, Romania University of Novi Sad, Serbia Asia Eastern University of Science and Technology, Taiwan, Taiwan Alpen-Adria-Universität Klagenfurt, Austria GECAD - ISEP/IPP, Portugal
Special Sessions Time Series Forecasting in Industrial and Environmental Applications Program Committee Federico Divina (Organizer) José F. Torres Maldonado (Organizer) Julio César Mello Román (Organizer) Mario D. Lucio Giacobini (Organizer) Miguel Garcia Torres (Organizer) Andrés Chacón Maldonado Ángela del Robledo Troncoso García Antonio Morales-Esteban David Gutiérrez-Avilés Diego Pedro Pinto Roa
Pablo de Olavide University, Spain Pablo de Olavide University, Spain Universidad Nacional de Asunción, Paraguay University of Torino, Italy Pablo de Olavide University, Spain Pablo de Olavide University, Spain Pablo de Olavide University, Spain University of Seville, Spain University of Seville, Spain Universidad Nacional de Asunción, Paraguay
Organization
Gualberto Asencio Cortés Isabel Sofia José-Lázaro Amaro-Mellado Luís Filipe Domingues
Pablo de Olavide University, Spain Instituto Politécnico de Beja, Portugal University of Seville, Spain Polytechnic Institute of Beja, Portugal
Time Series Forecasting in Industrial and Environmental Applications Program Committee Federico Divina (Organizer) José F. Torres (Organizer) José Luis Vázquez Noguera (Organizer) Mario Giacobini (Organizer) Miguel García Torres (Organizer) Antonio Morales-Esteban David Gutiérrez-Avilés Diego Pedro Pinto Roa Elvira Di Nardo Jorge Reyes José-Lázaro Amaro-Mellado Laura Melgar-García Laura Sacerdote Luís Filipe Domingues Manuel Jesús Jiménez Navarro Zeyar Aung
Pablo de Olavide University, Spain Pablo de Olavide University, Spain Universidad Nacional de Asunción, Paraguay University of Torino, Italy Pablo de Olavide University, Spain University of Seville, Spain University of Seville, Spain Universidad Nacional de Asunción, Paraguay University of Torino, Italy NT2 Labs, Chile University of Seville, Spain Pablo de Olavide University, Spain University of Torino, Italy Polytechnic Institute of Beja, Portugal Pablo de Olavide University, Spain Khalifa University of Science and Technology, United Arab Emirates
Technological Foundations and Advanced Applications of Drone Systems Program Committee Ana Bernardos (Organizer) James Llinas (Organizer) Jesús García (Organizer) José Manuel Molina (Organizer) Juan Besada (Organizer) Antonio Berlanga Miguel Angel Patricio
Polytechnic University of Madrid, Spain NYSU at Buffalo, USA University Carlos III of Madrid, Spain University Carlos III of Madrid, Spain Polytechnic University of Madrid, Spain University Carlos III of Madrid, Spain University Carlos III of Madrid, Spain
xi
xii
Organization
Soft Computing Methods in Manufacturing and Management Systems Program Committee Anna Burduk (Organizer) Bo˙zena Skołud (Organizer) Damian Krenczyk (Organizer) Marek Placzek (Organizer) Wojciech Bo˙zejko (Organizer) Dorota Wi˛ecek Franjo Jovic Katarzyna Antosz Krzysztof Kalinowski Reggie Davidrajuh Sebastian Saniuk
Wrocław University of Technology, Poland Silesian University of Technology, Poland Silesian University of Technology, Poland Silesian University of Technology, Poland Wrocław University of Technology, Poland University of Bielsko-Biala, Poland University of Osijek, Croatia Rzeszow University of Technology, Poland Silesian University of Technology, Poland University of Stavanger, Norway University of Zielona Góra, Poland
Efficiency and Explainability in Machine Learning and Soft Computing Program Committee Belén Vega Márquez (Organizer) David Gutiérrez Avilés (Organizer) José María Luna Romera (Organizer) Manuel Carranza García (Organizer) Manuel Jesús Jiménez Navarro (Organizer) María del Mar Martínez Ballesteros (Organizer) Antonio Gilardi Chalachew Muluken Liyew Juan Parra Laura Melgar-García
University of Seville, Spain University of Seville, Spain University of Seville, Spain University of Seville, Spain University of Seville, Spain University of Seville, Spain SLAC—Stanford, USA University of Torino, Italy Aston University, UK Pablo de Olavide University, Spain
Machine Learning and Computer Vision in Industry 4.0 Program Committee Enrique Dominguez (Organizer) José Garcia Rodriguez (Organizer) Ramon Moreno Jiménez (Organizer) Andres Fuster-Guillo
University of Malaga, Spain University of Alicante, Spain Grupo Antolin, Spain University of Alicante, Spain
Organization
David Tomás Esteban José Palomo Ezequiel López-Rubio Jesus Benito-Picazo Jorge Azorín-López Juan Ortiz-de-Lazcano-Lobato Karl Thurnhofer-Hemsi Marcelo Saval-Calvo Miguel Molina-Cabello Rafael M. Luque-Baena
xiii
University of Alicante, Spain University of Malaga, Spain University of Málaga, Spain University of Málaga, Spain University of Alicante, Spain University of Málaga, Spain University of Málaga, Spain University of Alicante, Spain University of Málaga, Spain University of Extremadura, Spain
Genetic and Evolutionary Computation in Real World and Industry Machine Learning and Computer Vision in Industry 4.0 Program Committee Enrique Antonio De La Cal Marin (Organizer) Luciano Sánchez Ramos (Organizer) Camelia Chira Cosmin Sabo Dominik Olszewski Enol García González Fernando Martins Mario Villar Nahuel Costa Cortez Noelia Rico Petrica Pop Sezin Afsar Víctor Gonzalez
University of Oviedo, Spain University of Oviedo, Spain Babes-Bolyai University, Romania Technical University of Cluj-Napoca, Romania Warsaw University of Technology, Poland University of Oviedo, Spain University of Oviedo, Spain i4life, Spain University of Oviedo, Spain University of Oviedo, Spain Technical University of Cluj-Napoca, North University Center at Baia Mare, Romania University of Oviedo, Spain University of Oviedo, Spain
Soft Computing and Hard Computing for a Data Science Process Model Program Committee Antonio Tallón (Organizer) Ireneusz Czarnowski (Organizer) Akash Punhani David H. Glass
University of Huelva, Spain Gdynia Maritime University, Poland SRM Institute of Science and Technology, India Ulster University, UK
xiv
Organization
Elsa da Piedade Chinita Soares Rodrigues Mª José Ginzo Villamayor
Instituto Politécnico de Beja, Portugal University of Santiago de Compostela, Spain
SOCO 2023 Organizing Committee Chairs Emilio Corchado Héctor Quintián
University of Salamanca, Spain University of A Coruña, Spain
SOCO 2023 Organizing Committee Álvaro Herrero Cosio José Luis Calvo-Rolle Ángel Arroyo Daniel Urda Nuño Basurto Carlos Cambra Leticia Curiel Beatriz Gil Raquel Redondo Esteban Jove José Luis Casteleiro-Roca Francisco Zayas-Gato Álvaro Michelena Míriam Timiraos Díaz Antonio Javier Díaz Longueira
University of Burgos, Spain University of A Coruña, Spain University of Burgos, Spain University of Burgos, Spain University of Burgos, Spain University of Burgos, Spain University of Burgos, Spain University of Burgos, Spain University of Burgos, Spain University of A Coruña, Spain University of A Coruña, Spain University of A Coruña, Spain University of A Coruña, Spain University of A Coruña, Spain University of A Coruña, Spain
Contents
Special Session 2: Technological Foundations and Advanced Applications of Drone Systems Level 3 Data Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alan N. Steinberg
3
Image Classification Using Contrastive Language-Image Pre-training: Application to Aerial Views of Power Line Infrastructures . . . . . . . . . . . . . . . . . . . Adrián Losada, Ana M. Bernardos, and Juan Besada
13
A Realistic UAS Traffic Generation Tool to Evaluate and Optimize U-Space Airspace Capacity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Daniel Raposo, David Carramiñana, Juan Besada, and Ana Bernardos
24
UAV Airframe Classification Using Acceleration Spectrograms . . . . . . . . . . . . . . David Sánchez Pedroche, Francisco Fariña Salguero, Daniel Amigo Herrero, Jesús García, and José M. Molina Tuning Process Noise in INS/GNSS Fusion for Drone Navigation Based on Evolutionary Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Juan Pedro Llerena, Jesús García, José Manuel Molina, and Daniel Arias
34
44
Special Session 3: Soft Computing Methods in Manufacturing and Management Systems Digital Twins of Production Systems Based on Discrete Simulation and Machine Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Damian Krenczyk Edge Architecture for the Integration of Soft Models Based Industrial AI Control into Industry 4.0 Cyber-Physical Systems . . . . . . . . . . . . . . . . . . . . . . . . . . Ander Garcia, Telmo Fernández de Barreana, Juan Luis Ferrando Chacón, Xabier Oregui, and Zelmar Etxegoin The Use of Line Simplification and Vibration Suppression Algorithms to Improve the Quality of Determining the Indoor Location in RTLSs . . . . . . . . . ´ Grzegorz Cwikła and Tomasz Lorenz Possibilities of Decision Support in Organizing Production Processes . . . . . . . . . Małgorzata Olender-Skóra and Aleksander Gwiazda
57
67
77
88
xvi
Contents
Special Session 4: Efficiency and Explainability in Machine Learning and Soft Computing Efficient Short-Term Time Series Forecasting with Regression Trees . . . . . . . . . . 101 Pablo Reina-Jiménez, Manuel Carranza-García, Jose María Luna-Romera, and José C. Riquelme Generating Synthetic Fetal Cardiotocography Data with Conditional Generative Adversarial Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 Halal Abdulrahman Ahmed, Juan A. Nepomuceno, Belén Vega-Márquez, and Isabel A. Nepomuceno-Chamorro Olive Oil Fly Population Pest Forecasting Using Explainable Deep Learning . . . 121 A. M. Chacón-Maldonado, A. R. Troncoso-García, F. Martínez-Álvarez, G. Asencio-Cortés, and A. Troncoso Explaining Learned Patterns in Deep Learning by Association Rules Mining . . . 132 M. J. Jiménez-Navarro, M. Martínez-Ballesteros, F. Martínez-Álvarez, and G. Asencio-Cortés Special Session 5: Machine Learning and Computer Vision in Industry 4.0 A Deep Learning Ensemble for Ultrasonic Weld Quality Control . . . . . . . . . . . . . 145 Ramón Moreno, José María Sanjuán, Miguel Del Río Cristóbal, Revanth Shankar Muthuselvam, and Ting Wang Indoor Scenes Video Captioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 Javier Rodríguez-Juan, David Ortiz-Perez, Jose Garcia-Rodriguez, David Tomás, and Grzegorz J. Nalepa A Multimodal Dataset to Create Manufacturing Digital Twins . . . . . . . . . . . . . . . 163 David Alfaro-Viquez, Mauricio-Andres Zamora-Hernandez, Hanzel Grillo, Jose Garcia-Rodriguez, and Jorge Azorín-López A Modified Loss Function Approach for Instance Segmentation Improvement and Application in Fish Markets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 Alejandro Galán-Cuenca, Nahuel García-d’Urso, Pau Climent-Pérez, Andres Fuster-Guillo, and Jorge Azorin-Lopez Parallel Processing Applied to Object Detection with a Jetson TX2 Embedded System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 Jesús Benito-Picazo, Jose David Fernández-Rodríguez, Enrique Domínguez, Esteban J. Palomo, and Ezequiel López-Rubio
Contents
xvii
Deep Learning-Based Emotion Detection in Aphasia Patients . . . . . . . . . . . . . . . . 195 David Ortiz-Perez, Pablo Ruiz-Ponce, Javier Rodríguez-Juan, David Tomás, Jose Garcia-Rodriguez, and Grzegorz J. Nalepa Defect Detection in Batavia Woven Fabrics by Means of Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 Nuria Velasco-Pérez, Samuel Lozano-Juárez, Beatriz Gil-Arroyo, Juan Marcos Sanz, Nuño Basurto, Daniel Urda, and Álvaro Herrero An Image Mosaicing-Based Method for Bird Identification on Edge Computing Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 Dmitrij Teterja, Jose Garcia-Rodriguez, Jorge Azorin-Lopez, Esther Sebastian-Gonzalez, Rita Elise van der Walt, and M. J. Booysen HoloDemtect: A Mixed Reality Framework for Cognitive Stimulation Through Interaction with Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226 David Mulero-Pérez, Manuel Benavent-Lledo, Jose Garcia-Rodriguez, Jorge Azorin-Lopez, and Flores Vizcaya-Moreno Accurate Estimation of Parametric Models of the Human Body from 3D Point Clouds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236 Nahuel E. Garcia-D’Urso, Jorge Azorin-Lopez, and Andres Fuster-Guillo Lightweight Cosmetic Contact Lens Detection System for Iris Recognition at a Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246 Adrián Romero-Garcés, Camilo Ruiz-Beltrán, Rebeca Marfil, and Antonio Bandera Vehicle Warning System Based on Road Curvature Effect Using CNN and LSTM Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256 F. Barreno, Matilde Santos, and M. Romana Special Session 6: Genetic and Evolutionary Computation in Real World and Industry Enhancing Time Series Anomaly Detection Using Discretization and Word Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269 Lucas Pérez, Nahuel Costa, and Luciano Sánchez Multi-objective Optimization for Multi-Robot Path Planning on Warehouse Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279 Enol García González, José R. Villar, Camelia Chira, Enrique de la Cal, Luciano Sánchez, and Javier Sedano
xviii
Contents
On the Prediction of Anomalous Contaminant Diffusion . . . . . . . . . . . . . . . . . . . . 290 Douglas F. Corrêa, Guido F.M.G. Carvalho, David A. Pelta, Claudio F. M. Toledo, and Antônio J. Silva Neto Keeping Safe Distance from Obstacles for Autonomous Vehicles by Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300 Eduardo Bayona, Jesús-Enrique Sierra-García, and Matilde Santos An Approach of Optimisation in Last Mile Delivery . . . . . . . . . . . . . . . . . . . . . . . . 311 Dragan Simi´c, José Luis Calvo-Rolle, José R. Villar, Vladimir Ilin, Svetislav D. Simi´c, and Svetlana Simi´c Special Session 7: Soft Computing and Hard Computing for a Data Science Process Model A Preliminary Study of MLSE/ACE-III Stages for Primary Progressive Aphasia Automatic Identification Using Speech Features . . . . . . . . . . . . . . . . . . . . 323 Amable J. Valdés Cuervo, Elena Herrera, and Enrique A. de la Cal Comparison of LSTM, GRU and Transformer Neural Network Architecture for Prediction of Wind Turbine Variables . . . . . . . . . . . . . . . . . . . . . . 334 Pablo-Andrés Buestán-Andrade, Matilde Santos, Jesús-Enrique Sierra-García, and Juan-Pablo Pazmiño-Piedra The Impact of Data Normalization on the Accuracy of Machine Learning Algorithms: A Comparative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344 Kelsy Cabello-Solorzano, Isabela Ortigosa de Araujo, Marco Peña, Luís Correia, and Antonio J. Tallón-Ballesteros Adaptive Optics Correction Using Recurrent Neural Networks for Wavefront Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354 Saúl Pérez Fernández, Alejandro Buendía Roca, Carlos González Gutiérrez, Javier Rodríguez Rodríguez, Santiago Iglesias Álvarez, Ronny Anangonó Tutasig, Fernando Sánchez Lasheras, and Francisco Javier de Cos Juez Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
Special Session 2: Technological Foundations and Advanced Applications of Drone Systems
Level 3 Data Fusion Course of Action and Scenario Estimation Alan N. Steinberg(B) Data Fusion & Neural Networks, LLC, Alexandria, VA 22309, USA [email protected]
Abstract. In [1], we discussed the role and implementation of level 2 data fusion – estimation of relationships and situations – in a tactical application. The present paper extends this discussion to the role and implementation of level 3 data fusion – estimation and prediction of courses of action, interactions, outcomes, and mission impacts. Level 3 data fusion estimates scenarios: the evolution over time of known or predicted individual courses of action, interactions, and outcomes. When hosted on a responsive system (a “tactical” system), such scenario estimation can be used to predict the consequences and utility of the host system’s current or candidate courses of action plans. Course of action (CoA) projection characteristically involves analysis of various actors’ capability, opportunity, and intent to perform various actions affecting entities of concern, and the susceptibility of the latter to such actions. CoA and outcome projection in tactical operations is mainly in the form of prediction of future activities. Retroactive projection can also be performed to evaluate the characteristics, outcomes, and response effectiveness in historical or counterfactual scenarios. Keywords: High-level data fusion · prediction · capability · opportunity · intent
Adversarial use of small drone aircraft challenges conventional approaches to air defense: they are prolific, ambiguous in signature and behavior, opportunistic, expendable, and asymmetric in the cost of response. Conventional detection, tracking, and identification of individual drones will not suffice to recognize or predict novel threat situations or to manage cost-effective responses against this very asymmetric threat. We are developing methods to infer adversary courses of action from attributive and relational data and to plan cost-effective, goal-driven responses.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 P. García Bringas et al. (Eds.): SOCO 2023, LNNS 750, pp. 3–12, 2023. https://doi.org/10.1007/978-3-031-42536-3_1
4
A. N. Steinberg
1 Ontological Foundations Data fusion systems are concerned with estimating various entity and situational states. The JDL Data Fusion model originally defined level 3 data fusion as Threat Assessment [2]. This was subsequently generalized to Scenario/Outcome Assessment, defined as predictive or forensic estimation of courses of action, scenarios, and outcomes [3, 4].1 When concerned with recognizing or predicting human actions, we need to consider not only actors’ cognitive states but also their cognitive or perceptual states: • In assessing or predicting an actor’s a potential course of action, we should consider the utility, or value, to him of possible outcomes;2 • In assessing the effectiveness of our own current or previous actions, we need a model of our own value system; • In planning future courses of action, we would like to consider the utility as well as the probability of possible outcomes as our actions interact with the environment, including effects of other actors’ actions. We have proposed an ontology for representing and relating entity classes of concern of multi-level data fusion [5]. That ontology defines a situation as a set of relationships, each of which involves a relation and vector of arguments. A scenario is defined as a temporally evolving situation, involving the evolution of constituent entities and their interactions.3 A course of action (CoA) is defined as a sequence of actions conducted by an entity (an “actor”) in an actual or postulated scenario. An actor may be an individual (a “level 1 entity”) or a system of individuals (a “level 2 entity”) and an action may be intentional or unintentional. Entities of concern in CoA analysis can include. • resources under “our” control or influence;4 • entities that may be controlled or influenced intentionally by other actors (e.g., adversaries or independent neutral actors); • entities affected only by unintentional factors. As depicted in Fig. 1, the ontology distinguishes a. actual physical world states over time, including physical states of entities x and their action potentials; i.e., their capabilities and opportunities to carry out actions A(x); 1 We partition data fusion levels according to classes of variables to be evaluated: Level 0 Fea-
ture/Signal Assessment, Level 1 Individual Entity Assessment, Level 2 Relationship/Situation Assessment, Level 3 Scenario/Outcome Assessment, Level 4 System Assessment [4]. 2 We conflate value and utility (pace Wilde: “all art is quite useless”), as we shall be reasoning in terms of value relative to entities’ “mission” objectives. 3 A scenario can be represented as a temporal relationship among situations [6,7]. As formulated in [5], an individual is ontologically a degenerate relationship, which is a degenerate situation, which is a degenerate scenario [8]. Therefore, similar methods may be used for describing, recognizing, and projecting changes in individuals, relationships, situations, and scenarios; viz., an opportunity for software pattern reuse across fusion levels. 4 Let’s use ‘us’ to refer to the fusion process’ host system; i.e., the entity that estimates and evaluates “our” place in the world, determines outcome utility and action costs to “us”, and manages “our” responses.
Level 3 Data Fusion
5
b. contingent world states, including outcome states of entities at various times, given actions by various entities; c. utility valuation by an entity x of world states to include states of x itself and of other entities (x may be an “actor” that can perform acts A that can affect outcomes, in which case utility valuation can include the cost of one’s act); d. an entity’s (x) estimation of world state, including i) his estimation of his own and others’ capabilities and opportunities to carry out actions, ii) his estimation of other’s intentions to carry out actions, and iii) his estimation of probabilities of various outcomes including estimated utility and costs; e. an entity x’s estimation of an entity y’s state valuation and estimation. State estimation under c, d, and e can be reflexive: the entity x may be “us”.
Fig. 1. High-level ontology for multi-level data fusion [5]
2 Level 3 Data Fusion Level 3 fusion processing estimates and projects courses of action of entities, their interactions, and outcomes; i.e., resultant states of entities in the scenario, to include “us”. Figure 2 depicts interactions (“attacks”) of threat entities and their targets (red arrows), CoAs by targets (green arrows), intercession by defensive forces to prevent or mitigate these attacks (blue arrows), and effects on and by collateral entities. Event impacts can include the net utility of an outcome situational state to various entities, and particularly to “us” (the fusion system’s host). Utility can be viewed as a relationship U (S, x) between an entity or situation state S and an impacted entity x. Utility may be conditioned on x’s higher-level goals or “mission”. Similarly, cost is a relationship between an act and an affected entity (generally, the actor). The impact of
6
A. N. Steinberg
an event on an entity can be defined as the utility of the outcome over time, given that entity’s system of values. CoA planning by an actor x (to include “us”) can be modeled in terms of x’s estimation of probabilities of possible outcomes, the utility to x’s mission of each outcome given x’s CoA and the mission cost of the CoA.
Capability
Outcome state
Threat Entity
Threat Entity Interaction
Opportunity
Target(s)
CoA High-level goals
Intent
Goal-decomposition/ valuation
Target(s)
CoA
Collateral Entity
CoA
Collateral Interaction
Interaction
Defense
Mission Impact Interaction
Defense
CoA
Response Management
Fig. 2. CoA interactions and outcomes
Level 3 fusion process can be used tactically in evaluating the effectiveness of our CoAs in managing outcomes: estimating CoAs – intentional and unintentional – of various actors, interactions, outcomes, and utility/cost to our mission. Most often, level 3 fusion is used to support forecasting or planning; by predicting CoAs, interactions, and outcomes of multiple actors. However, similar processes can be used retrospectively, in assessing past CoAs, events, and outcomes. Retrospective forensic analysis is typically performed for post-mission assessment of CoA effectiveness, and of supporting information acquisition, inferencing, planning, and response (e.g., battle damage assessment). Retrospective inferencing can also involve counterfactual estimation: “What would have been the outcome if we or our adversaries had chosen different CoAs?”. Level 3 fusion can involve three distinct estimation processes. These are described in Sect. 3 and shown in Fig. 3 in the context of a dual-node data fusion/resource management architecture [1]: • F3a: CoA Recognition - recognizing types of actions and courses of action: actors, their activity histories, intended and actual outcome states of the actors and affected entities. CoA recognition is generally a model-based process much like level 1 target recognition and level 2 relationship/situation recognition. • F3b: Event prediction - projecting CoAs over time, estimating interactions and outcomes, including mission impact. Event prediction is analogous to individual state prediction in level 1 target tracking and to level 2 relationship prediction. • F3c: Forensic Assessment - reconstructing CoAs from available evidence, for battle damage assessment, response assessment, process assessment, model assessment, etc. Forensic reconstruction of past scenarios can be used to explain current evidence by projecting causality backwards to past causes.
Level 3 Data Fusion
7
CoA estimation and projection can involve kinematic projection, conditioned by operational constraints; together with projected entity capabilities, vulnerabilities, and inter-relationships (e.g., opportunities for various types of interactions given entities’ capabilities and vulnerabilities), and inferred intent (goal structure and plan). Opportunity can include the access to potential targets and their vulnerabilities given protective and response capabilities. Outputs of level 3 fusion are estimates of situational states over time projected via models of entity and interactive dynamics. Impact estimation includes situational state predictions conditioned on our actual, planned, or provisional CoAs (e.g., cost/utility impacts on our mission or impacts on the effectiveness and costs of possible responses).
Fig. 3. Dual-node network realization of level 3 data fusion and management
Sophisticated CoA estimation can involve inferences concerning entities’ perceptual states: their state of awareness and belief concerning their operational environment, their goal structure, and their expectations concerning possible outcomes of alternative courses of action (Fig. 4). In other words, level 3 processing can involve estimating other sentient entities’ level 3 products. These estimated perceptual states can be used to infer probabilities that actors will attempt various actions.
3 DNN Implementation An adaptive information exploitation system can be implemented as a dual-node network (DNN), in which each data fusion node (‘F’ in Fig. 3) is paired with a resource management node (‘M’) [9]. The DNN paradigm enables effective response to new and rapidly evolving tactical mission applications such as counter-drone: • Agile response: each node pair can act as a quasi-autonomous agent that acquires and exploits data as needed to meet the pair’s evolving mission objectives; • Synergistic information exploitation: a common ontology across the fusion levels (Fig. 1) enables synergistic, consistent inferencing of all information [1, 5];
8
A. N. Steinberg
Fig. 4. Physical and perceptual states
• Cost-effective development: standardized node structures across all fusion and management levels facilitate common design and software reuse [9]. The DNN architecture employs three types of level 3 fusion nodes as shown in Fig. 3 for a) recognition, b) prediction, and c) reconstruction of courses of action, interactions, and outcomes. 3.1 CoA Recognition Node F3a performs model-based recognition of actions and courses of action, combining level 1 and level 2 fusion products concerning • individual types, attributes, and behaviors; • group/coordination patterns; • command/control/communications (C3), affiliations, and force structure relationships; • capabilities, opportunities, and intentions of individuals and groups to conduct various kinds of actions against specific entities in the operational environment. Recognition techniques similar to those used in level 1 target classification and in level 2 relationship and situation recognition may be used; e.g., syntactic graph matching, constructive belief nets, pattern recognition, reinforcement learning. Predictive CoA models, including our repertoire of responses, are usually developed pre-mission but may allow adaptive refinement in-mission, e.g., by machine-learning, case-based reasoning or other knowledge-based methods. 3.2 Event Prediction Node F3b projects CoAs over time, estimates interactions and outcomes, including mission impact.
Level 3 Data Fusion
9
Event/outcome assessment involves inferring contingent states of entities for various time windows. Outcome estimation may include not only L0-3 physical states (signal, attributive, or relational states), but also the utility/cost of such states to actor entities [7]. Contingencies evaluated can include interaction with or influence by other L0-3 entities (to include conditioning by the ambient situation or scenario (a level 2 or level 3 entity) [8]. An important class of such outcomes is the effect on our state, on the state of our assets, or mission. In its most important role, level 3 data fusion estimates the mission impacts of the current and projected situation given a response plan. Event prediction assesses the impacts of implemented or alternative response plans for various projection time windows. The level 3 projections are used to evaluate our response plans by estimating or predicting the effectiveness and cost of actions in affecting the distribution of outcomes. Bayesian cost analysis may be used in estimating outcomes and utility impacts of predicted, conditional, or counterfactual scenarios. Analysis can involve wargaming, to assess outcomes of our candidate courses of action as they interact with alternative courses of action of other actors. The output is a distribution of probabilities and mission utility across outcomes over time and the cost-effectiveness of our CoAs in managing outcomes. Multiple Hypothesis Testing (MHT) methods can be used for predicting scenarios conditioned on alternative CoAs, including our own. Evaluation is in terms of a) a probability and utility distributions across possible world states at selected times during and following the CoA enactment and b) the estimated cost of the CoA (resource expenditure, opportunity cost, costs of collateral effects). Plan evaluation may predict the contribution of, and synergy among, individual resources (the response system’s platforms, sensors, weapons, and countermeasures) in executing the plan. 3.3 Forensic Assessment Node F3c reconstructs CoAs from available evidence. Such etiologic analysis – i.e., reasoning from effect to cause – can be used to • explain outcomes; e.g., assessing the effectiveness of our and other’s actions and of the contributing performance of sensing, modeling, processing, and responses; • infer latent effects: estimated past causes can provide expectations for outcome factors that might not be observed directly (this is common practice in medical diagnosis, in which a diagnosis can prompt the physician to look for and treat hidden effects of a disease); and • validate and refine predictive models for evaluating future events; e.g., providing truth data for supervised learning. In tactical applications, forensic assessment can include evaluating the effectiveness of historical responses (e.g., battle damage assessment) and post-engagement intelligence analysis of actors’ actual capabilities, intentions, and opportunities (including the vulnerabilities of potential “targets” and expectations for mitigating and responsive actions); as well as their and our perceptions of these capabilities, intentions, and opportunities at the time. Counterfactual assessment may be performed similarly.
10
A. N. Steinberg
4 Modeling Courses of Action Data fusion systems are concerned with estimating various world states, often by assigning probabilities across alternative states. These can be states of patterns, individuals, relationships, situations, courses of action (CoA) or events, as they relate to state variables of concern to the various data fusion levels. Estimation can be concerned with current or past states or prediction of future states. Such state variables generally include objective states but, as noted above, can also include subjective states. Of particular interest in recognizing and predicting CoAs is estimating the utility to various actors of various world states: • In assessing an actor’s actions and in predicting his potential CoA, we would like to consider the utility to him of current and possible future states that may be affected by his actions; • In assessing the effectiveness of our own current or previous actions, we need a model of mission utility; • In assessing the utility (value) of possible future courses of action, we would like to consider the utility to our mission of possible future states as well as the probability of such outcomes given candidate courses of action. Such evaluation will typically involve game-theoretic analysis, considering the interplay of our candidate CoA with those of other actors. 4.1 Probability of Action It has been argued that Capability, Opportunity, and (for intentional actions) Intent – or the jurisprudent equivalent Means, Opportunity, and Motive – are mutually necessary and jointly sufficient conditions for an entity to perform an action [11]. If so and, if we are careful to define them so as to be mutually independent, p(A(x, y)) = p(Cap(A(x))p(Opp(A(x, y))p(Int(x, y));
(1)
where Cap(A(x))
x is capable of performing action A (i.e., having sufficient physical and cognitive attributes); Opp(A(x, y)) x has an opportunity to conduct A against y (i.e., within range, no impediments); Int(x, y) x has an intent to conduct A against y (i.e., will attempt to, if capable and has opportunity).5
5 This probabilistic model is idealistic in that it assumes an actor to have a consistent system
of values. Prospect theory, developed by behavioral economists Kahneman and Tversky, finds that people systematically value losses and gains differently (loss aversion), chose in favor of certainty, and exhibit systematic biases and inconsistencies in weighing evidence. [10]
Level 3 Data Fusion
11
4.2 CoA Utility Modeling The utility to entity z’s mission m of action A(x, y) can be stated in terms of Bayesian cost (the usual concern is the perspective of the actor himself; i.e., z = x): Umz (A(x, y)) = Umz (S)p(S|A(x, y)) − Cmz (A(x, y)); (2) S
where Umz (S) mission utility to z of world state S; p(S|A(x, y)) probability of world state S if x conducts action A against y; Cmz (A(x, y)) mission cost to z of action A(x, y). An actor x will perform action A(x, y) based on his estimate of the utility and cost to his mission, Uˆ mx (S), Cˆ mx (A(x, y)), and his estimation of outcome probabilities pˆ x (S|A(x, y)). This formulation can be used in predicting actions of others and in planning one’s own actions.6 In an idealized (i.e., impossible) fusion system, utility is evaluated in terms of outcome states of all entities (at any level, L0-L4) in the universe of discourse: (3) Uˆ x (S) = Uˆ mx (y = Y |S); y∈
where ‘(y = Y |S)’ means ‘entity y is in state Y in situation S’. In a practical system, evaluation can be restricted to entities that may be relevant to the actor’s mission; i.e., entities whose states may have some effect on Uˆ x (S) and may be significantly affected by the action. 4.3 Evaluating Response Effectiveness Level 3 data fusion involves estimating and projecting. • the likelihood of actors’ (e.g., adversaries’) CoAs using Eq. (1), with estimated intent using Eqs. (2)–(3), where x is the actor and probabilities p(Umx ), p Cmx derive from his value scheme in the given mission; • the utility of responsive CoAs using Eqs. (2), (3), employing our own mission value scheme. Tactical responses often have the purpose of reducing the effectiveness of adversary actions; i.e., of reducing the degradation of our mission utility: ⎡ ⎤ ⎣Um (S) Um (B(z, x)) = p(S|A(x, y), B(z, x))p(A(x, y)⎦ − Cm (B(z, x)). S
y A(x,y)
Such responses may be planned with the intent of inducing changes in adversaries’ action capabilities, opportunities, or intentions, per (2) and (3). 6 Mission cost can include such factors as resource depletion, missed opportunities, susceptibility
to countermeasures, financial or diplomatic consequences of collateral effects, etc.
12
A. N. Steinberg
References 1. Steinberg, A., Bowman, C.: Synergy in multi-level reasoning. In: Proceedings of the 12th IEEE Conference on Cognitive and Computational Aspects of Situation Management (CogSIMA), Salerno (2022) 2. Kessler, O.: Functional description of the data fusion process, Technical report for the U.S. Office of Naval Technology, Data Fusion Development Strategy, NADC (1991) 3. Steinberg, A., Bowman, C., White, F.: Revisions to the JDL model. In: Proceedings of the Joint NATO/IRIS Conference, Quebec (1998) 4. Steinberg, A., Bowman, C., Blasch, E., White, F., Llinas, J.: Evolution of the JDL model. Perspect. Mag. 8(1) (2023, forthcoming) 5. Steinberg, A.: An ontology for multi-level data fusion. In: Proceedings of the 25th International Conference on Information Fusion, Linköping (2022) 6. Lambert, D.: A unification of sensor and higher-level fusion. In: Proceedings of the Ninth International Conference on Information Fusion, Florence (2006) 7. Steinberg, A., Snidaro, L.: Levels?. In: Proceedings of the 18th International Conference on Information Fusion, Washington (2015) 8. Steinberg, A.: Situations and contexts. Perspect. Mag. 1(1), 16–24 (2016) 9. Bowman, C.: Process assessment and process management for intelligent data fusion & resource management systems. In: Proceedings of AIAA Space 2012, Pasadena, CA (2012) 10. Kahneman, D.: Thinking, Fast and Slow, Parker, Strauss and Giroux, New York (2011) 11. Little, E., Rogova, G.: An ontological analysis of threat and vulnerability. In: Proceedings of the Ninth International Conference on Information Fusion, Florence (2006)
Image Classification Using Contrastive Language-Image Pre-training: Application to Aerial Views of Power Line Infrastructures Adrián Losada , Ana M. Bernardos(B)
, and Juan Besada
Information Processing and Telecommunications Center, Universidad Politécnica de Madrid, ETSI Telecomunicación, 28040 Madrid, Spain [email protected]
Abstract. This article evaluates the use of CLIP, a contrastive language-image pre-training methodology, for analyzing aerial images of power line infrastructures. Companies record videos using drones and helicopters to assess the health status of the infrastructures, resulting in hours of unlabeled video. This study proposes a semi-supervised approach that combines natural language processing and image understanding to learn a common representation of images and text. A small set of images labeled based on criteria such as transmission tower type, camera angle view, and background were used to fine-tune CLIP for generating domainspecific embeddings. Results show that this approach achieved an F1 score of over 96% for detecting transmission towers, which could be used to automatically classify unlabeled aerial images as the first step in maintenance data pipelines for predictive detection of anomalies in components, presence of nests or plants, etc. Keywords: contrastive language-image pre-training · image classification · aerial images · power line infrastructures · predictive maintenance
1 Introduction Predictive maintenance of power lines infrastructure is a key task for electricity suppliers. Any kind of power outage could cause huge losses, so electricity companies must avoid any problem even before it occurs [1]. Traditionally, these inspections have been done directly by human visual surveys, but aerial means (such as helicopter or unmanned aerial vehicles) are more effective methods to retrieve significant data. But this means to process tons of raw video. Manual labeling is time-demanding and expensive, thus is essential to find a way of identifying the image contents using little information. In this direction, this paper provides some insights with respect to the application of Contrastive Language-Image Pre-training (CLIP) for the purpose. CLIP is a method proposed by OpenAI in 2021 designed for zero-shot learning, an approach which intends to propose a model that do not need training to produce sound image-text pairs [2]. CLIP simultaneously trains both an image encoder and a text encoder to predict the correct pairings of an image and its caption, using the product of the image and text encoded matrices as the target value for prediction. This way, it is possible to insert a new image © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 P. García Bringas et al. (Eds.): SOCO 2023, LNNS 750, pp. 13–23, 2023. https://doi.org/10.1007/978-3-031-42536-3_2
14
A. Losada et al.
to the trained model with a set of possible captions and, tokenizing them, the system can return the most appropriate text for the image. Thus, the main idea behind CLIP is to learn visual and textual representations that can be used for a wide range of tasks without the need of fine-tuning it. OpenAI has not disclosed the training data for its models, thus domain-specific problems may underperform. Our purpose is to apply this model to the specific domain of power lines infrastructure classification, to automatically label aerial images depending on the different types of transmission towers (TT), camera angle view and background type. We will first evaluate the performance of off-the-shelf CLIP and move forward to fine-tune the model with selected domain-specific knowledge to improve results. The paper is structured as follows. Section 2 summarizes the state-of-the-art with respect to the use of CLIP in different domains. Section 3 describes the methodology to use CLIP over a specific dataset. Results and conclusions over power line infrastructures are in Sect. 4 and 5.
2 Related Work Contrastive Language-Image Pretraining (CLIP) is an image label predictor which “jointly trains an image encoder and a text encoder to predict the correct pairings of a batch (image, text) training examples” [2]. While other methods train an image feature extractor to predict some label, this model focuses on the relation between image and text by training the encoders, which are later able to perform a zero-shot classification by embedding the target image and the set of possible labels (Figs. 1 and 2).
Fig. 1. CLIP training and testing method [2]
CLIP focuses on performing a visual classification by simply providing the possible categories for the input images. However, the model is not prepared for the detection of multiple classes in a single image and images must have low or zero background noise. Furthermore, CLIP training has been made using 400M of data collected from the internet from generical domains, such as animals or daily items [3]. The variety of classes we are considering in this domain is not massive, but they may be present over different backgrounds, as TT can be deployed over urban or country areas. This technique is compared to more traditional algorithms in [4], where the authors test CLIP and YOLO on fire and gun detection in real-time video for security purposes.
Image Classification Using Contrastive Language-Image Pre-training
15
Although CLIP is not designed for object detection, it is able to outperform YOLO in both tasks, achieving an F-measure of 99.7% versus 85.5% for the former and 97.3% versus 96.5% for the latter. Since CLIP does not require the extensive labelling work required by traditional object detection algorithms, it enables rapid technology adaptation simply by defining the relevant semantic of interest. Literature is not extensive when applying CLIP to real problems. To the best of our knowledge, there is no documented previous experience for power line infrastructures. Nevertheless, CLIP has been used e.g. for diagnosis of radiological images [3]. Authors improve CLIP pre-training to adapt the model, with new datasets that include unpaired images and text datasets. Model new loss function is updated and, instead of using the cosine similarity between image and text, a soft semantic matching loss is used, which adds human knowledge about the domain problem to avoid false negatives results. In [5], authors intend to create clinical reports from radiology images by using two CLIP-based models. These two models predict a single report or a combination of report sentences, providing one image encoder and two text encoders, depending on the input type. To evaluate the model, resulting reports are passed through the original dataset labeler, computing these predictions against the original labels, obtaining a F1 score of 0.310 versus the 0.294 of the single reports. Vehicle identification is done in [6] using natural language queries. Opposite to other problems, there are three instances for every entry: three frames of the same item and three queries describing those frames. Info for all frames and queries is extracted and averaged, resulting in a single feature matrix to describe the three frames and another one which combines the three queries information. Model is fine-tuned using the feature extractor and evaluated with the cosine similarity. Results are evaluated using mean reciprocal rank (MRR, a measure to evaluate systems that return a ranked list of answers to queries), obtaining the top-8 position in the Nvidia AI City Challenge.
3 Methodology 3.1 Data Selection and Processing In this study, we will address the classification of different types of transmission towers (TT) using aerial images taken from different camera angle views on different backgrounds. For this purpose, we will choose TTPLA [1], a dataset which classifies three types of TT according to their shape, i.e.: a) lattice towers: big steel ensembled structures used for high voltage energy transmission; b) tubular poles: cylindric structures made on steel or concrete for medium voltage and c) wooden poles: wooden structures similar to the tubular poles but used for low voltage energy transmission. As this dataset is focused not only on TT, but also on power lines, we will simplify it by removing non-relevant pictures. We aim at working with a small dataset, as we will have to manually label it. This way, all power lines pictures which does not contain TT will be discarded along with the images which contains two or more different types of TT and those which cannot be well identified, such as images with far towers or tiny portions of them. A final dataset of 440 curated images is used for our purposes.
16
A. Losada et al.
Fig. 2. From left to right: a lattice tower, 4 tubular poles and 2 types of wooden poles [7]
3.2 Data Labeling and Caption Generation We have relabeled our curated dataset according to three different classes. The first one, camera angle view (AV), refers to the camera angle regarding the TT and its values can be a) high, b) top (a high-angle view taken exactly above the item), c) ground, d) frontal or e) lateral angles. Secondly, the transmission tower type (TT) may be i) tubular pole, ii) wooden pole or iii) lattice tower. Lastly, the image background (BG) can involve 1) road, 2) village, 3) natural, 4) clear sky or 5) cloudy sky. After labeling the dataset, we will generate image captions for CLIP model training. To simplify the process, we will combine classes to create different sets of captions. The first set will only include the object to classify, the TT; the second will include the TT with the angle view; and the last will include the TT, AV, and background. Additionally, we will use adjectives to describe the AV and BG classes, followed by the nouns “view” and “background”, respectively. To enhance sentence meaning, we will start the caption with “a power line picture of” as suggested in [7]. By doing so, we expect to improve accuracy results even when using the same key information. Table 1. Possible classes combination for the generation of image captions. Set
Classes
Caption Format
Item
TT
A power line picture of a wooden pole
View
TT & AV
A power line picture of a lattice tower from a high angle view
Full
TT & AV & BG
A power line picture of a tubular pole from a top view with a road background
3.3 CLIP Classifier Selection Methodology Original CLIP models are trained using ImageNet, a generic dataset organized according to the WordNet lexical database, which groups different English words into sets of cognitive synonyms. This means that, even if the existent models have been trained using more of 80,000 of these sets, they are not specialized into the domain words out
Image Classification Using Contrastive Language-Image Pre-training
17
dataset could use. Thus, we will carry out the following testing steps: 0) compare a zero-shot classifier (using one of the original CLIP models) against a few-shot trained classifier. If the comparison shows that the few-shot classifier gets better results than zero-shot one, we will proceed with the fine-tuning, considering 1) selection of the image captions format, 2) selection of the image color spaces between RGB, HSV and black and white, 3) selection of the CLIP base model between different configurations of Vision Transformer and Resnet and 4) adjustment of the learning rate, weight decay and batch size parameters. Some detail on these steps is below: 0) Zero-shot vs. few-shot classifier. We will first test the performance of the zero-shot model against a fine-tuned (but not optimized) version. For this, we will choose a model and compare its results with and without fine-tuning it. This process will only use the full captions sets, which will be divided into three subsets: one for testing used for both models and one for training and validation used only for the fine-tuned one. 1) Selection of captions sets format. First, we aim to determine which set of captions is most suitable for addressing the current problem. To achieve this, we will fine-tune an existing model using each set of captions along with their corresponding images. To evaluate if training a model with additional information is beneficial, we will test each resulting model against its own set of captions as well as those that include its items, e.g., the model fine-tuned with the TT&AV captions will be tested against the captions sets TT and TT&AV. They will be organized according to Tables 1 and 2.
Table 2. Test distribution between the sets of captions used for fine-tuning (FT) and testing Model / Test
TT
TT & AV
FT- TT
X
FT - TT & AV
X
X
FT - TT & AV & BG
X
X
TT & AV & BG
X
2) Selection of color spaces. The Python Image Library (PIL) provides different ways of representing an image according to the number of bands of data and the depth and type of the image pixels. Depending on it, information received by the model could stand out different features so we will need to test which one performs better while fine tuning the CLIP models. We will test three different color modes, whose representation will differ in the number and purpose of bands of data. Pixels will always be 8-bit, whose value range goes from 0 to 255. RGB represents the true color, and it will have three bands to indicate the value of colors red, green and blue. HSV also has 3 bands, but they indicate values of hue, saturation and value in color space. L represents the black and white color, with only one data band [8]. 3) Base model to fine tune. After choosing the most appropriate information for the current problem domain, we will need to select the most suitable base model for CLIP fine-tuning. This way, we will create a new set of tests which will try the different base models using the already selected data, this is, the captions set and the color space.
18
A. Losada et al.
The CLIP GitHub repository [9] offers nine base models, which can be grouped in two variants. The first one is based on a vision transformer (ViT) architecture, whose size depends on the type: ViT-B(ase) with 12 layers and ViT-L(arge) with 24. The number next to the model’s name indicates the input patch size and, unless specified, the images have a size of 224x224 pixels [10]. The second group consist of five ResNet-based methods and, as [2] explains “the first two models follow ResNet-50 and ResNet-101 and we use EfficientNet-style scaling for the next three models which simultaneously scale the model width, the number of layers and the input resolution to obtain models with roughly 4x, 16x and 64x computation”. 4) Parameters adjustment. The final step of the CLIP fine-tuning will be the adjustment of the training hyperparameters: learning rate (1e−7 , 1e−6 , 1e−5 , 1e−4 , 1e−3 ) weight decay (0, 1e−3 ,1e−2 , 1e−1 , 1.5e−1 , 2e−1 ) and the batch size (4, 8, 16). Besides, as we are modifying the weight decay, which directly affects to the learning speed, we will increase the maximum epochs, adjusting it later to the best hyperparameters values.
4 Results and Discussion The steps followed in the previous methodology should result into a custom CLIP model specialized in the problem domain. To get the best possible evaluation, we have divided our dataset into three subsets consisting of train (70%), validation (10%) and test (20%) which, besides, are obtained using a seed so, for all the executed tests, we will use the same data. Furthermore, the test results are discussed by focusing mainly on the transmission tower classes separately, this is, we will prioritize TT results against AV and BG classes, and we will only consider the classes independently: wooden pole, tubular pole and lattice tower. We will not vary all parameters simultaneously; Table 3 shows the default values, which will be updated with the best results for every test. Table 3. Base parameters value during fine-tuning Base model
Color space
Batch size
Epochs
Learning rate
Weight decay
ViT-B/32
RGB
8
30
1e-5
0
CLIP method calculates the cosine similarity between the target image and all the possible captions, normalizing these values and returning the top n similar queries. To evaluate these results, we will obtain, for every class, the occurrences, the number of times it appears on the top queries, and the highest cosine similarity along its occurrences. The class with the highest value for both similarity and occurrences will be added to the confusion matrix as a hit but, if any of them are tied between two or more classes, the confusion matrix will be filled using two different ways. First, if the real class is one of the tied ones, this will be the chosen to be added into the matrix; if none of the tied class is the real one, the miss will be set to the highest similarity if the occurrences were tied or vice versa. To add extra info, the confusion matrix will have one extra row per tied case, so we could have some more context about the cases were the model hit the target in a single shot or had doubts between two or more classes. For example, as stated
Image Classification Using Contrastive Language-Image Pre-training
19
in Table 4, the wooden pole class has been hit 23 times, from which 19 were obtained by a tie between tubular and wooden pole, 1 from lattice tower and wooden pole and the remaining 3 were directly predicted. We will then export two possible results, the first one, standard, for all the kind of hits and the second one only for direct predictions, without ties (eg. Table 5). We will then consider the top 3 F1 scores by prioritizing the standard results (standard F1) over the direct ones (direct F1) and following the next classes order: transmission tower, angle view and background. Table 4. Confusion matrix for TT results with tied classes Wooden pole
Tubular pole
Lattice tower
wooden pole
23.0
0.0
0.0
tubular pole
2.0
22.0
0.0
lattice tower
1.0
0.0
40.0
tubular pole - wooden pole
19.0
12.0
0.0
lattice tower - tubular pole
0.0
2.0
1.0
lattice tower - wooden pole
1.0
0.0
0.0
Table 5. Metrics for the previous TT results Precision
Recall
F1 Score
Accuracy
wooden pole
0.884
1.0
0.939
0.966
tubular pole
1.0
0.917
0.956
0.977
lattice tower
1.0
0.976
0.988
0.989
4.1 Zero-Shot vs. Few-Shot Classifier A standard CLIP model will be compared with its fine-tuned version to see if the fewshot classifier gives better results by having more knowledge domain. The model is fine-tuned using the full caption set with all possible classes, which is also used to test both versions. Table 6 shows that the fine-tuned model performs twice as well as the base model, so we will continue with the remaining fine-tuning steps. 4.2 CLIP Fine-Tuning As the few-shot classifier performs better than the zero-shot one we will proceed with the remaining fine-tuning steps presented on Sect. 3.3. Following, we will find out the best possible configuration for the CLIP model.
20
A. Losada et al. Table 6. Standard F1 score for TT in classifiers comparison
Model
Wooden pole
Tubular pole
Lattice tower
Base
0.409
0.491
0.861
Fine-Tuned
0.906
0.900
0.964
Captions Set Selection. We will test different fine-tuned (FT) models using the different captions sets described in Sect. 3.2, and the caption set to use during testing, so we can check if training the model with more information than needed for test is worth. Table 7 shows that the best F1 score is obtained for the FT model using the view captions (transmission tower + angle view) and tested over these same caption set. Table 7. Standard F1 score for TT in captions sets testing Model
Caption set test
Wooden pole
Tubular pole
Lattice tower
FT-Item
Item
0.920
0.955
0.976
FT-View
Item
0.898
0.933
0.976
FT-View
View
0.960
0.978
0.988
FT-Full
Item
0.920
0.927
0.941
FT-Full
View
0.906
0.872
0.952
FT-Full
Full
0.906
0.900
0.964
Color Space Selection. Next test step will check which is the best color space for finetuning the model between RGB, HSV and black and white (L). For it, we will use the default configuration adding the results from the previous test, this is, model will be fine-tuned and tested using the view caption set and the parameters obtained in Table 7. Standard prediction for RGB and HSV provides the same result for TT classes, with an average of 0.961 ± 0.025 and 0.961 ± 0.024 respectively, so results will be evaluated for the TT direct predictions, where the HSV score is much superior to the RGB one (Table 8). CLIP Model Base. CLIP provides different base models to based or ResNet or Vision Transformers. We have tested which model provides better results after fine-tuning them, except for RN50x64 and ViT-L/14@336px, which could not be tested due to hardware Table 8. Direct F1 score for TT in color space testing Model
Color space
Wooden pole
Tubular pole
Lattice tower
FT-View
RGB
0.667
0.889
0.987
FT-View
L
0.882
0.875
0.975
FT-View
HSV
0.979
1.000
0.988
Image Classification Using Contrastive Language-Image Pre-training
21
limitations. Calculated predictions for TT have resulted into similar results for the different models, obtaining for standard predictions a maximum F1 score average of 0.976 ± 0.013 and a minimum average of 0.948 ± 0.048, and for direct ones a maximum of 0.976 ± 0.002 and the minimum of 0.928 ± 0.054. Table 9 compiles the top 3 results for AV classes, where any ViT model obtain better results than the ResNet ones. Despite similar results inside Vision Transformer architecture, ViT-B/32 gets a better performance, so we will establish this as baseline model. Table 9. Standard F1 score for AV in CLIP base model testing Model
Base CLIP
Top
High angle
Frontal
Ground
Lateral
FT-View-HSV
ViT-B/16
0.800
0.917
0.600
0.923
0.857
FT-View-HSV
ViT-B/32
0.923
0.974
0.667
0.800
0.857
FT-View-HSV
ViT-L/14
0.880
0.911
0.667
0.750
0.706
Parameters Adjustment. Hyperparameters affecting the model training are learning rate, weight decay and batch size. We have tested them sequentially using the best result from one parameter to test the following one. Besides, we have increased the number of epochs to 200 due to a slower learning caused by the changes on learning rate and weight decay. Table 10 gathers top 3 F1 score for learning rate, weight decay and batch size tests. For learning rate and weight decay, the best results are obtained with, respectively, values 1e−5 and 0.1. However, for batch size testing, F1 scores for TT classes are similar for standard predictions in batch size 8 and 16, with an average F1 score of, respectively, 0.952 ± 0.013 and 0.961 ± 0.025, so results have been finally evaluated using the direct prediction for TT, being the best score obtained with a batch size of 16. Table 10. Standard F1 score for TT in learning rate, weight decay and batch size testing
Learning rate
Model
Hyperpar Wooden pole Tubular pole Lattice tower
FT-V-HSV-ViT32
1.00E-03 0.754
1.000
0.730
FT-V-HSV-ViT32
1.00E-05 0.91705
0.93605
0.98805
FT-V-HSV-ViT32
1.00E-06 0.870
0.863
0.937
0.01
0.960
0.978
0.988
FT-V- HSV-ViT32
0.1
0.980
1.000
0.988
FT-V- HSV-ViT32
0.15
0.962
0.977
0.988
FT-View-HSV-ViT32
4
0.939
0.919
0.952
FT-View-HSV-ViT32
8
Weight Decay FT-V- HSV-ViT32
Batch Size
FT-View-HSV-ViT32 16
0.930
0.955
0.964
0.941
0.971
0.987
22
A. Losada et al.
4.3 Final model CLIP fine-tuning resulted in a final model that uses HSV on ViT-B/32 with a learning rate of 1e-5, weight decay of 0.1, and a batch size of 16, using View training and test caption sets. The system achieves the best results for the transmission tower classes, with an average F1 score above 0.96. However, for the angle view and combined caption classes, the standard prediction yields higher values than the direct prediction. This could be due to the difficulty of differentiating between the different angle views of the images. Table 11 summarizes the F1 scores for the different fields studied. Table 11. Average F1 score for TT and AV classes and the view caption set
Transmission Towers
Standard prediction
Direct prediction
0.961 ± 0.025
0.966 ± 0.023
Angle View
0.859 ± 0.076
0.660 ± 0.383
View Caption set
0.424 ± 0.392
0.341 ± 0.441
5 Conclusions The CLIP fine-tuning resulted in a model capable of detecting transmission towers and angle views with an average F1 score of 0.961 and 0.859, respectively. While results may not be as good as standard computer vision models, it can be used for massive image labeling on videos and as part of a larger detection system. This would allow the system to easily discard images without relevant data, reducing computation and time resources. Future work will involve adding other infrastructure components such as dampers and insulators and integrating the embedding strategy into a real-time video labelling using UAV. Acknowledgements. Authors acknowledge the funding under grant AI4TES and PDC2021– 121567-C21 funded by the Spanish Ministry of Economic Affairs and Digital Transformation and MCIN/AEI/10.13039/501100011033/, respectively, and by EU Next GenerationEU.
References 1. Abdelfattah, R., Wang, X, Wang, S.: TTPLA: An Aerial-Image Dataset for Detection and Segmentation of Transmission Towers and Power Lines. In: Proceedings of the Asian Conference on Computer Vision (2020) 2. Radford, A., et al.: Learning Transferable Visual Models From Natural Language Supervision. In: Proceedings of the International Conference on ML, pp. 8748–8763 (2021) 3. Wang, Z., Wu, Z., Agarwal, D., Sun, J.: MedCLIP: Contrastive learning from unpaired medical images and text. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 3876–3887 (2022)
Image Classification Using Contrastive Language-Image Pre-training
23
4. Deng, Y., Campbell, R., Kumar, P.: Fire and Gun Detection Based on Sematic Embeddings. In: IEEE International Conference on Multimedia (2022) 5. Endo, M., Krishnan, R., Krishna, V., Ng, A.Y., Rajpurkar, P.: Retrieval-based chest x-ray report generation using a pre-trained contrastive language-image model. In: Proceedings of ML Research, vol. 158. PMLR, pp. 209–219, Nov. 28 (2021) 6. Khorramshahi, P., Rambhatla S.S., Chellappa, R.: Towards Accurate Visual and Natural Language-Based Vehicle Retrieval Systems. In: Proceedings of the Conference on Computer Vision and Pattern Recognition, pp. 4183–4192 (2021) 7. Different types of transmission towers, Electrical Engineering Pics (2014) 8. Pillow (PIL Fork), PIL Documentation - Concepts, Pillow (PIL Fork) 9.4.0, https://pillow.rea dthedocs.io/en/stable/handbook/concepts.html#concept-modes. Accessed Mar 2023 9. OpenAI, CLIP (Contrastive Language-Image Pretraining), GitHub, Jan. 05, 2021. https://git hub.com/openai/CLIP. Accessed Mar 2023 10. Dosovitskiy, A., et al.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, https://github.com/. Accessed Mar 2023
A Realistic UAS Traffic Generation Tool to Evaluate and Optimize U-Space Airspace Capacity Daniel Raposo , David Carramiñana(B)
, Juan Besada , and Ana Bernardos
Information Processing and Telecommunications Center, Universidad Politécnica de Madrid, Madrid, Spain [email protected]
Abstract. This paper presents a comprehensive Unmanned Aircraft System (UAS) traffic generation tool designed to create realistic and complex drone traffic scenarios for various operational use cases. The tool’s primary features include the ability to simulate a diverse range of flight patterns, such as surveillance, cargo transportation, emergency response, and personal aerial mobility. The tool can accurately model UAS traffic with detailed flight plan information, including waypoints, polygonal areas, and temporal restrictions. Additionally, the paper demonstrates the potential for integrating the tool with UAS Traffic Management (UTM) systems, providing a testbed for assessing strategic and tactical conflict detection and resolution. By using the tool as a ground truth reference, the performance of the UTM system can be evaluated, ensuring its effectiveness in maintaining safe and efficient airspace operations. Overall, this UAS traffic generation tool offers valuable insights for researchers, industry stakeholders, and policymakers, contributing to the development of robust and efficient UTM systems and promoting the safe integration of drones into shared airspace. Keywords: UAS · flights · drone traffic · dimulation · U-Space · UTM
1 Introduction Unmanned Aircraft System (UAS) traffic is anticipated to increase significantly in the coming years. Particularly, a high density of operations is foreseen in urban environments to provide a myriad of services (e.g., law enforcement, parcel delivery, people mobility) [1]. Consequently, regulations and technological advancements are required to ensure the safe operation of drones, considering both air-to-air and air-to-ground risks. In this regard, Unmanned Traffic Management (UTM) systems (i.e., U-Space in Europe) are already providing tools to guarantee safety by enforcing and monitoring UAS separation prior to (i.e., strategic phase) and during (i.e., tactical phase) flights [2]. However, the success of UTM systems relies on the appropriate characterization of a set of parameters related to minimum separation between aircraft and the maximum capacity of a given airspace. These parameters must be estimated in advance through a context-dependent, comprehensive analysis that considers the desired safety level, the © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 P. García Bringas et al. (Eds.): SOCO 2023, LNNS 750, pp. 24–33, 2023. https://doi.org/10.1007/978-3-031-42536-3_3
A Realistic UAS Traffic Generation Tool
25
expected types of operations, and airspace restrictions. In this context, simulation-aided statistical analysis can be performed to derive the necessary parameters. The validity of this simulation-driven approach depends on the existence of a realistic UAS operations simulation framework. The cornerstone of such a framework is the ability to generate drone traffic (i.e., simultaneously, easily, and randomly generated) that mimics the typical types of operations and missions that occur in real life. Therefore, this paper paves the way towards a reliable and holistic simulation framework by proposing a traffic generation tool for UAS operations. In doing so, the paper explores current UAS missions and proposes a set of welldefined, statistically parametrized traffic patterns. Thereby, complex traffic scenarios can be easily defined as an overlay of randomly generated flights derived from a set of userselected flight patterns (Sect. 2). Additionally, the paper discusses how the proposed generation tool can be integrated into an airspace capacity analysis and optimization process, which may also include integration with a UTM system (Sect. 4). To facilitate this integration, the proposed tool already considers the definition of flights with varying levels of detail. Currently, numerous UAS flight simulation examples can be found in the literature. Among the various drone simulation tools developed recently [3], commercial tools such as X-plane, Flight Gear, JMavSim, or Microsoft AirSim enable the simulation of a single drone flight in a three-dimensional environment. Other tools also include models of the communication links needed for drone flight [4] or consider traffic distribution [5]. Of particular interest are those tools that interact with UTM systems (rather than being self-contained), such as the one that emerged from NASA’s development of UTM systems research [6]. However, it is uncommon to find projects that concentrate on generating realistic drone traffic with the expected requirements for airspace capacity analysis (e.g., mission-dependent definition, random generation). The main UTM system simulators, such as the proposal in [7], demonstrate that traffic generation is performed manually by a user selecting waypoints on a map to form the trajectory. Others, such as the one mentioned above developed by NASA, also allow manual control in real-time to introduce contingencies in the trajectories. Furthermore, the few existing developments focused on traffic generation are primarily based on random generation, only considering parameters such as aircraft density or flight area, or on a very specific generation for a very detailed scenario as the one in [8]. Overall, existing proposals are not able to easily generate flights fulfilling different tasks and therefore evaluate problems among them. As a result, parameters such as the type of mission (which greatly determines the flight path of a drone) are not taken into account [9]. Additionally, the generation of predefined flights by well-known services/applications such as DJI or PIX4D is quite limited. Essentially, they offer the ability to define waypoints (by generating polygons), circular missions, or “free” missions. Summing up, the literature reveals that it does not exist any tool allowing easy definition of drone traffic based on a set of mission-related traffic patterns, such as the one proposed in this paper.
26
D. Raposo et al.
2 Realistic UAS Traffic Generation and Specification As previously mentioned, the proposed tool, based on a set of user-provided input parameters, allows the generation and statistical specification of a very high volume of realistic UAS traffic in a very simple way. This is clearly distinguishable from the approach found in the literature consisting in the laborious task of specifying each of the waypoints for every flight. When referring to UAS traffic generation and specification, two distinct sequential processes are implied: 1. Traffic generation refers to a high-level description of the flight from a strategic perspective, denoted as the flight plan. This is a definition of the airspace expected to be utilized by the operation, it is determined by the geometric primitives supported by the UTM system carrying out the strategic evaluation process. Typically, this highlevel description consists of a succession of waypoints and polygons, each with a temporal window of operation. The waypoints are geographic points that mark the specific trajectory of the drone, while the polygons demarcate a geographic area in which the drone can operate freely during a specific time interval. 2. Traffic specification involves transforming the high-level flight plan obtained in the previous process into a more detailed definition (denoted as flight script) that contains the necessary level of detail to unequivocally simulate the drone trajectory, such as a list of waypoints. For instance, if there were polygons in the flight plan, they should be replaced by the exact trajectory that the drone will follow. Additionally, other constraints related to ground avoidance or drone dynamic restrictions (e.g., maximum climb rate, maximum speed) can result in additional waypoints to produce a flyable flight script. These restrictions are typically addressed during a flight-planning phase aided by UTM systems. The proposed tool performs both processes sequentially. First, the user provides a high-level parametrization of the desired realistic traffic, made possible by the missionrelated traffic patterns identified in Sect. 2.1. Then, using the user inputs, a set of flight plans is randomly generated (respecting user constraints and patterns) in the traffic generation stage described in Sect. 2.2. Finally, each flight plan is converted into a flight script using the method detailed in Sect. 2.3. 2.1 Identification of Common UAS Patterns Drones are currently being used in a wide variety of applications, each requiring a distinct type of drone trajectory to fulfill its mission [10]. The most prevalent applications are analyzed below, being the objective to identify common patterns that allow for the easy characterization of different types of drone flights: • Industry and construction: two main types of flights can be considered depending on the type of infrastructure to be monitored/inspected. In the case of roads or power lines, flights follow linear trajectories along the infrastructure. However, for mines or buildings, flights follow irregular trajectories (from a UTM perspective) along the infrastructure that can be modeled as delimited flights within an area where the drone
A Realistic UAS Traffic Generation Tool
•
•
•
•
27
follows variable trajectories. Multirotor drones are used in shorter missions where maneuverability is key, whereas fixed-wing drones are utilized for linear inspections. Logistics: in the case of inter-warehouse transport tasks, the trajectories followed by these drones tend to be a series of linear trajectories among static positions (warehouses). Another possibility consists of last-mile delivery, where variable points are brought in (delivery locations). Multirotor or eVTOL drones may be used in this application depending on the required endurance. Mobility and Transportation: flights describe longer trajectories, ideally following a straight line between two points (vertiports). However, due to conflicts with other flights or restricted areas, they may need to change their trajectory. For this reason, fixed wing or eVTOL drones are used for these types of applications, as they can cover longer flights. Agriculture: drones dedicated to precision agriculture perform flights inside a delimited area (crop fields) in which they usually follow very regular trajectories, generally sweeping over it so they can perform their monitoring, spraying, or sowing tasks in a homogeneous way across the entire terrain. Multirotor drones are typically employed in this task. Security and Defense: like in the previous case, these flights occur in a delimited area, following regular trajectories to cover an area of interest (for searching/exploring the area). However, in this case, flights may start outside the interest area as it may be quicker or unreachable by ground. Additionally, another modality involves a drone called “captive” that remains practically static with the mission of monitoring a specific area.
Table 1. Flight patterns available in the proposed tool Pattern
Applicable missions
Linear flight between unknown waypoints
Linear inspection of industry/construction infrastructure
Linear flight between known waypoints
Inter-warehouse logistics. Mobility between vertiports
Linear flight between known and unknown waypoints
Last-mile logistics
Static / near static flight (captive drone)
Security and defense. Advertising
Polygon-bounded flight
Industry/construction non-linear inspection Agriculture Security and defense
Linear flight to arrive to interest area followed by polygon-bounded flight
Security and defense
From the previous mission-driven analysis, the following (Table 1) multi-mission flight patterns can be extracted according to the use of polygons, static or variable waypoints, or both to define the flight plan. More patterns may be identified for other
28
D. Raposo et al.
applications after the corresponding analysis. In any case, the proposed staged approach (i.e.: traffic generation followed by a specification stage) is general and can be applied for any mission type. 2.2 Traffic Generation Module Therefore, considering the previously identified flight patterns, it has been decided to implement six different traffic generators, which, based on input parameters, generate traffic for each of the applications presented in the previous section: 1 2 3 4 5 6
Flights delimited by a polygonal area. Static flights around a point. Flights between nodes of a network. Linear trajectory flights. Flights from network nodes to a polygon. Flights from network nodes to one or several random positions.
Each generator provides the minimum parameters needed for a strategic evaluation of the mission. The most important are: • Operator, pilot, and drone identifier: these are obtained randomly from a list of operators, pilots, and drones provided by the user. The drone identifier needs to be selected from suitable drone types for the operation, as already discussed. • Takeoff and landing area: defined as a geographic point with a safety radius. • Flight phases: these consist of a succession of waypoints and/or polygons with their geographic coordinates and properties such as maximum speed, altitude limitation, and, if necessary, temporal restrictions for each phase. One common input parameter for all generators is a time window with which through a uniform distribution, the flight’s starting time is determined. This is done for multiple flights being generated at the same moment not to take off simultaneously but within the specified time range. Additionally, temporal overlaps of flights to be performed by the same drone are avoided. Regarding flight phases, polygons are generated based on an area and maximum number of vertices and radial length defined by the user. However, for waypoints, the following cases are distinguished: • Random points: generated randomly within an area provided by the user. • Network nodes: in the case where a list of positions is provided by the user, as in flights between network nodes, circular polygons are generated for take-off and landing. • Linear trajectories: in this case, quasi-aligned points are generated with a maximum rotation angle between two consecutive segments, defined by the user as an input parameter. 2.3 Traffic Specification Module To simulate flights, it is necessary to generate a fully detailed flight plan. This process is only necessary for traffic that includes polygons as part of the flight phases; in that case, the exact trajectory to be followed by the drone within the polygon must be defined. The generation of the trajectory allows for better adaptation to the type of flight we want
A Realistic UAS Traffic Generation Tool
29
to simulate. Therefore, different trajectory generators have been designed: parallel line sweeps, spiral sweeps, and random trajectories. These generators work only on convex polygons; otherwise, the generated trajectory would go beyond the polygon’s perimeter. Hence, the following process is carried out on the input polygon (Fig. 1): 1. The initial polygon is triangulated. 2. Triangles that form convex polygons are joined to divide the initial polygon into the smallest number of convex sub-polygons. 3. The optimal path to “jump” from one sub-polygon to another is sought to generate a smooth trajectory (linking adjacent polygons to avoid jumping between distant polygons). 4. The trajectory is generated for each of the convex sub-polygons.
Fig. 1. Concave polygon division into convex sub polygons.
The last step differs for each of the generators. As an example, the steps followed in the case of parallel line sweeps is detailed (Fig. 2):
Fig. 2. Trajectory generation with each generator on same polygon.
1. The vertex of the polygon with the lowest longitude coordinate (furthest west) is found and taken as the starting point of the trajectory. 2. A segment is created with its center at the previous vertex, length twice the height of the polygon (to ensure that it intersects the perimeter), and inclination provided in the input parameters. 3. Successively, until the segment no longer intersects the polygon’s perimeter: a. The previous segment is moved a distance equivalent to the one provided, and the intersections of the new segment with the polygon’s perimeter are obtained.
30
D. Raposo et al.
b. The intersection points are added to the trajectory in the correct order: if the previous point added was from the upper half of the polygon, the point from the upper half is added first and then the one from the lower half; conversely if the previous point was from the lower half.
3 Validation of the Traffic Generation and Specification Modules In this section, the result of the implementation of the proposed system is shown, along with a validation scenario demonstrating its performance. Remind that the initial goal was to generate flight plans representing real operations. The system provides an interface where users can select the type of flight they want to generate. Then, it requests the necessary input parameters for each type of flight (Fig. 3).
Fig. 3. System user interface for emergency flight creation.
To test our system and verify that it meets expectations, we decided to create different types of traffic in the city of Madrid where it is expected that different types of flights coexist. For example: • Aerial surveillance flights near sensitive buildings: flights have been created around the Royal Palace and Campo del Moro. • Cargo transportation flights between centers with logistical needs: we generated cargo transportation flights between the following hospitals: Clínico San Carlos, San José y Santa Adela, Hospital Madrid, and Hospital Universitario Moncloa. Each flight will make between 2 and 4 stops, requiring a circular polygon of 30 m radius for each stop. • Emergency flights: we simulated two fires in the Casa de Campo area. The fire stations are located: one inside Casa de Campo, another in Puerta del Ángel, and the last one next to Palacio de la Moncloa. • Personal aerial mobility flights: the drones depart from base stations, located in this case at Ópera, Callao, Banco de España, Tirso de Molina, Atocha, and Retiro, and pick up and transport passengers who request the service within the entire central area of the city. The generation result is as follows, in which each flight is drawn with a different color (Fig. 4).
A Realistic UAS Traffic Generation Tool
31
Fig. 4. Generated Flight plans.
In the following images, the generated flights can be seen in greater detail, where we can notice the trajectory generated over the polygonal areas of emergency flights (random trajectory) and surveillance flights (spiral sweep) (Fig. 5).
Fig. 5. Detailed view of emergency and surveillance flights generated.
4 Next Steps: Towards a Separation Optimization and Evaluation Framework The proposed traffic generation and specification tool can be integrated with additional modules to establish a Capacity Evaluation and Separation Optimization framework (Fig. 6) that computes the minimal separation, thus maximizing airspace utilization while meeting safety requirements. In this novel framework, the traffic supplied by the Traffic Generation module functions as a test scenario to evaluate potential tactical and strategic conflicts with a specified separation. Initially, a Strategic Evaluation module eliminates flight plans that do not adhere to the required minimum separation or airspace restrictions. Subsequently, compliant flights are simulated using a Flight Simulator based on the detailed description generated in the Traffic Specification module. At this stage, conflicts between drones should be absent, as they have been addressed during the strategic phase. Nevertheless, flight contingencies may arise, causing drones to deviate from anticipated trajectories. Drone separation during the strategic phase must accommodate these deviations by employing a larger separation than necessary under ideal conditions. Tactical conflicts in non-ideal situations are identified and serve as a metric for iteratively optimizing the required drone separation to achieve a specified safety target.
32
D. Raposo et al.
Fig. 6. Proposed framework for airspace capacity evaluation and separation optimization. Elements described in this paper are shown in blue (see Sects. 2.2 and 2.3).
Once a drone separation has been established, the aforementioned framework can be adapted to evaluate the operation of a UAS Traffic Management (UTM) system, as depicted in Fig. 7. In this context, strategic (i.e., flight plan conflict detection with a given separation) and tactical (i.e., tactical conflict detection in simulated flights) processes are executed by the UTM system itself. Our proposed modules serve once again as traffic input to simulate the flight plans and trajectories that the UTM system processes. However, the previous Strategic Evaluation module and Tactical Conflict Detection module function as a ground truth reference for expected strategic and tactical conflicts. By comparing these outputs with those obtained from the operational UTM system, it becomes possible to assess whether the UTM system is functioning as anticipated.
Fig. 7. Proposed integration of the traffic generation tool with a UTM system. Elements described in this paper are shown in blue (see Sects. 2.2 and 2.3).
5 Conclusions In conclusion, this paper has introduced a UAS traffic generation platform capable of simulating diverse drone flight scenarios for a range of operations. The tool’s flexibility, stemming from its integration of multiple traffic generators and trajectory algorithms, enables it to create realistic UAS traffic patterns. While the paper has demonstrated potential applications, such as integration with UTM systems, further research and development are required to fully capitalize on the tool’s capabilities. In that sense, Sect. 4 has detailed the next steps to use the tool to evaluate metrics such as minimum separation between aircrafts or the maximum allowed traffic density in each scenario. Ultimately,
A Realistic UAS Traffic Generation Tool
33
this traffic generation tool contributes to the ongoing efforts to enhance UTM systems and promote safe drone operations in shared airspace. Acknowledgements. Authors acknowledge the funding received under grants PID2020118249RB-C21 and PDC2021-121567-C21 funded by MCIN/AEI/10.13039/501100011033 and by EU Next GenerationEU/PRTR. Daniel Raposo acknowledges funding for his scholarship from UPM project RP180022025.
References 1. European Drones Outlook Study 2016. https://www.sesarju.eu/sites/default/files/documents/ reports/European_Drones_Outlook_Study_2016.pdf 2. U-space Blueprint brochure final. https://www.sesarju.eu/sites/default/files/documents/rep orts/U-space%20Blueprint%20brochure%20final.PDF 3. Carramiñana, D., Campaña, I., Bergesio, L., Bernardos, A.M., Besada, J.A.: Sensors and communication simulation for unmanned traffic management. Sensors. 21, 927 (2021). https:// doi.org/10.3390/s21030927 4. Al-Mousa, A., Sababha, B.H., Al-Madi, N., Barghouthi, A., Younisse, R.: UTSim: a framework and simulator for UAV air traffic integration, control, and communication. Int. J. Adv. Robot. Syst. 16, 1729881419870937 (2019). https://doi.org/10.1177/1729881419870937 5. Rodríguez-Fernández, V., Menéndez, H.D., Camacho, D.: Design and development of a lightweight multi-UAV simulator. In: 2015 IEEE 2nd International Conference on Cybernetics (CYBCONF), pp. 255–260 (2015). https://doi.org/10.1109/CYBConf.2015.7175942 6. Homola, J., Prevot, T., Mercer, J., Bienert, N., Gabriel, C.: UAS traffic management (UTM) simulation capabilities and laboratory environment. In: 2016 IEEE/AIAA 35th Digital Avionics Systems Conference (DASC), pp. 1–7 (2016). https://doi.org/10.1109/DASC.2016.777 8078 7. Yoon, S., Shin, D., Choi, Y., Park, K.: Development of a flexible and expandable utm simulator based on open sources and platforms. Aerospace 8, 133 (2021). https://doi.org/10.3390/aer ospace8050133 8. Oosedo, A., Hattori, H., Yasui, I., Harada, K.: Unmanned Aircraft System Traffic Management (UTM) simulation of drone delivery models in 2030 Japan. J. Robot. Mechatron. 33, 348–362 (2021). https://doi.org/10.20965/jrm.2021.p0348 9. Bareas y Llorens - Automatic drone demand generation and evaluation.pdf. https://upcomm ons.upc.edu/bitstream/handle/2117/108168/memoria.pdf?sequence=1&isAllowed=y 10. Shakhatreh, H., et al.: Unmanned Aerial Vehicles (UAVs): a survey on civil applications and key research challenges. IEEE Access. 7, 48572–48634 (2019). https://doi.org/10.1109/ACC ESS.2019.2909530
UAV Airframe Classification Using Acceleration Spectrograms David Sánchez Pedroche1(B) , Francisco Fariña Salguero2 , Daniel Amigo Herrero1 , Jesús García1 , and José M. Molina1 1 Group GIAA, University Carlos III of Madrid, Madrid, Spain {davsanch,damigo,jgherrer}@inf.uc3m.es, [email protected] 2 University Carlos III of Madrid, Madrid, Spain
Abstract. Unmanned Aerial Vehicles (UAVs) are a highly innovative technology that is subject to strict regulations due to their potentially hazardous characteristics and a lack of legislative framework for their safe operation. To overcome these challenges, the Unmanned Air System Traffic Management (UTM) initiatives aim to establish validation and monitoring techniques for drone trajectories both prior to and during flight. In the UTM framework, drones will collaborate through systems similar to those used for ship and aircraft vehicles, such as Automatic Identification System (AIS) and Automatic Dependent Surveillance-Broadcast (ADSB). This paper presents an approach in the use of machine learning to gain insights into their kinematic behavior of UAV with the objective of detecting the drone airframe and classifying drones according to their characteristics. Keywords: UAV classification · Spectrogram classification · trajectory data analysis · real-world data modelling
1 Introduction Unmanned Aerial Vehicles (UAVs), also known as drones, have seen rapid advancement in recent years and have the potential to revolutionise many industries. However, due to the lack of adequate regulations and the possibility of malicious uses the UAV use in our spaces have been limited to a few specific uses in really controlled environments. Their implementation has been limited to a few specific uses, such as remote sensing, military operations, and hobbyist applications, due to safety and security concerns. The lack of adequate laws and regulations to ensure the safe and secure use of drones has contributed to their restricted implementation. Moreover, the potential for drones to be used maliciously, such as for military or terrorist attacks, cyberattacks, and unauthorised data collection, has highlighted the need for mitigation measures. To address these challenges, unmanned aircraft regulations are being developed globally, defining the UAV airspace and integrating them into the current traditional airspace systems. In Europe, the U-SPACE initiative aims to create a unified system for monitoring and controlling the use of drones, enabling a secure and efficient drone © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 P. García Bringas et al. (Eds.): SOCO 2023, LNNS 750, pp. 34–43, 2023. https://doi.org/10.1007/978-3-031-42536-3_4
UAV Airframe Classification Using Acceleration Spectrograms
35
ecosystem. This U-SPACE implementation requires the development of advanced technologies, such as rapid and automated Air Traffic Management (ATM) systems adapted to the UAV flights. An important factor for U-SPACE is the heterogeneity of drones in terms of their characteristics and capabilities, ranging from small nanoUAVs with limited autonomy to large fixed-wing military drones that can fly long distances autonomously. The management of these diverse drones in their respective airspace application areas is critical to ensuring their safe and secure operation. This means that an early UAV identification could help in different U-SPACE applications, for example it can allow for dangers identification that allow the deployment of counter-UAS systems. The line of investigation of this study presents an approach to use machine learning techniques over drone trajectory data in order to identify the drone airframe and classify drones according to their characteristics. In previous studies a similar approach has been developed with ship trajectory data [1–3]. In the future, these types of techniques can be used within the U-SPACE domain, taking advantage of the collaborative data available in it. This approach can be used in several applications within the U-SPACE, being possible to find potential inconsistencies in the dynamic models of the UAV, to fine tune the trajectory prediction systems within U-Space services, or to find users cheating the system (i.e. using not permitted drones for some applications). However, currently the lack of datasets with enough quality is a significant obstacle to be considered [4, 5]. To solve this, a dataset was generated using collaborative flight logs provided by the PX4 flight control system community [4]. The dataset consists of trajectories of drones from five different classes based on the drone’s airframe and number of rotors. This paper presents an analysis of drone trajectories and proposes the use of spectrograms to represent UAV kinematics with the aim of classifying different airframes present in the dataset. To achieve this, it is necessary to apply algorithms that transform the data into useful spectrograms for machine learning algorithms. In addition, machine learning techniques are required to classify the generated spectrograms. Specifically, this paper aims to demonstrate the feasibility of this proposal using convolutional neural networks for the airframe classification. Overall, this research aims to contribute to the development of a safe and efficient drone ecosystem that can realise the full potential of drones in various industries while ensuring public safety and security. In summary, this paper presents a novel approach for the classification of UAV trajectories, using spectrograms of the acceleration and machine learning algorithms to identify the UAV airframe. The proposed system was implemented and evaluated using the acceleration computed from the positional date present in a collaborative dataset provided by the PX4 flight control system community. The acceleration is a kinematic feature that have shown good results in similar trajectory identification problems [5]. The results show promising accuracy rates and demonstrate the potential of this approach for future applications for the airspace management of drones. The paper is structured as follows: in Sect. 2, the state of the art of UAV trajectory classification methods is presented. Section 3 details the system implementation details, including data preprocessing, spectrogram generation, and machine learning algorithm
36
D. S. Pedroche et al.
configuration. In Sect. 4, the results of the performed experiments are presented and analyzed. Finally, Sect. 5 concludes the paper with a summary of the contributions and future directions for research in this field.
2 State of the Art UAVs come in many different sizes and have diverse characteristics that make their identification challenging. Various studies propose the use of different sensors to detect and identify drones, including radar measurements, sound sensors, radio frequency identification, and image sensors [6–10]. To generate a robust system against all types of threats, sensor fusion is often employed [11]. Most studies aim to detect the presence of drones compared to other elements such as birds, rather than identifying specific drone models. Two possible classification approaches exist: real-time systems that use the latest sensor observations, and offline systems that analyze the entire situation a posteriori, allowing for a more exhaustive analysis [12]. Among offline systems, a common approach is the use of feature extraction techniques to identify patterns in the trajectory [13]. Trajectory segmentation is often necessary to obtain comparable trajectory segments for feature extraction [12, 14, 15]. It is also possible to study UAV behaviour to identify anomalies and react promptly to dangerous behaviours [16, 17]. Spectrograms are image representations of time-series data that include both frequency and temporal information. This type of data has been utilized in various projects for signal classification [18, 19], making it possible to apply it to time-series data produced from different motions of a target like the identification of a person movement type using an accelerometer data [5]. The Anomaly detection or identification and classification of UAVs are promising fields of study that will be necessary for the development of a secure U-SPACE environment, but these approaches have challenging problems due to their different characteristics and sizes of the UAV. Sensor fusion and trajectory-based approaches are common solutions to generate a robust system in similar areas. However, is important to take into account that due to the current legal limitations, the research and data collection from real-life scenarios is currently limited. For this paper, the problem is proposed to be approached from the perspective of a spectrogram applied to acceleration data.
3 Proposed System The primary objective of this line of investigation is to develop novel classification techniques for Unmanned Aerial Vehicles (UAVs) utilizing the kinematic information that can be obtained from their movement trajectories. Specifically, the proposed method of this study aims to classify UAVs based solely on the analysis of their acceleration frequencies. This approach utilizes kinematic data extracted exclusively from geospatial position data obtained from the PX4 repositories, which is then transformed into an acceleration value for the UAV, combining the three dimensions into a single acceleration value.
UAV Airframe Classification Using Acceleration Spectrograms
37
The proposal is to use a classification system of spectrograms, approaching the problem from a similar to the one that is commonly used for audio processing and other frequency signal classification tasks, using a signal processing and classification pipeline to achieve the multi-class classification of five types of UAVs: Quadrotor, Hexarotor, Octorotor, Plane, and VTOL. The Fig. 1 presents the proposed system, extraction the trajectory information from the PX4 logs files, generating the acceleration signal and generating spectrogram representations of labelled flight-segments of signal data, which will be used to train a Convolutional Neural Network (CNN) for feature extraction and classification.
Features genera on TOTAL Dura on TOTAL Number of Points Timestamp Distance XY Distance Z Speed XY Speed Z
Fig. 1. Proposed system
3.1 UAV Dataset and Spectrogram Generation In general, there is a lack of public datasets of real drone trajectories [12, 13]. The proposal of this work is to produce one using UAV log data available in opensource platforms. In specific the proposal is to use the data provided by the PX4 autopilot, this platform is used a flight control system for multiple types of UAVs, including different type of multicopters, fixed-wing aircraft, VTOLs, zeppelin aircraft and even other not airborne vehicles like rovers or submarines. The system provides a logging system that allow the generation of ULOG files, these files can store all sensor information captured during the flight and additional internal information of the vehicle. In addition, there is a strong community that shares their flight logs on the web PX4 platform [4], being
38
D. S. Pedroche et al.
possible to use those logs for the generation of a real drone trajectories dataset. Within the PX4 platform a total of 55354 logs have been acquired to create a useful data mining dataset. Unfortunately, these files are configured by the user and have a dependency on the PX4 software version, so not all flights contain the same information, being necessary the use of data mining techniques to clean the data and make it useful for the possible data mining applications. For this study the proposal is to extract from the dataset the drone airframe as the class to be used in the analysis and the kinematic information as the main input of the classification algorithm. After an analysis of the dataset, some trajectories were identified as not suitable for the desired data mining process due to different erroneous data, noisy information or undesirable characteristics. To avoid those trajectories a cleaning step was performed when reading the pyulog library, avoiding 34361 trajectories that match one or more of the following criteria: • All the unknown types of airframes and all the trajectories from non-UAV sources (rovers). • All the trajectories of simulated UAVs, avoiding the software-in-the-loop and hardware-in-the-loop simulations provided by the PX4 framework. • All the trajectories with less than 30 measurements. • All the trajectories under a minimum duration of 30 s. • All the trajectories under a minimum movement of 10 m. • All trajectories outside normal UAV movement physical limits, avoiding speed jumps over 45 m/s. After the performed preparation and cleaning process, only 20993 trajectories remained useful for the next data mining process. The larger discarded groups were the ones of simulated trajectories and the unknown airframes. The results of the data preparation stage can be checked in Table 1. Table 1. Dataset distribution per class Row Labels
Number of trajectories
Average number of points per track
Duration avg (min)
Distance avg (km)
Hexarotor
456
4201,66886
7,159649123
1669,861551
Octorotor
106
2125,726415
3,644811321
Plane
585
5609,953846
9,468774929
6391,504666
Quadrotor
19564
3869,836179
6,60811269
1784,690579
VTOL
282
4193,159574
6,802009456
4872,764517
Total
20993
3921,071643
401,2470824
1944,661946
320,7080819
In a U-SPACE scenario the detection of UAVs will include the detection of those vehicles through several timestamps. To standardise the dataset, positional data is transformed into Cartesian coordinates using the starting point as the origin, and the timestamps are
UAV Airframe Classification Using Acceleration Spectrograms
39
stored in seconds starting from zero. This positional data, along with the timestamps, can be used to calculate acceleration and generate spectrograms. The acceleration is calculated combining the positional data of the horizontal plane and the vertical plane, computing for each of the trajectory points a single modulus value that can be used for the spectrograms. In order to effectively identify UAVs in a real U-SPACE scenario, an algorithm capable of rapidly identifying targets must be developed, allowing for real-time detection. To use the proposed acceleration spectrograms, a fixed input size is necessary to create the various spectrograms. To train the classification model, the extracted trajectories must be segmented into fragments of a fixed, short duration. During experimentation, various segment durations were tested. From the different experiments, the best configuration to form each spectrogram, was the segmentation of each trajectory into 90 s sections, maintaining a 10 s overlap between adjacent segments. Each of the generated segment is transformed into a spectrogram of the acceleration signal using a short-time Fourier transform that computes the acceleration variation over the 90 s. To perform the transformation is necessary to configure and adjust three main parameters: • Sampling rate: rate at which the signal was originally recorded or sampled. This parameter is necessary to properly interpret the frequencies in the spectrogram, as the frequency axis of the spectrogram is proportional to the sampling rate. • Number of FFT points: number of samples used for each frequency analysis. This parameter determines the frequency resolution of the spectrogram. A larger number of fft results in a higher frequency resolution, but also increases the computational complexity of generating the spectrogram. • Hop length: number of samples between each frame of the spectrogram. This parameter determines the time resolution of the spectrogram. A smaller hop length results in a higher time resolution, but also increases the number of frames and the computational complexity of generating the spectrogram (Fig. 2).
3.2 Classification Algorithm To perform the classification of airframes, the proposal is to use the extracted spectrograms into an image classification algorithm trained to identify the airframe with the detected acceleration. The proposed classification algorithm consists of convolutional neural network architecture that will extract features from the acceleration spectrogram images. The architecture, as shown in the Fig. 3, consists of 3 convolutional layers with their respective max pooling layers. After the convolutional network, a fully connected classifier is used for feature classification and the airframe classification. ReLU activation functions are used for non-linear activations and dropout layers are used in the fully connected block to diversify learning and mitigate overfitting.
40
D. S. Pedroche et al.
Fig. 2. Sequence of 16 consecutive flight segments corresponding to the Quadrotor class
Fig. 3. Classification algorithm architecture
UAV Airframe Classification Using Acceleration Spectrograms
41
4 Experiments and Results During the experimentation for this study several models have bien trained and tested. Comparing the approach at different segment size for the spectrogram generation and configuring different classification architectures and models. In addition, multiple spectrogram configurations have been tested, tuning the hyperparameters of the spectrogram generation algorithm to achieve the best possible classification results. The best results are presented in the confusion matrix of the Fig. 4, obtained with spectrograms generated in a duration of 90 s. Using a sampling rate of 10, 128 FFT and a hop length of 1. With these parameters the trained model achieves an accuracy of 81%. Although is important to consider the results for each individual class. As expected, the best results are in the quadrotor, as is the class with more samples for the training. The most similar class is the Hexarotor, having more incorrectly classified instances which is normal as is the most similar airframe and the one with the most similar movement. Beside this, plane and VTOL are the two less differentiable classes with 0.31% of misclassified planes. This is normal because those are the airframes with the less similar movement to the other ones, but also being the classes with less instances can affect the results. Being an important line of research in the future to tune the algorithm having more examples of these minority classes.
Fig. 4. Confusion matrix for 90 s sepctrogram configuration
42
D. S. Pedroche et al.
5 Conclusions In conclusion, our study has demonstrated that is possible to use spectrograms extracted from the UAV movement information to perform an airframe identification useful for the future U-SPACE operational environment. However, the limited availability of data samples for certain classes can affect the balanced learning of models, so one main priority in the future is to acquire a more extensive dataset with sufficient instances to facilitate appropriate learning of each UAV class movement patterns. Being difficult to obtain well-structured datasets, a possible future line of investigation is the study over simulated trajectory data, assessing if the flight simulations are reliable to provide realistic data to approach this problem. The current analysis presents a good performance using only a single feature (acceleration spectrogram) with reduced duration flights. Our findings show that the different flight segment duration requires a specific configuration of spectrograms. Therefore, another interesting line of investigation in the future will be on the refining of the current approach by incorporating additional features, and different tuning the segment duration to values under the current 90 s presented during this paper. Overall, our study highlights the importance of dataset size and quality, spectrogram configuration, and feature selection for the development of accurate UAV detection models. Acknowledgements. This work was funded by CDTI (Centro para el Desarrollo Tecnológico Industrial E.P.E.), CNU/1308/2018, 28 November; the public research projects of the Spanish Ministry of Science and Innovation PID2020-118249RB-C22 and PDC2021-121567-C22 - AEI/10.13039/501100011033; the project under the call PEICTI 2021-2023 with identifier TED2021-131520B-C22; and the Madrid Government (Comunidad de Madrid-Spain) under the Multiannual Agreement with UC3M in the line of Excellence of University Professors (EPUC3M17) and in the context of the V PRICIT (Regional Programme of Research and Technological Innovation).
References 1. Amigo, D., Sánchez, D., García, J., Molina, J.M.: Segmentation optimization in trajectorybased ship classification. In: Herrero, Á., Cambra, C., Urda, D., Sedano, J., Quintián, H., Corchado, E. (eds.) SOCO 2020. AISC, vol. 1268, pp. 540–549. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-57802-2_52 2. Sánchez Pedroche, D., Amigo, D., García, J., Molina, J.M.: Architecture for trajectory-based fishing ship classification with AIS data. Sensors 20(13), Article no. 13 (2020). https://doi. org/10.3390/s20133782 3. Pedroche, D.S., Herrero, D.A., Herrero, J.G., López, J.M.M.: Clustering of maritime trajectories with AIS features for context learning. In: 2021 IEEE 24th International Conference on Information Fusion (FUSION), pp. 1–8 (2021). https://doi.org/10.23919/FUSION49465. 2021.9626956 4. «px4 public data». https://review.px4.io/browse 5. Noh, B., Cha, K., Chang, S.: Movement classification based on acceleration spectrogram with dynamic time warping method. In: 2017 18th IEEE International Conference on Mobile Data Management (MDM), Daejeon, South Korea, pp. 397–400. IEEE, May 2017. https://doi.org/ 10.1109/MDM.2017.72
UAV Airframe Classification Using Acceleration Spectrograms
43
6. Taha, B., Shoufan, A.: Machine learning-based drone detection and classification: state-ofthe-art in research. IEEE Access 7, 138669–138682 (2019). https://doi.org/10.1109/ACCESS. 2019.2942944 7. Dale, H., Baker, C., Antoniou, M., Jahangir, M., Atkinson, G., Harman, S.: SNR-dependent drone classification using convolutional neural networks. IET Radar Sonar Navig. 16(1), 22–33 (2022). https://doi.org/10.1049/rsn2.12161 8. Chen, W.S., Chen, X.L., Liu, J., Wang, Q.B., Lu, X.F., Huang, Y.F.: Detection and recognition of UA targets with multiple sensors. Aeronaut. J. 127(1308), 167–192 (2023). https://doi.org/ 10.1017/aer.2022.50 9. Leonardi, M., Ligresti, G., Piracci, E.: Drones classification by the use of a multifunctional radar and micro-Doppler analysis. Drones 6(5), 124 (2022). https://doi.org/10.3390/drones 6050124 10. Chan, J.J.X., et al.: Small flying object classifications based on trajectories and support vector machines. J. Robot. Mechatron. 33(2), 329–338 (2021). https://doi.org/10.20965/jrm.2021. p0329 11. «Crow: Indra’s Counter UAS Solution», Crow: Indra’s Counter UAS Solution. https://counte ruas.indracompany.com/en/product. Accedido 19 de febrero de 2023 12. Zhang, X., Mehta, V., Bolic, M., Mantegh, I.: Hybrid AI-enabled method for UAS and bird detection and classification. In: 2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Toronto, ON, Canada, pp. 2803–2807. IEEE, October 2020. https://doi. org/10.1109/SMC42975.2020.9282965 13. Mohajerin, N., Histon, J., Dizaji, R., Waslander, S.L.: Feature extraction and radar track classification for detecting UAVs in civillian airspace. In: 2014 IEEE Radar Conference, Cincinnati, OH, USA, pp. 0674–0679. IEEE, May 2014. https://doi.org/10.1109/RADAR. 2014.6875676 14. Chen, W.S., Liu, J., Li, J.: Classification of UAV and bird target in low-altitude airspace with surveillance radar data. Aeronaut. J. 123(1260), 191–211 (2019). https://doi.org/10.1017/aer. 2018.158 15. Doumard, T., Riesco, F.G., Petrunin, I., Panagiotakopoulos, D., Bennett, C., Harman, S.: Radar discrimination of small airborne targets through kinematic features and machine learning. In: 2022 IEEE/AIAA 41st Digital Avionics Systems Conference (DASC), Portsmouth, VA, USA, pp. 1–10. IEEE, September 2022. https://doi.org/10.1109/DASC55683.2022.9925778 16. Pan, X., Desbarats, P., Chaumette, S.: A deep learning based framework for UAV trajectory pattern recognition. In: 2019 Ninth International Conference on Image Processing Theory, Tools and Applications (IPTA), Istanbul, Turkey, pp. 1–6. IEEE, November 2019. https://doi. org/10.1109/IPTA.2019.8936099 17. Karakostas, B.: An approach to UAV behaviour classification based on spatial analysis of ADS-B flight data. Procedia Comput. Sci. 201, 338–342 (2022). https://doi.org/10.1016/j. procs.2022.03.045 18. Hershey, S., et al.: CNN architectures for large-scale audio classification (2016). https://doi. org/10.48550/ARXIV.1609.09430 19. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition (2014). https://doi.org/10.48550/ARXIV.1409.1556
Tuning Process Noise in INS/GNSS Fusion for Drone Navigation Based on Evolutionary Algorithms Juan Pedro Llerena1(B) , Jesús García1 , José Manuel Molina1 , and Daniel Arias2 1 Applied Artificial Intelligence Group, University Carlos III of Madrid, Madrid, Spain
{jllerena,jgherrer}@inf.uc3m.es, [email protected]
2 Institute of Communications and Navigation, German Aerospace Center (DLR), Neustrelitz,
Germany [email protected]
Abstract. Tuning the navigation systems is a complex engineering task due to the sensitivity of their parameters, specific requirements to be met, as well as the maneuvering context that the system is designed to address. This problem is still a challenge for navigation system engineers who spend a lot of time to tests and simulations to achieve tuning. Tuning methodologies aim to find the best filter configurations considering a set of design requirements. It is important to note that the navigation system is a critical system within flight controllers since it is responsible to provide service to other subsystems such as control or guidance. The aim of this work is to fine-tune an INS/GNSS navigation system to ensure maximum position and/or orientation accuracy for specific missions with several maneuvers. We propose to use single and multiple objective evolutionary heuristic optimization strategies to carry out this problem. Finals results improve the navigation system and shows outstanding results compared to commercial tool method. Keywords: Tuning · navigation · INS/GNSS · Evolutionary algorithms
1 Introduction From the point of view of drone operations, a mission is understood as the execution of a pre-planned trajectory for a specific goal. These goals can be very varied, such as recognition, surveillance, tracking, etc. The planning of these missions can be performed with ground stations whereby waypoints and maneuvers the desired mission can be defined. Once the missions are planned, they are sent to the flight controller who manages them. Flight controllers such as Pixhawk [1] are composed of a set of subsystems such as the guidance system or propeller control, which depend on the perception of the environment. The system performing the perception in the flight controller is the navigation system. It’s responsible to estimate the vehicle states necessary to perform any mission or simply stabilize the © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 P. García Bringas et al. (Eds.): SOCO 2023, LNNS 750, pp. 44–54, 2023. https://doi.org/10.1007/978-3-031-42536-3_5
Tuning Process Noise in INS/GNSS Fusion
45
vehicle. The navigation system can be considered a critical system for any autonomous vehicle. The extended Kalman filter (EKF) [2] is one of the most widespread estimation systems to fuse inertial navigation systems (INS) and global navigation satellite systems (GNSS). The EKF/KFs take as input measurements and control and return as output the estimated dynamic states of the system. This requires a system model composed of a prediction function, a measurement function, and their corresponding noise terms. Usually, the prediction and measurement functions are relatively easy to model, however, the noise terms are complicated to model since they are responsible to capture what deterministic models cannot do such as mis-modeled system and measurement, hidden states of the system model, approximations by the linearization of functions or errors due to time discretization. Tuning strategies can be divided according to the type of evaluation metrics. Thus, we find metrics based on ground truth (GT) and metrics without GT. Works such as J. García et al. [3] propose a methodology to tune the navigation system of the Pixhawk flight controller using four steps: design test, execute flight, save real-time data, and analyze. To do this, authors do a systematic review of quality metrics with and without GT. This work uses real flight data and evaluates system performance without GT metrics, the averaged innovations, and the fusion brake. P. Abbeel et al. [4] use machine learning strategies to identify the parameters of an EKF. The evaluation functions employ GT-based metrics. The work is developed on a real platform, and they start from the assumption that the platform can be instrumented to capture additional states that allow them to perform very accurate measurements considered as GT. They propose a learning system based on process and measurement noise optimization. The algorithm used for the optimization is a coordinate ascent. Throughout this paper, we propose the fine-tuning of an asynchronous INS/GNSS navigation system by means of evolutionary optimization with one and multiple objectives. It is illustrated with a particular case by means of a mission with different types of maneuvers, sensor noise, measurement rate, etc. Finally, the performance of the navigation system is compared by applying Genetic Algorithms (GA), with the default configuration and with the setting of a commercial tool. The main novelty focuses on the comparison of single objective, multitask and multiobjective optimization. This paper is organized as follows: Sect. 2 shows the INS/GNSS navigation system. Section 3 formulate the methodology to tune. Then, Sect. 4 show the case study. Experimental results are shown in Sect. 5. Finally, the conclusions are presented in Sect. 6.
2 INS/GNSS Inertial navigation systems (INS) and global navigation satellite systems (GNSS) are two of the main techniques used in real environments for UAS navigation. Each of these systems aim to estimate different states of a vehicle such as position, velocity, attitude among others. INS are able to obtain states relative to attitude and positions in local coordinates, while GNSS systems such as the Global Position System (GPS), Glonass, or Galileo, allow to obtain the global position of a receiver by triangulating the signal
46
J. P. Llerena et al.
from different satellites of the same GNSS family or even combining several systems [5]. Although INS can estimate local position and velocities, they present biases in their estimation that depending on the quality of the system propagates and diverges to a lesser or greater extent over time [6]. Typical INS/GNSS fusion solutions [7, 8] are based on a loosely coupled architecture (GNSS is an independent information source that provides a single position measurement), which uses GNSS position and velocity measurements to aid the INS. In this way, the IMU sensors are used to extrapolate position, velocity, and attitude at high frequency (100 to 200 Hz), while updates from GNSS measurements at lower frequency (1 Hz) allows the update of dynamic estimates and inertial sensor biases. A high-level description of the INS/GNSS fusion system can be depicted as in Fig. 1. The input/output data of the system are shown horizontally with black arrows, while the filter parameters are shown with red vertical arrows. The R parameters are given by the quality of the sensors and the Q parameters are the subject of this study.
Fig. 1. High-level INS/GNSS fusion system diagram
The internal fusion process is performed by EKF, where it’s possible describe by (1) and the same two steps as in Kalman filter [2]. xk = f (xk−1 , uk ) + εk z k = g(xk ) + δk
(1)
where f and g are nonlinear functions. Here uk is a control signal and is mapping together with earlier states at k-time by f and the process noise εk . The process noise εk is Gaussian with zero mean and covariance Q. On the other hand, measurements z k is modeled by non-liner law compose whit nonlinear function g of the states xk and observation noise δk . Observation noise is a gaussian noise with zero mean and covariance R. Process noise covariance, Q, gives the KF some flexibility in the face of uncertainty in the behavior of the dynamics of the state vector to be estimated. A usual example appears with simple cinematics of a vehicle with uniformly accelerated motion, where the estimation model of the fusion system corresponds to a uniform rectilinear motion and the process noise models uncertainty in acceleration, which gives the fusion system some flexibility to successfully follow it.
Tuning Process Noise in INS/GNSS Fusion
47
In the case of an INS/GNSS system the concept is the same. In a way, the process noise can be interpreted as a modulator of the weight of the measurements z k , versus the prediction xk|k−1 in the filtering process xk|k (update-step). Thus, if the filtering is required to match a measurement more closely than the prediction because the system is in a transient state, such as a change in a maneuver, the process covariance would have to be high. On the other hand, in a stationary state where the measurement is not relevant, the process noise should be low. Finding a balance between these terms will also depend on the specific mission to be performed. Consider the asynchronous filter [9] its state vector x, described in (2) has 28 states, attitude q ∈ H, angular velocity ω ∈ R3 relative to the body frame, position p ∈ R3 in local frame NED, velocity v ∈ R3 in local frame NED, acceleration a ∈ R3 of the platform in the local NED frame, geomagnetic field vector M ∈ R3 at the reference location, accelerometer bias ba ∈ R3 , gyroscope bias bω ∈ R3 and magnetic bias bM ∈ R3 .
x = [q, ω, p, v, a, M , ba , bω , bM ]T
(2)
The process noise is composed by the covariances of each 28 states in (2). However, the biases in acceleration gyroscope, and attitude, correspond to states with highly nonlinear relation with observations, so they must be considered initially as extremely low to the observable states. The remaining five variance triads identify the prediction uncertainty in the observed states: accelerometer noise σa2 , gyroscope noise σω2 , magnetometer 2 , GPS position noise σ 2 and GPS velocity noise σ 2 . noise σM p v T 2 θ = [θ1 , . . . , θ5 ]T = σω2 , σp2 , σv2 , σa2 , σM
(3)
3 Tuning Process To identify the process noise covariance Q can be considered as a learning problem that can be modeled as an optimization process that improves one or more aspects of the system. Doing this requires an evaluation metric called as fitness function or cost function fj ej , where ej is the error function and j means the objective/states. In this work, the aim is to improve the accuracy of position and orientation estimation by the navigation system throughout a mission. For this purpose, root mean square error (RMSE) is used as a cost function (4). Position accuracy is the aggregated RMSE of the three position components. On the other hand, the orientation accuracy is the RMSE of the aggregated error of the three Euler angle representation {“roll”, “pitch” and “yaw”}. The error function ej,k (θ , GTj,k ) at k-time step is the l2 -norm between j-state estimated denoted by xj k|k (θ ), and GT xj GT k , (5).
N 1
GT ej,k (θ , GTj,k )2 fj = RMSEj (xj k|k (θ ), xj k ) = N
(4)
k=1
ej,k (θ , GTj,k ) = xj k|k (θ ) − xj GT k
(5)
48
J. P. Llerena et al.
where k is the k-time step, but if accompanied by pipe as xA|B , it means the Kalman notation, where A is the estimation time {current = k, previous = k − 1} and B is the observation time, so k|k is update at k-th observation, k|k − 1 is prediction to k-th time from observation at time k − 1. Subindex j = {pENU , q}, means the objectives. In j-set, functions pENU is the position in ENU (East North Up) reference frame and q means the attitude in Euler angles representation {“roll”, “pitch” and “yaw”}. We also combine these two objectives into a single objective (8). To combine them we use the normalized weighted sum method [10], where the weights are equal one for each cost function. To guarantee the weight are similar for each function, the RMSE of pENU and q is normalized by the variance error. In addition, multi-objective method (MO) is used. Specifically, multiobjective evolutionary algorithms (MOEAs). It is important to note that, in contrast to the cases of one objective or a combination of several objectives, the MO solution generates a hyperspace of solutions with equal dimension as the number of objectives. This requires a decision criterion to select the most appropriate solution for the specific case. In this work we use as a criterion the solution of the optimal Pareto front [11] with the smallest l2 -norm to the origin of the solution space (Pareto front). Single objective: min{fpENU (.)}
(6)
min{fq (.)}
(7)
min{fpENU (.) + fq (.)}
(8)
MO − min{fpENU (.), fq (.)}
(9)
θ∈
θ∈
θ∈
Multiple objectives: θ∈
where θ ∈ in the optimization context is the decision variables show in (3), and is the search space.
4 Case Study 4.1 Mission Problem and Simulation Configuration Consider the mission described in Fig. 2(a) and the dynamic model of guidance and control of a quadrotor describing the trajectory described in Fig. 2(b). The representative waypoints are shown in green circles. Using the waypoints as well as the velocities and orientations during the flight simulation illustrated in Fig. 2 (b), the positions, orientations, velocity, acceleration and angular velocities along the trajectory are calculated by means of interpolators using UAV Toolbox of Matlab [12]. The trajectory and sensors measurements are simulated under Table 1 conditions. Finally, the GT is get considering the simulated asynchronous measurement conditions Fig. 3.
Tuning Process Noise in INS/GNSS Fusion
49
(b)
(a)
Fig. 2. Mission problem. (a) Mission plan, (b) Quadrotor trajectory simulation.
Fig. 3. Final ground truth trajectory and GPS simulation data.
Table 1. Sensor and trajectory simulation parameters Parameter
Value
Parameter
Value
GPS rate
1 (Hz)
Velocity Acc
0.1 (m/s)
GPS decay Factor
0.25
IMU rate
160 (Hz)
H. Position Acc
1.6 (m)
Time between wp
10 (s)
V. Position Acc
1.6 (m)
Ref. Location
[40.544288, −4.012095, 900]
4.2 Filter Parameters The starting filter conditions shown in Table 2, where subindex of the variance σ 2 means the state.
50
J. P. Llerena et al. Table 2. Additive process noise variances.
State
Symbol
Value
Units
Quaternion noise
[1, 1, 1, 1] · 10−2
–
[1, 1, 1] · 102
(rad /s)2
[1, 1, 1] · 10−6
(m)2
[1, 1, 1] · 10−6
Acceleration noise
σq2 σω2 σp2 σv2 σa2
[1, 1, 1] · 102
(m/s)2 2 m/s2
Geomagnetic vector noise
2 σM
[1, 1, 1] · 10−6
(μT 2 )
Gyroscope bias noise
σb2
[1, 1, 1] · 10−7
(rad /s)2
Accelerometer bias noise
σb2
[1, 1, 1] · 10−7
σb2
[1, 1, 1] · 10−7
(μT 2 )
Angular velocity noise Position noise Velocity noise
ω a
Magnetometer bias noise
M
m/s2
2
4.3 Optimization Algorithms Based on the work of J. García et al. [10], genetic algorithms with real coding are used. The NSGA-III algorithm [13] is used for the MO cases. The acronym of this algorithm corresponds to Nondominated Sorting Genetic Algorithms which successfully addresses real-world problems with multiple objectives [11]. The main difference with the NSGAII algorithm lies in the way it addresses the dominance resistance problem by focusing on population members that are not dominated but are close to a set of supplied reference points. Although the case study does not present constraints outside the domain of the decision variables, this algorithm can be used in future case studies such as spatial, convergence, nonlinear, etc. constraints. Search space is define in (10). The search space has as an upper bound 30% of the value proposed in [9] and the lower bound indicated. It is important to note that the lower dimension must always be greater than 0 to prevent Q becoming singular. ⎧ θ1 ∈ [0.0765, 0.9945] (rad /s)2 ⎪ ⎪ ⎪ 2 ⎪ ⎪ ⎨ θ2 ∈ [0.0001 1.6445] (m) m 2 (10) = θ3 ∈ [0.0001 0.9945] /s ⎪ ⎪ 2 2 ⎪ θ ∈ 1.1700] m/s [0.0001 ⎪ 4 ⎪ ⎩ θ5 ∈ [0.0001 1.6445] μT 2 We use real-coded GA for a 40 population along 100 iterations. The selection parent’s method is a roulette wheel with probability parameter β = 1, then, offspring parameter population pC = 1. Crossover method is uniform crossover with γ = 0.1. Mutation parameters are (μ, σ ) = (0.5, 0.1). In MO case, divisions are 5 and crossover non dominate is 0.8.
Tuning Process Noise in INS/GNSS Fusion
51
5 Results Given the optimization problems described in Sect. 3 using the same trajectory, we first check the convergence of the algorithms. Then, the results obtained are compared with the tool integrated into the Matlab Navigation Toolbox [4, 14]. Finally, the quantitative results are shown in Table 3.
Fig. 4. Single Objective fitness.
In single objectives cases Fig. 4, it is shown that the algorithm converges to a stable solution between 20–40 first iterations. In all cases decrease the fitness function.
Fig. 5. Final results, Pareto front.
Figure 5, shows the solution space. The MO optimization shows a region coherent with the solutions found by the single-objective strategies. The region described by MO
52
J. P. Llerena et al.
optimization is shown in the zoomed region. The set of solutions describes the border of solutions called the Pareto front. All solutions improve the initial configuration of the “Untuned” filter. 5.1 Results Comparison The first configuration is with the default values of the “Untuned” filter, the second one is the coordinate ascent method “CC. Ascent” integrated in [14]. In the multi-objective (9), we use the solution with the smallest l2 -norm to the origin of the solution space. Table 3 shows the results of the different configurations together with the RMSE in position p and orientation q. The configuration with the best RMSE in position corresponds to problem (6) which improves by 57% the starting accuracy in position and by 11% the accuracy in orientation. The configuration with the highest orientation improvement is the one corresponding to process (7) which improves by 47% the RMSE in position and by 49% the RMSE in orientation. Table 3. Quantitative Results Comparison. Method
RMSE pENU
Filter Parameters q
θ1
θ2
θ3
θ4
θ5
Untuned
2.2237
2.9182
1
1
1
1
1
CC. Ascent
0.9830
3.3388
7,0434
11,7490
0,0467
1,6007
4,3843
min {fp (.)}
0.9588
2.5934
0,9945
0,6289
0,0017
1,1700
0,7264
min {fq (.)}
1.1813
1.4949
0,0765
1,6445
0,0237
0,0959
1,6445
min {fp (.) + fq (.)}
1.0150
1.5687
0,3485
1,6445
0,0100
0,3434
1,6407
MO-min {fp (.), fq (.)}
1.0369
1.5923
0,6730
1,5369
0,0116
0,3180
1,5617
θ∈ θ∈ θ∈
θ∈
6 Conclusions The variability in the results with small changes in the filter parameters shows the sensitivity of the filter to these parameters. This sensitivity provides a great opportunity for study as shown in this work. The results show that improving the position also improves the RMSE in orientation and vice versa. This is a consequence of moving the filter configuration to the performance border. When the problem is approached in a MO way (9), it shows that improving one objective makes the other worse, creating a Pareto front. The Pareto front shows a limit of filter performance in position and orientation. Table 3 shows that the best configuration for a single objective is gated with singleobjective optimization, while multiple objectives require a method that aggregates them
Tuning Process Noise in INS/GNSS Fusion
53
or MO optimization (9). The solution generated with the sum of objectives (8) is close to the solution provided by MO for minimum l2 -norm from the origin. This suggests that the method of the weighted sum objectives can be shifted along the border according to the weights associated with each function, and single-objective optimization, at the ends of the Pareto front. The results found with the 5 proposals raised improve the starting configuration of the filter. In addition, the configuration associated with (6) improves the C. Ascent method whose orientation performance worsens 14%. However, the computational cost of GA is high. It is important to mention that this work has been done on a single trajectory and the results shown correspond to a test interaction. Funding. This research was partially funded by public research projects of Spanish Ministry of Science and Innovation, references PID2020-118249RB-C22 and PDC2021-121567-C22 AEI/10.13039/501100011033, and by the Madrid Government (Comunidad de Madrid-Spain) under the Multiannual Agreement with UC3M in the line of Excellence of University Professors, reference EPUC3M17.
References 1. Pixhaw: The hardware standard for open-source autopilots. https://pixhawk.org/. Accessed 25 Apr 2023 2. Groves, P.D.: Principles of GNSS, Inertial, and Multisensor Navigation Systems, Boston, MA, USA, London, UK. Artech House (2013) 3. García, J., Molina, J.M., Trincado, J.: Real evaluation for designing sensor fusion in UAV platforms. Inf. Fus. 63, 136–152 (2020) 4. Abbeel, P., Coates, A., Montemerlo, M., Ng, A.Y., Thrun, S.: Discriminative training of Kalman filters. Robot. Sci. Syst. 2, 1 (2005) 5. Cai, C., Gao, Y.: Modeling and assessment of combined GPS/GLONASS precise point positioning. GPS Solut. 17(2), 223–236 (2013) 6. Hasan, A., Samsudin, K., Rahman bin Ramli, A., Ismaeel, S.: A review of navigation systems (integration and algorithms). Aust. J. Basic Appl. Sci. 3(2), 943–959 (2009) 7. Sun, R., Cheng, Q., Wang, G., Ochieng, W.Y.: A novel online data-driven algorithm for detecting UAV navigation sensor faults. Sensors 17(10) (2017) 8. Khaleghi, B., Khamis, A., Karray, F.O., Razavi, S.N.: Multisensor data fusion: a review of the state-of-the-art. Inf. Fus. 14(1), 28–44 (2013) 9. MATLAB: Asyncronous sensor fusion of magnetic, angular rate, gravity and GPS. - MathWorks. https://es.mathworks.com/help/nav/ref/insfilterasync.html?searchHighlight=insfilter Async&s_tid=srchtitle_insfilterAsync_1. Accessed 25 Apr 2023 10. García, J., Berlanga, A., Molina López, J.M.: Effective evolutionary algorithms for manyspecifications attainment: Application to air traffic control tracking filters. IEEE Trans. Evol. Comput. 13(1), 151–168 (2009) 11. Hua, Y., Liu, Q., Hao, K., Jin, Y.: A survey of evolutionary algorithms for multi-objective optimization problems with irregular pareto fronts. IEEE/CAA J. Autom. Sin. 8(2), 303–318 (2021) 12. MATLAB: Waypoint trajectory generator- MathWorks. https://es.mathworks.com/help/fus ion/ref/waypointtrajectory-system-object.html. Accessed 25 Apr 2023
54
J. P. Llerena et al.
13. Deb, K., Jain, H.: An evolutionary many-objective optimization algorithm using referencepoint-based nondominated sorting approach, Part I: solving problems with box constraints. IEEE Trans. Evol. Comput. 18(4), 577–601 (2014) 14. MATLAB: Navigation Toolbox Documentation - MathWorks. https://es.mathworks.com/ help/nav/index.html?s_tid=CRUX_lftnav. Accessed 25 Apr 2023
Special Session 3: Soft Computing Methods in Manufacturing and Management Systems
Digital Twins of Production Systems Based on Discrete Simulation and Machine Learning Algorithms Damian Krenczyk(B) Faculty of Mechanical Engineering, Silesian University of Technology, Gliwice, Poland [email protected]
Abstract. The implementation of the concept of a digital twin of a production system based on discrete-event simulation models, apart from the possibility of simulating the features, behavior, functioning and performance of a real counterpart, requires supplementing the model with analytical solutions. Such a solution will allow the digital twin to be included in the decision support process in the area of production management, and the obtained analysis results can be sent back to the executive elements of the production system or planning systems. The paper presents the concept and practical verification of the possibility of creating digital twins of production systems based on the integration of modern discrete simulation software with machine learning modules - reinforcement learning. This combination also gives the opportunity to effectively conduct optimization and adaptation activities and increase the level of autonomy and flexibility of production systems. Keywords: machine learning · reinforcement learning · digital twin · discrete-event simulation
1 Introduction The increasing level of production automation, allowing for increased efficiency along with the increasing level of flexibility of manufacturing systems, is a response to the requirements of an increasingly competitive market and dynamic complex environment in which today’s organizations operate. This is a global trend, particularly visible in the most dynamic industries, such as automotive and electronics. This is possible thanks to the development and implementation of modern technologies, as well as management concepts focused on quality and continuous improvement of processes. Manufacturers are constantly modernizing their systems in order to produce many models and their variants (versions) in different volumes on the same production and assembly lines, which allows the product to be adapted to the customer’s needs, more and more often in the conditions of serial and mass production (mass customization). The next step towards increasing the flexibility and reconfigurability of manufacturing systems, and as a result gaining a competitive advantage and gaining larger market shares, is the modern concept of production implementation and management oriented © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 P. García Bringas et al. (Eds.): SOCO 2023, LNNS 750, pp. 57–66, 2023. https://doi.org/10.1007/978-3-031-42536-3_6
58
D. Krenczyk
towards cyber-physical systems, called Industry 4.0 [1, 2]. Its aim is to integrate an entire value chain, e.g. production and logistics, but also between organizations, suppliers and customers, through the implementation of new methods of production and operation of enterprises characterized by three key features: horizontal integration, vertical integration, and the so-called “end-to-end engineering integration” [3–6]. As a result, both efficiently and profitably designing and manufacturing customized largeand small-lot products will become possible. Production systems and management systems are constructed based on cyber-physical solutions (Cyber-Physical Systems - CPS) by enriching them with new network systems (IoT), wireless sensor networks (WSN), systems for processing large amounts of data (big data), cloud computing solutions (anything as a service XaaS) and mobile internet [3, 4]. Recently, due to lower delays and transmission costs in the application of real-time algorithms, IoT and XaaS solutions expanded with computing methods supported by artificial intelligence solutions [7–9]. In cyber-physical systems, the processed data is obtained, among others, from a network of sensors, actuators, robots, machines, devices, business processes and personnel. With regard to production systems, it is required to develop data processing and decision support solutions capable of supporting the planning and control process in real time. The responses obtained from the “cyber” layer must allow for a quick response to changing environmental factors and enable, to a large extent, in an autonomous manner, reconfiguration of the production system and its adaptation to these changes. The level of this autonomy will largely depend on the efficiency and effectiveness of the algorithms and their ability to adapt to new conditions.
Fig. 1. Cyber-Physical System and a Digital Twin in the concept of Industry 4.0.
In the smart factory concept based on cyber-physical systems, one of the basic elements of the cyber layer is the digital twin, a virtual equivalent of the real system [3, 10, 11]. The digital twin is therefore a virtual, dynamic digital model that should map and simulate the features, behavior, functioning and efficiency. The digital twin should also adapt to changes in the physical objects of the analyzed system [10]. Its role can be played
Digital Twins of Production Systems Based on Discrete Simulation
59
by computer discrete simulation models, which in today’s development phase, using data exchange interfaces and emulation modules of industrial automation components, can automatically update the mapping of the real equivalent through data exchange [5, 12]. However, in order to meet the requirements that the digital layer should meet in relation to data processing, analytics and inference, it becomes necessary to search for new and develop methods and algorithms supporting autonomous production planning and control processes - it is necessary to enrich computer simulation models with solutions implementing advanced methods of artificial intelligence and machine learning (Fig. 1). The purpose of the research work, the results of which are presented in this paper, is the practical verification of the possibility of creating digital twins of production systems based on the integration of modern discrete simulation computer systems with Reinforcement Learning (RL) modules. The results of the first stage of research work, including the proof of concept and the verification of the effectiveness of the latest RL algorithm in solving a selected practical problem, are shown. Section 2 briefly discusses the concept and selected reinforcement learning algorithms. The next section are devoted to presenting the process of creating a digital twin using discrete computer simulation systems, showing the results of computational analyzes. Finally, Sect. 4 concludes the research with areas and directions for future research work.
2 Reinforcement Learning In recent years, there has been a significant increase in interest in the use of reinforcement learning algorithms, also in the areas of production management [13]. The concept of deep reinforcement learning has the potential to revolutionize the field of artificial intelligence. Recent developments in this area suggest that it is a technology with unlimited potential, especially in solving problems where dynamic decision making is required based on variable data about the state of the system. It was selected as one of the breakthrough technologies by the MIT Technology Review, which will play a key role in the future achievements of artificial intelligence [14, 15]. RL is a sub-category of machine learning that optimizes agent behavior in the environment. The agent has specific knowledge about the environment, takes actions and after their assessment is rewarded or punished depending on the impact of the actions taken by the agent on the degree of achievement of the optimization goal. The agent’s decision-making logic, defined by the policy, leads to the selection of an action depending on the current state of the system. Although RL is becoming more and more popular in many fields and areas, there is still a lot of potential in its application in the field of production planning [16]. In reinforcement learning, unlike standard supervised learning methods, training data in the form of input/output pairs is not required, but the agent receives input data describing the current state of the system, which was created as a result of the last action. Then the agent chooses the next action, which is the result of applying its policy (Fig. 2). As a result of applying the current policy, the agent selects stocks that tend to increase the long-term sum of the reward value. Also, compared to supervised learning, the learning stage does not need to be separated from the operational use stage. Both stages can be
60
D. Krenczyk
implemented in parallel, which is of great importance in systems operating in a variable and dynamic environment, mentioned in the previous section. Formally, reinforcement learning is defined as a Markov decision process and requires defining a discrete set of the system (environment) states S = {s1 , s2 , s3 , …, sn } and a finite set of actions A = {a1 , a2 , a3 , …, an }. The agent chooses an action according to the adopted strategy h: S → A. After performing action s, the agent obtains information about the state of the environment s in the next step based on the function of state transitions f (s, a) and the real number reward determined by the so-called gain function ρ(s, a).
Agent
Acons
Environment (digital twin)
Rewa rd
State
Fig. 2. Reinforcement learning (RL) process.
If the agent-driven process in the system is available in the learning phase or we are able to build its digital twin, information about the system state is collected from sensors during direct interaction of the agent with the environment, which enables real-time reactions to changes in the system. Research on reinforcement learning methods conducted in recent years has led to the creation of various groups of algorithms, which differ, among others, in terms of the way of constructing policies. Among them, model-free and model-based algorithms can be distinguished. Modelless algorithms do not require learning a model of the environment and the dynamics of the environment, nor do they require a state transition function. One of the most famous is the Q-learning algorithm, which is a model-free algorithm that does not require an environment model and can be used to solve problems with stochastic transitions and rewards without the need for adaptation. Further advances in RL resulted in the creation of the Trust Region Policy Optimization (TRPO) algorithm. The algorithm uses a policy gradient approach to update applied policies, taking the largest possible step to improve performance while reducing discrepancies in the policy update process at each iteration. OpenAI proposed [17], the Proximal Policy Optimization (PPO) algorithm, which is basically a simplification of the TRPO algorithm that assigns a policy update at each step that minimizes the cost function while ensuring that the deviation from the previous policy is relatively small. This algorithm introduces an adaptive penalty, using the discrepancy as a hyperparameter to determine the penalty coefficients [15]. The PPO algorithm, due to its much simpler implementation and tuning, has become a very popular RL algorithm. It is available in most open-source libraries containing implementations of RL algorithms, e.g. in Stable-Baselines3, a popular Python library [18]. For more information and an overview of RL methods and algorithms, see e.g. in [13].
Digital Twins of Production Systems Based on Discrete Simulation
61
3 A Digital Twin Based on Discrete-Event Simulation as a Reinforcement Learning Agent Environment This section presents the process of creating a digital twin based on a digital model of a selected process of the manufacturing system made in the FlexSim discrete-event simulation software. The simulation model, in turn, will be an environment that will be influenced and communicated with by the learning LR agent (Fig. 2). The PPO algorithm will be used in the learning and control process. In the described case, the digital model, reflecting the behavior of the real production system, was supplied with data generated based on selected probability distributions. However, in practical implementations, the digital twin can use data from the manufacturing system and compare the results of commands received by the RL agent with the results based on control signals from conventional real-time control systems. It can then take over the control of the system while continuing to learn and adapt to changes in the real system. RL Agent The RL agent was implemented in the Python high-level programming language with the open-source Stable-Baselines3 libraries [18] containing a set of learning algorithm implementations and Gym [19], providing a standard API for communication between learning algorithms and environments, and a standard set of environments compliant with that API [20]. The designed RL agent used the previously mentioned PPO algorithm. In this case, the action space selected by the agent and observation space received from the simulation model (environment) can be defined in a wide range of types, such as Discrete, MultiDiscrete, Box and MultiBinary. This gives the possibility of broadly shaping the scope of information about the state of the environment (manufacturing system) transferred to the agent, as well as shaping the RL Agent’s responses, which in this case are decisions controlling the flow of semi-finished products. Communication of the RL Agent with the Simulation Model The communication of the RL agent with the digital simulation model in the FlexSim software was implemented using the Reinforcement Learning tool. This tool provides interface configurations and the necessary scripts used both in the process of learning the algorithm and using the model in a real production environment. In addition, it provides a user interface that allows you to configure the observation and action space and define the reward function of decision events (see Fig. 4). It also performs tasks of automatic normalization of all variables to the range required by the RL algorithm libraries [21]. Simulation Model The simulation model constituting the environment that will be observed and communicated with by the RL agent was made in the FlexSim software. The model includes a selected part of a multi-assortment (mixed-model) production system containing a digital representation of a manufacturing workstation and a section of an overhead conveyor with a loop acting as a buffer between workstations. The agent decides in what order the semi-finished products awaiting execution will be redirected from the conveyor loop to the selected production workstation at a given moment (Fig. 3).
62
D. Krenczyk
Fig. 3. FlexSim - Simulation model.
Fig. 4. FlexSim – Agent RL interface parameters.
Selecting a routed semi-product version from the conveyor buffer to the workstation allows load balancing of the cells and maximizes productivity. The impact of the version sequence on efficiency is related to the fact that the total process times depend on the previous version on a given workstation, due to the need for changeover and the so-called weight of the version, affecting the required resources. The total process times are known and determined depending on the current-previous pair. In the test case, processing times range from 90 to 180 [s]. For the purposes of the simulation experiments, 10 product versions were selected, for which the processing times differed. Two scenarios of the
Digital Twins of Production Systems Based on Discrete Simulation
63
experiments were also assumed, differing in the probability distribution of versions in the arrival sequence. The first scenario assumes the inflow in accordance with the uniform distribution for all product versions and the second, in which a triangular distribution was assumed, due to the occurring disruptions in supplies. The agent’s observation space was defined as MultiDiscrete, including the values of discrete variables relating to the version of the last semi-finished product and the workstation on which it was made. The agent action space is defined as Discrete, and its value corresponds to the designation of the version that will next be redirected to the socket from the overhead conveyor loop. The function determining the value of the reward for a given step takes into account the current temporary performance of the workstation, while the duration of the training episode was set to one hour of work (Fig. 4). The Results of the Experiments Using the RL module described in the previous sections and the simulation model, experiments were performed for two scenarios: (1) for a uniform distribution of the inflow of orders (semi-finished products), (2) for a triangular distribution of the inflow of orders. The productivity indicator of the production system was hourly productivity and it corresponded to a single training episode in the learning phase, consisting of steps corresponding in turn to the production of one piece. The training for each scenario was 50,000 steps.
Fig. 5. Comparison of experimental results (scenario 1).
Figure 5 shows the results obtained for the first scenario. On the left are the values obtained when the selection of the semi-finished product to be directed from the conveyor loop to the production cell was carried out without the participation of an agent (the next version was selected at random). On the right, the results obtained when agent versioning
64
D. Krenczyk
is controlled after training are shown. As can be seen in Fig. 5, an increase in performance after training was obtained, amounting to 10%. Figure 6 shows the results for the second scenario. In this case, as might be expected, the sudden change in the version distribution of the influx stream resulted in a significant drop in cell performance when agentless control was performed (results shown on the left side of the figure). The agent training phase in this case was carried out for the same parameters as before. With subsequent episodes, there was a rapid increase in performance, from an initial value of 20 (at 2,000 steps) to 26,2 (at 50,000 steps). The agent quickly adapted to the new situation (the calculation time of the training phase was 900 [s]) and a significant increase in efficiency after training was obtained, amounting to 23%.
Fig. 6. Comparison of experimental results (scenario 2).
4 Summary The paper presents an example of creating digital twins of production systems based on the integration of modern discrete simulation computer systems with reinforcement machine learning modules. The combination of artificial intelligence technology and a digital representation of a real production system in the form of a simulation model opens up new possibilities in the areas of intelligent management and production control and supports the development of the implementation of cyber-physical systems with selfadaptation and self-reconfiguration capabilities. The presented preliminary results of the implementation of reinforcement learning algorithms indicate their great potential, capabilities and adaptability in supporting decision-making in the areas of production management. However, it is necessary to carry out further work in order to find the best ways to define the agent’s observation space, so that it reflects the current state of the manufacturing system as faithfully and in detail as possible.
Digital Twins of Production Systems Based on Discrete Simulation
65
Based on the obtained results and conclusions, further work will be also carried out related to solving the problem of tuning algorithm parameters and their impact on the learning process and effectiveness in typical production planning and control problems, so that it is possible to carry out the next step in the form of practical verification at the level of demonstrative industrial implementations. Acknowledgements. Publication supported as part of the Excellence Initiative—Research University program implemented at the Silesian University of Technology, the year 2023.
References 1. Morgan, J., Halton, M., Qiao, Y.S., Breslin, J.G.: Industry 4.0 smart reconfigurable manufacturing machines. J. Manuf. Syst. 59, 481–506 (2021) 2. Zhu, Q., Huang, S., Wang, G., Moghaddam, S.K., Lu, Y., Yan, Y.: Dynamic reconfiguration optimization of intelligent manufacturing system with human-robot collaboration based on digital twin. J. Manuf. Syst. 65, 330–338 (2022) 3. Wang, S., Wan, J., Li, D., Zhang, C.: Implementing smart factory of industrie 4.0: an outlook. Int. J. Distrib. Sens. Netw. 12(1), 3159805 (2016) 4. Chen, G., Wang, P., Feng, B., Li, Y., Liu, D.: The framework design of smart factory in discrete manufacturing industry based on cyber-physical system. Int. J. Comput. Integr. Manuf. 33(1), 79–101 (2020) 5. Krenczyk, D.: Dynamic simulation models as digital twins of logistics systems driven by data from multiple sources. J. Phys. Conf. Ser. 2198, 012059 (2022) 6. Kohnová, L., Salajová, N.: Impact of industry 4.0 on companies: value chain model analysis. Adm. Sci. 13(2), 35 (2023) 7. Sittón-Candanedo, I., Alonso, R.S., Rodríguez-González, S., García Coria, J.A., De La Prieta, F.: Edge computing architectures in industry 4.0: a general survey and comparison. In: Martínez Álvarez, F., Troncoso Lora, A., Sáez Muñoz, J., Quintián, H., Corchado, E. (eds.) SOCO 2019. AISC, vol. 950, pp. 121–131. Springer, Cham (2020). https://doi.org/10.1007/ 978-3-030-20055-8_12 8. Aazam, M., Zeadally, S., Harras, K.A.: Deploying fog computing in industrial Internet of Things and industry 4.0. IEEE Trans. Ind. Inform. 14(10), 4674–4682 (2018) 9. Kubiak, K., Dec, G., Stadnicka, D.: Possible applications of edge computing in the manufacturing industry-systematic literature review. Sensors 22(7), 2445 (2022) 10. Zhuang, C., Liu, J., Xiong, H.: Digital twin-based smart production management and control framework for the complex product assembly shop-floor. Int. J. Adv. Manuf. Technol. 96, 1149–1163 (2018) 11. Jwo, J.-S., Lee, C.-H., Lin, C.-S.: Data twin-driven cyber-physical factory for smart manufacturing. Sensors 22(8), 2821 (2022) 12. Rosen, R., von Wichert, G., Lo, G., Bettenhausen, K.D.: About the importance of autonomy and digital twins for the future of manufacturing. IFAC-PapersOnLine 48(3), 567–572 (2015) 13. Wang, Hn., et al.: Deep reinforcement learning: a survey. Front. Inf. Technol. Electron. Eng. 21, 1726–1744 (2020) 14. Cunha, B., Madureira, A.M., Fonseca, B., Coelho, D.: Deep reinforcement learning as a job shop scheduling solver: a literature review. Adv. Intell. Syst. Comput. 923, 350–359 (2020) 15. Poppera, J., Yfantis, V., Ruskowski, M.: Simultaneous production and AGV scheduling using multi-agent deep reinforcement learning. Procedia CIRP 104, 1523–1528 (2021)
66
D. Krenczyk
16. Halbwidl, H., Sobottka, T., Gaal, A., Sihn, W.: Deep reinforcement learning as an optimization method for the configuration of adaptable, cell-oriented assembly systems. Procedia CIRP 104, 1221–1226 (2021) 17. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv:1707.06347 (2017) 18. Raffin, A., Hill, A., Gleave, A., Kanervisto, A., Ernestus, M., Dormann, N.: Stable-baselines3: reliable reinforcement learning implementations. J. Mach. Learn. Res. 22(268), 1–8 (2021) 19. Brockman, G., et al.: OpenAI Gym. arXiv:1606.01540 (2016) 20. Gym - an open-source Python library. https://github.com/openai/gym. Accessed 01 May 2023 21. FlexSim. The reinforcement learning tool. https://docs.flexsim.com/en/22.0/ModelLogic/Rei nforcementLearning/KeyConcepts/KeyConcepts.html. Accessed 01 May 2023
Edge Architecture for the Integration of Soft Models Based Industrial AI Control into Industry 4.0 Cyber-Physical Systems Ander Garcia1,2(B) , Telmo Fernández de Barreana1 , Juan Luis Ferrando Chacón1 , Xabier Oregui1 , and Zelmar Etxegoin3 1 Vicomtech Foundation, Basque Research and Technology Alliance (BRTA), Mikeletegi 57,
20009 Donostia-San Sebastián, Spain [email protected], [email protected] 2 Faculty of Engineering, University of Deusto, Mundaitz Kalea, 50, 20012 Donostia- San Sebastian, Spain 3 Gaindu, Inzu Group, Olaso Kalea 45, 20870 Elgoibar, Spain
Abstract. Traditionally PLC and SCADA systems programmed by automation engineers have been responsible for the control of industrial machines and processes. Industry 4.0 paradigm has merged OT and IT domains, proposing new alternatives for this task. Industry 4.0 approaches start capturing OT industrial data and making it available to the IT domain. Then, this data is visualized, stored and/or analyzed to gain insights of the industrial processes. As a final step, AI models access real-time data to generate predictions and/or control industrial processes. However, this process requires OT and IT knowledge not present in many industrial companies, mainly SMEs. This paper proposes a micro-service edge architecture based on the MING (Mosquitto, InfluxDB, Node-RED and Grafana) stack to ease the integration of soft AI models to control a cyber-physical industrial system. The architecture has been successfully validated controlling the vacuum generation process of an industrial machine. Soft AI models applied to real-time data of the machine analyze the vacuum value to decide when the most suitable time is (i) to start the second pump of the machine, (ii) to finish the process, and (iii) to stop the process due to the detection of humidity. Keywords: Edge Computing · Industry 4.0 · Cyber-Physical Systems · Control · Artificial Intelligence
1 Introduction Within traditional industrial automation engineering workflows, control of processes and machines has been a responsibility of PLC (Programable Logic Controller) or SCADA (Supervisory Control And Data Acquisition) systems. Most of these PLC and SCADA systems are commercial products programmed with proprietary tools from their provider (Siemens, Beckhoff, GE Digital…). Moreover, the control functionalities they provide are related to classical control engineering. These systems tend to work as isolated silos where orders are received from the outside, but few internal data is shared. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 P. García Bringas et al. (Eds.): SOCO 2023, LNNS 750, pp. 67–76, 2023. https://doi.org/10.1007/978-3-031-42536-3_7
68
A. Garcia et al.
With the introduction of the Industry 4.0 paradigm, these systems have faced new requirements to share and exploit data. The volume of data being captured from manufacturing lines is continuously increasing. To get a deeper insight of manufacturing processes, more data variables are being monitored and data is captured at a higher frequency: from one value of a few key variables for a whole batch, to time series of several variables captured at frequencies of seconds or even faster. In order to face these new requirements, new architectures are required to integrate Information Technology (IT) and Operations Technology (OT) fields. This implies a myriad of IT and OT technologies, standards and specifications related to Industry 4.0. The complexity of this integration generates a knowledge barrier, as these IT technologies follow a completely different philosophy from the regular tools used by OT engineers and maintenance teams. Small and Medium-sized Enterprises (SMEs), which generally lack multidisciplinary teams with the required IT and OT knowledge and experience, face big difficulties to capture, monitor and visualize data from manufacturing processes. Moreover, this is only the first step of the Industry 4.0 paradigm. Once data is actionable, Artificial Intelligence (AI) models are trained and applied to generate information to create predictions of the manufacturing lines, automatically detect manufacturing problems, and even control the industrial process itself. The integration of this second step increases the difficulties for SMEs, as AI engineers must be integrated on the teams, and the IT architecture has to integrate AI algorithms and feed them with real-time data. Furthermore, integration of Industrial AI functionalities into traditional industrial buses and devices is a cumbersome task. This paper proposes a micro-service based edge architecture to face these difficulties and to ease the integration of soft AI models into cyber-physical systems to control industrial processes and machines. The architecture is based on the MING software stack. This stack comprises several Open Source products to ease the deployment of Industry 4.0 solutions. The stack covers main functionalities of these solutions: • A MQTT broker that manages the communication based on messages with a topic and a payload. • A time-series database to manage captured industrial data. • A no-code tool to ease the programming of the required logic. • An easy to use tool to visualize data, generate dashboards and configure alarms Although each functionality can be fulfilled by several tools, the MING stack proposes Mosquitto as a MQTT broker, InfluxDB as a time series database, Node-RED as a no-code tool, and Grafana as an analytics and interactive visualization web application. Based on the MING stack, InfluxDB has been changed by the TimescaleDB extension to the PostgreSQL database. Moreover, this paper integrates soft AI models fed with realtime data and sending commands to control the industrial process through the MQTT broker. The cyber-physical system includes a gateway communicating with the MQTT broker and with the industrial PLC. The architecture has been validated controlling a vacuum generation process. Before the integration of the architecture, the PLC was in charge of the control of the valves and the pumps of the vacuum machine. After the integration, the control has been performed by soft AI models analyzing real-time data from the vacuum machine.
Edge Architecture for the Integration of Soft Models
69
The structure of the paper is as follows. Section 2 reviews the related work. Section 3 presents the proposed architecture, while Sect. 4 focuses on its validation. Finally, Sect. 5 presents the main conclusions and future work.
2 Related Work The Industrie 4.0 term, also known as Industrial Internet of Things (IIoT), was presented by the German government in 2011. The objective of the fourth industrial revolution is to work with a higher level of operational productivity and efficiency, connecting the physical to the virtual world. Industry 4.0, is related to several technologies such as Internet of Things (IoT), Industrial Automation, Cybersecurity, Intelligent Robotics, or Augmented Reality [1]. The term cyber-physical systems (CPS) [2], merges the “cyber” as electric and electronic systems with “physical” things. The “cyber component” allows the “physical component” (such as mechanical systems) to interact with the physical world by creating a virtual copy of it. This virtual copy will include the “physical component” of the CPS (i.e., a cyberrepresentation) through the digitalization of data and information [1]. In general, a CPS consists of two main functional components: (1) the advanced connectivity that ensures real-time data acquisition from the physical world and information feedback from the cyber space; and (2) intelligent data management, analytics and computational capability that constructs the cyber space [3]. Edge computing, which is crucial for a new type of intelligent industrial devices and services based on AI, is a paradigm where data are analyzed and stored close to the devices generating and consuming them, facing previous disadvantages and making them attractive for manufacturing scenarios [4, 5]. Edge computing devices have increasingly powerful computation functionalities, able to run Industrial AI applications with high computation requirements [6]. However, these solutions present a relevant knowledge barrier, as they require OT, IT and AI expertise. Although some architectures have been proposed to ease the integration of Industry 4.0 data capturing and monitoring solutions, the main challenges for both I4.0 and IIoT (Industrial Internet of Things) including security, standard exchange of data and information between devices, machines and services (not only within an industry but also between different industries) still remain some of the open issues [7]. In [5], a review of the application of edge computing paradigm into manufacturing scenarios is provided, identifying architectures, advances and open challenges. For example, authors [8] have recently proposed an architecture to capture and monitor time-series data. Previously, [9] presented a MQTT-based IoT Cloud Platform with Node-RED that could be ported to edge scenarios, which shares various elements with the MING stack. In [10], a low-cost, highly customizable modular SCADA approach based on Node-RED was presented. Recently, in [11] they propose a system for PEM hydrogen generators based on Grafana and Node-RED. However, as far as authors know, there are no examples of architectures integrating Industrial AI models into machines and processes focused on lowering the knowledge barrier required to deploy them. Thus, this paper focuses on proposing an easy to use and deploy architecture, validating it to control an industrial vacuum machine with AI models.
70
A. Garcia et al.
3 Architecture The proposed architecture (Fig. 1) is based on micro-services deployed as docker containers. Excluding the industrial machine/process, the rest of the components are docker containers to ease the deployment and improve its scalability. The main components of the architecture are the following.
Fig. 1. Architecture
• Machine. The machine is the industrial machine or process to be controlled. It is programmed by automation engineers. Its main responsibilities are safety and cycle low-level control. However, higher level control of the machine/process is modelled as a black-box with defined inputs and outputs. Inputs and outputs are mapped to variables and/or methods of the automation program and are available through the OT communication protocol compatible with the machine, such as Siemens S7, Modbus or OPC UA. • Gateway. The gateway interacts with the machine, translating OT communication protocols to MQTT. MQTT is a robust and trustworthy protocol, with implementations requiring very low computation and available for most of the current hardware and software platforms. MQTT is based on a queue manager (broker), where different clients send messages (publish). Each message is sent with a certain subject (topic) and may contain data (payload). Other clients can show their interest in certain topics to the queue manager (subscribe). When the queue manager receives a message with some of these topics, it sends the message to the subscribed clients. Each modelled input and output of the machine is mapped to MQTT topics and message payloads by the gateway. Moreover, the gateway may include some basic logic to group some inputs and outputs as new input/output variables or methods. Although the gateway can be programmed with several programming languages such
Edge Architecture for the Integration of Soft Models
• • •
• •
•
71
as Python, Java or NodeJS, node-RED is proposed due to its no-code approach that lower the knowledge barrier required to program and maintain the gateway. Broker. The MQTT broker is the key element to communicate the different components of the architecture. It receives messages with a topic and a payload, and resend them to the components interested in the topics. Writer. The writer subscribes to MQTT topics and update the database accordingly. As previously noted, although Node-RED is proposed to implement it, any programming language may be used. Database. The proposed database is TimescaleDB, a time-series database that is an extension of the popular Open Source engine PostgreSQL. Although functionalities of TimescaleDB are equivalent to the ones offered by InfluxDB, with TimescaleDB the same database can store regular relational, JSON and time-series data. Remote HMI. This is a simplified version of the HMI of the machine deployed as a Web application. Node-RED is the proposed tool for this HMI, but any Web technology stack could be used. Visualization and monitoring. These components are based on Grafana to connect to the database and to generate (i) personalized dashboards to visualize information, and (ii) customized alarms to monitor the machine. Grafana is a popular multiplatform Open Source analytics and interactive visualization web application. Grafana is agnostic of the underlying database and has an intuitive user interface both to customize charts and dashboards, and to generate alerts and notifications based on advanced rules and notification channels. Industrial AI algorithms. These algorithms are deployed as containers subscribed to the MQTT broker to be fed with real-time data. The algorithms are based on soft AI models to analyze data in real-time, and to generate the commands required to control the machine. These commands are sent as messages to the MQTT broker, and the gateway receives the messages and translates them to the OT protocol used by the machine, which reacts to the command sent by the soft AI model. These algorithms can be programmed with several tools, but Python is proposed due to the rich AI ecosystem it offers.
These components collaborate together to create an advanced cyber-physical system with the following workflow to control the machine using Industrial AI algorithms. • Step 1: After the process is started, data is generated by the machine and the gateway uses OT protocols to continuously read machine data and send data to the MQTT broker in real-time. • Step 2: Machine data is received and published by the MQTT broker. Then, these actions happen in parallel: o o o o
The writer receives data and stores it at the database. The HMI is updated in real-time. Grafana visualize and monitor data in real-time. Data is received by AI algorithms to control the process. When these algorithms identify that some action must be taken, they generate a command and send it to the MQTT broker.
72
A. Garcia et al.
• Step 3: The gateway receives the commands and translate them to the corresponding OT protocol to be send to the machine. • Step 4: The machine reacts to the received control order.
4 Validation The architecture has been validated to control a vacuum generation process. This process is composed by a vacuum chamber, three valves, one regular pump, one special pump, and a vacuum measurement equipment (Fig. 2a). A Siemens 1500 PLC controls the process, connects to a Pfeizer ASM 340 vacuum generator and measurement equipment, and presents a traditional HMI where actions to start and stop the vacuum process, open/close the valves and the chamber, start/stop the pumps, and read the vacuum level are available. The vacuum generation process is the first step of a leak test detection machine under development. In a traditional setup, the PLC has to be set up in advanced the total time of the process, and the time the special pump is going to be started. These times are configured manually by the operator following a try-error methodology until the desired vacuum level (10e–1, 10e–2, 10e–3…) is reached. In order to control these times by soft AI models, first the process has been modelled as a black-box with these inputs and outputs (Table 1): Table 1. Inputs and outputs of the process Inputs
Outputs
Commands to open/close the valves Commands to start/stop the pumps
State of the valves State of the pumps Real-time vacuum level
Instead of modifying the original PLC program, a new data-block has been created with variables related to these inputs and outputs. Then, using SIOME software, a new OPC UA data structure has been created, and each variable of the data-block has been assigned to one variable of the OPC UA structure. Finally, this mapping has been imported into TIA Portal to generate an OPC UA server. Then, in order to simplify the control of the process, two new inputs have been added to start and stop the vacuum process. The gateway transforms these commands to the proper open/close commands of the pumps and the valves expected by the PLC. The different inputs and outputs have been mapped to these topics: • /machine_name/start: Input command to start a vacuum generation process • /machine_name/created: Output to acknowledge that a new process has been created at the database and the vacuum process has been started. At this step only the regular pump is running. The payload includes the identifier of the new process • /nachine_name/superpump: Input command to start the special pump with the objective of reaching lower vacuum levels. • /nachine_name/data: Output the share real-time data about the vacuum level.
Edge Architecture for the Integration of Soft Models
73
• /nachine_name/stop: Input command to stop the vacuum generation process. • /nachine_name/humidity: Input command to notify humidity has been detected during the process. In this case the user is notified and the process is stopped. The HMI (Fig. 2b) has been generated as a Node-RED user-interface reacting to these messages. The HMI presents several controls. Three gauges represent the measured vacuum level in real-time. Each of them has a different range to represent different vacuum levels: high (0–1000 mbar), medium (0–10 mbar) and low (0–1 mbar). Several data fields show the identifier of the current process, the real-time vacuum value, the status of the process (starting/started/stopped) and elapsed time elapsed. Finally, three buttons allow to start a vacuum generation process, and to manually stop it or switch on the special pump, and two switches represent the state of the pumps.
Fig. 2. User-interface of the HMI
Grafana (Fig. 3a) has been integrated to generate a dashboard where current and historical data can be visualized. The user can select one or more vacuum generation processes, and the dashboards shows them. The visualized vacuum generation processes have been running with the same part in the chamber.
Fig. 3. (a) Dashboard to show vacuum measurements at Grafana. (b) Humidity effect
Finally, the Industrial AI based on soft models analyses vacuum measures of realtime data. The output of these models identifies the state of the process to control and
74
A. Garcia et al.
decides whether the special pump must be switched on, the process has finished, or humidity has been detected and thus the process has to be stopped. These models have been trained with a dataset generated with the vacuum machine running in the traditional PLC controlled mode, but storing vacuum measurements at the database. These manual processes have been repeated with two type of parts: one aluminium part and one 3D printed part. Several running times and special pump swithching on times have been mixed in these training processes. Each process last between 40 and 180 s. Data has been captured at 3 Hz at the database, but it has been downsampled with the average value for each second to train the models. Some parts have also been sunk into water before entering the chamber in some of the processes. The final training dataset had more than 120 hundred labelled individual vacuum processes. This dataset has been used to generate three different models. After analysing several approaches, from traditional digital signal processing techniques to neural networks, these models have been generated and deployed inside docker containers: • Switch-on special pump: When the vacuum generation process starts, the first pump is also started. This pump only reaches a vacuum level close to 1, and it requires the help of the second pump to reach lower vacuum level. The special pump should be activated when a certain vacuum level has been reached and the vacuum decay over time starts to flatten. The developed model takes a 20-points length window and first fits the window to an exponential function and computes the gradient of the fitted curve. High negative gradient values indicate that the slope of the window is high, meaning that the vacuum is decaying. However, small negative values indicate that the slope of the window is small, meaning that the vacuum decay curve is becoming flat. Thus, by comparing the maximum value of the gradient against a threshold, it is possible to detect when the pressure decay starts to become flat. This moment is the optimum time to start the special pump to continue decreasing the vacuum level. • Stop the process. Once a certain level of vacuum has been reached, the two pumps are not able to decrease it. In order to detect if the time to stop the process has been reached, the first step is to set the objective vacuum value (precision). Afterwards, on the one hand, if the objective vacuum value is achieved, the process is ended. On the other hand, it is possible that the desired vacuum level is not reachable for different reasons, such as pollution of the chamber or mechanical defects. In this second case, waiting more time is a waste of resources. The model that identifies this case takes 15-points length windows and calculates upper and lower bounds. Afterwards, the algorithm evaluates whether the remaining points of the window lay inside the calculated upper and lower bounds. If so, it is considered that the vacuum value of the process will not keep decreasing and the decision to stop the process is taken. • Humidity detection. The presence of humidity affects the way the vacuum value decays over the time. It causes a little increment in the evolution of the pressure. This phenomenon can be seen in Fig. 3b. To detect humidity defects over the process, the humidity detection algorithm takes a 20-points length window and computes the difference of adjacent points over the window. Then, if any of the differences is greater than a previously set threshold, it is considered that humidity is present. In this case, a command is sent to notify the situation and stop the process.
Edge Architecture for the Integration of Soft Models
75
5 Conclusions First steps of Industry 4.0 approaches are focused on capturing OT industrial data and making it available to the IT domain. Then, this data is visualized, stored and/or analyzed to gain insights of the industrial processes. In further steps, AI models access real-time data to generate predictions and control industrial processes. However, this requires OT and IT knowledge not present in many industrial companies, mainly SMEs. The proposed architecture is based on several micro-services deployed as docker containers and the MING (Mosquitto, InfluxDB, Node-RED and Grafana) stack to ease the integration of soft AI models to control a cyber-physical industrial system. The architecture has been successfully validated controlling the vacuum generation process of an industrial machine. Soft AI models applied to real-time data of the machine analyze the vacuum value to decide when the most suitable time is (i) to start the second pump of the machine, (ii) to finish the process, and (iii) to stop the process due to the detection of humidity. The results of the validation are able to perform a full control of the vacuum generation process without requiring the use of the HMI of the PLC. Moreover, the previous manual configuration step of the PLC controlled process, where the operators had to manually identify the optimum duration of the process and starting time of the special pump, has been greatly simplified. Soft models only require a precision parameter, which is dependent of the vacuum level required by the final application. With this parameter, the AI models automatically control the process, deciding when to switch on the special pump, to finish the process, and to stop due to humidity detection. Future work starts with further industrial validations deploying the architecture and new control models at other machines or processes. These new scenarios will test the resilience, scalability and performance of the architecture and each of its components. For example, scenarios where a very high throughput of real-time data is required, may need to include Kafka instead of MQTT. The same applies to scenarios where a nodeRED based gateway cannot cope with the complexity of the required logic, and the gateway has to be manually coded using Python or other programming language. A second line of future work tackles the performance of the real-time capabilities of the infrastructure, which are relevant to define the black-box of the machine or process that is going to be controlled by soft models. Below the achieved real-time capabilities the PLC and SCADA systems must still be in charge of the control functionalities. Acknowledgement. Research was partially supported by the Centre for the Development of Industrial Technology (CDTI) and the Spanish Minister of Science and Innovation (IDI-20210506) and by the Economic Development, Sustainability and Environment Department of the Basque Government (KK-2022/00119).
References 1. Alcácer, V., Cruz-Machado, V.: Scanning the industry 4.0: a literature review on technologies for manufacturing systems. Eng. Sci. Technol. an Int. J. 22(3), 899–919 (2019). https://doi. org/10.1016/j.jestch.2019.01.006
76
A. Garcia et al.
2. Fei, X., et al.: CPS data streams analytics based on machine learning for cloud and fog computing: a survey. Futur. Gener. Comput. Syst. 90, 435–450 (2019). https://doi.org/10. 1016/j.future.2018.06.042 3. Lee, J., Bagheri, B., Kao, H.A.: A cyber-physical systems architecture for industry 4.0-based manufacturing systems. Manuf. Lett. 3, 18–23 (2015). https://doi.org/10.1016/j.mfglet.2014. 12.001 4. Alam, M., Rufino, J., Ferreira, J., Ahmed, S.H., Shah, N., Chen, Y.: Orchestration of microservices for IoT using docker and edge computing. IEEE Commun. Mag. 56(9), 118–123 (2018). https://doi.org/10.1109/MCOM.2018.1701233 5. Qiu, T., Chi, J., Zhou, X., Ning, Z., Atiquzzaman, M., Wu, D.O.: Edge computing in industrial internet of things: architecture, advances and challenges. IEEE Commun. Surv. Tutorials 22(4), 2462–2488 (2020). https://doi.org/10.1109/COMST.2020.3009103 6. Singh, R., Gill, S.S.: Edge AI: a survey. Internet Things Cyber-Phys. Syst. 3(March), 71–92 (2023). https://doi.org/10.1016/j.iotcps.2023.02.004 7. Lu, Y.: Industry 4.0: a survey on technologies, applications and open research issues. J. Ind. Inf. Integr. 6, 1 (2017). https://doi.org/10.1016/j.jii.2017.04.005 8. Garcia, A., Oregui, X., Franco, J., Arrieta, U.: Edge containerized architecture for manufacturing process time series data monitoring and visualization. In: IN4PL 2022 - Proceedings of 3rd International Conference Innovative Intelligent Industrial Production and Logistics, pp. 145–152 (2022). https://doi.org/10.5220/0011574500003329 9. Rattanapoka, C., Chanthakit, S., Chimchai, A., Sookkeaw, A.: An MQTT-based IoT cloud platform with flow design by node-RED, RI2C 2019 - 2019 Research, Invention, and Innovation Congress, December 2019 (2019). https://doi.org/10.1109/RI2C48728.2019.899 9942 10. Nit, ulescu, I.-V., Korodi, A.: Supervisory control and data acquisition approach in node-RED: application and discussions. IoT 1(1), 76–91 (2020). https://doi.org/10.3390/iot1010005 11. Folgado, F.J., Gonz, I., Jos, A.: Internet of things data acquisition and monitoring system framed in industrial internet of things for PEM hydrogen generators. 22 (2023). https://doi. org/10.1016/j.iot.2023.100795
The Use of Line Simplification and Vibration Suppression Algorithms to Improve the Quality of Determining the Indoor Location in RTLSs (B) ´ Grzegorz Cwikła
and Tomasz Lorenz
Faculty of Mechanical Engineering, Silesian University of Technology, Konarskiego 18a, 44-100, Gliwice, Poland [email protected]
Abstract. The article presents an attempt to develop algorithms to simplify the object movement trajectory obtained from the real-time locating system (RTLS) in order to reduce the amount of data while maintaining sufficient compliance with the original trajectory. Software was developed to collect and process data from RTLS, allowing to apply: Douglas-Peucker, Visvalingam-Whyatt and location instability suppression algorithms. The RTLS and algorithms were tested on the example of trajectories, the location data of which was collected using the Decawave RTLS. As a result of the research, the usefulness of the algorithms was assessed and the recommended parameters were determined, allowing to obtain correct results of determining the location after simplifying the trajectory and suppressing random vibrations, while reducing the amount of processed data. Keywords: Real-time Locating System · RTLS · accuracy · Douglas-Peucker algorithm · Visvalingam-Whyatt algorithm · location instability suppression algorithm
1 Introduction Determining the location with the use of Global Positioning System is now a commonly used technology that has found applications in many areas of life. GPS (and its equivalents, e.g. Galileo) allows to determine the position of a given object with a high accuracy in a short time in global geographic coordinates, but it also has some limitations [12]. The main limitation results from the use of radio technology to determine the location via satellites - these signals have a relatively low power and are often unable to penetrate building roofs and other obstacles, which makes it difficult or even impossible to use GPS to determine indoor location. In many cases it excludes the possibility of using this technology in many branches of industry. Therefore, for indoor applications, locating systems are used that use technologies specifically designed for indoor work, and the results are presented in local coordinates [7]. Systems for locating objects indoors (RTLS) are now increasingly used in industry, mainly to support management and optimize the use of production resources [15, 18]. For this type of application, 0.1 to 1 m is considered to be © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 P. García Bringas et al. (Eds.): SOCO 2023, LNNS 750, pp. 77–87, 2023. https://doi.org/10.1007/978-3-031-42536-3_8
78
´ G. Cwikła and T. Lorenz
sufficiently accurate, and in some less demanding cases up to several meters accuracy can be good enough [6, 16]. They are therefore by definition not the systems for very high accuracy applications, for example for the control of automated vehicles (AGV). RTLSs are based on various phenomena, technologies and algorithms that enable the determination of the object’s location [17]. There is no established standard in indoor localisation, thus the selection of an existing system needs to be done based on the type of environment being tracked, the speed, the cost and the accuracy required [1]. The majority of the locating techniques can be classified as the active systems, due to the necessity of electronic devices (tags) carried by the person/vehicle/object being tracked in order to estimate their position, whilst passive localization stands for the locating of objects without additional devices [17]. Indoor RTLSs almost always have to deal with a basic problem - a typical operating environment contains many obstacles and objects that make it difficult to accurately determine the location: walls, machines, building structure elements, tanks, storage racks, vehicles and any other objects - that are on the way between RTLS’s elements: transmitters and receivers of signals used to determine the location [17]. Although most of the RTLS technologies used today do not require a complete absence of obstacles between the transmitter and receiver (LoS - Line of Sight), the quality of the results obtained is noticeably better in an environment where the number of obstacles is smaller. As a result, the coordinates of objects obtained from the RTLS are often characterized by a high level of noise or disruptions, that result from incorrect determination of coordinates [5], especially when the object whose location is being determined moves (usually some time after the object has stopped moving, the results of determining the location stabilize). There is therefore a need to develop methods to improve the quality of determining a location by detecting and removing incorrect coordinates. The first step in assessing the quality of the data obtained is determining whether the subsequent coordinates can be reliable, taking into account the possible velocity, accelerations or other possible constraints, e.g. walls between the previous and current indicated location. This allows for the removal of the most obvious disturbances, but in order to smooth the obtained location path more accurately, but still not to introduce oversimplification, more sophisticated methods are need-ed. This article presents the basics of indoor RTLS technology and preliminary re-search on the proposed algorithms for the improvement of location data. 1.1 Methods and Technologies for Determining Indoor Location Determination of location of objects is one of the key features in the context-aware computing [17]. Real-time Locating System can be defined as combination of software and hardware used to automatically determine the coordinates of an object (e.g. person, WIP or equipment) within an instrumented area, allowing tracking and management [2]. Technologies allowing for locating can be classified according to the positioning algorithm and the physical layer or sensor infrastructure. Location sensing approaches can be classified into following main categories: triangulation, trilateration, hyperbolic lateration, proximity, location fingerprinting (scene analysis), and dead reckoning [7]. The metrics typically used in most of these approaches are: Received Signal Strength Indicator (RSSI), Time of Arrival (TOA), Time Difference
The Use of Line Simplification and Vibration Suppression Algorithms
79
of Arrival (TDOA), Angle of Arrival (AOA) or Direction of Arrival (DOA). Most of locating technologies are based on radio frequencies (RF) communication, e.g. WiFi, Bluetooth, Ultra-wideband (UWB), Radio Frequency Identification (RFID). A few methods are based on other technologies and phenomena, in most cases ultrasonic or infrared [3, 4]. Because of problems with obtaining precise indoor location in complicated environment, with objects causing signal interferences or obscuring the Line-of-Sight required in some methods, and high number of located objects, hybrid systems combining more techniques or algorithms in order to improve the accuracy of the location estimation are becoming popular [10].
2 Decawave as an Example of RTLS Based on UWB Technology The Decawave DWM1001 devices used in RTLS research allow to build a locating system with a range of ~60 m and an accuracy of up to 0.1 m [8]. The system is based on hardware modules that can be configured as a tag (the location of which is determined), anchor (fixed reference point), or listener (listening for tag location data within Bluetooth range). Location data is calculated by the tags themselves. Tag receives UWB signals from anchors with system-defined positions (reference points). RF UWB signals are used to determine the location, while Bluetooth (BLE) for communication with data collecting devices. The tag which is localized can use information about its location (e.g. for navigation), store it or send it to other system elements via Bluetooth (directly or via a listener), where it can be used for further analyzes [11]. The research used the MDEK1001 development kit, consisting of 12 devices that can be configured as a tag, anchor or listener. The minimum configuration of the system providing location data should be based on a minimum of 4 anchors arranged in a shape similar to a rectangle. Their coordinates, hereinafter referred to as reference coordinates, are saved in the memory of all system elements (via BLE using an application for a smartphone or tablet). The number of anchors can be increased as needed to ex-tend the RTLS range. One of the anchors must be configured as the initiator, which will be responsible for starting and controlling the entire system. Communication between devices included in the system takes place using the TDMA technique, each of the system elements receives time slots allowing for data transmission. The UWB RTLS system uses the TWR Two Way Ranging technique, which, based on the measured time of data transmission between the tag and a single anchor and the speed of light, determines the distance between the two devices. This operation is repeated at least 3 times, using 3 different anchors. Based on the obtained distances from anchors, the tag determines its current position [8]. Location data can be read from the tag via USB or indirectly via the listener that can be connected to the computer via USB (Fig. 1). The application for a tablet or smartphone provided by Decawave is used for the configuration of system components and initial tests, however, it only provides basic functionality. New software has been developed for the research, allowing for the reception of data from a tag or listener via USB, visualization, and the use of the investigated data processing algorithms.
´ G. Cwikła and T. Lorenz
80
A T L
Anchor
A
A
A T
Tag Listener
T
L
UWB
T
BLE USB
T T
A
A
A
Fig. 1. Example RTLS configuration with 6 anchors and listener, computer receives data from listener via USB; tablet or smartphone used to configure system components and collect data via BLE (only some connections are shown)
3 Algorithms for Improving RTLS Data Tests of available RTLSs (Decawave, Ubisense) have shown that most of the recorded trajectories contain points that limit transparency of data, distort the shape and are even clearly incorrect because the observed object certainly could not be in some location [5, 14]. Usually these are points created as a result of unidentified disturbances or loss of system range occurring during the movement of the object with the tag. It can be observed that the coordinates obtained in this way differ noticeably from the preceding or following values. Analyzing possible solutions, it was found that a promising approach to this problem may be the use of algorithms simplifying the path that arises from the combination of points recognized by RTLS. The term “simplify” in reference to geometry means the minimization of the number of points while keeping in line with the original shape of the path. Based on the analysis of the available sources, it was concluded that the currently most popular polygon smoothing algorithms, e.g. border contours when scaling maps to display, which seem suitable for adaptation to smooth the path obtained from RTSL, are Douglas-Peucker [13] and Visvalingam-Whyatt [9]. 3.1 Polyline Simplification Algorithms The algorithms used in this chapter were created as a result of the need to simplify excessively complex shapes, e.g. when displaying maps at different scales. An exact shape consisting of a high number of points, when displayed at a smaller scale, may lack many details that would not be significant in the current view, but increase the processing power requirement, and therefore extend the loading and display time of such images. The task of such algorithms is to simplify a polyline consisting of many points, maintaining the highest possible compliance with the original shape (Fig. 2). Such algorithms can therefore also be used in RTLS related applications.
The Use of Line Simplification and Vibration Suppression Algorithms
81
Original trajectory
Simplified trajectory
Fig. 2. Example of simplifying the trajectory of a tag movement: 12 points to 6 points
The first of the discussed is the Douglas-Peucker (D-P) [13] algorithm. Its main purpose is to identify points that are less important to the overall shape of the line and remove them, no new points are generated. The algorithm usually requires one parameter, the tolerance (usually referred to as Epsilon), to be specified. D-P is an iterative algorithm - it removes the point, divides the line and starts again, until the list of points to be analyzed is exhausted. In the first step, the algorithm creates a line between the first and last point of the analyzed line segment. Then, it identifies the point on the line that is farthest from that line joining the endpoints. If the distance between the line and the point is less than Epsilon, the point is discarded and the algorithm starts all over again until there is no point between the endpoints. If the distance between the point and the line is greater than Epsilon the first and farthest point are connected by a different line, and any point which is closer to this new line than Epsilon is discarded. Whenever a new farthest point is identified, the original line splits into two and the algorithm continues on each part separately. The Visvalingam-Whyatt (V-W) algorithm is also used to identify points that can be removed from the line [9]. Unlike D-P, the tolerance factor is defined as the area of the triangle, not the distance between the points. The V-W algorithm in the first step generates triangles between points and calculates their surface areas. Then the algorithm identifies the smallest of these triangles and checks if its area is smaller or larger than the epsilon. If it is smaller, the point associated with the triangle is discarded and the algorithm repeats the previous steps - generating new triangles, identifying the smallest one, checking the condition and repeating it. The algorithm stops when all generated triangles are larger than epsilon. The result of the application of these algorithms may be similar, the differences appear at different epsilon values and shapes of the input trajectories. V-W tends to create smoother geometry and is often preferred to simplify images of natural objects. D-P is faster, but tends to produce spiky lines with certain input configurations and epsilon values [9]. 3.2 Location Instability Suppression Algorithm (LISA) The second type of problems with RTLS data is a certain number of coordinate readings of objects that are not moving, leading to an unnecessary increase in the amount of analyzed location data. Some RTLS manufacturers try to remedy this problem by detecting the movement of the object by e.g. an accelerometer and stopping or significantly reducing
82
´ G. Cwikła and T. Lorenz
the frequency of locating if the object is not moving according to the indications of this sensor. Despite this, the number of disturbances is still large and redundant points with coordinates similar to the location of the object at a given moment can be observed. These points are generated due to fluctuations in the level of the signals used to determine a location. Therefore it is justified to propose an algorithm allowing to reduce the number of points generated when the object is immobile. The last algorithm presented was proposed specifically for the detection and removal of excess points, obtained despite the fact the tag was not moving. An accumulation of a excessive number of points in a small area, resulting from the minor differences in the received signals from a stationary object, is detected. The basis of the LISA operation is to detect the accumulation of points in a R radius circle, allowing to adjust the sensitivity. When more points are detected in such a circle, the first one is left and the rest are removed from the trajectory (Fig. 3).
Original trajectory
Circle, radius r
Simplified trajectory
Fig. 3. Visualization of location instability suppression algorithm
Before proceeding to the execution of individual steps of the procedure, it is necessary to determine the value of the parameter being the radius of the area around the given point. Its value has the greatest impact on the operation of the entire algorithm, therefore it can significantly change the shape of the obtained trajectory. This algorithm can be compared to scanning an area with a certain radius, centered on a given point. For a trajectory consisting of points A, B, C and D, the algorithm is carried out as follows: the circular area around the first point (A) is checked, if the next point (B) is inside this area, it is deleted, and the coordinates of point A are compared with the next point (C). This process is repeated for point A until none of the following trajectory points are found in a given area within radius R. These steps are repeated for the next point - assuming that point B was deleted, the next point to be checked is C. In this way all points are verified up to the last, which, like the first, does not change.
4 Research on the Effectiveness of Algorithms After the initial stages of tests, verifying the correct operation of the application and RTLS, tests were carried out to determine the effectiveness of algorithms simplifying the line and damping trajectory vibrations. Two RTLS data sets (previously registered object
The Use of Line Simplification and Vibration Suppression Algorithms
83
movement trajectories with the Decawave tag) were selected, on which the algorithms will be tested. The first (A) recorded in the area of 20 × 8 [m] consists of 398 points, while the second (B) recorded in the area of 10 × 10 [m] contains 817 points. Three algorithms implemented in the application were used: Douglas-Peucker (D-P), Visvalingam-Whyatt (V-W) and the location instability suppression algorithm (LISA). Each of them was used for both trajectories. Changes in the sensitivity of algorithms took place through a specific parameter: • the Douglas-Peucker algorithm uses the Epsilon parameter, which defines the limit distance of a point from a given segment, • the Visvaligam-Whyatt algorithm uses the Area parameter, which defines the boundary area of a triangle formed from three consecutive trajectory points, • location instability suppression algorithm uses the Radius parameter, which specifies the radius of the area around the point. The Epsilon and Radius parameters were assigned five consecutive values: 1, 0.5, 0.1, 0.05 and 0.01 [m]. The Area parameter was assigned five values successively: 1, 0.5, 0.1, 0.05 and 0.01 [m2 ]. During the research, the shape of the obtained trajectory was assessed (in comparison with the original trajectory of the localized object) and the degree of reduction of the number of points. Figure 4 shows the original A and B trajectories before tests of simplification. The results of testing the effectiveness of all three algorithms at different sensitivity values are presented in Table 1 and Table 2, for areas A and B, respectively. For each case, the degree of similarity of the original trajectory to the line obtained as a result of the simplification algorithm was determined. The evaluation is presented on a scale of 0–3, where 0 means no similarity and 3 - high similarity.
Fig. 4. Original trajectories obtained from RTLS in area A and B
4.1 Discussion of Results For the Douglas-Peucker algorithm, adopting the Epsilon value to 1 or 0.5 results in a visible loss of important trajectory points. Most of the finer features of the path have been
84
´ G. Cwikła and T. Lorenz Table 1. The results of the algorithms’ operation on the trajectory A
Algorithm sensitivity parameter
Number of points
% of the original number of points
Trajectory similarity
Douglas–Peucker algorithm (D-P) - algorithm sensitivity parameter: Epsilon 1
13
3.26
1
0.5
23
5.77
2
0.1
103
25.87
3
0.05
147
36.93
3
0.01
285
71.6
3
Visvalingam-Whyatt algorithm (V-W) - algorithm sensitivity parameter: Area 1
6
1.5
0
0.5
9
2.26
1
0.1
20
5.02
2
0.05
39
9.79
2
0.01
83
20.85
3 2
LISA – algorithm sensitivity parameter: Radius 1
21
5.27
0.5
44
11.05
2
0.1
184
46.23
3
0.05
287
72.11
3
0.01
361
90.70
3
removed, so that the shape is similar to the original, but lacks important detail, which is unacceptable. An Epsilon 0.1 gives a trajectory sufficiently retaining its original shape. The situation is similar for the successive values of Epsilon 0.05 and 0.01, but in this case the satisfactory reduction of the number of trajectory points was not obtained. The best value of the Epsilon, which allows to keep the shape of the trajectory while minimizing the number of points, is 0.05. For Visvalingam-Whyatt algorithm, the values of 1 and 0.5 of the Area parameter result in a visible loss of important points of trajectory. Most of the bends of the path have been removed, as a result of which its shape largely differs from the original one. By using the values 0.1 and 0.05, it is possible to get a path similar to the original one with only minor details missing. The use of the Area = 0.01 resulted in a slight increase in the number of points, but allowed to obtain a shape that combines the appropriate shape similarity with a significant reduction in the data needed for description. 0.01 is the best Area value tested. For location instability suppression algorithm, the values 1 and 0.5 of the Radius parameter cause a visible, unacceptable loss of shape. Most of the small trajectory fragments have been removed. The use of Radius 0.1 allows to obtain a path very similar
The Use of Line Simplification and Vibration Suppression Algorithms
85
Table 2. The results of the algorithms’ operation on the trajectory B Algorithm sensitivity parameter
Number of points
% of the original number of points
Trajectory similarity
Douglas–Peucker algorithm (D-P) - algorithm sensitivity parameter: Epsilon 1
7
0.85
0
0.5
19
2.32
1
0.1
111
13.58
2
0.05
191
23.37
3
0.01
457
55.93
3
Visvalingam-Whyatt algorithm (V-W) - algorithm sensitivity parameter: Area 1
8
0,97
1
0.5
10
1,22
1
0.1
31
3,79
2
0.05
56
6,85
2
0.01
113
13,83
3 2
LISA - algorithm sensitivity parameter: Radius 1
21
2.57
0.5
63
7.71
3
0.1
250
30.59
3
0.05
398
48.71
3
0.01
529
64.74
3
to the initial one, important details have been retained. For Radius 0.05 and 0.01, the shape is still good, but accompanied by an insufficient reduction in the number of points, which shows that the vibration is not being suppressed properly. Radius 0.1 is the best value for this parameter.
5 Summary As a result of the conducted work, software for RTLS tests was obtained and algorithms enabling the reduction of the amount of data, simplification and filtering of the trajectory were examined. The values of the algorithm parameters have been determined, allowing to get a satisfactory quality of their operation. All the tested algorithms can give good results of the trajectory improvement, however, the degree of reduction of the number of points varies. In this respect, the V-W algorithm is the best, the next LISA, and the last D-P, but the differences are not significant. However, the analyzed algorithms do not take into account the context and specificity of the area in which the determination of the location takes place, such as the existence of places in which the localized object cannot be located, or the situation when
86
´ G. Cwikła and T. Lorenz
coordinates will be generated for which it is physically impossible to move the object in a short time with the maximum speed possible. Therefore, there is a need for further research into methods to take into account these factors, which will allow to obtain better and more reliable results in determining the location.
References 1. Adler, S., Schmitt, S., Wolter, K., Kyas, M.: A survey of experimental evaluation in in-door localization research. In: 2015 International Conference on Indoor Positioning and Indoor Navigation (IPIN) (2015). https://doi.org/10.1109/IPIN.2015.7346749 2. Barosz, P., Gołda, G., Kampa, A.: Efficiency analysis of manufacturing line with industrial robots and human operators. Appl. Sci. 10(8), 2862 (2020) ´ 3. Bartoszek, S., Cwikła, G., Kost, G., Nie´spiałowski, K.: Impact of the selected disturbing factors on accuracy of the distance measurement with the use of ultrasonic transducers in a hard coal mine. Energies 15(1), 133 (2021). https://doi.org/10.3390/en15010133 ´ 4. Bartoszek, S., Stankiewicz, K., Kost, G., Cwikła, G., Dyczko, A.: Research on ultrasonic transducers to accurately determine distances in a coal mine conditions. Energies 14(9), 2532 (2021). https://doi.org/10.3390/en14092532 ´ 5. Cwikła, G., Grabowik, C., Kalinowski, K., Paprocka, I., Bana´s, W.: The initial considerations and tests on the use of real time locating system in manufacturing processes improvement. IOP Conf. Ser. Mater. Sci. Eng. 400, 042013 (2018) 6. Davidrajuh, R., Skolud, B., Krenczyk, D.: Performance evaluation of discrete event systems with GPenSIM. Computers 7(1), 8 (2018). https://doi.org/10.3390/computers7010008 7. Deak, G., Curran, K., Condell, J.: A survey of active and passive indoor localisation systems. Comput. Commun. 35, 1939–1954 (2012) 8. Decawave. DWM1001C Module. https://www.decawave.com/product/dwm1001-module. Accessed 15 Aug 2020 9. Fleischmann, M.: Line simplification algorithms. https://martinfleischmann.net/line-simpli fication-algorithms. Accessed 5 Apr 2021 10. Foit, K., Gołda, G., Kampa, A.: Integration and evaluation of intra-logistics processes in flexible production systems based on OEE metrics, with the use of computer modelling and simulation of AGVs. Processes 8(12), 1648 (2020). https://doi.org/10.3390/pr8121648 11. Kampa, A., Paprocka, I.: Analysis of energy efficient scheduling of the manufacturing line with finite buffer capacity and machine setup and shutdown times. Energies 14(21), 7446 (2021). https://doi.org/10.3390/en14217446 12. Kelepouris, T., McFarlane, D.: Determining the value of asset location information systems in a manufacturing environment. Int. J. Prod. Econ. 126, 324–334 (2010) 13. Lee, S.: Simplify polylines with the douglas peucker algorithm. https://towardsdatascience. com/simplify-polylines-with-the-douglas-peucker-algorithm-ac8ed487a4a1. Accessed 4 Apr 2021 ´ 14. Mikoda, M., Kalinowski, K., Cwikła, G., Grabowik, C., Foit, K.: Accuracy of real-time location system (RTLS) for manufacturing systems. Int. J. Mod. Manuf. Technol. 12(1), 106–113 (2020) 15. Paprocka, I.: The model of maintenance planning and production scheduling for maximizing robustness. Int. J. Prod. Res. 57(2), 1–22 (2018). https://doi.org/10.1080/00207543.2018.149 2752 16. Paprocka, I., Kempa, W.: Model of production system evaluation with the influence of FDM machine reliability and process-dependent product quality. Materials 14(19) (2021). https:// doi.org/10.3390/ma14195806
The Use of Line Simplification and Vibration Suppression Algorithms
87
17. Pirzada, N., Yunus, N.M., Subhanc, F.M., Fadzil, H., Khan, M.A.: Comparative analysis of active and passive indoor localization systems. AASRI Procedia 5, 92–97 (2013) 18. Rácz-Szabó, A., Ruppert, T., Bántay, L., Löcklin, A., Jakab, L., Abonyi, J.: Real-time locating system in production management. Sensors 20(23), 6766 (2020)
Possibilities of Decision Support in Organizing Production Processes Małgorzata Olender-Skóra(B)
and Aleksander Gwiazda
Department of Engineering Processes Automation and Integrated Manufacturing Systems, Faculty of Mechanical Engineering, Silesian University of Technology, Konarskiego 18A, 44-100 Gliwice, Poland {malgorzata.olender-skora,aleksander.gwiazda}@polsl.pl
Abstract. The dynamics of the market, changes in the implementation of products in a short time, often customized products, require many analyses and actions in a short time. The problem that arises in this type of aspect is what should be done and how. One of the possibilities is the use of selected Industry 4.0 technologies to perform relevant analyses. It is also important because of the need to constantly change, to become more competitive in the market, but also to look at opportunities for improvement without stopping the implementation of production processes. Therefore, the article also focuses on the applicability of one of the technologies, namely simulation and the type of results obtained from the analyses performed. An algorithm for creating simulations that supports the implementation of this technology is also described. Keywords: Industry 4.0 · simulation · production planning · management
1 Introduction With the rapid changes taking place in the market, manufacturers need to be flexible to provide customers with the right products or services within the timeframe they expect. Many tasks require a large workforce, the right materials and production resources. Manufacturers are therefore looking for solutions to support them in implementing innovative solutions, but also to help them assess the technological progress of the company. The increase or reduction of the complexity of an enterprise functioning in a market and operating under different conditions is determined by many factors. The most common external factors conditioning its dynamics can be, for example [1, 9]: globalization, dynamics of technological changes, fluctuations on the currency market, degree of competition, cultural differences, legal regulations (also regarding export and import), industry standards, pressure to implement new products, number of customers, differentiation of customer needs, long delivery times, supply uncertainty and minimum lots of orders. In production planning and scheduling, data is the basis. A basis that is also important when it is implemented in programs of various types, e.g. simulations. In an era of product customization, the need for flexibility in production is increasing, and this increases the © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 P. García Bringas et al. (Eds.): SOCO 2023, LNNS 750, pp. 88–97, 2023. https://doi.org/10.1007/978-3-031-42536-3_9
Possibilities of Decision Support in Organizing Production
89
amount of data to be analyzed, particularly as different types of data are read out from machines and robots thru sensors and often in real-time. However, the so-called mass customization - the ability to provide individually designed products or services to every customer, is still limited by [14]: relatively low verification of the commercial market, a difference in surface smoothness across prints from the same PC file, the unpredictability of the production and poor of formalized guidelines for most 3D printing processes. With new technologies, which are also included in the field of Industry 4.0, manufacturers can handle the challenges of changing production and adapting to current trends in the market. These technologies make it possible to analyse changes in production in advance by making simulations and various types of analysis, without stopping production in reality. For this purpose, it is important that data is extracted from real-world processes so that the data is up to date, to best represent the processes in the simulation. Furthermore, due to the dynamics of the market and the need for change in the short term, appropriate data collection and analysis are also very important. Therefore, elements of Industry 4.0 such as Big Data, Cloud, Internet of Things are also important. The use of these elements and the appropriate analysis of data and results from the simulations influence to quicker decision-making and changes in processes, and consequently enable more flexible implementation of production [15, 16]. The article describes the definition of Industry 4.0, identifies its technologies and then briefly characterizes them. In the following chapters, the focus is on the application of one of the technologies, namely simulation. The applicability of the simulation and the type of results obtained from the analyses carried out are also described. An algorithm for the creation of simulations that supports the implementation of this technology in reality is also described.
2 Industry 4.0 The changes required by customers, and the speed of these changes, are also important in terms of operational management, but also from the point of view of management strategy, to adapt to new challenges. Many authors have characterized Industry 4.0, in their own different ways. According to [2], Industry 4.0 is characterized using intelligent products and processes, enabling autonomous data collection and analysis, and interaction between products, processes, suppliers and customers over the Internet. Next Authors [8], define as - Industry 4.0 is characterized by cyber physical systems that allow for the merging of real and virtual worlds in real time. But Authors [3] mentioned that Industry 4.0 connects Information and Communication Technologies (TIC) and production, allowing the process and product data merge with machine data unlocking communication between those machines. So, each of the definitions has some common elements and, like the authors [10] argue that standardized processes, elimination of waste and a constant focus on customer value are fundamental to the introduction of Industry 4.0. Industry 4.0 includes technologies that help an enterprise to become competitive, but also allow to handle the challenges of the market. The role of Cyber-Physical Systems is to monitor physical processes throughout the supply chain through real-time communication and the Internet of Things (IoT) [13, 17].
90
M. Olender-Skóra and A. Gwiazda
Also in Table 1, there are six business benefits in Industry 4.0 mentioned [7]. Table 1. Six business benefits in Industry 4.0 [7]. Benefits
Interpretation
Efficiency
Fewer people and more automation manage the decision-making process faster and maintain high efficiency. Automation also tends to maintain high quality and keep manual production problems low
Agility
With a focus on high standardization and small series, Industry 4.0 brings high flexibility to the production process
Innovation
Industry 4.0 production lines are built to suit a large mix and low volume, they are ideal for launching new products and experimenting with design
Costs reduction
After the initial investment in the transformation, the costs will fall. Fewer quality problems lead to less material waste and lower operating and personnel costs
Revenues
With lower costs, better quality and the ability to serve customers well, Industry 4.0 puts manufacturers on the road to being the preferred supplier for customers. Among other things, it offers ways to offer customized products, to serve larger markets, and higher-margin services to all customers
Customer experience
The depth of information on customer requirements and existing problems, and the ability to respond promptly, can provide customers with the right products and services, sometimes in real-time
The fundamental changes due to the Industry 4.0 concept are based on technological developments such as: Simulation, Big Data and Analytics, Robotics, Cybersecurity, 3D Printing, IoT and others [5, 11]. 2.1 Technologies of the Industry 4.0 So, the Industry 4.0 concept concerns on view on certain group of technologies. Each of technology is briefly described below [4, 11]: 1. Horizontal and vertical system integration – technology is for develop organizationwide protocols for data sharing. It is to create the basis for an automated supply and value chain. 2. Virtual Reality - technology can create cooperative interactive environment resembling the real world into a virtual computer-based framework. Augmented Reality – environment where the users have the perception of being in a physical world that is not the real scene at the time. 3. The Cloud - by this technology data collected is analyzed and stored in the cloud to be processed and present to the user. 4. Autonomous robots - programmed robots that perform repetitive tasks with high level of quality, promoting productivity and process automation. 5. Internet of things - connects different physical objects – machines or sensors through internet. Those objects can communicate in a virtual world.
Possibilities of Decision Support in Organizing Production
91
6. Big data and analytics - collecting, storing, and sharing an enormous volume of data, which require a set of techniques of advanced analysis to support the decision process-making. 7. Additive manufacturing – kind of productive process, which allows to personalization, flexibility and fast prototyping, a lot of products from different materials. 8. Simulation - the process of making a model of a real or hypothetical system to describe and analyses the behaviors of the system. 9. Cyber security – tools and technologies to protect integrity of computers and data against damage, and attacks. Each of these technologies is very important for realizing customized production and becoming a competitive enterprise in the current times. They do not have to be applied all, at the same time, but the more technologies applied, the sooner enterprises can flexibly adapt to market needs. The next section describes the applicability of simulation with an example made in simulation software. The simulation example was made in FlexSim software.
3 Methodology There are many definitions of simulation in the literature. One of them is the one proposed by the author [6], which defines that a simulation is a representation of real production processes, executed in simulation software. The simulation must contain actual data so that the results obtained can be analyzed accordingly. When running the model in a simulation, there are some important points to consider, namely [6]: development of a computer model illustrating the current state of the production system, formulate assumptions for change scenarios and develop a concept for evaluating change scenarios. An example of the simulation is in Fig. 1 showed. During the step of preparing the models for further work, the most important were the declarations of specific movable elements as rigid body. The corresponding joints and constraints could then be assigned to the elements declared this way. The individual movement axes on the model were shown as seen in Fig. 1. In case of robots, all joints apart from base restraint are hinge joints. This is only an exemplarily selected robot that we have in the lab. An additional benefit of such solution is the possibility of comparison with real objects. The last axle has no gripper therefore its move would not be visible. It is the sole reason why this segment has been omitted. The last segment has been fixed. However in case of the milling machine, for axis control the sliding joints were used. There is an encoder on each axis of rotation of the robots and position sensor on each axis of milling tool. Model shown in Fig. 1 is a representative example of a part of the production line. Correct projection of that fragment allows controlling the entire factory, after specific set-tings in PLM Siemens NX program. In the model, 2 capacitive displacement sensors are placed at it is beginning and the end of the conveyer. A specific signal on the first sensor initiates conveyor movement, while a signal on the last one causes the conveyor to stop. The stopping signal is generated by reaching the box object in relation to the area marked with the arrow. In case of robots, readings
92
M. Olender-Skóra and A. Gwiazda
Fig. 1. Joints and Constraints with sensors in PLM Siemens NX.
of real values are sourced via the encoders. In the milling machine, position is also read using the capacitive displacement sensors. The sensors are connected in the program, by means of position sensors and velocity sensors, to the specific signals, which are sent to the OPC UA server, and their values are displayed as real. The information flow diagram between server and robots and milling machine can be characterized by Fig. 2. The sensor is added in order to begin and end the movement, and it is not included in the scheme, as its work is temporary. It is worth noting that the position sensor (accordingly to the illustration) refers to the linear and the angular positions (encoders).
Fig. 2. Information flow diagram in PLM Siemens NX.
The virtual twin model prepared in this way allows for simulations related to the control of bottlenecks in the production system and possible collisions related to the
Possibilities of Decision Support in Organizing Production
93
co-operation of robots and the transport system. At present, the NX system is limited in the scope of modeling sensors to speed and position control sensors. In the real system, we deal with the monitoring of electricity consumption by individual components of a robotic cell, the consumption of working media such as compressed air - used to control grippers, vibro-diagnostics of drive units of both the transport system and the main robotic drives, as well as system pressures compressed air supply and vacuum in gripper systems. All values are delivered to the OPC UA server by mapping technology. So, the limitation is at the NX system level, which allow only visualizing the above measurements but not for connecting them with parts of the VR model. To fill this niche, work was started on the development of detailed models, enabling the determination of basic KPIs - key production indicators, and enabling the estimation of unit production costs. This method will be the subject of a separate article. General idea which was used also in this work base on agent’s system responsible for processing the data from the OPC UA server, calculating the KPI’s and writing the values back to the OPC UA. This approximation allowed to unload the PLC processor and to separate the KPI calculation system from the PLC program code. All is in line with the methodology of the Y-path. Y-path is a new standard of extracting the information from the sensor, which allows using this information in processes other than the controlling process. Due to the fact that the division takes place at the splitter level, this solution lowers the CPU load. Data from the PLC is divided into two paths: process data and data needed for the industry 4.0 approximations. From the safety point of view those two paths are separated and independent IP addresses can be established. Simulated elements were linking in the virtual space were linked with the real controllers. Scheme of connecting of elements is presented in Fig. 3. Setting values of velocity or position were sent from the PC 2100 industrial computer (hardware), on which the data exchange server was hosted, to the computer connected via an Ethernet cable. Signals were being received by the model as setting, and the given object pursued the set position with set velocity. Temporary parameter state however was being sent to the industrial computer the same way but using different signals. Due to such many data and components during the execution of the model in the simulation, Fig. 4, the algorithm to be followed to create such a simulation model is shown. In the first step, the data mentioned above are required. If these data are not available, they are collected and then analyzed. Next, a model is executed in simulation software. In the next step, various process execution scenarios are created, depending on the need of enterprises. Then the results of the analyses are generated. If there are no better solutions than the chosen one, for a specific data set, this solution is implemented. However, if a better solution is found, the decisionmaker returns to the step of preparing the simulation.
94
M. Olender-Skóra and A. Gwiazda
Fig. 3. Connection diagram.
Preparing a model of the production process and running a simulation is the first step, and it is also important to analyze the model and the results obtained. To this end, it is important to have the appropriate type of results obtained, so that improvements or processes or optimization can be suggested. For an example of a process in which three “Separators” are used (the individual components made are separated on these machines), here to separate into well-made and bad-made components (Fig. 5), the results were obtained. In this case Separator 1 is working by 11,41% of the time and idle or blocked by the rest of time. The most of time is idle. Next Separator 2 is working by 6,53% of the time and idle or blocked by the rest of time. The most of time is idle. And Separator 3 is working by 6,39% of the time and idle or blocked by the rest of time. The most of time is also idle. Next results were generated for the Separators for scenario 2. In this case Separator 1 is working by 62,59% of the time and idle or blocked by the rest of time. The most of time is working. The better solution for Separator 1 achieved for scenario 2. Next Separator 2 is working by 6,53% of the time and idle or blocked by the rest of time. The most of time is idle. And Separator 3 is working by 6,39% of the time and idle or blocked by the rest of time. The most of time is also idle.
Possibilities of Decision Support in Organizing Production
95
Fig. 4. Algorithm for the execution of the simulation in the software [12].
Fig. 5. Example of simulation with “Separators”.
4 Conclusion The opportunities presented by the development of technology are also noticeable in enterprises. Manufacturers are looking for new technologies and innovative solutions. This is important from a manufacturing point of view, as manufacturers need to invest in new solutions to stay competitive in the market.
96
M. Olender-Skóra and A. Gwiazda
Technologies attributed to the concept of Industry 4.0, including simulation and relevant databases, support enterprises in these changes. In particular, the fact that analyses are executed in a virtual world eliminates the need to test solutions in a real-time. This allows for making the correct analyses, and optimizations for implementing the best solution. These results allow us to save a lot of time as well as money. Based on the above figures, it was proved that the program works correctly. Communication between the OPC server and the program has been successful. Setting values were successfully sent back to the simulation, while the information from the model were sent to the OPC server. A control panel for all machines from the simulation has been successfully designed. Worth emphasis is the possibility of implementing the created program in another robotic environment. Each machine can be controlled, just connect the signals connected to the signals from the OPC server. The article describes the definitions of Industry 4.0 and briefly characterizes the technologies that are attributed to this concept. Then, it focuses on the technologies concerning simulation, also the algorithm to be applied for the execution of the simulation is described. Next possible results obtained during the simulation run and experiments are also presented. In addition, the possibility of also using VR technology in combination with simulation was pointed out, so that more analyses of various types can be made, without stopping the implementation of production processes in reality. These solutions allow the elimination of waste in the enterprise, and enable it to become more flexible, and support the adaptation of new employees to a new workplace (layout of machines and equipment) and to work in a new workstation.
References 1. Bozarth, C.C., Warsing, D.P., Flynn, B.B., Flynn, E.J.: The impact of supply chain complexity on manufacturing plant performance. J. Oper. Manag. 27, 80–82 (2009) 2. Buer, S.V., Strandhagen, J.O., Chan, F.T.S.: The link between industry 4.0 and lean manufacturing: mapping current research and establishing a research agenda. Int. J. Prod. Res. 56(8), 2924–2940 (2018) 3. Corallo, A., Lazoi, M., Lezzi, M.: Cybersecurity in the context of industry 4.0: a structured classification of critical assets and business impacts. Comput. Ind. 114, 103165 (2020) 4. Ghadge, A., Er Kara, M., Moradlou, H., Goswami, M.: The impact of Industry 4.0 implementation on supply chains. J. Manuf. Technol. Manage. 31(4), 669–686 (2020) 5. Hopkins, J.L.: An investigation into emerging industry 4.0 technologies as drivers of supply chain innovation in Australia. Comput. Ind. 125, 103323 (2021) 6. Jurczyk-Bunkowska, M.: Tactical manufacturing capacity planning based on discrete event simulation and throughput accounting: a case study of medium sized production enterprise. Adv. Prod. Eng. Manage. 16(3), 335–347 (2021) 7. Lachvajderová, L., Kádárová, J.: Industry 4.0 implementation and industry 5.0 readiness in industrial enterprises. Manage. Prod. Eng. Rev. 13(3), 102–109 (2022) 8. Lasi, H., Fettke, P., Kemper, H.G., Feld, T., Homann, M.: Industry 4.0. Bus. Inform. Syst. Eng. 6, 239–242 (2014) 9. Lewandowska-Ciszek, A.: Identifying the phenomenon of complexity in the sector of industrial automation. Manage. Prod. Eng. Rev. 13(2), 3–14 (2022) 10. Mayr, A., et al.: Lean 4.0 - a conceptual conjunction of lean management and industry 4.0. In: 51st CIRP Conference on Manufacturing Systems, pp. 622–628. Procedia CIRP (2018)
Possibilities of Decision Support in Organizing Production
97
11. Olender-Skóra, M., Bana´s, W., Gwiazda, A.: Possibilities of industrial utilization of FFF/FDM process for chosen element printing. Int. J. Mod. Manuf. Technol. 9(2), 1–6 (2017) 12. Olender-Skóra, M., Bana´s, W.: Application of a digital twin for manufacturing process simulation. In: 15th Global Congress on Manufacturing and Management (GCMM-2020), pp. 1–6, Elsevier Ltd (2019) 13. Pacholski, L.: Managerial recommendations concerning the cybersecurity of information and knowledge resources in production enterprises implementing the industry 4.0 concept. Manage. Prod. Eng. Rev. 13(3), 30–38 (2022) 14. Rojek, I., Mikołajewski, D., Kotlarz, P., Macko, M., Kopowski, J.: Intelligent system supporting technological process planning for machining and 3D printing. Bull. Polish Acad. Sci. Tech. Sci. 69(2), 1–8 (2021) 15. Hryniewicz, P., Banas, W., Foit, K., et al.: Modelling cooperation of industrial robots as multi-agent systems. IOP Conf. Ser. Mater. Sci. Eng. 227, 1–7 (2017) 16. Krenczyk, D., Skołud, B., Olender, M.: The production route selection algorithm in virtual manufacturing networks. IOP Conf. Ser. Mater. Sci. Eng. 227, 1–9 (2017) 17. Tonelli, F., Demartini, M., Pacella, M., Lala, R.: Cyber-physical systems (CPS) in supply chain management: from foundations to practical implementation. In: Procedia CIRP, pp. 598–603 (2021)
Special Session 4: Efficiency and Explainability in Machine Learning and Soft Computing
Efficient Short-Term Time Series Forecasting with Regression Trees Pablo Reina-Jim´enez(B) , Manuel Carranza-Garc´ıa, Jose Mar´ıa Luna-Romera, and Jos´e C. Riquelme Department of Computer Languages and Systems, University of Seville, Av. Reina Mercedes s/n, Seville 41012, Spain [email protected]
Abstract. Time series forecasting is crucial in various domains, including finance, meteorology, economics, and energy management. Regression trees and deep learning models are among the techniques developed to tackle these challenges. This paper presents a comparative analysis of these approaches in terms of efficacy and efficiency, using real-world datasets. Our experimental results indicate that regression trees can provide comparable performance to deep neural networks with significantly lower computational demands for training and hyper-parameter selection. This finding underscores the potential of regression trees as a more sustainable and energy-efficient approach to time series forecasting, aligning with the ongoing efforts in Green AI research. Specifically, our study reveals that regression trees are a promising alternative for short-term forecasting in scenarios where computational efficiency and energy consumption are critical considerations, without the need for costly GPU computation.
Keywords: green AI regression trees
1
· time series forecasting · deep learning ·
Introduction
Time series forecasting is the task of predicting future values of a time series based on its past values. Accurate time series forecasting can help organizations make better decisions, such as predicting stock prices or managing the distribution of electricity to different areas. There are several approaches to time series forecasting, including classical methods like ARIMA, tree-based algorithms like M5P, and deep learning models like LSTM. Regression trees, have been widely used in time series forecasting due to their simplicity and interpretability. In particular, in this paper we will focus on the M5P model [1], a regression tree that has regression lines on its leaves, in contrast to the majority of regression trees that hold fixed values. This will aid in a more effectively modeling of the complex relationships present within the time series under consideration. On the other hand, deep learning models, c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 P. Garc´ıa Bringas et al. (Eds.): SOCO 2023, LNNS 750, pp. 101–110, 2023. https://doi.org/10.1007/978-3-031-42536-3_10
102
P. Reina-Jim´enez et al.
such as multilayer perceptron (MLP) [2] and long short-term memory (LSTM) networks [3], have shown superior performance in many domains, including natural language processing, computer vision, and time series forecasting. Moreover, in recent years, transformer models based on attention mechanisms have achieved impressive results in time series prediction [4]. Furthermore, green AI is a growing field that focuses on developing machine learning algorithms and models that are energy-efficient and environmentally sustainable. Models such as regression or decision trees could potentially reduce the energy consumption of machine learning systems [5]. Due to the rapid increase in energy consumption of machine learning systems, research and development in the field of green AI have grown significantly in recent years [6,7]. These algorithms are also being applied to problems typically studied with a deep learning approach, such as natural language processing and time series forecasting [8,9]. In this work, we compare the performance of M5P trees with deep learning models for the time series forecasting task. Our goal is to analyze and discuss the specific situations where the decision tree performs similarly to deep learning, as well as the benefits of using decision trees in terms of speed and energy consumption. This aspect is particularly relevant for time-sensitive projects or resource-constrained environments, where reduced processing time is essential. Our experimental study shows that decision trees can be a viable alternative to deep learning in certain scenarios, highlighting the potential of simpler, more efficient AI approaches that can be used to reduce the environmental impact of AI systems.
2 2.1
Materials and Method Dataset
For this study we selected three datasets, two of them related with the electric power distribution problem and the other one related with the electricity demand in Spain. In the two first datasets, the goal is to predict the oil temperature of two chinese electrical transformers, which could be useful in avoiding transformer faults and have been extensively studied [10,11]. The datasets consist of hourly measurements of various characteristics of the electrical transformer over 2 years, which include its oil temperature (OT). To obtain a univariate series for prediction purposes, we removed all other variables and retained only the OT variable. The third dataset [12] provides information on the electricity demand in Spain and contains data on the demand in Spain over a period of one year, with a frequency of 10 min. For the three datasets, we utilized a division strategy in which the initial 80% of the dataset was used for training and the remaining 20% was allocated for testing. Table 1 presents relevant details about the time series discussed, including the duration and frequency of each series. Additionally, Figs. 1, 2, and 3 display the complete series.
Efficient Short-Term Time Series Forecasting
103
Table 1. Time Series Datasets Dataset
Train Instances Test Instances Frequency
ETTh1 13177 13177 ETTh2 ElectricityDemand 39313
4243 4243 13246
1h 1h 10 min
OT (°C)
15 10 5 0 2018−02
2018−03
2018−04
2018−05
2018−06
OT (°C)
Fig. 1. ETTh1 Time Series Dataset 50 40 30 20 10 0 2018−02
2018−03
2018−04
2018−05
2018−06
Demand (MW)
Fig. 2. ETTh2 Time Series Dataset 37500 35000 32500 30000 27500 25000 22500 20000 17500
2015−10−15
2015−11−01
2015−11−15
2015−12−01
2015−12−15
Fig. 3. ElectricityDemand Time Series Dataset
2.2
Experimental Setup
Regression trees are a type of machine learning algorithm that recursively partitions the data into subsets based on the values of predictor variables and then fits a regression model to each subset. In our particular case, we use the M5P model, which fits regression lines to the data. The main advantages of this architecture are its interpretability and computational efficiency. The model can be easily visualized and understood, making it useful for explaining the reasoning behind its predictions. Additionally, it requires less computational resources than some other machine learning models, making it more efficient for large datasets or limited computing resources.
104
P. Reina-Jim´enez et al.
On the other hand, deep learning models are the dominant architectures used in predictions problems, including time series forecasting. Due to its ability to capture temporal and more complex dependencies, models such as LSTM and CNN tend to obtain the best results in this field. The M5P model was trained with both pruned and unpruned trees, and we tested different numbers of minimum instances per leaf, ranging from 4 to 36. For the deep learning models, we used the Adam optimizer with a learning rate of 0.01 for 50 epochs, and employed early stopping with a patience of 5. We varied the number of layers and neurons per layer for the three deep learning architectures, resulting in a total of 18 different configurations per architecture. Our choice of configurations was informed by results from previous studies in the literature [13,14]. Table 2 summarizes the specific configurations used for each architecture, which are the combinations of all selected values. Table 2. Parameter grid search used in the experimental study Models
Parameters
Values
M5P
Minimum Instances per Leaf
4, 8, 12, 16, 20, 24, 28, 32, 36
Unpruned
True, False
Hidden Layers
[8, 16, 8], [4, 8, 4], [16, 32, 16],
Layers
1, 2, 3
MLP
[8, 16, 32, 16, 8], [4, 8, 16, 8, 4], [16, 32, 64, 32, 16], [8, 32, 8], [4, 16, 4], [16, 64, 16]
LSTM
CNN
2.3
Units
16, 32, 64
Return sequence
True
Layers
1, 2, 3
Filters
32, 64, 128
Pool Size
0
Evaluation Procedure
For the evaluation process, we divided each dataset into training and testing sets. The testing set was taken as the last part of the original dataset, while the remaining 80% was used for training. Once we have split the original series, we apply a min-max normalization to the data to scale each instance between 0 and 1. Then we have to transform the time series into instances that can be fed to the models. For the deep learning models, we adopted the Multi-Input Multi-Output (MIMO) strategy, where all the instances of the past history (PH) are fed to the model, and the entire forecasting horizon (FH) is predicted. Since the M5P algorithm only produce Single-Output results, this strategy cannot be applied. For the M5P models we use a recursive strategy. During training, we predicted only the first instance of the forecasting horizon, while during the prediction process, we performed single-step predictions and fed the results as the last input for the next prediction. While there are some tools available for establishing a multi-output tree-based model, for the purposes of this study, we
Efficient Short-Term Time Series Forecasting
105
elected not to utilize them in order to establish a baseline for experimentation, with plans to employ these strategies in future work. Regardless of the strategy used, we used a moving window generator system to transform the datasets into instances that can be feed to the supervised models. For the generation of the time series dataset, we used a forecasting horizon of 6 and 24, which means we tried to predict the next 6 and 24 h, respectively, for the first two datasets, and the next hour and 6 h for the last dataset. For the past history, we used a factor of 2 and 3, resulting in a past history of 12 and 18 for a forecasting horizon of 6, and a past history of 48 and 72 for a forecasting horizon of 24. Attending to the metrics used to evaluate the models error and efficiency, we use the weighted absolute percentage error (WAPE) on unnormalized data. Since the efficiency of the models is essential in time series forecasting, we also measure the average training and inference time of each architecture.
3
Results
In this section, we will compare the results of the models in terms of performance and computational time. For the experiments, we used a machine with an Intel i7-7700K CPU and two NVIDIA 1080 8GB GPU. In our analysis, we present Tables 3 and 4 that compare the performance of the M5P regression tree and deep learning models in terms of error and computing time, using the average time for training and test, for two different past history factors. Interestingly, our results demonstrate that the model with the larger past history factor consistently achieves better performance. This finding suggests that incorporating a more comprehensive view of the past can significantly improve the predictive capacity of the model performance. For instance, the M5P model with F H = 24 obtains an improvement of 1% in the Mean WAPE metric. However, it is important to note that while increasing the past history factor may lead to enhanced performance, it can also increase the computational time and complexity of the models. Therefore, practitioners should carefully consider the trade-offs between improved performance and increased computation time when selecting the appropriate past history factor for their time series forecasting tasks. Ultimately, the optimal choice will depend on the specific requirements and constraints of the project at hand, as well as the overall objectives of the analysis. In terms of error, both the M5P regression tree and deep learning models display similar error rates, as evidenced by their comparable WAPE values. However, the M5P model excels in terms of computational efficiency, despite being trained solely on a CPU. It consistently outperforms the deep learning models in terms of training time, which require GPU for training. Even though the prediction time of the M5P model is slower due to its recursive strategy, the time required is minimal, only a few milliseconds.
106
P. Reina-Jim´enez et al. Table 3. Error and computing time comparison using a PH factor of 2
Dataset ETTh1
ETTh1
ETTh2
ETTh2
Electricity Demand
Electricity Demand
FH Model Best WAPE Mean WAPE Std WAPE Train Time (s) Test Time (ms) 6
24
6
24
6
24
M5P
12.447
12.697
0.276
1.640
MLP
12.070
12.699
0.799
15.688
0.939 8.519e–4
LSTM
12.022
13.109
1.496
29.776
2.579e–3
CNN
11.998
12.381
0.2626
16.381
2.248e–3
M5P
19.036
19.868
0.890
3.421
4.102
MLP
19.037
19.764
0.578
12.528
8.869e–4
LSTM
18.692
19.402
0.717
39.231
5.872e–3
CNN
19.464
20.188
0.475
13.769
5.105e–3
M5P
9.106
9.865
0.337
1.566
0.954
MLP
8.093
9.412
1.055
17.684
8.282e–4
LSTM
7.380
8.749
1.551
31.037
2.579e–3
CNN
7.377
8.077
0.657
18.254
2.579e–3
M5P
12.232
13.924
1.400
2.912
4.094
MLP
11.855
13.358
1.605
21.938
9.108e–4
LSTM
11.661
12.265
0.611
48.961
5.848e–3
CNN
11.795
12.239
0.254
20.942
5.153e–3
M5P MLP
1.390 1.481
1.504 1.672
0.113 0.220
4.572 20.545
0.955 2.808e–4
LSTM
1.432
1.766
0.420
43.729
1.442e–3
CNN
1.375
1.477
0.111
27.360
1.571e–3
M5P MLP
4.546 5.507
5.017 6.013
0.433 0.358
8.649 24.889
4.049 3.416e–4
LSTM
3.261
4.373
0.635
84.894
5.055e–3
CNN
3.846
4.535
0.451
32.313
2.535e–3
In conjunction with the tables showcased earlier, Fig. 4 display the actual values of the series for a one-week duration, juxtaposed with the optimal configurations of the M5P model and the most effective deep learning model. To determine the value for each instance, we calculated the mean of all forecasting horizons predicting that particular moment and subsequently plotted the resulting value. Several conclusions can be drawn based on the results obtained from the experiments conducted on the ETTh1 dataset. When using an FH of 6 and PH of 18, the M5P model outperforms the CNN and LSTM models in terms of training time, with a speedup of 9.91 and 17.33 times, respectively. Regarding the Mean WAPE metric, the M5P performs 2.7% worse than CNN and 2.7% better than LSTM. When the FH is increased to 24, the M5P model remains significantly faster, with a training time of only 5 s versus 57 s for LSTM. However, the Mean WAPE and Best WAPE show a slight increase of 2.7% and 1.6%, respectively, compared to LSTM, which is the best-performing model. When compared to both CNN and MLP in terms of the Best WAPE metric, the M5P model shows a small increase of less than 1%. The M5P model, using the ETTh2 dataset with a FH of 6 and PH of 18, demonstrates an impressive training time of only 1.566 s, significantly faster than both CNN (18.254 s) and LSTM (31.037 s). However, in terms of error metrics,
Efficient Short-Term Time Series Forecasting
107
Table 4. Error and computing time comparison using a PH factor of 3 Dataset ETTh1
ETTh1
ETTh2
ETTh2
Electricity Demand
Electricity Demand
FH Model Best WAPE Mean WAPE Std WAPE Train Time (s) Test Time (ms) 6
24
6
24
6
24
M5P
11.982
12.352
0.387
1.850
MLP
11.904
12.719
0.834
17.412
0.957 8.057e–4
LSTM
11.696
12.695
1.295
32.064
3.009e–3
CNN
11.718
12.020
0.257
18.345
4.075e–3
M5P
18.775
19.654
0.968
5.513
4.193
MLP
18.764
19.503
0.579
14.430
8.920e–4
LSTM
18.462
19.126
0.548
57.356
8.148e–3
CNN
18.700
19.525
0.475
15.937
4.170e–3
M5P
7.111
7.568
0.373
1.712
0.962
MLP
6.916
8.018
1.120
23.485
8.531e–4
LSTM
6.376
7.480
1.432
40.064
3.009e–3
CNN
6.488
6.843
0.316
23.244
4.075e–3
M5P
12.264
13.836
1.424
3.922
4.281
MLP
11.863
13.163
1.425
19.096
9.161e–4
LSTM
11.614
12.221
0.548
58.780
8.196e–3
CNN
11.853
12.233
0.240
19.888
4.170e–3
M5P MLP
1.391 1.512
1.489 1.825
0.097 0.345
5.131 25.359
0.968 2.874e–4
LSTM
1.425
1.663
0.303
54.828
1.996e–3
CNN
1.345
1.452
0.121
28.764
1.731e–3
M5P MLP
4.561 5.538
4.968 5.936
0.364 0.395
13.269 26.430
4.179 3.346e–4
LSTM
2.938
3.650
0.553
91.179
7.277e–3
CNN
3.205
4.346
0.688
32.118
3.687e–3
the Mean WAPE and Best WAPE are worse than those of the deep learning models by 1.116% and 0.884%, and 1.729% and 1.274%, respectively. Increasing the FH to 24 retains the training time advantage of the M5P but exhibits higher Mean WAPE (13.924 vs. CNN 12.239 and LSTM 12.265) and Best WAPE (4.06% and 3.71% worse than LSTM and CNN, respectively). In the ElectricityDemand dataset with FH of 6 and PH of 18, the M5P model demonstrates faster training time compared to CNN and LSTM, at 4.572 s versus 27.360 and 43.729 s. Additionally, it achieves a slightly better Mean and Best WAPE than LSTM but slightly worse than CNN. When FH is increased to 24, M5P maintains its training time advantage, but its Mean WAPE falls behind both CNN and LSTM. Moreover, its Best WAPE performance indicates 36.1% higher error rate than LSTM. The principal advantage of the M5P model lies in its substantially reduced training times, which are consistently shorter than the time required by other models across all datasets. In contrast, the test times for the M5P model are slower than those of the other models under comparison. However, the discrepancies in error rates between M5P and the other models remain relatively small, particularly for lower FH values, which serves to emphasize the competitive performance of the M5P model. In summary, although the M5P model exhibits longer test times, it provides competitive performance at lower forecasting horizons and markedly faster training times in comparison to the other models.
108
P. Reina-Jim´enez et al. ETTh1 - FH: 6, PH: 18 6
Real Forecasted M5P Forecasted LSTM
OT (°C)
4 2 0 −2 −4 2018−01−23
2018−01−25
2018−01−31
2018−02−02
2018−01−31
2018−02−02
Demand (MW)
OT (°C)
Real 17.5 Forecasted M5P 15.0 Forecasted LSTM 12.5 10.0 7.5 5.0 2.5 0.0 −2.5 2018−01−23 2018−01−25
2018−01−27 2018−01−29 ETTh2 - FH: 6, PH: 18
34000 32000 30000 28000 26000 24000 22000 20000
2018−01−27 2018−01−29 ElectricityDemand - FH: 6, PH: 18
Real Forecasted M5P Forecasted CNN
2015−10−12
2015−10−13
2015−10−14
2015−10−16
2015−10−17
2015−10−19
Fig. 4. Week prediction for FH = 6
Consequently, the M5P model emerges as an attractive choice when training time is a crucial consideration. In the case of the electricity demand dataset, we also experimented with a forecasting horizon of 24 × 6 to predict the subsequent full day. However, the outcomes in terms of training time and error were unsatisfactory. Consequently, and given the autoregressive nature of the employed M5P model, we shifted our focus to short-term forecasting rather than pursuing long-term forecasting. The M5P model used in the study was implemented using the Python-WekaWrapper3 library, which utilizes the pre-existing M5P algorithm implementation within Weka. While this approach incurs some computational overhead, it is possible that a more optimized implementation could further improve the computational efficiency of the M5P model for these experiments. Finally, we found that the M5P regression tree consistently produced similar error metrics across the parameter grid, making it easier to identify optimal settings compared to deep learning models. With the M5P model maintaining similar error rates within the grid, researchers can more easily fine-tune parameters and make decisions about model selection. Additionally, the simpler parameterization of M5P models results in a lower overall training time, which offers significant advantages in terms of computational efficiency when using a
Efficient Short-Term Time Series Forecasting
109
grid-based approach to parameter selection. This advantage is particularly relevant since training time multiplies by the number of models in the grid.
4
Conclusions
Based on the results presented in this study, we can conclude that M5P regression trees can provide accurate time series forecasting results while requiring significantly less computational time compared to deep learning models such as CNN and LSTM. Specifically, the M5P model demonstrates a computational time improvement of 382.76% compared to MLP, 1065.08% compared to LSTM, and 417.41% compared to CNN. This is an important finding that supports the concept of green AI, as it highlights the potential of using more computationally efficient methods for time series forecasting tasks. The results also suggest that the choice of modelling approach should depend on the specific requirements and constraints of the application. If interpretability and computational efficiency are the primary concerns, regression trees may be the preferred modelling approach, as they have an advantage in terms of training time and are easier to interpret. On the other hand, if minimizing error is crucial and computational resources are not limited, deep learning models may be a better choice. Furthermore, as the forecasting horizon increases, the error of the trees tends to worsen. Therefore, for scenarios where predictions far into the future are desired, deep learning models are a more suitable choice. Additionally, by employing state-of-the-art techniques for measuring CO2 emissions [8], we determined that the theoretical maximum power consumption of the employed GPU was 180 Wh, while the CPU consumption was 91 Wh. Therefore, by exclusively using a CPU, we are producing half the carbon emissions. In future works, we aim to build a non-autoregressive M5P tree, i.e. build a MIMO model based on a M5P architecture. This can be achieved by training as many different Single-Output models as the length of the forecasting horizon, being each of them specialized in forecasting a specific moment of the forecasting horizon. This would allow the study of long-term forecasting with regression trees and make a better comparison with the deep learning models. Furthermore, it would also be important to test these architectures with more univariate series and try to use multivariate series, for instance, by using all variables from the ETTh1 and ETTh2 datasets rather than relying solely on one variable. Acknowledgments. This research has been funded by Ministerio de Ciencia e Innovaci´ on under the projects: Aprendizaje Profundo y Transferencia de Aprendizaje Eficientes para Salud y Movilidad (PID2020-117954RB-C22), and Soluciones Digitales para Mantenimiento Predictivo de Plantas E´ olicas (TED2021-131311B); and by the Andalusian Regional Government under the project Modelos de Deep Learning para Sistemas de Energ´ıa Renovable: Predicci´ on de Generaci´ on y Mantenimiento Preventivo y Predictivo (PYC20 RE 078 USE).
110
P. Reina-Jim´enez et al.
References 1. Quinlan, J.R., et al.: Learning with continuous classes. In: 5th Australian Joint Conference on Artificial Intelligence, vol. 92, pp. 343–348. World Scientific (1992) 2. Gardner, M.W., Dorling, S.R.: Artificial neural networks (the multilayer perceptron)-a review of applications in the atmospheric sciences. Atmosp. Environ. 32(14-15), 2627–2636 (1998) 3. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997) ¨ Loeff, N., Pfister, T.: Temporal fusion transformers for inter4. Lim, B., Arık, S.O., pretable multi-horizon time series forecasting. Int. J. Forecast. 37(4), 1748–1764 (2021) 5. Silva, P., et al.: Approximation workflow for energy-efficient comparators in decision tree applications. In: 2022 IFIP/IEEE 30th International Conference on Very Large Scale Integration (VLSI-SoC), pp. 1–2. IEEE (2022) 6. Mohammed, A., et al.: ANN, M5P-tree and nonlinear regression approaches with statistical evaluations to predict the compressive strength of cement-based mortar modified with fly ash. J. Market. Res. 9(6), 12416–12427 (2020) 7. Kuo, J.C.-C., Madni, A.M.: Green learning: introduction, examples and outlook. J. Vis. Commun. Image Represent. 90, 103685 (2022) 8. Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for deep learning in NLP. arXiv preprint arXiv:1906.02243 (2019) 9. Gocheva-Ilieva, S.G., et al.: Regression trees modeling of time series for air pollution analysis and forecasting. Neural Comput. Appl. 31, 9023–9039 (2019) 10. Zhou, H., et al.: Informer: beyond efficient transformer for long sequence timeseries forecasting. In: The Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Virtual Conference, vol. 35, pp. 11106–11115. AAAI Press (2021) ´ 11. Jim´enez-Navarro, M.J., Mart´ınez-Ballesteros, M., Mart´ınez-Alvarez, F., AsencioCort´es, G.: A new deep learning architecture with inductive bias balance for transformer oil temperature forecasting. J. Big Data 10(1), 80 (2023) 12. REE: Red El’ectrica de Espa˜ na (2015) 13. Lara-Ben´ıtez, P., Carranza-Garc´ıa, M., Riquelme, J.C.: An experimental review on deep learning architectures for time series forecasting. Int. J. Neural Syst. 31(03), 2130001 (2021) ´ 14. Torres, J.F., Hadjout, D., Sebaa, A., Mart´ınez-Alvarez, F., Troncoso, A.: Deep learning for time series forecasting: a survey. Big Data 9(1), 3–21 (2021)
Generating Synthetic Fetal Cardiotocography Data with Conditional Generative Adversarial Networks Halal Abdulrahman Ahmed1(B) , Juan A. Nepomuceno2 , Belén Vega-Márquez2 , and Isabel A. Nepomuceno-Chamorro2 1 Minerva Lab, University of Seville, Seville, Spain
[email protected], [email protected] 2 Department of Computer Languages and Systems, University of Sevilla, Sevilla, Spain
Abstract. In recent years, the use of machine learning models has become increasingly common, and the availability of large datasets is an essential for achieving good predictive model performance. However, acquiring medical datasets can be a challenging and expensive task. To address this issue, generating synthetic data has emerge as a viable alternative. This paper proposes using a Conditional Generative Adversarial Network (CGAN) to generate synthetic data for predicting fetal health diagnosis from a publicly available Fetal Cardiotocography (CTG) dataset. The study also evaluates the efficacy of the Generative Adversarial Network (GAN), specifically Conditional GAN, in the clinical problem. We analyzed 2126 fetal cardiotocogram samples that were labeled by medical doctors. We used CGAN-generated data together with Support Vector Machines (SVM) and Extreme Gradient Boosting (XGBoost) as classifiers to show the performance of classifiers using the real and the synthetic dataset. The experiment results revealed that the synthetic dataset behave similarly to real data in terms of classifier performance. Keywords: Synthetic data · Conditional Generative Adversarial Network (CGAN) · Fetal Cardiotocography (CTG) dataset
1 Introduction Data sharing is a critical aspect of healthcare applications and life sciences research. The biggest challenge and opportunity for the life sciences community and researchers is learning how to safely share patient health data, at the same time protect the privacy of patients. This complicates the process of data collection and presents a major challenge for researchers, as they need access to such personal data for developing algorithms or testing their applicability to different datasets. For this reason, many researchers rely on publicly available medical data and generate synthetic data to perform analyses and develop innovative tools and methods. It is widely used in the Medical, manufacturing, agriculture, and eCommerce sectors [1]. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 P. García Bringas et al. (Eds.): SOCO 2023, LNNS 750, pp. 111–120, 2023. https://doi.org/10.1007/978-3-031-42536-3_11
112
H. A. Ahmed et al.
Generative Adversarial Networks (GANs) [2] have achieved massive success in artificial intelligence and deep learning in the term of generating synthetic data from real data. It consists of two neural networks in which two networks are trained against each other in a zero-sum game framework. One of them is known Generator and Discriminator. Recently, there are developed several kinds of GANs, including GAN, Conditional Gan (CGAN), Deep Convolutional GAN (DCGAN), CycleGAN, Generative Adversarial Text to Image Synthesis, Style GAN, and Super Resolution GAN (SRGAN) [3]. The aim of this work is to generate synthetic data applying GAN from a biomedical problem. In biomedical applications, GANs can be trained to understand the underlying patterns and structures, generating new data that attempt to mimic the original dataset [4]. In this work the biological problem is a set of samples from fetal cardiotocogram, i.e., the dataset describes the morphologic patterns and fetal states to represent the fetal heart rate and uterine contractions during pregnancy from 2126 women. To evaluate the generated data, they are used to train classifiers and to test if a predictive model achieves a similar performance using the synthetic dataset instead of the real dataset for predicting fetal health diagnosis and monitoring purposes. This evaluation helps in determining the efficacy of GAN in producing synthetic data that closely resembles real data and can be used in subsequent applications. This paper is divided into several sections: Sect. 2 provides the methodology used in the study. Section 3 details the experiments conducted to evaluate the efficacy of the CGAN to generate synthetic dataset from the real dataset. Finally, the Sect. 4 summarizes the findings of the study and provides future research lines.
2 Methodology 2.1 Conditional Generative Adversarial Networks Recently GANs have been utilized to generate synthetic data especially tabular data, due to their performance and flexibility in showing data, as demonstrated by their effectiveness in generating and manipulating high-quality data, images and natural language. [5] GANs are trained on a real dataset before being used to generate synthetic data resembling the distribution of the original dataset [6, 7] the architecture of GAN is depicted in Fig. 1. In this research, we will learn more about a selected CGAN model, which will provide the knowledge required to help in producing synthesized data for solving data-related issues. Conditional Generative Adversarial Networks (CGANs) are a generative model that combines the concepts of Generative Adversarial Networks (GANs) and conditional modeling. In a CGAN, the generator and discriminator are trained in an adversarial setting to generate realistic samples based on specific input data. Some additional parameters are used, in which both generator and discriminator networks of the original GAN model are conditioned during the training on external information. This information could be a class label or data from other modalities to help the discriminator classify the input correctly and not be easily fooled by the generator. [8] In traditional GANs, the generator network takes random noise as input and generates synthetic samples, while the discriminator network tries to distinguish between real and generated samples. However, in CGANs, additional input information, known as conditioning variables, is provided to both the generator and discriminator networks.
Generating Synthetic Fetal Cardiotocography Data
113
CGANs utilise conditioning variables, such as class labels, text descriptions, or images, to generate samples that satisfy conditions or possess desired characteristics. During training, the generator’s objectives are twofold: to deceive the discriminator into perceiving its generated samples as authentic and to ensure that the samples correspond to the provided conditioning variables. In the meantime, the discriminator’s mission of distinguishing between real and generated samples remains unchanged. This adversarial interaction between the generator and discriminator drives the continual improvement of the quality of generated samples throughout the training procedure [9].
Fig. 1. Architecture of the GAN Model
2.2 Classifiers • Support Vector Machines Classifier: Support Vector Machine (SVM) [10] classifier is one of the most used supervised machine learning algorithms for imbalanced data. It is mainly used in Classification and Regression problems, and it has been demonstrated that the SVM classifier is not affected by dataset class imbalance in comparison to the other algorithms. The general idea of the SVM approach is to find an optimal separation or ‘hyperplane’ with the help of kernel to separate two or more different classes and create a rigid boundary between the samples, which will help in the classification, regression, and outlier detection. • Extreme Gradient Boosting Classifier: Extreme Gradient Boosting (XGBoost) [11] is a classifier that represents the extreme gradient boost algorithm. This algorithm is a powerful machine learning technique with numerous components, implementing gradient boosted decision trees. Moreover, it is regularized boosting and can handle missing values automatically, and it is designed to be highly efficient, flexible, portable, and applied to tabular and structured data. Similarly, to SVM, it is used for classification and regression purposes. XGBoost was created with deep consideration of both machine learning techniques and system optimization.
114
H. A. Ahmed et al.
3 Experiments 3.1 Dataset The Cardiotocography dataset (CTG) [12] consists of Fetal Heart Rate (FHR) and Uterine Contraction (UC) classified by doctors. The dataset consists of 2126 Fetal Cardiotocograms samples. The samples are divided into three imbalanced classes; (N = Normal Case Class, S = Suspicious Cases Class, P = Pathological Cases Class); 1655 normal samples, 295 suspicious samples, and 176 pathologic samples according to morphologic patterns and fetal states (Normal, Suspicious, and Pathological). The dataset can experiment with 10 or 3 classes of problems. The samples of the classes are characterized by 23 attributes. In our study, we used three datasets: the original dataset, the CGAN-generated synthetic dataset, and a mixed dataset that combined the two datasets. The original dataset contains real-world data used as the basis for training the machine learning model. Next, the synthetic dataset is generated using a CGAN that was trained on the original dataset. Next, the CGAN was used to generate synthetic data that was similar in characteristics to the original dataset. Finally, the mixed dataset is created by combining the original and synthetic datasets to create a larger, more diverse dataset for training the model.
Fig. 2. Comparison of Class Distribution in Original, Synthetic, and Mixed Datasets
The Fig. 2 shows the total number of samples for each class (in three different datasets: original, synthetic, and mixed). The x-axis displays the classes, and the y-axis represents the total number of samples. By using these three datasets, we were able to evaluate the impact of the synthetic dataset on the performance of the model compared to using only the original dataset. We analyzed the performance of the model on each
Generating Synthetic Fetal Cardiotocography Data
115
dataset separately, as well as on the mixed dataset, to determine the effectiveness of the synthetic data in improving the accuracy of the model. The dataset is split into two subsets: the training set and the testing set, which help to evaluate the performance of the CGAN model. For this reason, firstly, the used datasets are split into 80% Train set and 20% Test set to train and test the model. Then based on the train data instances, the generator sub-model of the selected CGAN generates synthetic data instances. 3.2 CGAN Parameter Tuning Several parameters can be tuned when generating synthetic data with a CGAN. In this study, we fine-tune these parameters: batch size [13], noise dimension [14], number of epochs [15], learning rate [16], and stratified K-fold [17], which can help improve the performance of the CGAN model and generate more realistic synthetic data. However, the specific values for each parameter will depend on the problem being solved and the nature of the data being generated. We conducted extensive parameter tuning to determine the optimal batch size, noise dimension, epoch numbers, and learning rate for our CGAN model. First, we evaluated a range of batch sizes from 16, 128, and 264. We found that a batch size of 128 was optimal, and balanced training speed and stability. Next, we varied the noise dimension by evaluating 50, 100, 264, 300, and 350 values. A noise dimension of 264 yielded the greatest results, with higher values resulting in overfitting and lower values resulting in underfitting. Then, we experimented with various epoch numbers ranging from 100 to 1000, we discovered that training for 100 epochs was enough to attain acceptable performance without overfitting. In this study, we used Stratified K-fold which is a cross-validation technique that splits the dataset into k-folds or splits while preserving the percentage of samples for each class in each fold, we used 5-folds and 10-folds, which means the dataset is divided into five and ten sections, respectively [18]. In different iterations, one part becomes the validation set. This technique is especially useful when dealing with imbalanced datasets, where some classes may have significantly fewer samples than others [19]. Finally, we tested 5e–6, 1e–5, and 1e–6 to fine-tune the learning rate. We determined that a learning rate of 1e–5 was optimal for our model and dataset. In Table 1, a clear and concise summary of the parameter tuning process and the best set of parameters that were tuned during the experimentation process, along with their corresponding best values, is shown. The outcome results can be used as a reference for future work and can help guide the development of similar models or datasets. 3.3 Classifiers Hyperparameter and Parameter Tuning Our study investigated the effectiveness of the default parameter settings for the XGBoost and SVM classifiers. However, we explored different random_state hyperparameter for the Stratified K-fold cross-validation technique, including None, 0, 5, 10, and 42, to examine their impact on the results. After performing the experiments, we observed that Stratified K-fold with 10 folds and a random_state value of 10 yielded the most favorable outcomes. Therefore, we chose this configuration for all subsequent experiments in our
116
H. A. Ahmed et al.
study. We adjusted the random state of the XGBoost and SVM Classifiers to examine its effect on model performance. Table 1. CGAN Parameter Tuning Results Parameters
Best Value
Batch Size
128
Noise Dimension
264
Numbers of epochs
100
Learning Rate
1e-5
4 Results and Discussion This section outlines the experimental setup and outcomes of CGAN experiments on the original, synthetic, and mixed datasets. The recorded results and parameter tuning were obtained from training and testing, we selected the parameters that provided the best outcomes, as detailed in the CGAN Parameter Tuning section (Sect. 3.2). The first experiment aimed to compare the performance of two different classifiers, XGBoost and SVM, on original dataset. The results showed that the XGBoost classifier achieved a significantly higher accuracy score of 97.18% compared to the SVM classifier, which achieved an accuracy score of 90.85%, as shown in the Table 2. In the second experiment, we evaluated the performance using synthetic data. The results of this experiment showed that the XGBoost classifier achieved an accuracy rate of 93.28%, as shown in Table 2, while the SVM classifier achieved an accuracy rate of 93.93%. This indicates that both classifiers performed relatively well on this synthetic dataset, with the SVM classifier performing slightly better than XGBoost. Finally, the third experiment aimed to evaluate the performance of XGBoost and SVM machine learning classifiers on the mixed dataset. As shown in Table 2, the experiment results revealed that the XGBoost classifier outperformed the SVM classifier in terms of accuracy, achieving a maximum accuracy rate of 93.57% compared to 90.97% for SVM. The F1 score was employed as a metric to evaluate the performance of the synthetic data generated using CGAN when applied to the XGBoost and SVM classifiers. The results demonstrated distinct patterns in the F1 scores across different classifiers and epochs, as shown in Table 2. The XGBoost classifier achieved an impressive F1 score of 0.98 for the synthetic dataset. Similarly, the SVM classifier demonstrated strong capabilities on the same synthetic dataset, achieving a commendable F1 score of 0.97. These results indicate that both classifiers performed well on the synthetic dataset, with the XGBoost classifier showing slightly better performance in terms of the F1 score. Additionally, when evaluated on a mixed dataset, both classifiers continued to showcase their competence. When processed with the XGBoost classifier, the mixed dataset
Generating Synthetic Fetal Cardiotocography Data
117
Table 2. Experimental Results on the Synthetic Dataset Name of Dataset
Classifier
Original Dataset Synthetic Dataset
Accuracy Rate
F1-Score
XGBoost
97.18%
0.97
SVM
90.85%
0.90
100
93.28%
0.93
200
89.15%
0.89
300
84.82%
0.96
400
72.23%
0.68
500
74.19%
0.85
600
83.51%
0.87
700
89.37%
0.98
800
88.72%
0.88
900
86.12%
0.65
1000
87.85%
0.85
100
93.93%
0.93
200
90.67%
0.90
300
86.55%
0.95
400
76.36%
0.71
500
78.96%
0.81
600
85.68%
0.87
700
90.89%
0.97
800
90.02%
0.87
XGBoost
SVM
Mixed Dataset
XGBoost
Number of Epochs
900
89.80%
0.64
1000
89.59%
0.84
100
93.57%
0.93
200
91.76%
0.92
300
88.83%
0.95
400
82.17%
0.83
500
83.86%
0.88
600
89.28%
0.91
700
92.10%
0.97
800
91.53%
0.91
900
91.31%
0.83
1000
89.39%
0.91 (continued)
118
H. A. Ahmed et al. Table 2. (continued)
Name of Dataset
Classifier
Number of Epochs
Accuracy Rate
F1-Score
SVM
100
90.97%
0.90
200
90.07%
0.87
300
87.13%
0.91
400
79.68%
0.79
500
82.28%
0.83
600
87.36%
0.88
700
89.50%
0.94
800
89.84%
0.89
900
89.62%
0.79
1000
89.50%
0.88
Fig. 3. Performance of CGAN on Synthetic and Mixed Datasets
resulted in an F1 score of 0.97, while the SVM classifier achieved a still respectable F1 score of 0.94. The highest F1 scores were achieved at epoch 700, as shown in Table 2. Notably as shown in Fig. 3, the maximum accuracy rates achieved on the synthetic and mixed dataset are at 100 epochs. These results suggest that the best performance can be achieved with a relatively low number of epochs, which could save computational resources and time. When comparing the average total training times of different classifiers on various datasets, notable differences emerge. For the original dataset, the XGBoost classifier took approximately 3.98 s, while the SVM classifier completed training in just 0.11 s.
Generating Synthetic Fetal Cardiotocography Data
119
These results suggest that SVM has a significant advantage in terms of training speed on the original dataset. On the synthetic dataset, the training time for the XGBoost classifier was reduced to around 1.65 s, while the SVM classifier finished training in approximately 0.12 s. Although XGBoost still required more time compared to SVM. Similarly, for the mixed dataset, the training time for XGBoost was approximately 2.92 s, while the SVM classifier required about 0.35 s. Once again, SVM exhibited faster training compared to XGBoost. These results highlight the efficiency of SVM in terms of training speed across different datasets, particularly when compared to XGBoost.
5 Conclusions Our work highlights that with the optimal set of parameters after training the model, the generated synthetic data can be used for predicting fetal health diagnosis and monitoring purposes improving the performance of predictive models and reduce the need for collecting additional new data. However, additional research is required to determine the efficacy of CGAN in different clinical domains. Furthermore, in our study, we evaluated the performance of XGBoost and SVM classifiers on the original dataset, the CGAN-generated synthetic dataset, and the mixed dataset. Our results demonstrated that XGBoost outperformed SVM on the original dataset in terms of the performance metric. Nevertheless, both classifiers performed comparatively well on the synthetic dataset, with SVM performing marginally better than XGBoost. Finally, the XGBoost classifier performed more accurately on the mixed dataset than the SVM. These results demonstrate the potential for using CGAN to generate synthetic datasets to enhance the efficacy of machine learning classifiers on clinical datasets. Additionally, our findings indicate that a comparatively small number of epochs can accomplish the greatest performance on synthetic and mixed datasets, conserving computational resources and time. For future work, further investigation could be done on the impact of different network architectures on the quality of generated synthetic data such as DCGAN or WGAN could also be explored. Additionally, it would be interesting to examine the effect of different loss functions, and machine learning classifiers on the generated synthetic data. Moreover, tuning, and optimising parameters could potentially improve the performance. Finally, we used a 3-class problem to analyse the dataset. However, our findings sparked our interest in further researching the dataset by expanding it to a 10-class problem.
References 1. Turing. Synthetic Data Generation: Definition, Types, Techniques, and Tools (2022). www.tur ing.com. https://www.turing.com/kb/synthetic-data-generation-techniques. Accessed 14 June 2023 2. Saxena, D., Cao, J.: Generative adversarial networks (GANs) challenges, solutions, and future directions. ACM Comput. Surv. (CSUR) 54(3), 1–42 (2021) 3. Tewari, A.: Types of generative adversarial networks (GANs). OpenGenus IQ: Computing Expertise & Legacy (2022). https://iq.opengenus.org/types-of-gans/
120
H. A. Ahmed et al.
4. Vega-Márquez, B., Rubio-Escudero, C., Riquelme, J.C., Nepomuceno-Chamorro, I.: Creation of synthetic data with Conditional Generative Adversarial Networks. In: Martínez Álvarez, F., Troncoso Lora, A., Sáez Muñoz, J.A., Quintián, H., Corchado, E. (eds.) SOCO 2019. AISC, vol. 950, pp. 231–240. Springer, Cham (2020). https://doi.org/10.1007/978-3-03020055-8_22 5. Martin, A., Chintala, S., Bottou, L.: Wasserstein generative adversarial networks. In: International Conference on Machine Learning (2017) 6. Park, N., Mohammadi, M., Gorde, K., Jajodia, S., Park, H., Kim, Y.: Data synthesis based on generative adversarial networks. Proc. VLDB Endow. 11(10), 1071–1083 (2018). ISSN 21508097. https://doi.org/10.14778/3231751.3231757 7. Xu, L., Skoularidou, M., Cuesta-Infante, A., Veeramachaneni, K.: Modeling tabular data using conditional GAN. In: Advances in Neural Information Processing Systems, vol. 32, pp. 7335–7345. Curran Associates, Inc., (2019). URL: https://proceedings.neurips.cc/paper/ 2019/file/254ed7d2de3b23ab10936522dd547b78Paper.pdf 8. Vega-Márquez, B., Rubio-Escudero, C., Nepomuceno-Chamorro, I.: Generation of synthetic data with Conditional Generative Adversarial Networks. Logic J. IGPL 30(2), 252–262 (2022) 9. Figueira, A., Vaz, B.: Survey on synthetic data generation, evaluation methods and GANs. Mathematics 10(15), 2733 (2022) 10. Pedregosa, F.: Scikit-learn: machine Learning in Python (2011). https://www.jmlr.org/papers/ v12/pedregosa11a.html. Accessed 11 2022 11. Guest_Blog. Introduction to XGBoost algorithm in machine learning. Analytics Vidhya (2023). https://www.analyticsvidhya.com/blog/2018/09/an-end-to-end-guide-to-understandthe-math-behind-xgboost/. Accessed 11 2022 12. Campos, D., Bernardes, J.: Cardiotocography. UCI Machine Learning Repository (2010). https://doi.org/10.24432/C51S4N. Accessed 20 Apr 2023 13. Sinha, S., Zhang, H., Goyal, A., Bengio, Y., Larochelle, H., Odena, A.: Small-GAN: speeding up GAN training using core-sets. In: International Conference on Machine Learning, pp. 9005–9015. PMLR (2020) 14. Padala, M., Das, D., Gujar, S.: Effect of input noise dimension in GANs. In: Mantoro, T., Lee, M., Ayu, M.A., Wong, K.W., Hidayanto, A.N. (eds.) ICONIP 2021. LNCS, vol. 13110, pp. 558–569. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-92238-2_46 15. Sharma, S.: Epoch vs batch size vs iterations - towards data science. Medium (2019). https:// towardsdatascience.com/epoch-vs-iterations-vs-batch-size-4dfb9c7ce9c9. Accessed 03 2023 16. Brownlee, J.: How to configure the learning rate when training deep learning neural networks. MachineLearningMastery.com (2019a). https://machinelearningmastery.com/learning-ratefor-deep-learning-neural-networks/. Accessed 02 2023 17. Widodo, S., Brawijaya, H., Samudi, S.: Stratified K-fold cross validation optimization on machine learning for prediction. Sinkron: jurnal dan penelitian teknik informatika 7(4), 2407– 2414 (2022) 18. Explain stratified K fold cross validation in ML in python. ProjectPro (2022). https://www. projectpro.io/recipes/explain-stratified-k-fold-cross-validation 19. Szeghalmy, S., Fazekas, A.: A comparative study of the use of stratified cross-validation and distribution-balanced stratified cross-validation in imbalanced learning. Sensors 23, 2333 (2023). https://doi.org/10.3390/s23042333. Accessed 01 2023
Olive Oil Fly Population Pest Forecasting Using Explainable Deep Learning ´ A. M. Chac´ on-Maldonado(B) , A. R. Troncoso-Garc´ıa, F. Mart´ınez-Alvarez, G. Asencio-Cort´es, and A. Troncoso Data Science and Big Data Lab, Pablo de Olavide University, 41013 Seville, Spain {amchamal,artrogar,fmaralv,guaasecor,atrolor}@upo.es
Abstract. This paper proposes an application of the Automated Deep Learning model to predict the presence of olive flies in crops. Compared to baseline algorithms such as Random Forest or K-Nearest Neighbor, our Automated Deep Learning model demonstrates superior performance. Explainable Artificial Intelligence techniques such as Local Interpretable Model-Agnostic Explanations and Shapley Additive explanations are applied to interpret the results, revealing solar radiation as a key predictor for the presence of olive fly. This study enhances deep learning for agriculture, showcasing Automated Deep Learning superiority, and providing interpretable insights for effective pest management. Keywords: deep learning agriculture
1
· explainable artificial intelligence ·
Introduction
Agriculture constitutes one of the main industries in the world, as a primary source of food and raw materials. Nevertheless, the presence of pests in crops can result in significant economic losses and, in certain cases, jeopardize food security. Uncontrolled pest infestation leads to reduced crop efficiency and destruction of the harvest, resulting in substantial losses for farmers in terms of production and management costs. Pests are among the most critical limiting factors for agricultural productivity, responsible for global losses ranging from 37% to 50% [8]. Early detection of pests and timely implementation of prevention and control plans would ensure that farmers can cultivate healthy crops without worrying about pest invasions that diminish their quality. Thus, implementing maintenance and control measures in agricultural areas ensures the best sanitary conditions for crop environments [1]. Predictive systems with the ability to detect pests events early are crucial for evaluating management strategies and alternatives and are a valuable tool for agricultural producers. The objective of this study is to predict the quantity of olive flies that will fall into traps, treating this problem as a regression task. To achieve this, the Automated Deep Learning (ADL) algorithm [3] has been applied. The ADL model includes an internal validation for hyperparameter optimization to obtain the c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 P. Garc´ıa Bringas et al. (Eds.): SOCO 2023, LNNS 750, pp. 121–131, 2023. https://doi.org/10.1007/978-3-031-42536-3_12
122
A. M. Chac´ on-Maldonado et al.
best configuration of the neural network (NN). This algorithm has already been applied previously to an agriculture classification problem, specifically to the phenology of olives in Andalusia, Spain, obtaining promising results [3]. To validate the proposal, a fair comparison with other algorithms has been performed. The obtained results have been subject to explainability analysis in an attempt to address the issue of the “black-box” concept that is prevalent in deep learning algorithms today. In recent years, there has been a growing interest in explainability techniques for artificial intelligence models (XAI) [15]. Although users have knowledge of the input data, outputs, and model architecture, there is limited understanding of the internal processes that drive the results. XAI refers to a set of processes and methods that allow human users to understand and trust the results generated by machine learning algorithms. There are various techniques for adding explainability, ranging from starting with the output data of the NN and examining how different attributes interact to produce a specific result through force-directed graphs to association rules [4]. In this study, the Local Interpretable Model-Agnostic Explanations (LIME) and SHapley Additive exPlanations (SHAP) have been used to obtain valuable information about the importance of attributes in the final output. The remainder of the paper is structured as follows. In Sect. 2, recent advances in the field of are reviewed. Section 3 describes and details the experiments carried out and Sect. 4 presents the results that have been obtained. Section 5 concludes the paper.
2
Related Work
Deep learning (DL) and machine learning (ML) are becoming widely used to solve problems in different fields [6]. In fact, many applications can be found in the literature [17]. These techniques are not yet widespread in the agriculture sector. Some works have been published that introduces DL techniques, such as in [11], where basic algorithms are used to predict diseases and crop quality. The present study is a continuation of the experiments started in [3], where the same techniques were applied, but without adding interpretability to the results obtained. The work was focused on improving the accuracy of olive phenology forecasting using DL models with hyperparameter optimization to optimize the architecture along with class balancing preprocessing and the introduction of new variables. This fact improved the phenological forecast classification problem with respect to the literature in 16 different plots from 4 distinct areas of Spain. The improvements achieved in terms of prediction accuracy are around 4%, and, in some cases, exceed 20%. Thus, a good performance of the ADL algorithm was showed for agriculture data. Concerning the field of pest prediction, the article [9], applied techniques based on models such as SVM, NN, and Bayesian Networks for crop pest prediction. In [5], ML algorithms were utilized to develop models for climate-based disease and pest alert systems to improve the efficiency of chemical pest control for coffee trees. The climate variables such as air temperature and relative
Olive Oil Fly Population Pest Forecasting
123
humidity showed significant correlations with coffee rust disease. In the article [2], some of the most relevant research on automatic pest detection through proximal digital images and ML were described. In addition, in [20], an idea of the implementation of wireless sensor networks was presented. In [12] different ML models that can predict the appearance of insects during a certain season on a daily basis are presented. The results showed high accuracy levels for predicting insect occurrence, namely up to 76.5%. However, DL predictions need to be explained, and XAI becomes a promising research field. Several articles address the problem of obtaining explanations for regression task or time series forecasting. For example, the work in [7] developed a XAI model for short-term electricity consumption forecasting in electric vehicle load demand. Another example is found in [18], where XAI techniques were utilized to explain health data models in the field of sleep apnea detection.
3
Methodology
This section describes the proposed methodology, which utilizes an optimized NN to generate daily predictions of the amount of flies that fall into traps using weekly data of flies fallen into traps, as well as other meteorological attributes collected daily. In this case, four distinct weekly prediction horizons are considered, each representing different future weeks in which the samples are taken (horizon 1, amount of flies fallen into traps within one week; horizon 2, within two weeks, etc.). Thus, this problem is treated as a regression problem, as we have known data to predict the value of unknown data. The model is based on a fully connected feed-forward NN [16], whose architecture is optimized by the Hyperband algorithm [10]. The remainder of this Section is structured as follows. Section 3.1 depicts the ADL algorithm. Section 3.2 describes the different algorithms used for performing the comparison of the obtained results. Finally, Sect. 3.3 presents the tools used for explainability mentioned earlier in Sect. 1. 3.1
ADL
The purpose of ADL is to optimize the configuration of parameters present in a NN. The Python library KerasTuner has been used for model optimization. KerasTuner is an easy-to-use and scalable hyperparameter optimization framework that solves the pain points of hyperparameter search. In this case, the hyperband tuner has been used, whose configuration is indicated in Table 1. In Table 1, NN stands for parameters of the neural network and those that do not, for the model configuration. The NN has been structured using a certain number of layers, between one and five, the output shape stands between 10 and 100 with steps of 10. In general, a grid has been created that allows a random search for the best grid configuration for that data rather than setting an exhaustive parameterization. Concerning the model, the hyperparameters have been not optimized. The number of epochs is set to 100 and the activation function is RELU. Finally, the best individual is found corresponding to a particular network architecture.
124
A. M. Chac´ on-Maldonado et al. Table 1. Settings for the hyperband optimization algorithm.
Parameter
Description
Number of layers (NN)
The range of layers is from 1 to 5, not counting the output layer
Output shape (NN)
Parametrization of the output space of the NN on a grid of 10 to 100 with steps of 10
Optimizer (NN)
The optimizer used was Adaptive Moment Estimation (ADAM)
Loss function (NN)
Mean Absolute Error (MAE)
Optimization metric (NN) Mean Absolute Error (MAE) Epochs (model)
One epoch is when an entire dataset is passed forward and backward through the NN only once. The number of epochs used was 100.
Activation function (model)
The activation function used in all layers has been RELU
3.2
Benchmark Algorithms
The algorithms used to perform the comparison of the obtained results are presented as follows. 1. Bayesian Ridge. This algorithm combines elements of linear regression and Bayesian inference to make predictions. It uses a Bayesian approach to estimate the weights, which means it also takes into account uncertainty in the estimates. This algorithm introduces a penalty on the weights of the input variables, which helps prevent overfitting and improve the model’s generalization ability. 2. Gaussian Process. In a Gaussian process, it is assumed that the output of a model is distributed according to a multivariate normal distribution. This distribution is specified by a covariance function that describes the correlation between any pair of data points. The algorithm is unique in the sense that, in addition to making predictions, it also provides information about the uncertainty of them. 3. Lasso. It is a form of linear regularization that adds a penalty term to the traditional loss function. This term penalizes the magnitude of the regression coefficients, which helps reduce the number of features considered in the model and prevent overfitting. The penalty term in the Lasso algorithm is a linear function of the absolute value of the regression coefficients and is known as the L1 term. The penalty function is proportional to the magnitude of the regression coefficients. 4. Ridge. It is similar to Lasso, adding penalty to the traditional loss function, preventing overfitting. The penalty term in the Ridge algorithm is a quadratic function of the regression coefficients, and is known as the L2 term. The penalty function is also proportional to the magnitude of the regression coefficients.
Olive Oil Fly Population Pest Forecasting
125
5. Multilayer Perceptron (MLP). It is a way of modeling the relationship between the input variables and the desired output variable through a sequence of hidden layers. Each hidden layer is made up of a set of nodes that represent mathematical functions that combine input signals and produce an output signal. The final layer is an output layer, which contains a single node that produces the prediction for the desired output variable. 3.3
Explainability
Existent post-hoc explainability techniques have been applied to the results obtained by the ADL model to interpret the predictions. Namely, the techniques are LIME and SHAP. The LIME is a popular and simple tool that allows users to understand how models make decisions in terms of individual features of input data. Basically, LIME is used to explain a single prediction in terms of the relevant features of the input data for that prediction. This helps to better understand how the model works and which features are the most important for each prediction [14]. SHAP is a game-based theoretical approach to explain the output of any ML model. SHAP is based on the concept of “Shapley values”. The Shapley values represent the contribution of each feature to the prediction, taking into account the interactions between features. Depending on the algorithm used, SHAP employs different models, including models for tree-based algorithms, models for NN (the case in this paper), and models for ML algorithms in general. SHAP generates both local and global explanations and a wide range of graphical representations that are really useful to understand the importance of the features in making predictions [13].
4
Experimentation and Results
This section presents the results of the experiments carried out to predict the amount of flies that have fallen in traps a week ahead. In Sect. 4.1, the input data used is described. Section 4.2 details the quality measures that are used to compare the ADL model with baseline algorithms. Finally, Sect. 4.3 introduces the explanations of the results. 4.1
Input Data
The data used has been collected in C´adiz, Andalusia, Spain. The dataset consists of 322 attributes, with 321 of them being composed of a fusion of information from different sources such as satellite image band data, vegetative indices, temperature, humidity, or wind, which are recorded on a daily basis. The remaining attribute represents the quantity of olive flies fallen into traps, which is recorded on a weekly basis. Therefore, the value of this last attribute is repeated daily during that week.
126
A. M. Chac´ on-Maldonado et al.
Furthermore, this dataset is divided into four distinct weekly horizons, each containing 670 instances (a total of 2680 instances) and 322 attributes. The weekly prediction horizons refer to the time when the samples of the quantity of flies fallen into traps were taken. For example, in the first weekly prediction horizon, it shows the quantity of flies fallen into traps within one week (remember that this value is recorded weekly, so it will be repeated in the data seven times, once for each day). In the second weekly prediction horizon, it represents the quantity of flies fallen into traps within two weeks, and so on up to four weeks (which is the fourth horizon). In this case, the problem is treated as a regression problem, predicting the quantity of flies fallen into traps based on the corresponding attribute in the dataset. 4.2
Results and Discussion
The ADL model does not have fixed parameter values but rather optimal parameter ranges, from which the best models are obtained for each case using kerasTuner. Therefore, in this case, we have 4 different models with different hyperparameter values. The ranges of these parameter values are presented in Sect. 3.1. The ADL model is compared in this section with several benchmark algorithms, as it was introduced in Sect. 3.2. The configuration used by the benchmark algorithms has been the default configuration shown in their documentation, except for some specific parameters. The configurations of the benchmark models are presented as follows. 1. Bayesian Ridge. Default configuration 2. Gaussian Process. Default configuration. 3. Lasso. Default configuration with the exception of alpha = 0.1 (this parameter controls the strength of the regularization). 4. Ridge. Default configuration with the exception of alpha = 0.5 (this parameter controls the strength of the regularization). 5. Multilayer Perceptron. Default configuration with the exception of random state = 1, max iter = 500. 6. ADL. Configuration described in Sect. 3.1. The quality metric used to compare the results obtained from these algorithms along with ADL has been the MAE metric. The MAE measures the difference between the actual values and the values predicted by the model and calculates the average of these differences. MAE is calculated through the Python skLearn library by passing the values of the actual and predicted instances. Table 2 presents a comparison of the MAE obtained by ADL and the benchmark algorithms. The results show that the ADL model improves the rest of the algorithms in the four prediction horizons. The highest performance of ADL is seen in the predictions for horizons 1 and 2. An exception occurs in horizon 3 regarding the Gaussian Process algorithm, where ADL obtained worse results and in the horizon 4, where Lasso improves the ADL results slightly. However, the differences are not as remarkable as it is observed in the results for horizons 1 and 2.
Olive Oil Fly Population Pest Forecasting
127
Table 2. Comparison between the results obtained by the algorithms for the 4 horizons. Model
Horizon 1 Horizon 2 Horizon 3 Horizon 4
ADL
0.296
0.406
0.476
0.487
Bayesian Ridge
0.437
0.555
0.553
0.496
Gaussian Process
0.597
0.961
0.430
0.569
Lasso
0.370
0.508
0.581
0.428
Ridge
0.609
1.035
0.58
0.561
Multilayer Perceptron 1.119
0.528
0.632
0.487
ADL has been probed as highly effective in optimizing the parameters of NN. Furthermore, when applied to meteorological and agricultural data, ADL has proven to be a powerful tool in accurately predicting the number of flies caught in traps compared to baseline algorithms. 4.3
Explainability
In this section, XAI techniques are applied to the results obtained by the ADL model. The objective is to explain how the NN is making the predictions, identifying the most relevant variables. As the input data has more than 300 attributes, highlighting the most important ones could be useful for optimizing the analysis of this kind of data. Namely, the Python libraries for LIME and SHAP have been applied, as was introduced in Sect. 3.3. Both techniques add post-hoc explainability meaning that they are independent of the model.
Fig. 1. LIME 6 top features. Prediction output: no flies (0).
LIME only provides local explanations for each single instance of the input data. Experiments have been carried out, representing random instances and summarizing the most important features for global predictions. LIME could also be configured to plot a certain number of top features for each certain prediction. A great number of attributes lead into misunderstanding and a low number is not representative. After an experimentation process, for this study, six top attributes for each predicted instance have been taken into account. Two example instances are illustrated in Figs. 1 and 2. Figure 1 represents an instance
128
A. M. Chac´ on-Maldonado et al.
Fig. 2. LIME 6 top features. Prediction output: a certain number of flies (2.08).
with no flies in the tramp, which is considered as the “negative class” according to LIME. On the other hand, Fig. 2 depicts an instance with a predicted value of 2.08 flies, meaning “positive”. In both cases, it can be observed that the features with the greatest influence on the prediction are SOLR SUM and DIRW, representing solar radiation and wind direction from different weeks, respectively. SHAP creates two distinct graphs, called force plots. On the one hand, the local plot shows the most influential attributes of the result obtained in a particular instance. On the other hand, the same information is displayed, but for a set of a certain number of instances. In this specific case, 50 instances are presented. The two types of graphs have been obtained for both individual attributes and sets of 50 attributes for the four horizons during the experimental part. In this paper, the most representative are shown in Figs. 3 and 4. As it can be seen, the most relevant attributes are SOLR SUM bk w-1 sXX (where XX is the number representing the week in which the sample was taken), which refer to the solar radiation on the indicated date.
Fig. 3. Force plot for a given instance corresponding to horizon 2.
Through the explanation provided by LIME and SHAP, it is clear that the level of solar radiation is one of the most influential attributes associated with the presence of olive flies in the crops. This finding provides valuable information on the relationship between environmental factors and the presence of pest, which can inform effective pest management strategies in olive groves and other agricultural settings.
Olive Oil Fly Population Pest Forecasting
129
Fig. 4. Force plot for a set of instances corresponding to horizon 4.
5
Conclusions and Future Work
This study makes a significant contribution to the field of deep learning applied to the agricultural domain. The research showcases the superiority of the ADL model, which automatically optimizes hyperparameters, over conventional ML approaches, underscoring its effectiveness in predicting the number of flies caught in traps. The two XAI techniques used in this study, LIME and SHAP, provide interpretable explanations for the predictions of the ADL model, improving its transparency and usability in real-world agricultural applications. The most important attributes are identified. The great significance of solar radiation as a predictor of olive fly presence is one of the main conclusions highlighted. As future work lines, the objectives are to enhance the ADL model and explore more XAI techniques. Concerning deep learning, the parameter range configuration of the ADL algorithm will be extended, as well as the model could be applied to different data from agriculture. Regarding explainability, other existing tools such as tensorLy or other techniques based on quantitative association rules [19] could be tested in this specific scope. Acknowledgments. The authors would like to thank the Spanish Ministry of Science and Innovation for the support under the project PID2020-117954RB-C21 and the European Regional Development Fund and Junta de Andaluc´ıa for projects PY2000870 and UPO-138516.
References 1. Albanese, A., Nardello, M., Brunelli, D.: Automated pest detection with DNN on the edge for precision agriculture. IEEE J. Emerg. Select. Top. Circuit. Syst. 11(3), 458–467 (2021) 2. Barbedo, J.G.A.: Detecting and classifying pests in crops using proximal images and machine learning: a review. AI 1(2), 312–328 (2020)
130
A. M. Chac´ on-Maldonado et al.
3. Chac´ on-Maldonado, A.M., Molina-Cabanillas, M.A., Troncoso, A., Mart´ınez´ Alvarez, F., Asencio-Cort´es, G.: Olive phenology forecasting using information fusion-based imbalanced preprocessing and automated deep learning. In: Bringas, P.G. et al. (eds.) Hybrid Artificial Intelligent Systems. HAIS 2022. LNCS, vol. 13469. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-15471-3 24 4. Das, A., Rad, P.: Opportunities and challenges in explainable artificial intelligence (xai): a survey. arXiv preprint. arXiv:2006.11371 (2020) 5. de Oliveira Aparecido, L.E., de Souza Rolim, G., da Silva Cabral De Moraes, J.R., Costa, C.T.S., de Souza, P.S.: Machine learning algorithms for forecasting the incidence of Coffea arabica pests and diseases. Int. J. Biometeorol. 64, 671–688 (2020) 6. Dong, S., Wang, P., Abbas, K.: A survey on deep learning and its applications. Comput. Sci. Rev. 40, 100379 (2021) ´ 7. Gallardo-G´ omez, J.A., Divina, F., Troncoso, A., Mart´ınez-Alvarez, F.: Explainable artificial intelligence for the electric vehicle load demand forecasting problem. In: Bringas, P.G., et al. (eds.) 17th International Conference on Soft Computing Models in Industrial and Environmental Applications (SOCO 2022). SOCO 2022. LNCS, vol. 531. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-180507 40 8. He, H., et al.: Crop diversity and pest management in sustainable agriculture. J. Integr. Agric. 18(9), 1945–1952 (2019) 9. Kim, Y.H., Yoo, S.J., Gu, Y.H., Lim, J.H., Han, D., Baik, S.W.: Crop pests prediction method using regression and machine learning technology: survey. IERI procedia 6, 52–56 (2014) 10. Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., Talwalkar, A.: Hyperband: a novel bandit-based approach to hyperparameter optimization. J. Mach. Learn. Res. 18(1), 6765–6816 (2017) 11. Liakos, K.G., Busato, P., Moshou, D., Pearson, S., Bochtis, D.: Machine learning in agriculture: a review. Sensors 18(8), 2674 (2018) 12. Markovi´c, D., Vujiˇci´c, D., Tanaskovi´c, S., Dordevi´c, B., Randi´c, S., Stamenkovi´c, Z.: Prediction of pest insect appearance using sensors and machine learning. Sensors 21(14), 4846 (2021) 13. Nohara, Y., Matsumoto, K., Soejima, H., Nakashima, N.: Explanation of machine learning models using improved shapley additive explanation. In: Proceedings of the ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, p. 546 (2019) 14. de Sousa, I.P., Vellasco, M.R., da Silva, E.C.: Local interpretable model-agnostic explanations for classification of lymph node metastases. Sensors 19(13), 2969 (2019) 15. Rojat, T., Puget, R., Filliat, D., Del Ser, J., Gelin, R., D´ıaz-Rodr´ıguez, N.: Explainable artificial intelligence (xai) on timeseries data: a survey. arXiv preprint. arXiv:2104.00950 (2021) ´ 16. Torres, J.F., Hadjout, D., Sebaa, A., Mart´ınez-Alvarez, F., Troncoso, A.: Deep learning for time series forecasting: a survey. Big Data 9(1), 3–21 (2021) ´ 17. Torres, J.F., Troncoso, A., Koprinska, I., Wang, Z., Mart´ınez-Alvarez, F.: Deep learning for big data time series forecasting applied to solar power. In: International Joint Conference SOCO2018-CISIS2018-ICEUTE2018, pp. 123–133 (2018) ´ 18. Troncoso-Garc´ıa, A., Mart´ınez-Ballesteros, M., Mart´ınez-Alvarez, F., Troncoso, A.: Explainable machine learning for sleep apnea prediction. Procedia Comput. Sci. 207, 2930–2939 (2022)
Olive Oil Fly Population Pest Forecasting
131
´ 19. Troncoso-Garc´ıa, A., Mart´ınez-Ballesteros, M., Mart´ınez-Alvarez, F., Troncoso, A.: A new approach based on association rules to add explainability to time series forecasting models. Inf. Fusion 94, 169–180 (2023) 20. Wani, H., Ashtankar, N.: An appropriate model predicting pest/diseases of crops using machine learning algorithms. In: 2017 4th International Conference on Advanced Computing and Communication Systems (ICACCS), pp. 1–4. IEEE (2017)
Explaining Learned Patterns in Deep Learning by Association Rules Mining 2 ´ M. J. Jim´enez-Navarro1(B) , M. Mart´ınez-Ballesteros1 , F. Mart´ınez-Alvarez , 2 and G. Asencio-Cort´es 1
2
Department of Computer Languages and Systems, University of Seville, Seville, Spain {mjimenez3,mariamartinez}@us.es Data Science and Big Data Lab, Pablo de Olavide University, 41013, Seville, Spain {fmaralv,guaasecor}@upo.es
Abstract. This paper proposes a novel approach that combines an association rule algorithm with a deep learning model to enhance the interpretability of prediction outcomes. The study aims to gain insights into the patterns that were learned correctly or incorrectly by the model. To identify these scenarios, an association rule algorithm is applied to extract the patterns learned by the deep learning model. The rules are then analyzed and classified based on specific metrics to draw conclusions about the behavior of the model. We applied this approach to a well-known dataset in various scenarios, such as underfitting and overfitting. The results demonstrate that the combination of the two techniques is highly effective in identifying the patterns learned by the model and analyzing its performance in different scenarios, through error analysis. We suggest that this methodology can enhance the transparency and interpretability of black-box models, thus improving their reliability for real-world applications. Keywords: association rules · Apriori interpretability · explainable AI
1
· deep learning ·
Introduction
Deep learning has become a popular tool for solving complex problems in various domains, such as finance, healthcare, and engineering. However, their lack of interpretability and black-box nature pose significant challenges to trust and understanding their predictions. For example, a model that predicts a certain disease in medical diagnosis without providing a clear explanation for its prediction may not be trusted by physicians or patients, even if the model has high accuracy. To address this issue, explainability approaches have been developed, such as global/local model-agnostic methods [18] and example-based methods. However, these approaches can be challenging to understand for non-experts in the field c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 P. Garc´ıa Bringas et al. (Eds.): SOCO 2023, LNNS 750, pp. 132–141, 2023. https://doi.org/10.1007/978-3-031-42536-3_13
Explaining Learned Patterns in Deep Learning by Association Rules Mining
133
[9] and may be too general or too specific to identify particular scenarios where the model performed poorly or behaved like an outlier. Association rules (AR) [2] are a powerful tool for enhancing interpretability in machine learning by identifying meaningful relationships between variables [16]. In this paper, we propose a model-agnostic approach that combines an AR mining with a deep learning model to enhance the interpretability of its predictions. By using the AR algorithm to extract the patterns learned by the deep learning model, the behavior of the model can be better understood using an intuitive cause-and-effect structure similar to a decision tree. Additionally, the rules identify generalizable scenarios without relying on global explanations. This approach can enhance trust and reliability in model predictions. To demonstrate the effectiveness of our proposed methodology, we applied it to various scenarios within a well-known dataset using the Apriori [2] algorithm to discover AR. The results show that the AR algorithm is an effective and simple approach to identifying the learned patterns and analyzing the performance of the model. Thus, this approach has the potential to enhance the transparency and interpretability of black-box models across various domains, making them more reliable for real-world applications. The main contributions of this paper can be summarized as follows: – Development of a methodology to evaluate the behavior of black-box models. – A simple and easy-to-understand methodology that can identify the strengths and weaknesses of a model. – Analysis of several models in various scenarios, including overfitting and underfitting, on a well-known dataset. The remainder of this paper is structured as follows. Section 2 discusses related work that focuses on interpretability approaches. In Sect. 3, we provide a detailed overview of the proposed methodology. Section 4.1 presents the experimental setting for different scenarios. In Sect. 4, we present the results and analysis of the proposed methodology applied to a public dataset. Finally, Sect. 5 summarizes the main conclusions of this work.
2
Related Works
Numerous studies have been conducted to understand the behavior learned by models, particularly in the context of deep learning due to its importance and black-box nature. Some studies use a heatmap approach [10,12,17] to explain the behavior learned by the model in image classification by illustrating the areas where the model focused to make its prediction. However, these methods require expert knowledge to understand the learned behavior and are not applicable to tabular data, which is the scope of this work. Other approaches use feature attribution methodologies, such as SHAP [14], to assign a numerical value that represents the importance of a feature [1,3,8].
134
M. J. Jim´enez-Navarro et al.
However, these methods rely on expert knowledge to analyze feature attribution and provide overly general explanations. Finally, rule-based methods have been proposed for various applications due to their simplicity [15,19]. Bernardi et al. [5] use a non-parametric method to determine the range for out-of-distribution samples, Ferreira et al. [7] apply a generic algorithm to build a tree representation of the operations needed to obtain the outputs for a local example, Lal et al. [13] develop a methodology to extract rules from an ensemble of tree models, and Barbiero et al. [4] use an entropy-based model to generate first-order logic explanations in a deep learning model. Although these approaches are similar to our methodology, most of them do not have the ability to adapt to local/global explanations or be applied to any model.
3
Methodology
In this section, we present the model-agnostic method used to extract the patterns learned by a model and the subsequent analysis. In our methodology, we assume a traditional modeling approach in which the dataset is split into at least a training and testing set. An arbitrary model is then trained on the training set and evaluated on the testing set. The input to our methodology is a dataset consisting of the features X, the true target variable y, and the predictions made by the model yˆ. Note that the dataset with the predictions corresponds to the training set as our goal is to obtain the learned behavior during the training process using only the training set.
Fig. 1. Workflow representation of the methodology. Note that dashed lines represent optional steps. Note that Dtrain and Dtest represent the training and test datasets, real real ˆ test refer to the discretized version. In addition, Rtrain ˆ train and D and Rtest while D represent the rules obtained for the training and test dataset from the real target values, pred pred and Rtest are the rules for the predictions made by the model. while Rtrain
Explaining Learned Patterns in Deep Learning by Association Rules Mining
3.1
135
Preprocessing
This step includes dropping correlated features and discretizing the remaining features shown in Fig. 1. Optionally, the features X can be preprocessed to remove highly correlated features. This is necessary because the Apriori algorithm may obtain similar rules for two correlated features, using them interchangeably because of their similarity in frequency. Once the training dataset Dtrain contains its predictions, all continuous columns are discretized because the Apriori algorithm cannot use continuous values and the explanations would be more generalizable using ranges instead of specific values. Discretization requires a number of bins, which is specified by the user. A K-means [6] method is used to determine the bin ranges for each feature, where the K-means algorithm uses the centroid of the cluster to determine the range of the bins. To discretize the target, the predictions yˆ are used to determine the bins, which are then applied to y. Note that the goal of the methodology is to extract the patterns learned by the model, so the discretization must use the predictions as a reference for the target discretization. Therefore, the result of ˆ train . the discretization process is the discretized dataset D 3.2
Rules Mining
This step involves both the transactionalization and AR algorithm phases shown in Fig. 1. Once the dataset has been properly discretized, the next step is to apply the AR algorithm to obtain the rules, which has been the well-known Apriori algorithm for this study. However, before that, the Apriori algorithm ˆ tr into a set of transactions. These needs to transform the discretized dataset D transactions contain information about the items presented in each instance. An item refers to the bin in which the value of each feature and target value is contained. With the transactions, the Apriori algorithm is used to obtain the rules that satisfy the minimum confidence and support thresholds. Typically, we are interested in rules with high confidence, which indicate the probability that the pattern represented in the rule has been learned by the model. The support represents the generality/specificity of the pattern. A low minimum support allows specific patterns to be represented, while high support only represents general rules with a high frequency. Therefore, the result of the Apriori algorithm is a set of rules Rtrain . We are only interested in the patterns that have an impact on the target. For that reason, we select only those rules from the set Rtrain that contain one of the target items/bins. Additionally, the Apriori algorithm may generate redundant rules. A redundant rule is one in which a subset of the antecedents obtains a greater confidence. These rules do not provide useful information, and for that reason, they are removed, considering more general rules. Note that there are two targets in our context: the real target y and the predicted target yˆ. Therefore, pred true and Rtrain . we are interested in two sets of rules: Rtrain
136
3.3
M. J. Jim´enez-Navarro et al.
Calculate Metrics
To obtain meaningful insights from the analysis, it is necessary to calculate a ˆ train . Two types of set of metrics for each rule obtained from the training set D metrics are typically used: AR metrics and performance metrics. The most commonly used AR metrics are confidence and support, while performance metrics can include any useful metric for analysis. In our case, we have chosen the mean squared error (M SE) as the error increases greatly when there are great differences between true and predictions. It is important to note that we calculate AR pred true and Rtrain , as this helps to identify whether a learned metrics for both Rtrain pattern represents an actual pattern or not. Note that these metrics are detailed in Sect. 4.2 Optionally, metrics can also be calculated for other datasets, such as the test dataset Dtest . To calculate these metrics, the entire process is repeated starting from the discretization step, but using the bins obtained from the train set Dtrain . Specifically, we are interested in calculating the performance metrics of pred , which can help to evaluate the generalization of the the predicted rules Rtest potential rules learned by the model. 3.4
Classify
Finally, to facilitate their comprehension, the rules are classified into four categories based on whether they represent a real pattern or not, and whether they were correctly learned (CL) by the model or not. The rules that represent a real pattern (RP) are identified based on the conpred true fidence difference between Rtrain and Rtrain . A high difference between real and predicted rules may indicate that the pattern learned by the model (assuming high confidence) does not correspond to a real pattern (an Unreal Pattern or UP) that should have been learned if the predictions were closer to the actual values. On the contrary, a low difference may indicate that the model has learned a real pattern that exists in reality. To determine rules with large or low differences, the z-score is used, with those above 3 considered unreal patterns and those below 3 considered real patterns. The rules that were incorrectly learned (IL) by the model represent those pred . Rules with high with high errors calculated from y and yˆ for each rule in Rtrain errors are considered poorly learned patterns, whereas those with low errors are considered correctly learned rules. Again, to determine rules with high or low errors, the z-score is used, with rules above a z-score of 1 considered poorly learned and those below a z-score of 1 considered correctly learned.
4
Results and Discussion
In this section, we present the results obtained for the studied dataset. In Sect. 4.1 the experimental setting is described. In Sect. 4.2 the metrics used for the evaluation and analysis are shown. In Sect. 4.3 the results obtained after applying our methodology are presented.
Explaining Learned Patterns in Deep Learning by Association Rules Mining
4.1
137
Experimental Setting
To carry out our experiment, we selected a well-known dataset to test our methodology and analyze two different scenarios. In particular, the dataset used in our experiment is the California Housing dataset [11]. It contains information from the 1990 California census and includes eight real features, such as median income (mi), longitude (l), total rooms (tr), and more. The target variable is to predict the mean house value (mhv) using these features. The dataset also includes categorical data that was removed to use only real data. To test the proposed methodology, we used two different scenarios: underfitting and overfitting the model over the data. To underfit the model, we selected a fully connected model with only 1 neuron, 1 hidden layer, and trained for 5 epochs as it is not enough to fit the data. For overfitting, we selected a model with 512 neurons in 10 hidden layers and trained for 100 epochs. We will identify the patterns obtained after applying our methodology and evaluate the strengths and weaknesses of the model. 4.2
Metrics
In this section, we describe the main metrics used in our methodology to evaluate the both the AR obtained and the performance of the model. The support for a rule (A =⇒ C), where A and C denote the antecedents and consequents respectively, is the percentage of instances in the dataset that satisfy both the antecedent and consequent conditions. f rq(A, C) represents the number of instances that satisfy both the antecedent and consequent conditions, while N represents the total number of instances in the dataset. The support values range from 0 to 1. f rq(A, C) (1) N The confidence, is the probability that instances containing A also contain C, and it also ranges from 0 to 1. Support(A =⇒ C) =
Conf idence(A =⇒ C) =
f rq(A, C) f rq(A)
(2)
The MSE is a commonly used metric in regression problems, where it measures the squared difference between the predicted and true values. 4.3
Results
The main results are presented in tables, where the antecedents and consequents follow the same structure: F [lower, upper], where F represents the feature abbreviation, and lower and upper represent the minimum and maximum bin values, respectively. For the hyperparameters of the methodology, we set the number of bins for each feature at five, the minimum support at 1%, and the minimum confidence at 75%.
138
M. J. Jim´enez-Navarro et al.
Fig. 2. Histogram of the target variable (blue) in Dtrain and ranges covered by the rules obtained by the model in the overfitting (red) and underfitting (green) scenarios.
Overfitting Scenario. Table 1 displays the top five rules obtained by the Apriori algorithm, sorted by prediction confidence. It can be observed that the pattern learned by the model with the highest confidence suggests that when longitude (l) falls between −118.954 and −118.147 and the mean income (mi) ranges between 4.749 and 5.798, the mean house value is between 3.27e5 and 3.84e5. This indicates that houses in that area of the dataset are relatively expensive since this range is above the mean value of 2.07e5. Table 1. Top five association rules discovered by the Apriori algorithm sorted by confidence in the overfitting scenario. Note that Spred , Strue , Cpred and Ctrue denote the support (S) and confidence (C) for the predicted (pred) and true (true) values of the target. In addition, M SEtrain and M SEtest denote the error in the training and test dataset, respectively. Finally, the type column corresponds to the assigned rule class. Antecedents
Consequent
Spred Strue Cpred Ctrue M SEtrain M SEtest
l ∈ [−118.95, −118.15] ∧ mi ∈ [4.8, 5.8]
⇒ [3.27e5, 3.84e5] 0.02
l ∈ [−118.95, −118.15] ∧ mi ∈ [5.8, 7.2]
⇒ [3.84e5, 4.55e5] 0.01
l ∈ [−118.954, −118.147] ∧ tr ∈ [1406, 2396] ∧ mi ∈ [3.8, 4.8] ⇒ [2.78e5, 3.27e5] 0.01
0
1.06e5
1.03e5
Type
0.85
0.16
IP CL
0
0.82
0.16
9.18e4
9.84e4
IP CL
0
0.80
0.11
9.49e4
8.47e4
IP CL
l ∈ [−117.55, −116.60] ∧ mi ∈ [4.749, 5.798]
⇒ [3.84e5, 4.55e5] 0.01
0
0.78
0.04
1.96e5
2.00e5
IP IL
l ∈ [−118.95, −118.15] ∧ mi ∈ [3.8, 4.8]
⇒ [2.78e5, 3.27e5] 0.03
0
0.76
0.13
9.52e4
9.01e4
IP CL
When analyzing the antecedents of the top five rules obtained in Table 1, it becomes apparent that only three features were used: longitude (l), median income (mi), and total rooms (tr). This suggests that these features provide more information to the model compared to others, as the patterns obtained using other features lacked sufficient confidence. Furthermore, the consequent of the rules only shows patterns for house values above the mean (2.07e5) as shown in Fig. 2, which implies that the model could not learn patterns for cheaper houses with sufficient confidence.
Explaining Learned Patterns in Deep Learning by Association Rules Mining
139
Looking at the AR metrics, we can see that the support of the rules using the predictions is higher than when using the true target values. The confidence of the predictions ranges from 0.76 to 0.85, while the true confidence obtained from the target values ranges from 0.04 to 0.16. This suggests that the rules obtained may not represent real patterns, which could be present if the model performed better. In terms of error, the model had an average error of 1.02e5 in both the training and test sets. The model does not seem to have any generalization problems for these specific patterns, even though it was configured to overfit the data, except for the third rule. Additionally, the error of the patterns seems to be around the average, except for the fourth rule, which represents a bar-learned pattern. In general, the methodology found four correctly learned patterns (CL) and one rule with an incorrectly learned pattern (IL) whose error was above the mean. Underfitting Scenario. Table 2 presents the top five rules obtained using our methodology and sorted by confidence. The first pattern indicates that if the housing mean age falls within the range of 17 to 21 years and the median income is between 0.5 and 2.1, then the median house value falls between the range of 1.92e3 and 1.24e5, which is a price range below the average. Table 2. Top five association rules encountered by the Apriori algorithm sorted by confidence in the underfitting scenario. Note that Spred , Strue , Cpred and Ctrue denote the support (S) and confidence (C) for the predicted (pred) and true (true) values of the target. In addition, M SEtrain and M SEtest denote the error on the training and test dataset, respectively. Finally, the type column corresponds to the assigned rule class. Antecedents
Consequent
Spred Strue Cpred Ctrue M SEtrain M SEtest
hma ∈ [17, 21] ∧ mi ∈ [0.5, 2.1] ⇒ [1.92e3, 1.24e5] 0.02
0.01
hma ∈ [21, 27] ∧ mi ∈ [0.5, 2.1] ⇒ [1.92e3, 1.24e5] 0.01
0.01
0.97
0.76
hma ∈ [48, 52] ∧ mi ∈ [3.8, 4.8] ⇒ [2.45e5, 2.89e5] 0.01
0
0.95
0.17
hma ∈ [27, 32] ∧ mi ∈ [3.0, 3.8] ⇒ [1.66e5, 2.05e5] 0.03
0.01
0.92
0.22
hma ∈ [27, 32] ∧ mi ∈ [2.1, 3.0] ⇒ [1.24e5, 1.66e5] 0.03
0.01
0.92
0.21
0.99
0.73
4.71e4
Type
3.93e5
WFR CL
4.89e4
5.28e4
WFR CL
9.27e4
7.96e4
BFR IL
6.41e4
6.60e4
BFR CL
5.43e4
5.07e4
BFR CL
As in the previous section, the relevant features for this model are housing median age (hma) and median income (mi) as shown in the antecedents. However, in contrast to the overfitted model, the consequents mostly consider rules above the mean (2.07e5) as shown in Fig. 2. Analyzing the AR metrics, we observe that the support and confidence of the predictions are mostly greater than the real ones. However, the difference in the first two rules is not considerably greater compared to the other rules, and we consider them well-formed rules.
140
M. J. Jim´enez-Navarro et al.
In terms of error metrics, the error is considerably better than the overfitted model, with an average of 6.01e04 in train and 5.98e04 in test. Additionally, generalization does not seem to be a problem in these patterns, as the training error is similar to or lower than the test error. In summary, the methodology obtained two patterns that represent real patterns (WFR) with remarkable performance (CL), one rule that does not represent a real pattern (BFR) and has poor performance (IL), and two rules that do not represent a real pattern (BFR) but have good performance (CL).
5
Conclusions
In this work, we have developed a novel model-agnostic explainability methodology applied to a deep learning model. The method uses the well-known Apriori algorithm internally to obtain the patterns learned by the model in a simple format, which are then analyzed to draw conclusions about the learned behavior of the model. The results obtained provide a taxonomy of rules that can be classified based on the association and error metrics obtained. In the future, there are several issues that need to be addressed. First, the method does not consider the full coverage of the dataset in the rules. Secondly, the discretization process is critical, and it could be improved by using a metaheuristic. Thirdly, rules with significant overlap should be removed. Finally, the method should be studied with a large attribute space to evaluate its scalability. Acknowledgements. The authors would like to thank the Spanish Ministry of Science and Innovation for the support under the projects PID2020-117954RB and TED2021-131311B, and the European Regional Development Fund and Junta de Andaluc´ıa for projects PY20-00870, PYC20 RE 078 USE and UPO-138516.
References 1. Afchar, D., Guigue, V., Hennequin, R.: Towards rigorous interpretations: a formalisation of feature attribution. In: Proceedings of the 38th International Conference on Machine Learning, vol. 139, pp. 76–86. PMLR (2021) 2. Agrawal, R., Imielinski, T., Swami, A.: Mining association rules between sets of items in large databases. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 207–216, May 1993 3. Albini, E., Long, J., Dervovic, D., Magazzeni, D.: Counterfactual shapley additive explanations. In: 2022 ACM Conference on Fairness, Accountability, and Transparency, FAccT 2022, pp. 1054–1070 (2022) 4. Barbiero, P., Ciravegna, G., Giannini, F., Li´ o, P., Gori, M., Melacci, S.: Entropybased logic explanations of neural networks. arXiv (2021) 5. De Bernardi, G., Narteni, S., Cambiaso, E., Mongelli, M.: Rule-based out-ofdistribution detection (2023) 6. Dash, R., Paramguru, R., Dash, R.: Comparative analysis of supervised and unsupervised discretization techniques. Int. J. Adv. Sci. Technol. 2, 29–37 (2011)
Explaining Learned Patterns in Deep Learning by Association Rules Mining
141
7. Ferreira, L., Guimar˜ aes, F., Pedrosa-Silva, R.: Applying genetic programming to improve interpretability in machine learning models. In: Proceedings of: Congress on Evolutionary Computation, pp. 1–8 (2020) 8. Fumagalli, F., Muschalik, M., Kolpaczki, P., H¨ ullermeier, E., Hammer, B.: SHAPIQ: unified approximation of any-order Shapley interactions (2023) ´ 9. Gallardo-G´ omez, J.A., Divina, F., Troncoso, A., Mart´ınez-Alvarez, F.: Explainable artificial intelligence for the electric vehicle load demand forecasting problem. In: Proceedings of 17th International Conference on Soft Computing Models in Industrial and Environmental Applications (SOCO 2022), pp. 413–422 (2023) 10. Hou, Y., Zheng, L., Gould, S.: Learning to structure an image with few colors. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10116–10125 (2020) 11. Kelley Pace, R., Barry, R.: Sparse spatial autoregressions. Stat. Probab. Lett. 33(3), 291–297 (1997) 12. Kohlbrenner, M., Bauer, A., Nakajima, S., Binder, A., Samek, W., Lapuschkin, S.: Towards best practice in explaining neural network decisions with LRP. In: Proceedings of: International Joint Conference on Neural Networks, pp. 1–7 (2020) 13. Lal, G.R., Chen, X., Mithal, V.: TE2Rules: extracting rule lists from tree ensembles (2022) 14. Lundberg, S.M., Lee, S.: A unified approach to interpreting model predictions. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS 2017, pp. 4768–4777. Curran Associates Inc. (2017) 15. Mart´ın, D., Mart´ınez-Ballesteros, M., Garc´ıa-Gil, D., Alcal´ a-Fdez, J., Herrera, F., Riquelme-Santos, J.C.: MRQAR: a generic MapReduce framework to discover quantitative association rules in big data problems. Knowl.-Based Syst. 153, 176– 192 (2018) ´ 16. Mart´ınez Ballesteros, M., Troncoso, A., Mart´ınez-Alvarez, F., Riquelme, J.C.: Improving a multi-objective evolutionary algorithm to discover quantitative association rules. Knowl. Inf. Syst. 49, 481–509 (2016) 17. Tjoa, E., Guan, C.: Quantifying explainability of saliency methods in deep neural networks with a synthetic dataset. IEEE Trans. Artif. Intell., 1–15 (2022) ´ 18. Troncoso-Garc´ıa, A.R., Mart´ınez-Ballesteros, M., Mart´ınez-Alvarez, F., Troncoso, A.: Explainable machine learning for sleep apnea prediction. Procedia Comput. Sci. 207, 2930–2939 (2022) ´ 19. Troncoso-Garc´ıa, A.R., Mart´ınez-Ballesteros, M., Mart´ınez-Alvarez, F., Troncoso, A.: A new approach based on association rules to add explainability to time series forecasting models. Inf. Fusion 94, 169–180 (2023)
Special Session 5: Machine Learning and Computer Vision in Industry 4.0
A Deep Learning Ensemble for Ultrasonic Weld Quality Control Ram´on Moreno1(B) , Jos´e Mar´ıa Sanju´ an1 , Miguel Del R´ıo Crist´obal1 , Revanth Shankar Muthuselvam1 , and Ting Wang2 1
Department of Advanced Manufacturing 4.0, Antolin, Burgos, Spain [email protected] 2 College of Electrical Engineering and Control Science, Nanjing Tech University and Southeast University, Nanjing, China
Abstract. This paper introduces a computer vision engine for weld quality control of door panel manufacturing. The main algorithm carries out three tasks: object detection, image segmentation and image classification. These tasks are implemented by an ensemble of deep learning models, where each model fulfill a task. The overall performance of the computer vision system reaches an accuracy higher than 0.99 with a well suited computing time for the industrial process.
1
Introduction
Computer vision is one of the most important tools for product quality control in industrial manufacturing processes. It provides adaptability for many use cases, high accuracy on the detection, and expert knowledge digitalization. It is wellknown that the disruptive technology underneath are the Convolutional Neural Networks (CNN) [1] and its main branches: Generative Convolutional Networks (GANs) [2], Deep Reinforcement Learning (DRL) [3], etc. Deep Learning models have applications such as object detection [4,5], semantic segmentation [6–8] and classification [9,10] with an outstanding performance. This work is grounded in an automotive industrial process for door panel manufacturing [11]. This process is highly automatized where the machine is compound by four stations with several robots. The robots are equipped with an ultrasonic tool so as to perform the welding. A door has near to 100 welds, and the quality is supervised by validation experts. In spite of the defect rate is very low, the digitalization of this quality control allows the manufacturing process to move forward into the fully automated manufacturing by quality prediction and real time process optimisation. This paper presents a deep learning ensemble so as to accomplish weld quality control in doors panel manufacturing. Onward the paper is outlined as follows: Sect. 2 describes the technologies in which this work is based on (manufacturing process and computational technologies), Sect. 3 shows the ensemble deep learning approach, Sect. 4 shows the experimental results and metrics, and finally Sect. 5 wraps up the paper with the conclusions. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 P. Garc´ıa Bringas et al. (Eds.): SOCO 2023, LNNS 750, pp. 145–152, 2023. https://doi.org/10.1007/978-3-031-42536-3_14
146
2
R. Moreno et al.
Technology
There are two main technologies involved in this work. On the one hand the manufacturing process and on the other hand the computational technologies. 2.1
Manufacturing Process
Doors manufacturing is composed by two main steps: Injection and Assembly. Injection process builds the structure of the panel. This is accomplished in fully automated injection molding machines. In short, these machines inject blend plastic into molds, where the molds have been designed to build specific parts. To do this properly, the machines work with the clamping force range from 280 kN to 55,000 kN. Several technologies are used such as hydraulic, hybrid and electric or horizontal and vertical injection moulding machines. Moreover, those machines can be customized to high-precision systems with the shortest cycle times. Assembly process is in charge of the union of several parts with the panel. This union can accomplished in several ways depending of the ancillary part and the joining technology, in this job the union is carried out by a ultrasonic welding process. Figure 1 illustrates a door panel with the welds. The picture shows the backward side of the door, where all the components are already assembled and union carried out by ultrasonic welding.
Fig. 1. Example of weld door: on the top left (a) an exampling door. On the top right (b) the welds highlighted by different colors. On the bottom, four crops of a panel.
Most of joins are weld properly, however in some cases the weld can come up with some welding defects. Broadly, the quality of the welds can be gathered
Ultrasonic Weld Quality Control
147
in four main classes: well fusion welds (OK), bad fusion welds (BW), Lack of fusion (LoF) and overlapped welds (OL). Next Fig. 2 shows examples of these four classes.
Fig. 2. Weld categories. From left to right four examples per class are illustrated: OK (a), OL(b), BW(c), LoF(d).
The objective of this work is to detect the all welds in a panel and classify them among the aforementioned classes. 2.2
Deep Learning Models
This work explores three computer vision topics: object detection, semantic segmentation and image classification, using three deep learning models. First, object detection has been the object of research long time ago among the scientific community, nonetheless with deep learning outbreak is when this task has reached high performance accuracy. Object detection focuses on finding out the regions of an image that contain the objects of search. Perhaps among the several methods in the state of art [5] the most disruptive one is the wellknow model YOLO [4]. This model has very interesting advantages for industrial topics, such as: it can work with multi classes, very less inference running time, and it can be trained easily if we have enough samples. Second, image segmentation [6] focuses on identifing the connected pixels of an image which belong to an object. Notice that by difference with object detection, the accuracy of a segmentation model is measured in terms of pixels, whereas the model for object detection is measured in screening terms. There are many methods for image segmentation based on deep learning in the state of the art, however the underneath idea is inspired from the U-net network [8]. This network is underpinned by two complementary parts: a compression information set of layers, and a decompression set of layers, both interconnected with the layer at the same level. U-net family has many implementations, for this work we have bet for the mobilenet implementation due to two main reasons: on the one hand, it is very fast and can compute an image in few milliseconds, on the other hand, it provides enough accuracy for welds segmentation. Eventually, image classification is also a well-know topic in computer vision. Here again, deep learning has been the cutting-edge technology since ten years. Perhaps this is the most active research topic in the computer vision community.
148
R. Moreno et al.
Algorithm 1 Computer Vision approach I ←Image crops ← Y olo(I) for crop in crops do weld ← M obilenet(crop) quality ← GoogleN et(weld) end for
There are many methods/models in the state of the art [9]. Nonetheless for the objective of this work we have used the GoogleNet network [10] for the same reasons as mentioned in the aforementioned method: fast computing speed and good accuracy.
3
Deep Learning Ensemble
In order to fulfill the objective of this project, it is necessary to implement the three steps. First, find out where the welds are. Second, crop the smaller window which contains the weld. Eventually, classify the weld among the quality classes. According with the deep learning models aforementioned, each one of them fit perfectly in one of the steps. Yolo for weld detection, Mobilenet for weld segmentation and GoogleNet for weld classification. The three models are assembled to implement the computer vision engine that given an input image with welds, can analyze them and output the welds quality control. Figure 3 illustrates the idea.
Fig. 3. Ensemble Model. The sketch illustrates the overall idea: the image and the data flow from the input to the output. In the middle are represented the different deep learning models which compound the ensemble.
From other point of view, Algorithm 1 describes the procedure where the models are assembled. Given an image with welds, Yolo model will detect all the welds in the image giving off crops of a fixed size. After that, each crop will be segmented with the Unet version of mobilenet in order to find out the fitter region witch contains the weld. Then, the weld is classified by GoogleNet, returning the quality of the welds.
4
Experimental Results
This section explains the developments and the industrial validation
Ultrasonic Weld Quality Control
149
Development The welding machine processes a part per minute, where a camera captures a picture that is then stored in a database. From a source of several thousands of pictures, we have selected a randomly 600 pictures so as to create a main dataset for all the developments. Figure 1(a) shows an exampling capture. The first step is to train a YOLO model for weld detection and identification. To do this, we have labeled a set of images where every weld has a name. Afterwards, a process of oversampling has been accomplished in order to have a higher number of images with labeled welds. Oversampling consists of creating random crops from the original image (as illustrated in 1(b)), and transform them it by rotation. Notice that a weld by itself can not be distinguished form other except for its location in the door. So, in order to label the welds, a twice size window has been used as is illustrated in Fig. 4(a). After oversampling, a dataset of 60.000 images has been created. Using this dataset a Yolo model has been trained properly. An exampling result of the YOLO model prediction is shown in Fig. 4(b). The detection accuracy is 100%.
Fig. 4. Crops for YOLO training. On the left (a) crop samples. On the right an example of YOLO detection.
Once the weld has been detected and identified, the image segmentation process take place. So as to train a segmentation model, a dataset with 6000 samples has been created. These images have been labeled using the Yolo outputs, and drawing a circle with the closer approximation of the weld. This task has been
150
R. Moreno et al.
accomplished with the labelme software [12]. After an oversampling process we have created a dataset with 60000 samples. Figure 5 illustrates the overall procedure. The accuracy on validation of the model is 0.99 according with the Jaccard index metric [13].
Fig. 5. Semantic segmentation. On the top left (a) four samples outcome from YOLO ready to be labeled. On the top right (b) the same samples labeled with the region containing the weld. On the bottom (c, d) a couple of examples of segmentation, on the left input image with the segmentation, by them the cropped region.
Once the weld images are properly detected, the classification model can be trained. To train the GoogleNet model, we have created a random dataset with 50000 samples, 12500 per class (BW, LOF, OK, OL). The model has been trained with a initial learning 0.1 and a weigh decay 0.001. After 70 training epochs the model is created with following metrics: accuracy: 0.9900, f1 score weighted: 0.9900 and precision: 0.9901. Figure 6 shows the confusion matrix. Industrial Validation The ensemble of models has been deployed in a virtual machine in the plant integrating IT-OT networks. For every part produced on the machine a picture is captured and the image is processed so as to predict the quality of the welds. The predictions are showed to the operator by an HMI where he can modify the quality of the welds. The expert annotations are stored in a database for further model re-training. Figure 7 shows the operator interface. After a while with the application running in the plant, we have validated the model with the following accuracy metric per class; OK: 0.9781, OL: 0.9998, DP: 0.98552, LOF: 0.9956 The prediction engine has been deployed in a Windows virtual machine with Processor Intel(R) Xeon(R) Gold 6134 CPU @ 3.20GHz, 3192 Mhz, 2 Core(s) and 8 Gb memory. To process an image takes 10 s in average, whereas the machine cycle takes 50 s.
Ultrasonic Weld Quality Control
151
Fig. 6. GoogleNet confusion matrix
Fig. 7. On premise validation interface
5
Conclusion
In this work, we have created a computer vision engine for weld quality control grounded on well-known deep learning models which are assembled in a single algorithm. The full application has been deployed in a industrial environment with several components: IT-OT networks, a virtual machine for inference, a panel-pc for operator validation and the welding machine. The accuracy of the system is pretty good, but it still needs a couple of updates with the expert validation. Nonetheless the first release is providing very good metrics and the performance of the full application suits well with the industrial process.
152
R. Moreno et al.
As further steps we are implemented an active learning approach by collecting the feedback from the operator and re-training the classification model until the predictions are identical to the process expert’s conclusion. Acknowledgements. This work has been founded by the grant: 10/18/BU/0030 Linea: planes estrat´egicos i+d. T´ıtulo del proyecto: Despliegue de estrategia digital e industria 4.0 en grupo Antolin. T´ıtulo del plan estrat´egico: Plan estrat´egico de innovaci´ın de Grupo Antolin 2022-24 expedientes contenidos en el plan estrat´egico.
References 1. Gu, J., et al.: Recent advances in convolutional neural networks. Pattern Recogn. 77, 354–377 (2018) 2. Creswell, A., White, T., Dumoulin, V., Arulkumaran, K., Sengupta, B., Bharath, A.A.: Generative adversarial networks: an overview. IEEE Sig. Proc. Mag. 35, 53–65 (2018) 3. Li, Y.: Deep reinforcement learning: an overview (2018). arXiv:1701.07274 [cs] 4. Jiang, P., Ergu, D., Liu, F., Cai, Y., Ma, B.: A review of yolo algorithm developments. Procedia Comput. Sci. 199, 1066–1073 (2022) 5. Zou, Z., Chen, K., Shi, Z., Guo, Y., Ye, J.: Object detection in 20 years: a survey. Proc. IEEE 111, 257–276 (2023) 6. Minaee, S., Boykov, Y., Porikli, F., Plaza, A., Kehtarnavaz, N., Terzopoulos, D.: Image segmentation using deep learning: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 44, 3523–3542 (2022) 7. Howard, A.G., et al: MobileNets: efficient convolutional neural networks for mobile vision applications (2017). arXiv:1704.04861 [cs] 8. Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4 28 9. Lu, D., Weng, Q.: A survey of image classification methods and techniques for improving classification performance. Int. J. Remote Sens. 28(5), 823–870 (2007). Publisher: Taylor & Francis eprint: https://doi.org/10.1080/01431160600746456 10. Szegedy, C., et al.: Going deeper with convolutions (2014). arXiv:1409.4842 [cs] 11. Prasad, S.V.N.B., Akhil Kumar, G., Yaswanth Sai, P., Basha, S.V.: Design and fabrication of car door panel using natural fiber-reinforced polymer composites. In: Vijayan, S., Subramanian, N., Sankaranarayanasamy, K. (eds.) Trends in Manufacturing and Engineering Management. LNME, pp. 331–343. Springer, Singapore (2021). https://doi.org/10.1007/978-981-15-4745-4 30 12. Wada, K.: LabelMe: image polygonal annotation with python (2023) 13. Costa, L.D.F.: Further generalizations of the Jaccard index (2021). arXiv:2110.09619 [cs]
Indoor Scenes Video Captioning Javier Rodr´ıguez-Juan1 , David Ortiz-Perez1 , Jose Garcia-Rodriguez1(B) , David Tom´ as1 , and Grzegorz J. Nalepa2 1 Universidad de Alicante, Alicante, Spain {jrodriguez,dortiz,jgarcia}@dtic.ua.es, [email protected] 2 Jagiellonian University, Krak´ ow, Poland [email protected]
Abstract. The progress of automatic scene analysis techniques for homes and the development of assisted living systems is vital to help different kinds of people, such as the elderly or visually impaired individuals, who require special care in their daily lives. In this work, we introduce the usage of SwinBERT, a powerful and recent video captioning model for home scenes using various lexical transformations to maximize the semantic similarity of captions and annotations, thereby obtaining more accurate matches from a human perspective. In the experiments we utilize the large-scale dataset Charades, which was created with the goal of producing a vast dataset while preserving the naturalness and spontaneity of household and daily activities.
Keywords: video captioning environment
1
· lexical transformations · indoor
Introduction
Nowadays, we are witnessing the development of automatic systems and Deep Learning (DL) techniques that are making breathtaking progress in a wide range of disciplines. One technology that stands out is Transformers, a neural network architecture that has significantly advanced almost every field within the DL domain. Video captioning is one of the DL tasks where Transformers had supposed an important leap on its development. Video captioning [2,19] is a task where, given an input video, the system is able to generate a natural language description that illustrates the contents appearing in the input. As stated in [9], indoor video captioning still represents a significant challenge due to the lack of detailed descriptions that capture all the semantics and objects appearing in the scenes. Video captioning models usually consist of an encoder-decoder architecture where the encoder is typically a Convolutional Neural Network (CNN) that visually analyzes the content. Then, this information serves as input for a Recurrent Neural Network (RNN) decoder that generates brief video descriptions. To maximize the semantics captured in the descriptions, Krishna et al. [8] introduced the Dense Video Captioning (DVC), a c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 P. Garc´ıa Bringas et al. (Eds.): SOCO 2023, LNNS 750, pp. 153–162, 2023. https://doi.org/10.1007/978-3-031-42536-3_15
154
J. Rodr´ıguez-Juan et al.
variation of classical video captioning that focused on generating longer descriptions by introducing a new captioning module that uses contextual information from past and future events to jointly describe all events. This variant was further improved in works such as [3], where the usual “detect-then-describe” scheme of DVC was reversed, proposing a top-down approach that first generates paragraphs from a global view and then grounds each event description to a video segment for detailed refinement. We will also demonstrate how a pure video captioning model can be modified to output longer descriptions using training data with denser captions. Video captioning frameworks have the potential to be implemented in various fields, including industry, robotics, and ambient assisted living. In the context of industry, researchers such as Zamora-Hern´ andez et al. [16] have explored methods to improve product quality by analyzing product assembly through object detection and action recognition models. Moreover, the field of robotics is often seen in conjunction with the ambient assisted living field, as observed in GomezDonoso et al. [6]. In this work, a robotic platform is proposed to assist individuals who are dependent, such as those suffering from Alzheimer’s disease [4]. Video captioning frameworks can be applied to previously described fields to enhance the comprehension of details. In this work, we propose a home video captioning framework composed of one of the most recent and promising video captioning models: SwinBERT. The descriptions generated by this model will be enhanced using postprocessing textual techniques to maximize the semantic information present in the outputs. In summary, our main contributions are: 1. To introduce a video captioning framework for home scenarios 2. To demonstrate how lexical postprocessing over descriptions helps in semantic understanding 3. To provide an intuition on how is the performance of one of the most powerful video captioning models currently available over indoor environments The rest of this article is composed of four additional sections. In Sect. 2, we will present other works related to the main topics covered in our contribution. In Sect. 3, we will introduce the method used to obtain the evaluation data for this study. Then, in Sect. 4, we will describe the different experiments that were conducted. Finally, in Sect. 5, we will summarize all the information and draw conclusions based on the study.
2
Related Works
In this section, we will review the state-of-the-art frameworks and methodologies used in the fields of video captioning and indoor scene captioning, which served as the basis for our experimentation. 2.1
Video Captioning
As previously stated, video captioning is the task of generating a natural language description that expresses what is happening in a given input video. This
Indoor Scenes Video Captioning
155
task is characterized by having a two-stage approach. Early works were based on the usage of Subject-Verb-Object (SVO) techniques to visually detect contents as a first stage and then a template-based model for the text generation [2]. With the advent of neural networks, this task has evolved to incorporate a CNN-RNN approach. There are many different variations and architectures for this approach. Some notable examples include the usage of bounding box detection [18] to recognize items in the CNN stage or the addition of an RNN inside the CNN stage to improve it by incorporating temporal context features [7]. The most commonly used CNN architectures for this task are ResNet, Inception Net, Xception Net, and R-CNN. For the decoding RNN stage, the most frequently used structures are LSTM or GRU. Nowadays, it is common to see attention models used in both stages of these architectures [13], with the aim of highlighting the most relevant parts of the input. It is worth emphasizing that attention models were a fundamental pillar in the creation of Transformers. Transformers have become the most trending neural network structure due to their impressive performance in many fields. For this reason, researchers have begun to introduce these structures into the video captioning task. This has led to some popular video captioning works, such as the one by Luowei Zhou et al. [19], who is focused on obtaining denser event descriptions by proposing a unique encoder-decoder model that is capable of performing this task more accurately than other proposals using two models that have to be trained. 2.2
Indoor Scene Captioning
Indoor scene captioning is a task in which an input image or video showing an indoor environment is provided, and a natural language description is obtained as output. This is a highly studied field within the scope of Machine Learning (ML), especially in the contexts of robotics and works aimed at creating helpful systems for visually impaired people. Classical scene classification was based on hand-crafted features and mainly focused on object detection to determine the type of scenes. With the use of neural networks, this task has evolved to incorporate these architectures. This evolution can be observed in some recent works [1], in which authors rely on the model scaling to improve performance and capacities of CNN. As the need for extracting more useful information about scenes arose, authors began to focus not only on classifying the scenes but also on obtaining a more detailed description that represents in natural language what can be seen in the scene. Some authors, such as Fudholi et al. [5], base their approach on customizing known datasets like MSCOCO by introducing new ground-truth features for the provided captions. Others, like Lin et al. [10], focus their efforts on creating richer indoor multi-sentence captions by using a 3D visual parsing system and a text generation algorithm that takes into account coherence between sentences.
156
3
J. Rodr´ıguez-Juan et al.
Methodology
In this section, we will detail the video captioning model proposed to obtain natural language descriptions from input videos. We used the SwinBERT captioning model [11] to compose this module. This model was developed in 2022 by Microsoft Open Source with the aim of creating an end-to-end solution for the video captioning task. SwinBERT is based on Transformers, whose main contributions are adaptable spatial-temporal features encoding for variable-length frames and the introduction of a trainable sparse attention mask that allows the model to focus on frames with more spatialtemporal movements. This model is one of the most outstanding video captioning models currently, as it achieves significant performance improvements in comparison to previous methods over the most used video captioning datasets. As we will see in the experimentation section, we will use this model with different given pre-trainings provided in its official publication. Figure 1 shows the difference between the outputs of SwinBERT trained on various datasets. The SwinBERT model follows the video captioning trend of using an encoderdecoder two-staged approach, with the Video Swin Transformer [12] as the encoder and a multimodal transformer as the decoder. We can also highlight the usage of a sparse attention mask that acts as a regularizer for the decoder and reduces redundancy among the different video tokens.
GT: A person wrapped in a blanket is sitting at a table and eating something from a bowl. Lastly the person takes a drink from a cup. MSVD: a man is drinking a drink from a cup MSR-VTT: a man is eating something in his kitchen VATEX: a young man is sitting at a table and drinking from a cup.
Fig. 1. Charades’ frame example with its ground-truth (GT) annotation and three different captions obtained from SwinBERT (MSVD, MSR-VTT and VATEX).
Indoor Scenes Video Captioning
4
157
Experiments
The experimentation was focused on the Charades dataset [15]. This dataset was selected due to the limited number of indoor video datasets that are correctly annotated, and because it is a challenge to current computer vision models. This challenging dataset provides an interesting opportunity to investigate the limitations of the SwinBERT model, which has demonstrated high performance on general contexts. 4.1
Charades Dataset
Charades is a large-scale dataset comprising of 9,848 videos of daily home activities, with each video having an average duration of 30 s. The dataset was created using crowdsourcing, with the entire process of video creation, script writing, recording, and annotation being developed in a distributed manner. Charades was created by 267 people from three different countries through Amazon Mechanical Turk service. The authors of Charade aimed to create a dataset of videos that are as casual and realistic as possible. For this reason, they used crowdsourcing for the entire construction process to maintain the bias of each person involved in the creation process towards the activities, thus preserving the essence and nature of the daily tasks of each person. All of this process resulted in a dataset comprising 27,847 video descriptions, 66,500 temporally localized intervals for 157 action classes, 41,104 labels for 46 object classes, and 15 types of indoor scenes. 4.2
Postprocessing
In captioning tasks, we are not only interested in comparing individual words, n-grams, or subsequences of tokens, but also in capturing the underlying semantics of the compared sentences. Therefore, we conducted experiments applying various postprocessing techniques to the sentences before comparison in order to maximize the semantic information that can be extracted. Some different combinations of postprocessing approaches have been performed to compare their lexical results. The individual operations applied are summarized in the following list: – Lemmatization: It consists of obtaining the root lemma of each of the tokens. This mainly helps us to avoid the rich lexical diversity, such as verb tenses. – POS Filtering: Part-of-speech (POS) tagging consists of assigning to each token a tag that represents its type of word. Once the tokens are tagged, we filter them to only capture verbs, nouns, proper nouns, and adjectives because these types of words constitute the semantic meaning of a sentence. – Punctuation removal: As we detected that punctuation signs have a considerable impact on the results of some samples, we tested the application of a punctuation removal step.
158
J. Rodr´ıguez-Juan et al.
For obtaining all the lexical information about each token of the group of sentences analysed the spaCy1 library was used.
SwinBERT
Captions
Post-processing Part-of-speech filtering Annotations Charades
Lemmatization
Punctuation removal
Processed texts
Processed texts
BERT Score
SacreBLEU
Precision/Recall
BLEU-1
Fig. 2. Structure of the used pipeline during the experimentation.
4.3
Setup
The results are presented for three different pre-trained checkpoints of the SwinBERT model. These pre-training datasets were MSVD, MSR-VTT, and VATEX, which are commonly used sources of data in the video captioning domain. The VideoSwin model is initialized with Kinetics-600 pre-trained weights. 1
https://spacy.io/.
Indoor Scenes Video Captioning
159
For the evaluation stage, we have chosen the SacreBLEU implementation from TorchMetrics2 to compute BLEU-1 score. To compute the BERT Score [17] metric, we used the implementation available on their official GitHub repository,3 with the setting roberta-large.L17.noidf.version=0.3.12(hug.trans=4.28.0.dev0)-rescaled.fast-tokenizer. The complete pipeline used in this experimentation is shown in Fig. 2. The code developed in the present work is available at GitHub.4 4.4
Results
We conducted tests on a randomly selected set of videos from the Charades dataset. This video sample was tested with the three different SwinBERT pretrainings (MSVD, MSR-VTT, and VATEX), and for each of them, different experiments were performed by applying textual postprocessing to maximize semantic similarity. The individual textual postprocessings tested were: Raw (no transformation applied), POS (part-of-speech filtering, where only nouns, proper nouns, adjectives, and verbs were kept from the complete descriptions), Lemma (lemmatization of tokens), and Punct (removing punctuation). The results were evaluated using two metrics: BLEU-1 and BERTScore. BLEU-1 is based on matching n-grams between the description and the target annotation, while BERTScore obtains contextual embeddings of each token inside the descriptions using a BERT model and compares these embeddings to obtain three metrics: precision, recall, and F1. Unlike BLEU, which is restricted to a strict lexical comparison, BERTScore is capable of comparing sentences at a semantic level, which means that comparing two different words that are synonyms provide a high similarity score. In Table 1, we can see the results of the experiments carried out over a randomly selected sample of 3,000 videos from the entire dataset, in terms of the BLEU-1 metric. The highest score for each table has been highlighted in bold. In this table, we can observe the differences obtained by using one pretraining or another, with the best results obtained using the VATEX checkpoint. VATEX had higher BLEU-1 scores because this SwinBERT checkpoint is capable of obtaining more detailed and larger descriptions compared to MSVD and MSR-VTT. This contributes to more n-gram matches as the descriptions have a more similar length to the annotated descriptions. Moreover, we observed that removing tokens from the descriptions decreased the scores, which can be explained by the nature of BLEU. With fewer n-grams to compare, and SwinBERT not being able to output exact annotated words, scores are decreased. We also observed the capability of SwinBERT to understand punctuation in sentences, as we get slightly better scores if we do not remove the punctuation signs. In conclusion, the best combination in terms of BLEU-1 is the one where we maintained larger descriptions with a unified lexicon. 2 3 4
https://torchmetrics.readthedocs.io/en/latest/. https://github.com/Tiiiger/bert score. https://github.com/javirodrigueez/swinbert-evaluation.
160
J. Rodr´ıguez-Juan et al.
Table 1. BLEU-1 score results of testing a sample of 3,000 videos with different textual postprocessing methods. Post-processing
MSVD MSR-VTT VATEX
Raw
0.0932 0.1207
0.2864
POS
0.0349 0.0339
0.1828
Lemma
0.1110 0.1268
0.3314
Punct
0.1146 0.1307
0.2921
Punct+POS
0.0418 0.0408
0.1635
Punct+Lemma
0.1211 0.1374
0.3070
Lemma+POS
0.0436 0.0423
0.2057
Punct+Lemma+POS 0.0525 0.0511
0.1892
It can be observed that the results for BLEU-1 are not as high as expected, given the performance of SwinBERT on other mainstream datasets. This can be explained by two factors: the goal of the Charades dataset, which is designed for visual analysis, and the nature of BLEU as a metric that was originally conceived for translation tasks and does not take into account the semantic similarity of words but only exact matches. Table 2 shows the previously described results in terms of precision and recall obtained from BERT Score. Here, we can see slightly higher results compared to Table 1 due to BERT’s ability to capture semantic similarity. In this table, we can observe more homogeneous values between different pretrainings because sentence length is not as important as it was for the BLEU calculation. Table 2. BERT Score results (precision and recall) of testing a sample of 3,000 videos with different textual postprocessings. Post-processing
MSVD Precision Recall
MSR-VTT Precision Recall
VATEX Precision Recall
Raw
0.454
0.434
0.454
0.198
0.199
0.310
POS
0.258
0.104
0.225
0.088
0.326
0.179
Lemma
0.328
0.128
0.313
0.137
0.366
0.266
Punct
0.467
0.166
0.447
0.166
0.424
0.270
Punct+POS
0.244
−0.018 0.214
−0.034 0.217
0.071
Punct+Lemma
0.329
0.085
0.315
0.095
0.322
0.217
Lemma+POS
0.223
0.090
0.176
0.068
0.307
0.171
−0.045 0.217
0.087
Punct+Lemma+POS 0.208
−0.029 0.171
Indoor Scenes Video Captioning
5
161
Conclusion
In this study, the usage of the Video captioning model SwinBERT over indoor environments was introduced by testing it under MSVD, VATEX, and MSRVTT pretrainings, and applying different lexical transformations to the outputs. It was demonstrated the differences between these pretrainings, with VATEX showing the best performance due to its capability to obtain larger and more detailed descriptions. The lexical transformations that contributed to unifying the lexical structure of sentences without removing any semantic information (lemmatization, lemmatization with punctuation removal, or no transformation) were found to achieve the highest scores. Additionally, the challenge posed by the Charades dataset for computer vision description generation was highlighted, and a sample of 3,000 videos was evaluated using the metrics BLEU-1 and BERT Score. BERT Score was found to be more suitable for evaluation, as it is capable of capturing the semantic similarity over lexical information. Future studies in dense video captioning, object recognition and fine-tuning of SwinBERT will be considered to improve the results obtained. Furthermore, by incorporating various representations of objects [14] into SwinBERT, we can enhance its understanding of objects, thereby improving the quality of generated descriptions. Acknowledgment. We would like to thank “A way of making Europe” European Regional Development Fund (ERDF) and MCIN/AEI/10.13039/501100011033 for supporting this work under the TED2021-130890B (CHAN-TWIN) research Project funded by MCIN/AEI/10.13039/501100011033 and European Union NextGenerationEU/PRTR, and AICARE project (grant SPID202200X139779IV0). Also the HORIZON-MSCA-2021-SE-0 action number: 101086387, REMARKABLE, Rural Environmental Monitoring via ultra wide-ARea networKs And distriButed federated Learning. Furthermore, we would like to thank Nvidia for their generous hardware donation that made these experiments possible.
References 1. Afif, M., Ayachi, R., Said, Y., Atri, M.: Deep learning based application for indoor scene recognition. Neural Process. Lett. 51(3), 2827–2837 (2020) 2. Barbu, A.: Video in sentences out (2012). https://arxiv.org/abs/1204.2742 3. Deng, C., Chen, S., Chen, D., He, Y., Wu, Q.: Sketch, ground, and refine: top-down dense video captioning. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 234–243 (2021) 4. Fern´ andez Montenegro, J.M., Villarini, B., Angelopoulou, A., Kapetanios, E., Garcia-Rodriguez, J., Argyriou, V.: A survey of Alzheimer’s disease early diagnosis methods for cognitive assessment. Sensors 20(24) (2020) 5. Fudholi, D.H., Nayoan, R.A.: The role of transformer-based image captioning for indoor environment visual understanding. Int. J. Comput. Digit. Syst. 12(1), 479– 488 (2022) 6. Gomez-Donoso, F.: A robotic platform for customized and interactive rehabilitation of persons with disabilities. Pattern Recogn. Lett. 99, 105–113 (2017) 7. Jin, T., Li, Y., Zhang, Z.: Recurrent convolutional video captioning with global and local attention. Neurocomputing 370, 118–127 (2019)
162
J. Rodr´ıguez-Juan et al.
8. Krishna, R., Hata, K., Ren, F., Fei-Fei, L., Niebles, J.C.: Dense-captioning events in videos (2017). https://arxiv.org/abs/1705.00754 9. Li, X., Guo, D., Liu, H., Sun, F.: Robotic indoor scene captioning from streaming video. In: 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 6109–6115 (2021) 10. Lin, D., Kong, C., Fidler, S., Urtasun, R.: Generating multi-sentence lingual descriptions of indoor scenes (2015). https://arxiv.org/abs/1503.00064 11. Lin, K.: SwinBERT: end-to-end transformers with sparse attention for video captioning (2022). https://arxiv.org/abs/2111.13196 12. Liu, Z.: Video swin transformer (2021). https://arxiv.org/abs/2106.13230 13. Pei, W., Zhang, J., Wang, X., Ke, L., Shen, X., Tai, Y.-W.: Memory-attended recurrent network for video captioning (2019). https://arxiv.org/abs/1905.03966 14. Revuelta, F.F., Chamizo, J.M.G., Garcia-Rodrguez, J., S´ aez, A.H.: Representation of 2D objects with a topology preserving network. In: Quereda, J.M.I., Mic´ o, L. (eds.) Pattern Recognition in Information Systems, pp. 267–276. ICEIS Press (2002) 15. Sigurdsson, G.A., Varol, G., Wang, X., Farhadi, A., Laptev, I., Gupta, A.: Hollywood in homes: crowdsourcing data collection for activity understanding (2016). https://arxiv.org/abs/1604.01753 16. Zamora-Hern´ andez, M.-A., Castro-Vargas, J.A., Azorin-Lopez, J., GarciaRodriguez, J.: Deep learning-based visual control assistant for assembly in industry 4.0. Comput. Ind. 131, 103485 (2021) 17. Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: BERTScore: evaluating text generation with BERT (2020). https://arxiv.org/abs/1904.09675 18. Zhou, L., Kalantidis, Y., Chen, X., Corso, J.J., Rohrbach, M.: Grounded video description (2019). https://arxiv.org/abs/1812.06587 19. Zhou, L., Zhou, Y., Corso, J.J., Socher, R., Xiong, C.: End-to-end dense video captioning with masked transformer (2018). https://arxiv.org/abs/1804.00819
A Multimodal Dataset to Create Manufacturing Digital Twins David Alfaro-Viquez1(B) , Mauricio-Andres Zamora-Hernandez1 , opez2 Hanzel Grillo1 , Jose Garcia-Rodriguez2 , and Jorge Azor´ın-L´ 1
University of Costa Rica, San Jos´e, Costa Rica {david.alfaro,mauricio.zamorahernandez,hanzel.grillo}@ucr.ac.cr 2 University of Alicante, Alicante, Spain {jgr,jazorin}@ua.es Abstract. This paper introduces a multimodal dataset created for research on digital twins in the manufacturing domain. Digital twins refer to the digital representations of physical world objects, and they require data to be accurately modeled. By incorporating various data modes, the digital twin representations in computational environments can become more complex and precise. To this end, we propose a dataset that consists of videos recorded inside a manufacturing laboratory, featuring different people performing assembly sequences in varying ways. In addition to the videos, we also incorporated facial capture, lateral capture, and top capture to analyze the pose of the subjects, position of hands and tools, and actions performed during product assembly. Our dataset was able to successfully label 3 different actions (hold, release, screw) for 4 different kinds of tools (ratchet, wrench, allen key, screwdriver), indicating when the subject starts and ends each action for each tool. Keywords: digital twin · human action recognition dataset · pose estimation
1
· multimodal
Introduction
Industry 4.0 represents a significant advancement in manufacturing production and management. Through the integration of digital and physical technologies, it allows for increased efficiency, productivity, and cost reduction. Real-time data capture through the use of sensors is a crucial aspect of Industry 4.0, enabling faster and more accurate decision-making. Data analysis is also essential to identify patterns and trends, optimizing production processes and improving product quality. This research explores the recognition of human action in manufacturing, examining state-of-the-art approaches, their advantages, and challenges. Human action recognition (HAR) in manufacturing can improve production processes, worker safety, and product quality. The human-robot collaboration (HRC) approach enables collaboration between human workers and robots in manufacturing tasks. HRC allows robots to assist human workers in physically demanding or c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 P. Garc´ıa Bringas et al. (Eds.): SOCO 2023, LNNS 750, pp. 163–172, 2023. https://doi.org/10.1007/978-3-031-42536-3_16
164
D. Alfaro-Viquez et al.
dangerous tasks, while human workers can bring their expertise to complete tasks that are difficult to automate. HAR and HRC enable robots to recognize and understand the actions of human workers, adapting their behavior accordingly. The creation of a dataset is essential for training intelligent systems, allowing algorithms to learn from the patterns present in data and improve decisionmaking capabilities. In manufacturing, there is a great need for specific datasets for HAR, as manufacturing operations are highly complex and involve a wide variety of movements and activities. These datasets enable intelligent systems to understand the different actions performed in a manufacturing environment and detect deviations or errors. The availability of these datasets is also essential for the development of automation systems that can improve efficiency and reduce errors in manufacturing processes. The main contribution of this work is a dataset for multimodal Human Action Recognition (HAR) in manufacturing environments. This dataset is significant because it provides labeled data for artificial neural network (ANN) training, fatigue and error detection, as well as quality monitoring in assembly tasks. These ANNs are trained to recognize primarily manufacturing actions but can also detect operator fatigue and ergonomic positions. This can help to make decisions that benefit the operator and improve the performance of the production process. This paper is organized as follows. The second section analyzes the state of the art of action recognition and pose estimation. The third section describes the creation of the videos for the dataset. The fourth section describes the multimodal data of the dataset. The last section presents conclusions and discusses future lines of research.
2
Related Work
There are few datasets available that are specifically oriented towards Human Action Recognition (HAR) in manufacturing environments. For example, the inHard dataset (Industrial Human Action Recognition Dataset) [13], which is built from real-world industrial setups, contains 2 million frames, involves 16 people in the recording of the videos, and includes 15 different actions. In the research by Dallel et al. [5], the use of digital twins (DT) for data generation from digitized human models in real industrial environments is explored. The DT simulates assembly actions to generate synthetic self-labeled data and create an action recognition dataset that is referred to as inHard-DT. Another example of industrial datasets is the one developed in the research by Cicirelli et al. [1]. They build a dataset called Human Action Multimodal Monitoring in Manufacturing (HA4M), which provides six types of data: RGB images, depth maps, IR images, depth-aligned RGB images, point clouds, and skeleton data. This work focuses on Human Action Recognition (HAR) in the context of manufacturing, which represents a fundamental basis for optimizing production processes and improving product quality. While HAR has been extensively
A Multimodal Dataset to Create Manufacturing Digital Twins
165
researched and used in everyday action recognition [14–16], most existing approaches are based on a single recognition mode, such as skeleton recognition [10,17]. While HAR datasets exist, they typically focus on a single mode of recognition or measure complementary action information, such as 3-axis accelerometer data, electromyography (EMG), or first-person video data capture [6,14,19,20]. This proposal, focuses in a multimodal approach that incorporates various types of data, including facial expression recognition for fatigue analysis, skeleton position, hand positions, and hand action on objects in the manufacturing process. One of the challenges in HAR is the difficulty of determining the correct action due to factors such as luminosity or occlusion [3]. Another challenge is measuring the primitive action accurately, as it can be complex to identify the start and finish times for more precise time estimations [4]. 2.1
Pose Estimation for Action Recognition
Pose estimation has been widely used in human action recognition (HAR) research, with different features being extracted and used to recognize actions. One example is the use of distance and angle measurements between body parts, such as the nose and other joints, to recognize actions [8]. In this approach, the information is processed using a k-means algorithm, followed by a multimodal latent mapping model (MLDA) to recognize the recorded actions. Another example is the AFE-CNN model, which is based on 3D skeletons and comprises two modules [9]. The first module adapts the model to different skeleton sizes and camera views, while the second module analyzes the time factor in the scene. The multigrain-contextual approach model is another mechanism developed for HAR, which captures information about the relationships between joints and body parts [10]. In this approach, the interrelationships of joints and frames are studied to recognize the sequence of actions using machine learning algorithms. Additionally, in [12], skeleton interrelationships are also used as descriptors for HAR, where single frames depend on others to recognize the sequence of actions. Although synthetic data has been shown to be favorable for training intelligent models, it has been little explored for HAR. In [11], actions are generated from synthetic data by emphasizing unseen points, but the 3D pose estimation performance may fail in deconvoluted scenes.
3
Experimental Setup and Data Acquisition
To ensure the quality of the data, we use high-resolution cameras with a frame rate of 30 frames per second. The cameras are calibrated and synchronized using an external trigger signal. The resolution of each camera is 1920 × 1080 pixels, and they are positioned to provide a complete view of the operator during assembly operations (Fig. 1).
166
D. Alfaro-Viquez et al.
The dataset includes recordings of ten different operators performing five different assembly tasks, with each task performed five times. This setup results in a total of 250 videos with an average duration of 2 min each. The operators were instructed to wear the same clothing, and the same tools and parts were used for each task. Before the recordings, the operators were trained to perform the assembly tasks to minimize variability in the actions performed. All data were stored in a secure server in an uncompressed format to ensure the quality of the data during analysis. To ensure privacy and anonymity, all operators’ faces were blurred in the recordings. The data are available for research purposes upon request.
Fig. 1. Research proposal working station
The top and side views were captured using Logitech C310 cameras, while the front camera used was a Microsoft HD camera. All cameras used were highdefinition with autozoom capabilities. The video quality was 1280 × 720 pixels with a frame rate of 30 frames per second, and all videos had no audio.
A Multimodal Dataset to Create Manufacturing Digital Twins
167
The dataset comprises 116 videos featuring both male and female participants, showcasing the assembly and disassembly operations of the product depicted in Fig. 2. The assembly process involves 3 gears, 3 shafts, 8 sinkers, nuts, and bolts, all of which were designed by an industrial engineer. The participants were given workflow instructions to follow, which enabled the analysis of the learning curve. The assembly design includes four different types of tools and manual operations, as shown in Table 1. Table 1. Tools used in the experiment N of Tool Name
4
1
Screwdriver
2
Ratchet
3
Wrench
4
Allen Key
Dataset Discussion
Our dataset includes recordings from three synchronized points of view, enabling us to analyze a variety of factors related to assembly, such as action recognition, learning curves, tool usage, manufacturing parts, gestures, poses, movements, and fatigue in individuals. Each video contains multimodal information, with different data available in each perspective. The front view captures the face of the individual performing the assembly, providing information about their level of fatigue. The top view records the assembly process and captures the movements of hands, components, and tools, with labels indicating the actions and objects performed by the operator. The side view shows the skeleton poses of the operator and provides data for analyzing ergonomic positions and fatigue levels. A table containing labels for the actions captured in the dataset is included as Table 2. Table 2. Actions Labeled in the experiment N of Action Name 1
Hold
2
Release
3
Screw
Table 2 displays the actions identified in the videos, while Table 1 provides a list of the tools featured in the videos. The actions captured represent various stages of the assembly process, and involve different operators using the same tools. For instance, the analysis of screwing is performed across all the tools.
168
4.1
D. Alfaro-Viquez et al.
Dataset Top View
In this view, you can observe the objects and actions performed by the operator during the assembly operation. Each scene is labelled with the corresponding action that is being performed at that moment. The tools and hands used in the assembly are also labelled in each scene, starting from the beginning until the end of the action. Additionally, the entire video is labelled with a fatigue level indicator. The dataset includes both men and women who performed the same assembly and disassembly process. This provides an opportunity to analyze how people perform the same assembly operations, including their actions, tool usage, hand position, and physical postures, at different levels of fatigue.
Fig. 2. Dataset Top View
4.2
Dataset Side View
In the side view, the person’s skeleton is labeled to assist in assessing their pose and determining if they are upright during assembly operations. Additionally, other factors such as eye marks can be used to assess fatigue. The OpenPose [21] standard was used for skeleton annotation in this view, with labels placed on the neck, right shoulder, left shoulder, right elbow, left elbow, right wrist, left wrist, and right hip. 4.3
Dataset Front View
Frontal capture enables us to analyze a person’s face during assemblies. For instance, it is possible to study the opening angle of their eyes and track the number of times they blink or yawn. According to [7], signs of fatigue in a person can be detected through this method [7]. Figure 5 provides an example of facial and eye recognition, where a label is set if the person shows signs of fatigue. The system is capable of recognizing these features in the video.
A Multimodal Dataset to Create Manufacturing Digital Twins
169
Fig. 3. Dataset Lateral View
Fig. 4. Dataset Front View
5
Datarecords
The dataset is available on GitHub as HAMD-ME1 (Human Action Monitoring Dataset in Manufacturing Environments). Figure 5 illustrates the dataset’s structure, wherein the same scene is stored in multiple directories corresponding 1
https://github.com/david-alfarov/-HAMD-ME/blob/main/README.md.
170
D. Alfaro-Viquez et al.
to top, front, and side views. Each folder contains compressed videos along with their respective annotations. Figure 5 represents a sample video with its numbering. Each directory includes two files. The first file is a video in Matroska format (*.mkv), and its name follows the structure XYYYYMMDD, where X denotes the view (TOP, FRONT, SIDE), YYYYY represents the year, MM denotes the month, and DD represents the day. The second file in each subfolder, named videoY annotations.zip, where Y denotes the video number, contains the annotations in YOLO format.
Fig. 5. Dataset Structure
6
Conclusions and Future Work
The primary contribution of this work is the creation of a multimodal dataset for human activity recognition (HAR) in Industry 4.0 manufacturing environments. The dataset includes multiple views of the same assembly action, as well as information on fatigue level, operator gender, actions, objects, and tools. The inclusion of top, side, and front views per action enhances the accuracy of action recognition, with potential practical applications in industry. By monitoring the actions performed and the status of the operator, action recognition enables smart manufacturing and facilitates decision-making about the work process. It can also enable analysis of an operator’s state while performing their work, as well as applications such as human-robot collaboration. This dataset, which is based on action recognition, will contribute to the development of robust recognition systems, and will improve the training of intelligent recognition systems. We have successfully labeled three types of actions (clamp, release, and screw) and four types of tools (ratchet, wrench, allen wrench, and screwdriver). Each action is labeled from the subject’s initiation until completion. Future research will focus on the use of virtualization and digital twins to generate synthetic data for HAR, as well as when the person is using both hands.
A Multimodal Dataset to Create Manufacturing Digital Twins
171
References 1. Cicirelli, G., et al.: The HA4M dataset: multi-modal monitoring of an assembly task for human action recognition in manufacturing. Sci. Data 9 (2022) 2. Shinde, S., Kothari, A., Gupta, V.: YOLO based human action recognition and localization. Procedia Comput. Sci. 133, 831–838 (2018) 3. Voronin, V., Zhdanova, M., Zelenskii, A., Agaian, S.: Action recognition for the robotics and manufacturing automation using 3-D binary micro-block difference. Int. J. Adv. Manuf. Technol. (2021) 4. Koch, J., B¨ usch, L., Gomse, M., Sch¨ uppstuhl, T.: A methods-time-measurement based approach to enable action recognition for multi-variant assembly in humanrobot collaboration. Procedia CIRP 106, 233–238 (2022). https://doi.org/10.1016/ j.procir.2022.02.184 5. Dallel, M., Havard, V., Dupuis, Y., Baudry, D.: Digital twin of an industrial workstation: a novel method of an auto-labeled data generator using virtual reality for human action recognition in the context of human-robot collaboration. Eng. Appl. Artif. Intell. 118, 105655 (2023). https://doi.org/10.1016/j.engappai.2022.105655 6. Al-Amin, M., et al.: Action recognition in manufacturing assembly using multimodal sensor fusion. Procedia Manuf. 39, 158–167 (2019). https://doi.org/10. 1016/j.promfg.2020.01.288 7. Alfaro-Viquez, D., Zamora-Hernandez, M., Benavent-Lledo, M., Garcia-Rodriguez, J., Azor´ın-L´ opez, J.: Monitoring human performance through deep learning and computer vision in industry 4.0. In: 17th International Conference on Soft Computing Models in Industrial and Environmental Applications (SOCO 2022), pp. 309–318 (2023) 8. Rathore, A., Hafi, L., Ricardez, G., Taniguchi, T.: Human action categorization system using body pose estimation for multimodal observations from single camera. In: 2022 IEEE/SICE International Symposium on System Integration (SII) (2022). https://doi.org/10.1109/sii52469.2022.9708816 9. Guan, S., Lu, H., Zhu, L., Fang, G.: AFE-CNN: 3D Skeleton-based action recognition with action feature enhancement. Neurocomputing 514, 256–267 (2022) 10. Wu, L., Zhang, C., Zou, Y.: SpatioTemporal focus for skeleton-based action recognition. Pattern Recogn. 136 (2023) 11. Varol, G., Laptev, I., Schmid, C., Zisserman, A.: Synthetic humans for action recognition from unseen viewpoints. Int. J. Comput. Vis. 129, 2264–2287 (2021) 12. Islam, M., Bakhat, K., Khan, R., Iqbal, M., Islam, M., Ye, Z.: Action recognition using interrelationships of 3D joints and frames based on angle sine relation and distance features using interrelationships. Appl. Intell. 51, 6001–6013 (2021). https://link.springer.com/10.1007/s10489-020-02176-3 13. Dallel, M., Havard, V., Baudry, D., Savatier, X.: An industrial human action recogniton dataset in the context of industrial collaborative robotics. In: IEEE International Conference on Human-Machine Systems ICHMS (2020). https://github. com/vhavard/InHARD 14. Amjad, F., Khan, M., Nisar, M., Farid, M., Grzegorzek, M.: A comparative study of feature selection approaches for human activity recognition using multimodal sensory data. Sensors 21, 2368 (2021). https://doi.org/10.3390/s21072368 15. N´ un ˜ez-Marcos, A., Azkune, G., Arganda-Carreras, I.: Egocentric vision-based action recognition: a survey. Neurocomputing 472, 175–197 (2022) 16. Lin, J., Mu, Z., Zhao, T., Zhang, H., Yang, X., Zhao, P.: Action density based frame sampling for human action recognition in videos. J. Vis. Commun. Image Represent. 90, 103740 (2023). https://doi.org/10.1016/j.jvcir.2022.103740
172
D. Alfaro-Viquez et al.
17. Patil, A.A., Swaminathan, A., Gayathri, R.: Human action recognition using Skeleton features. In: 2022 IEEE International Symposium on Mixed and Augmented Reality Adjunct (ISMAR-Adjunct) (2022). https://doi.org/10.1109/ismaradjunct57072.2022.00066 18. Tasnim, N., Baek, J.: Dynamic edge convolutional neural network for skeletonbased human action recognition. Sensors 23 (2023) 19. Li, R., Wang, H., Liu, Z., Cheng, N., Xie, H.: First-person hand action recognition using multimodal data. IEEE Trans. Cogn. Dev. Syst. 14, 1449–1464 (2022). https://doi.org/10.1109/tcds.2021.3108136 20. Ren, Z., Zhang, Q., Cheng, J., Hao, F., Gao, X.: Segment spatial-temporal representation and cooperative learning of convolution neural networks for multimodalbased action recognition. Neurocomputing 433, 142–153 (2021) 21. Simon, T., Joo, H., Matthews, I., Sheikh, Y.: Hand keypoint detection in single images using multiview bootstrapping. In: CVPR (2017)
A Modified Loss Function Approach for Instance Segmentation Improvement and Application in Fish Markets Alejandro Gal´ an-Cuenca(B) , Nahuel Garc´ıa-d’Urso, Pau Climent-P´erez, Andres Fuster-Guillo, and Jorge Azorin-Lopez Department of Computer Technology, University of Alicante, Ctra. Sant Vicent del Raspeig s/n, 03690 Alicante, Spain {a.galan,nahuel.garcia,pau.climent,fuster,jazorin}@ua.es
Abstract. This paper presents an approach to image segmentation and classification algorithm where the dataset has only few images labelled, done intentionally. The method tries to classify only the few instances with enough quality in the image, the KeyFish. In order to not being punished with wrong false positives, it must learn the examples but not the context. The application is intended for wholesale fish markets. Due to the depth and occlusions of the fish tray, the camera can only visualize a small fraction of the total instances. The main goal is to predict in the best possible quality the few fish seen, regardless of other occurrences. Tests have been made over Yolact++ architecture and the proposed method, with an increase in precision from 85.97% to 89.69% in bounding box and 74.03% to 81.42% in detection of the mask for a 50% overlap limit.
Keywords: instance segmentation sustainable fisheries
1
· deep learning · computer vision ·
Introduction
Proper development of the fishing sector is crucial for the preservation of marine life and the local economy. More precisely, on the Mediterranean coast, most fish markets are suffering the lack of informatization because of not having enough economic resources, especially in terms of management, where they are forced to depend on the human factor with the consequent errors in precision, fatigue, slowness. Due to recent problems such as inflation, this sector and in particular the smaller retailer fish markets and small-scale boats, which represent approximately the 80% of the fleet in the Mediterranean area [10], are at risk of precariousness. This work was supported by the Spanish State Research Agency (AEI) under grant PID2020-119144RB-I00 funded by MCIN/AEI/10.13039/501100011033. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 P. Garc´ıa Bringas et al. (Eds.): SOCO 2023, LNNS 750, pp. 173–183, 2023. https://doi.org/10.1007/978-3-031-42536-3_17
174
A. Gal´ an-Cuenca et al.
Not only because of insufficient traffic of some ship vessels per week to justify an investment in technological infrastructure, also due to spatial constraints. The DeepFish project [1] covers this particular issue [6], providing an approach without the need of an expensive process of calibration. Moreover, to mitigate this same problem of management, a solution in the same environment of retail fish markets had been made in [11] in the above-mentioned project, with the use of a fast and efficient segmentation architecture, which is YOLACT++ [3,4]. Despite this, the domain of a wholesale and a retail fish market is completely different, specially in terms of overlap and heterogeneity of each tray. That means the previous approach to this problem is not valid, because it is expecting environments where almost every instance is fully visible, and the challenge is to classify each occurrence. These trays are prepared to be sold to the public, as a touristic attraction. On the other hand, wholesale markets are prepared for few but big clients like chains of supermarkets. Trays have larger volumes, where only a small fraction of the total, about 10%–25%, can be seen, and only few instances are not partially overlapped. Some heuristic must be used to achieve an estimate of the total, because this information cannot be known at first glance. As species tend to be uniform in the same tray, it is possible to turn the instance classification problem into a tray classification problem. First, this problem occurs in the context of instance segmentation applied to Homogeneous object clusters (HOC) with the specific condition of not labelling each visible instance, but to only label the best occurrences or key objects on the image, which, as far as we are concerned, has hardly been explored. In this paper, a cluster method has not been considered, the analyzed dataset is not prepared for any structure, therefore another approach has to be done. On the state of the art, some works can be found about machine learning applied to images with incomplete labelling, especially those applied only to classification, which is more commonly studied. However, an incomplete dataset tends to have partial labelling because of the lack of resources or low quality of the labelling process. Methods with these goals will have different philosophy than the method proposed in this article, where the images have partial labelling intentionally, because the purpose is learning to recognize only good quality instances of a class, not every instance of it. Moreover, partial annotations tend to be the omission of classes, not entities. For example, an image of a deer in a forest with trees, with only “deer” as a label for that image. In this case of study, only the deer and some of the sharpest trees have been labelled. With N > m, it is not usual to find datasets with N instances of a class and only m occurrences labelled, if this is not what is intended. The same reason justifies most of the approaches to datasets composed of incomplete images, which tend to approach multi-label strategy with single-label annotations and are less expensive [7,9,15]. Other common purpose is the use of unsupervised methods, probably related to some self-learning-like approach where the precision and quality of each label on the images is low. Some works talk about Amodal instance segmentation [2,14], an approach to improve recognition on high overlap scenarios. Certain techniques utilized to
Modified Loss Function for Instance Segmentation
175
improve the results are the prediction of the complete shape of an overlapped object, completion of overlapped regions of objects or background, breaking the scene into multiple individual objects, and establishing a known order of perception that will have the objects on the image. These techniques tend to require learning the structure of objects, and need considerable amounts of data. The main difference of this field of study and the problem to be addressed, is the heterogeneous class classification and the objective to label as many instances as possible even if the percentage seen is minimum, which is the opposite of the main goal treated in this paper, to label a small set of objects maximizing the accuracy of that instances over other occurrences on the image. As mentioned in [12], “Training a CNN with partial labels, hence a small number of images for every label, using the standard cross-entropy loss is prone to overfitting and performance drop.” Some modification over the method should be done. Along other studies, we agreed on the crucial relevance of the loss function [7,9,12,15]. Some have explored a regularization of the loss function based on the smoothness of labels [12], training with CNN and similarities in the minibatches, updating features in the process. On the one hand, [9] focus on single-label to multi-label conversion. A modification to the classification loss is applied depending on the proportion of labels known per image. On the other hand, some works focus on attention mechanisms to address the improvement in accuracy. More precisely, [5] apply these techniques over a Yolact++ architecture [4], the same base architecture we have used as a starting point for the following experiments. In summary, the main objective of this work is to obtain an alternative method to the studied case of large overlap, and to verify better results than the original scenario through experimentation. The structure of the paper is as follows: Sect. 2 sets out the baseline dataset, and the required changes; Sect. 3 presents the proposed method; Sect. 4 details the experimentation and results considered; finally, Sect. 5 summarizes the work carried out.
2
Dataset
To carry out the experiments set out in this paper, the DeepFish2 dataset has been used. It consists of 876 images of trays of fish (3,154 specimens) from a wholesale fish market in Altea (Alicante, Spain). Images were labelled at the pixel level (per fish instance, i.e. each specimen). There are fishes from 27 different species. For further details, please refer to the public dataset repository [8]. The whole dataset has been divided with an 80% ratio of training instances and the other 20% for validation. As can be seen in Fig. 1a, it presents a high imbalance between classes. Because of this disequilibrium, the division has been made according to every class, assigning the same percentage of specimens to each train or validation set. Final numbers of distributions are showed in Table 1.
176
A. Gal´ an-Cuenca et al.
Table 1. Total instances and images of trays of the dataset used, and distribution between train and validation sets. Distribution Trays Annotations Train Validation
648 192
2461 693
Table 2. Minimum, maximum, average and standard deviation of instances of the dataset used in every tray image between train and validation sets. Distribution Minimum Maximum Average Standard deviation Train Validation
1 1
22 14
3.598 3.6094
2.3289 2.2322
(a) Frequency of each specie in descending or- (b) der.
Quantity of trays for each specie in the dataset ordered by frequency in the dataset.
(c) Average and standard deviation of each specie per tray.
Fig. 1. Frequency of instance species and trays in the dataset (a, b). Average and standard deviation of species per tray (c).
In this dataset, the number of instances labelled on the image does not correspond with the quantity of real instances. Indeed, it is usually minuscule compared to the total number. This statement can be counter-intuitive because some species tend to appear in the tray with similar quantity every time, mostly caused by the size of them. For example, some big fishes like Seriola dumerilli have trays with presence of 2 instances whereas other classes such as Phycis blennoides can have trays with 50 occurrences or even more due to its smaller size, see Fig. 2. Despite the imbalance in frequency of occurrence, Table 2 demonstrates the mismatch between the quantity of species per tray and the number of species
Modified Loss Function for Instance Segmentation
(a)
Seriola dumerill´ı
(b)
177
Phycis blennoides
Fig. 2. Example trays with black silhouette representing labelled instances.
labelled per tray, which preserves consistency in the range from 1 to 5 entities labelled in every image, while the appearance rate is much greater. With previous statements, the information shown in Fig. 1b and Fig. 1c indicate the expected precision rate of some classes. One example is Sepia officinalis, which has the greatest average per tray of the dataset, 7 instances. Meaning that specie has 7 entities with great quality, because instances have been labelled only if they are easily recognizable. The algorithm will need fewer images, which come with great density, and will need fewer instances, which are easily identifiable. This class represents the 10th with more instances (143 specimens) and the 14th in respect to the number of trays, with only 20 images. However, as will be shown in Sect. 4.2, and more precisely in Fig. 3b, greater number of instances is not correlated to the precision a class will obtain. More factors are affecting the output. That “quality” of the species depends highly on the visual distinguishing features. In this example, Sepia officinalis has a color, shape, or even texture different enough from other species to justify a higher result with fewer images.
3
Proposed Method
Starting from the previous method applied to a retail fish market [11], new problems emerged due to the change to a wholesale fish market. More precisely, a high overlap on the tray, changing positions of fish, heterogeneity of sizes in same tray, existence of some multi-label trays and artifacts downgrading the value of the image, such as ice, water, dirt on the camera, lighting, lower resolutions than previous domains, among other issues. A new approach has been made to sort out these new problems, whereas some of the previous objectives have become trivial in the new environment, like the classification of species inside the same tray. The main goal of this method, named KeyFish, is to detect the segmentation of an instance optimizing the precision, regardless of the quality of other predictions. With that small number of instances with high confidence, size and weight will be calculated, and the number of total species can be inferred with the total weight of the tray and the average weight of the KeyFish. Otherwise, the previous technique is not valid for this issue, since is based on counting instances predicted and makes efforts to differentiate every occurrence. If only a fraction of the total number of entities can be seen, the total number of
178
A. Gal´ an-Cuenca et al.
instances will be fewer. In addition, wrong predictions or the lack of them, will decrease the precision of the method. As the new domain is noisier, precision will tend to be lower. This new method avoids previous issues like overlap, changing positions of fish and artifacts, because losing sight of a fragment of the tray will probably output the same result if at least one specimen has been detected with enough quality to be considered a KeyFish. However, if the set of KeyFish is not representative of the total population in the tray, some multi-label or heterogeneous-size trays may have lower results than expected, because the subset used to infer the total is incomplete. As the main objectives have changed significantly from detecting everything to detecting the best instances as accurately as possible, the architecture should also change to adapt the training to this new perspective and increase the precision of the method. More precisely, the loss function has been modified. The function has three components: a classification loss Lcls , a box regression loss Lbox , and a mask loss Lmask . Both Lcls and Lbox are defined as done in [13]. To compute the mask loss, a pixel-wise binary cross entropy (BCE) is taken among the set of assembled masks M and the set of ground truth masks Mgt , so that mask loss is calculated as: Lmask = BCE(M, Mgt )
(1)
The main contribution relies on the following aspect: for each of the three parameters, the same loss function will be used, but the importance of it to the algorithm will depend on the resemblance of the instance and the most similar ground truth correspondence. In other words, only the M instances with most IoU over the N ground truth examples are penalized, as can be seen in Eq. 2. The α parameter is used as a threshold limit to determine if that prediction is penalized, depending on its IoU with the ground truth. χ represents all the predictions in that image, and is used to express the best overlap prediction for every ground truth example. x − y if (x ∩ y) > α ∧ (x ∩ y) = max(χ ∩ y) f (x, y) = (2) 0 if (x ∩ y) μj + σ
(3) (4) (5)
(6)
2. Obtain the current video frame from the hardware. 3. Apply the forgetting rate α to recompute the a priori probabilities πi for all the elements of the active detection set. If an object is no longer visible because it has gone out of V, then it is erased because it is now inactive. 4. Draw M samples at random according to the multivariate distribution (2). Identify the bounding box associated with each sample. Then modify its size to match the window size required by the image classification deep neural network. Then the window associated with the bounding box is fed to the deep network. In case the network informs of detection, the corresponding sample is inserted into A. Also, the sample is marked with the reliability of the detection, which is taken as an estimation of the probability that the detection is correct. 5. Go to step 2.
3
System Architecture
Automatic object detection and classification in digital video streams is a very common task nowadays. However, it is also a very complex procedure that requires the use of deep learning techniques in order to be enough robust and accurate. Any system involving the use of deep learning techniques usually requires high amounts of computing power. This is more remarkable when processing video streams coming from panoramic 360◦ cameras as they handle even larger frames. On a regular basis, the building of a deep learning-based object detection system requires the use of expensive and high-power demanding hardware that, at the same time, requires a series of external components that difficult its integration into autonomous devices. One solution for building deep learning-based object detection systems for autonomous devices is to deploy them into embedded systems as they are small, low power consuming and they barely require external components to work. But, in the case of embedded systems-based video stream processing, computational resources are especially valuable as they tend to be scarce. Thus, it is desirable to utilize system architectures that optimize computation processes so they can be
188
J. Benito-Picazo et al.
Fig. 1. Diagram of the software architecture
performed by low-profile pieces of hardware whose computing power is reduced, in order to achieve a low power consumption and a higher level of autonomy and versatility. As mentioned in Sect. 1, this work pursues the development of a stand-alone object detection and classification system for 360◦ camera video streams with low power consumption and reduced size. Consequently, the architecture of the system will be divided into two well-differentiated parts, namely the software architecture, and the hardware architecture. Both of them will be precisely detailed below. 3.1
Software Architecture
The software architecture developed in this work is committed to the objective of optimizing the system’s operation in order to be deployed in a low-powerconsuming hardware device without experiencing a strong loss in performance. With this target in mind, the architecture consists of three different modules that operate concurrently in a producer-consumer configuration (Fig. 1). The first module is a video stream acquisition process that is in charge of receiving the frames from any 360◦ video source, namely a panoramic camera. Frames are supplied to the second module which implements the potential detection generator this system relies on. Contrary to what systems such as FasterRCNN do, the potential detection generator takes advantage of the information learnt from past frames by selecting a certain number of areas in the current frame where a Convolutional Neural Network (CNN) is going to check if there is an identified object or not. The position and size of these areas will be selected by using one of the three multivariate homoscedastic probability distributions presented in Sect. 2, over the position and size of the objects the system has already found in the video stream. The third module is the one presenting the highest novelty in this work and is the inference module. The cited module is in charge of performing the identification of the objects appearing in the portions of the frame supplied by the potential detection generator. It consists of an array of several inference parallel processes each one of them performing the inference task which will determine whether there is an identified object in that area according to the accuracy obtained. Every parallel process will add the result of the inference, i.e., the position and category of the identified object, to a common list of confirmed detections. This list of confirmed detections is going to be updated by the main process according to the algorithm proposed in Sect. 2.
Parallel Detection Jetson TX2
189
Fig. 2. Schematic of the system’s workflow.
3.2
Hardware Architecture
Multiprocess-based implementations of Deep learning-based detection systems require larger amounts of memory than monoprocess implementations. The reason is that, since data structure sharing between different processes is quite limited, sometimes is necessary to store some data structures in the memory of every process involved. Hence, it was critical to choose a piece of hardware that not only was small and low power consuming but also has enough memory and computing capabilities. Therefore, the device selected to support the system developed in this work was the Jetson TX2 board. This device is a consolidated system for deep learning tasks and it has a 256-core NVIDIA Pascal GPU, 8GB of RAM/VRAM memory and a power consumption of 7.5 W.
4
Experimental Results
In order to test the object detection and classification system for panoramic video streams developed in this work, a complete tests series has been developed by implementing the software architecture described in Sect. 3.1, and deploying it in the hardware platform described in Sect. 3.2. This implementation consists of a program designed in Python language that analyses, frame by frame, a 360◦ video simulating the video stream provided by a 360◦ camera. In the opinion of the authors of this work, this method is more convenient for testing the performance of the system since it avoids the intrinsic issues produced by the interaction with the camera, resulting in more reproducible experiments and more accurate measurements. The video used is a 360◦ video supplied by Stanford University’s Virtual Human Interaction Lab [11]. The system workflow is shown in Fig. 2. According to this, in the first place, a frame from the 360◦ video indicated above is fed to the system. Next, this frame is processed by the potential detection generation engine whose operation was described in Sect. 3.1. This module will generate a set of potential detections that in practice, is a set of areas or windows from the current frame whose position and size will be calculated by using one of the three probability distributions explained in Sect. 2. The program will feed the potential detections to the inference module, where a set of n parallel processes will use a CNN to determine whether there are any objects of the category set recognized by the CNN in the areas enclosed by the potential detections. If one process determines that a certain window contains any object,
190
J. Benito-Picazo et al.
Fig. 3. Number of detections performed by the system in 612 frames setting up the potential detection generator with the three multivariate distributions (see Sect. 2).
this one will be added to the list of confirmed detections and incorporated into the knowledge base of the system. As this system is an improvement over the one published in [4], the experiments section in this work is mostly oriented to illustrate the increase achieved in the system speed by introducing parallel computing in the inference module. So, the experiments consisted of feeding the 612 frames of the 360◦ video cited above to the system and checking how many of the objects which are actually present in the frames it can detect and how fast it can do it. This experiment has been repeated for all three multivariate probability distributions presented in Sect. 2, for a number of potential detections that goes from 1 to 30 and for different amounts of parallel inference processes that goes from 1 to 4. The reason for using up to 4 parallel inference processes is that this was the maximum number of processes the Jetson TX2 was capable of managing without running out of memory. The video from [11] which has been used to perform the tests, was manually tagged localizing all the appearances of objects from four categories of the Pascal VOC 2012 dataset. These categories are “person”, “dog”, “car” and “motorcycle”. The program can check whether the position of a detection generated by the system really contains the object identified in this detection. This way the system can count the number of positive detections of objects in each frame. The Convolutional Neural Network used in the inference process is the MobileNet [10] implementation from the Pytorch framework properly trained with the widely used Pascal VOC 2012 dataset. The reason for using the MobileNet in our inference module is its balance between accuracy, inference speed, and memory consumption. In order to ensure the reproducibility of the experiments it is also important to indicate the values of the different parameters affecting the multivariate probability distributions in which relies the potential detection generator. These values are σ = 0.3 and α = 0.1 for all three distributions. In the case of q, it is
Parallel Detection Jetson TX2
191
Fig. 4. Performance of the system in fps for all three multivariate probability distributions, from 1 to 4 parallel processes and the no multiprocess version of the system.
q = 0.4 for the Gaussian mixture, q = 0.2 for the Student-t mixture, and q = 0.7 for the Triangular mixture. It is important to remark that the parameter values are selected after a supercomputer-driven process of optimization by using as a validation dataset two different videos from the [5] dataset. Regarding to the results from the experiments, Fig. 3 shows how the number of detections increases as the number of potential detections grows up. This is expected, since the higher the number of potential detections generated by the system for each frame, the higher the number of regions of the image the system looks for possible objects simultaneously. It can also be observed that the number of correct detections performed by the system does not follow a smooth progression. Instead, multiple oscillations can be observed in the plot. The reason for this is that the probability distributions used for the system to know which region of the frame to observe in a certain frame have an important random component that will affect the initialization of the system, and consequently, its performance through the following frames. The last appreciation that can be extracted from this plot is that at first glance, the potential detection engine powered by the Gaussian distribution seems to perform slightly better than the other two, having the highest number of accumulated correct detections in one pass when using 20 potential detections. As has been already anticipated in the paragraphs above, the experimental section is mostly dedicated to analyze the performance of this system when the inference module is implemented using a multiprocess parallel concurrent architecture. Consequently, for this section, it has been developed a series of experiments involving the evaluation of the system speed in frames per second when implementing the inference module with one, two, three, and four parallel processes on a Jetson-TX2 board for a number of potential detections spanning from 1 to 30 and for all three probability distributions described in Sect. 2. In order to illustrate the advantages of using multiprocessing, the tests have also
192
J. Benito-Picazo et al.
been performed using the single process implementation described in [4]. Results of this series of experiments are shown in Fig. 4. The most important appreciation from Fig. 4 is the speed increase in fps when using multiprocessing for almost every amount of potential detections. More precisely, the system speed increases from 1 to 5 fps depending on the number of parallel inference processes and the number of potential detections (windows) considered. This represents a system that in the best case can be up to 5 times faster than the non-multiprocessing version. It is remarkable that, for example, for one potential detection, the speed in fps is higher even when just one process in the inference module has been used. The reason for this is that even using only one process in the inference module, this process works concurrently with the potential detection generator module in a producer-consumer architecture which makes image processing more efficient. The other important observation it can be obtained from these plots is that a high amount of parallel processes is not always equivalent to higher fps. This only happens when the number of potential detections is high enough to take advantage of the parallel processing architecture. When the number of potential detections is not high enough, the cost of managing multiple processes overrides the benefit of multiprocessing implementation of the inference module. So, it can be concluded that the multiprocessing architecture is more suitable for this system as the number of potential detections increases. It is also apparent from Fig. 4 that the probability distribution used in the implementation does not seem to introduce a significant variation in the speed performance of the multiprocess implementation. This is also an expected behavior as the potential detection generator is placed in a monoprocess module.
5
Conclusion
In this paper, a novel anomalous object detection system embedded in a Jetson TX2 board is proposed. In our system, video streams taken from panoramic surveillance cameras feed a potential detection generator module based on a probability mixture model to detect anomalous objects. Then, these detected objects are classified in the inference module using a MobileNet model. The novelty of this proposal with respect to previous state-of-the-art works is the introduction of a multiprocess parallel concurrent architecture in the inference module to increase the processing framerate in a Jetson TX2 board. Experimental results show that parallel processing increases the speed of the system from 1 to 5 fps depending on the number of parallel inference processes and the number of potential detections considered. In general, the higher the number of potential detections, the higher the number of parallel processes that can be used to increase the speed of the system. Finally, we can observe that the speed performance is not influenced by the probability distribution used in the potential detection generator. Future work involves the improvement of the system’s accuracy with the implementation of a new potential detection generator, based on a more specific
Parallel Detection Jetson TX2
193
probability distribution, and the using of NVIDIA’s TensorRT technology in order to improve the system’s speed performance. Acknowledgements. This work is partially supported by the Autonomous Government of Andalusia (Spain) under project UMA20-FEDERJA-108. It is also partially supported by the University of M´ alaga under grant. It includes funds from the European Regional Development Fund (ERDF). It is also partially supported by the University of M´ alaga (Spain) under grants B1-2019 01, B1-2019 02, B1-2021 20, B4-2022 and B12022 14. They also gratefully acknowledge the support of NVIDIA Corporation with the donation of a RTX A6000 GPU with 48Gb. The authors also thankfully acknowledge the grant of the Instituto de Investigaci´ on Biom´edica de M´ alaga y Plataforma en Nanomedicina-IBIMA Plataforma BIONAND.
References 1. Angelov, P., Sadeghi-Tehran, P., Clarke, C.: AURORA: autonomous real-time onboard video analytics. Neural Comput. Appl. 28(5), 855–865 (2017) 2. Bang, S., Park, S., Kim, H., Kim, H.: Encoder-decoder network for pixel-level road crack detection in black-box images. Comput.-Aided Civil Infrastruct. Eng. 34(8), 713–727 (2019) 3. Benito-Picazo, J., Dom´ınguez, E., Palomo, E.J., L´ opez-Rubio, E.: Deep learningbased video surveillance system managed by low cost hardware and panoramic cameras. Integr. Comput.-Aided Eng. 27(4), 373–387 (2020) 4. Benito-Picazo, J., Dom´ınguez, E., Palomo, E.J., Ramos-Jim´enez, G., L´ opez-Rubio, E.: Deep learning-based anomalous object detection system for panoramic cameras managed by a Jetson TX2 board. In: 2021 International Joint Conference on Neural Networks (IJCNN), pp. 1–7 (2021). https://doi.org/10.1109/IJCNN52387.2021. 9534053 5. Charles, P.L.S.: LITIV (2018). http://www.polymtl.ca/litiv/en/. Accessed 14 Feb 2018 6. Chen, C., Li, S., Qin, H., Hao, A.: Robust salient motion detection in nonstationary videos via novel integrated strategies of spatio-temporal coherency clues and low-rank analysis. Pattern Recogn. 52, 410–432 (2016) 7. Dalwadi, D., Mehta, Y., Macwan, N.: Face recognition-based attendance system using real-time computer vision algorithms. In: Hassanien, A.E., Bhatnagar, R., Darwish, A. (eds.) AMLTA 2020. AISC, vol. 1141, pp. 39–49. Springer, Singapore (2021). https://doi.org/10.1007/978-981-15-3383-9 4 8. Dziri, A., Duranton, M., Chapuis, R.: Real-time multiple objects tracking on raspberry-pi-based smart embedded camera. J. Electron. Imaging 25, 041005 (2016) 9. Haritaoglu, I., Harwood, D., Davis, L.S.: W4: real-time surveillance of people and their activities. IEEE Trans. Pattern Analy. Mach. Intell. 22(8), 809–830 (2000) 10. Howard, A.G., et al.: Mobilenets: efficient convolutional neural networks for mobile vision applications (2017) 11. VHI Lab: 360 video database. https://vhil.stanford.edu/ 12. Li, L., Huang, W., Gu, I.Y., Tian, Q.: Statistical modeling of complex backgrounds for foreground object detection. IEEE Trans. Image Process. 13(11), 1459–1472 (2004)
194
J. Benito-Picazo et al.
13. Liang, X.: Image-based post-disaster inspection of reinforced concrete bridge systems using deep learning with Bayesian optimization. Comput.-Aided Civil Infrastruct. Eng. 34(5), 415–430 (2019) 14. McCann, M., Jin, K., Unser, M.: Convolutional neural networks for inverse problems in imaging: a review. IEEE Signal Processing Mag. 34, 85–95 (2017) 15. Micheloni, C., Rinner, B., Foresti, G.: Video analysis in pan-tilt-zoom camera networks. IEEE Signal Process. Mag. 27(5), 78–90 (2010) 16. Poppe, R.: A survey on vision-based human action recognition. Image Vis. Comput. 28(6), 976–990 (2010) 17. Sajid, H., Cheung, S.C.S., Jacobs, N.: Appearance based background subtraction for PTZ cameras. Signal Process. Image Commun. 47, 417–425 (2016) 18. Vijayan, M., Mohan, R.: A universal foreground segmentation technique using deep-neural network. Multimedia Tools Appl. 79, 34835–34850 (2020)
Deep Learning-Based Emotion Detection in Aphasia Patients David Ortiz-Perez1 , Pablo Ruiz-Ponce1 , Javier Rodr´ıguez-Juan1 , David Tom´ as1 , Jose Garcia-Rodriguez1(B) , and Grzegorz J. Nalepa2 1 Universidad de Alicante, Alicante, Spain {dortiz,pruiz,jrodriguez,jgarcia}@dtic.ua.es, [email protected] 2 Jagiellonian University, Krak´ ow, Poland [email protected]
Abstract. In this paper, we propose a pipeline for analyzing audio recordings of both aphasic and healthy patients. The pipeline can transcribe and distinguish between patients and the interviewer. To evaluate the pipeline’s effectiveness, we conducted a manual review of the initial frames of one hundred randomly selected samples and achieved a 94% accuracy in patient differentiation. This evaluation aimed to ensure accurate differentiation when analyzing frames where the clinician interacts with the patient. This differentiation is important, as the primary objective of this project is to examine patients’ emotions while they listen to their interviewer and identify patterns between healthy patients and those with aphasia. To achieve this, we used the AphasiaBank dataset, which includes video recordings of interviews with both aphasic and healthy patients. By combining the audio differentiation with the video recordings, we were able to analyze the facial expressions of patients while they listened to the speech of the interviewer. This analysis revealed a negative influence on the mood of aphasic patients. This negative influence stems from aphasic patients’ difficulty in correctly understanding and expressing speech. Keywords: Aphasia Transformers
1
· Emotion recognition · Deep Learning ·
Introduction
Aphasia is a neurological disorder that occurs due to damage to certain regions of the brain involved in speech and language. This condition can cause significant communication difficulties for patients, who may be unable to express themselves clearly. In the United States, it affects approximately one million people and is commonly associated with middle-aged and older individuals, though it can occur at any age. Symptoms of aphasia primarily involve difficulties with language. There are various types of aphasia, which are determined by the specific location and extent of brain damage [10,13,15]. The most prevalent types are Wernicke, Broca, and c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 P. Garc´ıa Bringas et al. (Eds.): SOCO 2023, LNNS 750, pp. 195–204, 2023. https://doi.org/10.1007/978-3-031-42536-3_19
196
D. Ortiz-Perez et al.
global aphasia. Wernicke aphasia is characterized by the use of nonsensical, long sentences and the invention of new words, making it challenging for patients to comprehend others’ speech. In contrast, Broca aphasia results in patients using minimal words and constructing short, direct sentences, frequently omitting common words such as “the”, “and”, or “is”. Global aphasia involves extensive brain damage and is associated with severe communication difficulties that limit patients’ ability to both speak and comprehend others’ speech. Other less common types of aphasia affect patients’ communication abilities differently. Aphasia can result from various conditions such as strokes, brain tumors, or progressive neurological diseases like Alzheimer’s disease, which is often linked to dementia [5,7,18]. This study aims to develop a pipeline that can transcribe and distinguish between patient and clinician recordings for further analysis of patients’ facial expressions while they listen to clinicians. The primary objective is to analyze patients’ emotions to identify patterns in aphasia disease, particularly regarding how patients feel while listening to others, such as clinicians. Patients with aphasia may have different moods due to difficulty comprehending language. Analyzing their reactions and emotions can help improve communication with them, ultimately enhancing their comfort levels. The remaining of the paper is organized as follows: Sect. 2 will introduce the state of the art and related work in this area; Sect. 3 will analyze the data available in our dataset; Sect. 4 will show the work done over the dataset for analyzing the results in Sect. 5; finally, in Sect. 6 we summarize our conclusions and propose further work.
2
Related Work
A research study has been conducted to select the most suitable dataset for our project. The only dataset that provides information regarding aphasia disease is AphasiaBank1 [6], which will be explained in detail in Sect. 3 and is provided by TalkBank. TalkBank is primarily dedicated to the research of human communication and offers other similar datasets that have been considered for our project. One such dataset is TBIBank [4], which contains information on patients with traumatic brain injuries. This dataset is similar to the AphasiaBank corpus, as aphasia is often the result of brain damage in certain areas. Another comparable dataset is RHDBank [14], which contains information on patients with right hemisphere damage. Lastly, there is DementiaBank[11], which contains information on patients with dementia. Dementia is another cause of aphasia, as it is a progressive neurological disease. The main factor that drove the selection of the AphasiaBank dataset over the others was the availability of video recordings and a larger number of samples. While DementiaBank only contains audio recordings and TBIBank do not provide a video modality for every sample, both AphasiaBank and RHDBank 1
https://sla.talkbank.org/TBB/aphasia.
Deep Learning-Based Emotion Detection in Aphasia Patients
197
provide video recordings for each sample. Moreover, AphasiaBank offers a significantly larger number of samples for our study. In this research, the video modality is essential for emotion recognition, as it is easier to predict when the facial expressions of a person are visible. Regarding the existing work carried out on the selected dataset, AphasiaBank, there are tasks such as automatic speech recognition of aphasic individuals, as well as numerous lexical and semantic analyses, as this dataset includes transcriptions of recordings. The task of automatic speech recognition, which involves transcribing an audio recording, has shown significant advancements in recent years, particularly with transformer-based architectures such as Whisper [17] or Wav2Vec2 [1]. The significance of this area lies in the added complexity of the task due to the communication difficulties faced by aphasic patients who may produce incomprehensible speech or sentences during a conversation. Additionally, there is a significant disparity in the availability of transcription data for healthy patients compared to those with the disease. In this regard, we highly appreciate the work done by Iv´ an G. Torres et al. [22], who used the AphasiaBank dataset as well. Regarding other works focused on the semantic and lexical analysis of transcriptions from this dataset, several studies can be found. One such example is the work by Yu-Er Jiang et al. [9], which analyzed the main verbs and nouns used by patients with anomic aphasia and healthy controls. The study compared individuals of similar age and education levels to ensure a more accurate and balanced analysis. Results showed that individuals with anomic aphasia tend to use fewer core verbs and nouns than healthy individuals. Another study in this area that utilized the same dataset was conducted by Ouden Dirk-Bart et al. [16], which analyzed the use of verbs. Results showed that individuals with Broca’s aphasia tend to use verbs in less complex and diverse ways than healthy individuals. Emotional expressions and understanding are crucial in human communication. Diseases such as Aphasia and Dementia can negatively impact interactions and conversations with others. Patients with dementia may find it difficult to identify others’ emotions and empathize with them. Thus, investigating the emotions of these patients is an interesting area of study. With advancements in artificial intelligence, tasks such as emotion recognition can be automated. Although there are currently no studies on how aphasia affects patients’ emotions, diseases like dementia have been explored in automating emotion recognition for further analysis. Karmele Lopez-de-Ipi˜ na et al. [8] conducted an emotion response analysis aimed at detecting dementia by analyzing audio recordings and using audio features to determine emotions. Parkinson’s disease is another illness that can affect patients’ emotions, with deficits in emotional speech production. Shunan Zhao et al. [23] have performed a more complex analysis using automatic emotion recognition to investigate this disease. In this context, there can be numerous emotions, with subtle differences between them. Psychologist Paul Ekman differentiates between six basic emotions: anger, disgust, happiness, fear, surprise, and sadness. Ekman proposed this distinction based on an analysis of eye, head, and facial muscle movements.
198
3
D. Ortiz-Perez et al.
Dataset
As previously indicated, our dataset selection process culminated in the decision to use the AphasiaBank dataset. This particular dataset is of particular interest due to its inclusion of video recordings, where patients were recorded during a conversation with a clinician. The videos capture the upper half of the body, including the face, which makes the facial expressions of the patients the most crucial aspect for our analysis of emotional recognition. The dataset includes video recordings of both healthy and aphasic patients, although there are considerably more samples from the latter group. The dataset contains a total of 440 video samples from aphasic patients and 220 samples from healthy patients. The primary focus of the recordings is on the speech behavior of the patient, with the conversation and discourse tasks designed to provide data on how they express themselves. Since this database has different corpuses, it is important to note that the tasks vary depending on the corpus of the dataset, and some tasks are more varied than others. A corpus is a set of data from the dataset, in this case, a set of video recordings. The main task involves initiating a conversation by inquiring about the patient’s perception of their speech, while other tasks include the description of various images. The dataset also includes CHAT transcriptions [12] of the conversations, which is in line with other similar datasets, such as DementiaBank. In this sense, the information represented in the form of text that captures the speech that has been performed can be highly valuable for semantic and lexical analysis, as demonstrated in previous studies.
4
Approach
In this project, we developed a pipeline for automatic speech recognition and speaker differentiation of the video recordings in the AphasiaBank dataset. The pipeline consists of several stages and has been applied to each sample of the dataset. First, we extract the audio information from the video and store it. Using the Whisper model developed by OpenAI, we transcribe the recording, resulting in two files: a plain transcription file and a file with transcription and time-lapse of the transcripted sentences. The latter is used for further processing. Next, we use the speaker-diarization [2,3] model provided by the HuggingFace library to differentiate between the patient and the clinician. This model enables us to obtain a time-lapse of when each speaker is talking. Both models are transformer-based, which has significantly improved the accuracy of the pipeline. Using the output from both models, we obtain a final transcription of what each speaker says. In order to distinguish between the patient and the clinician, we propose to identify the patient as the person who speaks for a longer duration in the recordings. This approach is based on the fact that the recordings are primarily focused on the speech of the patients, who are expected to speak more than the clinicians. In this scenario, the role of the clinicians is to facilitate the conversation and provide assistance to the patients when necessary.
Deep Learning-Based Emotion Detection in Aphasia Patients
199
For emotion recognition, we extract the time-lapse where the patient is listening to the clinician, and only keep the video frames during this period. The pipeline architecture is shown in the Fig. 1. Overall, our pipeline provides an efficient and accurate method for processing audio recordings and extracting important information for further analysis.
Fig. 1. Architecture of the proposed pipeline
Once we have identified the video frames where the patient is listening to the clinician, we utilize the DeepFace [19–21] library’s model to extract relevant information from facial expressions. While this model can provide information about age, sex, and race, our focus is solely on the emotions conveyed through facial expressions. The model identifies emotions such as anger, disgust, fear, happiness, sadness, surprise, and neutral. Those emotions are the previously mentioned in Sect. 2. We will use this information to develop a method for analyzing the emotions conveyed in each sample. The other relevant information obtained through transcription and speaker differentiation with time lapses will not be used in this project. However, we will keep this information for future works.
5
Evaluation
In order to evaluate this work, the diarisation component was tested as the first step. One hundred random samples were selected from the dataset, with patients and clinicians properly differentiated, and were manually reviewed. The pipeline
200
D. Ortiz-Perez et al.
was tested by analyzing the initial few minutes of the selected samples along with the diarisation, as the entire files were not analyzed due to some samples being up to an hour long. The pipeline correctly distinguished ninety-four out of the one hundred samples. However, the remaining six samples were incorrectly labeled in terms of differentiating between the clinician and the patient while they were speaking. As a result, the accuracy rate in distinguishing between patients and clinicians was 94%. This differentiation task is important for the ultimate goal of analyzing the facial expressions of patients while they are listening to the clinician. The samples that were incorrectly distinguished were those where the clinician had to speak extensively to maintain the conversation and assist the patients who were not able to communicate fluently. In such cases, the pipeline identified the clinician as a patient since it considers the person who speaks more as the patient. Additionally, the low recording quality was another reason for incorrect labeling. Nonetheless, the pipeline generally distinguishes the majority of cases correctly. On the other hand, no evaluation has been done over the transcription text, since it has not finally used in this work.
Fig. 2. Mean of emotions represented in the analysis over patients while listening to clinicians’ speech
The other evaluation metric involved comparing the results obtained from both aphasic and healthy patients in terms of emotion recognition. The results are shown in the Fig. 2 and in more detail in Table 1, with an individual percentage for each corpus provided in the dataset. These metrics represent the average emotions displayed by the patients’ facial expressions during the interview, represented as a percentage between zero and one, for example, 0.5 represents the half of every represented emotion. These metrics are The most notable difference was observed in the mean value of the “angriness” emotion. This finding was not surprising, as patients may experience frustration and anger due to difficulties in understanding the speech of the clinician. Similarly, although it represents a small proportion of the mean of the emotions, the aphasic patients showed double the proportion of “disgust” emotion compared to the healthy patients. Other significant differences were observed in the proportions of “fear”, “surprise”, and “neutrality” emotions. The lower proportion of “fear” and “surprise” and the
Deep Learning-Based Emotion Detection in Aphasia Patients
201
higher proportion of “neutral” emotion may be due to the difficulty in understanding the speech. In the case of not understanding the clinician’s speech, patients may not show fear or surprise as healthy patients would when they fully comprehend a sentence and are surprised by its content. Additionally, the higher proportion of “neutral” emotion may result from the lack of expression due to poor speech recognition. Table 1. Average emotion detection in the different corpuses of the dataset Corpus
Angry Disgust Fear
Happy Sad
Surprise Neutral
Control Wright Capilouto Kempler Richardson MSU Total
0.153 0.053 0.019 0.000 0.083 0.122
0.021 0.003 0.005 0.000 0.000 0.003
0.238 0.121 0.262 0.062 0.088 0.221
0.041 0.001 0.224 0.655 0.139 0.126
0.409 0.816 0.459 0.130 0.475 0.329
0.044 0.000 0.008 0.000 0.001 0.025
0.094 0.006 0.022 0.153 0.214 0.172
Aphasia Wright Thompson Adler UNH STAR TAP Garrett Whiteside Tucson Fridriksson UCL TCU Elman CMU Kurland TCU-bi Kempler Kansas SCALE ACWT Wozniak MSU Williamson BU Total
0.160 0.113 0.088 0.344 0.554 0.359 0.051 0.350 0.114 0.040 0.296 0.137 0.256 0.231 0.572 0.079 0.189 0.132 0.296 0.119 0.266 0.064 0.025 0.509 0.201
0.000 0.000 0.000 0.001 0.008 0.019 0.000 0.001 0.000 0.000 0.000 0.001 0.000 0.000 0.001 0.000 0.006 0.000 0.004 0.004 0.002 0.004 0.001 0.002 0.006
0.137 0.105 0.532 0.343 0.074 0.267 0.220 0.153 0.066 0.109 0.139 0.073 0.089 0.132 0.035 0.062 0.067 0.254 0.109 0.072 0.088 0.014 0.001 0.152 0.150
0.343 0.347 0.041 0.045 0.021 0.032 0.212 0.143 0.101 0.050 0.024 0.025 0.085 0.477 0.092 0.339 0.388 0.113 0.112 0.341 0.039 0.073 0.019 0.053 0.109
0.127 0.169 0.105 0.235 0.333 0.265 0.068 0.221 0.481 0.450 0.198 0.742 0.549 0.059 0.250 0.316 0.298 0.373 0.261 0.100 0.322 0.484 0.226 0.127 0.32
0.001 0.040 0.046 0.001 0.000 0.002 0.000 0.031 0.001 0.007 0.029 0.000 0.000 0.002 0.000 0.000 0.004 0.000 0.016 0.017 0.041 0.000 0.000 0.024 0.013
0.232 0.226 0.187 0.031 0.011 0.056 0.449 0.100 0.238 0.343 0.313 0.022 0.021 0.098 0.050 0.204 0.048 0.127 0.201 0.347 0.242 0.361 0.728 0.134 0.198
202
6
D. Ortiz-Perez et al.
Conclusion
This work proposes a pipeline to analyze video recordings of Aphasia patients and obtain time-lapses of moments where patients are listening to their interviewer. The pipeline aims to obtain time-lapses of moments where patients are listening to their interviewer. To achieve this goal, the study conducted research in the area of Automatic Speech Recognition tasks and differentiation between speakers. Based on this research, the study selected two models, namely Whisper and speaker-diarization, to develop the pipeline. The effectiveness of the pipeline was evaluated by manually reviewing the beginning of one hundred randomly selected video samples from the dataset used. The pipeline was also used to recognize emotions in both healthy and Aphasia patients. The DeepFace library was utilized to detect emotions from the facial expressions of patients. The study found that Aphasia patients express different emotions than healthy patients when listening to someone’s speech, mainly due to their difficulties in understanding and expressing speech, which negatively influences their mood. This analysis of their emotional state can help improve their interactions by avoiding conversations that may have a negative impact on their mood. Future work in this field includes proposing and deploying a more complex system for analyzing patients’ facial expressions. The new system would include additional features beyond emotions to identify other facial expression patterns between healthy and Aphasia patients. Another idea is to analyze the transcription of the different samples to identify patterns in what they express and are listening that could lead to further interesting research. With this transcription analysis, a deeper emotional analysis can be implemented to identify the type of speech that has a negative impact on their mood. Finally, the study plans to expand the project to include other similar diseases, such as Traumatic Brain Injuries, to explore their effects on patients. Acknowledgment. We would like to thank “A way of making Europe” European Regional Development Fund (ERDF) and MCIN/AEI/10.13039/501100011033 for supporting this work under the TED2021-130890B (CHAN-TWIN) research Project funded by MCIN/AEI /10.13039/501100011033 and European Union NextGenerationEU/PRTR, MoDeaAS project (grant PID2019-104818RB-I00) and AICARE project (grant SPID202200X139779IV0). Also the HORIZON-MSCA-2021-SE-0 action number: 101086387, REMARKABLE, Rural Environmental Monitoring via ultra wideARea networKs And distriButed federated Learning. Furthermore, we would like to thank Nvidia for their generous hardware donation that made these experiments possible.
References 1. Baevski, A., Zhou, H., Mohamed, A., Auli, M.: wav2vec 2.0: a framework for selfsupervised learning of speech representations (2020) 2. Bredin, H., Laurent, A.: End-to-end speaker segmentation for overlap-aware resegmentation. In: Proceedings of Interspeech 2021, Brno, Czech Republic (2021)
Deep Learning-Based Emotion Detection in Aphasia Patients
203
3. Bredin, H., et al.: Pyannote. audio: neural building blocks for speaker diarization. In: ICASSP 2020, IEEE International Conference on Acoustics, Speech, and Signal Processing, Barcelona, Spain (2020) 4. Elbourn, E., Kenny, B., Power, E., Togher, L.: Psychosocial outcomes of severe traumatic brain injury in relation to discourse recovery: a longitudinal study up to 1 year post-injury. Am. J. Speech-Lang. Pathol. 28, 1–16 (2019). https://doi.org/ 10.1044/2019 AJSLP-18-0204 5. Fern´ andez Montenegro, J.M., Villarini, B., Angelopoulou, A., Kapetanios, E., Garcia-Rodriguez, J., Argyriou, V.: A survey of Alzheimer’s disease early diagnosis methods for cognitive assessment. Sensors 20(24) (2020). https://doi.org/10. 3390/s20247292. https://www.mdpi.com/1424-8220/20/24/7292 6. Forbes, M., Fromm, D., Macwhinney, B.: Aphasiabank: a resource for clinicians. In: Seminars in Speech and Language, vol. 33, pp. 217–22 (2012). https://doi.org/ 10.1055/s-0032-1320041 7. Gomez-Donoso, F., et al.: A robotic platform for customized and interactive rehabilitation of persons with disabilities. Pattern Recognit. Lett. 99, 105–113 (2017). https://doi.org/10.1016/j.patrec.2017.05.027. https://www.sciencedirect. com/science/article/pii/S0167865517301903. User Profiling and Behavior Adaptation for Human-Robot Interaction 8. L´ opez-de Ipi˜ na, K., et al.: On the selection of non-invasive methods based on speech analysis oriented to automatic Alzheimer disease diagnosis. Sensors 13(5), 6730–6745 (2013). https://doi.org/10.3390/s130506730. https://www.mdpi.com/ 1424-8220/13/5/6730 9. Jiang, Y.E., Liao, X.Y., Liu, N.: Applying core lexicon analysis in patients with anomic aphasia: based on mandarin aphasiabank. Int. J. Lang. Commun. Disord. (2023). https://doi.org/10.1111/1460-6984.12864. https://onlinelibrary.wiley. com/doi/abs/10.1111/1460-6984.12864 10. Johns Hopkins Medicine: Aphasia. https://www.hopkinsmedicine.org/health/ conditions-and-diseases/aphasia 11. Lanzi, A., Saylor, A., Fromm, D., Liu, H., Macwhinney, B., Cohen, M.: Dementiabank: theoretical rationale, protocol, and illustrative analyses. Am. J. Speech-Lang. Pathol. 32, 1–13 (2023). https://doi.org/10.1044/2022 AJSLP-22-00281 12. Macwhinney, B.: The childes project: tools for analyzing talk. Child Lang. Teach. Ther. 8 (2000). https://doi.org/10.1177/026565909200800211 13. Mayo Clinic: Aphasia (2022). https://www.mayoclinic.org/diseases-conditions/ aphasia/symptoms-causes/syc-20369518 14. Minga, J., Johnson, M., Blake, M., Fromm, D., Macwhinney, B.: Making sense of right hemisphere discourse using RHDBank. Top. Lang. Disord. 41, 99–122 (2021). https://doi.org/10.1097/TLD.0000000000000244 15. National Institute of Mental Health: What is aphasia? - types, causes and treatment. https://www.nidcd.nih.gov/health/aphasia 16. Ouden, D.B., Malyutina, S., Richardson, J.: Verb argument structure in narrative speech: mining the AphasiaBank. Front. Psychol. 6 (2015). https://doi.org/10. 3389/conf.fpsyg.2015.65.00085 17. Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision (2022)
204
D. Ortiz-Perez et al.
18. Revuelta, F.F., Chamizo, J.M.G., Garcia-Rodrguez, J., S´ aez, A.H.: Representation of 2D objects with a topology preserving network. In: Quereda, J.M.I., Mic´ o, L. (eds.) Pattern Recognition in Information Systems, Proceedings of the 2nd International Workshop on Pattern Recognition in Information Systems, PRIS 2002, In conjunction with ICEIS 2002, Ciudad Real, Spain, April 2002, pp. 267–276. ICEIS Press (2002) 19. Serengil, S.I., Ozpinar, A.: Lightface: a hybrid deep face recognition framework. In: 2020 Innovations in Intelligent Systems and Applications Conference (ASYU), pp. 23–27. IEEE (2020). https://doi.org/10.1109/ASYU50717.2020.9259802 20. Serengil, S.I., Ozpinar, A.: Hyperextended lightface: a facial attribute analysis framework. In: 2021 International Conference on Engineering and Emerging Technologies (ICEET), pp. 1–4. IEEE (2021). https://doi.org/10.1109/ICEET53442. 2021.9659697 21. Serengil, S.I., Ozpinar, A.: An evaluation of SQL and NOSQL databases for facial recognition pipelines (2023). https://www.cambridge.org/engage/coe/articledetails/63f3e5541d2d184063d4f569. https://doi.org/10.33774/coe-2023-18rcn ´ 22. Torre, I.G., Romero, M., Alvarez, A.: Improving aphasic speech recognition by using novel semi-supervised learning methods on aphasiabank for English and Spanish. Appl. Sci. 11(19) (2021). https://doi.org/10.3390/app11198872. https:// www.mdpi.com/2076-3417/11/19/8872 23. Zhao, S., Rudzicz, F., Carvalho, L.G., Marquez-Chin, C., Livingstone, S.: Automatic detection of expressed emotion in Parkinson’s disease. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4813–4817 (2014). https://doi.org/10.1109/ICASSP.2014.6854516
Defect Detection in Batavia Woven Fabrics by Means of Convolutional Neural Networks Nuria Velasco-P´erez1 , Samuel Lozano-Ju´ arez1 , Beatriz Gil-Arroyo1 , 2 Juan Marcos Sanz , Nu˜ no Basurto1 , Daniel Urda1 , ´ and Alvaro Herrero1(B) 1 Grupo de Inteligencia Computacional Aplicada (GICAP), Departamento de Digitalizaci´ on, Escuela Polit´ecnica Superior, Universidad de Burgos, Av. Cantabria s/n, 09006 Burgos, Spain {nuriavp,sljuarez,bgarroyo,nbasurto,durda,ahcosio}@ubu.es 2 Textil Santanderina, Cabez´ on de la Sal, Spain [email protected]
Abstract. Quality control is one of the key stages of any manufacturing process in general and in the textile sector in particular. At present time, the inspection process in the textile industry is carried out mainly through visual inspection by qualified workers as commercial solutions suffer from major shortcomings. Thus, the present study proposes and validates Deep Learning models applied to automatically control the quality of fabrics produced on high-speed, complex, and up-to-date production conditions. More precisely, Convolutional Neural Networks are validated on real-life images gathered from the production line of Batavia yarns. Satisfactory results obtained in experimentation encourages the application of such models in this complex task. Keywords: Defect detection · Textile · Industry 4.0 Convolutional neural networks · Image analysis
1
· Deep Learning ·
Introduction and Previous Work
In recent years, many unsolved problems in industrial domains are being approached by soft computing techniques [2]. Among them, there are still some open challenges in the textile industry. It is yet a very traditional sector, in which new technologies have virtually not penetrated. This is mainly due to the fact that the companies that compose it are usually historical companies that have survived thanks to knowing how to adapt to new trends and have been able to compete with low-cost companies. Nevertheless, the textile sector evolves towards greater competitiveness and with an increase in the required quality standards. Currently, quality control methods are nearing obsolescence and present innumerable limitations. Consequently, the need arises to evaluate the possible incorporation of new automated and intelligent inspection methods. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 P. Garc´ıa Bringas et al. (Eds.): SOCO 2023, LNNS 750, pp. 205–215, 2023. https://doi.org/10.1007/978-3-031-42536-3_20
206
N. Velasco-P´erez et al.
Currently, the inspection process in the textile sector is carried out mainly through visual inspection by qualified workers (specifically trained). It is done at an average speed of about 20 m/min and random quality checks are also carried out. The manufacturing speed of the different production lines is around 42 m/min, so the quality inspection phase is a bottleneck. Although being satisfactory so far, the textile manufacturing industry is evolving, with increasing demand for products and quality, with a parallel increase in manufacturing speeds. Hence, traditional visual inspection is becoming obsolete [8], with drawbacks demanding innovative and cutting-edge solutions [10]. To bridge this gap, this study proposes and validates Deep Learning (DL) models applied to automatically control the quality of fabrics produced in Textil Santanderina (TS) company. The main goal is to identify defects (warp and weft defects, impurities, bow, skew or colour gradients) generated at any stage of the production process (spinning, weaving, coating finishing, and garment manufacturing). More precisely, Convolutional Neural Networks (CNNs) are proposed due to their superior and widely-validated ability to process images for quality control [9]. Different topologies of such architectures will be benchmarked in order to identify the one best fitting the problem under analysis. However, the industrial use of artificial vision systems, especially for inspection, continues to be limited since there are still open challenges such as the stoppage of production after the identification of waste, delays in production, wasted resources, or excessive reliance on human factors. Current image-processing systems are available in the market to be applied in the textile industry. However, they are not supported by Artificial Intelligence (AI) to allow fast training and functionality with high variability rate of products (in TS, batches are between 500 m and 50,000 m with more than 700 new references per year). The rest of this paper is organized as follows: First, the related work to the present study is discussed Sect. 2. The addressed problem and the generated dataset are introduced in Sect. 3 while the applied methods and the experimental design are described in Sect. 4. Section 5 contains a description of the obtained results. Finally, the main conclusions and proposals for future work are presented in Sect. 6.
2
Related Work
Some commercial systems currently implemented are Uster˝o EVS FABRIQ VISION, Mahlo GmbH, and PEKAT VISION. They all have some limitations, respectively: i) restricted to the end of the inspection (not during the manufacturing process) and for uniform production batches, ii) limited to defects not affecting the fabric’s own structure, and ii) only for the detection of easily visible errors in long batches, since it does not have an AI learning system for learning patterns and series. In the scientific literature, DL architectures have been widely applied to the quality control of manufactured products [7,11]. Particularly, in the industrial field CNNs have shown great potential in different fields such as manufacturing
Defect Detection in Batavia Woven Fabrics by Means of CNNs
207
and quality control [3,12,13]. In the textile sector, some authors have approached the defect detection task by applying traditional approaches [15] while some others have proposed the application of DL architectures. One of the pioneer works in applying CNNs from a supervised standpoint is [5]. Although interesting results were obtained, the proposed CNNs were validated on open but outdated dataset, that do not comply with the present requirements of the textile industry. Other DL models have been also validated [14] to detect many defects in the same fabric picture. However, it is not a real situation in the context of our problem as simultaneous defects are very infrequent. Other authors have proposed an alternative approach based on unsupervised learning. Firstly, [6] proposed for reducing the effort and time spent on hyper-parameter initialization and finetuning. Later on, a Variational Autoencoder is proposed in [16] to unlabelled defect-free pictures to save labelling efforts. In our study this is not a key issue as defects are being identified and labelled at present time by human operators. All in all, it can be said that these previous proposals have been applied offline and have not been validated on real settings with high manufacturing speeds. Thus, the present research goes one step further by validating CNNs on highspeed, complex, and up-to-date production conditions.
3
Case Study
As previously stated, in the present work, CNNs are applied to images collected from the fabric production at TS. It is a leading European company in developing and manufacturing yarns, fabrics, finishes and garment made from pure or blended materials. The company, founded in December 1960, has a long history in the textile market, currently offering a large variety of products for the fashion/clothing sector, sportswear sector, healthcare and medical sector, military and public sector, and protective industry sectors. The company is one of the few textile companies in Europe having a vertical supply chain to cover the whole manufacturing process of fabrics: spinning, weaving, coating finishing, garment manufacturing and logistics. This superb level of integration and coordination provides the company to supply processed products at any stage of production, with any type of finish. Its total worldwide production is more than 40 million meters of fabrics every year and more than 7 million units of garments per year. To meet the goals of this research, the company has installed a Basler raL camera, with the Awaiba DR-12k-3.5 CMOS sensor that delivers 8 kHz at 12k resolution array camera. It works in the VIS-NIR bandwidth. It has Basler proprietary optics that allow a resolution of 10 px/mm. For homogeneous lighting, a LED-array that emits in 850 nm with a field of 15 mm × 1510 mm using 360 infrared (IR) 850 nm LEDs has been placed 20 cm over the fabric with an angle of incidence of 15◦ . Illumination should be in IR in order to not to interfere with the current D65 standard illuminating systems for visual inspection. A picture of the setting in the factory is shown in Fig. 1. The camera is synchronized with the fabric through a 10-bit inductive encoder for high precision ( 0 and to 0 otherwise. To learn the set of filters, a training set including patches of natural images is used. This learning process should maximise the statistical independence of the filter responses [11]. Similar to LBP coding, if the code string is a local descriptor of the image intensity pattern in the vicinity of the pixel, the histogram of code values will encode the texture properties of the image. Once the BSIF code histogram for an iris pattern is obtained, its bins are normalised to have a mean value equal to 0 and a standard deviation of 1 (z-score normalisation). Figure 4 shows the BSIF 8-9 and the associated histograms calculated for one filter size for an authentic iris image and an iris image with a cosmetic contact lens. 4.2
Building the Ensemble of Classifiers
In order to determine what the best ensemble of classifier is for our specific case, we follow the proposal by McGrath et al. [12]. Thus, we build all BSIF combinations with n ranging from 5 to 12, and s from 3 to 34. However, contrary to previous approaches, we build the BSIF encoding using as input the normalised iris image (64 × 512 pixels) (see Fig. 3).
252
A. Romero-Garc´es et al.
Fig. 4. 8-bit BSIF codes and the resulting histograms calculated at one filter size for a iris image without and with cosmetic contact lens.
Using the NDCLD’15 as training dataset [7], two classifiers (SVM and MLP) are trained for each BSIF histogram. In total, we have 232 models, which will be named adding the term ‘svm’ or ‘mp’ to the BSIF encoding (e.g., BSIF-8-26-svm uses the BSIF-8-26 encoding and a SVM classifier).
5
Results
We used the IIITD-CogentScanner dataset [6] for the validation of the ensemble. As was described in Sect. 3, before starting the iris segmentation and normalization, a contrast test is used to discard blurred images. The 232 models obtained in the training step are tested against the validation set, one by one, giving as an individual correct classification rate (CCR). Then, these models are sorted according the CCR and are then added one by one (from best to worst) using a majority voting test in order to obtain the optimal CCR. As is shown in Fig. 5, the best CCR value (97,30%) is obtained with the first three SVM models (BSIF12-18-svm, BSIF-10-24-svm and BSIF-11-26-svm). These three models provide us with the best performance using the IIITD-CogentScanner, being the best candidates for our current implementation of iPAD in the IAAD system. Using more models also give us a good performance (above 97%), although this value starts to stay less than 97% when 40 models or more are used. 5.1
Cross-Dataset Testing
Once the best models were obtained from the validation set, we used them to classify two different datasets: the complete IIITD dataset and a dataset from SHS Consultores SL. The SHS dataset is composed by 1838 images with cosmetic contact lens and 351 with transparent lens, captured using the AIRIM system. The
Lightweight Cosmetic Contact Lens Detection System
253
Fig. 5. Classification on the IIITD-Cogent dataset. The figure provides the results of each individual classifier and of the ensemble versions (see text). For the sake of simplicity we show only the first 40 models.
three SVM models selected previously using the IIITD-CogentScanner dataset gave us a CCR of 94,65% on IIITD and 84,28% on SHS (without discarding images by contrast). Results are similar to the ones provided by recently proposed algorithms. Being evaluated on the IIITD dataset using an intra-database training-testing protocol [19], the MVANet [18] provides a CCR of 94,99% and the ELF [19] of 96,04%. The results obtained when evaluating the entire IIITD database validate the hypothesis that using the iris pattern yields slightly better descriptors than using the eye region as input. Thus, following the method proposed by McGrath et al. [12] (without using as input the normalised iris image), and using the complete IIITD dataset for validating the ensemble, we obtained a CCR value of 93%.
6
Conclusions and Future Work
This paper describes an iPAD approach to be embedded in a IAAD framework. The approach meets the requirements of relatively low resource consumption and the ability to generate the result quickly. For this purpose, different options have been evaluated, resulting in an ensemble of classifiers that only requires three SVMs. As the main novelty compared to other proposals [12,13], our iPAD approach does not consider the use of non-segmented iris images. The ultimate goal of the complete system is to deploy it at an entrance control, where security personnel are present. In this scenario, the normal attack will be the use of cosmetic contact lenses. As discussed above, iris segmentation is beneficial in this situation. In our case, it is also a necessary step to be carried out as the system needs the iris pattern to implement user identification. Experiments have shown that
254
A. Romero-Garc´es et al.
the proposed set of classifiers gives promising results on the databases studied. Thus, the result of our proposal on the IIITD database is very similar to that obtained by recent methods based on Deep Learning. However, the complexity of the MVANet and ELF approaches is significantly higher than that of our proposal, which in our case is also a relevant factor, given that our objective is to embed them in the MPSoC. Future work focuses on embedding the iPAD approach in the MPSoC framework where the eye detector is currently running. Finally, the size and diversity of the SHS database should also be increased. As it is captured using the AIRIM system itself, it is quite likely that an ensemble of classifiers obtained using this database (and not the IIITD-CogentScanner), will allow better results to be obtained when implemented in AIRIM. Acknowledgements. This work has been partly supported by grant CPP2021008931 funded by MCIN/AEI/10.13039/501100011033 and by the European Union NextGenerationEU/PRTR, and by projects TED2021-131739B-C21 and PDC2022133597-C42, funded by the Gobierno de Espa˜ na and FEDER funds.
References 1. Mordor Intelligence (2022) Global Iris Recognition Market (2022–2027) 2. Nguyen, K., Fookes, C., Jillela, R., Sridharan, S., Ross, A.: Long range iris recognition: a survey. Pattern Recogn. 72, 123–143 (2017) 3. Ruiz-Beltr´ an, C.A., Romero-Garc´es, A., Gonz´ alez, M., S´ anchez-Pedraza, A., Rodr´ıguez-Fern´ andez, J.A., Bandera, A.: Real-time embedded eye detection system. Expert Syst. Appl. 194, 116505 (2022). https://doi.org/10.1016/j.eswa.2022. 116505 4. Baker, S.E., Hentz, A., Bowyer, K., Flynn, P.J.: Degradation of iris recognition performance due to non-cosmetic prescription contact lenses. Comput. Vis. Image Understand. 114(9), 1030–1044 (2010) 5. Kohli, N., Yadav, D., Vatsa, M., Singh, R.: Revisiting iris recognition with color cosmetic contact lenses. In: 2013 International Conference on Biometrics (ICB), Madrid, Spain, pp. 1–7 (2013). https://doi.org/10.1109/ICB.2013.6613021 6. Yadav, D., Kohli, N., Doyle, J.S., Singh, R., Vatsa, M., Bowyer, K.W.: Unraveling the effect of textured contact lenses on iris recognition. IEEE Trans. Inf. Forensics Secur. 9(5), 851–862 (2014) 7. Yambay, D., et al.: LivDet iris 2017-Iris liveness detection competition 2017. In: 2017 IEEE International Joint Conference on Biometrics (IJCB), Denver, CO, USA, pp. 733–741 (2017). https://doi.org/10.1109/BTAS.2017.8272763 8. Doyle, J.S., Flynn, P.J., Bowyer, K.W.: Automated classification of contact lens type in iris images. In: 2013 International Conference on Biometrics (ICB), Madrid, Spain, pp. 1–6 (2013). https://doi.org/10.1109/ICB.2013.6612954 9. Sequeira, A., Thavalengal, S., Ferryman, J., Corcoran, P., Cardoso, J.S.: A realistic evaluation of iris presentation attack detection. In: 2016 39th International Conference on Telecommunications and Signal Processing (TSP), Vienna, Austria, pp. 660–664 (2016). https://doi.org/10.1109/TSP.2016.7760965
Lightweight Cosmetic Contact Lens Detection System
255
10. Pala F., Bhanu, B.: Iris liveness detection by relative distance comparisons. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Honolulu, HI, USA, pp. 664–671 (2017). https://doi.org/10.1109/ CVPRW.2017.95 11. Komulainen, J., Hadid, A., Pietikainen, M.: Contact lens detection in iris images. In: Rathgeb, C., Busch, C. (eds.) Iris and Periocular Biometric Recognition, Chapter 12, pp. 265–290. IET, London, UK (2017) 12. McGrath, J., Bowyer, K.W., Czajka, A.: Open source presentation attack detection baseline for iris recognition. CoRR, abs/1809.10172 (2018). http://arxiv.org/abs/ 1809.10172 13. Doyle, S., Bowyer, K.W.: Robust detection of textured contact lenses in iris recognition using BSIF. IEEE Access 3, 1672–1683 (2015). https://doi.org/10.1109/ ACCESS.2015.2477470 14. Raghavendra, R., Raja, K.B., Busch, C.: ContlensNet: robust iris contact lens detection using deep convolutional neural networks. In: 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), Santa Rosa, CA, USA, pp. 1160–1167 (2017). https://doi.org/10.1109/WACV.2017.134 15. Tapia, J.E., Gonzalez, S., Busch, C.: Iris liveness detection using a cascade of dedicated deep learning networks. IEEE Trans. Inf. Forensics Secur. 17, 42–52 (2022). https://doi.org/10.1109/TIFS.2021.3132582 16. Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: MobileNetV2: inverted residuals and linear bottlenecks. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, pp. 4510–4520 (2018). https://doi.org/10.1109/CVPR.2018.00474 17. Yadav, D., Kohli, N., Agarwal, A., Vatsa, M., Singh, R., Noore, A.: Fusion of handcrafted and deep learning features for large-scale multiple iris presentation attack detection. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Salt Lake City, UT, USA, pp. 685–6857 (2018). https://doi.org/10.1109/CVPRW.2018.00099 18. Gupta, M., Singh, V., Agarwal, A., Vatsa, M., Singh, R.: Generalized iris presentation attack detection algorithm under cross-database settings. In: Proceedings of IEEE ICPR, pp. 5318–5325 (2020) 19. Agarwal, A., Noore, A., Vatsa, M., Singh, R.: Generalized contact lens iris presentation attack detection. IEEE Trans. Biom. Behav. Identity Sci. 4(3), 373–385 (2022). https://doi.org/10.1109/TBIOM.2022.3177669 20. Dronky, M.R., Khalifa, W., Roushdy, M.: Impact of segmentation on iris liveness detection. In: 2019 14th International Conference on Computer Engineering and Systems (ICCES), Cairo, Egypt, pp. 386–392 (2019). https://doi.org/10.1109/ ICCES48960.2019.9068147 21. Dronky, M.R., Khalifa, W., Roushdy, M.: Using residual images with BSIF for iris liveness detection. Expert Syst. Appl. 182, 115266 (2021). https://doi.org/10. 1016/j.eswa.2021.115266 22. Othman, N., Dorizzi, B., Garcia-Salicetti, S.: OSIRIS: an open source iris recognition software. Pattern Recogn. Lett. 82, 124–131 (2016). https://doi.org/10.1016/ j.patrec.2015.09.002 23. Gragnaniello, D., Poggi, G., Sansone, C., Verdoliva, L.: Using iris and sclera for detection and classification of contact lenses. Pattern Recogn. Lett. 82(2), 251–257 (2016). https://doi.org/10.1016/j.patrec.2015.10.009
Vehicle Warning System Based on Road Curvature Effect Using CNN and LSTM Neural Networks F. Barreno1(B) , Matilde Santos2 , and M. Romana3 1 Computer Science Faculty, Complutense University of Madrid, 28040 Madrid, Spain
[email protected]
2 Institute of Knowledge Technology, Complutense University of Madrid, 28040 Madrid, Spain
[email protected]
3 Civil Engineering School, Technical University of Madrid, 28040 Madrid, Spain
[email protected]
Abstract. The work proposes an intelligent estimation system to classify dangerous driving turnings on roads where consider the dynamics of the vehicle. A convolutional neural network (CNN) and a long short-term memory (LSTM) model are applied. The vehicle’s dynamic characteristics are measured by the use of inertial sensors incorporated in the vehicle. The actual data gathered from CAN (Controller Area Network) bus, accelerations measures, gyroscope readings, and steering angle are applied to determine the classification of conduction operations. Urban, rural and motorway roads located in the south region of Germany are used in this investigation as real scenarios. The findings achieved with the suggested neural models are encouraging and indicate that this artificial intelligence approach may indeed be used to alert user of unsafe or risky behavior in real time, with the aim of making driving more comfortable and safer. Keywords: Convolutional neural networks CNN · LSTM · Deep learning · CAN bus sensors · driving · identification · classification · safety · roads
1 Introduction Driver behavior on the road is currently the subject of special attention in terms of driving safety. Recognizing many types of driving conduct is essential to ensure safety, especially in automated vehicles [1]. But even the most experienced drivers may be affected by the roadway network, which damages by wear and tear, natural phenomena, etc., and this could make an otherwise safe driving maneuver a dangerous one. While the user experience in terms of driving is the relevant, this experience is highly influenced by the road condition. Driver behavior is the most important factor for safety and comfort, but road geometry must also be taken into account as it affects driving style [2, 3]. For example, how the user perceives the different geometric characteristics of the road, and depending on whether he/she is on a rural road or a motorway, may cause different driving behaviors and, therefore, risky maneuvers [4]. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 P. García Bringas et al. (Eds.): SOCO 2023, LNNS 750, pp. 256–266, 2023. https://doi.org/10.1007/978-3-031-42536-3_25
Vehicle Warning System Based on Road Curvature Effect
257
The vehicle’s dynamics enable measurement-based identification of the many types of maneuvers. These measures consist of inertial data, such as GPS data and accelerometer and gyroscope readings, which have proven useful in this field. It has been shown that CAN bus vehicle sensors can be used to create a driver profile by identifying braking and turning events [5]. These data can be used to obtain the driving behavior and then to identify is a maneuver is safe or risky. The objective of this work is to identify if a driver’s maneuver can be considered abnormal, and therefore dangerous, using measurements from inertial sensors and other measurements obtained from the CAN bus. Real data of an urban, rural road and a motorway from southern Germany have been used, in particular, Gaimersheim, Ingolstadt and Munich. To identify dangerous maneuvers, a Soft Computing technique is applied, in particular convolutional neural networks. The intelligent model proposed in this article focuses on the characterization of driving in cornering maneuvers, in such a way that it includes the geometry of the road through characteristics related to it, such as the longitudinal and lateral accelerations of the vehicle perceived by the driver in curves [6, 7]. In other words, driving anomalies based on perception is the principal contribution of this paper. That the driver has of the geometry-related vehicle accelerations of the road, since an inappropriate speed in a curve might prove to be too high and therefore, unsafety for a certain road segment. These encouraging results help define a driving style that is more conducive to safe driving. The use of inertial sensors and information from the CAN bus of the vehicle is also supported by this study. It is an integrated system in all vehicles that provides information that can be used to increase road safety and reduce traffic accidents. Analysis of driving maneuvers has been previously addressed with artificial intelligence techniques, which confirms the appropriateness of such approaches. Indeed, an approach to analyze driver behavior is to study some maneuvers, since these events provide useful information about their driving style. Recently, deep learning approaches have proven to be an effective solution for time series modeling due to their ability to automatically learn pre-existing time dependencies in the time series [8]. As a supervised approach, convolutional neural networks (CNNs) are frequently utilized in pattern recognition applications. This kind of network is effective at handling time series data [9] and is frequently utilized in image recognition models [10]. To mention some examples in this particular field, by detecting facial landmarks in the model via the camera and using convolutional neural networks, it is possible to classify driver drowsiness [11]. In [12], an LSTM-based ACC system is proposed that can learn from previous driving experiences and adapt and predict new situations in real time; the system is evaluated with aggressive lane changes of the preceding vehicle, forcing the driver to reduce speed. In [13], a new hybrid framework of convolutional neural network (CNN) and attention-based short-term memory (LSTM), called DSDCLA, is proposed with the aim of determining the style of driving. The main difference between this work and the previous cited ones lies in the consideration of the effect of the curvature of the road while driving. The content of the rest article is organized as described. Section 2 introduces the vehicle dynamics approach, which forms the framework of the proposed identification
258
F. Barreno et al.
system. Section 3 details the model that identifies when a maneuver is anomalous using real data. The results are presented in Sect. 4. The conclusions and possible future work bring the article to a close.
2 Road Curvature-Based Dynamics of the Vehicle While travel, the vehicle is affected by forces that define its dynamics. Based on Ackermann’s simplified model of a vehicle [14], the parameters considered to define it are the radius of curvature of the trajectory, the vehicle wheelbase and the angle of the front wheel. In addition, it is verified that when a vehicle turns during a curve, Because the vehicle’s speed is higher, the centrifugal force exerted by the vehicle on the vehicle is also higher. On the other hand, the geometry of the road is designed so that a the vehicle travels with an appropriate level of comfort and safety at a given speed. If the car does not exceed that velocity, driving is expected to be safe; conversely, if the car accelerates beyond highest curved speed, car accelerations range from unsafe and awkward. To describe the forces that affect the vehicle, the steering angle is defined as the angle between the front of vehicle and the steered wheel direction: δ = tan−1 (W /R)
(1)
where δ (arc degree), is the steering angle, W is wheelbase (m) and R (m), is the radius of curvature. The Ackerman yaw rate (AYR) o idealized geometric yaw rate [15]: AYR =
vl · tan(δ) W
(2)
where AYR (degree/s) is idealized yaw rate, vl (m/s) is the linear velocity. Understeer (US) is a phenomenon that occurs during the driving that causes the actual turning of the vehicle to be less than what it should theoretically be induced by the position of the front wheels. As a result, the front of the vehicle tends to run out to the outside of the curve. This is mainly caused by the inertia when entering a corner at excessive speed, but also when the front tires are worn or in poor condition or when the road surface is slippery. That is, the yaw rate is less than the idealized yaw rate, and it is defined as follows [16]: US =
AYR ω
(3)
where ω (degree/s) is the angular velocity. If US > 1 is understeer Oversteer is the phenomenon of rear axle slippage that can occur in a car when attempting to corner or when already turning. The car is said to oversteer when the rear wheels do not follow the same path as the front wheels. Oversteer occurs when the rear of the vehicle tries to move in front of the front. It is the opposite to understeer. So, if US < 1 is oversteer. The forward acceleration of the vehicle is the longitudinal acceleration, which is parallel to the linear velocity. The lateral acceleration is the acceleration measured in
Vehicle Warning System Based on Road Curvature Effect
259
the vehicle’s orthogonal direction. According to the AASHTO Green Book, the road acceleration is computed as [17]: aroad = g(ρ + ft )
(4)
where g (m/s2 ), is gravity acceleration, aroad (m/s2 ), is the critical acceleration caused by the impacts of the road geometry, f t is the maximum transversal friction mobilized coefficient and ρ (m) is the road cross slope. The maximum acceleration in a curve is this acceleration on the road in order to keep safety and comfort in mind. Despite the cross slope and the hardest to acquire maximum mobilization coefficient of friction, aroad may be calculated from inertial sensor and CAN bus, as well as [18]: aroad = |ω| · vl
(5)
where ω (rad/s) is the angular velocity and vl (m/s) is the linear velocity of the vehicle. According to [15], the lateral acceleration perceived by the driver due to driving behavior that is erratic is: ap = |am | − aroad
(6)
where the recorded lateral acceleration is am (m/s2 ) and the driver’s subjective acceleration ap (m/s2 ). From the perspective of the driver, this acceleration represents the sensation of driving as a result of the influence of the road geometry. In Fig. 1, a few of these a vehicle’s dynamic features on urban and motorway roads (Gaimersheim, Germany) are represented [19]. In the upper image, the steering angle from CAN bus unit is shown; at the bottom, the angular velocity. The peaks that appear in the graphics correspond to changes of direction during driving.
Fig. 1. Steering angle and angular velocity recorded in Gaimersheim [19].
3 Risky Maneuvers Identification by CNN and LSTM Models The A2D2 is a public autonomous driving dataset from Audi [19]. This dataset includes data derived from the vehicle bus CAN recorded on highways, country roads and cities from Munich, Gaimersheim and Ingolstadt (Germany). This dataset provides images and
260
F. Barreno et al.
3D point clouds together with 3D bounding boxes, and semantic segmentation. Realtime data from measurements taken by inertial sensors (accelerations and gyroscopes) are included in the information extracted from the car’s CAN bus. The vehicle is an Audi Q7 e-tron car. The data for the vehicle bus are kept in a JSON file, which also includes the timestamps and corresponding units for the signals. The signals include, among others, odometer, acceleration, angular speed, GPS coordinates, braking pressure, pitch and roll angles. 3.1 Model’s Selection of Variables The variables we are using were collected using various sampling rates. The sample time of the bus CAN are 100 Hz, 50Hz and 25 Hz for accelerations, vehicle speed and steering angle, respectively. To synchronize those measures, data was resampled at 100 Hz. Longitudinal acceleration, lateral acceleration, yaw rate, vehicle speed and steering angle were chosen as the main characteristics, based on the definition of curve behavior and other studies. Linear velocity, vl (km/h) is given by the CAN bus. Longitudinal acceleration is measured by the accelerometer (x-axis) in the vehicle in the onward direction. A positive value means acceleration and a negative one means deceleration (braking). The CAN bus provides linear velocity, vl (km/h). The accelerometer (x-axis) in the car measures longitudinal acceleration in the forward motion. Positive values indicate acceleration, negative values indicate deceleration (braking). The anomalous accelerations and decelerations that speed up or slow down the vehicle in an abrupt way usually reveal themselves as large values of longitudinal acceleration and deceleration. The accelerometer (y-axis) measures lateral acceleration, and the gyro sensor measures angular velocity (ω). Steering angle, δ, is given by the CAN bus. They help identify the behavior of the vehicle in a curve as a large increase may be indicative of excessive turning speed. They are related to understeer phenomena manifested as large lateral accelerations and high yaw rate due to passing through the curve at excessive speed. As a result, Table 1 contains a list of the input variables for the neural network model that will be used to identify unusual turning behavior while driving. Table 1. Variables of input to CNN and LSTM models. Features
Descriptions
Units
az
Longitudinal acceleration (from IMU, x-axis)
m/s2
ay
Lateral acceleration (from IMU, y-axis)
m/s2
wz
Angular speed (from gyroscope, z-axis)
degree/s
vl
Linear velocity (from bus CAN)
km/h
δ
Steering angle (from bus CAN)
arc degree
The stability of drivers’ vehicle control in the longitudinal direction may be seen in the standard deviation of speed and longitudinal acceleration, whereas the stability of
Vehicle Warning System Based on Road Curvature Effect
261
drivers’ vehicle control in the lateral direction can be seen in the standard deviation of lateral acceleration [17]. To account for events on steering angle and lateral acceleration as seen by the driver, steering turns were labeled as follows (7): ⎧ ⎨ US > 1 x = warning event X = ap > μap x = warning event (7) ⎩ otherwise x = normal event where x is the data sample’s chosen class, US is the understeer value, μap is the mean value of the lateral perceived acceleration, and X is a feature vector made up of Table 1 variables. Based on the following reasoning, a description of a typical car driver’s driving style can be given when approaching a curve. If the current speed taking the curve is excessive, the lateral acceleration will be high, resulting in faulty vehicle maneuvering. This can lead to understeer during the driving of the vehicle. On the other hand, perceived lateral acceleration, which measures the driver’s experience during driving, will be perceptible, so the road geometry influence is obtained indirectly using the longitudinal and lateral acceleration, yaw rate, steering angle, and vehicle speed, one can determine the vehicle dynamics. This can mean that the car is moving faster than the posted speed limit. Thus, it is feasible to determine the perceived driver acceleration (Eq. (6)) from the chosen vector of vehicle dynamics properties (Table 1) and this way, the effect of the road curvature is included indirectly in the classifier, which is implicit at each curve transition. Please note that in these experiments GPU measures are not used. 3.2 Deep Convolutional Neural Network Model In [20] a convolutional neural network is proposed to classify unsafe driving behavior. The inputs are actual measurements obtained from data collected via the CAN bus by sensors mounted inside a moving vehicle. A total of 438129 driving maneuvers, 134881 of which were classified as unsafe maneuvers, are included in the data set for the model. The z-score is used to normalize the features as the scales of the features vary and to avoid drastic changes in the network weights (8): z=
x−μ σ
(8)
In this scenario, x represents the unstandardized data, μ represents the feature vector’s mean, σ represents its standard deviation, and z represents the standardized data. Data from the dataset are selected randomly for training and testing, splitting the set into 50% for training data and the rest for testing. The best outcomes from a number of studies have been attained with the following arrangement of the CNN and LSTM neural networks parameters. The CNN model consists of an input layer with 5 neurons, one for each feature, and two convolutional layers that apply sliding convolutional filters to the input. The convolutional layer has a filter of size 3. The first and second convolutional layers have 32 and 64 filters, respectively. For both convolutional layers, the inputs are filled from the left so that the outputs have the same length. The initial learning rate
262
F. Barreno et al.
used in this study was 0.001. The total number of training epochs is 15. The LSTM model consists in a deep learning neural network with an hidden layer with 5 neurons, one for each feature, and a LSTM layer with 100 neurons. A fully connected layer is implemented where a linear transformation is applied to the input vector through a weight matrix. The output values of the fully connected layer are transformed between 0 and 1 in the soft-max layer. Finally, the output is obtained by a classification layer. A learning rate of 0.01 was set for the study. The batch size was 500. The number of training epochs is 15. As each simulation for each experiment lasted about 10 min, the computing time was not prohibitive. The characteristics of the computer used are IntelCore i7 1.2 GHz processor with 8 GB RAM for both models. Figure 2 shows both proposed neural networks classifiers.
Fig. 2. Proposed CNN and LSTM neural networks.
4 Results and Discussion Tables 3 and 5 represent the outcomes of using the Deep Convolutional NN classifier. F1 -score [21], is utilized to qualify the model correctness (9): F=
2pr p+r
(9)
The expressions that define accuracy, a, precision, p, and recall, r, are the following. a=
TP + TN TP + FP + TN + FN
(10)
p=
TP TP + FP
(11)
r=
TP TP + FN
(12)
where TP is true positive, TN is true negative, FP is false positive, and FN is false negative.
Vehicle Warning System Based on Road Curvature Effect
263
In Table 2 and Table 4 the confusion matrixes for the three roads are shown, providing details on the present and anticipated categories obtained with the CNN and LSTM models. Table 3 and Table 5 show the results obtained with the CNN model. Since the majority of the data samples have high accuracy percentages and large F1 -scores, it can be seen that anomalous driving incidents are adequately detected. Table 2. Confusion matrixes obtained with the CNN classifier. Gaimersheim
Predicted class
True Class
34544
209
329
17494
Ingolstadt
Predicted class
True Class
92438
3937
16180
36485
Munich
Predicted class
True Class
112250
7233
28614
35839
Table 3. Validation results obtained with the CNN classifier. Driver
Accuracy
Precision
Recall
F1 -score
Gaimersheim
98.97%
98.93%
98,77%
98,85%
Ingolstadt
86.50%
87.68%
82.59%
84.28%
Munich
80.51%
81.44%
74.77%
76.74%
Table 4. Confusion matrixes obtained with the LSTM classifier. Gaimersheim
Predicted class
True Class
34300
453
1238
16585
Ingolstadt
Predicted class
True Class
93988
2387
14011
38654
Munich
Predicted class
True Class
116460
3027
30232
34221
264
F. Barreno et al. Table 5. Validation results obtained with the LSTM classifier.
Driver
Accuracy
Precision
Recall
F1 -score
Gaimersheim
96.78%
95.87%
96,92%
96,37%
Ingolstadt
88.99%
90,60%
85.46%
87.23%
Munich
81.91%
85.63%
75.28%
77.40%
It can be seen that the results for Gaimersheim are better than for the other two cities; this may be due to the fact that the data depend on the vehicle’s conditions, such as its suspension. In summary, the CNN and LSTM classifier systems are capable of identifying the majority of on-road driving events and therefore to identify unsafe maneuvers in cornering events. Utilizing the CAN bus measures that are already installed in all vehicles, this tool can contribute to safer and more comfortable driving on the road. This could help to detect excessive speed when navigating a curve, being able to generate warning to the driver and thus, helping avoid skidding, drifting off the road or driving into the oncoming lane [22, 23].
5 Conclusions and Future Works A deep learning convolutional neural network has been developed and implemented to classify anomalous driving occurrences in maneuvers at urban, country and motorway roads in this article. Also, a LSTM neuronal network has been implemented to compare the performance of both neural-based models. They use as inputs the real measures of accelerometers, gyroscope, and GPS. Based on those measurements, obtained with bus CAN incorporated in any current vehicle, it is feasible to determine the perceived driver acceleration as well as the acceleration caused by the road. as well as to evaluate the understeer phenomena. This information allows the classification of inefficient or defective driving maneuvers in curve turnings. These CNN and LSTM-based classifiers produced promising and practical results. Some inferences are possible. Every driving maneuver, including dangerous cornering maneuvers, has been properly classified. Similar results have been obtained for both models. The database contains naturalistic driving information collected at 3 different areas of southern Germany on urban, conventional, and motorways roads with the same vehicle. A high success rate has been achieved in the classification. Numerous intriguing upcoming works are possible to discuss. First, if more data were available, they may be taken into account. The link between driver behavior and road curvature as well as the impression of speed and acceleration of other cars on the road might also be taken into account, continuing the theory put forward in this article.
Vehicle Warning System Based on Road Curvature Effect
265
References 1. Meiring, G.A.M., Myburgh, H.C.: A review of intelligent driving style analysis systems and related artificial intelligence algorithms. Sensors 15(12), 30653–30682 (2015) 2. Martín, S., Romana, M.G., Santos, M.: Fuzzy model of vehicle delay to determine the level of service of two-lane roads. Expert Syst. Appl. 54, 48–60 (2016) 3. Barreno, F., Romana, M.G., Santos, M.: Fuzzy expert system for road type identification and risk assessment of conventional two-lane roads. Expert. Syst. 39(9), e12837 (2022). https:// doi.org/10.1111/exsy.12837 4. Wu, C., Yu, D., Doherty, A., Zhang, T., Kust, L., Luo, G.: An investigation of perceived vehicle speed from a driver’s perspective. PLoS ONE 12(10), e0185347 (2017) 5. Van Ly, M., Martin, S., Trivedi, M.M.: Driver classification and driving style recognition using inertial sensors. In: 2013 IEEE Intelligent Vehicles Symposium (IV), pp. 1040–1045. IEEE, June 2013 6. Barreno, F., Santos, M., Romana, M.: Abnormal driving behavior identification based on naturalistic driving data using LSTM Recurrent Neural Networks. In: García Bringas, P., et al. (eds.) SOCO 2022, vol. 531, pp. 435–443. Springer, Cham (2022). https://doi.org/10.1007/ 978-3-031-18050-7_42 7. Barreno, F., Santos, M., Romana, M.G.: A novel adaptive vehicle speed recommender fuzzy system for autonomous vehicles on conventional two-lane roads. Expert Syst., e13046 (2022) 8. Lara-Benítez, P., Carranza-García, M., Riquelme, J.C.: An experimental review on deep learning architectures for time series forecasting. Int. J. Neural Syst. 31(03), 2130001 (2021) 9. Wang, K., et al.: Multiple convolutional neural networks for multivariate time series prediction. Neurocomputing 360, 107–119 (2019) 10. Swaminathan, V., Arora, S., Bansal, R., Rajalakshmi, R.: Autonomous driving system with road sign recognition using convolutional neural networks. In: 2019 International Conference on Computational Intelligence in Data Science (ICCIDS), pp. 1–4. IEEE, February 2019 11. Li, Y., et al.: A CNN-based wearable system for driver drowsiness detection. Sensors 23(7), 3475 (2023) 12. Singh, R., Mozaffari, S., Rezaei, M., Alirezaee, S.: LSTM-based preceding vehicle behaviour prediction during aggressive lane change for ACC application (2023). arXiv preprint arXiv: 2305.01095 13. Liu, J., Liu, Y., Li, D., Wang, H., Huang, X., Song, L.: DSDCLA: driving style detection via hybrid CNN-LSTM with multi-level attention fusion. Appl. Intell., 1–18 (2023) 14. Rajamani, R.: Vehicle Dynamics and Control. Springer, New York (2011). https://doi.org/10. 1007/978-1-4614-1433-9 15. Renfroe, D.A., Semones, P.T., Roberts, A.: Quantitive measure of transient oversteer of road vehicles (2007) 16. Pacejka, H.: Tire and Vehicle Dynamics. Elsevier, Amsterdam (2005) 17. Transportation Officials: A Policy on Geometric Design of Highways and Streets. AASHTO (2011) 18. Barreno, F., Santos, M., Romana, M.: Fuzzy logic system for risk and energy efficiency estimation of driving maneuvers. In: Gude Prego, J.J., de la Puerta, J.G., García Bringas, P., Quintián, H., Corchado, E. (eds.) CISIS 2021 and ICEUTE 2021, vol. 1400, pp. 94–104. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-87872-6_10 19. Geyer, J., et al.: A2d2: Audi autonomous driving dataset (2020). arXiv preprint arXiv:2004. 06320 20. Aboah, A., Adu-Gyamfi, Y., Gursoy, S.V., Merickel, J., Rizzo, M., Sharma, A.: Driver maneuver detection and analysis using time series segmentation and classification. J. Transp. Eng. Part A Syst. 149(3), 04022157 (2023)
266
F. Barreno et al.
21. Wang, X., Xu, R., Zhang, S., Zhuang, Y., Wang, Y.: Driver distraction detection based on vehicle dynamics using naturalistic driving data. Transp. Res. Part C Emerg. Technol. 136, 103561 (2022) 22. Echeto, J., Santos, M., Romana, M.G.: Automated vehicles in swarm configuration: simulation and analysis. Neurocomputing 501, 679–693 (2022) 23. Sánchez, R., Sierra-García, J.E., Santos, M.: Modelado de un AGV híbrido triciclo-diferencial. Revista Iberoamericana de Automática e Informática industrial 19(1), 84–95 (2022)
Special Session 6: Genetic and Evolutionary Computation in Real World and Industry
Enhancing Time Series Anomaly Detection Using Discretization and Word Embeddings Lucas P´erez, Nahuel Costa, and Luciano S´ anchez(B) Computer Science Department, Polytechnic School of Engineering, University of Oviedo, Gijon, 33202 Asturias, Spain {perezlucas,luciano}@uniovi.es
Abstract. Time series anomaly detection plays a pivotal role across diverse fields, including cybersecurity, healthcare and industrial monitoring. While Machine Learning and Deep Learning approaches have shown remarkable performance in these problems, finding a balance between simplicity and accuracy remains a persistent challenge. Also, although the potential of NLP methods is heavily expanding, their application in time series analysis is still to be explored, which could benefit greatly due to the properties of latent features. In this paper, we propose WETAD, a novel approach for unsupervised anomaly detection based on the representation of time series data as text, in order to leverage the use of well-established word embeddings. To showcase the performance of the model a series of experiments were conducted on a diverse set of anomaly detection datasets widely used in the literature. Results demonstrate our approach can compete and even outperform state-of-the-art approaches with a simple, yet effective model. Keywords: anomaly detection
1
· word embeddings · time series
Introduction
We are currently immersed in a transition to the fourth industrial revolution, also known as Industry 4.0. Some terms that we cannot ignore considering their great relevance nowadays are the Internet of Things (IoT) and Big Data. Through the use of a large number of sensors, massive amounts of data can be collected, processed and transmitted in a distributed manner. The failure of one piece of equipment or machine can affect another machine or process dependent on it, causing a shutdown of the production line. Such failure can usually be detected as an anomaly in the data. Stoppages are often associated with huge costs due to different aspects, such as loss of production, failure to meet delivery deadlines, deterioration of equipment, etc. Therefore, anomaly detection has experienced a great increase of interest for obvious economic reasons. Although it is impossible to eliminate all system failures, it is possible to detect them and, as time permits, be proactive to solve them or minimize damage. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 P. Garc´ıa Bringas et al. (Eds.): SOCO 2023, LNNS 750, pp. 269–278, 2023. https://doi.org/10.1007/978-3-031-42536-3_26
270
L. P´erez et al.
Anomaly detection techniques have typically used a unsupervised approach where algorithms learn from a clean dataset and are evaluated over a set with anomalous samples. Traditionally, shallow methods based on classical techniques such as autoregressive models, ARIMA and their variants [27] were employed. Additionally, tree-based models like Isolation Forest [29], and adaptations of support vector machines such as One-Class SVM [1], found utility in this domain. Also, approaches based on dimensionality reduction such as PCA [7] and autoencoders have been effectively applied [12]. With the evolution of Deep Learning, more modern models like Recurrent Neural Networks (RNNs) or Convolutional Neural Networks (CNNs) combined with other techniques have been proposed that allow both to improve results and to be applied to larger problems. For example, Omnianomaly [24] uses a stochastic RNN and a planar normalizing flow to generate reconstruction probabilities. MERLIN [18] is a parameter-free method capable of finding anomaly discords by iterating and comparing neighboring subsequences of the time series. MAD-GAN [10] relies in a Long ShortTerm Memory network (LSTM) based GAN to model the time series distribution. MTAD-GAT [28] applies two graph attention layers in parallel to model both feature and temporal correlations that are fed to a GRU layer to subsequently pass the outputs to forecasting and reconstruction models. More recent methods, such as USAD [2] are able to achieve fast training by means of an architecture based on adversely trained autoencoders. GDN [4] introduces an embedding vector for each sensor to learn a graph of relationships between data modes and uses attention-based forecasting and deviation scoring to output anomaly scores. More recently, in TranAD [26] the authors proposed a new architecture based on Transformers and is complemented using a two phase adversarial training phase and Meta Learning. Although recent works such as [4] or [26] use mechanisms originally applied to NLP problems like attention, there are still several NLP techniques that could be useful for anomaly detection problems that have not yet been fully exploited. Applying NLP techniques to time series problem solving is highly interesting, since text and time series have a high number of significant similarities. First, both have a sequential nature: in time series the points are ordered by timestamps while in text the words are ordered to obtain a meaning. Secondly, both exhibit temporal dependence: in time series a point depends on its antecedents while in text, words also often depend on their context. Furthermore, time series usually exhibit trends and patterns, in the same way that repetition of words or keywords occurs in text. In this regard, there are some promising papers. In [13,21] the authors used fuzzy logic to discretize time series and applied text mining techniques to identify patterns related to the health status of aircraft engines. In [19], the method proposed uses the discretization of the time series into symbols to subsequently learn word embeddings using Skip-gram. The SAFE framework [25] also proposes time series classification tasks by means of a new neural network architecture using word embeddings. In [5], different NLP-based techniques such as SVD, a Transformer model and an LSTM network with embeddings are applied to detect anomalies in categorical time series.
WETAD
271
Similarly, in this paper we propose to reformulate the problem of time series anomaly detection in order to benefit from NLP techniques. Contrary to existing methods we introduce a novel architecture capable of dealing not only with categorical time series, but with all types of time series. Moreover, this architecture differs from other approaches as it exploits a never before proven concepts of discretization by timestamp and of word similarity by calculating scalar product between embeddings. Thus, we achieve a simple but effective model that is at the same time computationally undemanding.
2 2.1
Experimental Study Problem Formulation
We start from a multivaluated time series X = {xt }tT as an ordered set of k dimensional vectors, where each observation is collected in a specific time period and consists of k observations. It should be noted that an univariate series is a special case where the parameter k is one. The time series is splitted into training X train and test sets X test , of which the training set is assumed to be free of anomalies. The task is to predict whether an anomaly occurred at each time step t in the test set of the time series X test . 2.2
Data Preprocessing
As usual in any Machine Learning problem, data is normalized to ease model performance and training stability, for which the min-max normalization was used: xkt − min(Xktrain ) (1) xkt = max(Xktrain ) − min(Xktrain ) + ε where xkt is a point in the k channel and in the t timestamp. min(Xktrain ) and max (Xktrain ) are the minimum and maximum values in the k channel of the training data. The ranges obtained from the minimum and maximum of the train set are then applied to the test set. ε is a small constant to prevent zero-division. Knowing the ranges a-priori, we normalize the data to the range [0, 1). Once the data is normalized, it is necessary to discretize it to convert the time-series into a sequence of symbols. The discretization is a simplified version of SAX (Symbolic Aggregate Approximation) [11] where the previous step of downsampling data with PAA (Piecewise Aggregate Approximation) [8] is omitted and breakpoints are equidistant. For each channel of the time-series the normalized range is divided into as many intervals as symbols we selected. Each point will be discretized as a given symbol according to the range to witch its value belongs. The number of symbols is an hyperparameter of the model wich can be tuned depending of the fine-grained discretization pursued (for which we achieved good results with 7 in our experiments). If the number of symbols is too low, the discretization will be generic but if the number is high the resulting discrete time-series will be more complex. Once discretization is complete the
272
L. P´erez et al.
values of k channels will be concatenated per timestamp. Thus, a single point in a determined timestamp will be converted into a word consisting of k symbols. Each of these words will be treated as the i -th discretized value of the time series. The vocabulary will be formed by the set of words of the train and test sets. In turn, each of these words will be composed of k symbols. In Fig. 1 the discretization and generation of words for a time series of 3 channels is illustrated graphically.
Fig. 1. Illustration of the idea behind converting time series to symbolic data (n of symbols = 4) and extracting words from it.
2.3
Model Architecture
In recent years there has been an overwhelming increase in the popularity of Deep Learning for anomaly detection, in part guided by the great advances made in NLP [9]. New models based on techniques such as Attention and specifically Transformers are beginning to be applied to anomaly detection [2,26]. However, the use of embedding-based techniques such as Word2Vec or Glove [15,22] is barely explored while they are a key part of NLP. In [16], the Skip-Gram methodology was proposed, where given a target word, the model attempts to predict the context, i.e., the neighboring words by using a context window. In most cases this model is simplified by using a single word as the context, so that the skipgrams formed are pairs of words. As explained in the Introduction, there are certain similarities between text and time series. Thus, starting from the discretized time series, we intend to exploit these similarities in order to obtain embeddings that allow us to represent the time series. Subsequently, we aim to detect anomalies, which will correspond to pairs of words that are not usually found together in the same context. The proposed architecture is a variation of Skip-Gram proposed by Mikolov [16] known as Skip-Gram Negative Sampling. Negative sampling emerged as an improvement to the original model which was computationally very expensive. This was partially solved with the application of Hierarchical Softmax but definitely improved with Negative Sampling, which simplified the problem while maintaining high quality embeddings.
WETAD
273
In Negative Sampling, a specific number of negatives samples are randomly drawn from a noise distribution. On the one hand, in the training set there will be positive samples, which will correspond to pairs for which the context word is within the window with respect to the target word. These pairs will be labeled with 1. On the other hand, randomly chosen word pairs will be negative samples and will be labeled with a 0. Instead of using at the end of the model a layer with a Softmax function to compute the probability distribution of observing an output word given an input word, it is replaced by a Sigmoid function, whose output differentiates pairs belonging to the context (positive samples) from random (negative samples), which transforms it into a binary classification problem. In this way, the model will presumably learn embeddings where similar words are close and therefore whose scalar product is high. For further exemplification, in a multivariate time series with 3 channels, the “ABA” event may be common to appear before the “BAB” event, which would result in a high scalar product between their embeddings, so they can be considered as normal events. On the contrary, if the event “ACC” is not common to appear before “BAB”, this would mean a low scalar product and may be an anomaly, since during training the generated embeddings of each words were not similar. Figure 2 illustrates the model architecture. It has two inputs, one for the word context and another for the target. The first layer consists of two embeddings, one for each word, which will be in charge of encoding the “information” of the symbols/words fed to the model. There are different methodologies to select the size of the embeddings, but in our case for the used datasets reasonable results were achieved with a size of 300 dimensions as recommended by Melamud et al. [14]. However, for the selection of this value, successive tests were performed by increasing the size of the embeddings by 50. Therefore, the embeddings are two large matrix of real numbers of size vocab size × 300. Given a pair of input words and once their respective embeddings are obtained, the next step is to perform a dot product between the selected embeddings. The result will finally be passed to the last layer where the Sigmoid function is applied. At the output of the model a score between [0, 1] is obtained which would model the probability that the two words appear in the same context (close to each other) or not. A score, called perplexity1 , which is the inverse of this estimated probability is used for detecting anomalies: word/symbol pairs that have a low perplexity value will be considered normal, while a high perplexity will be associated with an anomaly. Once the time series was discretized, the vocabulary size is obtained to initialize the embeddings of the model. After that, the pairs of skipgrams are generated from the training set, generating positive and negative samples. The model is trained using Negative Sampling. For evaluation, the pairs of symbols are generated using contiguous symbols. Once the scores are obtained, the perplexity score is calculated for each word and POT [23] is applied to detect possible anomalies through the analysis of extreme values. POT is a statistical method that uses “extreme value theory” to fit the data distribution with a Generalized Pareto 1
Disambiguation: Do not confuse with the classical meaning in NLP.
274
L. P´erez et al.
Distribution and identify appropriate values at risk to dynamically determine threshold values. Regarding the parameters used, as already mentioned in the previous section, the number of letters to perform the discretization of the time series was 7, the dimensionality of the embeddings was 300 and the window size for the generation was 7. 2.4
Datasets
The datasets used for benchmarking, are widely used in the anomaly detection literature and are open and publicly available. The set is composed of one univariate time series (UCR) and five multivariate time series (MBA, SMD, MSL, SMAP and MSDS), which contain a very low percentage of anomalous data. SMD, MSL and SMAP are multi-entity datasets, which are made up of different entities, corresponding to different physical units of the same type. For these, a different model per entity has been trained to finally aggregate the results by adding true positives, false positives, etc. and calculating the precision, recall and F1 score. – Hexagon ML/UCR Time Series [3] is a large collection of univariate time series which has been growing and being updated over the years. – MIT-BIH Supraventricular Arrhythmia Database (MBA) [17] is a collection of electrocardiogram recordings from four patients, containing multiple instances of two different kinds of anomalies. This dataset has been used for benchmarking purposes in both medical and anomaly detection articles. – Soil Moisture Active Passive (SMAP) [6] is a compilation of data and telemetry from a NASA space mission that measures and maps Earth’s soil moisture and freeze/thaw state from a satellite. – Mars Science Laboratory (MSL) [6] is similar to SMAP dataset, but the data and telemetry are collected from the Curiosity rover during its exploration of the planet Mars. Authors such as [18] have analyzed this dataset and the previous one, detecting a large number of trivial sequences, so as in [26] the non-trivial sequences have been chosen. – Server Machine Dataset (SMD) [24] is a dataset collected over 5 weeks from a large Internet company and contains resource utilization traces from 28 different machines in a cluster of computers. Similar to the previous ones, we have chosen to use non-trivial traces. – Multi-Source Distributed System (MSDS) [20] is a recent high-quality multi-source data composed of distributed traces, application logs, and metrics from a complex distributed system.
3
Results
For the experimentation, the chosen metrics were the precision, recall and the F1 -score. Since the data in anomaly detection problems are usually imbalanced
WETAD
275
Fig. 2. Architecure for the proposed model. Table 1. Summary of dataset characteristics used in this paper. Obtained from [26] Dataset Train size Test size N ofchannels % of anomalies UCR
1600
5900
MBA
100000
100000
2
0.14
SMD
135183
427617
38
4.16
MSL
1
1.88
58317
73729
55
10.72
SMAP
135183
427617
25
13.13
MSDS
146430
146430
10
5.37
it is recommended to use metrics like ROC. However, since F1 -score is widely used in the literature, we consider that this metric in combination with precision and recall allow us to evaluate WETAD properly against other approaches. Table 1 compares the best precision, recall and F1 -score obtained with our proposed method, labeled as WETAD (Word Embeddings for Time series Anomaly Detection), with the results obtained by 7 anomaly detection methods (described in the Introduction). These methods are TranAD [26], GDN [4], MTAD-GAT [28], USAD [2], MAD-GAN [10], OmniAnomaly [24] and MERLIN [18]. It should be pointed that the experiments have been performed using the implementations available in the Github repository of [26]. In terms of F1 -score our method outperforms the other approaches in UCR, MBA, MSL, SMAP and MSDS datasets. Only in SMD TranAD obtains a higher F1 -score and in SMAP it draws with the same method with a F1 -score of 0.914. In the average ranking both methods also draw with a 1.8 position. The worst method by far seems to be MERLIN which, lacking parameters, seems to have a hard time adapting, especially to high-dimensional multivariate time series. It is noteworthy mentioning that WETAD on some datasets such as MSDS and MSL reverses the trend of the other models, slightly decreasing precision or recall to improve the inverse metric. This may be due to the fact that there may be certain words that in one context are considered anomalous while in another may be normal, which would decrease precision but increase recall. For exemplification, in collective type anomalies, where all points in a sub-sequence of the time series, a word may appear repeatedly. However, the word may also appear in another non-anomalous context. In this case, the problem would be
276
L. P´erez et al.
that the word in the non-anomalous context could be detected as a false positive, leading to alterations in the metrics. Our method in general seems to be quite stable both in precision and in recall which makes it get good results. The combination of the perplexity score together with the dynamic thresholding of POT seems to have an effect, since it allows to adjust the threshold helps set more accurate values by also considering the localized peak values in the data sequence. OmniAnomaly and MTAD-GAT seem unable to detect anomalies in the only univariate time series set, perhaps because they focus on the complex dependencies between different channels. Unlike other methods that use computationally expensive techniques such as CNNs or Transformers, WETAD achieves competitive results with a simple model. Table 2. Precision, Recall and F1 -scores of all models with POT dynamical thresholding.
WETAD TranAD GDN MTAD-GAT USAD MAD-GAN OmniAnomaly MERLIN
WETAD TranAD GDN MTAD-GAT USAD MAD-GAN OmniAnomaly MERLIN
4
Prec
UCR Rec
F1
Prec
MBA Rec
F1
Prec
Rec
SMD F1
0.634 0.649 0.049 0.000 0.346 0.595 0.000 0.374
0.957 0.472 0.043 0.000 0.477 0.949 0.000 0.698
0.763 0.547 0.045 0.000 0.401 0.731 0.000 0.487
0.981 0.957 0.844 0.901 0.895 0.940 0.859 0.985
0.985 1.000 1.000 1.000 1.000 1.000 1.000 0.049
0.983 0.978 0.915 0.948 0.945 0.969 0.924 0.094
0.979 0.927 0.717 0.805 0.710 0.520 0.775 0.132
0.410 0.646 0.495 0.628 0.646 0.489 0.555 0.539
0.578 0.761 0.586 0.705 0.676 0.504 0.646 0.213
Prec
MSL Rec
F1
Prec
SMAP Rec
F1
Prec
MSDS Rec
F1
Avg. Ranking
0.960 0.247 0.241 0.144 0.239 0.232 0.237 0.140
0.653 1.000 1.000 1.000 1.000 1.000 1.000 0.372
0.777 0.396 0.389 0.252 0.385 0.377 0.383 0.204
0.841 0.842 0.848 0.782 0.819 0.821 0.818 0.157
1.000 1.000 0.985 1.000 1.000 1.000 1.000 0.999
0.914 0.914 0.912 0.878 0.900 0.901 0.900 0.273
0.825 1.000 1.000 1.000 1.000 1.000 1.000 0.726
1.000 0.803 0.803 0.611 0.796 0.611 0.796 0.311
0.904 0.890 0.890 0.758 0.886 0.758 0.887 0.435
1.8 1.8 4.2 5.0 4.2 4.3 4.8 6.7
Conclusions and Future Work
In this paper we propose WETAD, a new approach based on NLP techniques to identify anomalies in time series. We first apply a discretization on the time series and then train a Skip-Gram Negative Sampling model by generating word
WETAD
277
pairs and attempting to learn representations of these using word embeddings. Experiments on a large set of datasets and a selection of state-of-the-art algorithms have shown that our method can compete with current techniques of much higher complexity and even improve results. In future work, we aim to improve the model by adding more advanced NLP techniques such as Attention combined with the discretized time series. It may also be of interest to test the model with SAX discretization, the inclusion of fuzzy logic or the improvement of signal preprocessing as in [21]. Another aspect to consider for future work is the use of other metrics also suited for imbalanced datasets (such as ROC score) in combination with those currently included. Finally, since WETAD is a straightforward model, it is easy to install in real environments, so an analysis on training times and computational complexity could be performed in view of implementations in power-limited industrial IoT devices.
References 1. Arenas-Garc´ıa, J., G´ omez-Verdejo, V., Navia-Vazquez, A.: RLS adaptation of oneclass SVM for time series novelty detection (2004) 2. Audibert, J., Michiardi, P., Guyard, F., Marti, S., Zuluaga, M.A.: Usad: Unsupervised anomaly detection on multivariate time series. KDD ’20, New York, NY, USA, Association for Computing Machinery, pp. 3395-3404 (2020) 3. Dau, H.A., et al.: The ucr time series archive (2019) 4. Deng, A., Hooi, B.: Graph neural network-based anomaly detection in multivariate time series (2021) 5. Horak, M., Chandrasekaran, S., Tobar, G.: Nlp based anomaly detection for categorical time series (2022) 6. Hundman, K., Constantinou, V., Laporte, C., Colwell, I., Soderstrom, T.: Detecting spacecraft anomalies using LSTMs and nonparametric dynamic thresholding. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery, ACM (07 2018) 7. Jin, Y., Qiu, C., Sun, L., Peng, X., Zhou, J.: Anomaly detection in time series via robust PCA. In: 2017 2nd IEEE International Conference on Intelligent Transportation Engineering (ICITE), pp. 352–355 (2017) 8. Keogh, E.J., Pazzani, M.J.: Scaling up dynamic time warping for datamining applications. In: Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD ’00, New York, NY, USA, Association for Computing Machinery, pp. 285–289 (2000) 9. Khurana, D., Koli, A., Khatter, K., Singh, S.: Natural language processing: state of the art, current trends and challenges. Multimed. Tools Appl. 82(3), 3713–3744 (2022) 10. Li, D., Chen, D., Shi, L., Jin, B., Goh, J., Ng, S.K.: Mad-gan: Multivariate anomaly detection for time series data with generative adversarial networks (2019) 11. Lin, J., Keogh, E., Wei, L., Lonardi, S.: Experiencing sax: A novel symbolic representation of time series. Data Min. Knowl. Discov. 15, 107–144 (2007) 12. Malhotra, P., Ramakrishnan, A., Anand, G., Vig, L., Agarwal, P., Shroff, G.: Lstmbased encoder-decoder for multi-sensor anomaly detection (2016) 13. Mart´ınez, A., S´ anchez, L., Couso, I.: Engine health monitoring for engine fleets using fuzzy radviz. In: 2013 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), pp. 1–8 (2013)
278
L. P´erez et al.
14. Melamud, O., McClosky, D., Patwardhan, S., Bansal, M.: The role of context types and dimensionality in learning word embeddings. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California, Association for Computational Linguistics, pp. 1030–1040 (2016) 15. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space (2013) 16. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality (2013) 17. Moody, G., Mark, R.: The impact of the MIT-BIH arrhythmia database. IEEE Eng. Med. Biol. Mag. 20(3), 45–50 (2001) 18. Nakamura, T., Imamura, M., Mercer, R., Keogh, E.J.: Merlin: Parameter-free discovery of arbitrary length anomalies in massive time series archives. In: 2020 IEEE International Conference on Data Mining (ICDM), pp. 1190–1195 (2020) 19. Nalmpantis, C., Vrakas, D. In: Signal2Vec: Time Series Embedding Representation, pp. 80–90 (2019) 20. Nedelkoski, S., Bogatinovski, J., Mandapati, A.K., Becker, S., Cardoso, J., Kao, O.: Multi-source distributed system data for AI-powered analytics. In: Brogi, A., Zimmermann, W., Kritikos, K. (eds.) Service-Oriented and Cloud Computing, pp. 161–176. Springer International Publishing, Cham (2020) 21. Palacios, A., Mart´ınez, A., S´ anchez, L., Couso, I.: Sequential pattern mining applied to aeroengine condition monitoring with uncertain health data. Eng. Appl. Artif. Intell. 44, 10–24 (2015) 22. Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) 23. Rosso, G.: Extreme value theory for time series using peak-over-threshold method (2015) 24. Su, Y., Zhao, Y., Niu, C., Liu, R., Sun, W., Pei, D.: Robust anomaly detection for multivariate time series through stochastic recurrent neural network. KDD ’19, New York, NY, USA, Association for Computing Machinery, pp. 2828–2837 (2019) 25. Tabassum, N., Menon, S., Jastrz¸ebska, A.: Time-series classification with safe: Simple and fast segmented word embedding-based neural time series classifier. Inform. Process. Manage. 59(5), 103044 (2022) 26. Tuli, S., Casale, G., Jennings, N.R.: Tranad: Deep transformer networks for anomaly detection in multivariate time series data (2022) 27. Yu, Q., Jibin, L., Jiang, L.: An improved arima-based traffic anomaly detection algorithm for wireless sensor networks. Int. J. Distrib. Sensor Netw. 2016, 1–9 (2016) 28. Zhao, H., et al.: Multivariate time-series anomaly detection via graph attention network (2020) 29. Zhong, S., Fu, S., Lin, L., Fu, X., Cui, Z., Wang, R.: A novel unsupervised anomaly detection for gas turbine using isolation forest. 06, 1–6 (2019)
Multi-objective Optimization for Multi-Robot Path Planning on Warehouse Environments Enol Garc´ıa Gonz´ alez1(B) , Jos´e R. Villar1 , Camelia Chira2 , anchez1 , and Javier Sedano3 Enrique de la Cal1 , Luciano S´ 1
University of Oviedo, Computer Science Department, Oviedo, Spain {garciaenol,villarjose,delacal,luciano}@uniovi.es 2 University of Babes Boliay, Department of Computer Science, Cluj-Napoca, Romania [email protected] 3 Instituto Tecnol´ ogico de Castilla y Le´ on, Burgos, Spain [email protected]
Abstract. Today, robots can be found in almost any field. Examples include robots for transporting materials in hospitals and warehouses, surveillance, intelligent laboratories and space exploration. Whatever the reason for moving the robot and whatever its location, all robot applications anywhere require path calculation. In this paper, we address the problem of collision-free path planning in multirobot environments, known as Free Multi-Robot Path Planning (MPP). In this paper we propose a novel approach to solve the MPP problem using multi-objective optimization, for which we define two functions that has to be minimized. In experimentation, it is compared with previous approaches to the problem, improving them in some scenarios. Finally, new lines of research are proposed to improve this path calculation problem using multi-objective optimization and to address new and more complex problems in warehouse environments. Keywords: Multi-Robot Path Planning optimization
1
· NSGA · Multi-objective
Introduction
Nowadays it is common to use robots for the automation of multiple transport tasks. We can find robots automating transport tasks in different environments, from hospitals [1,13,18] to large logistics centers [3,9,17,19]. In many of these *This research has been funded by the Spanish Ministry of Economics and Industry, grant PID2020-112726RB-I00, by CDTI projects CER-20211003 and CER-20211022, by Missions Science and Innovation project MIG-20211008 (INMERBOT), and by ICE (Junta de Castilla y Le´ on) under project CCTT3/20/BU/0003. Also, by Principado de Asturias, grant SV-PA-21-AYUD/2021/50994. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 P. Garc´ıa Bringas et al. (Eds.): SOCO 2023, LNNS 750, pp. 279–289, 2023. https://doi.org/10.1007/978-3-031-42536-3_27
280
E. Garc´ıa Gonz´ alez et al.
spaces, the transport tasks are not performed by a single robot, but multiple robots must collaborate in the same space and move around without colliding with each other. The problem of determining the best path for a set of robots in a shared space without collisions or human intervention is known as Free Multi-robot Path Panning (MPP). This MPP problem has applications in multiple fields that can be very different from each other, such as surveillance [22], intelligent laboratories [21,23], or space exploration [12]. In current work, the MPP problem is applied in the field of warehouse logistics [3,9,17,19]. The objective of this study is to provide a method that calculates the route to be followed by robots within a known environment in which the location of obstacles and the start and end points of each robot are known. In the literature, there are two main approaches to solving the MPP problem. The first of them is a heuristic approach, which is based on A*. A* [10] is a search algorithm that has been widely used for route calculation in single-robot environments, but it does not consider the existence of multiple robots in a shared space. Some authors have developed heuristic algorithms like Windowed HCA* [20], D*Lite [16], Field D* [6] or ThetA* [4] that are based on A* and extend it to solve the path planning problem when there are multiple robots. The other approach that is largely present in the literature is the use of metaheuristics. Within this approach, we find works such as the one presented in [2] that solve the problem using Differential Evolution. Other works of this approach are [8,24], which solve the problem with other bio-inspired metaheuristics such as Ant Colony Optimization and Grey Wolf Optimizer. In this research, we propose a new approach to this problem based on multiobjective optimization. Our goal is to improve our previous proposals to solve this problem by employing co-evolutionary algorithms [7,15]. The structure of this document is as follows. Section 2 explains the proposed approach for solving MPP. Next, Sect. 3 gives details of the experimentation, while Sect. 4 shows the results and discussion. Finally, Sect. 5 draws the conclusions extracted from this study and presents future research lines on this topic.
2
Non-Dominated Genetic Algorithm Approach
This proposal focuses on solving the MPP problem with a multiobjective optimization approach. To do so, it makes use of a metaheuristic well-known in the literature: NSGA-III [5,14]. This metaheuristic is based on genetic algorithms since their behavior is very similar. Like genetic algorithms, it starts from an initial population and performs an iterative process, in which, in each iteration, new solutions will be generated using crossover and mutation operators. The difference between NSGA-III and genetic algorithms are the number of criteria. Classical genetic algorithms minimize a single criteria called fitness, while NSGA-III have many objectives to minimize. Having many criteria changes how we determine whether one solution is better. To consider that a solution dominates another one, its value must be strictly lower in all the criteria except
Multi-objective Optimization for MPP on Warehouses
281
one in which it can be less or equal. For this work, two objectives have been defined to be minimized: – Length of the paths. That consist on the number of movements on the path. – Number of collisions. Taking into account that Xit is the position of robot i at time t. We can define that a collision occurs if: • Two robots i, j share a tile at the same moment of time. Xit = Xjt • Two robots i, j exchange positions at two consecutive instants of time. Xit = Xjt−1 and Xit−1 = Xjt To minimize those two objectives using NSGA-III, its necessary to define three behaviors: how to initialize the first population and how to perform the crossover and mutation operator. To generate the initial population. It is proposed that we start from a base of k pre-generated routes –being k a parameter of the algorithm–, in a pseudo-random way and that the solutions are formed by indicating which of the already calculated routes is used. The following sections indicate how these routes are generated and how the behaviors required by NSGA-III will be implemented. 2.1
Route Generation
The generation of routes will be performed randomly by means of an iterative algorithm in which in each iteration the movement to be performed will be selected pseudo-randomly (Alg. 1). This algorithm is separated into two parts: i) generation of the initial routes, and ii) loop elimination (Alg. 2). In the generation of routes, the movements are chosen randomly from all possible movements at that moment, but taking into account two restrictions: – If it is possible to redo the last move, it will have a higher probability of being chosen to encourage routes to incorporate straight-line sections. – The movement opposite to the previous one can only be chosen if there is no other possible movement to avoid the algorithm getting stuck between two points. Once the initial path is generated, it is analyzed and two time instants are searched for in which the robot passes through the same coordinates. If the robot passes through the same coordinate in two different t instants, it is considered that there is a loop and all the movements between the two appearances of those coordinates in the path are eliminated. To analyze the feasibility of this algorithm to generate routes, the algorithm has been run 30 times to generate 1000 routes in each run. During this test, the algorithm was able to generate an average of 930.92 solutions per second, with a standard deviation of 30.31.
282
E. Garc´ıa Gonz´ alez et al.
Algorithm 1. Random search O(n) Input: origin, dest points Output: path: Sequence of movements to reach the point dest from the point origin. current ← origin path ← ∅ while current = dest do if path is not empty & is valid movement(path[-1]) & random < 0.7 then Add path[-1] to path again Copy the last movement else Calculate available movs if length(available movs) == 1 then Add available movs[0] to path else Remove oposite(path[-1]) from available movs next movement ← random(available movs) Add next movement to path end if end if Update current position end while Look for loops on path and remove them using Alg. 2 return path
2.2
Initial Population
The representation of the solutions is defined as a vector of n positions, where n is the number of robots in the problem. Each position in the vector will contain an integer representing which index of the pre-generated paths will be used for that robot. Figure 1 shows an example of this representation, where robot number 1 will use the path at position 33 in the list of pre-generated paths, the second robot will use path 15 and the last robot will use the path at position 24.
Fig. 1. Example of the representation of a single solution
The solutions will be generated randomly so that each position of the vector will contain a random integer generated between 0 and the number of precomputed routes. 2.3
Crossover
For the crossover operator we have chosen to use the One-Point Crossover [11]. This operator selects a random cut point in the two parent solutions and
Multi-objective Optimization for MPP on Warehouses
283
Algorithm 2. Remove loops O(n2 ) Input: path: sequence of movements and points for a robot that could contain loops Output: path: Sequence of movements and points for a robot without loops for i ← 0 To length(path) do for j ← length(path) To i Step −1 do if coords at(i) == coords at(j) then If a coord appears twice in the route new path ← path[0 : i] + path[j : length(path)] return remove loops(new path) end if end for end for return path
generates two daughter solutions in which each part will come from one of the parents. Figure 2 illustrates this crossover operator.
Fig. 2. Example of crossover operator
2.4
Mutation
The mutation operator will act on each of the elements of a solution. This means that each of the numbers present in the vector forming an individual will be evaluated and modified with a probability of pm . The modification to be made in the mutation operator will be to change the value to a random one. Figure 3 illustrates this mutation operator.
3
Experimentation Setup
The objective of the experimentation is to compare with other methods already published in previous works to solve the MPP problem [7,15]. The comparison replicates the three scenarios proposed in [7]: the first one with a single room and fixed obstacles; the second one with 4 connected rooms; and the third one that replicates a warehouse.
284
E. Garc´ıa Gonz´ alez et al.
Fig. 3. Example of mutation operator
(a) Scenario 1
(b) Scenario 2
(c) Scenario 3
Fig. 4. Graphical representation of the scenarios used in the experimentation
Figure 4a shows the first scenario consisting of a closed room of size 10 × 30 meters. In this first scenario, there are only a few fixed objects inside the room. The objective in this scenario is for the robot to go from any random starting point to any random ending point. The number of robots used in this experiment varies between 3 and 15. The second scenario, shown in Fig. 4b, consists of 4 rooms of the same size as the first scenario. In this case, the rooms are free of obstacles, but connected by narrow corridors of length 4, in which only one robot fits. The complication of this scenario is that the robots are required to end up in a different room from the starting room, in order to analyze the behavior of the algorithm when crossing corridors. For this experiment, the number of robots varied between 3 and 6 robots per room. Figure 4c shows the last scenario, which replicates a warehouse in which there are two clearly differentiated zones. The large area at the top with a pattern of aisles and square obstacles corresponds to the storage area of the workshop. The square obstacles would be the shelves where the products are stored. Analogous to the second scenario, only one robot is considered to fit in each aisle. The smaller area at the bottom of the map with no obstacles inside is considered a working area for the robots to move around in as needed. The goal of this experiment is for the robots to move from a random point in the work zone to a
Multi-objective Optimization for MPP on Warehouses
285
Table 1. Parameters used during the experimentation. Crossover probability
0.7
Mutation probability
0.3
Max number of iterations
200
Population size
50
Pre-generated routes (k)
100
Fig. 5. Comparative graphs of the evolution of the execution time concerning the number of robots for each method for scenario 1.
random point in the shelving area. In this scenario, the experimentation started with 10 robots and was increased by 5 to reach 30 robots. Up to 10 runs of each algorithm and scenario will be run to estimate the statistics of the performances. During the experimentation, the parameters in Table 1 were used. These parameters were chosen among the best after several runs with different parameters.
4
Results and Discussion
Three metrics were used during the experimentation: the execution time, the length of the longest path and the sum of the lengths of all paths. Table 2 shows the runtime results compared to the two previous works: [15] and [7]. The new method obtains better results in the first scenario, since for 15 robots it produces results up to ten times better. Figure 5 graphically represents the evolution in execution time as a function of the number of robots for these three methods. In the case of the last two scenarios, it is observed that the proposed method does not perform as well, since it has much longer execution times than [7]. However, it does improve on [15], as it is able to solve these scenarios with almost any number of robots.
286
E. Garc´ıa Gonz´ alez et al.
Table 2. Mean (MN) and standard deviation (SD) of the path-planning time in seconds for each method and scenario. The gray cells represent the experiments for which the method could not find a collision-free path. No. robots Morteza2022 [15] Scenario 1 6 7 8 9 10 11 12 13 14 15 Scenario 2 3 4 5 6
per per per per
Scenario 3 10 15 20 25 30
Garcia2023 [7]
NSGA-III
MN 4.7479 9.6416 14.3116 32.8010 59.7161 95.2412 121.2675 198.7366 271.5449 571.3999
SD 3.5449 4.3064 8.0995 12.7689 24.0424 46.1279 72.5481 101.0648 124.6219 585.3738
MN 4.9712 5.3837 5.7193 8.3554 12.0660 13.7042 23.9644 25.1462 27.6918 53.2227
41384.1716 14851.4483 0.3939 0.2327 0.4368 1.2399
0.0663 0.0536 0.0162 0.0183
5.4058 0.5484 14.7859 1.5257 60.5659 1.2013
0.1558 0.1999 0.1730 0.1810 0.2105
0.0106 0.0524 0.0121 0.0160 0.0081
13.5677 23.9452 36.2394 52.8438 69.6164
MN 3.8313 11.2127 75.8119 303.3389 1255.6156 3958.5166
room room room room
SD 1.4861 5.4879 41.5637 203.8299 459.5965 1954.4560
SD 0.3330 0.4022 0.3323 4.6203 6.6266 8.0799 0.8502 0.2680 1.0077 0.5877
2.2207 3.1946 4.1399 4.9216 5.5152
The other two metrics corresponding to the length of the longest path and the aggregated length are shown in Table 3. The result from the experimentation shows that the new method gives better results in both time and path length when we are in a simple environment with few robots like in scenario 1. In more complex scenarios –scenario 2 and scenario 3–, with a larger number of robots, the new approach takes longer to find results, reaching paths that are longer than those provided by the method with which we are compared.
Multi-objective Optimization for MPP on Warehouses
287
Table 3. Max and aggregated length (in meters) of the paths obtained from each method for each scenario according to the number of robots. The gray cells represent the experiments for which the method could not find a collision-free path. No. robots Garcia2023 [7] NSGA-III Max Aggregated Max Aggregated Scenario 1 6 7 8 9 10 11 12 13 14 15 Scenario 2 3 4 5 6
per per per per
Scenario 3 10 15 20 25 30
5
room room room room
55.0 56.4 57.2 60.5 59.2 68.9 63.1 79.0 78.4 90.8
363.9 377.1 438.5 389.1 444.5 414.7 426.1 437.8 469.8 486.0
32.0 32.0 32.0 36.0 50.2 50.6 52.1 54.3 62.0 70.1
120.0 146.2 174.6 195.2 248.3 300.7 338.8 387.6 412.2 489.0
77.0 79.0 79.0 79.0
736 951.0 1206.0 1458.0
76.0 82.2 88.0
788.3 920.2 1336.9
62.0 62.0 62.0 62.0 62.0
399.0 536.0 639.0 808.0 946.0
70.0 140.2 236.7 248.6 253.3
415.1 740.9 1339.5 1662.3 1810.8
Conclusion and Future Work
This paper proposes a new approach to the MPP problem. In the experimentation, we compare with another work that tries to solve the same problem using co-evolutionary algorithms. The newly proposed method outperforms the previous one in simple scenarios with few robots, but it is not able to beat the benchmark method in more complicated environments with more robots. Future work will try to improve this approach in more complex scenarios, and we will combine the MPP and the TSP problem. To improve this approach, we will consider two possible lines: i) the improvement of the path generation algorithm, and ii) the development of a local search algorithm that obtains new solutions from those with few collisions. Regarding the combination of the MPP with other known problems in the literature, we will try to solve the MPP problem in environments where there is not only one starting point and one destination point; the main idea is that the robot must visit multiple waypoints before reaching the destination.
288
E. Garc´ıa Gonz´ alez et al.
References 1. Causse, O., Pampagnin, L.: Management of a multi-robot system in a public environment. In: Proceedings 1995 IEEE/RSJ International Conference on Intelligent Robots and Systems. Human Robot Interaction and Cooperative Robots, vol. 2, pp. 246–252 (1995). https://doi.org/10.1109/IROS.1995.526168 2. Chakraborty, J., Konar, A., Jain, L.C., Chakraborty, U.K.: Cooperative multirobot path planning using differential evolution. J. Intell. Fuzzy Syst. 20, 13–27 (2009). https://doi.org/10.3233/IFS-2009-0412 3. Chen, X., Li, Y., Liu, L.: A coordinated path planning algorithm for multirobot in intelligent warehouse. In: 2019 IEEE International Conference on Robotics and Biomimetics (ROBIO), pp. 2945–2950 (2019). https://doi.org/10. 1109/ROBIO49542.2019.8961586 4. Daniel, K., Nash, A., Koenig, S., Felner, A.: Theta∗: Any-angle path planning on grids. J. Artif. Intell. Res. 39, 533–579 (2010). https://doi.org/10.1613/jair.2994 5. Deb, K., Jain, H.: An evolutionary many-objective optimization algorithm using reference-point-based nondominated sorting approach, part i: Solving problems with box constraints. IEEE Trans. Evol. Comput. 18, 577–601 (2014). https://doi. org/10.1109/TEVC.2013.2281535 6. Ferguson, D., Stentz, A.: Using interpolation to improve path planning: the field d* algorithm. J. Field Robot. 23, 79–101 (2006). https://doi.org/10.1002/rob.20109 7. Garc´ıa, E., Villar, J.R., Tan, Q., Sedano, J., Chira, C.: An efficient multi-robot path planning solution using a* and coevolutionary algorithms. Integr. Comput. Aided Eng. 30, 41–52 (2023). https://doi.org/10.3233/ICA-220695 8. Gul, F., Rahiman, W., Alhady, S.S.N., Ali, A., Mir, I., Jalil, A.: Meta-heuristic approach for solving multi-objective path planning for autonomous guided robot using pso-gwo optimization algorithm with evolutionary programming. J. Ambient Intell. Humanize Comput. 12, 7873–7890 (2021). https://doi.org/10.1007/s12652020-02514-w 9. Han, S.D., Yu, J.: Effective heuristics for multi-robot path planning in warehouse environments. In: 2019 International Symposium on Multi-Robot and Multi-Agent Systems (MRS), pp. 10–12 (2019). https://doi.org/10.1109/MRS.2019.8901065 10. Hart, P.E., Nilsson, N.J., Raphael, B.: A formal basis for the heuristic determination of minimum cost paths. IEEE Trans. Syst. Sci. Cybern. 4, 100–107 (1968). https://doi.org/10.1109/TSSC.1968.300136 11. Holland, J.H.: Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology. Control and Artificial Intelligence. MIT Press, Cambridge, MA, USA (1992) 12. Huang, D., Jiang, H., Yu, Z., Kang, C., Hu, C.: Leader-following cluster consensus in multi-agent systems with intermittence. Int. J. Control Autom. Syst. 16, 437– 451 (2018). https://doi.org/10.1007/s12555-017-0345-2 13. Huang, X., Cao, Q., Zhu, X.: Mixed path planning for multi-robots in structured hospital environment. J. Eng. 2019(14), 512–516 (2019). https://doi.org/10.1049/ joe.2018.9409 14. Jain, H., Deb, K.: An evolutionary many-objective optimization algorithm using reference-point based nondominated sorting approach, part ii: Handling constraints and extending to an adaptive approach. IEEE Trans. Evol. Comput. 18, 602–622 (2014). https://doi.org/10.1109/TEVC.2013.2281534 15. Kiadi, M., Garca, E., Villar, J.R., Tan, Q.: A*-based co-evolutionary approach for multi-robot path planning with collision avoidance. Cybernetics and Systems, pp. 1–16 (2022). https://doi.org/10.1080/01969722.2022.2030009
Multi-objective Optimization for MPP on Warehouses
289
16. Koenig, S., Likhachev, M.: Fast replanning for navigation in unknown terrain. IEEE Trans. Rob. 21, 354–363 (2005). https://doi.org/10.1109/TRO.2004.838026 17. Kumar, N.V., Kumar, C.S.: Development of collision free path planning algorithm for warehouse mobile robot. Proc. Comput. Sci. 133, 456–463 (2018). https://doi. org/10.1016/j.procs.2018.07.056 ´ 18. Ortiz, E.G., Andres, B., Fraile, F., Poler, R.: Angel Ortiz Bas: fleet management system for mobile robots in healthcare environments. J. Indust. Eng. Manage. 14(1), 55–71 (2021). https://doi.org/10.3926/jiem.3284 19. Sharma, K., Doriya, R.: Coordination of multi-robot path planning for warehouse application using smart approach for identifying destinations. Intel. Serv. Robot. 14, 313–325 (2021). https://doi.org/10.1007/s11370-021-00363-w 20. Silver, D.: Cooperative pathfinding. In: Proceedings of the First AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment (AIIDE’05), pp. 117–122 (2005). https://doi.org/10.1609/aiide.v1i1.18726 ¨ Dogru Bolat, E.: Design and implementation of web-based 21. Solak, S., Yakut, O., virtual mobile robot laboratory for engineering education. Symmetry 12 (2020). https://doi.org/10.3390/sym12060906 22. Stump, E., Michael, N.: Multi-robot persistent surveillance planning as a vehicle routing problem. In: 2011 IEEE International Conference on Automation Science and Engineering, pp. 569–575 (2011). https://doi.org/10.1109/CASE.2011.6042503 23. Tan, Q., Denojean-Mairet, M., Wang, H., Zhang, X., Pivot, F.C., Treu, R.: Toward a telepresence robot empowered smart lab. Smart Learn. Environ. 6, 5 (2019). https://doi.org/10.1186/s40561-019-0084-3 24. Zheng, Y., Luo, Q., Wang, H., Wang, C., Chen, X.: Path planning of mobile robot based on adaptive ant colony algorithm. J. Intell. Fuzzy Syst. 39, 5329–5338 (2020). https://doi.org/10.3233/JIFS-189018
On the Prediction of Anomalous Contaminant Diffusion Douglas F. Corrˆea1(B) , Guido F.M.G. Carvalho1,2 , David A. Pelta2 , onio J. Silva Neto1 Claudio F.M. Toledo3 , and Antˆ 1
Rio de Janeiro State University, Rio de Janeiro, Brazil [email protected] 2 Department of Computer Science & A.I., University of Granada, Granada 18014, Spain 3 University of S˜ ao Paulo, S˜ ao Paulo, Brazil
Abstract. The present work aims to estimate the parameters necessary to predict a contaminant diffusion in a medium where a biflux anomalous diffusion phenomenon occurs. To accomplish it, an inverse problem approach is adopted, and with the aid of the Differential Evolution algorithm the parameters were estimated. Synthetic experimental data generated with the Bevilacqua-Gale˜ ao Model were used to simulate a real case scenario. The outcome of the proposed method accurately predicted the dispersion of contamination with a tolerable margin of error. Keywords: Biflux Diffusion Equation · Fourth Order Diffusion Model · Inverse Problem · Differential Evolution
1
· BG
Introduction
Contaminants may enter the water on a daily basis, either at its source or during transportation through the distribution system1 , whether accidentally or intentionally. A small part of seawater may become contaminated as a result of misfortunes like offshore drilling unit accidents or oil spills from marine shipping. Even a river can get contaminated by wasted residues. In such scenarios, it is crucial to minimize the damage, and in order to do so, it is essential to anticipate the spread of contamination to take appropriate measures. The phenomenon of diffusion, which refers to the scattering of particles in a medium, is actively present in contamination processes. The diffusion equation is commonly utilized to model several phenomena such as disease transmission [8], population dynamics [6], heat transfer [2], financial markets [4], and among others fluid and particle transport [6]. In the conventional diffusion model, particles move randomly and uniformly within the medium, resulting in a linear increase in the mean square displacement with time, displaying a Gaussian behavior [1]. In a nonequilibrium system, 1
https://www.cdc.gov/healthywater/drinking/contamination.html.
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 P. Garc´ıa Bringas et al. (Eds.): SOCO 2023, LNNS 750, pp. 290–299, 2023. https://doi.org/10.1007/978-3-031-42536-3_28
On the Prediction of Anomalous Contaminant Diffusion
291
such as when oil spills in the ocean, a type of diffusion that deviates from the usual may happen, when it happens the mean square displacement increases at a non-linear rate, and this process is usually called anomalous diffusion. This deviation can be caused by several factors such as medium heterogeneity, particle interactions or complex physical processes [2]. Bevilacqua, Gale˜ ao and coworkers [2,3] have developed an analytical formulation for anomalous diffusion in which temporary retention is taken into consideration, resulting in a fourth-order partial differential equation, the BevilacquaGale˜ ao (BG) Model is utilized in the present study as we focus on the effects of diffusion in a contaminant spreading process. In a direct problem, the required variables are known, and the goal is to obtain the effects of those variables in the phenomena given an initial state. However, in order to predict the dispersion of contaminants, we must first determine the governing variables of the phenomenon from observed effects, which in this study is the contamination concentration profile at a time t. To accomplish it, an inverse problem approach is utilized. A schematic representation of an inverse problem can be seen in Fig. 1. According to Ref. [5] an inverse problem is whenever you know the effects and needs to infer either the process or the cause. In the current study, the cause is the initial concentration and the process is partially known, we consider the mathematical model as being the anomalous biflux diffusion model, but without any previous knowledge about the fraction of particles allowed to spread in the BG Model (β(x)), Subsect. 2.1, that we aim to find, being though a partially known process.
Fig. 1. Schematic representation of the Inverse Problem considered in this work. Adapted from [5]
The authors states in [2] that due to the BG Model being a new model it is necessary to develop methods to estimate its parameters, and in our approach no knowledge about the function β(x) is previously available, which leads us to assume it to be a fifth degree polynomial, and in the inverse problem we aim to estimate its coefficients that allow us to represent the behavior of β(x) predicted by the BG Model in an anomalous diffusion process using the Differential Evolution(DE) algorithm. We chose a population based technique because it allows to obtain a set of potentially good solutions when used as a solution generator [9], which is our case here, our final goal is then to predict the contamination spread after obtaining these coefficients. Based on the research done by [10] in the Maca´e river around a thermoeletric power plant in Brazil, we make use of a one dimensional differential equation to represent a scenario where a river with the similar characteristics gets contaminated.
292
D. F. Corrˆea et al.
In this context, the aim of this paper is to estimate the coefficients of a fifth degree polynomial that approximates the behavior of the function β(x) using the Differential Evolution Method. Finally, with our results we hope that this work will provide a path for more experimental analysis of the BG Model in cases that no previous knowledge of β(x) is available. The paper is organized as follows, the Sect. 2 is divided into 3 subsections of preliminaries content to review the necessary topics to understand this paper. In Subsect. 2.1 we briefly review the anomalous biflux diffusion model utilized, in 2.2 the numerical method used to solve the partial differential equation is presented, in 2.3 the Differential Evolution metaheuristic and the adopted parameters are briefly described. In 3 the inverse problem is formulated, then in 4 the results are presented and discussed. Finally, in 5 the conclusions of the present work is presented.
2 2.1
Bevilacqua-Gale˜ ao (BG) Model and Numerical Solution BG Model
Here we make a brief description of the BG Model for biflux anomalous diffusion, which was first presented in Ref. [2]. For the sake of simplicity, in this section we shall assume the discrete and symmetric case of redistribution. Let’s consider a region in space that contains a high concentration of particles, represented by cell i in Fig. 2. In a particle spreading process with retention, a portion (α) of the particles is retained, and the non-retained portion (β) is redistributed into neighboring cells. This process is represented in Fig. 2, where β = 1 − α.
Fig. 2. The discrete distribution as a function of time in the BG model for biflux anomalous diffusion with constant β.
This phenomenon in a continuous medium and with the possibility that the redistribution has a spatial dependence is governed by the following partial differential equation [2]: ∂p(x, t) ∂ ∂ ∂p(x, t) ∂ 3 p(x, t) = K2 (1) β(x) − K4 β(x)(1 − β(x)) ∂t ∂x ∂x ∂x ∂x3
On the Prediction of Anomalous Contaminant Diffusion
293
where p(x, t) represents the concentration, K2 is the diffusion coefficient and K4 is the reactivity coefficient. Equation (1) is called the Biflux Anomalous Diffusion Equation, or BG Model for anomalous diffusion. For a review of how Eq. (1) was obtained we recommend reading the analytical development of this model in Refs. [2,3]. With the product rule, Eq. (1) can be written as ∂p(x, t) ∂ 2 p(x, t) ∂ 3 p(x, t) ∂ 4 p(x, t) ∂p(x, t) = A1 + A2 − A3 − A4 2 3 ∂t ∂x ∂x ∂x ∂x4 where
dβ(x) dx A2 (x) = K2 β(x)
A1 (x) = K2
dβ(x) dx A4 (x) = K4 β(x)(1 − β(x))
A3 (x) = K4 (1 − 2β(x))
(2)
(3) (4) (5) (6)
When K2 , K4 , β(x), its derivative, along with the initial and four boundary conditions, are known, then the problem can be solved numerically using, for example, the Finite Difference Method. 2.2
Numerical Solution
Prior to tackling the Inverse Problem, we provide a brief description of the numerical method utilized to estimate the derivatives in the BG Model. For this purpose, the well-established Finite Difference Method was employed. The resulting algebraic linear equations are solved with the Gauss Elimination Method followed by the Backward Substitution approach. The following discretizations were used in a Forward-Time Centered-Space implicit approach, first for the time derivative φn+1 − φni ∂p(x, t) ≈ i ∂t Δt
(7)
where φni is the concentration at the i-th spatial node and the n-th time node. For the spatial derivatives, the scheme used is given as follows n+1 n+1 φn+1 − 8φn+1 ∂p(x, t) i−1 + 8φi+1 − φi+2 ≈ i−2 ∂x 12Δx
(8)
n+1 n+1 n+1 + 16φn+1 −φn+1 ∂ 2 p(x, t) i−2 + 16φi−1 − 30φi i+1 − φi+2 ≈ 2 2 ∂x 12Δx
(9)
n+1 n+1 n+1 −φn+1 ∂ 3 p(x, t) i−2 + 2φi−1 − 2φi+1 + φi+2 ≈ ∂x3 2Δx3
(10)
n+1 n+1 n+1 − 4φn+1 φn+1 ∂ 4 p(x, t) i−2 − 4φi−1 + 6φi i+1 + φi+2 ≈ ∂x4 Δx4
(11)
294
D. F. Corrˆea et al.
With a substitution of the above equations in Eq. (1), we get the following algebraic linear system of equations n+1 [−Γ1,i + Γ2,i − Γ3,i + Γ4,i ]φn+1 i−2 + [8Γ1,i − 16Γ2,i + 2Γ3,i − 4Γ4,i ]φi−1 +
[1 + 30Γ2,i + 6Γ4,i ]φn+1 + [−8Γ1,i − 16Γ2,i − 2Γ3,i − 4Γ4,i ]φn+1 i i+1 + (12) n [Γ1,i + Γ2,i + Γ3,i + Γ4,i ]φn+1 i+2 = φi , i = 1, 2, . . . , N x and n = 1, 2, . . . , N t
where Nx represents the total number of spatial nodes, and Nt denotes the total number of nodes in time. Γ1,i , Γ2,i , Γ3,i and Γ4,i are defined as Γ1,i =
A1,i Δt 12Δx
(13)
A2,i Δt (14) 12Δx2 A3,i Δt Γ3,i = (15) 2Δx3 A4,i Δt Γ4,i = (16) Δx4 in our simulations, since this scheme in this particular case is unconditionally stable due to the implicit formulation, we arbitrarily defined Nt = 10, 001 and Nx = 501. Γ2,i =
2.3
Differential Evolution (DE) Method
The Differential Evolution is a heuristic optimization method, proposed by Storn and Price [7], We can summarize our implemented version of this method as follows 1. Generation of an initial random population xt=0 k,j = xL,k + randk,j (xU,k − xL,k ), j = 1, 2, . . . , Npop where xU,k and xL,k are the upper and lower bounds of the k-th variable, randk,j is a random number between 0 and 1, and Npop is the size of the population. 2. Mutation operation for generating a candidate → → → → − x t + α(− xt −− xt ) v t=− l
r0
r1
r2
where α is a perturbation factor, here arbitrarily defined as 0.7, and the → → → x tr1 and − x tr2 are randomly chosen from within the population vectors − x tr0 , − and must be distinct from each other. 3. The next step is the crossover operation where the generated vector can be accepted or not depending on the criterion − → vl t , ifpc randk,l → − t+1 xl = − → x t , otherwise l
where pc is the crossover probability defined for the present study as 0.7.
On the Prediction of Anomalous Contaminant Diffusion
295
→ 4. Finally, if the new vector − vl t provides a better value for the objective function → − t than vector x l , the latter is replaced by the former in the next generation, → otherwise − x tl remains in the population for one more generation 5. Repeat steps 2–4 until a predefined maximum number generations is achieved. Several test with different parameters were made but as a comparison is not the focus of the current study we only present in the current paper the parameters that in our tests better explored the space of factible answers.
3
Inverse Problem Formulation
As the aim of this paper is first to estimate the coefficients of the polynomial used to represent the function β(x), and as a consequence be able to predict the spreading of contaminant, our first goal is to solve the inverse problem that in this case is formulated as an optimization problem of minimizing a cost function in order to estimate the coefficients. In this scenario β(x) is given by β(x) = Ax5 + Bx4 + Cx3 + Dx2 + Ex + F
(17)
and the objective function to be minimized is defined as Fobj (φ) =
N meas
[(φ∗i (Z) − Φi )2 + Gi ]
(18)
i=0
where
Z = [A∗ , B ∗ , C ∗ , D∗ , E ∗ , F ∗ ]
is a vector of coefficients candidates, φ∗i (Z) is the solution of the BG Model with Z, and Φi is the experimental data, both at the same location and time instant. In Eq. 18, Gi is a penalization for when β(x), with Z, generates values bigger than 1 or smaller than 0. That is necessary because, according to the biflux theory, β(x) must be constrained to the interval [0,1]. Gi is defined as 0, if 0 ≤ βiZ ≤ 1 (19) Gi = (βiZ )2 , otherwise The inverse problem solution is then the vector Z, namely the coefficients of β(x), given by Eq. (17), that minimizes the cost function, see Eq. (18), in which Φ represents the experimental value that has been observed. In this study, we have generated synthetic data using βexact (x) = log(x + 1) and adding random noise from a gaussian distribution with zero mean and 0.01 standard deviation in order to simulate real experimental data that always has measurement errors. The polynomial we are searching for is intended to approximate the observed data generated with βexact (x) in the BG model. In order to find the vector Z, in the present work we make use of a Differential Evolution algorithm as described in the previous section.
296
4 4.1
D. F. Corrˆea et al.
Results Direct Problem and Case of Study
In this study, the direct problem was solved with K2 = 0.1 and K4 = 10−5 , the initial condition is given by p(x, t = 0) = 0.5(1 + cos(π(x − 1)))
0≤x≤2
(20)
and the boundary condition defined as p(0, t) = p(2, t) = 0
(21)
∂p(x, t) ∂p(x, t) = =0 ∂x x=0 ∂x x=2
(22)
Figure 3a shows the βexact (x) function and Fig. 3b its derivative.
Fig. 3. (a) Function β(x) = βexact (x) = log(x + 1), and the product β(x)[1 − β(x)]. (b) The derivative of β(x) with respect to x.
In this particular case, the solution of the biflux anomalous diffusion model without noise is presented in Fig. 4a. With the output of the direct problem the synthetic data was generated by adding random noise from a gaussian distribution to it, the noisy data is shown in Fig. 4b and it’ll be the data considered as our case of study. 4.2
Parameters Estimation
In the process of estimating the coefficients it is important to define a search interval for each element, in the current study we defined the lower limit as -1.0 and upper limit to 1.0 except for coefficient F that is contained in the interval 0.0 to 1.0 After 30 runs of the Differential Evolution algorithm with 30 particles each, the 10 best solutions in terms of the cost function value are ranked in Table 1. Figure 5 shows Solutions 1, 2, 5, 7 and 8 in comparison with the exact values of β(x) along x.
On the Prediction of Anomalous Contaminant Diffusion
297
Fig. 4. (a) The solution of the BG Model with constant parameters. β(x) = log(x + 1), K2 = 0.1 and K4 = 10−5 without noise. (b) The solution of the BG Model at t=10 with β = log(x + 1), K2 = 0.1, K4 = 10−5 and with noise from a gaussian distribution N (μ = 0, σ = 0.01).
Fig. 5. A few selected solutions from Table 1 of the estimated polynomial β(x)
4.3
Prediction of Concentration
With the estimated coefficients we can then try to predict the contamination concentration in different instants of time. Figure 6a shows the result predicted with Solution #1 (Estimated 1) and #10 (Estimated 2) from Table 1 at t = 10 for comparison with the experimental data. That is, from the 10 best solutions, the closest to zero and the farthest, in terms of the cost function value, are plotted.
Fig. 6. (a) The solution of the BG Model with two candidates solutions for Z, with Result #1 and Result #10. (b) Estimated result at t = 20 for the concentration profile of contaminant along axis x.
298
D. F. Corrˆea et al.
Table 1. The 10 Best solutions obtained with the Differential Evolution algorithm after 30 runs of the algorithm with 30 as the maximum generation number and population size of 30 each. Solution
Z A
B
C
D
E
F
#1 #2 #3 #4 #5 #6 #7 #8 #9 #10
0.000000 0.000000 0.023557 0.058743 0.000156 0.029741 0.009255 0.010510 0.046135 0.055398
–0.087332 –0.064951 –0.079058 -0.205408 –0.031347 –0.080288 –0.086152 -0.087223 –0.144138 –0.095613
0.283434 0.222200 0.187456 0.285988 0.094223 0.165702 0.164124 0.165826 0.172362 0.077877
–0.265461 –0.157164 –0.192952 –0.146824 0.000000 –0.202598 –0.000173 0.000436 –0.052807 –0.117823
0.337245 0.235142 0.267132 0.223975 0.161489 0.278812 0.137480 0.137618 0.182588 0.253057
0.000000 0.000000 0.000884 0.000000 0.006469 0.013596 0.017369 0.018525 0.018140 0.021231
Std. Dev. 0.023171 0.047393
0.068503 0.095899
0.065965 0.008984
Finally as our main goal was to predict the contamination concentration profile at a time after the data was collected with the estimated β(x) and with the BG Model the concentration profile with estimated β(x) and βexact (x) is shown in Fig. 6b. In average, the difference between the estimated curve and the exact curve did not grow with time, which gives a good accuracy in prediction of the spreading pattern. The root mean square error (RMSE) of the predicted concentration and the exact data in t = 10 is 0.01 and it is approximately the same in t = 20.
5
Conclusions and Future Work
Our approach of an inverse problem of coefficients estimation with the aid of the Differential Evolution algorithm has proved itself successful as the final result was a really close prediction with error of magnitude of order 10−2 with respect to that predicted by the model with the exact parameters which indicates that this approach can be a solution for those who wants to try out the BG Model but has no available knowledge about β(x). For future works, firstly β(x) can be treated as a multivariate polynomial as a function not only of spatial position but also of time and the concentration itself. Secondly, a two and three dimensional cases should be explored with this polynomial approach. Acknowledgments. This study was financed in part by the Coordena¸ca ˜o de Aperfei¸coamento de Pessoal de N´ıvel Superior - Brasil (CAPES) - Finance Code 001, Conselho Nacional de Desenvolvimento Cient´ıfico e Tecnol´ ogico (CNPq) and Funda¸ca ˜o
On the Prediction of Anomalous Contaminant Diffusion
299
Carlos Chagas Filho de Amparo ` a Pesquisa do Estado do Rio de Janeiro (FAPERJ). D. Corrˆea acknowledges CAPES also for the scholarship for his stay in the University of Granada (Grant CAPES/PrInt No. 88887.717186/2022-00). D. Pelta acknowledges support from projects PID2020-112754GB-I00, MCIN /AEI/ 10.13039 /501100011033 and FEDER/Junta de Andaluc´ıa-Consejer´ıa de Transformaci´ on Econ´ omica, Industria, Conocimiento y Universidades/Proyecto (BTIC-640-UGR20).
References 1. Cleland, J.D. and Williams, M.A.: Analytical investigations into anomalous diffusion driven by stress redistribution events: consequences of l´evy flights. Mathematics, MDPI AG (2022). https://doi.org/10.3390/math10183235 2. Bevilacqua, L., Gale˜ ao, A.C.N.R., Costa, F.P.: A new analytical formulation of retention effects on particle diffusion process. An Acad. Bras. Cienc. 83, 1443– 1464 (2011) 3. Bevilacqua, L., Gale˜ ao, A.C.N.R., Simas, J.G., Doce, A.P.R.: A new theory for anomalous diffusion with a bimodal flux distribution. J. Brazilian Soc. Mech. Sci Eng. 35(4), 1–10 (2013) 4. Blackledge, J.: Application of the fractional diffusion equation for predicting market behavior. Int. J. Appl. Math. 40(3), 130–158 (2010) 5. Silva Neto, A.J., Becceneri, J.C., Campos Velho, H. F., (eds.), Computational Intelligence Applied to Inverse Radiative Transfer Problems, EdUERJ, 2016. ISBN 978-85-7511-368-4. (in Portuguese) 6. Bevilacqua, L., Jiang, M., Silva Neto, A., Gale˜ ao, A.C.R.N.: An evolutionary model of bi-flux diffusion processes. J. Brazilian Soc. Mech. Sci. Eng. 38(5), 1421–1432 (2015). https://doi.org/10.1007/s40430-015-0475-5 7. Storn, R., Price, K.: Differential evolution - a simple and efficient heuristic for global optimization over continuous spaces. J. Global Optim. 11, 341–359 (1997) 8. Jiang, M., Bevilacqua, L., Silva Neto, A.J., Gale˜ ao, A.C.N.R., Zhu, J.: Bi-flux theory applied to the dispersion of particles in anisotropic substratum. Appl. Math. Model. Elsevier BV 64, 121–134 (2018). https://doi.org/10.1016/j.apm.2018.07. 022 9. Raoui, H.E., Cabrera-Cuevas, M., Pelta, D.A.: The role of metaheuristics as solutions generators. Symmetry 13, 2034 (2021). https://doi.org/10.3390/ sym13112034 10. Lugon Junior, J., Silva Neto, A.J., Rodrigues, P.P.G.W.: Assessment of dispersion mechanisms in rivers by means of an inverse problem approach. Inverse Prob. Sci. Eng. 16(8), 967–979 (2008). https://doi.org/10.1080/17415970802082864
Keeping Safe Distance from Obstacles for Autonomous Vehicles by Genetic Algorithms Eduardo Bayona1,2 , Jes´ us-Enrique Sierra-Garc´ıa1(B) , and Matilde Santos3 1
Department of Electromechanical Engineering, University of Burgos, Burgos, Spain [email protected], [email protected] 2 Laboratorio de Innovacion Michelin, Aranda de Duero, Spain 3 Institute of Knowledge Technology, Complutense University of Madrid, Madrid, Spain [email protected]
Abstract. Industrial automation and autonomous mobile robots have become increasingly popular in warehouses and factories worldwide. However, some of these industrial robots, such as the Autonomous Guided Vehicles (AGV), usually operate in a workspace with humans and other machines, thus the risk of collisions and accidents cannot be ignored. In this context, the importance of safety measures for AGVs are key for their right performance. This paper proposes a novel approach for optimizing trajectory design in autonomous vehicles, with focus on ensuring safe navigation through occupancy maps. A mathematical tool that uses genetic algorithms to design trajectories that maintain a safe distance from obstacles is developed. The occupancy map is defined using geometric shapes formed by polylines, which enables more efficient calculations during the genetic algorithm search process. Keywords: Soft computing · Automatic Guided Vehicle Algorithms · Industry 4.0 · Trajectories
1
· Genetic
Introduction
The optimization of trajectory design is crucial for autonomous mobile robots, and different approaches have been proposed to address this problem [1]. One of the main objectives of trajectory planning is to design a safe and efficient path while avoiding obstacles present in the workspace. To achieve this objective, different strategies have been used, including heuristic-based methods, samplingbased techniques, and optimization-based methods [2]. Heuristic-based methods, such as potential field-based ones, generate attractive and repulsive forces between the robot and obstacles to obtain a collision-free path. These methods are computationally efficient but lack the ability to guarantee global optimality [3]. Sampling-based methods, such as Rapidly Exploring Random Trees (RRT) and Probabilistic Roadmaps (PRM), generate paths by c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 P. Garc´ıa Bringas et al. (Eds.): SOCO 2023, LNNS 750, pp. 300–310, 2023. https://doi.org/10.1007/978-3-031-42536-3_29
Keeping Safe Distance from Obstacles by Genetic Algorithms
301
randomly sampling the configuration space of the robot. These methods can guarantee collision-free paths but can be computationally expensive in highdimensional configuration spaces [4]. Optimization-based methods such as Genetic Algorithms (GA), Particle Swarm Optimization (PSO), or A* search, formulate the trajectory planning as an optimization problem. This approach involves defining the objective function, decision variables and constraints in mathematical terms, to represent the problem goals and limitations. This is a critical step as it establishes the mathematical basis that will be used to derive the optimal solution. In recent years, GA-based trajectory planning has gained significant attention due to its effectiveness in generating optimal trajectories in high-dimensional spaces [5]. The use of GA in trajectory planning has been lately studied, and numerous methodologies have been proposed in the literature. For instance, in [6] a GA-based approach was proposed to generate optimized trajectories for mobile robots in dynamic environments. The authors applied a modified fitness function that incorporates various variable information to optimize the trajectory. Another GA-based method proposed to generate trajectories for AGVs in industrial environments was a hybrid approach that combines GA with binary grid occupancy maps to generate the shortest possible collision-free paths between three way-points [7]. Occupancy maps may play an important role in trajectory planning as they provide information about the workspace and the obstacles present in it. Various approaches have been used to generate occupancy maps, including geometric shape-based maps [8], binary occupancy maps [7] and signed distance fields [9]. In this work, a novel approach for generating an occupancy map is presented, which uses closed geometric shapes defined by poly-lines. Differently from [8], this work builds an occupancy map from an already defined performance site without any feedback from the environment. This approach facilitates the calculation of distance between the vehicle and obstacles, and it enables the use of vehicleto-obstacle distance as a trajectory optimization parameter. Additionally, this research introduces a method for defining the occupancy map using geometric shapes formed by polylines, which improves computational efficiency and allows the import of CAD maps. This work utilizes Frenet curves for trajectory planning in a geometric occupancy map, distinguishing from other articles that employ grid-based occupancy maps and straight lines for path definition [10,11] resulting in more natural and smooth trajectories. The use of Frenet curves enables efficient and adaptable path planning in complex environments, enhancing path smoothness, obstacle avoidance, and adaptability overcoming the limitations of grid-based approaches. The methodology proposed in this work contributes to the development of more efficient and safe autonomous mobile robots. When the collisions with obstacles are defined as fixed penalties in the function cost or as constraints to discard solutions, the GA may stop in local minimum as many solutions may produce the same fitness function value. Thus, in this work the fitness function considers the distances with collisions and without collisions. Then, a piece-wise
302
E. Bayona et al.
fitness function is defined to maintain the continuity of its value throughout its range and to aid in the optimization process. In contrast, alternative approaches focus on different factors, such as considering trajectory length after ensuring collision-free paths [12] or penalizing the fitness function based on the number of collision events [13]. By incorporating obstacle proximity into the optimization process, our method strikes a balance between collision avoidance and efficient path planning offering enhanced safety as a distinct advantage. The rest of the paper is structured as follows. Section 2 describes the methodology used to develop the path planning, including the genetic algorithm iteration process and the definition of the occupancy map. Section 3 presents simulation experiments, and the results demonstrate the effectiveness of the proposed solution. Finally, the paper concludes with a discussion of conclusions and future work.
2
Optimization Methodology
In order to automate the process of designing trajectories for AGVs, different methodologies can be applied that allow optimal trajectory planning while enabling users to easily comprehend and interpret the outcomes [14]. These trajectories must effectively avoid the obstacles of the occupancy map. To do so, genetic algorithms are used, as they can quickly search the space of possible solutions and find an optimal path [7], minimizing collisions with obstacles. This paper focuses on trajectory optimization for Automated Guided Vehicles (AGVs) by maximizing the average distance from obstacles in an occupancy map. The goal is to ensure the highest level of safety and avoid collisions. The study involves defining an occupancy map for the vehicle’s operation, and a genetic algorithm is employed to compute the optimized trajectory using parameters such as initial and final waypoint coordinates, departure angles, intermediate waypoints, and vehicle dimensions. Subsequently, we analyze the optimized trajectory solution obtained by the algorithm using the following metrics to assess each vehicle’s capability to traverse it: curvature, minimum and average distance to obstacles and trajectory length.
Fig. 1. Optimized trajectories generation process diagram
Keeping Safe Distance from Obstacles by Genetic Algorithms
303
The diagram in Fig. 1 illustrates the iterative process of trajectory optimization. The genetic algorithm provides the departure angles of the intermediate waypoints. The trajectory generator computes the Frenet’s Curve considering the waypoints and the departure angles provided by the GA. The suitability of each calculated trajectory is evaluated by studying the collisions and the distances to the obstacles. To do it, the dimensions of the AGV are considered. Collisions and distances are included in the fitness function until the optimal solution for the initially proposed problem is found. 2.1
Occupancy Map
Real-world testing for trajectory optimization is often impractical, leading to the use of simulated environments. However, simulations have limitations and may not fully replicate real-world complexities. To address this, occupancy maps and visualization tools are advantageous for validating solutions, enhancing understanding, and identifying areas for improvement. These tools bridge the gap between simulations and real-world scenarios, providing valuable insights. The environment and obstacles characteristics such as shape, size, and position, should be described in an occupancy map which serves as the main input for trajectory planning. A binary occupancy map discretely represents obstacle locations in a grid of cells, with each cell indicating its state as occupied or unoccupied. The resolution of the occupancy map determines the cell size and collision detection detail. The formal representation of the occupancy grid is expressed as (1), ⎛ ⎞ a11 a21 . . . a1m ⎜ a12 a22 . . . a2m ⎟ ⎜ ⎟ (1) aGRID = ⎜ . . . . ⎟ ⎝ .. .. . . .. ⎠ an1 an2 . . . anm where (n, m) are integers representing the dimensions of the workspace. If a location is occupied, it is represented by a 1, otherwise by a 0, as given in (2). 1 − Occupied Location anm ∈ [0, 1] : anm = (2) 0 − Free Location The use of binary occupancy maps is a suitable way to represent obstacles and perform collision-free trajectory optimization processes, but efficiency in computation is reduced as the need for precision and map size increases. This research proposes a new approach to defining an occupancy map using closed polylines formed by the vertices of obstacles in the workspace. This method results in more efficient calculations and allows the use of the distance to obstacles as a continuous trajectory optimization parameter instead of discrete, which enhances the optimization process through a better definition of the fitness function.
304
E. Bayona et al.
During trajectory generation, the minimum distance from each point to the obstacle set is calculated to avoid collision. The accuracy of distance calculations is improved by importing CAD maps made up of polylines, which also reduces computational loads when scaling up the map’s precision or dimension. The reduced computational load is achieved by utilizing the ability of the collision distance calculation method to operate on each polyline that forms the obstacles. In contrast, when using a binary map, the calculation had to be performed for each cell representing the points inside the obstacles. 2.2
Distance to Obstacles Calculation
This approach considers the Euclidean distance between the vehicle and the occupancy map obstacles, aiming to maximize the average distance value while the vehicle is in motion. To optimize the trajectory based on this metric, a genetic algorithm is used. The key to obtain a good performance is a suitable definition of the fitness function and the configuration of the parameters. Once the occupancy map is generated, the algorithm requires the specification of the initial conditions before starting the trajectory optimization. These initial conditions include the definition of the start and end points, as well as intermediate way-points that the vehicle must pass through. Based on these conditions, the tool determines the optimal trajectory that passes through those points in accordance with the designated optimization parameter. The generated trajectory is represented by the Frenet equations, which define a flat curve in R2 . The tangent vector T and the normal vector N satisfy Eq. (3), known as the Frenet equations of a trajectory [15]. T s = ksN s (3) N s = −ksT s Solutions for this optimization problem are defined as a matrix of n elements: θ(n), where n is the number of intermediate points and θ is the output angle of the trajectory at those intermediate points. Therefore, there will be as many solutions as intermediate points. The angle values lie within the limits of 0 and 2π. These conditions are formally expressed in (4), where Nw denotes the number of intermediate waypoints. θi ∈ [0, 2π] i ∈ {N ≤ Nw }
(4)
The calculation of the distance between obstacles and the vehicle uses the singular points of the vehicle, i.e., the vertices of the vehicle, and the segments that delineate the map obstacles. On the other hand, to calculate the distance between obstacles and each point of the trajectory, first projection of each trajectory point on the lines containing the segments that define the obstacles in the map is obtained. If the projection falls on the obstacle segment, the distance value is calculated as the norm of the vector that connects that point and its projection. Otherwise, the distance is the
Keeping Safe Distance from Obstacles by Genetic Algorithms
305
norm of the vector that connects the point and its nearest vertex. This process is expressed mathematically in Eq. (5). · AP AB · AB Q = projAB (P ) = 2 ||AB||
(5)
be the distance of the point P to the segment AB, Therefore, let D(P, AB) that is, the distance from the vehicle, point P , to the closest segment of the If the point Q belongs the segment AB, the distance is the occupancy map, AB. norm of QP as in (6). In case Q is outside the AB segment, the distance is the , being A the nearest vertex to P (6). norm of AP = D(P, AB)
|| = (Qx − Px )2 + (Qy − Py )2 ||QP || = (Ax − Px )2 + (Ay − Py )2 ||AP
if Q ∈ AB if Q ∈ / AB
(6)
where (Qx , Qy ) are the coordinates of point Q, (Px , Py ) are the coordinates of point P , and (Ax , Ay ) are the coordinates of point A. 2.3
Genetic Algorithm Fitness Function
In order to determine the optimal solution for each given scenario, the optimization method undergoes an iterative calculation process that explores all possible trajectories within a predefined set of the initial parameters such as the start and end waypoints coordinates and their angles of departure, the intermediate waypoints coordinates and the automated guided vehicle dimensions. During each iteration, the algorithm utilizes the aforementioned coordinates and departure angles to calculate possible trajectories for the AGV. Based on the vehicle’s dimensions and the obstacle locations, the algorithm calculates the average distance between the vehicle and its closest obstacle for each point along the trajectory. The initial solutions are generated randomly following the constraints defined in Eq. (4). The “intermediate” crossover operator is employed in this work. In this operator, two parent solutions are randomly selected from the population, and a new child solution is created by averaging the corresponding variables of the parents. The averaging process incorporates a random weight factor for each variable, enhancing the diversity of the offspring. For the mutation operator, the “boundary” mutation approach is adopted. This operator introduces random modifications to the selected variables of a solution within a range established within the bounds of the variable. Specifically, a range of 20% of the variable’s total range is utilized for the boundary mutation. To form the new population, individuals are selected using a combination of fitness proportionate selection and elitism. The selection probability for each individual is determined by normalizing their fitness values. In each generation, two elite individuals, representing the best solutions from the current generation, are directly included in the next generation.
306
E. Bayona et al.
The fitness function of the optimization process plays a critical role in ensuring that the algorithm finds optimal solutions while avoiding being stuck in sub-optimal ones. For instance, when optimizing trajectory length and collision avoidance, the existence of collisions should be penalized in the fitness function. However, poorly designed penalty coefficients may cause the fitness function to remain constant throughout the iterations, leading to premature termination of the algorithm. To address this, treating collision avoidance as a continuous parameter allows for a more accurate assessment of collision risk and evaluation of solution suitability at each iteration. This approach reduces computational load and enables the exploration of a broader range of solutions, preventing the algorithm from being trapped in sub-optimal outcomes. In this work, a dual fitness function is developed to evaluate the best solutions both, in the presence and absence of collisions. Both fitness subfunctions are normalized to ensure continuity in the values obtained by the algorithm during the iterations. The formal expression of the fitness function is therefore as follows in (7) ⎧ n ⎪ if Dc = 0 ⎪ ⎨1 + i=1 minj∈No Dcij n (7) f= minj∈No Dij /n ⎪ ⎪ if Dc = 0 ⎩1 − i=1 maxj∈No Dij where n is the number of trajectory points, No is the number of obstacles in the occupancy map, Dij denotes the distance from the point i to the obstacle j when there is not collision, and Dcij when a collision occurs. Dc is the sum of distances to all colliding obstacles. Thus, when there is not collision (Dc = 0), fitness function aims to maximize the distance to obstacles. On the other hand, when collisions occur (Dc = 0), the fitness function is designed to minimize the number of them.
3
Simulation Results
To validate this approach to design trajectories for the AGV, simulations experiments have been carried out. The experiment involves the generation of an industrial operating area with several obstacles for the AGV using a CAD tool. The occupancy map includes two layers, each one represented by a different color: the blue layer denotes fixed obstacles and the green layer represents mobile obstacles. The experiment in Fig. 2 starts with manual selection of initial and final waypoints (red dots) and intermediate waypoints (blue dots) on a given occupancy map. These points are chosen with human intervention based on the desired start and end positions of the vehicle and the required waypoints during its movement. The algorithm then utilizes the angles of the intermediate points as optimization variables. The configuration parameters of the genetic algorithm, shown in Table 1, play an important role in guiding the optimization process.
Keeping Safe Distance from Obstacles by Genetic Algorithms
307
Fig. 2. Waypoints position within occupancy map (Color figure online)
During the development of our research, we carefully evaluated different parameter configurations for the genetic algorithm. After an extensive analysis, we made the decision to employ the parameters outlined in Table 1. These selected parameters consistently exhibited a good performance in terms of convergence time, accuracy, and the algorithm’s ability to explore the solution space effectively. Table 1. Genetic algorithm parameters Population Size Elite Count (%) Crossover Fraction (%) N◦ Generations 100
2
80
10
As shown in Fig. 2, experiment begins with the selection of initial points (red dots) and intermediate way-points (blue dots) on a given map manually. This selection is made using the wanted initial and ending positions of the guided vehicle that is going to perform in the generated trajectory and the possible compulsory waypoints that are wanted to go through during the vehicle movement. The algorithm is now ready to operate using the output angles of the intermediate points as optimization variables. Table 1 represents the configuration parameters of the genetic algorithm used to obtain the results. Resulting trajectory is visualized in red, and the profile of the AGV vehicle at each point of the trajectory in blue (Fig. 3). The dimensions of the AGV used to execute this experiment are 2104 × 500 mm. The red points on the trajectory correspond to the input parameters, which are highlighted in Fig. 3. The yellow lines indicate the minimum distance between the vehicles vertices and the obstacles in the occupancy map. Despite not following the shortest path to its
308
E. Bayona et al.
destination, the trajectory generated by the algorithm maintains the maximum possible average distance to obstacles in the occupancy map (yellow lines) as desired.
Fig. 3. Optimize trajectory representation in occupancy map (Color figure online)
The proposed method is able to adapt to the problem based on the feedback provided by the fitness function that guides the algorithm towards more optimal solutions over time. The algorithm prioritizes trajectories that prevent collisions, followed by those that have the maximum distance from the vehicle vertices to environment obstacles, making the navigation safer. Table 2 contains analytical results of the metrics that define the trajectory optimized by the genetic algorithm. These data show the feasibility of the solution found by the GA in real cases like this one. Table 2. Experiment metrics analytics results (mm.) Min. Dist. to Obstacles Average Distance Trajectory Length Minimum Curvature Fitness function Value 735.15
4
1086.2
12566
1610.2
0.1399
Conclusions and Future Works
This research proposes a novel approach for optimizing the design of safe trajectories for AGVs by using genetic algorithms. The main goal is to ensure safe vehicle operation by maintaining the maximum distance to obstacles. The algorithm
Keeping Safe Distance from Obstacles by Genetic Algorithms
309
considers various factors that influence the motion and safety of autonomous vehicles, including vehicle size and obstacle position. A method that defines the occupancy map using geometric shapes formed by polylines is presented, that reduces the computational time. This approach facilitates the calculation of the vehicle-to-obstacle distance and enables the importation of CAD maps. Future work could involve implementing diverse fitness functions that consider additional factors like energy efficiency or minimizing trajectory completion time. Additionally, exploring multi-objective optimization techniques could offer comprehensive solutions by considering trade-offs and real-world constraints in trajectory optimization.
References 1. Abajo, M.R., Sierra-Garc´ıa, J.E., Santos, M.: Evolutive tuning optimization of a PID controller for autonomous path-following robot. In: Sanjurjo Gonz´ alez, H., Pastor L´ opez, I., Garc´ıa Bringas, P., Quinti´ an, H., Corchado, E. (eds.) SOCO 2021. AISC, vol. 1401, pp. 451–460. Springer, Cham (2022). https://doi.org/10. 1007/978-3-030-87869-6 43 2. S´ anchez-Ib´ an ˜ez, J.R., P´erez-del-Pulgar, C.J., Garc´ıa-Cerezo, A.: Path planning for autonomous mobile robots: a review. Sensors 21(23), 7898 (2021) 3. Fang, Y., Yao, Y., Zhu, F., Chen, K.: Piecewise-potential-field-based path planning method for fixed-wing UAV formation. Sci. Rep. 13(1), 2234 (2023) 4. Ma, H., Meng, F., Ye, C., Wang, J., Meng, M.Q.-H.: Bi-Risk-RRT based efficient motion planning for autonomous ground vehicles. IEEE Trans. Intell. Veh. 7(3), 722–733 (2022) 5. Liu, T., Liang, Z.: Design of multi-AGV intelligent collision avoidance system based on dynamic priority strategy (2022) 6. Lamini, C., Benhlima, S., Elbekri, A.: Genetic algorithm based approach for autonomous mobile robot path planning. Procedia Comput. Sci. 127, 180–189 (2018) 7. Bayona, E., Sierra-Garc´ıa, J.E., Santos, M.: Optimization of trajectory generation for automatic guided vehicles by genetic algorithms. In: Garc´ıa Bringas, P., et al. (eds.) SOCO 2022. LNNS, vol. 531, pp. 484–492. Springer, Cham (2023). https:// doi.org/10.1007/978-3-031-18050-7 47 8. Wolter, D., Latecki, L.J., Lak¨ amper, R., Sun, X.: Shape-based robot mapping. In: Biundo, S., Fr¨ uhwirth, T., Palm, G. (eds.) KI 2004. LNCS (LNAI), vol. 3238, pp. 439–452. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-302216 33 9. Liu, H., Zhang, X., Gao, H., Yuan, J., Chen, X.: Robust localization and map updating based on Euclidean signed distance field map in dynamic environments (2021) 10. Yang, S.X., Hu, Y., Meng, M.Q.-H.: A knowledge based GA for path planning of multiple mobile robots in dynamic environments. In: 2006 IEEE Conference on Robotics, Automation and Mechatronics, pp. 1–6 (2006) 11. Hu, Y., Yang, S.X.: A knowledge based genetic algorithm for path planning of a mobile robot. In: IEEE International Conference on Robotics and Automation 2004. Proceedings. ICRA 2004, vol. 5, pp. 4350–4355 (2004)
310
E. Bayona et al.
12. Samadi, M., Othman, M.F.: Global path planning for autonomous mobile robot using genetic algorithm. In: 2013 International Conference on Signal-Image Technology & Internet-Based Systems, pp. 726–730 (2013) 13. Griffiths, I.J., Mehdi, Q.H., Wang, T., Gough, N.E.: A genetic algorithm for path planning. IFAC Proc. Vol. 30(7), 485–490 (1997). 3rd IFAC Symposium on Intelligent Components and Instruments For Control Applications 1997 (SICICA 1997), Annecy, France, 9–11 June 1997 14. Espinosa, F., Santos, C., Sierra-Garc´ıa, J.E.: Transporte multi-AGV de una carga: estado del arte y propuesta centralizada. Revista Iberoamericana de Autom´ atica e Inform´ atica industrial 18(1), 82–91 (2020) 15. Alencar, H., Santos, W., Neto, G.S.: Differential Geometry of Plane Curves. Student Mathematical Library. American Mathematical Society (2022)
An Approach of Optimisation in Last Mile Delivery Dragan Simi´c1(B)
, José Luis Calvo-Rolle2 , José R. Villar3 , Vladimir Ilin1,1 Svetislav D. Simi´c1 , and Svetlana Simi´c4,4
,
1 Faculty of Technical Sciences, University of Novi Sad, Trg Dositeja Obradovi´ca 6, 21000
Novi Sad, Serbia [email protected], {dsimic,v.ilin,simicsvetislav}@uns.ac.rs 2 Department of Industrial Engineering, University of A Coruña, Avda. 19 de Febrero S/N, 15405 Ferrol, A Coruña, Spain [email protected] 3 University of Oviedo, Campus de Llamaquique, 33005 Oviedo, Spain [email protected] 4 Faculty of Medicine, University of Novi Sad, Hajduk Veljkova 1–9, 21000 Novi Sad, Serbia [email protected]
Abstract. Transport is the backbone of the economy aiming to move people and goods efficiently. The gross domestic product is collected from vehicle taxes, energy taxes, and taxes on fuel. On the other side, 25% of the CO2 emission of the whole transport sector comes from urban transport. Simultaneously, new technologies are developed and applied in real life, and E-commerce represents approximately 10% of the global retail landscape. Although home delivery is convenient for the customer, last mile delivery (LMD) poses significant logistical challenges for companies. The aim of this paper is to propose an approach to cost-optimal routing of a truck-and-drone system for LMD. The applied solution is an algorithm based on a combination of the combinatorial optimisation genetic algorithm. The experimental results demonstrate that it is possible to optimise last mail delivery and significantly reduce total distance for truck route and drone route. Keywords: Last mile delivery · parcel delivery · e-commerce · genetic algorithm · urban transportation
1 Introduction Transportation has huge economic, social and environmental impacts. Transport is the backbone of the European economy, accounting for about 7% of the gross domestic product (GDP) due to the fact that transport industry offers over 5% of total employment in the European Union (EU). As a network industry, transport requires elements such as infrastructures, vehicles, equipment, information and communication technology (ICT) applications and operational procedures to interact smoothly in order to move people and goods efficiently. Public revenues also benefit from transportation: 0.6% of the GDP © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 P. García Bringas et al. (Eds.): SOCO 2023, LNNS 750, pp. 311–320, 2023. https://doi.org/10.1007/978-3-031-42536-3_30
312
D. Simi´c et al.
is collected from vehicle taxes; and the greatest part of energy taxes, that counts 1.9% of the GDP, comes from taxes on fuel [1]. In the environmental aspects, transportation has been the sector with the largest growth rate of greenhouse gas (GHG) emissions compared to 1990. Moreover, road transport caused 39,000 deaths in EU in 2008. 60% of the global oil consumption and 25% of energy consumption are due to transportation [1]. 69% of road accidents occur in cities, and 25% of the CO2 emission of the whole transport sector comes from urban transport [2]. Therefore, one can easily understand the importance of transportation in urban areas for private individuals, public authorities and enterprises. E-commerce represents approximately 10% of the global retail landscape, while driving most of the growth in this sector. E-retailing continues to grow at a 20% rate, becoming a US$ 4 trillion market in 2020 [3]. E-retailing goes hand-in-hand with last mile delivery (LMD) services, which include parcel and grocery deliveries. Although home delivery is convenient for the customer, LMD poses significant logistical challenges for companies. The most recent innovation in urban transportation and LMD is autonomy drones. The first official tests with drones started in 2013, performed by the Amazon. The drones could deliver packages weighing up to 2.3 kg to customers within 30 min of them placing order [4]. In the following year, DHL launched its initial operation for research purposes with drones focused in remote areas with restricted access [5]. The importance and complexity of this problem can be represented by the following example: On March 15, 2021, Amazon.com launched the 2021 Amazon Last Mile Routing Research Challenge with scientific support from a team of researchers at the Massachusetts Institute of Technology’s Center for Transportation and Logistics. The research challenge encouraged participants worldwide to develop innovative approaches and optimisation methods to produce delivery routes [6]. This paper presents an approach to cost-optimal routing of a truck-and-drone system for LMD, showing how to minimize the total costs of a delivery tour for a drone. The proposed solution is an algorithm based on a combination of the combinatorial optimisation algorithm which includes genetic algorithm (GA). The search operators are divided in two sub-problems: (1) the first is the optimisation for the truck routing; while (2) the second is the optimisation for the drone routing. The drone is characterised with its range (fly distance) and speed. This approach minimizes total LMD cost, and it is tested with different artificial data sets whose numerical experimental results have been presented. This research directly continues and expands the authors’ previous research in vehicle routing problem, supply chain management, and detection of anomalies in industrial systems presented in research papers [7–11]. The rest of the paper is organized in the following way: Sect. 2 overviews the related work. Section 3 presents modelling last mile delivery approach, and the proposed combinatorial optimisation GA implemented in it. This section also describes the used dataset. Experimental results and discussion are presented in Sect. 4 and finally, Sect. 5 provides concluding remarks.
An Approach of Optimisation in Last Mile Delivery
313
2 Related Work The vehicle routing problem (VRP) has been an important research topic in operation research for decades [12]. VRP defines a class of combinatorial optimisation problem that allows optimising itineraries of a fleet of vehicles, when these vehicles operate round trips, and have multiple stops along their itinerary. This situation represents a large part of the flow of vehicles for good distribution in cities. The major applications of the VRP arise in the field of transportation, especially the LMD. In recent years, a growing number of logistic companies have introduced drones or unmanned aerial vehicles in the delivery operations [13]. The truck–drone routing problem (TDRP), where trucks and drones are scheduled and coordinated to serve customers is presented in [14]. In that paper, some truck–drone routing models are presented. First, two basic models for the travelling salesman problem with drones and vehicle routing problem with drones are presented. Second, researches devoted to the TDRP are classified according to their addressed constraints and features. One of the most challenging problems in last mile logistics (LML) has been the strategic delivery due to various market risks and opportunities. The research paper [15] provides a systematic review of LML-related studies to find current issues and future opportunities for the LML service industry. To that end, 169 papers were selected as target studies for in-depth analysis of recent LML advances. First, text mining analysis was performed to effectively understand the underlying LML themes in the target studies. Then, the novel definition and typology of LML delivery services were suggested. And finally, that paper proposed the next generation of LML research through advanced delivery technique-based LML services, environmentally sustainable LML systems, improvement of LML operations in real industries, effective management of uncertainties in LML, and LML delivery services for decentralized manufacturing services. In the wake of e-commerce and its successful diffusion in most commercial activities, last mile distribution causes more and more trouble in urban areas all around the globe. Growing parcel volumes to be delivered toward customer homes increase the number of delivery vans entering the city centres and thus add to congestion, pollution, and negative health impact. Among the most prominent are unmanned aerial vehicles, drones and autonomous delivery robots taking over parcel delivery. The paper [16] surveys the established and novel last mile concepts and puts special emphasis on the decision problems to be solved when setting up and operating each concept, discussing the most important decision problems, and surveying the existing research on operation research methods solving these problems. During recent years, advances in drone technologies have made them applicable in various fields of industry, and their popularity continues to grow. In the research paper [17], the academic contributions on drone routing problems are analysed between 2005 and 2019 to identify the main characteristics of these types of problems, as well as the research trends and recent improvements.
314
D. Simi´c et al.
3 Modelling the Last Mile Delivery Last mile delivery includes the activities necessary for the customer to obtain the goods purchased, and these activities are typical for urban areas, city centres, home deliveries, and rural logistics regions [18]. Customers demand more flexibility, timeliness of delivery, reliability and customer service through varying delivery models, such as climate control, same-day location changes, white-glove home delivery or seamless returns. 3.1 Parcel Delivery Models A freight carrier usually operates about three delivery vans to deliver products to their customers in Torino’s city centre, with a vehicle fill-in rate of 50–75% reported [19]. Also, in the research paper [20] from 2018, a parcel delivery van usually performs on average 37 stops to deliver 118 items to 72 customers per delivery round in central London. Moreover, the rate of successful deliveries on first attempt for business-tobusiness (B2B) deliveries is usually higher compared to business-to-consumer (B2C) deliveries, as home deliveries often require multiple attempts due to the unavailability of consumers at home when the courier reaches their residential address. Based on the literature review it was possible to identify five aspects of interest from the e-commerce consumer perspective: (1) delivery point; (2) delivery time and speed; (3) track and trace; (4) value-added services; and (5) delivery price [21]. On the other hand, many local governments have applied various restriction policies limiting large freight vehicles from entering the city centre based on weight, class or time [22]. This contributed to freight carriers depending more on light commercial vehicles (LCVs) to make deliveries and pick up services in the inner-city area compared to primarily trucks in suburban parts. Therefore, all shipping, distribution, and supply chain management companies (United Parcel Service Inc., FedEx, DHL, Amazon.com), research, improve and apply their distribution activities in city centres and urban areas. As a result, there has been a growth in the need for LMD services globally, which is driven by the development and growth in the e-commerce sector business and the rise in the trading activities. Current sustainable transportation modes are: (1) Electric vans: the standard delivery vehicle is the diesel van because it is the cheapest source of energy for such vehicles; however, it is not sustainable in the future; (2) Cargo bike or Cubicycle are appropriate in order to deliver small packages; and (3) Electric scooters are being used nowadays for the distribution of goods in the last mile delivery. On the other side, future sustainable transportation modes include: (1) Drones, which refer to an aircraft that flies without a crew and performs its function remotely; (2) Unmanned aerial vehicles (UAVs) capable of autonomously maintaining a controlled and sustained flight level powered by electric batteries; (3) Autonomous vehicles capable of sensing their environment and navigating without any human input; and (4) Subway automation system, which is the transportation mode for sustainable last mile delivery presented is the tube transport or underground system [23].
An Approach of Optimisation in Last Mile Delivery
315
3.2 Mainframe of the Last Mile Delivery Algorithm The proposed solution is an algorithm based on a combination of the combinatorial optimisation algorithm which includes GA.
Algorithm 1 The genetic algorithm applied for last mile delivery Begin Step 1:
Step 2: Step 3:
Step 4: Step 5:
Initialization. Number of iterations numIter = 500; Initialize the number of population popSize = 100; Set the speed of drone = 2; Set the case for drone delivery range = 2; Set the distance matrix for delivery places. The cost function = Total Distance (TotalDist) for i = 1 : numIter do (Calculate Total Distance) for p = 1 : popSize do drone_dist = (launch, deliver) + (deliver, rendezvous) truck_dist = (launch, rendezvous) if drone_dist < range % Check Case for drone delivery range case = 1 % truck delivers and carries the drone case = 2 % truck and drone deliver to next places end if BestDist = max ((truck_dist / truck_speed),(drone_dist / drone_speed)) % Distance converts to time end for p % Genetic Algorithm Operators A random sampling in matrix. What are the min distances ? What is the best route ? Randomly select two route insertion points and sort. Mutate the Best row (BestDist) to get Three New Routes and orig. a small matrix of 4 rows of best time. for k = 1 : 4 do case = 1 % Flip two of the cities and cities between case = 2 % Swap two cities case = 3 % Slide segment case = 4 % Increment sequence one space Using the original population, create a new population end for k Update entire population with mutations end for i The cost function is min BestDist => Total Distance Post-processing the results and visualization
End.
This GA solves the truck drone in tandem team or last mile effort for travelling salesman problem for parcel delivery operations. The aim of this approach is to minimise total last mile delivery cost. Each truck carries a drone which is launched from a stop to deliver a parcel within range to a nearby stop while operating in parallel with the truck.
316
D. Simi´c et al.
The truck and drone work in parallel operations to deliver packages. Therefore, the search optimisation problem is divided in two sub-problems: (1) first, it is being optimised for the truck routing; (2) and second, it is being optimised for the drone routing. The drone is constrained with its range (fly distance) and speed. As such, it has to operate in close proximity of the truck as an operation. An operation is when a truck launches the drone, truck and drone deliver to separate locations, and the truck then recovers the drone at a rendezvous location for battery swaps and loading. The main idea is to determine a route for truck and drone, as well as operations, whose cost function minimises total time which is calculated from minimum total last mile delivery cost. Total time is based on the times for the operations (launch – deliver – recover). The maximum time of an operation for truck or drone is used to calculate total time for the route. The basic steps of the genetic algorithm applied for LMD are summarized by the pseudo code revealed in Algorithm 1. 3.3 Dataset The artificial datasets are generated, fifty times, and one of them, for example, is presented in Table 1. There are nodes presented with their coordinates in km, and calculation between the distances nodes are surely less than one mile. The experimental results presented are used by this dataset. Table 1. Coordinates of the nodes in km Coordinate
Node
(km)
1
2
3
4
5
6
7
8
9
10
x
0.02
0.87
0.24
0.97
0.52
0.39
0.80
0.89
0.06
0.01
y
0.04
0.62
0.67
0.17
0.61
0.82
0.29
0.21
0.74
0.98
11
12
13
14
15
16
17
18
19
20
x
0.74
0.47
0.22
0.68
0.99
0.92
0.65
0.76
0.53
0.34
y
0.69
0.06
0.80
0.81
0.41
0.81
0.92
0.45
0.65
0.28
4 Experimental Results The cost function minimises the total last mile delivery cost, and Best Distance is converted in time. The cost function of the combinatorial optimisation GA implemented for LMD optimisation best distance value is: BestDist = min
20 i=1
max(
TruckDist i DroneDist i , ) TruckSpeed DroneSpeed
(1)
An Approach of Optimisation in Last Mile Delivery
317
Table 2, Fig. 1 and Fig. 2 refer to the same dataset presented in Table 1. The behaviour of the cost function which is calculated by Eq. (1) is presented in Fig. 1. In the iteration process, the total distance started from 5.86 km, while at the end of the 500 iterations minimum total distance is 2.86 km.
Fig. 1. Experimental results from cost function for Total Distance for combinatorial optimisation genetic algorithm implemented for last mile delivery optimisation
Table 2. Truck route and drone route for last mile delivery for twenty places (nodes) Truck route 20
9
13
6
19
14
11
14
11
2
15
8
7
20
Drone route 20
1
9
10
13
3
6
5
19
17
16
2
18
15
8
4
7
12
20
In Table 2, truck route delivery and drone route delivery are presented. As shown, the delivery range = 2 and the drone could deliver two parcels to two different places (nodes), when fly distance is in the optimal distance range. The truck route starts on delivery place node = 20, and drone route is from node 20 to 1, and finishes on node 9, where the drone is recovered, for example, to charge batteries and maintain some parts. The truck route and drone route continue in the same manner until truck node = 7, and the drone delivery places are 7, then 12, and then the drone gets back to the beginning position node = 20, where also the truck finishes its delivery route. It could be mentioned again that minimal Total distance for truck route and drone route is 2.83 km which is presented in Fig. 1. Some parts of the iteration process for truck route and drone route are presented in Fig. 2. Using the route diagrams from Fig. 2(a) to Fig. 2(d). In order to make this contribution more solid and to add the higher value of this research, finally, the novel dataset is presented in Table 3, with the coordinates of the delivery nodes.
318
D. Simi´c et al.
(a)
(b)
(c)
(d)
Fig. 2. Experimental results for six steps of the Truck route and Drone route iteration process (in km) with applied combinatorial optimisation genetic algorithm
Table 3. Coordinates of the nodes in km - the novel dataset Coordinate
Node
(km)
1
2
3
4
5
6
7
8
9
10
x
0.38
0.32
0.99
0.72
0.41
0.10
0.73
0.64
0.07
0.12
y
0.89
0.29
0.27
0.59
0.48
0.37
0.65
0.94
0.62
0.28
11
12
13
14
15
16
17
18
19
20
x
0.98
0.50
0.02
0.05
0.14
0.89
0.47
0.56
0.49
0.07
y
0.20
0.44
0.03
0.88
0.61
0.20
0.52
0.05
0.86
0.44
The experimental results, truck route and drone route, are presented in Table 4. In the iteration process, the total distance started from 5.82 km, while at the end of the 500 iterations minimum total distance is 2.71 km.
An Approach of Optimisation in Last Mile Delivery
319
Table 4. Truck route and drone route for last mile delivery - the novel dataset Truck route 3
20
2
5
4
12
10
11
1
6
3
Drone route 3
15
20
9
2
7
5
19
4
14
12
13
10
8
11
16
1
17
6
18
3
5 Conclusion and Future Work In recent years, a growing number of logistics companies have introduced drones or unmanned aerial vehicles in the delivery operations. The aim of this research is to propose the combinatorial optimisation GA implemented for LMD optimisation. The cost function presents minimum total last mile delivery cost, which is presented as Best Last Mile Delivery Distance converted to time. For this research, artificial data sets are used, but only one of them in experimental results is presented. Experimental results encourage the authors’ further research. The proposed optimisation method could be improved in the future work on extending research, applying some other evolutionary algorithms and determining its better behaviour and efficacy. Then, the model could be tested with an original very large real-world dataset obtained from the existing logistics companies. Acknowledgment. This research has been supported by the Ministry of Science, Technological Development and Innovation through project no. 451-03-47/2023-01/200156 “Innovative scientific and artistic research from the FTS (activity) domain”.
References 1. Commission of the European Communities: A sustainable future for transport: Towards an integrated, technology-led and user friendly system. Technical Report COM (2009) 279 (2009). http://ec.europa.eu 2. Rodrigue, J.P.: The Geography of Transport Systems, 5th edn. Taylor & Francis Group, Abingdon (2020) 3. The Nielsen Company: Future opportunities in FMCG e-commerce: Market drivers and fiveyears forecast (2018). https://www.nielsen.com/wp-content/uploads/sites/2/2019/04/fmcgeCommerce-report.pdf. Accessed 6 May 2023 4. BBC: Amazon testing drones for deliveries. https://www.bbc.com/news/technology-251 80906. Accessed 6 May 2023 5. Aircargo News: DHL parcelcopter launches initial operations for research purpose. https://www.aircargonews.net/sectors/express/dhl-parcelcopter-launches-initial-operat ions-for-research-purpose/. Accessed 6 May 2023 6. Merchan. D., et al.: Amazon last mile routing research challenge: data set. Transp. Sci., 1–4 (2022). https://doi.org/10.1287/trsc.2022.1173 7. Simi´c, D., Kovaˇcevi´c, I., Svirˇcevi´c, V., Simi´c, S.: Hybrid firefly model in routing heterogeneous fleet of vehicles in logistics distribution. Logic J. IGPL 23(3), 521–532 (2015)
320
D. Simi´c et al.
8. Simi´c, D., Simi´c, S.: Hybrid artificial intelligence approaches on vehicle routing problem in logistics distribution. In: Corchado, E., Snášel, V., Abraham, A., Wo´zniak, M., Graña, M., Cho, S.-B. (eds.) HAIS 2012. LNCS (LNAI), vol. 7208, pp. 208–220. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-28942-2_19 9. Simi´c, D., Simi´c, S.: Evolutionary approach in inventory routing problem. In: Rojas, I., Joya, G., Cabestany, J. (eds.) Advances in Computational Intelligence, pp. 395–403. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-38682-4_42 10. Ilin, V., Simi´c, D., Tepi´c, J., Stoji´c, G., Sauli´c, N.: A survey of hybrid artificial intelligence algorithms for dynamic vehicle routing problem. In: Onieva, E., Santos, I., Osaba, E., Quintián, H., Corchado, E. (eds.) Hybrid Artificial Intelligent Systems: 10th International Conference, HAIS 2015, Bilbao, Spain, June 22-24, 2015, Proceedings, pp. 644–655. Springer International Publishing, Cham (2015). https://doi.org/10.1007/978-3-319-19644-2_53 11. Zayas-Gato, F., et al.: A hybrid one – class approach for detecting anomalies in industrial systems. Expert Syst. 39, e12990 (2022). https://doi.org/10.1111/exsy.12990 12. Cattaruzza, D., Absi, N., Feillet, D., González-Feliu, J.: Vehicle routing problems for city logistics. EURO J. Transp. Logist. 6(1), 51–79 (2017). https://doi.org/10.1007/s13676-0140074-0 13. Otto, A., Agatz, N., Campbell, J., Golden, B., Pesch, E.: Optimization approaches for civil applications of unmanned aerial vehicles (uavs) or aerial drones: a survey. Networks 72, 411–458 (2018). https://doi.org/10.1002/net.21818 14. Liang, Y.J., Luo, Z.X.: A survey of truck–drone routing problem: literature review and research prospects. J. Oper. Res. Soc. China 10, 343–377 (2022). https://doi.org/10.1007/s40305-02100383-4 15. Na, H.S., Kweon, S.J., Park, K.: Characterization and design for last mile logistics: a review of the state of the art and future directions. Appl. Sci. 12(1), 118 (2022). https://doi.org/10. 3390/app12010118 16. Boysen, N., Fedtke, S., Schwerdfeger, S.: Last-mile delivery concepts: a survey from an operational research perspective. OR Spect. 43, 1–58 (2021). https://doi.org/10.1007/s00291020-00607-8 17. Thibbotuwawa, A., Bocewicz, G., Nielsen, P., Banaszak, Z.: Unmanned aerial vehicle routing problems: a literature review. Appl. Sci. 10, 4504 (2020). https://doi.org/10.3390/app101 34504 18. Markowska, M., Marcinkowski, J., Kiba-Janiak, M., Strahl, D.: Rural E-customers’ preferences for last mile delivery and products purchased via the Internet before and after the COVID-19 pandemic. J. Theor. Appl. Electron. Commer. Res. 18, 597–614 (2023). https:// doi.org/10.3390/jtaer18010030 19. Pronello, C., Camusso, C., Valentina, R.: Last mile freight distribution and transport operators’ needs: which targets and challenges? Transp. Res. Procedia 25, 888–899 (2017) 20. Allen, J., et al.: Understanding the impact of e-commerce on last-mile light goods vehicle activity in urban areas: the case of London. Transp. Res. Part D Transp. Environ. 61(Part B), 325–338 (2018). https://doi.org/10.1016/j.trd.2017.07.020 21. Araújo, F., Reis, J., Cruz Correia, P.: The role of last-mile delivery in the future of e-commerce. In: IFIP International Conference on Advances in Production Management Systems, Novi Sad, Serbia, pp. 307–314 (2020). https://doi.org/10.1007/978-3-030-57993-7_35 22. Dablanc, L., Giuliano, G., Holliday, K., O’Brien, T.: Best practices in urban freight management: lessons from an international survey. Transp. Res. Rec.: J. Transp. Res. Board 2379(1), 29–38 (2013). https://doi.org/10.3141/2379-04 23. Vidal, À.M.: Sustainable Solutions in Last Mile Logistics. School of Industrial and Information Engineering, Master Thesis, Milano, Italy (2021)
Special Session 7: Soft Computing and Hard Computing for a Data Science Process Model
A Preliminary Study of MLSE/ACE-III Stages for Primary Progressive Aphasia Automatic Identification Using Speech Features Amable J. Vald´es Cuervo1(B) , Elena Herrera2 , and Enrique A. de la Cal1(B) 1
Computer Science Department, Faculty of Geology, University of Oviedo, Oviedo, Spain {UO232486,delacal}@uniovi.es 2 Psychology Department, Faculty of Psychology, University of Oviedo, Oviedo, Spain [email protected]
Abstract. Primary Progressive Aphasia (PPA) is a syndrome causing progressive deterioration of language and speech due to brain degeneration. Three variants exist: non-fluent variant (nfvPPA), semantic variant(svPPA) and logopenic variant (lvPPA). While fMRI is the most accepted diagnostic tool (and neurological exploration), it is expensive and takes even months to deliver results. Cheaper and faster tools are needed for earlier diagnosis and treatment initiation. Some studies have attempted automatic diagnosis using acoustic and linguistic features with ML and DL techniques. However, none have included Latin language patients or analyzed the effect of cognitive tests. This work proposes a methodology based on three main steps: i) a new assessment tool (PPATool) combining ACE-III and MLSE with three language tasks: verbal fluency, repetition and naming, and ii) an IDA process to obtain an ML model trained with our own two-class (PPA/Healthy) dataset, and iii) ranking the relevance of tasks in PPATool from models performance. The results obtained after deploying the IDA process on the dataset obtained from an early-stage clinical trial, show that the verbal fluency data outperforms the rest of the tasks. Keywords: Primary Progressive Aphasia · ACE-III · MLSE · Voice silence removal · Machine Learning Classification · voice features · MFCC · Imbalanced datasets
1
Introduction
Primary progressive aphasia (PPA) is a syndrome characterized by a progressive deterioration of language and speech due to the degeneration of languagerelated brain systems. Three variants of PPA have been identified: non-fluent c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 P. Garc´ıa Bringas et al. (Eds.): SOCO 2023, LNNS 750, pp. 323–333, 2023. https://doi.org/10.1007/978-3-031-42536-3_31
324
A. J. Vald´es Cuervo et al.
variant PPA (nfvPPA), semantic variant PPA (svPPA), and logopenic variant PPA (lvPPA). Diagnosis is typically made by a specialized neurologist using subjective complaints, observations during the examination, clinical criteria, and cognitive/neuroimaging tests. However, cheaper and faster tools to support clinical diagnosis are needed for earlier disease diagnosis and treatment initiation. There are already some studies that have worked on the approach to the automatic diagnosis of PPA, Fraser et al., 2014 [6], Hoffman et al., 2017 [8], Cho et al., 2020 [4] and Themistocleous et al., 2021[15]. Most of them include extracting acoustic and/or linguistic features, different classical Machine Learning (ML) and Deep Learning (DL) techniques, datasets with up to 100 English speakers, and tackling two-class (PPA-Healty) and three-class (lvPPA-svPPAnfvPPA) problems. But, none included Latin language patients in the study nor analysed the effect of the typology of cognitive tests in the performance of the ML technique. This work proposes a methodology based on three main pillars: 1) design and deployment of a new assessment tool specific for PPA (PPA-Tool) combining the screening test ACE-III and the cognitive test MLSE, 2) development of an Intelligent Data Analysis (IDA) process trained with a set of prosodic and spectral transformations from the recorded tasks to obtain an ML model capable of classifying between PPA and Healthy patients, and 3) ranking the relevance of the different tasks defined in the PPA-Tool driven by the performance of the models obtained in the previous pillar. The proposed methodology aims to provide a cheaper and faster tool for supporting clinical diagnosis and treatment initiation, ultimately improving and accelerating the management of PPA patients. This work is arranged in the following sections; the next section describes the methodology proposed, including the design of proposed examination tests and the IDA process to obtain the classification models. The experimental setup and discussion of the obtained results can be found in Sect. 3. Finally, conclusions and future work are included.
2
Proposed Methodology
The proposed methodology is composed of three main steps (see Fig. 1): i) Designing of the assessment tool (PPA-Tool), ii) Intelligent data analysis, and iii) Ranking the PPA-Tool’ tasks. 2.1
Design of the Assessment Tool
Two tests were selected for assessing the participants: the ACE-III, a cognitive screening test widely used for the neuropsychological evaluation of patients and the Mini Linguistic State Examination (MLSE), a specific PPA test. The ACE-III is a test in which all cognitive domains (orientation, attention, memory, visuospatial skills, executive functions and language) are assessed. ACE-III helps to rule out the presence of dementia and also offers a fairly comprehensive cognitive profile
Ranking of PPA assessment tool tasks through ML
325
Fig. 1. The overall process of this proposal.
The MLSE is a specific test to classify the different variants of PPA recently developed [11], which evaluates the key linguistic domains affected by PPA according to diagnostic criteria[7]. It consists of eleven subtests that assess the following language skills: naming, word and sentence repetition, word and sentence comprehension, semantic association, reading, writing and a connected speech task. The MLSE is the only PPA-specific test with a Spanish version recently adapted [9]. For the purpose of this report, some of the tasks of the ACE-III and the MLSE were pooled for joint analysis, keeping the following tasks (now on PPATool): – Fluency: the verbal fluency tasks of the ACE-III. – Repetition: the repetition of words and sentences of both tests. – Naming: the picture naming of the MLSE. The verbal fluency task involves three subtests: Phonological, semantic and actions. In these tasks, participants have to generate as many words of each category as they can in one minute of time (words beginning with “p”, animals and actions, respectively). The picture naming test is a standard task used to assess language disorders normally. It is based on the presentation of 20 images, and the participant must generate the name of the picture being shown. Finally, the repetition task involves both word and sentence repetition. In this test, the examiner states each item and the patient is required to repeat them as accurately as possible. 2.2
Intelligent Data Analisys
After carrying out the PPA-Tool tasks, one voice file per patient/participant will be obtained. To obtain an optimal classification model to identify automatically PPA recordings, the following steps are proposed: i) Data preparation, ii) Preprocessing, ii) Voice features, and iv)Classification models training.
326
A. J. Vald´es Cuervo et al.
Data Preparation: Tasks Segmentation and Labelling. The first step is to separate the audios into smaller audios that correspond to the three different tasks chosen in the PPA-Tool: fluency, naming and repetition. Although the proposed PPA-Tool just considers these 3 tasks, this splitting step could be valid for any number of tasks. In this stage of our research, this process will be deployed manually, but in the future could be run through any automatic segmentation technique. As the number of patients in this stage of the PPA clinical trial is just six: four PPAs and two controls, a multi-class (the three variants of PPA) problem will be discarded, and just a two-class problem will be dived: PPA and Healthy classes. Preprocessing: Silence Removing. This study aims to use prosodic and spectral features of voice waveforms to distinguish between patients with PPA impairment and healthy individuals. In long recordings, it is common to encounter sound fragments other than voice, such as unvoiced segments and silence. Since the recordings were conducted in a controlled environment, there were few or no unvoiced events present, necessitating a silence removal algorithm. The most widely accepted algorithms for silence removal rely on thresholding the Short Time Energy (STE) and Zero Crossing Rate (ZCR) features, as well as the statistical behaviour of background noise, as reported in the literature [12] (Fig. 2).
Fig. 2. left) RMS for each kind of sound event, right) Silence threshold computing.
This work proposes a simplified approach based on thresholding the Root Mean Square (RMS) feature. A representative silence reference fragment is manually selected from the beginning of each patient recording, and the mean and standard deviation of the RMS in dBFS are calculated for this segment. Consequently, the silence threshold is determined for each patient recording using Eq. 1. (1) SilenceT Hp = mean(RM Sp ) − std(RM Sp ) Finally, the calculated SilenceTH is applied using an overlapped window of size Silence-WD, removing any sample windows that exceed this threshold. Figure 3 illustrates an example of silence removal, where the original recording
Ranking of PPA assessment tool tasks through ML
327
Fig. 3. Example of silence removal: Silence-WD = 30 ms (top), Silence-WD = 3 s (middle), ORIGINAL (bottom)
(ORIGINAL) is processed with the proposed algorithm with Silence-WD, 3 s and 30 ms obtaining the two subfigures Silence-WD = 3 s and Silence-WD = 30 ms. It can be observed in this case that Silence-WD = 30 ms obtains a much more clean recording than Silence-WD = 3 s. It is important to note that PPA patients may exhibit mumbling in certain parts of the recording, which can be considered as silence. However, the applied silence-removal algorithm does not eliminate these revised mumble fragments. Preprocessing: Framing. Window framing in voice applications has three relevant fields: Automatic Speech Recognition (ASR), Automatic Diseases Identification (ADI), and Speech Emotion Recognition (SER) [1,5,10]. ASR typically utilizes a narrow window length of approximately 25 ms, prioritizing changes over time and achieving a higher time resolution. In contrast, for SER, a wider window length of around 65 ms up to 2–3 s [2] can be used, resulting in frames with greater frequency information and higher frequency resolution. Concerning ADI applications, a crucial problem in the acoustical analysis of pathological voices is the correct Pitch Period (To) evaluation since most of the voice parameters are computed using the already determined values of T o. As it’s stated in [1], a window of two, even three T o are suitable to segment the signal (typically 2 T o correspond to 30 ms). In this case, in order to reduce the computational cost of the training algorithms, two conservative window sizes taken from the SER field have been selected: 100 ms and 3 s, with a sliding size of 50 ms and 1 s, respectively. Voice Features. This study focuses on using common features in Speech Emotion Recognition (SER) to characterize voice events, drawing from existing literature [3]. SER commonly utilizes prosodic and spectral features, which are combined for improved performance. Prosodic features, such as intonation and rhythm, are perceptible to humans, while spectral features capture vocal tract characteristics. Spectral features are obtained by transforming the time domain signal into the frequency domain using Fourier transform. Among the spectral
328
A. J. Vald´es Cuervo et al.
features, Mel Frequency Cepstral Coefficient (MFCC) is particularly useful for SER. Segmental transformations, rather than spectrographic ones, were considered in this research. Two groups of segmental features were chosen: ProSodic Features (PSF) to capture the rhythm and SpecTral Features (SPF) to capture frequency. The selected features include Root-mean-square (PSF), MFCC (SPF), Chroma stft (SPF), Spectral centroid (SPF), Spectral bandwidth (SPF), Spectral rolloff (SPF), and Zero crossing rate (SPF). To create the dataset, each feature is computed for each frame. For MFCC, specific parameters are required: – n mfcc: number of MFCCs (mel coefficients) to be returned. – n fft: length of the FFT window in samples or milliseconds. – hop length: number of samples between successive frames. Typical values for MFCC applied to Automatic Speech Recognition (ASR) are n mfcc = 13, n fft = 12 ms, and hop length = 12 ms (non-overlapping frames). Additionally, this proposal considers the mean and standard deviation of MFCC coefficients for each frame, resulting in a total of 7 + nm f cc ∗ 2 = 33 features. ML Classification Algorithms. This work only focuses on a good PPA screening test design based on an ML technique’s result to rank these tasks. So, eleven representative Classification Machine Learning Techniques (CML), belonging to well-known classification typologies like linear (L), Tree-based (T), Probabilistic (P), Nearest neighBor (NB), Embeddings (EB) and Neural Networks (NN), have been deployed on the PPA/Healthy dataset. The set of selected algorithms was: BernoulliNB (P-Ber), DecisionTree (TDT), RandomForestClassifier (T-RF), ExtraTrees (T-XT), KNeighbors (NBKN), RidgeClassifierCV (L-RC), SVC(L-SVC), AdaBoost (EB-AB), GradientBoosting (T-GB), Multi-Layer Perceptron (NN-MLP) and XGB.
3 3.1
Numerical Results Material and Methods
Inclusion and Exclusion Criteria. As this is a preliminary study for a running clinical trial, just six participants have been selected for this project. Four were diagnosed with PPA in two variants, and the other two were healthy controls. The patients with PPA ranged in age from 65 to 79 years, and were diagnosed by a specialized neurologist. The two control subject were aged 67 and 73 years, with no history of neurological pathology and a normal neuropsychological profile. All participants gave informed consent before participating in the study. The ethics committee of the Principality of Asturias has approved this research.
Ranking of PPA assessment tool tasks through ML
329
Audio Capture Issues. A Yotto YDM-20 USB microphone connected to a MacBook Pro was used to collect the participants’ recordings. All recordings were taken using a frequency 44100 Hz and mono-micro. Validation Strategy Y Scoring Metric. All the experiments have been deployed using a repeated 5 × 2CV validation strategy, using as a scoring metric the geometric mean of the sensitivity and specificity of a two-class problem with PPA and Healthy classes (See Eq. 2). (2) GeometricM ean(SE, SP ) = Sensitivity · Specif icity 3.2
Results
Tasks Splitting. The voice recording for each patient has been split manually, by an experimental psychologist, into the three tasks stated in the PPA-Tool(see Sect. 2.1). Most of the interviewer fragments were removed, but some short parts contain a mixture of both voices: interviewer and patients. It’s assumed these parts will not affect the results. Finally, 2045 s (34 mins) have been obtained with a minimum of 136, 69 and 28 s for the corresponding tasks (see Table 1). Table 1. Recorded time in secs per task (Fluency, Repetition and Naming) and patient (pac1, pac2, pac3, pac4, cont1 and con2) Task
pac1 pac2 pac3 pac4 cont1 cont2 Subtotal
Fluency
191
182
181
136 181
181
1046
Repetition 238
209
98
112
79
69
791
Naming
176
75
217
45
31
28
568
Subtotal
605
466
496
293
291
278
2405
Silence Removal. After splitting the recording, the silence was removed using the method proposed in Subsect. 2.2. Four windows and sliding size configurations were deployed: i) 3 s with an overlapping size of 1 s, ii) 3 s without overlapping, iii) 30 ms with an overlapping size of 10 s, and iv) 30 ms without overlapping. For the shake of space, just the results of the best window configuration have been included in Table 2 (30 ms without overlapping). It can be stated that an average of 32% of samples were removed as silence, paying attention to the task Fluency that included a 42% of silence since it’s the task where the patients have more freedom to talk, so it’s more probable finding silence fragments. On the side, it can be remarked that control #1 recordings were reduced by 61%.
330
A. J. Vald´es Cuervo et al.
Table 2. Tasks time in secs per task and patient before and after silence removal deployment
3.3
Task
pac1 pac2 pac3 pac4 cont1 cont2 Subtotal
D1.Fluency before
191
181
136
181
181
D1.Fluency after
116,4 137,1 89,9
74,3
38,3
151,4 607,5
% Removed
39
45
79
16
42
182 25
50
1052
D2.Repetition before 238,0 209,0 98,0
112,0 79,0
69,0
805,0
D2.Repetition after
175,1 151,5 86,7
101,7 57,9
65,9
638,8
% Removed
26
9
27
5
21
D3.Naming before
176,0 75,0
217,0 45,0
31,0
28,0
572,0
D3.Naming after
103,8 45,9
161,9 38,4
18,7
26,3
395,0
% Removed
41
25
40
6
31
Subtotal before
605,0 466,0 496,0 293,0 291,0 278,0 2429,0
Subtotal after
395,3 334,5 338,5 214,4 114,9 243,6 1641,3
% removed
35
28
39
28
12
32
15
27
61
12
32
Features Computing, Framing and Datasets Naming
All the selected features (see Sect. 2.2) were computed for the two framing configurations proposed obtaining two datasets including the three tasks Fluency, Repetition and Naming (now on D1, D2 and D3 respectively): – DatasetA: with a window size of 3 s with an overlapping size of 1 s. – DatasetB: with a window size of 100 ms with an overlapping size of 50 s. Pay attention that DatasetB deploys some oversampling, obtaining a factor by 20 of DatasetA (1 s divided by 50 ms). Synthetic new data using classical oversampling techniques like SMOTE or ADASYN has not been included in this work. 3.4
Classification Models Training, Results Discussion and Tasks Ranking
The selected CML techniques (see Sect. 2.2) have been run with the default hyperparameters provided by sklearn1 and XGB Library2 , on each task data separately obtaining the three boxplots per CML algorithm and dataset (see Fig. 4). On one side, the main finding is that D1.Fluency is the task that allows a good classification performance for most ML techniques in both datasets, A and B. In addition, it can be observed that the dispersion of the results of Tree-based 1 2
https://scikit-learn.org/. https://xgboost.readthedocs.io.
Ranking of PPA assessment tool tasks through ML
331
Fig. 4. left) Boxplot representing geometric mean for DatasetA, right) Boxplot representing geometric mean for DatasetB
techniques is quite reduced for DatasetB, improving the robustness of the winner models (D1-RF, D1-XT and D1-XGB). On the other side, since DatasetA (Window/Sliding size: 3 s/1 s) is relatively smaller than DatasetB (Window/Sliding size: 100 ms/50 ms), it can be observed that Tree-Base techniques outperform the remaining ones, but the linear model ReagerC. That issue reveals the presence of multicollinearity in the data since a feature selection was not carried out. Concerning DatasetB, the Tree-Based and Embedding models outperform the rest of the models, but in this case, the RigerC obtains worse performance, but MLP performance has improved since the size of the Tasks-datasets is bigger. Thus, it can be stated that after increasing the sampling frequency (window reduced to 100 ms/50 ms), there exists more variability in each task/class dataset, so it’s needed a model capable of extracting these no-linear relations (Tree-based models).
4
Conclusions and Future Work
Regarding the PPA-Tool tasks, the chosen verbal fluency task demands high cognitive effort and engages extensive brain regions [13], involving executive functions, semantic memory, and language processes. Participants rely on their own lexical-semantic system in this semi-directed task with a significant spontaneous language component. In contrast, the picture naming task limits spontaneity by providing specific pictures, while the repetition task is entirely guided, requiring patient comprehension and phonological production. In summary, the verbal fluency task yields the best results as it assesses semispontaneous language, allowing participants to utilize their cognitive resources and generate fewer psycholinguistically distinct words [14]. Analyzing patients’ voices offers new prospects for evaluating and diagnosing PPA, comparable to psycholinguistic analysis. Additional language tasks from the tests used can be included to improve classification and develop more specific language assessment tests for accurate speech classification. From a computational perspective, addressing the following is necessary: i) conducting an in-depth study on new voice features specific to PPA, ii) exploring
332
A. J. Vald´es Cuervo et al.
silence removal and diarization techniques to obtain a clean signal for training, and iii) advancing CML and DL algorithms, including AutoML. Finally, completing the clinical trial with more patients, including a balanced representation of different PPA variants, is essential to obtain a three-class dataset. Acknowledgement. The research has been funded by the Spanish Ministry of Economics and Industry, grant PID2020-112726RB-I00, by the Spanish Research Agency (AEI, Spain) under grant agreement RED2018-102312-T (IA-Biomed), and by the Ministry of Science and Innovation under CERVERA Excellence Network project CER-20211003 (IBERUS) and Missions Science and Innovation project MIG-20211008 (INMERBOT). Also, by Principado de Asturias, grant SV-PA-21-AYUD/2021/50994. By European Union’s Horizon 2020 research and innovation programme (project DIH4CPS) under Grant Agreement no 872548. And by CDTI (Centro para el Desarrollo Tecnol´ ogico Industrial) under projects CER-20211003 and CER-20211022 and by ICE (Junta de Castilla y Le´ on) under project CCTT3/20/BU/0002.
References 1. Boyanov, B., Hadjitodorov, S.: Acoustic analysis of pathological voices: a voice analysis system for the screening and laryngeal diseases. IEEE Eng. Med. Biol. Maga. 16(4), 74–82 (1997) 2. de la Cal, E., Gallucci, A., Villar, J.R., Yoshida, K., Koeppen, M.: Simple metaoptimization of the feature MFCC for public emotional datasets classification. In: Sanjurjo Gonz´ alez, H., Pastor L´ opez, I., Garc´ıa Bringas, P., Quinti´ an, H., Corchado, E. (eds.) HAIS 2021. LNCS (LNAI), vol. 12886, pp. 659–670. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86271-8 55 3. de la Cal, E., Gallucci, A., Villar, J.R., Yoshida, K., Koeppen, M.: A first prototype of an emotional smart speaker, pp. 304–313 (2022) 4. Cho, S., Nevler, N., Shellikeri, S., Ash, S., Liberman, M.: Automatic classification of primary progressive aphasia patients using lexical and acoustic features. In: Proceedings of Language Resources and Evaluation Conference 2020 workshop on Resources and Processing of Linguistic, Para-linguistic and Extra-linguistic Data from People with Various Forms of Cognitive/Psychiatric/Developmental Impairments, June, pp. 60–65 (2020) 5. Fan, W., Xu, X., Xing, X., Chen, W., Huang, D.: LSSED: a large-scale dataset and benchmark for speech emotion recognition. In: ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, June 2021, pp. 641–645 (2021) 6. Fraser, K.C., et al.: Automated classification of primary progressive aphasia subtypes from narrative speech transcripts. Cortex 55(1), 43–60 (2014) 7. Gorno-Tempini, M.L., et al.: Classification of primary progressive aphasia and its variants. Neurology 76(11), 1006–1014 (2011) 8. Hoffman, P., Sajjadi, S.A., Patterson, K., Nestor, P.J.: Data-driven classification of patients with primary progressive aphasia. Brain Lang. 174(July), 86–93 (2017) 9. Matias-Guiu, J.A., et al.: Spanish version of the mini-linguistic state examination for the diagnosis of primary progressive aphasia. J. Alzheimer’s Dis. 83(2), 771–778 (2021)
Ranking of PPA assessment tool tasks through ML
333
10. Orozco-Arroyave, J.R., Arias-Londo˜ no, J.D., Vargas-Bonilla, J.F., Gonz´ alezR´ ativa, M.C., N¨ oth, E.: New Spanish speech corpus database for the analysis of people suffering from Parkinson’s disease. In: Proceedings of the 9th International Conference on Language Resources and Evaluation, LREC 2014, December 2014, pp. 342–347 (2014) 11. Patel, N., et al.: A ‘Mini Linguistic State Examination’ to classify primary progressive aphasia (2022) 12. Ranjan Sahoo, T., Patra, S.: Silence removal and endpoint detection of speech signal for text independent speaker identification. Image Graph. Signal Process. 6, 27–35 (2014) 13. Riello, M., et al.: Neural correlates of letter and semantic fluency in primary progressive aphasia. Brain Sci. 12(1), 1 (2021) 14. Rofes, A., De Aguiar, V., Ficek, B., Wendt, H., Webster, K., Tsapkini, K.: The role of word properties in performance on fluency tasks in people with primary progressive aphasia. J. Alzheimer’s Dis. 68(4), 1521–1534 (2019) 15. Themistocleous, C., Webster, K., Afthinos, A., Tsapkini, K.: Part of speech production in patients with primary progressive aphasia: an analysis based on natural language processing. Am. J. Speech-Lang. Pathol. 30(1s), 466–480 (2021)
Comparison of LSTM, GRU and Transformer Neural Network Architecture for Prediction of Wind Turbine Variables Pablo-Andrés Buestán-Andrade1,4(B) , Matilde Santos2 , Jesús-Enrique Sierra-García3 , and Juan-Pablo Pazmiño-Piedra4 1 Computer Sciences Faculty, Complutense University of Madrid, 28040 Madrid, Spain
[email protected]
2 Institute of Knowledge Technology, Complutense University of Madrid, 28040 Madrid, Spain
[email protected]
3 University of Burgos, 09006 Burgos, Spain
[email protected] 4 Universidad Católica de Cuenca, Cuenca 010101, Ecuador
Abstract. To deal with climate change and global warming, several countries have taken steps to reduce greenhouse gas emissions and have begun to switch to renewable energy sources. Wind energy is one of the most profitable and accessible technologies in this sense, and among the alternatives to manage its variability is prediction, which has become an increasingly popular topic in academia. Currently, there are several methods used for prediction, the methods based on artificial neural networks (RNA) are the ones that have aroused the greatest research interest. In this sense, this study presents the training of different ML algorithms, using deep learning models for the prediction of the power generated by a wind turbine. In addition, a new filtered signal is generated from the wind signal that is integrated into the set of input signals, obtaining better training performance. These models include long-short-term memory (LSTM), recurring rental unit (GRU) cells, and a Transformers-based model. The results referring to the trained models show that the technique that best fits time series I and time series II is GRU. This study also allows us to analyze different AI techniques to improve performance in wind farms. Keywords: Wind Turbine · O&M · LSTM · GRU · Multi-head Attention · Transformers
1 Introduction Due to the intense burning of fossil fuels, considerable amounts of gases such as carbon dioxide (CO2) and carbon monoxide (CO) are being released into the atmosphere, which increases global warming [1]. Thus, fossil fuels are being replaced by the use of renewable resources [2]. According to the latest report from the US Energy Information Administration, in the US renewable energy generation reached 11.6 trillion BTUs in 2019. On the other hand, multiple nations signed the 2015 Paris Agreement on the use and use of renewable energy [3]. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 P. García Bringas et al. (Eds.): SOCO 2023, LNNS 750, pp. 334–343, 2023. https://doi.org/10.1007/978-3-031-42536-3_32
Comparison of LSTM, GRU and Transformer Neural Network Architecture
335
Renewable energy sources have variability due to weather conditions, which can affect operation and maintenance efficiency [4]. Prediction methods, including physical, statistical, and artificial neural network (ANN) methods, can help manage this variability in wind power [5]. Physical methods are advantageous for long-term prediction but rely on weather forecast accuracy. Statistical models are effective for short-term wind speeds but struggle with non-linear relationships between variables [6]. ANN models have been successfully used. Deep neural networks like RNN and LSTM have been also considered for wind energy prediction [7]. Although good results are obtained using LSTM for wind energy prediction, the problem of gradient leakage presented by RNNs remains a major issue. Gated Recurrent Units (GRU) introduce a gating mechanism to solve the aforementioned problem and simulate the correlation of the time series as the time interval increases [8]. However, spatiotemporal wind energy prediction remains a challenging issue due to mainly two reasons: the insufficient capacity to model dependencies in time sequences and the fact that the reliability of the results may be affected by uncorrelated data noise [9]. To address these problems, a novel method called transformers has been recently introduced. Transformers are encoder-decoder models that are able to model dependencies and long-range interactions in sequential data and therefore, are attractive for time series prediction [10]. Based on all the previous research, a model is here proposed for wind energy prediction using transformers. The main contributions of this study are then as follows: • Analyze the configuration of the intelligent techniques applied, depending on the selected data window and the number of input features. • Understanding the effects on the training of the AI algorithms when a new filtered signal is generated from the wind and included in the temporal series. • Comparison of the machine learning (ML) models with the root mean square error (RMSE) and mean absolute error (MAE). This article is structured as follows: Sect. 2 describes the techniques used and the case study. Section 3 presents the training and validation of the models, and shows the performance obtained with each one. Section 4 presents the conclusions.
2 LSTM, GRU and Transformers Models and Case Study An LSTM is a type of RNN that acts as a means of transportation, transferring relevant information along the sequence chain. Theoretically, it can transport relevant information throughout the process, adding or deleting information over time, allowing for learning information that is relevant or forgetting it during training [11]. GRU is a variation of an LSTM. To address the issues of gradient vanishing, a GRU uses the update gate and reset gate, which are two vectors that decide what information should be passed to the output [11]. 2.1 Transformer Model The model used is a transformer encoder-decoder, where the encoder part takes the time series information as input, while the decoder predicts future values in an autoregressive manner [12].
336
P.-A. Buestán-Andrade et al.
• In the encoder, a vector of xN dimensions is the input. It has an input layer and a positional encoding layer. Then, five layers are connected, which include a selfattention layer, two additive and normalization layers, and two fully connected layers. The final result is a vector of xN dimensions that feeds the decoder [13]. • In the decoder has the output of the encoder as input. It consists of several layers identical to the encoder, such as a positional encoding layer, a self-attention mask, three additive and normalization layers, a decoding-encoding attention layer, and a fully connected layer. The final result is the prediction of the temporal series [13]. The model employs an attention mechanism and positional encoding, with sine and cosine functions, to encode the sequential information of the temporal series [14]. The architecture implemented is shown in Fig. 1.
Fig. 1. Architecture of the transformer model. Adapted from [12].
2.2 Case Study In this document, two-time series were selected, corresponding to a wind turbine located in Turkey [15] and another simulated using OpenFast software [16], which will henceforth be referred as “time series I” and “time series II”, respectively. Time series I contains four measurements taken at 10-min intervals during year 2018: active power, wind speed, wind direction, and theoretical power. Time series II presents five measurements recorded at one-hour intervals: active power, wind speed, wind direction, temperature, and ambient humidity. Time series I have been adapted to match the same format as time series II. Figure 2 shows the generated power (kW) of the two-time series at one-hour intervals. To obtain a more global view of the role of each feature of the time series, the correlation matrix was calculated, obtaining the results shown in Fig. 3.
Comparison of LSTM, GRU and Transformer Neural Network Architecture
337
Fig. 2. Power values generated by the two wind turbines recorded every 1 h.
Fig. 3. Correlation matrix between variables: (left) time series I; (right) time series II.
In this particular case, the maximum correlation values of time series I are found between the power generated (PG), wind speed (WS), and theoretical power (TP), while for time series II, the highest correlation is between generated power (PG) and wind speed (WS). Therefore, to have both models with the same input characteristics, the theoretical power, wind direction, temperature, and ambient pressure are discarded, working only with the previously generated power and wind speed as input variables.
3 Results 3.1 Data Pre-processing For the training, the data set has been divided as follows: 70% for training, 20% for validation, and 10% for testing; on the other hand, the normalization of the values has been carried out by subtracting the mean and dividing it by the standard deviation of each characteristic. Additionally, the wind speed has been filtered (FWS), to remove the noise and this new signal has been used as an additional input for training. One of the alternatives is
338
P.-A. Buestán-Andrade et al.
to use the Fast Fourier Transform (FFT), a mathematical operation that calculates the frequency components of a signal in the time domain. First, the Fourier coefficients are obtained. A parameter called “cut-off frequency” is established, which determines the frequency above which the Fourier coefficients are set to zero, that is, it establishes the range of frequencies that are preserved in the filtered signal. To calculate the FFT of the wind speed in a time series we used the FFT function of the NumPy library. Various frequency values have been tried, finding an appropriate value of 0.02. Figure 4 shows the wind speed before and after filtering in time series II.
Fig. 4. Original and denoised data of the wind speed in time series II.
In [17] it is suggested to carry out a window of consecutive samples for the prediction of time series. In this case, a data window of 12 and 24 samples (twelve hours and one day) is used to predict the next hour. 3.2 Training Table 1 gives a summary of all hyperparameters applied to all models. Table 1. Hyperparameters for prediction of wind turbine power generated in dataset. Hyperparameters
LSTM
GRU
Transformer
Head size
–
–
256
Num heads
–
–
4
Cells
96
256
4 (only encoder)
MLP units
128
128
128
Dropout
0.25
0.25
0.25
Epochs
100
100
100
Batch size
64
64
64
Loss function
MSE
MSE
MSE
Optimizer
Adam
Adam
Adam
Input shape
Possible combinations (24,3), (12,3), (24,2), (12,2), (24,1) and (12,1), data window and number of features
Comparison of LSTM, GRU and Transformer Neural Network Architecture
339
3.3 Prediction and Performance In this section, the results of the training carried out on the different models, with each of the time series, are shown. The MSE has been used as loss function for the models. They have been compared using the RMSE and the MAE. The RMSE measures how close a regression line is to a data set. It is calculated by the square root of the mean of the squared errors (1). 1 (1) RMSE = (yi − y˜ i )2 n The MAE refers to the difference between the prediction of an observation and the real value. The MAE uses the average of the absolute errors for a group of predictions and observations as a measure for the whole group (2). MAE =
1 |yi − y˜ i | n
(2)
where yi represent the current values, y˜ i are the predicted values, and n is the number of samples. Figure 5 presents the calculated RMSE and MAE. Based on the results obtained in this figure, the GRU model performs slightly better than the Transformers and the LSTM for both time series I and II with the tested configurations. Specifically, for time series I (a, b, c, d), the model with a 12-h data window that includes the filtered wind speed for the prediction of the generated power obtains the lowest value of RMSE and MAE, 0.0742 and 0.0512 respectively. In the same way, for the prediction of the wind speed, the 24-h data window with the filtered wind speed gets an RMSE of 0.0420 and a MAE of 0.0319. For time series II, the best model is obtained with a data window of 12 h, considering the filtered wind speed, with an RMSE of 0.1052 and a MAE of 0.0665 in the prediction of the power generated. Regarding the transformer models, the error is larger compared to the GRU and LSTM models for same configurations. Figure 6 shows the prediction of the PG obtained with the best trained models. Figure 6a) and Fig. 6b) show the prediction of 140 samples (almost six days). It is possible to see that there is no major difference between the predictions of one model and the other. If Fig. 5 is analyzed, LSTM, GRU, and Transformers have obtained similar RMSE and MAE errors for both time series; however, GRU has been the one with the least error in the training. To further compare the forecasting power and real power, we have obtained the power curve models. For instance, for time series I, the power curve that corresponds to the 3,5 MW wind turbine is shown in Fig. 7. The cut-in, rated, and cut-off wind speeds are 2,5 m/s, 14,5 m/s and 25 m/s, respectively. In this figure, we can see that the model fits the power curve that corresponds to the wind speed velocities. This suggests that the models are effectively learning the underlying relationships of the data and are able to make consistent predictions over time.
P.-A. Buestán-Andrade et al.
Target: Power Generated
Target: Power Generated
0.1000
MAE [KW]
0.1000 0.0500
0.0400 0.0200
GRU
Transformer
LSTM
Transformer
(a)
(b)
Target: Wind Speed
Target: Wind Speed 0.1000
MAE [m/s]
0.1400 0.1200 0.1000 0.0800 0.0600 0.0400 0.0200 0.0000
0.0800 0.0600 0.0400 0.0200
LSTM
Transformer
GRU
LSTM
Transformer
(c)
(d)
Target: Power Generated
Target: Power Generated 0.1500
0.1500
MAE [KW]
0.1000 0.0500
LSTM
(e)
Transformer
GRU
LSTM
(24,1)
(12,2)
(24,2)
(12,3)
(12,1)
(24,1)
(12,2)
(24,2)
(24,3) GRU
0.0500 0.0000
(12,3)
0.0000
0.1000
(24,3)
RMSE [kW]
0.2000
(12,1)
(24,1)
(12,2)
(24,3)
(24,1)
(12,1)
(12,2)
(24,2)
(12,3)
0.0000
(12,1)
GRU
(12,1)
(24,1)
(12,2)
(24,2)
(12,3)
(24,3)
(12,1)
(24,1)
(12,2)
(24,2)
(12,3)
(24,3)
LSTM
(24,3)
RMSE [m/s]
0.0600
0.0000
0.0000
GRU
0.0800
(12,3)
RMSE [kW]
0.1500
(24,2)
340
Transformer
(f)
Fig. 5. Error in time series I: a) RMSE of PG prediction, b) MAE of PG prediction, c) RMSE of WS prediction, d) MAE of WS prediction. Error in time series II: e) RMSE of PG prediction, b) MAE of PG prediction, c) RMSE of WS prediction, d) MAE of WS prediction.
Comparison of LSTM, GRU and Transformer Neural Network Architecture
Target: Wind Speed
GRU
Transformer
(g)
LSTM
(12,1)
(24,1)
(24,2)
(12,2)
(24,3) GRU
(12,3)
0.1200 0.1000 0.0800 0.0600 0.0400 0.0200 0.0000
(12,1)
(24,1)
(12,2)
(12,3)
LSTM
(24,2)
MAE [m/s]
0.1400 0.1200 0.1000 0.0800 0.0600 0.0400 0.0200 0.0000 (24,3)
RMSE [m/s]
Target: Wind Speed
341
Transformer
(h) Fig. 5. (continued)
Fig. 6. Predicting the generated power of a wind turbine: a) times series I; b) time series II.
Fig. 7. Real and forecasting power curve for time series I.
342
P.-A. Buestán-Andrade et al.
4 Conclusions and Future Works With the development of this research, it has been possible to verify that adding a new feature to ML models benefits the prediction results. This happens because the filtered signal helps to better understand the model about the input features, and their correlation, and to better select model weights. In this particular case, the GRU is the model that has best adapted to time series I and time series II, obtained lower RMSE and MAE error. Another area that is currently under constant investigation is the so-called PhysicsInformed Neural Networks (PINNs), a type of NN that integrates governing equations of the physics of traditional problems so that more realistic results are obtained, so when applying this type of machine learning can be considered for future work, as a continuation of this research. Acknowledgments. This work has been partially supported by the Spanish Ministry of Science and Innovation under the project MCI/AEI/FEDER number PID2021-123543OB-C21.
References 1. Bashir, M.B.A.: Principle parameters and environmental impacts that affect the performance of wind turbine: an overview. Arab J. Sci. Eng. 47, 7891–7909 (2022) 2. Roga, S., Bardhan, S., Kumar, Y., Dubey, S.K.: Recent technology and challenges of wind energy generation: a review. Sustain. Energy Technol. Assess. 52, 102239 (2022). https://doi. org/10.1016/j.seta.2022.102239 3. Sotiropoulou, K.F., Vavatsikos, A.P.: Onshore wind farms GIS-assisted suitability analysis using PROMETHEE II. Energy Policy 158, 112531 (2021). https://doi.org/10.1016/j.enpol. 2021.112531 4. Ko, M.S., Lee, K., Kim, J.K., et al.: Deep concatenated residual network with bidirectional LSTM for one-hour-ahead wind power forecasting. IEEE Trans. Sustain. Energy 12, 1321– 1335 (2021). https://doi.org/10.1109/TSTE.2020.3043884 5. Liu, B., Zhao, S., Yu, X., et al.: A novel deep learning approach for wind power forecasting based on WD-LSTM model. Energies (Basel) 13, 4964 (2020). https://doi.org/10.3390/en1 3184964 6. Wu, Q., Guan, F., Lv, C., Huang, Y.: Ultra-short-term multi-step wind power forecasting based on CNN-LSTM. IET Renew. Power Gener. 15, 1019–1029 (2021). https://doi.org/10.1049/ rpg2.12085 7. Shahid, F., Zameer, A., Muneeb, M.: A novel genetic LSTM model for wind power forecast. Energy 223, 120069 (2021). https://doi.org/10.1016/j.energy.2021.120069 8. Li, C., Tang, G., Xue, X., et al.: Short-term wind speed interval prediction based on ensemble GRU model. IEEE Trans. Sustain. Energy 11, 1370–1380 (2020). https://doi.org/10.1109/ TSTE.2019.2926147 9. Sun, S., Liu, Y., Li, Q., et al.: Short-term multi-step wind power forecasting based on spatiotemporal correlations and transformer neural networks. Energy Convers. Manag. 283, 116916 (2023). https://doi.org/10.1016/j.enconman.2023.116916 10. Wen, Q., Zhou, T., Zhang, C., et al.: Transformers in time series: a survey (2022). arXiv preprint arXiv:220207125 11. Phi, M.: Illustrated guide to LSTM’s and GRU’s: a step by step explanation. In: Towards Data Science (2018). https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-stepby-step-explanation-44e9eb85bf21. Accessed 1 Apr 2023
Comparison of LSTM, GRU and Transformer Neural Network Architecture
343
12. Youness, M.: How to use transformer networks to build a forecasting model. In: Towards Data Science (2021). https://towardsdatascience.com/how-to-use-transformer-networks-tobuild-a-forecasting-model-297f9270e630. Accessed 31 Mar 2023 13. Albin-Ludvigsen, K.-G.: How to make a Transformer for time series forecasting with PyTorch. In: Towards Data Science (2022). https://towardsdatascience.com/how-to-make-a-pytorch-tra nsformer-for-time-series-forecasting-69e073d4061e. Accessed 31 Mar 2023 14. Wu, N., Green, B., Ben, X., O’Banion, S.: Deep transformer models for time series forecasting: the influenza prevalence case (2020). arXiv preprint arXiv:200108317 15. Erisen, B.: Wind turbine scada dataset. In: Kaggle (2018). https://www.kaggle.com/datasets/ berkerisen/wind-turbine-scada-dataset/code. Accessed 3 Apr 2023 16. Dobrev, P.: Texas wind turbine dataset - simulated. In: Kaggle (2022). https://www.kaggle. com/datasets/pravdomirdobrev/texas-wind-turbine-dataset-simulated 17. Tensorflow. Time series forecasting. In: tensorflow.org (2022). https://www.tensorflow.org/ tutorials/structured_data/time_series?hl=en. Accessed 13 Nov 2022
The Impact of Data Normalization on the Accuracy of Machine Learning Algorithms: A Comparative Analysis Kelsy Cabello-Solorzano1 , Isabela Ortigosa de Araujo2 , Marco Pe˜ na1 , 3 4(B) on-Ballesteros Lu´ıs Correia , and Antonio J. Tall´ 1
International University of Andalusia, Huelva, Spain {kelsy.cabellosolorzano,marco.pena}@estudiante.unia.es 2 University of Huelva, Huelva, Spain [email protected] 3 LASIGE, Faculdade de Ciˆencias, Universidade de Lisboa, Lisbon, Portugal [email protected] 4 Department of Electronic, Computer Systems and Automation Engineering, University of Huelva, Huelva, Spain [email protected]
Abstract. In Machine Learning (ML) algorithms, data normalization plays a fundamental role. This research focuses on analyzing and comparing the impact of various normalization techniques. Three normalization techniques, namely Min-Max, Z-Score, and Unit Normalization, were applied as a preliminary step before using various ML algorithms. In the case of Min-Max we used two variants, one normalizing feature values in the interval [0, 1] and the other normalizing them in the interval [−1, 1]. The objective of this study is to determine, in a precise and informed manner, the most appropriate normalization technique for each algorithm, aiming to enhance accuracy in problem-solving. Through this comparative analysis, we aim to provide reliable recommendations for improving the performance of ML algorithms through proper data normalization. The results reveal that a few algorithms are virtually unaffected by whether normalization is used or not, regardless of the applied normalization technique. These findings contribute to the understanding of the relationship between data normalization and algorithm performance, allowing practitioners to make informed decisions regarding normalization techniques when using ML algorithms. Keywords: Data mining · Standardization techniques Learning · Min-Max · Z-Score and Unit Normalization
1
· Machine
Introduction
Data normalization is an important stage of data pre-processing before applying machine learning (ML) algorithms [7], and its correct implementation can c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 P. Garc´ıa Bringas et al. (Eds.): SOCO 2023, LNNS 750, pp. 344–353, 2023. https://doi.org/10.1007/978-3-031-42536-3_33
The Impact of Data Normalization on the Accuracy of ML Algorithms
345
significantly affect the results and effectiveness of prediction models. Data normalization plays a fundamental role in the research and application of predictive models in various fields, and particularly in financial markets [2,8]. Accurate stock price forecasts are very important to investors and market regulators. The dynamics and complexity of stock markets pose significant challenges when it comes to modeling and predicting price trends. Normalization provides an opportunity to improve the accuracy of forecasts by properly preparing the data. Using normalization brings data features to the same scale and can also reduce outliers that may exist in the original data set. This improves the quality and consistency of the data, which in turn improves the ability of predictive models to detect patterns and make accurate predictions. In this context, we highlight that data normalization is positioned as an important aspect in machine learning. The results of this study provide valuable guidance for researchers and practitioners interested in using data normalization as an effective strategy for machine learning in classification problems. Next, Sect. 2 presents the data normalization methods used in the study and emphasizes their applicability and benefits in machine learning. In Sect. 3 we are present experiments where normalization methods are applied with different data sets, and the different evaluation metrics used. In Sect. 4 the results of the application of the selected algorithms are analyzed and compared using ROC graphs to determine the effectiveness of the normalization methods. Finally, Sect. 5 discusses the impact of data normalization on the research environment and suggests possible methods and approaches to improve the use of normalization in future research.
2
Background
This section reviews the most relevant normalization approaches used in ML. Min-Max Normalization: The Min-Max normalization technique, also known as scaling, is a data normalization method widely used in machine learning. Its main objective is to transform data values to fall within a certain range, usually between 0 and 1. The min-max normalization process involves subtracting the minimum value from the data set and then dividing it by the difference between the maximum and minimum values. This is done for each value in the data set. For each attribute, the Min-Max of an entry is calculated by: (i) = (x(i) − xmin )/(xmax − xmin ), xnorm
(1)
where x is the original attribute value, in dataset entry i, and Xmin and Xmax are the minimum and the maximum values of the attribute in the data set. The Min-Max normalization is useful when you want to retain information about your data and fit it to a specific range. By scaling data within a narrow range, comparison and analysis of variables becomes easier, especially when using algorithms that are sensitive to data size, such as neural networks. In addition to the range 0 to 1, the Min-Max technique can fit the data to other userdefined ranges required by the problem. For example, the Min-Max formula can
346
K. Cabello-Solorzano et al.
be further adjusted if a range between –1 and 1 is required. It is important to note that min-max normalization transforms the data to a certain scale, but does not change the shape of the distribution of the original values. However, this method is sensitive to biases, as they affect the scale range and compress most of the values into a very narrow range. Z-Score Normalization: The Z-Score technique, also known as standardization, is a data normalization method widely used in machine learning. Its main goal is to transform data to have a mean of zero and a standard deviation of one. The Z-Score normalization process involves subtracting the mean from the data and dividing by the standard deviation. This is done for each value in the data set. For each attribute, the Z-score of an entry is calculated by: z (i) =
x(i) − μ , σ
(2)
where μ is the mean of the data set and σ is the standard deviation of the data set. Z-score normalization is useful when you want to compare and use variables with different scales and distributions. By normalizing the data, all variables have a comparable size and distribution, which facilitates comparison and analysis. With Z-score normalization, values above the mean will have a positive Z-score, while values below the mean will have a negative Z-score. This allows you to identify outliers or outliers because they tend to have a high or low Z-score compared to the rest of the data. Z-Score normalization does not transform the data within a given range, but rather adjusts it to have zero mean and a standard deviation of 1. In addition, the technique assumes that the data follow a normal distribution and therefore works well for continuous variables. Unit Normalization: The unit normalization technique, also known as L2 normalization, is a common technique used in machine learning to normalize data. Its purpose is to ensure that the eigenvectors have a Euclidean unit length. The unit normalization process is performed by dividing the value of each eigenvector by the Euclidean norm of the original vector, x(i) (i) = . xun (j) 2 jx
(3)
By dividing each value by the Euclidean norm, the vector norm equals 1. This technique is especially useful when using algorithms that are sensitive to data size, such as distance-based models, where the size of the feature affects the similarity or distance between instances. Normalizing data to a single scale prevents features with large values from dominating the model contribution. It is important to note that this technique does not transform data in a specific area, but fits the data to the overall scale, which preserves the relative scale.
The Impact of Data Normalization on the Accuracy of ML Algorithms
3
347
Experiments
This paper evaluates different normalization approaches of a variety of data sets before data partitioning. Once the data is normalized the data set thus preprocessed is divided into training and testing following a stratified approach. The next step is the training of different classifiers, and finally their predictions are produces. The performance models is evaluated on test sets using a three-fold cross validation in order to get a more reliable set of results. We tested several normalization approaches, in a benchmark dataset, with different machine learning classification algorithms. Dataset: Madelon [4]. It is an artificial dataset used in the NIPS 2003 feature selection challenge. It consists of 20 redundant features and 480 distractor features called probes. The dataset was designed to evaluate the performance of algorithms in a highly nonlinear classification task. This and the high number and variety of features makes it adequate for this experiment. Machine Learning Algorithms: Logistic Regression (LR), Decision Tree (DT), Random Forest (RF), Gradient Boosting (GB), Support Vector Machine (SVM), Neural Networks (NN), Na¨ıve Bayes (NB), and k-Nearest Neighbors (kNN). These algorithms represent a comprehensive coverage of classifiers, ranging from simple models such as NB or k-NN, to more complex non-linear ones such as RF or NN. This allows to assess if the influence of attribute normalization varies among ML models. Normalization Approaches: the data set was normalized using Min-Max [0, 1], Min-Max [−1, 1], Zscore and Unit Standardization techniques.
4
Results
Analyzing the results, in Table 1, it is evident that the impact of the different techniques does not differ much if at all, according to the values in the last column, representing the relative standard deviation of the accuracy of each algorithm over the four normalization approaches. However, in several applications an improvement of a few percent points in accuracy can have a huge impact, due to consequences to humans, such as in healthcare cases, or due to the non-linear characteristics of the context, as in financial markets. In a detailed analysis, we observe that some algorithms obtain more consistent results in terms of accuracy, regardless of the normalization method used. For example, Na¨ıve Bayes does not change at all, and the Gradient Boosting algorithm [1] shows relatively high accuracy across all normalization methods, with values ranging from 76.92% to 77.31%. On the other hand, there are algorithms that seem to be more sensitive to the normalization method applied.
348
K. Cabello-Solorzano et al.
Table 1. Accuracy of ML algorithms with no normalization (no Norm) and with different normalization techniques. Last column presents the relative standard deviation of each ML model across the four types of normalization.
ML alg no Norm Min-Max [0, 1] Min-Max [−1, 1] Zscore Unit Stdd Rel. Stdev LR
52.31
54.04
54.42
52.69
52.69
DT
76.73
RF
70.96
GB SVM
1.69%
76.15
75.58
76.54
76.92
0.75%
70.77
73.27
69.81
72.12
2.12%
77.12
77.31
76.92
77.12
77.31
0.24%
55.57
54.23
55.77
55.58
55.58
1.29%
NN
49.42
57.12
57.50
55.19
55.58
2.01%
NB
59.62
59.62
59.62
59.62
59.62
0,00%
k-NN
69.62
58.08
58.08
57.50
57.50
0.58%
For example, Logistic Regression [6] shows a lower accuracy compared to other algorithms, with values ranging from 52.69% to 54.42%. As for the decision trees, a considerably higher accuracy is observed compared to Logistic Regression. However, no significant improvement in accuracy is observed when using different normalization methods. The results range from 75.58% to 76.92%, suggesting that decision trees are more robust to variations in normalization methods. The Random Forest algorithm [9] shows relatively stable accuracy across all normalization methods, with values ranging from 69.81% to 73.27%. While not obtaining the best results compared to other algorithms, the variation in accuracy between normalization methods is not significant. In the case of SVM, Neural Networks and k-NN, it is observed that these algorithms obtain similar accuracy in all normalization methods. However, the overall accuracy is not very high, varying between 54.23% and 57.50%. This suggests that these algorithms may be less suitable for the Madelon dataset or that a more sophisticated approach would be needed to improve their performance. Finally, Naive Bayes shows a consistent accuracy of 59.62% across all normalization methods. While this value is not the highest compared to other algorithms, we recall that Naive Bayes is a simple and fast classification algorithm, and its performance remains stable across different normalization methods. The “no Norm” column (in Table 1) shows the accuracy of the ML algorithms when applied directly to the data set without normalization. Algorithms such as DT, RF and GB obtain relatively high results, while LR, SVM, NN, NB KNN underperform. On the other hand, NN strongly improves with normalization while k-NN degrades with it. This indicates that robust algorithms can perform well on non-normalized data, while sensitive algorithms can be negatively affected by the lack of normalization. 4.1
ML Algorithms and Normalization: A Graphical Approach
The ROC (Receiver Operating Characteristic) curve [5] is a graph showing the relationship between the true positive rate (TPR), also called (Sensitivity) and
The Impact of Data Normalization on the Accuracy of ML Algorithms
349
the false positive rate (FPR), also known as (1-specificity), showing different classification thresholds. It is generally used for binary classification problems where the objective is to evaluate the performance of a model in distinguishing between two classes. The X-axis shows FPR. FPR indicates the proportion of negative cases incorrectly classified as positive. The Y-axis represents TPR, the percentage of positive cases correctly classified as positive. The purpose of the ROC curve is to evaluate the performance of a classification model in order to distinguish between the two positive and negative classes. An ideal ROC is considered when the curve is as close as possible to the upper left corner of the graph. This indicates a high rate of true positives and a low rate of false positives, regardless of the classification threshold. The closer the curve is to the upper left corner, the better the model performs. AUC (Area Under the Curve) is a numerical metric that summarizes the ROC curve information into a single number. AUC measures the ability of the model to correctly classify instances of both classes. AUC values range from 0 to 1, where 1 indicates a perfect model with the ability to discriminate between classes and 0.5 indicates that the model performs classification. It is random, with no ability to distinguish between classes.
Fig. 1. Decision Tree model Performance
350
K. Cabello-Solorzano et al.
The ROC plots of the algorithms that achieved the best results are shown below. Only 3 (DT, RF, GB) of the 8 algorithms showed a significant performance, with accuracy values above 70%. We analyse them in detail next, with the ROC curves. It should be emphasized that the data set was normalized using Min-Max [0, 1], Min-Max [−1, 1], Zscore and Unit Normalization techniques. Decision Trees. In the first plot (A) of Fig. 1, a decision tree applied to a normalized data set with Min-Max [0,1] performed well in binary classification, with an ROC and AUC of 0.84. The model demonstrated ability to distinguish between the two classes and achieved accurate classification. In the second plot (Figure B), Min-Max [−1, 1] normalization of the data set resulted in an ROC and AUC value of 0.73. Although the model performed acceptably, a significant difference was observed compared to Min-Max [0, 1] normalization. More effective training is suggested to improve classification accuracy. In the third plot (Figure C), the decision tree model applied to a normalized data set with the Z-Score technique showed a performance of AUC 0.84. This indicates a good performance in the classification ability of the model. The deci-
Fig. 2. Random Forest Performance
The Impact of Data Normalization on the Accuracy of ML Algorithms
351
sion tree model applied to a normalized data set with the unit normalization technique graph (Figure D) obtained a ROC (AUC = 0.83). Decision tree model shows good performance in binary classification, although differences were identified in relation to the normalization techniques used. It is suggested to evaluate which normalization technique best fits each dataset as more appropriate. Random Forest. In the first plot (A) of Fig. 2, the Random Forest model applied to a normalized data set using the Min-Max [0, 1] technique obtained a classification performance with a ROC and AUC of 0.79. This indicates that the model has an acceptable ability to distinguish between classes in the data set. In the second plot (B), the Min-Max [−1, 1] normalization result with a ROC and AUC of 0.81 suggests that the model performed better compared to the other three plots. The plot (C) represents the Random Forest model applied to a normalized data set using the Z-Score technique, which yields a ROC and AUC of 0.79. The normalization approach shows similar results to the Min-Max [0, 1] normalization. In the last plot (D), the Random Forest model applied to
Fig. 3. Gradient Boosting Performance
352
K. Cabello-Solorzano et al.
the data set with unit normalization achieved a ROC of (AUC = 0.78). This result indicates an acceptable performance of the model. The results of the four Random Forest models with the four normalization techniques demonstrate their good classification performance, despite the minimal differences. These results will be useful in selecting appropriate normalization techniques for similar types of problems. Gradient Boosting. The 4 plots (Figure A, B, C, D) in Fig. 3 of the gradient boosting model that were normalized with Min-Max [0,1] and Min-Max [-1,1] normalization techniques, Zscore and Unity Normalization yielded the ROC and AUC of 0.85. The result indicates that all 4 models performed well and have the ability to correctly classify the classes.
5
Conclusions
In conclusion, the present comparative study on the impact of data normalization on the accuracy of machine learning algorithms has provided valuable information on the performance of various algorithms against different normalization methods. The results obtained allow drawing relevant conclusions for the selection and optimization of algorithms in the field of data classification. It has been observed that some algorithms, such as Gradient Boosting, exhibit relatively high and consistent accuracy across all the normalization methods considered and similar to the non normalized dataset. On the other hand, it has been identified that LR and NN are significantly sensitive to the normalization method applied, while RF and SVM particularly suffer with specific normalizations, Z score and Min-Max [0, 1] respectively. Since the most of the algorithms produced results that are only minor improvements over the baseline model (random choice), we can not much rely on the differences of applying different techniques in bad performing ML algorithms. The significant conclusions should be drawn from the algorithms whose performance is well above baseline, in this case above 70% in terms of accuracy. DT, RF and GB models fall into that category and have exhibited a high ability to distinguish between classes, demonstrating efficient performance in this difficult data classification problem. Therefore, the analysis of the ROC curve and AUC was made with these models, to evaluate in detail the behavior of the high performing algorithms that in this study have obtained the best results in terms of accuracy. GB is remarkably insensitive to the normalization technique, with a constant AUC, and similar accuracy results, irrespectively of using normalization or not. DT are not sensitive to normalization, except with Min-Max [−1, 1], that produces worse results both in accuracy and in AUC. RF present a stable behavior under the four normalization methods but in accuracy there is a variation with a clear advantage of Min-Max [−1, 1] closely followed by unit standardization. In summary, the results of this study highlight the importance of data normalization in machine learning algorithms and its impact on classification accuracy
The Impact of Data Normalization on the Accuracy of ML Algorithms
353
[3], and demonstrate that there is no clearly superior normalization method for all algorithms. The appropriate choice of normalization method can contribute significantly to improving the performance and accuracy of algorithms. These findings provide a valuable framework for future research and practical applications, facilitating the selection of appropriate normalization techniques based on the specific characteristics and requirements of machine learning algorithms and datasets. In any case, it is important to keep in mind that the accuracy achieved in this study is only one aspect of the evaluation, and other factors such as computational efficiency and interpretability should also be considered when selecting the most appropriate algorithm for a specific problem. As for the normalization techniques applied we know that they do not compromise interpretability, since they are very easy to revert, in the analysis of results.
References 1. Bent´ejac, C., Cs¨ org˝ o, A., Mart´ınez-Mu˜ noz, G.: A comparative analysis of gradient boosting algorithms. Artif. Intell. Rev. 54, 1937–1967 (2021) 2. Bhanja, S., Das, A.: Impact of data normalization on deep neural network for time series forecasting (2018). arXiv preprint arXiv:1812.05519 3. Foody, G.M.: Status of land cover classification accuracy assessment. Remote Sens. Environ. 80(1), 185–201 (2002) 4. Guyon, I.: Madelon data set. In: UCI (2003) 5. Hoo, Z.H., Candlish, J., Teare, D.: What is an roc curve? (2017) 6. Kleinbaum, D.G., Dietz, K., Gail, M., Klein, M., Klein, M.: Logistic Regression. Springer, Heidelberg (2002) 7. Murphy, K.P.: Machine learning: a probabilistic perspective. Massachusetts Institute of Technology (2012) 8. Nayak, S.C., Misra, B.B., Behera, H.S.: Impact of data normalization on stock index forecasting. Int. J. Comput. Inf. Syst. Ind. Manag. Appl. 6(2014), 257–269 (2014) 9. Matthias Schonlau and Rosie Yuyan Zou: The random forest algorithm for statistical learning. Stata J. 20(1), 3–29 (2020)
Adaptive Optics Correction Using Recurrent Neural Networks for Wavefront Prediction Sa´ ul P´erez Fern´andez1(B) , Alejandro Buend´ıa Roca1 , Carlos Gonz´alez Guti´errez1,2 , Javier Rodr´ıguez Rodr´ıguez1 , 1 ´ , Ronny Anangon´ o Tutasig1 , Santiago Iglesias Alvarez 1,3 Fernando S´ anchez Lasheras , and Francisco Javier de Cos Juez1,4 1
Instituto de Ciencias y Tecnolog´ıas Espaciales de Asturias (ICTEA), University de Oviedo, Oviedo, Spain [email protected] 2 Department of Computer Science, University of Oviedo, Oviedo, Spain 3 Department of Mathematics, University of Oviedo, Oviedo, Spain 4 Department of Exploitation and Exploration of Mines, University of Oviedo, Oviedo, Spain
Abstract. Adaptive Optics is a field aimed to improve the quality of the images received by terrestrial telescopes by the use of optical instrumentation, although it heavily relies on different techniques to control it. Neural networks have proven to be versatile in many different situations, therefore, implementing them to Adaptive Optics is the next reasonable step. Similar to weather forecasting, the present paper focuses on the implementation of neural networks in prediction of next stages of the atmosphere. Presented results are comparable with those with traditional systems but with neural networks being cheaper and easier to implement. Keywords: Adaptive Optics
1
· Machine Learning · Prediction
Introduction
Adaptive optics (AO) are essential for ground-based telescopes, as the atmosphere is an inevitable source of uncertainties. Nevertheless, AO techniques and systems are not exempt of problems [5]. The discrete parts of the components or the time those components need to communicate with each other are examples of difficulties that AO need to overcome. Currently, the use of machine learning techniques is one of the strengths of computing, allowing enormous progress to be made in the field of prediction [1]. In particular, recurrent networks are one of the most promising, as they are capable of learning complex patterns from sequential data [9]. In the context of telescope image correction, recurrent neural networks can be used to model the time evolution of atmospheric aberrations. In this work c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 P. Garc´ıa Bringas et al. (Eds.): SOCO 2023, LNNS 750, pp. 354–363, 2023. https://doi.org/10.1007/978-3-031-42536-3_34
AO Correction Using RNN for Wavefront Prediction
355
the focus is going to apply on the delay caused by the different computations performed by the components [4] and how it can be partially overcomed by predicting the following stage of atmospheric turbulences, so detectors can be arranged beforehand to correct those aberrations and the distortion obtained in the telescope images can be corrected without delay. Section 2 relates the background of Adaptive Optics, in which the experiments are based. Section 3 describes the tools used to acquire the results. Section 4 resumes where the data comes from and the different network architectures employed. Section 5 relates how the experiments are done and the results are explained in Sect. 6. Finally, concluding remarks and future lines of work are given in Sect. 7.
2
Adaptive Optics
As photons go through the atmosphere, they suffer scattering and diffraction on account of the air turbulences. These effects translate into low quality and blurred images. In astronomy, where the number of photons received is low, having the ability to correct the aberrations is indispensable to make useful measurements. Adaptive optics combine different techniques that measure the distortion of wavefronts coming from celestial bodies and correct them in real time. These corrections are even more crucial in modern telescopes as they grow in size [6]. The fundamental instruments used in astronomical adaptive optics are wavefront sensors (WFS), which measure the wavefront distortions, and deformable mirrors (DM), which correct them. These instruments communicate through a real-time control system, that tell the DM how to bend given a wavefront measurement[2]. Nevertheless, this information exchange can introduce a temporal mismatch between both instruments. Wavefront sensors measure the shape of the incoming wavefront to determine its aberrations, given by the deviations from a plane wavefront. Even though there are many types of sensors, in this experiment it is going to be used one of the most common WFS in astronomy, the Shack-Hartmann WFS (SH-WFS) (Fig. 1). These sensors consist of a lenslet array, that replaces the pupil and divide the wavefront, along with a position-sensing detector. The deviation of the focal spots are directly related to the average wavefront slope across the lenslet or subaperture, which can be analyzed to perform the later reconstruction of the wavefront. Deformable mirrors (Fig. 1) are flexible reflective surfaces mounted over a collection of pistons or actuators. As an actuator is activated, it moves the surface of the mirror, changing the local curvature. DMs use WFS information to reconstruct the wavefront. The calculations needed to go from one to another are called reconstruction algorithm. There are many different algorithms to modify the shape of the mirror, although in this paper the main focus will be the use of artificial neural networks for this task [13]. Along with the information of the WFS and a tomographic reconstructor, DMs can correct the aberrations on the wavefronts.
356
S. P. Fern´ andez et al.
Fig. 1. Open-loop optical system. Shack-Hartmann WFS measures the distortion on the wavefront and tells the DM how to bend to correct it. [1]
3
Neural Networks
The origin of artificial neural networks arises with the intention of building an intelligent machine from the architecture of the human brain, that is why they are inspired by the biological neural connections [11]. Now they have become one the most relevant tools in the field of prediction, as they are highly versatile as well as powerful and scalable. This makes them a great resource when dealing with large amounts of data, whether they are arrays of numbers, images, text, etc. [8]. Neural networks are composed of multiple layers of interconnected neurons, where each of them is responsible for performing a mathematical operation on the provided input data and then transferring the results to the next layer. The transformed information is propagated throughout all the layers and, once it reaches the final layer, data is transformed to a shape determined by the type of network used, numerical values if it is a regression network or categories if it is a classification network. For the training process of a neural network, large data sets containing both the input variables and the corresponding output labels are used. In this process, their weights and internal biases are adjusted to progressively minimize the discrepancy between the predictions and the output labels by backpropagating [12] the accumulated error at the output. This is done iteratively in order to converge to a minimum error that provides the best possible results. 3.1
Long Short-Term Memory Neural Networks
Recurrent neural networks can be distinguished from traditional feed-forward neural networks because it has connections not only with the next layers, but also with the previous ones. For the analysis of sequential data, networks can give us information about how the studied variables evolve. Within this group, there is a type of sequential neural network structure, which is the one main used in the experiments carried out, the Long Short-Term
AO Correction Using RNN for Wavefront Prediction
357
Memory (LSTM) networks [7]. These cells, which can be used in a similar way to a simple neural cell, perform better than predictive structures, its training converges faster and the detection of long-term data dependence is more efficient. Furthermore, they are implemented in the deep learning library keras so that the joint use of GPUs in training the algorithm greatly speeds up the learning process of the network [3]. The general idea of LSTM is that the network should be able to learn what information to store in the long term, instead of storing everything or discarding everything beyond a certain point, and it should know what to read at any given moment from what has been stored in its memory. In short, the information passes through the forget gate, deletes the irrelevant data and then collects new memories through an addition operation.
4 4.1
Experiment Description Training Data
In order to implement the results in real telescopes, realistic data is needed to test the model that can represent real atmospheric conditions. Nevertheless, this data can be hard to measure as the components of an optical system are complex and expensive. Simulations come out as a way to test our systems without the need of actually having them. For this experiment, the Simulation ’Optique Adaptative’ with Python (SOAPY) module [10] in Python has been used to test our models. Starting with the creation of atmospheric phase screens, SOAPY module recreates the propagation of a signal through those phases up to the behaviour and measurements on the optical system. Many parameters are customizable, such as the type of WFS, the properties of the atmosphere or the telescope, what allows us to recreate the specific characteristics we need for our experiment. The specific parameters given to the simulator are now described: – System: Simulations run with frequency 500 Hz in open-loop configuration, using one guide star. – Atmosphere: Two turbulence screens are simulated with the same strength, one at ground-level and another at a random height between 1 and 11 km. At each screen, the wind speed is chosen randomly between 10 and 15 m/s with a random direction. The Fried’s coherence length of screens is r0 = 16 cm @500 nm and their outer scale, L0 = 25 m. – Telescope: The characteristics of the telescope are based on CANARY, a demonstrator for Extremely Large Telescope (ELT) instruments, hosted by the William Herschel Telescope (WHT). Therefore, the telescope has a circle pupil of 4.2 m with a central obscuration of 1.2 m. The data used for training are the wavefront slopes measured by the WFS for a collection of frames, which can be converted to Zernike coefficients and to phase screen (Fig. 2). The number of training frames is resumed in Sect. 4.2.
358
S. P. Fern´ andez et al.
Fig. 2. Example of atmospheric screen reconstruction from simulation (left) and from prediction (right).
4.2
Network Architecture
For the experiments performed, different forms of training have been carried out in order to find the most efficient solution for the present case, given the nature of the data available. In terms of recurrent neural network structure, the above-mentioned LSTMs have been used. Subsequently, it has also been tried to use the reduced version of these networks, the GRU cell. Although it offers a slightly faster training process, it commits more errors and the reduction in time is not noticeable. In the case of combining sequential and convolutional networks, different layers have been combined, which will be discussed in the respective section. The different training methods are described below: Direct Forecast. This is the simplest case. Data is given directly to the network to predict the following slopes at next frames. Generally, a sequence of three following frames is predicted. Multi-target Forecast. Considering the same number of frames to predict, a different model can be trained for each of them. An independent prediction is made for each frame starting from the same input data. First frame has a greater advantage in the prediction since the frames containing more information are the closest to the value to be predicted. Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) Forecast. Since a recurrent neural network has problems when it comes to parallelizing, adding convolutional layers to network architecture allows for any time step to be calculated based on a reduced set of input values
AO Correction Using RNN for Wavefront Prediction
359
without the need to know past states. Additionally, this approach can combat one of the main issues with recurrent networks, unstable gradients. In the present case, the network structure is composed of four 1D convolutional layers, with pooling layers after every two, followed by two fully connected layers and two LSTM layers. However, instead of choosing a fully connected layer for the output, a TimeDistributed layer is configured, which is used to produce an output at each time step. This layer is different from a fully connected layer, as it keeps information of previous time step that a fully connected may lose. Gradient Boosting Forecast. In terms of model structure, this is the most complex training mode, but also one of the most promising. In this case, input matrix has the same shape as in previous cases, but, for the target outputs, a matrix is built with as many dimensions as frames to be predicted squared, i.e., if we want to predict n frames, target output matrix has dimensions n × n. This matrix is filled with sequences of n frames from both training and target data, first row using n-1 training values and one target, second row using n-2 and two target ones. . . up to last row that uses the whole target sequence. Once training is done, last row only last row is saved. Thanks to this, the interaction of the data when backpropagating the gradient is intensified and thus allow the model to have more information from which to obtain patterns of turbulence behavior.
5
Experiments
Several experiments have been carried out in order to approach an improvement in real-time adaptive optics predictions by means of neural networks. The procedure followed is none other than to search for the possible variables and relationships between them that could lead to such advances. 5.1
Training Performance by Varying the Number of Input Frames
The models used will be based on the network architectures mentioned before. In all cases the training has been carried out with variable number of input frames of value, from 5 to 80 with 5-frame jumps, which allows observing the relationship between the length of the input sequence and the performance. As for the training parameters, with the exception of the CNN-RNN (described in Sect. 4.2), 200 units have been used for the LSTM layers, with optimiser ‘Adam’, cost function ‘MSE’, 40 training epochs and a batch size of 64. No further parameters are specified beyond the composition of layers, as they are set as provided by keras as default. 5.2
Training Performance by Varying the the Frequency of Slopes Attainment
The update frequency of the frames can be specified in the simulator, which allows to obtain the slopes of the corresponding turbulence for two instants
360
S. P. Fern´ andez et al.
separated by a time equivalent to the period. In the previous experiment, a frequency 500 Hz is used, which means that the turbulence is updated every 2 ms. For this experiment, the training method with best results in previous experiment has been considered to reduce the amount of redundant information and to make it easier to obtain the results.
6
Results
Based on the results obtained in the previously introduced experiments, we proceed to describe the following points. 6.1
Variable Input Frames
Predicted values have a stronger dependence on last input frames than on first ones, although not much information is obtained when the number is considerably increased above 20 frames. Models lose performance when the size is reduced below that range, dropping when a large part of the input sequence is disregarded. Therefore, it is clear the importance of the closest values for the prediction, as well as the little relevance of the distant ones. This makes sense in a case of sequential prediction such as ours in which we are trying to predict numerical values, whose nature is practically random, but governed by general turbulence equations. However, if the aim is to obtain the best possible model while prioritizing the accuracy of the predictions, taking into account a high number of input frames is still a good option, although it does not seem necessary for this value to be too high. Figure 3 shows this evolution as a function of the different architectures: Direct prediction, Gradient Boosting (GB), the mixture of CNN & RNN (CNN), and finally the Multi-Target prediction (MT). 6.2
Variable Frequency
The following experiment seeks to demonstrate the relationship between the update frequency of the data generated by the simulator and the performance of the models. In this case, the best performing network, the Gradient Boosting network, is taken into account. As for the frequency ranges, given that the data were initially generated 500 Hz to give a closer approximation to the real case, relative reductions have been made. The frequencies considered are 500, 250, 166.67, 125, 100 and 83.33 Hz. The best results are obtained at the initial frequency 500 Hz, although the predictions of the first frame are maintained quite well for each of the frequencies considered, as can be seen in Fig. 4, with the accuracy dropping in the following frames. Interestingly, as the accuracy decreases, it drops, but at a certain point it rises again. One possible explanation is that as the frequency is reduced, more and more relevant information is disregarded, but at the same time the model takes less account of the noise present in the data.
AO Correction Using RNN for Wavefront Prediction
361
Fig. 3. Prediction performance depending on the type of network and length of the input sequence.
362
S. P. Fern´ andez et al.
Fig. 4. Prediction performance depending on the frame frequency
7
Conclusion
Based on the results obtained, we can conclude that the most promising model so far is Gradient Boosting, for which certain modifications are planned for future experiments. In relation to the dependence on the input sequence, most of the information from which the model is able to obtain turbulence behaviour patterns are those closest to the prediction. In the future, this will be analysed at a lower level, with even shorter input data or even by modifying the number of frames to be predicted, given that the model may be able to focus its efforts on fewer frames to obtain better results. Another possible future issue to analyse is the relationship between frequency and noise in the data during training. Networks are great tools for noise filtering, but maybe some kind of noise filtering or principal component analysis could be added to prioritise the most relevant information and improve prediction performance. Further experiments have been considered for the future that may provide a more global view of this process and facilitate the interpretation of the results from an optical error based point of view.
References 1. de Cos Juez, F.J., Lasheras, F.S., Roque˜ n´ı, N., Osborn, J.: An ann-based smart tomographic reconstructor in a dynamic environment. Sensors (Basel, Switzerland) 12, 8895–8911 (2012) 2. Basden, A.G., et al.: Experience with wavefront sensor and deformable mirror interfaces for wide-field adaptive optics systems. Monthly Not. Royal Astron. Soc. 459, 1350–1359 (2016)
AO Correction Using RNN for Wavefront Prediction
363
3. Gonz´ alez-Guti´errez, C., Sanchez-Rodr´ıguez, M.L., Calvo-Rolle, J.L., de Cos Juez, F.J.: Multi-GPU development of a neural networks based reconstructor for adaptive optics. In: Complexity, 2018 (Intelligent Control Approaches for Modeling and Control of Complex Systems) (2018) 4. Gonz´ alez-Guti´errez, C., et al.: Comparative study of neural network frameworks for the next generation of adaptive optics systems. Sensors 17, 1263 (2017) 5. Guo, Y., et al.: Adaptive optics based on machine learning: a review (2022) 6. Hippler, S.: Adaptive optics for extremely large telescopes. J. Astron. Instrument. 08, 1950001 (2019) 7. Hochreiter, S., Hochreiter, S., Schmidhuber, J., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997) 8. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521, 436–444 (2015) 9. Liu, X., Morris, T., Saunter, C., de Cos Juez, F.J., Gonz´ alez-Guti´errez, C., Bardou, L.: Wavefront prediction using artificial neural networks for open-loop adaptive optics. Monthly Not. Royal Astron. Soc. 496(1), 456–464 (2020) 10. Reeves, A.: Soapy: an adaptive optics simulation written purely in Python for rapid concept development. In: Marchetti, E., Close, L.M., V´eran, J.P. (eds.) Adaptive Optics Systems V, vol. 9909 of Society of Photo-Optical Instrumentation Engineers (SPIE) Conference Series, p. 99097F (2016) 11. Rosenblatt, F.: The perceptron: a probabilistic model for information storage and organization in the brain. Psychol. Rev. 65, 386–408 (1958) 12. Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by backpropagating errors. Nature 323, 533–536 (1986) 13. Su´ arez G´ omez, S.L., et al.: Experience with artificial neural networks applied in multi-object adaptive optics. Publ. Astron. Soc. Pac. 131(1004), 108012 (2019)
Author Index
A A. de la Cal, Enrique 323 Ahmed, Halal Abdulrahman 111 Alfaro-Viquez, David 163 Álvarez, Santiago Iglesias 354 Arias, Daniel 44 Asencio-Cortés, G. 121, 132 Azorín-López, Jorge 163, 173, 216, 226, 236 B Bandera, Antonio 246 Barreno, F. 256 Basurto, Nuño 205 Bayona, Eduardo 300 Benavent-Lledo, Manuel 226 Benito-Picazo, Jesús 184 Bernardos, Ana M. 13, 24 Besada, Juan 13, 24 Booysen, M. J. 216 Buestán-Andrade, Pablo-Andrés
334
C C. Riquelme, José 101 Cabello-Solorzano, Kelsy 344 Calvo-Rolle, José Luis 311 Carramiñana, David 24 Carranza-García, Manuel 101 Chacón, Juan Luis Ferrando 67 Chacón-Maldonado, A. M. 121 Chira, Camelia 279 Climent-Pérez, Pau 173 Correia, Luís 344 Costa, Nahuel 269 ´ Cwikła, Grzegorz 77 D de Barreana, Telmo Fernández 67 de Cos Juez, Francisco Javier 354 de la Cal, Enrique 279 Del Río Cristóbal, Miguel 145 Domínguez, Enrique 184
E Etxegoin, Zelmar
67
F F. Corrêa, Douglas 290 F.M.G. Carvalho, Guido 290 Fernández, Saúl Pérez 354 Fernández-Rodríguez, Jose David Fuster-Guillo, Andres 173, 236
184
G Galán-Cuenca, Alejandro 173 García González, Enol 279 Garcia, Ander 67 García, Jesús 34, 44 García-d’Urso, Nahuel 173 Garcia-D’Urso, Nahuel E. 236 Garcia-Rodriguez, Jose 153, 163, 195, 216, 226 Gil-Arroyo, Beatriz 205 González Gutiérrez, Carlos 354 Grillo, Hanzel 163 Gwiazda, Aleksander 88 H Herrera, Elena 323 Herrero, Álvaro 205 Herrero, Daniel Amigo I Ilin, Vladimir
34
311
J J. Tallón-Ballesteros, Antonio 344 J. Valdés Cuervo, Amable 323 Jiménez-Navarro, M. J. 132 K Krenczyk, Damian
57
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 P. García Bringas et al. (Eds.): SOCO 2023, LNNS 750, pp. 365–366, 2023. https://doi.org/10.1007/978-3-031-42536-3
366
Author Index
L Llerena, Juan Pedro 44 López-Rubio, Ezequiel 184 Lorenz, Tomasz 77 Losada, Adrián 13 Lozano-Juárez, Samuel 205
Romero-Garcés, Adrián 246 Ruiz-Beltrán, Camilo 246 Ruiz-Ponce, Pablo 195 S Salguero, Francisco Fariña 34 Sánchez Lasheras, Fernando 354 Sánchez, Luciano 269, 279 Santos, Matilde 256, 300, 334 Sanz, Juan Marcos 205 Sebastian-Gonzalez, Esther 216 Sedano, Javier 279 Shankar Muthuselvam, Revanth 145 Sierra-García, Jesús-Enrique 300, 334 Silva Neto, Antônio J. 290 Simi´c, Dragan 311 Simi´c, Svetlana 311 Simi´c, Svetislav D. 311 Steinberg, Alan N. 3
M Marfil, Rebeca 246 María Luna-Romera, Jose 101 María Sanjuán, José 145 Martínez-Álvarez, F. 121, 132 Martínez-Ballesteros, M. 132 Molina, José Manuel 34, 44 Moreno, Ramón 145 Mulero-Pérez, David 226 N Nalepa, Grzegorz J. 153, 195 Nepomuceno, Juan A. 111 Nepomuceno-Chamorro, Isabel A. O Olender-Skóra, Małgorzata 88 Oregui, Xabier 67 Ortigosa de Araujo, Isabela 344 Ortiz-Perez, David 153, 195 P Palomo, Esteban J. 184 Pazmiño-Piedra, Juan-Pablo 334 Pedroche, David Sánchez 34 Pelta, David A. 290 Peña, Marco 344 Pérez, Lucas 269 R R. Villar, José 279 Raposo, Daniel 24 Reina-Jiménez, Pablo 101 Roca, Alejandro Buendía 354 Rodríguez, Javier Rodríguez 354 Rodríguez-Juan, Javier 153, 195 Romana, M. 256
111
T Teterja, Dmitrij 216 Toledo, Claudio F. M. 290 Tomás, David 153, 195 Troncoso, A. 121 Troncoso-García, A. R. 121 Tutasig, Ronny Anangonó 354 U Urda, Daniel
205
V van der Walt, Rita Elise 216 Vega-Márquez, Belén 111 Velasco-Pérez, Nuria 205 Villar, José R. 311 Vizcaya-Moreno, Flores 226 W Wang, Ting
145
Z Zamora-Hernandez, Mauricio-Andres
163