166 59 86MB
English Pages 857 [858] Year 2021
Lecture Notes in Networks and Systems 295
Kohei Arai Editor
Intelligent Systems and Applications Proceedings of the 2021 Intelligent Systems Conference (IntelliSys) Volume 2
Lecture Notes in Networks and Systems Volume 295
Series Editor Janusz Kacprzyk, Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland Advisory Editors Fernando Gomide, Department of Computer Engineering and Automation—DCA, School of Electrical and Computer Engineering—FEEC, University of Campinas— UNICAMP, São Paulo, Brazil Okyay Kaynak, Department of Electrical and Electronic Engineering, Bogazici University, Istanbul, Turkey Derong Liu, Department of Electrical and Computer Engineering, University of Illinois at Chicago, Chicago, USA; Institute of Automation, Chinese Academy of Sciences, Beijing, China Witold Pedrycz, Department of Electrical and Computer Engineering, University of Alberta, Alberta, Canada; Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland Marios M. Polycarpou, Department of Electrical and Computer Engineering, KIOS Research Center for Intelligent Systems and Networks, University of Cyprus, Nicosia, Cyprus Imre J. Rudas, Óbuda University, Budapest, Hungary Jun Wang, Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong
The series “Lecture Notes in Networks and Systems” publishes the latest developments in Networks and Systems—quickly, informally and with high quality. Original research reported in proceedings and post-proceedings represents the core of LNNS. Volumes published in LNNS embrace all aspects and subfields of, as well as new challenges in, Networks and Systems. The series contains proceedings and edited volumes in systems and networks, spanning the areas of Cyber-Physical Systems, Autonomous Systems, Sensor Networks, Control Systems, Energy Systems, Automotive Systems, Biological Systems, Vehicular Networking and Connected Vehicles, Aerospace Systems, Automation, Manufacturing, Smart Grids, Nonlinear Systems, Power Systems, Robotics, Social Systems, Economic Systems and other. Of particular value to both the contributors and the readership are the short publication timeframe and the world-wide distribution and exposure which enable both a wide and rapid dissemination of research output. The series covers the theory, applications, and perspectives on the state of the art and future developments relevant to systems and networks, decision making, control, complex processes and related areas, as embedded in the fields of interdisciplinary and applied sciences, engineering, computer science, physics, economics, social, and life sciences, as well as the paradigms and methodologies behind them. Indexed by SCOPUS, INSPEC, WTI Frankfurt eG, zbMATH, SCImago. All books published in the series are submitted for consideration in Web of Science.
More information about this series at http://www.springer.com/series/15179
Kohei Arai Editor
Intelligent Systems and Applications Proceedings of the 2021 Intelligent Systems Conference (IntelliSys) Volume 2
123
Editor Kohei Arai Faculty of Science and Engineering Saga University Saga, Japan
ISSN 2367-3370 ISSN 2367-3389 (electronic) Lecture Notes in Networks and Systems ISBN 978-3-030-82195-1 ISBN 978-3-030-82196-8 (eBook) https://doi.org/10.1007/978-3-030-82196-8 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2022, corrected publication 2022 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Editor’s Preface
We are very pleased to introduce the Proceedings of Intelligent Systems Conference (IntelliSys) 2021 which was held on September 2 and 3, 2021. The entire world was affected by COVID-19 and our conference was not an exception. To provide a safe conference environment, IntelliSys 2021, which was planned to be held in Amsterdam, Netherlands, was changed to be held fully online. The Intelligent Systems Conference is a prestigious annual conference on areas of intelligent systems and artificial intelligence and their applications to the real world. This conference not only presented the state-of-the-art methods and valuable experience, but also provided the audience with a vision of further development in the fields. One of the meaningful and valuable dimensions of this conference is the way it brings together researchers, scientists, academics, and engineers in the field from different countries. The aim was to further increase the body of knowledge in this specific area by providing a forum to exchange ideas and discuss results, and to build international links. The Program Committee of IntelliSys 2021 represented 25 countries, and authors from 50+ countries submitted a total of 496 papers. This certainly attests to the widespread, international importance of the theme of the conference. Each paper was reviewed on the basis of originality, novelty, and rigorousness. After the reviews, 195 were accepted for presentation, out of which 180 (including 7 posters) papers are finally being published in the proceedings. These papers provide good examples of current research on relevant topics, covering deep learning, data mining, data processing, human–computer interactions, natural language processing, expert systems, robotics, ambient intelligence to name a few. The conference would truly not function without the contributions and support received from authors, participants, keynote speakers, program committee members, session chairs, organizing committee members, steering committee members, and others in their various roles. Their valuable support, suggestions, dedicated commitment, and hard work have made IntelliSys 2021 successful. We warmly thank and greatly appreciate the contributions, and we kindly invite all to continue to contribute to future IntelliSys. v
vi
Editor’s Preface
We believe this event will certainly help further disseminate new ideas and inspire more international collaborations. Kind Regards, Kohei Arai
Contents
Zero-Touch Customer Order Fulfillment to Support the New Normal of Retail in the 21st Century . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stavros Ponis, Eleni Aretoulaki, George Plakas, Kostas Agalianos, and Theodoros Nikolaos Maroutas
1
VitrAI: Applying Explainable AI in the Real World . . . . . . . . . . . . . . . Marc Hanussek, Falko Kötter, Maximilien Kintz, and Jens Drawehn
11
Contactless Interface for Navigation in Medical Imaging Systems . . . . . Martin Žagar, Ivica Klapan, Alan Mutka, and Zlatko Majhen
24
Mobile Apps for 3D Face Scanning . . . . . . . . . . . . . . . . . . . . . . . . . . . . Laura Dzelzkalēja, Jēkabs Kārlis Knēts, Normens Rozenovskis, and Armands Sīlītis
34
Tabu Search for Locating-Routing in the Goods Delivery and Waste Pickup in Trujillo-Peru . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Edwar Lujan Segura, José Rodríguez Melquiades, and Flabio Gutiérrez Segura
51
The Emergence of Hybrid Edge-Cloud Computing for Energy Efficiency in Buildings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yassine Himeur, Abdullah Alsalemi, Faycal Bensaali, and Abbes Amira
70
Particle Swarm Model for Predicting Student Performance in Computing Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mirna Nachouki and Riyadh A. K. Mehdi
84
A Genetic Algorithm for Quantum Circuit Generation in OpenQASM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Teong Joo Ong and Chye Cheah Tan
97
An Improved Clustering-Based Harmony Search Algorithm (IC-HS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 Yang Zhang, Jiacheng Li, and Lei Li vii
viii
Contents
Role of Artificial Intelligence in Software Quality Assurance . . . . . . . . . 125 Sonam Ramchand, Sarang Shaikh, and Irtija Alam Machine Learning for Optimal ITAE Controller Parameters for Thermal PTn Actuators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 Roland Büchi Evaluation of Transformation Tools in the Context of NoSQL Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 Sarah Myriam Lydia Hahn, Ionela Chereja, and Oliviu Matei Network Classification with Missing Information . . . . . . . . . . . . . . . . . . 166 Ruriko Yoshida and Carolyne Vu Topic Modeling Based on ICD Codes for Clinical Documents . . . . . . . . 184 Yijun Shao, Rebecca S. Morris, Bruce E. Bray, and Qing Zeng-Treitler Imbalanced Dataset Optimization with New Resampling Techniques . . . 199 Ivan Letteri, Antonio Di Cecco, Abeer Dyoub, and Giuseppe Della Penna Supporting Financial Inclusion with Graph Machine Learning and Super-App Alternative Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 Luisa Roa, Andrés Rodríguez-Rey, Alejandro Correa-Bahnsen, and Carlos Valencia Arboleda The Data Mining Dataset Characterization Ontology . . . . . . . . . . . . . . . 231 Man Tianxing and Nataly Zhukova Electromagnetism-Like Algorithm and Harmony Search for Chemical Kinetics Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 E. N. Shvareva and L. V. Enikeeva Multi-Level Visualization with the MLV-Viewer Prototype . . . . . . . . . . 250 Carlos Manuel Oliveira Alves, Manuel Pérez Cota, and Miguel Ramón González Castro One-Class Self-Attention Model for Anomaly Detection in Manufacturing Lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259 Linh Le, Srivatsa Mallapragada, Shashank Hebbar, and David Guerra-Zubiaga Customer Churn Prediction and Promotion Models in the Telecom Sector: A Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276 Ulku F. Gursoy, Enes M. Yildiz, M. Ergun Okay, and Mehmet S. Aktas Learning Incorrect Verdict Patterns of the Established Face Recognizing CNN Models Using Meta-Learning Supervisor ANN . . . . . 287 Stanislav Selitskiy, Nikolaos Christou, and Natalya Selitskaya
Contents
ix
Formation Method of Delaunay Lattice for Describing and Identifying Objects in Fuzzy Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305 S. O. Kramarov, V. V. Khramov, O. Y. Mityasova, E. V. Grebenyuk, A. A. Bocharov, and D. V. Chebotkov Analysis of Electricity Customer Clusters Using Self-organizing Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312 Gabriel Augusto Rosa, Daniel de Oliveira Ferreira, Alan Petrônio Pinheiro, and Keiji Yamanaka SmartData: An Intelligent Decision Support System to Predict the Readers Permanence in News . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326 Jessie Caridad Martín Sujo, Elisabet Golobardes i Ribé, Xavier Vilasís Cardona, Virginia Jiménez Ruano, and Javier Villasmil López Tropical Data Science over the Space of Phylogenetic Trees . . . . . . . . . 340 Ruriko Yoshida A Study of Big Data Analytics in Internal Auditing . . . . . . . . . . . . . . . . 362 Neda Shabani, Arslan Munir, and Saraju P. Mohanty An Automated Visualization Feature-Based Analysis Tool . . . . . . . . . . . 375 Rabiah Abdul Kadir, Shaidah Jusoh, and Joshua Faburada High Capacity Data Hiding for AMBTC Decompressed Images Using Pixel Modification and Difference Expansion . . . . . . . . . . . . . . . . . . . . . 388 Lee-Jang Yang, Fang-Ping Pai, Ying-Hsuan Huang, and Ching-Ya Tseng SIFCM-Shape: State-of-the-Art Algorithm for Clustering Correlated Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404 Chen Avni, Maya Herman, and Ofer Levi FRvarPSO as an Alternative to Measure Credit Risk in Financial Institutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419 Patricia Jimbo Santana, Laura Lanzarini, and Aurelio F. Bariviera A Multimodal Digital Humanities Study of Terrorism in Swedish Politics: An Interdisciplinary Mixed Methods Project on the Configuration of Terrorism in Parliamentary Debates, Legislation, and Policy Networks 1968–2018 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435 Jens Edlund, Daniel Brodén, Mats Fridlund, Cecilia Lindhé, Leif-Jöran Olsson, Magnus P. Ängsal, and Patrik Öhberg Fraud Detection in Online Market Research . . . . . . . . . . . . . . . . . . . . . 450 Vera Kalinichenko, Gasia Atashian, Davit Abgaryan, and Natasha Wijaya Deep Neural-Network Prediction for Study of Informational Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 460 Rejwan Bin Sulaiman and Vitaly Schetinin
x
Contents
A Novel Method to Estimate Parents and Children for Local Bayesian Network Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 468 Sergio del Río and Edwin Villanueva A Complete Index Base for Querying Data Cube . . . . . . . . . . . . . . . . . . 486 Viet Phan-Luong Multi-resolution SVD, Linear Regression, and Extreme Learning Machine for Traffic Accidents Forecasting with Climatic Variable . . . . 501 Lida Barba, Nibaldo Rodríguez, Ana Congacha, and Lady Espinoza Identifying Leading Indicators for Tactical Truck Parts’ Sales Predictions Using LASSO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 518 Dylan Gerritsen and Vahideh Reshadat Detecting Number of Passengers in a Moving Vehicle with Publicly Available Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 536 Luciano Branco, Fengxiang Qiao, and Yunpeng Zhang Towards Context-Awareness for Enhanced Safety of Autonomous Vehicles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 549 Nikita Bhardwaj Haupt and Peter Liggesmeyer Hybrid Recurrent Traffic Flow Model (URTFM-RNN) . . . . . . . . . . . . . 564 Elena Sofronova A Multimodal Approach to Psycho-Emotional State Detection of a Vehicle Driver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574 Igor Lashkov and Alexey Kashevnik Optimizing the Belfast Bike Sharing Scheme . . . . . . . . . . . . . . . . . . . . . 586 Nadezda Demidova, Aleksandar Novakovic, and Adele H. Marshall Vehicle-to-Grid Based Microgrid Modeling and Control . . . . . . . . . . . . 600 Samith Chowdhury, Hessam Keshtkar, and Farideh Doost Mohammadi Intelligent Time Synchronization Protocol for Energy Efficient Sensor Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 609 Jalil Boudjadar and Mads Mørk Beck Intelligent Sensors for Intelligent Systems: Fault Tolerant Measurement Methods for Intelligent Strain Gauge Pressure Sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624 Thomas Barker, Giles Tewkesbury, David Sanders, and Ian Rogers IoT Computing for Monitoring NFT-I Cultivation Technique in Vegetable Production . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633 Manuel J. Ibarra, Edgar W. Alcarraz, Olivia Tapia, Aydeé Kari, Yalmar Ponce, and Rosmery S. Pozo
Contents
xi
Selective Windows Autoregressive Model for Temporal IoT Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 648 Samer Sawalha, Ghazi Al-Naymat, and Arafat Awajan Distance Estimation Methods for Smartphone-Based Navigation Support Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 658 Bineeth Kuriakose, Raju Shrestha, and Frode Eika Sandnes Adversarial Domain Adaptation for Medieval Instrument Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 674 Imad Eddine Ibrahim Bekkouch, Nicolae Dragoş Constantin, Victoria Eyharabide, and Frederic Billiet Transfer Learning Methods for Training Person Detector in Drone Imagery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 688 Saša Sambolek and Marina Ivašić-Kos Video Processing Algorithm in Foggy Environment for Intelligent Video Surveillance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 702 Alexey Nikolayevich Subbotin, Nataly Alexandrovna Zhukova, and Tianxing Man Document Digitization Technology and Its Application in Tanzania . . . 716 Mbonimpaye John, Beatus Mbunda, Victor Willa, Neema Mduma, Dina Machuve, and Shubi Kaijage Risk and Returns Around FOMC Press Conferences: A Novel Perspective from Computer Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 724 Alexis Marchal Attention-Enabled Object Detection to Improve One-Stage Tracker . . . 736 Neelu Madan, Kamal Nasrollahi, and Thomas B. Moeslund Development of a Human Identification Software System in Real Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 755 Askar Boranbayev, Seilkhan Boranbayev, Mukhamedzhan Amirtaev, Malik Baimukhamedov, and Askar Nurbekov A Survey on the Semi Supervised Learning Paradigm in the Context of Speech Emotion Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 771 Guilherme Andrade, Manuel Rodrigues, and Paulo Novais CESAR: A New Metric to Measure the Level of Code-Switching in Corpora - Application to Maghrebian Dialects . . . . . . . . . . . . . . . . . 793 Karima Abidi and Kamel Smaïli Finding Trustworthy Users: Twitter Sentiment Towards US Presidential Candidates in 2016 and 2020 . . . . . . . . . . . . . . . . . . . . . . . 804 Teng-Chieh Huang, Razieh Nokhbeh Zaeem, and K. Suzanne Barber
xii
Contents
Neural Abstractive Unsupervised Summarization of Online News Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 822 Ignacio Tampe, Marcelo Mendoza, and Evangelos Milios Correction to: SmartData: An Intelligent Decision Support System to Predict the Readers Permanence in News . . . . . . . . . . . . . . . . . . . . . Jessie Caridad Martín Sujo, Elisabet Golobardes i Ribé, Xavier Vilasís Cardona, Virginia Jiménez Ruano, and Javier Villasmil López
C1
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 843
Zero-Touch Customer Order Fulfillment to Support the New Normal of Retail in the 21st Century Stavros Ponis(B) , Eleni Aretoulaki, George Plakas, Kostas Agalianos, and Theodoros Nikolaos Maroutas School of Mechanical Engineering, National Technical University Athens, Ir. Politechniou 9, Zografou, 157 73 Athens, Greece [email protected]
Abstract. After the first lift of the COVID-19 enforced lockdown, with consumers being reluctant to shop in physical stores and online deliveries taking double the time to arrive, in-store pickup –commonly referred to as “Click & Collect”– has become a pragmatic alternative. Still, most retailers require their in-store pickup customers to wait in lengthy lines before being served and then wait again for their orders to be collected. It is, hence, imperative for the “Click & Collect” business model to be adjusted to minimize physical contact, so as to prevent future government-enforced store closures, safeguard the heavy downturn of retail sales and support the ailing economy during a potential recurring COVID-19 related crisis. That is exactly where the proposed solution sets its vision aiming to adapt “Click & Collect” to the new reality, by introducing an innovative, completely contactless customer order delivery system. The proposed innovation is based on the rapidly expanding “curbside pickup” model and enhanced with two technologies, i.e. Wi-Fi Positioning System and Augmented Reality. It intends to be highly scalable and support retail industry in its battle against the pandemic by arming retailers with an easy to implement and use system, helping them overcome the challenges created by COVID-19. Keywords: COVID-19 · Click & Collect · Curbside pickup · Augmented Reality
1 Introduction The successful proliferation of business to consumer (B2C) e-commerce has undoubtedly revolutionized retail transactions and considerably benefited businesses, by providing them with the opportunity to complement their ‘bricks and mortar’ physical retail network, while at the same time enhancing their omnichannel presence. This multichannel approach has enabled consumers to combine multiple forms of purchases to accommodate their needs, ranging from traditional physical shopping, to online shopping, with home delivery or in-store pickup. However, online shopping and especially, home delivery has led to new challenges for supply chain management, a great number of which, manifest in last-mile logistics i.e. the last part of a B2C delivery process, from © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 295, pp. 1–10, 2022. https://doi.org/10.1007/978-3-030-82196-8_1
2
S. Ponis et al.
the moment a parcel is shipped from a distribution center, to the moment it is received by the end user. In particular, B2C last-mile delivery is currently perceived as one of the least efficient and most expensive operations throughout the retail supply chain [1], disproportionately comprising of up to 53% of total transportation costs [2]. The challenges associated with last-mile delivery, such as extremely short delivery schedules, road infrastructure limitations and customer availability, have always been omnipresent. Nevertheless, since the COVID-19 outbreak, with more consumers shopping online, the need to meet the growing demand for last-mile delivery, while at the same time, implementing health and safety protocols and effective social distancing measures designed to protect employees and customers, indisputably increases last mile expenses and further deteriorates the delivery process flexibility and efficiency. As a matter of fact, according to the “Global E-commerce Update 2021: Worldwide Ecommerce Will Approach $5 Trillion This Year”, the novel coronavirus pandemic has led to sudden growth of the global e-commerce market, which amounted to a total of $4.280 trillion in 2020, representing an increase of 27.6%, when compared to last year’s pre-COVID figures [3]. Therefore, it is beyond the shadow of a doubt that this unprecedented surge in e-commerce [4], which is expected to outlive COVID-19 [5], dictates a re-evaluation of current e-commerce fulfilment processes, in order to ensure positive customer experiences through timely, safe and contactless deliveries, while keeping additional costs to a minimum. The only way for businesses to flourish in this perforce contactless era lies in understanding consumers’ new needs and reshaping the current business models accordingly, to minimize COVID-related disruptions.
2 Proposal Objectives and Challenges The worldwide spread of COVID-19 has led to a paradigm shift in consumer behavior, transactions and the economy. The measures applied to contain the pandemic have triggered an economic downturn of great magnitude, by forcing retail businesses to shut down for a long period of time and consumers to turn to online shopping, in order to complete their daily needs. However, this abrupt increase in e-commerce demand has obligated a significant portion of retailers to shut down both their offline and online channel, due to their lack of resources and inexperience in shipping, driving their turnover rates to unsustainable levels. On the other hand, the ones equipped to handle the new reality were faced with an unforeseen excessive workload, which aggravated the performance of last-mile delivery networks and resulted in customer dissatisfaction. After the first lift of the lockdown, with consumers being reluctant to shop in physical stores and online deliveries still taking weeks to arrive instead of days, in-store pick-up – most commonly referred to as ‘Click & Collect’ – has become a pragmatic alternative. ‘Click & Collect’ has become the preferred starting point for retailers too, because it utilizes existing stores and workers, and it is not as expensive or logistically complicated as delivering orders to consumers’ residences. Still, even this ostensibly satisfactory choice is not optimal, considering that most retailers require their in-store pick-up customers to wait in lengthy lines before being served and then wait again for their orders to be collected by a store associate, hence jeopardizing both consumers’ and employees’ health. In its current form, the ‘Click & Collect’ business model is, therefore,
Zero-Touch Customer Order Fulfillment
3
inadequate and should urgently be adjusted to minimize physical contact, with a view to preventing future government-enforced store closures and eventually, safeguarding the heavy downturn of retail sales and supporting the ailing economy during a potential recurring COVID-19 related crisis. That is exactly where the proposed solution sets its vision, objectives and aspirations aiming to examine the ‘Click & Collect’ business model under the lens of social distancing norms and contribute to its adaptation to the new reality, by introducing an innovative, completely contactless customer order delivery system. The proposed innovation is based on the well-known and rapidly expanding ‘curbside pickup’ model. According to the Managing Director of Cowen Inc., ‘curbside pickup’, in USA, will grow to become a $35 billion channel by 2020 [6], a statement that not only confirms its preeminence over traditional channels, even after the pandemic ends, but also alerts the retail market for the need of its universal adoption. By addressing the above-mentioned challenges, the proposed solution aims to achieve a set of ambitious, yet feasible, objectives. Specifically: • Enable completely contactless order fulfillment and thus, facilitate vulnerable groups to safely accommodate their needs and change reluctant potential customers into buyers. • Support customers’ active participation in the inherent challenges of the last-mile delivery process, thus achieving a win-win equilibrium between reduced prices (no delivery cost), enhanced safety (contactless service) and increased customer service (no waiting in queues) on the one hand and decreased customer incurred costs of visiting the store, time and effort on the other. • The perceived win-win situation described above, leading to the en-masse adoption of the proposed solution can potentially eliminate pertinent supply chain disruptions, especially during the pandemic. • Contribute to the continuation and growth of economic activities, which are currently stifled by the regime of uncertainty because of the imposed safety measures and customers’ reluctance to buy in store. • Establish curbside pickups, as one of the primary channels for purchasing goods, for both retailers and customers, even in the absence of the pandemic or other healthrelated crisis.
3 Background The proposed solution puts special emphasis on bridging technological advance and market application, while attuned to balancing and maximizing economic and societal value, in the context of addressing the existing challenges and achieving the aforementioned objectives. To that end, it aims to support economic growth and provide safety during crises, such as the COVID-19 pandemic, but also offer a successful channel for picking purchased products in other contexts as well. The proposed solution will introduce an innovative system, based on two technologies, i.e. Augmented Reality (AR) and a Wi-Fi Positioning System (WFPS), which seem a perfect fit for conducting ‘Click & Collect’ fulfillment in a completely contactless way.
4
S. Ponis et al.
AR is a technology which essentially enables the addition of virtual elements to the physical world, with the use of wearable devices, thus providing its users with the ability to combine stimuli from their visual and/or auditory senses with digital perceptions [7]. The integration of such information in the real environment allows the coexistence of both physical and virtual elements in real-time settings and is considered to ameliorate users’ perception of their interaction with the real world [8]. Mobile devices are commonly used in AR applications, especially for demonstrating contextual information in buildings or sites. A prerequisite of such displays is the localization of objects and the device in the real world [9]. Literature findings show that various AR localization efforts have been carried out on both outdoor and indoor settings. On the one hand, the authors in [10] presented a self-contained localization device, which connects wirelessly to any AR device, with high geospatial accuracy, whereas the authors in [11] developed a handheld AR system, based on a tablet PC equipped with a camera, an orientation sensor and a real-time kinematic receiver, superimposing virtual objects to the real world and giving centimeterlevel accuracy. On the other hand, the authors in [12] designed and implemented an AR system for indoor environments based on an external ultrasound localization system and inertial sensors embedded in the device, to estimate the position and orientation of the user. WiFi-based schemes have started to be considered more appropriate on account of the omnipresence of Wi-Fi infrastructures. The authors in [13] proposed an indoor positioning system which achieves centimeter accuracy and maintains the performance under non-line-of-sight scenarios using a single pair of off-the-shelf Wi-Fi devices. Nevertheless, exhaustive scrutiny of the ‘Click & Collect’ current state of the art testifies -to the best of our knowledge- that similar efforts in the framework of ‘Click & Collect’ curbside pickup, do not exist, indicating our proposal’s inherently innovative character.
4 The Proposed Solution The proposed solution intends to be highly scalable, offering a generic solution for all available parking areas that can be -quickly and at a low cost- transformed to smart contactless order picking areas. It will provide an AR-enhanced mobile application (henceforth the ‘App’) and its cloud-based back-end, enabling the accurate user’s localization, indoor or outdoor, through installed Wi-Fi Networks, alerting the retailer when its client is parked inside the dedicated curbside pickup area and guiding the retailer’s employee to the customer’s car via the AR-enhanced User Interface (UI) of the application. The App offers two discrete UIs, i.e. Customer UI and Retailer UI. The retailer uses its UI to register his/her parking lot in the app and landmark the area by registering the Wi-Fi routers demarcating the curbside parking area. A delimited, in the App, parking area can potentially be used by multiple retailers, able to register multiple ‘Click and Collect’ areas to support their store retail network. At the same time, every smartphone user can register as a customer in the App and immediately gain access with his unique ID to all participating ‘Click and Collect’ areas.
Zero-Touch Customer Order Fulfillment
5
For localization purposes, location-specific Wi-Fi signal features, like Received Signal Strength Indication (RSSI), Time Difference of Arrival (TDoA), Angle of Arrival (AoA) and Channel State Information (CSI), can be acquired in every environment and used to achieve high accuracy. In the proposed solution, first offline training phase (mapping) will be initiated in terms of fingerprint generation (site survey) followed by the online positioning phase, matching the measurement point to the generated map. Wi-Fi fingerprinting requires a robust RSS database, which is then used for comparing with the actual measurements of the device to be located. RSS measurements, in DBm, stored alongside a corresponding MAC address and SSID, from a large number of locations distributed in the area of interest are collected and stored. Then, in the localization phase, the location of the device is determined by comparing the device fingerprint to those in the database with the help of a fingerprint matching algorithm, such as nearest neighbor or support vector machine, for example. According to the authors in [14], the physical phenomenon of time-reversal focusing effect can provide a high-resolution fingerprint for localization using the Wi-Fi protocol. In order for this to be achieved, a large bandwidth, obtained with the use of diversity exploitation, is created and thus, no extra devices are needed, as the portable Wi-Fi enabled customer’s device can be used to locate them across the parking area. The authors in [13] used Wi-Fi technology and accomplished accuracy of 1–2 cm, which exceeds the proposed system’s requirements. In this system, high bandwidth Wi-Fi routers will be installed around the parking areas of interest that do not add significant costs and can also be used as internet access points. Finally, the WiFi Positioning method is completely anonymous and will not cause GDPR implications during the system’s implementation. Augmented Reality will be utilized for providing quick visualizations of the localization results to the retailer’s UI. AR seems to be a perfect fit, as it exempts the need of extra identification codes and searching, since all crucial information will be projected graphically at the user’s screen. So, the proposed solution will be able to provide an ease of use application for the retailers, guiding the employee to the customer’s car, where he/she can safely put the purchased goods directly in his/her trunk, in a completely contactless way. The App will use the localization results combined with AR features to display, directly to the retailer’s UI, the 2D map of the parking space superimposing information on both the customer and the retailer’s employee positions, providing live updates on their relative distance. Moreover, if additional information is available (e.g. walkable paths) the fastest possible route will be recommended and drawn on the 2D map. The App can also be expanded with audio navigation functionality and a live timer widget providing information to the customer for the remaining time for delivery, i.e. the employee reaching the spot where the customer is parked. The back-end of the application will be hosted on the App’s cloud server further reducing its capital expenses keeping them on a pay on demand basis. The App’s server will be responsible for user authentication, security, data storage, real-time data synchronization, capacity scaling and push messages and notifications. The App’s server will also host the localization solver and implement the back-end logic of the system utilizing all the necessary APIs (Application Programming Interfaces) with the retailers’ proprietary information systems.
6
S. Ponis et al.
In the remainder of this section, we provide an initial use case scenario describing a walk-through of the business logic behind the aforementioned technology systems: 1. The customer places an order to the retailer’s e-shop. During the ordering process the customer provides his/her Customer ID and his/her preferred pickup spot, among the designated options. The order QR code is generated in the App. 2. The customer receives an e-mail from the retailer that his/her order is ready for pickup and a push notification from the App, with the order ID and its respective QR code. 3. The customer enters the Wi-Fi demarcated area and parks wherever he/she finds an open parking space. The localization solver determines the customer’s location (see Fig. 1a). 4. The system retrieves pending orders for the Customer’s ID, from the specific collection area. The retailer receives a push notification from the App with the order QR code. 5. The retailer acknowledges the customer’s arrival by pressing a touch button on his/her screen and he/she collects the order, beginning the delivery process. The timer widget is displayed on the customer’s smartphone indicating time left until the arrival of the employee delivering the order (see Fig. 1b). 6. The employee is guided to the customer (see Fig. 2a) through the AR visualization of the localization solver result in Step 3 (see Fig. 2b). 7. The employee verifies the correct order delivery by scanning the QR code on the customer’s phone and creating a match with the QR code sent by the App in Step 4. Then, he/she loads the order to the customer’s trunk.
Fig. 1. User experience from customer’s perspective: a) Customer’s localization inside the dedicated curbside pickup area; b) Customer UI
Zero-Touch Customer Order Fulfillment
7
Fig. 2. User experience from retailer’s perspective: a) Retailer’s localization inside the dedicated curbside pickup area; b) Retailer UI
5 Research Methodology The research methodology of the proposed project is consisted of four discrete steps. Step 1 – Scientific and Technological State of the Art: This step will provide the necessary research background of the project. Initially, a thorough Literature Review on the COVID-19 impact on the supply chain with emphasis on the last mile delivery services will take place. Special focus will be provided in documenting current state of the art of ‘Click & Collect’ and ‘Curbside Pickup’ models worldwide, including traditional business models, interesting case studies, best practices and failure cases and how these models could perform during a pandemic. Technologies supporting these business models will be studied, with an emphasis on wireless localization technologies and augmented reality, which are in the epicenter of this project, thus providing the technological background for the next methodological steps. Step 2 – Requirements Analysis: This step includes all the necessary research activities for determining the requirements to develop the proposed system. Based on best practices and case studies identified in the previous step, the system’s user groups and core use cases will be determined. Finally, the decisions for the back-end and App system’s logic and the engineering requirements for their development, will be documented. Step 3 – System Design: This step includes all the necessary research activities for the design of the proposed system, starting with a market assessment of available hardware and software solutions at the time of the study. Having assessed the technological tools available to the project team, the functional blocks and component diagrams of the proposed system will be prepared based on the results of Step 2. The localization algorithms will be implemented and tested in the lab. Physical and other constraints (e.g. installation points and yard layout) of the organizations where the pilot testing of the system will take place will be evaluated in the design of the system. Step 4 – System Development and Pilot Testing: This step includes all the necessary work for developing the proposed system. Initially, the localization algorithms will be developed, integrated and tested for their accurate functionality in vitro (lab). Upon
8
S. Ponis et al.
successful testing, integration with AR enhancements will take place in parallel with preparations and planning for testing the integrated system in vivo (store location). All the use cases prepared in Step 2 will be tested for workflow and functionality approval. Successful system testing and integration testing including the APIs of the system will initiate a Quality Assurance (QA) cycle for bug fixing and fine tuning. By the time QA has finished and the system is ready for UAT (User Acceptance Testing), the planning and design of the pilot implementation in actual store environment will commence. The roll-out of the system will initiate UAT, providing results for a final round of QA leading to the final version of the proposed system. The proposed methodology steps and the suggested research methods and tools utilized in each step are summarized in Fig. 3.
Fig. 3. The proposed research methodology.
6 Added Value and Impact Although assessing the impact of a research project, before its start, is far from a trivial task and has inherent bias when coming from the proposing team, we strongly believe that the proposed system is expected to have strong socio-economic impact. It is fully aligned with the European Union’s new “Digital Strategy”, whose goals have now been reshaped towards the utilization of disruptive digital technologies as tools for restoration and adaptation to the new reality, dictated by the coronavirus pandemic. The proposed innovation is fit to address the new challenges and prepare us for the post COVID-19 world by contributing to the emergence of a permanent curbside pickup service, actively promoting business continuance, ensuring customer safety and increasing resilience towards crises. With a capable curbside pickup model in place, the retail industry will be able to continue its commercial activities during the COVID-19 pandemic and improve its financial position. The proposed solution could positively affect the pandemic economics and boost the sales of non-essential products, since government-enforced store closures will be able to be avoided. Retailers lacking delivery services will be able to maintain their selling performance regardless of potential future social distancing measures. Indeed, the proposed system can be of great importance and create significant benefits for smaller
Zero-Touch Customer Order Fulfillment
9
urban retailers that have limited designated customer parking areas or completely lacking one, giving them the opportunity alone or collectively under a sharing cost agreement with other retailers to operate a ‘Click and Collect’ area using the proposed system’s application. It is also expected to loosen the pressure on courier services, since the last-mile process will in its most part be assumed by the customers providing significant relief from scheduling and delivering business resources. The benefits of this system will be even more significant in the case of Small Medium Enterprises (SMEs), which, across the Organization for Economic Co-operation and Development (OECD), account for between 50% and 60% [15] of the total value added. Preserving the livelihood of these companies during crises, such as the one resulted by COVID-19, and supporting them with easy to use and affordable systems and applications can have a significant impact to the economy, which is currently struggling to deal with the pandemic financial aftershocks. This need is even more evident in the case of the retail industry, probably one of the hardest hit industries from COVID-19, which suffered a decline by an average of 18.2% globally in 2020, translated to approximately $4.5 trillion [16, 17].
7 Conclusion To sum up, the innovation proposed in this paper aims to introduce a novel, entirely contactless customer order delivery system, fit to adjust the retail ‘Click and Collect’ business model to the COVID-19 stricken reality, based on the growing “curbside pickup” model and enhanced with a Wi-Fi Positioning System and Augmented Reality features. It is designed to minimize physical contact and enhance safety, with a view to supporting susceptible individuals to cover their needs and motivating reluctant customers to proceed to purchases. Moreover, by eliminating waiting in queues, delivery costs, customer incurred costs of store visits and alleviating the burden on courier services, the proposed system promotes business continuance, ameliorates retailers’ financial position and encourages resilience towards crises. At the same time, retailers lacking delivery services will be capable of maintaining their selling performance irrespective of potential future government-enforced social distancing measures. The authors strongly believe that rapidly transforming traditional retail processes to conform to this new reality by leveraging the power of digital technologies can be a decisive mitigating factor against the financial aftershocks of the pandemic. The proposed solution fully materializes this belief into research actions using digital economy principles and Industry 4.0 technologies to support the retail industry in its uneven battle against the pandemic and arm retailers with low-cost and easy to implement and use technology applications, helping them overcome the unforeseen challenges of the new retail reality imposed by the COVID-19 pandemic. Acknowledgments. The present work is co-funded by the European Union and Greek national funds through the Operational Program “Competitiveness, Entrepreneurship and Innovation” (EPAnEK), under the call “RESEARCH-CREATE-INNOVATE” (project code: T1EDK-01168 & acronym: SMARTFLEX Warehouse).
10
S. Ponis et al.
References 1. Gevaers, R., Van de Voorde, E., Vanelslander, T.: Cost modelling and simulation of last-mile characteristics in an innovative B2C supply chain environment with implications on urban areas and cities. Procedia Soc. Behav. Sci. 125(2014), 398–411 (2014) 2. Business Insider. https://www.businessinsider.com/last-mile-delivery-shipping-explained. Accessed 13 Jan 2021 3. eMarketer. https://www.emarketer.com/content/global-ecommerce-update-2021. Accessed 13 Jan 2021 4. Inside Retail. https://insideretail.asia/2020/06/22/unprecedented-growth-ahead-for-e-com merce-and-contactless-payments/. Accessed 06 July 2020 5. Internet Retailing. https://internetretailing.net/covid-19/covid-19/surge-in-ecommerce-willoutlive-corona-across-europe-consumer-research-suggest-21231. Accessed 05 July 2020 6. Forbes. https://www.forbes.com/sites/pamdanziger/2019/04/07/walmart-is-in-the-lead-inthe-soon-to-be-35-billion-curbside-pickup-market/#69597c44199e. Accessed 07 June 2020 7. Plakas, G., Ponis, S.T., Agalianos, K., Aretoulaki, E., Gayalis, S.P.: Augmented reality in manufacturing and logistics: lessons learnt from a real-life industrial application. Procedia Manufact. 51, 1629–1635 (2020) 8. Ponis, S.T., Plakas, G., Agalianos, K., Aretoulaki, E., Gayialis, S.P., Andrianopoulos, A.: Augmented reality and gamification to increase productivity and job satisfaction in the warehouse of the future. Procedia Manufact. 51, 1621–1628 (2020) 9. Herbers, P., König, M.: Indoor localization for augmented reality devices using BIM, point clouds, and template matching. Appl. Sci. 9(20), 4260 (2019) 10. Stranner, M., Arth, C., Schmalstieg, D., Fleck, P.: A high-precision localization device for outdoor augmented reality. In: 2019 IEEE International Symposium on Mixed and Augmented Reality Adjunct (ISMAR-Adjunct), pp. 37–41. IEEE, 2019 October 11. Schall, G., Zollmann, S., Reitmayr, G.: Smart Vidente: advances in mobile augmented reality for interactive visualization of underground infrastructure. Pers. Ubiquit. Comput. 17(7), 1533–1549 (2013) 12. Gómez, D., Tarrío, P., Li, J., Bernardos, A.M., Casar, J.R.: Indoor augmented reality based on ultrasound localization systems. In: Corchado, J.M., et al. (eds.) PAAMS 2013. CCIS, vol. 365, pp. 202–212. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-380617_20 13. Wu, Z.H., Han, Y., Chen, Y., Liu, K.R.: A time-reversal paradigm for indoor positioning system. IEEE Trans. Veh. Technol. 64(4), 1331–1339 (2015) 14. Chen, C., Han, Y., Chen, Y., Liu, K.R.: Indoor GPS with centimeter accuracy using WiFi. In: 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), pp. 1–4. IEEE (2016) 15. OECD. http://www.oecd.org/industry/smes/SME-Outlook-Highlights-FINAL.pdf. Accessed 06 July 2020 16. Business Insider. https://www.businessinsider.com/worldwide-retail-sales-will-drop-due-topandemic-2020-6. Accessed 13 Jan 2021 17. https://www.wfmj.com/story/43158535/retail-global-market-report-2021-2030-by-the-bus iness-research-company. Accessed 13 Jan 2021
VitrAI: Applying Explainable AI in the Real World Marc Hanussek1(B) , Falko K¨ otter2 , Maximilien Kintz2 , and Jens Drawehn2 1
University of Stuttgart IAT, Institute of Human Factors and Technology Management, Stuttgart, Germany [email protected] 2 Fraunhofer IAO, Fraunhofer Institute for Industrial Engineering IAO, Stuttgart, Germany https://www.iat.uni-stuttgart.de/en/ Abstract. With recent progress in the field of Explainable Artificial Intelligence (XAI) and increasing use in practice, the need for an evaluation of different XAI methods and their explanation quality in practical usage scenarios arises. For this purpose, we present VitrAI, which is a web-based service with the goal of uniformly demonstrating four different XAI algorithms in the context of three real life scenarios and evaluating their performance and comprehensibility for humans. This work highlights practical obstacles to the use of XAI methods, and also shows that various XAI algorithms are only partially consistent with each other and unsystematically meet human expectations. Keywords: Explainable Artificial Intelligence Evaluation of explanations
1
· XAI prototype ·
Introduction
The successful adoption of Artificial Intelligence (AI) relies heavily on how well decision makers can understand and trust it [6]. In the near future, AI will make far reaching decisions about humans, for example in self-driving cars, for loan applications or in the criminal justice system. Examples like COMPAS, a racially biased algorithm for parole decisions show the need for reviewing AI decisions [7]. But complex AI models like Deep Neural Networks (DNNs) are hard to understand for humans and act as de-facto black boxes [16]. In comparison, simpler models like decision trees are more understandable for humans but lack prediction accuracy [9], though some argue this is not necessarily true [21]. The field of Explainable Artificial Intelligence aims to create more explainable models while still achieving high predictive power. Current discussions in research suggest that explainability will lead to increased trust, realistic expectations and fairer decisions [6]. Post-hoc explanation techniques work by creating a simplified, interpretable explanation model that approximates the behavior of the black box [24]. However, this simplification is not without risk, as by definition the precise workings c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 295, pp. 11–23, 2022. https://doi.org/10.1007/978-3-030-82196-8_2
12
M. Hanussek et al.
of the black box are not shown [24]. In contrast, interpretable models aim to make the actual model used for predictions more transparent [21]. Basic examples of interpretable models are decisions trees and Bayesian rules lists [13]. Neural networks can be made interpretable by tracking the degree to which each part of the input data contributed to the decision [26]. Though the term model interpretability is widely used in literature, it lacks an agreed-upon definition [14]. The same is true for the term explanation, in particular what is a valid explanation and how to measure explanation quality. This stems from the difference between causality and correlation. In the aforementioned COMPAS example, it is unclear if race is used as a decision criterion, but still correlates strongly with the decision [7,21]. In another example, the model differentiated between pictures of huskies and wolves by the presence of snow, as wild animals tend to be photographed in the wilderness. To a human, this may be a bad explanation, though it correctly reveals the inner workings of the model [20]. Therefore, it is necessary to distinguish the quality of explanations according to different purposes pursued. For data scientists, explanation quality hinges on the degree with which the actual decision criteria are shown. A high-quality explanation for wrong decisions would immediately show the human that the solution is low-quality, ideally giving an indication of a systematic error (e.g. due to bias in the input data). For end users, explanation quality hinges on the degree to which an AI decision can be assessed based on the explanation. For example, if an AI identifies business cases from letters, the most relevant keywords could be highlighted so the human can decide if the decision is correct or if a manual reclassification is necessary. An explanation of the model’s workings is secondary in comparison to a decision basis for human validation. Ideally, AI reasons in the same way as humans, so that the explanation can be understood by humans. But this is not necessarily the case. There are many problems that are hard for AI and easy for humans (and vice versa). One example is general image classification, e.g. of animals. To humans differentiating between cats and dogs is trivial and an explanation provides little value to end users. Thus, evaluating explanations must also take into account tasks that are non-trivial for humans as well, for example identifying brands of vehicles. For the use of AI in practice, it is necessary to provide high-quality explanations for end users. Thus, for developers of AI-based solutions it is necessary to evaluate explanation quality from an end-user point of view. Further research is necessary in what constitutes explanation quality in practical use and how that quality can be systematically evaluated, and furthermore measured. In this work we present VitrAI, a platform for demonstrating and evaluating XAI. In Sect. 2, we investigate related work in the areas of (a) defining explainability and related terms and (b) benchmarking explainability. In Sect. 3, we outline the main sections of the platform and XAI methods. In Sect. 4, we present a preliminary evaluation of explanation quality for these tasks and describe how we plan to systematically evaluate explanation quality and compare human and AI
VitrAI: Applying Explainable AI in the Real World
13
explanations. Section 5 summarizes the main insights, suggests further research questions and outlines our future work in the field.
2
Related Work
In [14], the author define desiderata for interpretability: Trust, causality, transferability, informativeness, fair and ethical decision making. Interpretability can be achieved by transparency or post-hoc explanations. Model properties that confer transparency are simulability, decomposability and algorithmic transparency. Results may be explained post-hoc by examples, verbal reasoning, visualizations, or local explanations. The author stresses the importance of a specific definition and goal when measuring interpretability and cautions against limiting the predictive power of AI for interpretability. In [28], it is argued that interpretability is not an intrinsic property of a model, but a property in perspective to a specific audience: Engineers, users or affectees, each of which have different goals and requirements regarding interpretability, and interpretability needs to be measured in regard to these goals. In [18], the authors detail current efforts to measure intelligibility, for example introspective self-reports from people, questionnaires and propose quantitative measurements for evaluating explanation quality, in particular continuity and selectivity. In [4] an explanation is defined in the context of XAI as “a way to verify the output decision made by an AI agent or algorithm”, which corresponds to the explanation quality for end users we have outlined. In addition, this work gives an overview of desired qualities promoted by explanations (trust, transparency, fairness) and explains different techniques for evaluating XAI algorithms. The authors of [28] outline possibilities for evaluating explainability, noting that human understanding of systems is implicit and that a benchmark would need to encompass enough experience for a human to build such an implicit mental model. They note however a lack of explicit measures of explainability. Interpretability is different from explanation accuracy, as falsehoods such as simplifications can aid overall understanding. In [27], it is proposed to leverage existing work in the field of learning science by interpreting an AI’s user as a learner and an explanation as learning content. They compare pairs of explanations in an educational context with user trials and semi-structured interviews. Using interview transcripts, they compared the user experience of the explanations. As postulated in [28], it was shown that helpfulness of different explanations depends on users’ educational background. The authors of [8] investigate the capabilities of Deep Neural Networks to generalize by comparing DNN performance on distorted images with human performance. The results show that DNNs decline rapidly when images are distorted compared to humans, unless DNNs were trained for a specific kind of distortion. In [24], the authors investigate the reliability and robustness of the post-hoc explanation techniques LIME and SHAP. They devise a scaffolding technique
14
M. Hanussek et al.
to attack these techniques, allowing them to hide biases of models and give arbitrary explanations. In [22], the trade-off between accuracy and interpretability in clustering is investigated. They define a measurement for the interpretability of a cluster, by finding the feature value that most of the cluster nodes share. Thus, interpretability is defined as measurement of similarity. With this measurement, the authors show experimentally how to trade-off interpretability and clustering value in a clustering. However, there aren’t any human trials to benchmark this definition of interpretability yet. In [11], the authors detail an automated benchmark for interpretability methods for neural networks using removal of input dimensions and subsequent retraining. While this approach can measure the correlation between explained and actual importance of features, it does not take into account human perception and biases. The authors of [17] propose a human-grounded benchmark for evaluating explanations. Human annotators create attention masks that are compared with saliency maps generated by XAI algorithms. A trial showed differences in human and AI explanations, for example a human focus on facial features when recognizing animals, while saliency maps had no focus on specific body parts. In addition, they showed human bias towards different explanation errors. In [23], trust in explanations is investigated. They show that AI outperforms humans at detecting altered, deceptive explanations. The authors of [20] evaluate LIME with both expert and non-expert users, making them choose between different models based on explanation quality and measuring understanding of the underlying model. These experiments show the usefulness of explanations for machine learning-related tasks and show first indications of aspects of explanation quality. In [12], the authors investigate the use of interpretability by data scientists. A survey showed that data scientists overestimate the quality of XAI methods and misinterpret the explanations provided. While this work does not focus on end users, it highlights the importance of clearly communicating the limits of XAI and investigating how explanations are used and interpreted by humans in real-life scenarios.
3
VitrAI Prototype for XAI
VitrAI is a web-based service with the goal of demonstrating XAI algorithms in the context of real-life scenarios and evaluating performance and comprehensibility of XAI methods by non-specialists. The platform consists of two core sections. The first one is the demo section, in which XAI explanations are exhibited in three scenarios. These are described in Sect. 3.2. The purpose of the demo section is to provide a user-friendly introduction to XAI, requiring no setup or data input. It is intended as an interactive demo for talks, exhibitions, and lectures.
VitrAI: Applying Explainable AI in the Real World
15
The other one is the user-controlled section, where users can choose from several pre-trained machine learning models and provide own data input with subsequent explanations by XAI methods. This section allows for deeper examination and experimentation with XAI methods, for example during user testing. For each section, different XAI methods are implemented, they are depicted in Sect. 3.3. VitrAI’s name is a compound of the Latin word vitrum (glass) and Artificial Intelligence. 3.1
Machine Learning Tasks and Data Types
Across the platform, a mix of general and domain-specific tasks and datasets is encountered. On the one hand, there is unstructured data like images and texts. On the other hand, classical tabular data is present. Each dataset belongs to a machine learning task. Presently, XAI is closely related to supervised learning tasks [19], hence all considered tasks are supervised in nature. The following machine learning tasks are considered: supervised text classification, supervised image classification and supervised binary classification of tabular data. For such tasks, many machine learning models have been trained and deployed in the last few years, in science as well as in the industry. Hence, these tasks are suited for the evaluation of XAI methods. 3.2
Main Sections
Demo Section. The foundation of VitrAI’s demo functionality consists of three real-life scenarios. The first one is called Public Transport and is about supervised text classification with five classes. It contains six user complaints about public transport in German language. These complaints are based on true events and revolve around buses leaving too late or too early, unfriendly bus drivers or service issues related to public transport. Each text or complaint is assigned a label, e.g. “ride not on time” or “wrong bus stop”, by a machine learning model. The used machine learning algorithm is a Convolutional Neural Network built and trained with spaCy1 and achieves an accuracy score of 92%. The second scenario is called Car Brands and is a supervised image classification task. This scenario deals with car images from various angles and distances. The seven images are gathered from the CompCars dataset2 and are assigned one of seven make labels, e.g. “Volkswagen” or “Skoda”. The machine learning model is a fine-tuned transfer learning model and achieves an accuracy score of 87%. The third setting, Weather Forecast, is a supervised binary classification task. It deals with five samples from the “Rain in Australia” dataset3 that displays 10 years of weather observations from locations across Australia. From a data focused point of view, this data is an example of tabular data, i.e. numerical 1 2 3
https://spacy.io/. http://mmlab.ie.cuhk.edu.hk/datasets/comp cars/index.html. https://www.kaggle.com/jsphyg/weather-dataset-rattle-package.
16
M. Hanussek et al.
and categorical features like wind direction, wind speed in the afternoon or atmospheric pressure in the morning of the day before. The machine learning model predicts rainfall for the next day in a binary fashion (“Yes” or “No”). We used a tree-based model trained with scikit-learn4 that shows an accuracy score of 84%. Next to the three scenarios there is a Dataset Information functionality that, based on Pandas Profiling5 , gathers dataset facts and statistics like variable types, warnings about missing features, distributions of features and correlation matrices. With this built-in functionality, users can get an overview of tabular data before diving deeper by using XAI algorithms. Even with well-functioning XAI methods a sound understanding of the underlying data is obligatory which is why we decided to include the Dataset Information functionality. Note that in the demo section, the training of custom machine learning models is not possible. Instead, several pre-trained models exist for each scenario. These existing models were trained with the goal of achieving a decent accuracy score in order to subsequently obtain reasonable explanation quality. At the same time, the models should not be overly complex, so that an average specialist can build them. For example, in the text scenario the machine learning model was built and trained with spaCy in a straightforward manner and achieves an accuracy score of 92%. These requirements make for a setting in which XAI methods are used in practice and can therefore be realistically evaluated. An overview of the demo section is depicted in Fig. 1.
Fig. 1. Overview of the demo section with selected Car Brands scenario. 4 5
https://scikit-learn.org/stable/. https://github.com/pandas-profiling/pandas-profiling.
VitrAI: Applying Explainable AI in the Real World
17
User-Controlled Section. In comparison to the demo section, in the usercontrolled section users can create custom machine learning models and predict self-provided samples with subsequent explanations. Regarding training, only text models can be created. For this purpose, an earlier software development project at Fraunhofer IAO is used. Here, a custom text classification model can be trained on a user-defined dataset with common machine learning libraries. Concerning prediction, users can select pre-trained machine learning models (and thus, a task and data type). In the case of a text classification model, the user can enter a text sample in the user interface (see Fig. 2). Similarly, in the case of an image classification model, a sample image can be uploaded and when choosing a binary classification model, a single-line csv file containing the features as columns should be uploaded.
Fig. 2. Users first choose from a list of pre-trained models and, in the case of a text classification model, can provide a text sample which will be classified and explained later.
3.3
Supported XAI Methods
In total, the following four XAI methods are implemented: Layer-Wise Relevance Propagation (LRP) [3], LIME [20], SHAP [15], and scikit-learn’s permutation importance6 . Each scenario features at least two different XAI approaches. Table 1 shows the implemented XAI approaches for each scenario. Table 1. Considered scenarios with implemented XAI methods LRP LIME SHAP Permutation importances Public transport
6
Car brands
Weather forecast
https://scikit-learn.org/stable/modules/permutation importance.html.
18
M. Hanussek et al.
The used XAI algorithms are well-studied and established in the field of XAI [25]. With permutation importance, VitrAI does not only support instancebased explanations but also an explanation approach related to whole machine learning models. It is understood that permutation importance provide a rather rudimentary form of model insight, but due to their low complexity and easy usage, they can readily be used by practitioners. Figure 3, 4, 5 and 6 show the XAI algorithms used in the Text and Image Classification scenario.
Fig. 3. SHAP explanation for the true prediction (nearly 100% class probability) of the car make BMW.
3.4
Architecture and Technologies Used
VitrAI is built according to the microservice pattern. This means that functionally independent components are encapsulated as stand-alone applications in Docker7 containers. The containers are orchestrated with Docker Compose. The core functionality of VitrAI is in the modules, which are divided according to the type of data processed. For the frontend, Angular8 is used in order to apply the component libraries Nebular9 and PrimeNG10 . For the backend, technologies used are Django11 , Flask12 , CouchDB13 and the machine learning libraries TensorFlow14 , Keras15 , scikit-learn and spaCy.
7 8 9 10 11 12 13 14 15
https://www.docker.com/. https://angular.io/. https://akveo.github.io/nebular/. https://www.primefaces.org/primeng/. https://www.djangoproject.com/. https://flask.palletsprojects.com/en/1.1.x/. https://couchdb.apache.org/. https://www.tensorflow.org/. https://keras.io/.
VitrAI: Applying Explainable AI in the Real World
19
Fig. 4. LIME explanation for the true prediction (nearly 100% class probability) of the car make BMW.
4
Preliminary Findings
To begin with, we experienced challenges regarding uniform implementation and representation of different XAI methods. Although there are frameworks that orchestrate different methods, they either lack XAI approaches, present explanations heterogeneously or show dependency shortcomings (for example, iNNvestigate [1] and AI Explainability 360 [2] lack support of TensorFlow 2.0/tf.keras models). The last issue also applies to different XAI implementations and overall results in challenging usability (e.g. existing machine learning models may need to be retrained for compatibility reasons). We preliminarily evaluated the image and text classification tasks. In the Car Brands scenario, LIME highlights background regions and therefore irrelevant regions in four out of seven samples (see Fig. 4 for an example). This applies to positive as well as negative influences. On the other hand, in every sample at least parts of the highlighted regions are plausible, e.g. the brand logo or characteristic grilles are marked as positive influences (see part of the grille in Fig. 4). SHAP highlights incomprehensible image regions in every sample (see again Fig. 3). In five out of seven images, at least parts of the proclaimed influences are plausible. In summary, SHAP explanations are incomprehensible more frequently than LIME’s, and overall explanation quality is in need of improvement. Also, explanations match human explanations at most partly (among others, humans probably would mark both grilles in Fig. 4 as positive influences).
20
M. Hanussek et al.
Fig. 5. LIME explanation for the prediction of a complaint type.
Fig. 6. LRP explanation for the prediction of a complaint type.
In the Public Transport scenario, LRP produced helpful explanations in three out of six cases. A positive example can be seen in Fig. 5, where “minutes”, “4” and “late” are assigned positive influences for the sample’s association to the class “ride not on time”. Less plausible is LRP’s proclamation of the word “stop” as a negative influence when dealing with the class “stop missed by bus”. LIME shows slightly better explanations in as much as we find four out of six explanations to be acceptable. In for out of six cases, we assess the explanations of both algorithms as similar, while at least partial conformity with human explanation is given in three out of six samples.
VitrAI: Applying Explainable AI in the Real World
21
Summing up for all three cases, different XAI algorithms show only partial agreement among themselves and match human expectations unsystematically. For the unsatisfactory explanations, it is not clear whether they are due to poor performance of the machine learning model, poor performance of the XAI model, or some other reason.
5
Conclusion and Outlook
In this paper we presented VitrAI, a platform for demonstrating and evaluating XAI. We have outlined the main modules and XAI approaches used as well as given preliminary evaluation results, showing a need for further improvement and investigation. In the future, we intend to add more XAI approaches, e.g. CEM Explainer [5], Lucid16 or ProtoDash [10]. Since different methods pursue diverse explanation strategies (post-hoc/ante-hoc explanations, local/global direct explanations), for a thorough evaluation a most complete coverage of different approaches is necessary. More importantly, we will be using VitrAI to further research in the area of explanation quality, as it regards end users. This evaluation shall be conducted by means of aforementioned three scenarios with involvement of humans. The focus will be on the following research questions: 1. How do non-specialists assess the comprehensibility of different XAI explanations? 2. What constitutes explanation quality for end users? Can it be measured, compared and quantified? 3. To what extent do XAI explanations conform to human explanations? What are possible reasons for discrepancies? 4. How can explanation quality in XAI be improved for end users? An experimental evaluation with human involvement promises a better understanding of current practical shortcomings of XAI approaches and can therefore initiate future research leading to increased acceptance of XAI explanations in practice. Acknowledgment. This work was conducted together with students from the University of Stuttgart.
References 1. Alber, M., et al.: Investigate neural networks! J. Mach. Learn. Res. 20(93), 1–8 (2019) 2. Arya, V., et al.: One explanation does not fit all: a toolkit and taxonomy of AI explainability techniques (2019) 16
https://github.com/tensorflow/lucid.
22
M. Hanussek et al.
3. Bach, S., Binder, A., Montavon, G., Klauschen, F., M¨ uller, K., Samek, W.: On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLoS ONE 10, e0130140 (2015) 4. Das, A., Rad, P.: Opportunities and challenges in explainable artificial intelligence (XAI): a survey. arXiv e-prints arXiv:2006.11371, June 2020 5. Dhurandhar, A., et al.: Explanations based on the missing: towards contrastive explanations with pertinent negatives. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS 2018, pp. 590–601. Curran Associates Inc., Red Hook (2018) 6. Doshi-Velez, F., Kim, B.: Towards a rigorous science of interpretable machine learning (2017) 7. Dressel, J., Farid, H.: The accuracy, fairness, and limits of predicting recidivism. Sci. Adv. 4(1), eaao5580 (2018) 8. Geirhos, R., Temme, C.R.M., Rauber, J., Sch¨ utt, H.H., Bethge, M., Wichmann, F.A.: Generalisation in humans and deep neural networks. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS 2018, pp. 7549–7561. Curran Associates Inc., Red Hook (2018) 9. Gunning, D.: Explainable artificial intelligence (XAI) (2017) 10. Gurumoorthy, K.S., Dhurandhar, A., Cecchi, G.A., Aggarwal, C.C.: Efficient data representation by selecting prototypes with importance weights. In: Wang, J., Shim, K., Wu, X. (eds.) 2019 IEEE International Conference on Data Mining, ICDM 2019, Beijing, China, 8–11 November 2019, pp. 260–269. IEEE (2019) 11. Hooker, S., Erhan, D., Kindermans, P.J., Kim, B..: A benchmark for interpretability methods in deep neural networks. In: Wallach, H., Larochelle, H., Beygelzimer, A., Alch´e-Buc, F.D., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32, pp. 9737–9748. Curran Associates Inc. (2019) 12. Kaur, H., Nori, H., Jenkins, S., Caruana, R., Wallach, H., Wortman Vaughan, J.: Interpreting interpretability: understanding data scientists’ use of interpretability tools for machine learning, pp. 1–14. Association for Computing Machinery, New York (2020) 13. Letham, B., Rudin, C., McCormick, T.H., Madigan, D., et al.: Interpretable classifiers using rules and Bayesian analysis: building a better stroke prediction model. Ann. Appl. Stat. 9(3), 1350–1371 (2015) 14. Lipton, Z.C.: The mythos of model interpretability. Queue 16(3), 31–57 (2018) 15. Lundberg, S.M., Lee, S.I.: A unified approach to interpreting model predictions. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems 30, pp. 4765–4774. Curran Associates Inc. (2017) 16. Mittelstadt, B., Russell, C., Wachter, S.: Explaining explanations in AI. FAT* 2019, pp. 279–288. Association for Computing Machinery, New York (2019) 17. Mohseni, S., Block, J.E., Ragan, E.D.: A human-grounded evaluation benchmark for local explanations of machine learning. arXiv e-prints arXiv:1801.05075, January 2018 18. Montavon, G., Samek, W., Muller, K.-R.: Methods for interpreting and understanding deep neural networks. Digital Sig. Process. 73, 1–15 (2018) 19. Morichetta, A., Casas, P., Mellia, M.: Explain-it: towards explainable AI for unsupervised network traffic analysis. In: Proceedings of the 3rd ACM CoNEXT Workshop on Big DAta, Machine Learning and Artificial Intelligence for Data Communication Networks, Big-DAMA 2019, pp. 22–28. Association for Computing Machinery, New York (2019)
VitrAI: Applying Explainable AI in the Real World
23
20. Ribeiro, M.T., Singh, S., Guestrin, C.: “Why should i trust you?”: explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2016, pp. 1135–1144. Association for Computing Machinery, New York (2016) 21. Rudin, C.: Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell. 1(5), 206–215 (2019) 22. Saisubramanian, S., Galhotra, S., Zilberstein, S.: Balancing the tradeoff between clustering value and interpretability. In: Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, AIES 2020, pp. 351–357. Association for Computing Machinery, New York, New York (2020) 23. Schneider, J., Handali, J., Vlachos, M., Meske, C.: Deceptive AI explanations: creation and detection. arXiv e-prints arXiv:2001.07641, January 2020 24. Slack, D., Hilgard, S., Jia, E., Singh, S., Lakkaraju, H.: Fooling LIME and SHAP: adversarial attacks on post hoc explanation methods. In: Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, AIES 2020, pp. 180–186. Association for Computing Machinery, New York (2020) 25. Tjoa, E., Guan, C.: A survey on explainable artificial intelligence (XAI): towards medical XAI. IEEE Trans. Neural Netw. Learn. Syst. (2020) 26. Zhang, Q., Wu, Y., Zhu, S.: Interpretable convolutional neural networks. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8827–8836. IEEE Computer Society, Los Alamitos, June 2018 27. Zhou, T., Sheng, H., Howley, I.: Assessing post-hoc explainability of the BKT algorithm. In: Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, AIES 2020, pp. 407–413. Association for Computing Machinery, New York (2020) 28. Zhou, Y., Danks, D.: Different “intelligibility” for different folks. In: Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, AIES 2020, pp. 194–199. Association for Computing Machinery, New York (2020)
Contactless Interface for Navigation in Medical Imaging Systems Martin Žagar1(B) , Ivica Klapan2,3,4,5 , Alan Mutka1 , and Zlatko Majhen6 1 RIT Croatia, Don Frana Buli´ca 16, 20000 Dubrovnik, EU, Croatia
[email protected]
2 Ivica Klapan, Klapan Medical Group Polyclinic, Ilica 191A, 10000 Zagreb, EU, Croatia 3 School of Medicine, Josip Juraj Strossmayer University of Osijek, Trg Svetog Trojstva 3,
31000 Osijek, EU, Croatia 4 School of Dental Medicine and Health, Josip Juraj Strossmayer University of Osijek,
Trg Svetog Trojstva 3, 31000 Osijek, EU, Croatia 5 School of Medicine, University of Zagreb, Šalata 2, 10000 Zagreb, EU, Croatia 6 Bitmedix, Kuševaˇcka ulica 86, 10000 Zagreb, EU, Croatia
Abstract. Medical informatics in planning the surgeries is rapidly changing lately with new systems and approaches and one of the latest is contactless approach to visualization systems used during the surgeries. This approach is based on contactless together with adopting algorithms from augmented and virtual reality, enabling surgeons in the operating room more freedom in touchless humancomputer interaction in analyzing and visualization of complex medical data. We are proposing the novel contactless plug-in interface for the DICOM viewer platform using a camera controller that tracks hand/finger motions, with no hand contact, touching, or voice navigation. Our proposal flow is oriented to solve several issues we faced during the proof-of-the-concept of previous contactless surgery approach while using the Leap Motion as a tracking camera, both with software and hardware for spatial interaction. In this paper, we will focus on increasing the user experience in 3D-virtual rendered space and proposing the solution that could be used as a benchmark for 3D virtual navigation which integrates high-resolution stereo depth camera with medical imaging systems in order to obtain contactless ‘in the air’ real control of medical imaging systems with surgeon’s hands. Keywords: 3D virtual navigation · Contactless control of medical imaging systems · Motion tracking
1 Introduction 3D imaging technology in the operating room is not new and has been used for decades. For example, the first application of a stereoscopic camera in laparoscopic gynecology dates back to 1993 [1, 2]. 3D imaging technology provides surgeons with additional information about the spatial depth and offers an improved and more vivid view of anatomical structures, especially the finer ones, which significantly raises the level of safety and simplifies surgical procedures such as suturing [3]. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 295, pp. 24–33, 2022. https://doi.org/10.1007/978-3-030-82196-8_3
Contactless Interface for Navigation in Medical Imaging Systems
25
As a result of a clearer spatial representation, the performance of laparoscopy increasingly resembles operative conditions in an open procedure. In addition to the precision of the procedure, this more realistic anatomical view and segmentation options also reduce the level of stress in the surgeon [4]. With the segmentation algorithms we are trying to assign each pixel (or voxel in case of 3D) to specific tissue types and classes. The main aim of the segmentation of medical images is to define the Region of Interest (ROI) that will be used in surgical procedures. Following this segmentation procedures as a next step we are performing data analysis as two-step approach [5]. In the first approach we are identifying the features of 2D or 3D data distribution of joint feature vectors, depending on the segmentation classes. As a second approach, involving feature vectors as a part of the learning algorithm requires both the feature vector itself and a target function defining its interpretation about segmentation class membership. Both of these approaches are candidates to be performed in a contactless way. Alignment of the data sets acquired within different image acquisition procedures is important in the context of pixel or voxel segmentation of medical data, such as acquisition procedure itself by stabilizing in our case the surgeon-s hand and fingers for controlling and applying a constant field of view in different medical dataset sequences while controlling in a contactless way. In the shape of contactless controlling the visualizations systems used in surgeries, our innovative approach considers additional inputs to obtain blending of interactive digital elements – like selecting regions of interests, motion capture through different sensors (such as stereo-depth cameras), and other sensory projections – into real-world environments where VR tools and environments are extended to AR overlapping in real-time. With ongoing research presented in this paper, we are trying to resolve several issues we faced during our proof of the concept of contactless surgery in our previous approach while using the Leap Motion as a tracking camera, both with SW and HW for spatial interaction and to scale it up for bigger distances. This solution will provide medical image visualization in all medical surroundings (not just in the operating room), so we will adapt our system to support visualization of medical data in any telemedicine solution which predicted global market size for 2020 is 58.3 billion $ [6], steady growth in digital health market [7] and one of the top-funded health IT technologies [8]. We divided our paper in several sections: In Sect. 2 we are describing methods we used in defining and designing our system, and details about motion tracking and parameters we considered; gesture recognition framework and API together with the results are described in Sect. 3; and we are providing conclusions and future work in Sect. 4.
2 Materials and Methods In our previous research [5, 9], and [10] we have demonstrated how to design augmented reality combined with the use of 3D medical images and 4D medical videos, together with touchless navigation in space through remote controller, which was depending on several relevant factors, such as the position of the controller, position of the monitor and image quality. We also developed our plug-in application for the OsiriX platform,
26
M. Žagar et al.
enabling users to use the Leap Motion sensor as an interface for camera positioning in 3D Virtual Endoscopy views. In this paper, we propose a change in the type and the features of the controller to enable higher accuracy of the system, real remote system up to 2 m from visualization systems and to raise operating safety. An important issue we were thinking about when designing our approach was the fact that in the data acquired before surgery and in surgery preparation, there might be such a thing as too much information, an overreliance on augmented reality could mean that people are missing out on what’s right in front of them. In our case of contactless surgery, visualization tools must enable interactive visualization in real-time. Response time to interaction (a motion that is caught through the depth camera) is sufficiently high so displaying the data can be immersive and medical specialists can enter “into” the data and manipulate and analyze the displayed information and take up any viewpoint. This enables dynamic functional processes as well as entering into the anatomy details, on-line measurements in virtual reality. In contactless sinus surgery, it is very important to get high quality and dimensionality of the display and interactive access to the data represented, like quantitative information about the properties of tissues and these functions could be operated contactless by motion tracking and voice commands in our innovative approach. 2.1 Motion Tracking in Contactless Surgery The base of contactless data management of visualization system is the capturing system that tracks hands and fingers. For this reason we are using stereo-depth camera that enables Full HD resolution captivation remotely on the distances ranging 0.5 to 2.5 m, with a wide field of view in all three dimensions. The stereo depth module has two identical camera sensors (imagers) and is configured with identical settings. The infrared projector improves the ability of the stereo camera system to determine the depth. The color sensor data when captured is sent to a discrete image signal processor for image adjustments, image scaling, and other functions to compensate for inherent inaccuracy in lens and sensor in providing better image quality. We have developed a system that is operation-agnostic, with independent contactless interface as a plug-in application for DICOM-viewer platform using a stereo-depth camera controller that supports hand/finger motions as input, with no hand contact. Motion tracking of hand and fingers enables more precise virtual movement, rotation, cutting, spatial locking, and measuring. We use a camera for depth and motion tracking that has active stereo depth resolution with precise shutter sensors for depth streaming for our predicted range up to 2.5 m which is important for remote controlling in the operating room and which gives a sense of freedom to the surgeon during the surgery, to provide the most immersive experience. With this system we are offering an alternative to closed SW systems. We found it is possible to significantly simplify movement gestures in the virtual space of Virtual Endoscopy as opposed to the initial design with the usage of a camera of higher resolution. This approach would ensure at least doubled precision (compared to Leap Motion) [11], while moving through the Virtual Endoscopy space at least 30% percent faster, compared with human assistance during the surgery.
Contactless Interface for Navigation in Medical Imaging Systems
27
2.2 Parameters for ‘In the Air’ Commands Interaction with DICOM data and 3D volume rendering in the operating room, i.e. in a sterile environment, is a challenging task. Namely, the surgeon cannot leave his place to use the mouse, joystick, keyboard, and look at the monitor ROI and anatomical structures, but delegates it to the administrative and support staff, who presents it on the display which leads to interruption and distraction from the course of the surgery. The new modality of working with ‘in the air’ commands of surgeons or “touchless human-computer interaction” aims to overcome these limitations and replace the classic two-dimensional interfaces based on Windows/Icons/Menu/Pointers (WIMP) with adequate natural gestures. The hand tracking consists of two based algorithms: • Hand Skeleton tracking • Gesture tracking.
Fig. 1. Hand segmentation on 22 hand joints for the skeletal tracking algorithm
The hand skeletal tracking algorithm uses 2D/3D scene data to segment the hand from the environment and extracts the position and orientation of 22 hand joints presented in Fig. 1. The algorithm often detects 22 joints on both hands at the same time, which allows two-handed interactions. The obtained positions of joints are used in the gesture tracking algorithms. Motions can refer to a static pose (e.g., palm) or being dynamic such as movement between poses. In such case we are describing that movement as a gesture (typical example is wave gesture). It is also important to mention that robust real-time hand or finger tracking is a complex task in machine learning and computer vision. Gesture recognition methodologies are usually divided into two categories: static or dynamic [12]. Static gestures only require the processing of a single image at the
28
M. Žagar et al.
classifier; the advantage of this approach is the lower computational cost. Dynamic gestures require the processing of image sequences and more complex gesture recognition approaches. There are many challenges associated with the accuracy and usefulness of user gesture algorithms, such as image noise, camera calibration parameters, background noise, occlusion, distance from the camera, camera resolution, etc. The social aspect also has a significant influence on gesture usability in different cultures. In calibrating the system, the first step is to define gestures that will be used as control inputs and to train the surgeon (user) in order to customize the quality of depth settings that will be used in the contactless surgery. Figure 2 and Fig. 3 represent basic actions that can be used in our application:
Fig. 2. Actions Implemented in our application: on the left-hand side engage; in the middle mouse move, scroll; on the right-hand side left mouse click, grab and release
Fig. 3. Actions Implemented in our application: on the left-hand side the neutral position; on the right-hand side push to select, mouse click
The poses and gestures from Fig. 2 and Fig. 3 are often used together to perform a particular action. Below are descriptions of actions: • Activate: pose the palm towards the camera in a natural pose at a predefined distance for 5 s. In the beginning, we need to inform the software that we want to start interaction/control. This gesture triggers the software to start listening. • Mouse move scroll: After the system is engaged and the hand is fully open (Big 5), the movement of the hand means mouse cursor movement. • Left Mouse Click, Grab (Left Button pushed trigger), and Release (Left Button pushed trigger): If the system is engaged, the change between Big 5 towards the pinch gesture means to grab an object (Left Mouse Button Down event). Separating back the
Contactless Interface for Navigation in Medical Imaging Systems
29
thumb and index finger means object release (Left Mouse Button Up event). For our application, we use this action, for example, to rotate a model in a 3D scene. • The neutral position: If we are positioned on a particular place on the screen, and our hand is already extended, and we cannot move if further, closing fist means that the cursor is not tracking our hand anymore. We can return to a more comfortable central position, open it, and continue the interaction. • Push to Select, mouse click: 1) Wake the system from the idle mode, 2) Wave with palm (Big 5) parallelly to the camera to get the cursor over the desired item 3) Push towards the screen to select the item (Left Mouse Button Click event)
3 Results and Discussion Our main result is the design and production of Hand and Gesture Module API. This API functionality is shown in Figs. 4 and 5. The system’s input is 2D RGB image and 3D cloud (XYZ cloud points) provided by Intel RealSense D415 camera. The data are processed by the Hand Tracking module, where the hand’s center in 2D Image Pixel data and 3D world data are generated. Once the hand position is detected, the Gesture Recognition module estimates the hand’s gesture state. The information generated within the Hand and Gesture module is stored in the Hand and Gesture Interface Class. This class interface represents a package that is sent to different interactors. Within this paper, we describe two primary interactors. System Mouse Interactor provides basic interaction with the mouse (mouse movement, grabbing, and selecting). The second DICOM Viewer interactor provides “in the air” control of the DICOM Viewer application (2D and 3D scene manipulation, changing DICOM parameters, 2D and 3D measuring, etc.). The Hand and Gesture Module consists of two based tasks: • Hand Tracking – provides hand’s 2D Image Pixel Coordinates, and 3D Worlds coordinates • Gesture recognition – recognized gesture states and hand gestures. The Hand and Gesture module algorithm within each detection frame generates data stored in the Hand and Gesture Interface Class package. This package is transferred (through memory reference, serialized over a socket, or WEB API) towards the interactor responsible for converting its data to a mouse motion or program commands. The package is implemented as a C++ header file and contains the following data: • Gesture State – NO HAND, MOVING, GRABBING, SELECTING. This state is defined based on a current detected hand gesture state. • Hand Gesture – NONE, ONE, TWO, THREE, FOUR, FIVE, YEAH, ROCK, SPIDERMAN, FIST, OK. These gestures are recognized by the algorithm. • Finger State – states of all fingers (opened or closed). • Center of the hand – 2D Image Pixel coordinates. • Center of the hand – 3D World/Camera coordinates. • Control region (start2D, current2D, start3D, current3D, regionPositionPrevious, regionPositionCurrent) – data which contains information about the initial (starting)
30
M. Žagar et al.
Fig. 4. Hand and gesture module API
and the current hand position in 2D and 3D. This is important for the RELATIVE and ABSOLUTE interaction control implemented within the interactors. • Frame and DeviceOs timestamp – times in [ms] from the frame (real sense) and the device operating system (the controlling computer). Hand and Gesture API consists of different window areas where it is possible to train a neural network in the background and enable tracking in the process part, define filter settings for the spatial and temporal threshold on the left and right-hand side. The central part is the visualization toolkit where processed data is visualized and a window for visual feedback. Visual feedback has great importance when using a novel input such as an “in the air” hand gesture control system. Of course, we designed the system to enable the user (medical specialist) to easily understand how to control an application. We implemented several requirements/program features to ensure a responsive, accurate, and satisfying user experience: • UI that considers human ergonomics – we created an arc-based menu that can be controlled while resting hand elbow on the desk • Fast feedback response, within 100 ms • Implemented informative visual feedbacks – the application can show what happened and inform the user what is the next step • Created animation with natural physics to instruct the user • Created intuitive and clear visual designs and text feedbacks • Created visual feedback for the optimal user distance from the camera (“Optimal distance”, “Move Closer”, “Move Back”) • There is “View of User” – create a small viewport showing what the camera sees.
Contactless Interface for Navigation in Medical Imaging Systems
31
Fig. 5. Screenshot of hand and gesture module API
As a result, we can compare our previous solution based on Leap Motion with the proposed solution presented in this paper. The following Table 1 gives a summary of important features. Table 1. Comparison of leap motion based solution and our proposed solution. Leap motion based solution
Proposed solution
Depth range
’ section), they are expressed via the use of various gate basis defined by OpenQASM [25]. For instance, some of the gates that can be used by an OpenQASM program are the Hadamard, rz (which implements the eiθZ ) and Swap gates. These gates are made up of the following components: 1) gate name, such as ‘h’ for Hadamard, ‘rz’ for eiθZ and ‘swap’ for the swap gate; 2) θ which is applicable to a limited set of gates, such as the ‘rz’ or ‘p’ gates which involve application of a phase or rotate by θ on a qubit’s state; and 3) the
102
T. J. Ong and C. C. Tan
corresponding qubits or quantum registers (we will refer to them as registers in the rest of the paper) where the gates are applied, such as ‘swap q[0], q[1]’ (as shown in Listing 1).
Listing 1. An example of the gates that can be used in an OpenQASM program. This list is a small excerpt of the gates supported by OpenQASM, and the reader may refer to [25] for more details. Although OpenQASM 3.0 [26] has been proposed recently, this research is still based on OpenQASM 2.0 since the new standard is evolving and being updated constantly, so we plan to expand our research into OpenQASM 3.0 in our future research when the new standard is more stable.
5 Methods 5.1 Encoding and Decoding In terms of the flow of the algorithm, our GA adheres to the classic GA model [10] in terms of how the various operations, such as epochs, selections and genetic operators, are performed on the candidate solutions. Our GA is capable of generating complete OpenQASM 2.0 programs to be evaluated by Qiskit’s Aer based on the full set of gates for the OpenQASM 2.0 [25], aside from not being able to automatically generate procedures, such as the automatically defined functions (ADF) [13]. As mentioned in Sect. 4.5, a gate is made up of three components, so the encoding process groups each type of components separately instead of storing them in a flat array in a classic GA. In turn, the genes (gates) which form a chromosome (or a complete quantum program) are stored in a list. The list is chosen as the data structure to store the program because it is relatively straightforward to manage the size of the program, as well as ensuring the termination of a program by proving a threshold limit (by limiting the size of the list) to prevent GA from generating an infinitely large program. Thus, one may visualize the genes in a chromosome as a tree-like structure such that each of the gates is mapped onto a gene, whereas θ (a float value that ranges between [0, 2π]) and the corresponding qubits (integer offsets) are part of the genetic makeup of a gene. We do not directly encode the gate names into the genes, instead they are kept as an offset to the corresponding gate names in a lookup table based on [25]. Similarly, the decoding process is the exact opposite of the encoding process since the decoder will just look up the corresponding offset, encoded in a gene, from the lookup table to derive the gate name, and associate the required θ and qubits to create a valid OpenQASM instruction. Having completed the decoding of all the genes in an individual, GA will combine these information with the corresponding OpenQASM
A Genetic Algorithm for Quantum Circuit Generation in OpenQASM
103
headers, includes and measurement instructions (as discussed in Sect. 4.1) to send them to Qiskit for execution. Finally, fitness evaluation of the individuals are performed based on the outputs from Qiskit.
Fig. 2. The encoding and decoding of the sample program from Listing 1 in GA.
This encoding and decoding scheme (Fig. 2) has the advantage that when various operations (discussed in Sect. 5.2) are perform on these instructions, the genes can be exchanged easily between different chromosomes without worrying about the type of the operators and their effects on the values of the θ or registers. Furthermore, the representation makes it easy to apply different genetic operations on the type of gates in the future. 5.2 Operations Crossover. We have implemented single-point and multi-point crossover operators for the GA, and they reminisces the one in the classic GA in terms of operations, with the exception that they operate mainly on the genes (or gates) as a whole, as shown in Figs. 3 and 4. Both operators swap the genes between parent chromosomes without affecting the θ and registers to ensure that the resultant chromosomes are always valid OpenQASM programs. In short, the crossover operators have the effect of changing the ordering and size of the individual chromosomes (or programs). Mutation. On the other hand, the mutation operator has more freedom in terms of what it could perform on the genes. In comparison to the mutation operator in a classic GA where the operator randomly flips the value of a particular argument corresponding to a chosen encoding, the mutation operator in this GA may randomly alter the encoding corresponding to a gate, θ or register, as shown in Fig. 5: The mutation operator operates rather “freely” as it can alter the gate, θ, or registers to an arbitrary value or type. This operator may break some programs because it may
104
T. J. Ong and C. C. Tan
Fig. 3. The single point crossover operator in GA.
introduce or remove certain requirement(s) to the program. For instance, a new argument is now required (such as θ) in order for Aer to interpret the program properly when the ‘h’ gate is mutated into a ‘rz’ gate, as depicted by the case “A. To A Gate” in Fig. 5. As a result, the gene now requires a correction after the mutation operator is completed because the resultant program is no longer valid. The activation of a mutation operation is frequently, but not always, accompanied by a correction algorithm to convert the new program into a valid program. The correction processes are summarized in the following subsections. Mutation on a Gate. When a mutation happens on a gate, this may alter the number of registers and θ required. The correction algorithm will refer to a lookup table that is based on [25], to determine if the resultant gate requires any new θ as well as the new number of registers it operates on. • If θ already exists in the old gate, and the new gate requires a θ then keep everything unchanged, otherwise remove θ from the gene. • If θ does not exist in the old gate, but the new gate requires a θ then generate a random θ in the range of [0, 2π]. • If the new gate does not have sufficient registers to operate on (for instance, a ‘u’ gate, which accepts a single register as input, is mutated into a “rzz” gate, which accepts two registers as inputs), generate new register(s) from the existing set of remaining registers (without replacement) for it.
A Genetic Algorithm for Quantum Circuit Generation in OpenQASM
105
Fig. 4. The multi-point crossover operator in GA.
Fig. 5. The effects of the mutation operator on a gene.
• If the resultant gate has too many registers, perform truncation on the list of registers. For instance, if a ‘rzz(θ) q[0], q[1]’ gate is mutated into a ‘ry’ gate (which operates on one register only), the list of registers will be truncated into ‘ry(θ) q[0].’
106
T. J. Ong and C. C. Tan
Mutation on θ. Currently, we do not perform any lookup functions for a “better value” of θ to guide the mutation process, nor do we impose restrictions on the range of the change (such as a value), therefore, θ may take on any random floating-point value in the range of [0, 2π]. Mutation on a Register. When the mutation operator works on a register, it is going to change the corresponding qubit that it maps to, and since the operation is random, this may create duplicate registers for a gate. For instance, assuming that the mutation operator is going to mutate the 2nd register (or ‘q[1]’) for the ‘rzz(θ) q[0], q[1]’ gate, it may potentially generate ‘rzz(θ) q[0], q[0]’, which is an invalid gate for OpenQASM. Thus, the correction algorithm is going to assign a different register (from the remaining set of available registers) for the first register, potentially resulting in ‘rzz(θ) q[1], q[0]’, ‘rzz(θ) q[2], q[1]’, ‘rzz(θ) q[3], q[1]’ or ‘rzz(θ) q[4], q[1]’ for a 5-qubit based OpenQASM program. In summary, the crossover operator alters the ordering and structure of the gates in a program at a high level, whereas, the mutation operator works on the low level details (such as the affected registers and θ being passed into the gates) of the genes.
5.3 Selection and Fitness Evaluation The roulette wheel selection mechanism employed in GA closely reminisces what is used in the classic GA [10], and the fitness values of the chromosomes are first evaluated then the selection procedure picks out two parent chromosomes to undergo crossover and mutation operators to produce two offsprings for the new population. For the experiments, we have implemented the quantum equivalence of the classic GA MaxOne binary string problem described in various texts, such as [27] (i.e. maximize the number of 1s in a collection of binary digits). In contrast to the classic GA where the digits have a predefined ‘0’ or ‘1’, the quantum variant of this problem would, initially, have all of the registers set in superposition via the ‘h’ (Hadamard gate) gate. What GA needs to do is to find a collection of gates (in the form of a proper OpenQASM program) that maximizes the probability of the state vector “11111” (where all bits are turned on in a classic GA) for the 5-qubit problem respectively. In terms of evaluating the fitness of an individual, although one could perform a measurement on all of qubits when the program terminates (which is an irreversible operation via the ‘measure q[i] → c[i]’ instruction in OpenQASM), it is less practical since the measurement only results in one final state. The single measurement does not take into account of the other states which may also be reflected as the final state (albeit with less probability since their probability can also be non-zero) when additional measurements are performed on the same program. In light of this situation, a measurement cannot be performed directly on the registers during any of the evolution cycles, for that will result in the collapse of the quantum state which eliminates all of the information (such as probabilities) associated with the other states. Therefore, the state vector simulation done by Qiskit Aer [28] is consulted by GA to obtain the associated probabilities of all of the qubit (or register) states from the
A Genetic Algorithm for Quantum Circuit Generation in OpenQASM
107
simulator’s state vectors so that this process does not require any direct measurements on the registers. The outputs from Qiskit Aer consists of a list which maps out the possible qubit states and their associated probability, such as the following for a two-qubit setup. For instance, the key ‘00’ corresponds to one of the possible two-qubit states and 0.1 is the corresponding probability (or 10%): {‘00’ : 0.1, ‘01’ : 0.1, ‘10’ : 0.1, ‘11’ : 0.7}
Subsequently, the probabilities computed from the state vectors can be used in our fitness evaluation functions in various ways. We have implemented two approaches (referred to as FE1 and FE2 ) in this paper to evaluate the fitness values of an individual in a GA population: 1. FE1 (Chromosomei ) = P m ∗ S m , where P m represents the probability of the qubitstate S m , and S m represents the integer value of the corresponding qubit state that is closest to the maximum value of a MaxOne problem, for instance, a 5 digit MaxOne 2 problem will have the maximum n−1 k valuek of F(5) = 5 − 1. 2. FE2 (Chromosomei ) = k=0 P ∗ S , where n represents the total number of qubit states that can be represented by the number of qubits (which is the same as the maximum value of a MaxOne problem) individual within the population, whereas P k ∗ S k reminisces what was described in the first approach. The first approach (FE1 ) is similar to an “All or nothing” approach since it only takes into account of the probability of the qubit state that’s closest to the maximum value of the MaxOne problem while disregarding all of the probabilities associated with the other intermediary states. The method has the characteristics of exerting more selective pressures on less fit individuals since it does not differentiate individuals who may have the same fitness values but exhibiting different probabilities in all of the other intermediary states. For instance, given the probabilities of the following state vectors derived from different chromosomes during an evolution cycle for a 2 qubit program: 1.
{‘00’: 0.1; ‘01’: 0.1, ‘10’: 0.7, ‘11’:0.1}
2.
{‘00’: 0.5; ‘01’: 0.2, ‘10’: 0.2, ‘11’:0.1}
3.
{‘00’: 0.1; ‘01’: 0.1, ‘10’: 0.1, ‘11’:0.7}
The first two lists will have the same fitness values (namely, 0.3), although the first individual might have some interesting gate combinations that may be useful for future generations, both individuals are evaluated as less favorable by the first approach and have a higher likelihood of being discarded from the new population since the third list (having a fitness value of 2.1) dominates this small population of three chromosomes. In contrast, the second approach (FE2 ) takes on a “weighted average” or “expected value” approach so that the first individual may not be as desirable as the third individual, but it has a, relatively, better chance to survive the selection process than the second individual since it is ranked higher when its fitness value is computed, where the lists would take on the fitness values of {1.8, 0.9, 2.4} respectively.
108
T. J. Ong and C. C. Tan
Our current experiments seek to examine the behavior and results of these two evaluation schema to observe their impacts on the resultant population to guide our future research in quantum programs generation. 5.4 Experiments We have setup 2 sets of parameters (see Table 1 below) for GA to evolve quantum programs in solving the 5-qubit MaxOne problem based on FE1 and FE2 : Table 1. Parameters used by the experiments, where Pc and Pm represent crossover and mutation probabilities, Cs represents the min and max size of the chromosome and Ps represents the population size. Sets Pc
Pm
CS
PS
A
0.65 0.001 5–25 50
B
0.85 0.01
5–25 50
Lastly, the crossover and mutation probabilities are unchanged throughout the evolutionary cycles, and the selection process always introduces 20% new individuals into the new population to maintain genetic diversity.
6 Results and Analysis The GA is written in Python [23], and the results are collected over 100 epochs so that we can have a good overview about the changes in the minimum, average and maximum fitness values of the chromosomes based on the four sets of experiments (two sets of parameters for FE1 and FE2 respectively). The results of the runs are plotted with Matplotlib [29]. We will refer to the experiments as: FE1A , FE1B , FE2A and FE2B (for instance, FE1A corresponds to the results of the experiment derived from the first fitness evaluation function (FE1 ) based on parameters set A) respectively in Figs. 6, 7, 8, 9 and 10 below. The results showed that the FE1 is, relatively speaking, better than FE2 on both sets of parameters. This is in part due to the effect of the selection pressure exerted by FE1 since it only takes into account of the probability of the qubit state that is closest to the goal. However, the best fitness value for FE1A quickly “plateaus” around epoch 78 when the best individual (as shown in Figs. 6 and 7) dominates the entire population. The average fitness of the populations based on FE1A is gradually increasing throughout all of the epochs, whereas FE1B showed a slight dip in the average fitness value after epoch 82. This may have been caused by the higher mutation and crossover rates observed in FE1B. FE1B managed to stumble upon a solution that has a 86.17% probability in the ideal state, namely ‘11111,’ whereas FE1A was not able to yield any good solution for this particular run as the closest it could get is ‘11110’ with a 75.22% probability.
A Genetic Algorithm for Quantum Circuit Generation in OpenQASM
109
On the other hand, FE2A and FE2B (Figs. 8, 9 and 10) exhibited less desirable characteristics in terms of the fitness values. The average fitness values of the entire population oscillate throughout the entire run (or “ranging”) between an interval and never seem to quite get out of the range at the end of epoch 100. This indicates that FE2 does not provide sufficient guidance to GA in improving the average fitness of the population because when probabilities and the corresponding states in the state vectors are ‘averaged,’ an individual with higher probability in only one of the latter states but low probabilities in all other states may have similar fitness value as an individual (a “wellrounder”) with pretty homogenous probabilities in all of the states. Such characteristics of FE2 may have misguided the GA search process, thus, resulting in HMS) individuals are generated as the data set, and a sample is randomly selected from the data set as the initial clustering center c1 . Step 2. Calculate the shortest distance d (x) between each remaining sample and the 2 existing cluster center, and calculate the probability d (x) 2 of each sample being d x∈X (x) selected as the next cluster center. Then the next cluster center is selected according to roulette method. Step 3. Repeat Step 2 until the first cluster center C = {c1 , c2 , . . . , cHMS } of HMS is selected. Step 4. Calculate the distance between the remaining samples of the dataset and the initial cluster centers of HMS and classify them to the nearest center point to form HMS clusters. Step 5. Recalculate the center points of the HMS clusters and update the position of the center points. Step 6. Repeat Steps 4 and 5 until the position of the cluster center does not change. By outputting these centers, the HMS harmony vectors are generated to form a harmony memory. Our method only improves the second step of the original algorithm, except that the second step is different, other steps are the same as the original algorithm. The main process is shown by Fig. 1 as follows: The purpose of this change is to make the distribution of initial values more uniform by clustering in the mathematical calculation, so that the algorithm can avoid premature convergence as much as possible and ensure faster calculation speed and higher efficiency.
An Improved Clustering-Based Harmony Search Algorithm (IC-HS)
119
Fig. 1. The main process of IC-HS algorithm
4 Experiment Results and Analysis To test the performance of IC-HS algorithm, we consider five test functions in the experiment, as follows: A. Sphere function, defined as
min f(x) =
n
xi2 ,
i=1
where global optimum x* = (0, 0, . . . , 0) and f x* = 0 for −100 ≤ xi ≤ 100 (i = 1, 2, . . . , n). B. Schwefel’s problem 2.22, defined as
min f(x) =
n i=1
||xi || +
n i=1
||xi ||,
where global optimum x* = (0, 0, . . . , 0) and f x* = 0 for −10 ≤ xi ≤ 10 (i = 1, 2, . . . , n). C. Schwefel’s problem 2.26, defined as
min f(x) = 418.9829n −
n i=1
xi sin
|xi | ,
where global optimum x* = (420.9687, 420.9687, . . . , 420.9687) and f x* = 0 for −500 ≤ xi ≤ 500(i = 1, 2, . . . , n).
120
Y. Zhang et al.
D. Rastrigrin function, defined as
min f(x) =
n i=1
xi2 − 10 cos(2πxi ) + 10 ,
where global optimum x* = (0, 0, . . . , 0) and f x* = 0 for −5.12 ≤ xi ≤ 5.12(i = 1, 2, . . . , n). E. Schaffer function, defined as 2
x2 + y2 + 0.5 min f(x, y) =
2 − 0.5, 1 + 0.001 x2 + y2 sin
where global optimum x* = (0, 0, . . . , 0) and f x* = 0 for −100 ≤ xi ≤ 100(i = 1, 2, . . . , n). In the experiments, the parameter settings of the two algorithms are shown as follow: HMS = 5, HMCR = 0.99, PAR = 0.3, BW = 0.01. We use the best, worst, average and standard deviation (SD) to represent the performance of each algorithm. More than 30 independent simulations to get the data. For each simulation the procedure to be run in computer Inter Core i9, CPU 2.3 GHz, and the numerical results of solving four standard problems with different dimensions (5, 10, 20 and 30 dimensions) by different algorithms are recorded in Table 1, 2, 3 and 4. From the comparison results in Table 1, 2, 3 and 4, we can see that, compared with the traditional harmony search algorithm, our method shows better performance in solving unimodal function optimization problems or multimodal function optimization problems, with higher accuracy, better avoidance of falling into local optimum and premature convergence. It can also be seen from Fig. 2 that the convergence speed is also improved. Compared with the traditional harmony search algorithm, we get a better initial value, which shows that the clustering method plays a certain role in increasing individual diversity and optimizing the initial value. Table 1. The performance of the benchmark function optimization results (n = 5). Function
Global optimum
Algorithm
Best
Worst
Mean
SD
A
0
HS
4.76E−09
5.73E−07
1.02E−07
1.31E−07
0
IC-HS
2.70E−09
1.76E−07
3.58E−08
4.31E−08
B
0
HS
6.62E−05
6.21E−04
2.70E−04
1.35E−04
0
IC-HS
5.41E−05
6.17E−04
8.06E−05
1.11E−04 (continued)
An Improved Clustering-Based Harmony Search Algorithm (IC-HS)
121
Table 1. (continued) Function
Global optimum
Algorithm
Best
Worst
Mean
SD
C
0
HS
5.16E+00
4.85E+01
2.00E+01
1.08E+01
0
IC-HS
2.83E−01
9.90E+00
4.39E+00
2.83E+00
D
0
HS
9.95E−01
6.96E+00
2.37E+00
1.02E+00
0
IC-HS
1.64E−06
1.99E+00
2.65E−01
5.10E−01
0
HS
1.02E−01
3.60E−01
2.33E−01
6.71E−02
0
IC-HS
2.15E−02
2.03E−01
8.09E−02
4.30E−02
E
Table 2. The performance of the benchmark function optimization results (n = 10). Function
Global optimum
Algorithm
Best
Worst
Mean
SD
A
0
HS
1.47E−06
2.02E−01
1.17E−02
3.86E−02
0
IC-HS
8.70E−07
3.45E−05
6.72E−06
8.36E−06
B
0
HS
1.60E−03
1.58E−01
1.28E−02
3.48E−02
0
IC-HS
1.00E−03
5.50E−03
2.80E−03
9.24E−04
0
HS
1.75E+01
1.54E+02
5.27E+01
2.99E+01
0
IC-HS
3.73E+00
1.67E+01
1.18E+01
2.58E+00
D
0
HS
9.95E−01
4.97E+00
2.75E+00
9.14E−01
0
IC-HS
6.11E−05
1.99E+00
4.31E−01
6.43E−01
E
0
HS
2.03E−01
7.82E−01
3.63E−01
1.33E−01
0
IC-HS
1.53E−02
3.30E−01
2.21E−01
4.90E−02
C
Table 3. The performance of the benchmark function optimization results (n = 20). Function
Global optimum
Algorithm
A
0 0
B
0
C D E
Best
Worst
Mean
SD
HS
1.14E−04
8.39E−01
2.14E−01
2.51E−01
IC-HS
5.87E−05
1.28E−01
1.43E−02
3.12E−02
HS
2.73E−01
2.24E+00
1.47E+00
5.88E−01
0
IC-HS
1.38E−02
9.91E−01
3.01E−01
3.22E−01
0
HS
3.08E+01
1.52E+02
8.57E+01
2.89E+01
0
IC-HS
1.88E+01
9.48E+01
3.65E+01
1.46E+01
0
HS
4.23E−01
3.00E+00
1.94E+00
7.38E−01
0
IC-HS
2.10E−03
1.02E+00
3.10E−01
4.53E−01
0
HS
1.54E−01
2.53E+00
1.96E−01
3.85E−01
0
IC-HS
2.15E−02
1.02E−01
5.24E−02
2.71E−01
122
Y. Zhang et al. Table 4. The performance of the benchmark function optimization results (n = 30).
Function
Global optimum
Algorithm
Best
Worst
Mean
SD
A
0
HS
2.29E−02
3.24E+00
9.94E−01
7.83E−01
0
IC-HS
4.20E−03
8.79E−01
3.30E−01
2.34E−01
0
HS
1.52E−01
2.58E+00
1.12E+00
7.23E−01
0
IC-HS
9.69E−02
1.88E+00
8.03E−01
4.92E−01
0
HS
9.02E+01
2.60E+02
1.57E+02
4.33E+01
0
IC-HS
2.18E+01
9.94E+01
6.94E+01
1.97E+01
0
HS
9.24E−01
7.00E+00
3.76E+00
1.43E+00
0
IC-HS
1.78E−02
3.03E+00
1.07E+00
9.58E−01
0
HS
4.63E−01
1.97E+00
1.19E+00
5.18E−01
0
IC-HS
2.06E−01
5.21E−01
4.12E−01
9.22E−02
B C D E
Best Solution History
300
IC-HS HS
objective function value
250
200
150
100
50
0 0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
Number of Iterations
Function A Best Solution History
10
IC-HS HS
9 8
objective function value
7 6 5 4 3 2 1 0 0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
Number of Iterations
Function B Fig. 2. Examples of convergence behavior of two algorithms for solving different functions.
An Improved Clustering-Based Harmony Search Algorithm (IC-HS) Best Solution History
1800
IC-HS HS
1600
objective function value
1400
1200
1000
800
600
400
200
0 0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
Number of Iterations
Function C Best Solution History
70
IC-HS HS
60
objective function value
50
40
30
20
10
0 0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
Number of Iterations
Function D
Best Solution History
0.5
IC-HS HS
0.45
objective function value
0.4
0.35
0.3
0.25
0.2
0.15 0
0.5
1
1.5
2
2.5
3
Number of Iterations
Function E
Fig. 2. (continued)
3.5
4
4.5
5 10 4
123
124
Y. Zhang et al.
5 Conclusions To improve the performance of HS algorithm, we propose an improved harmony search algorithm based on clustering called IC-HS. By using a clustering algorithm in the initialization phase of harmony memory, the initial value with more average distribution is obtained, to improve the ensemble performance of the algorithm. Five test functions are used for comparison and analysis. The experimental results show that the IC-HS algorithm is better than the traditional HS algorithm. Compared with the traditional harmony search algorithm, the proposed method has higher accuracy, faster convergence speed and higher efficiency, and avoids premature convergence. The experimental results also prove the effectiveness and robustness of the algorithm. In a word, IC-HS algorithm is a promising optimization algorithm. In the next stage, we hope to continue to improve on this basis, especially in the improvement of accuracy, hoping to have better results.
References 1. Blum, C., Puchinger, J., Raidl, G.R., Roli, A.: Hybrid metaheuristics in combinatorial optimization: a survey. Appl. Soft Comput. 11(6), 4135–4151 (2011) 2. Gogna, A., Akash, T.: Metaheuristics: review and application. J. Exp. Theor. Artif. Intell. 25(4), 503–526 (2013) 3. Holland, J.H.: Adaptation in Natural And Artificial Systems. The University of Michigan Press, Ann Arbor (1975) 4. Chen, G.C., Yu, J.S.: Particle swarm optimization algorithm. Inf. Control-Shenyang 34(3), 318 (2005) 5. Karaboga, D., Bastuurk, B.: A powerful and efficient algorithm for numerical function optimization: artificial bee colony (ABC) algorithm. J. Glob. Optim. 39(3), 459–471 (2007). https://doi.org/10.1007/s10898-007-9149-x 6. Geem, Z.W., Kim, J.H., Loganathan, G.V.: A new heuristic optimization algorithm: harmony search. SIMULATION 76(2), 60–68 (2001) 7. Osama, M.A., Mandava, R.: The variants of the harmony search algorithm: an overview. Artif. Intell. Rev. 36(1), 49–68 (2011) 8. Mahdavi, M., Fesanghary, M., Damangir, E.: An improved harmony search algorithm for solving optimization problems. Appl. Math. Comput. 188(2), 1567–1579 (2007) 9. Omran, M.G.H., Mahdavi, M.: Global-best harmony search. Appl. Math. Comput. 198(2), 643–656 (2008) 10. Lee, K.S., Geem, Z.W.: A new meta-heuristic algorithm for continuous engineering optimization: harmony search theory and practice. Comput. Methods Appl. Mech. Eng. 194(36–38), 3902–3933 (2005)
Role of Artificial Intelligence in Software Quality Assurance Sonam Ramchand1 , Sarang Shaikh2(B) , and Irtija Alam1 1
Department of Computer Science, Institute of Business Administration (IBA), Karachi, Pakistan {s.ramchand 19841,i.alam 19826}@khi.iba.edu.pk 2 Department of Information Security and Communication Technology, Norwegian University of Science and Technology (NTNU), Gjovik, Norway [email protected]
Abstract. Artificial intelligence has taken its place in almost every industry individual operate in, it has become integral part of applications and systems in our surrounding. The world quality report estimates that 64% of the companies will implement Artificial Intelligence (AI) for the Software Quality Assurance (SQA) processes. It is predicted that in the very near future, SQA engineer will not be testing manually. But they would be acquiring skills to use AI enabled tools techniques for Software Quality assurances in order to contribute to the business growth. AI proves to play a crucial role in the software testing as it makes processes leaner and yields more accurate results. This paper will discuss about how Artificial Intelligence makes impact in the software testing industry. The new era of Quality Assurance will be dominated by the power of Artificial Intelligence as it significantly reduces time and increase efficiency of the firm to develop more sophisticated software. This studies focuses on artificial intelligence applications in software testing, which of the AI algorithms are popularly adopted by the QA industry, Furthermore, this paper talks about real issues that resides in the industry for instance; why young testers are more flexible towards adopting latest technological changes. Keywords: Artificial intelligence · White-box testing · Bug reporting · Black-box testing · Regression testing · Software development life cycle · Software quality assurance · SQA
1
Introduction
Computational Intelligence is a way to leverage computer power to translate human intelligence capacity into the machines [1]. Computational Intelligence in Software Quality Assurance refers to the usage of advanced computational intelligence to deliver high quality software [2]. Software systems are embedded in everything around us; from tinny smart watch to the huge military critical. Software has become an integral part of the livelihoods as a result the quality of the Software Systems has become really critical [3]. With the increasing usage of c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 295, pp. 125–136, 2022. https://doi.org/10.1007/978-3-030-82196-8_10
126
S. Ramchand et al.
the software, more data is being produced and more complex systems are being are developed [4]. 100% test coverage using the manual procedures have almost become impossible [5]. So, Artificial Intelligence along with the sophisticated Data Mining tools can result in smart automated testing which ultimate results in efficient testing in the least possible time with nearly 100% test coverage [6]. There are several approaches to implement artificial intelligence for testing. Broadly, AI could be split into two categories. A software application that uses knowledge base and defined set of pre-defined rules is called Rule-based AI approach and the application that leverages machine learning power for learning is referred as learning based Approach [7]. This paper will analyse the pros and limitation of each type of implementation in the Quality Assurance Processes. Furthermore, this paper will too cover how Advanced AI processes like Robotic Process Automation (RPA) can prove to be the power tool for testing [8]. It is too intended to compare the capabilities of software quality assurance engineers that have young mind with Software quality assurance engineers who have been working in the industry for more than 10 years. We will see who performs better and adopts the latest technologies fast in order to contribute to the business growth. The rest of the paper is organized as: Sect. 2 shows background studies, Sect. 3 explains the methodology, Sect. 4 shows the literature surveys conducted as a part of the methodology, Sect. 5 shows some more discussions on the topic, Sect. 6 explains the results, Sect. 7 shows the conclusion and finally Sect. 8 shows the future work.
2
Background
Despite several years of researchers and experts working on different software attributes, testing remains one of the most refined and oriented methodologies for assessing and enhancing software quality. But it is now unavoidable to step beyond the continuous testing strategy [9]. The need to foresee consumer needs in advance and build a framework that is flexible and predictive enough to cater to future developments is at its height as the world shifts in the direction of Digital Transformation [10]. Testing requires assistance to accelerate delivery in the present situation. Artificial intelligence in QA will help us achieve that point. Though Artificial keep getting mature in the field of Software Quality Assurance still Quality Assurance Engineers does not have enough skills to meet with the competencies that AI demands. Therefore, organizations should train their QA team over the competencies like AI testing, math optimization, business intelligence, algorithmic analysis, neuro linguistic. The recent announcement by the World Quality report suggests that in the very near future, there will be three emerging roles of Quality Assurance Expert i.e. AI Testing Experts, AI QA Strategists, QAOps. For years, automation in QA has existed. The advantages of automation, however, those were not impactful enough for the organisations to finally wake up and realise [11]. In the way quality assurance works in organisations, there is a need for reform [12]. There are two driving forces in general-agility in the manner in which testing is carried out and quick & efficient
Role of Artificial Intelligence in Software Quality Assurance
127
market penetration. Conventional test automation is no longer enough for QA teams to keep pace with the agile mode of production, making AI inescapable in test automation. AI does not replace manual work or be an all-in-one alternative to the testing tools previously available. Right now, improving automated testing methods is the main application of AI in software testing [13]. It has several limits, though. A long configuration is needed for automated tools: testers and architects need to develop the functionality of the tool, manually view the required scenarios, and monitor its performance. Artificial Intelligence can understand and accept the task of configuring, tracking and ensuring accurate results for automated tools. Intelligent Automation i.e. AI-led cognitive automation solutions integrate the best approaches to automation with Artificial intelligence and help bring higher performance [14].
3
Methodology
For this research paper, we will use both quantitative and qualitative approach. For qualitative data we will conduct surveys from different SQA & AI experts to know their practices and their knowledge about each other’s domain. The sample size for qualitative approach will be 7. For quantitative data, we will use questionnaire approach. The sample size will be 80. The questions used in the questionnaire is defined at the end of the paper after references. The purpose of this paper was to study the AI algorithms and methodologies proposed and are being used for software testing. Research papers were searched on the most commonly used and popular platforms for Computer Science which include IEEE, Elsevier, Springer, Taylor and Francis, Wiley, ACM, Arxiv.org and Google Scholar using below queries – – – – – –
Applications of AI in software quality assurance Application of AI in software testing Role of AI in software testing Future of software AI based software testing Computational intelligence in software quality assurance
After searching papers on mentioned platforms, the following filtration technique was applied to extract the most relevant papers. Figure 1 shows the different statistics of each screening stage.
4
Literature Survey
In the recent years, Artificial Intelligence has become integral part of the software applications being used by health, e-commerce, and many more digital services [15]. Even in the garment industry, AI is being used to improve their SQA and product quality. They are using RFID based recursive process mining
128
S. Ramchand et al.
Fig. 1. Relevant articles screening process
system [16]. However, implications of the AI in software testing still lies in the academic papers because of the complex procedures being followed by the Software testing. Different researchers are trying to introduce an approach to improve SQA using AI. For test management, by using ML as a support method, [17] presented an approach that shows an enhancement of the test process. Semiautomation of regression test collection is the domain. Machine Learning is used as a supporting machine in the suggested lean testing process, whereas choosing the relevant test cases will remain the responsibility of the manager of manual testing [18].
Role of Artificial Intelligence in Software Quality Assurance
129
Artificial Intelligence algorithms used in software testing are categorized by the test types. The paper additionally attempts to make relations between the principle AI approaches and which sort of tests they are applied to, specifically gray-box, black box & white-box, black software testing types [19]. When it comes to black-box testing, all three types of Machine Learning Algorithms I.e. Supervised Learning, Semi-supervised leaning, and unsupervised learnings are being used as clustering technique. Besides Artificial Intelligence along with the neural network play wide role in regression testing [20]. Most common algorithms used for Software Testing are Support Vector Machine (SVM), Genetic Algorithms, K-Clustering, Artificial Neural Networks, decision trees and na¨ıve base. The implications of the mentioned algorithms in the software testing makes the testing procedure leaner and more efficient [21]. Besides, Artificial Intelligence too helps to cover maximum code by the least possible test cases [22]. Artificial Intelligence is specifically used – – – –
To find the optimize ways the maximum code coverage For reducing the total test cases number Filter test cases that can cover more code Filter test cases that are less time consuming but covers the most of the code [23]
In the software development life cycle (SDLC), software testing phase is very time intensive. Therefore, it is necessary to cover the maximum project testing in the least possible time with the minimum number of test cases [24]. This is where Artificial Intelligence comes to rescue for the identification of the pertinent tests. To monitor all critical bug reports, the Bug Reporting System (BRS) plays a vital role during SDLC. The author in [25] proposed an improvement to the current BRS. For detecting the presence of a duplicate bug, it uses intelligent techniques based on artificial intelligence. The author in [26] also presents an approach using NLP and machine learning to improve bug reporting. Bug reporting has an adverse effect on the performance of SQA. The author in [27] discusses how artificial intelligence can help in testing graphical user interface.
5
Discussion and Analysis
The use of AI in software development is still in its infancy, although it continues to move in that direction. Standalone testing. The application of AI in software testing tools is intended to facilitate the software development lifecycle. Through the application of reasoning, problem solving and, in some cases, machine learning, AI can be used to automate and reduce the number of mundane and tedious tasks in development and application testing. AI shines in software development when used to remove these limitations so that software test automation tools can add even more value to developers and testers. The value of AI comes from reducing the developer or tester’s direct involvement in the most everyday tasks. Therefore, more and more developer is switching towards automation. Besides, automation testing, Robotic process automation too seems to be handy technology for Quality Assurance industry.
130
6
S. Ramchand et al.
Results
The questionnaire was shared via Google forms with Information Technology (IT) people working around different domains and using different technologies stacks. The structure of the questionnaire was divided into three sections. The first section of the questionnaire consisted questions regarding the general information about the organization and that individual’s working experience in software quality assurance field. Second sections address to whether or not SQA processes leverages AI to be leaner. The last and the third sections highlights which of the AI algorithms are popularly used for SQA engineering.
Fig. 2. Current designation of the respondents
Results are calculated based on responses of 104 respondents. Figure 2 shows the current designation of the individual who filled the questionnaire. The results show that 30% of the individuals who responded are Software Quality Assurance Experts. 15% of them are DevOps experts, 25% are Data scientist and the remaining 30% are Software Engineers. The results show that 50% of individuals who responded had experience of about 2–4 years, 15% of them were at senior level, 15% of them were at junior level and the remaining 20% were trainee engineers. The third was asked about the gender of the respondent. Another questions asked was related to the number of years of work experience the individuals had. The following Fig. 3 shows how experienced the individual is who has filled the questionnaire. The third asked question in the survey was “Do you think automated testing is cheaper than Manual testing in terms of time and cost?” Fig. 4 shows the percentage for different answers for this question. The results show that 90% of the individuals agree with automated testing being cheaper than manuals testing procedures. Another important question from this section is “Does AI helps in reducing number of test cases required?” The division of responses is defined in Fig. 5. 85% of the respondents agreed that AI reduces number of written test cases for the testing while 10% disagreed and the remaining 5% ratio was not sure about the answer.
Role of Artificial Intelligence in Software Quality Assurance
131
Fig. 3. Work experience of the respondents
Fig. 4. Automated testing vs manual testing
Fig. 5. Does AI Reduces Test cases
The last section of the survey was to identify which of the prominent AI algorithms are being used for the testing. The Fig. 6 show that 42.1% individuals say that decision trees are being used, 26.3% of individuals responded for ANN being widely used for testing. Though RPA has recently been in the market yet 21.1% individuals say that RPA is being widely for SQA processes and the remaining 10% ratio says that clustering techniques of machine learning are being widely used. Here the qualitative method was also used. 3 interviews were conducted in person and other 5 on phone call. We interviewed these 8 software quality assurance engineers for almost 25 min. Through these interviews, we got to know that the SQA engineers who are below 30 years of age & are recently graduated have more knowledge about how artificial intelligence can help in improving software quality assurance process and techniques. They are more open to change as compared to the SQA engineers who are using the traditional methods of SQA techniques since long.
132
S. Ramchand et al.
Fig. 6. AI Algorithms being used in SQA
7
Conclusion
Artificial Intelligence has become integral part of every industry that operate in today’s world. Artificial intelligence as of now indicated that it can accomplish better outcomes in software testing industry as well. Artificial intelligence driven testing will lead to the new era of the QA work sooner rather than later. It will oversee and control the vast majority of the testing regions and will enhance the testing result and will create more precise outcomes in a competitive time period. There is no uncertainty that AI will impact QA and testing industry and will lead this going forward. From our findings we analysed that Artificial intelligence is not being used at its full potential in the SQA industry. SQA engineers working on automation tools does not have knowledge which AI algorithms are internally being used. Rather they only have knowledge that how operate those automated tools. One more thing we analysed through the findings is organizations that were established more than 11 years ago hired new people in the past 4 years for their Software Quality Assurance team. And these were the same organizations providing high client satisfaction. It means that the new generation of SQA engineers have a better idea of Artificial Intelligence and they well know how to use AI aspects in Software Quality Assurance Processes. The SQA engineers having experience above 4 years are still stuck to manual testing and are in the misunderstanding that manual testing cannot be fully replaced automation testing. Besides, our findings too illustrate that the young graduates have more knowledge about the latest technologies trends within the domain of the Artificial intelligence and think out of the box to bring creativity within the organization they work for.
8
Future Work
In the future, it is intended to look into the more emerging yet advanced fields of Artificial Intelligence like Deep Learning and into the algorithms of Deep learning that have significance in the Software Testing Industry. It will be really interesting to explore and compare that How Deep Learning is better than traditional Artificial Intelligence Algorithms and what are the challenges of implying deep learning to Software Quality Assurance (SQA). Besides, it is to be explored that
Role of Artificial Intelligence in Software Quality Assurance
133
how Neural networks will result in cost reduction, speed up of Quality Assurance Processes, and efficiency gains. It is expected that AI will take a vital role in software testing in the long run. The new job role for the tester will be to gain expertise in on truing the AI models, calculations strategies to get more astute. Artificial intelligence methodologies will likewise interface to new advancements later on (like Cloud technologies, Internet of things, Big Data and others) and will extract the accepted procedures strategies that suit the customer application to get more exact and savvy experiments and will create great results. Profound learning alongside the NLP and other strategies will assume a significant function in the software testing and will have some specific instruments (Software and Hardware) to use in all software testing life cycle (STLC). With the increasing amount of Data that individuals produce each day, Big Data concept have emerged. So, it will be worth and interesting to explore how Artificial Intelligence into Software Quality Assurance processes will handle Big Data implications. The structured analysis of Big Data tools to discover patterns and information clusters is yet another interesting and challenging topic to discover and analyse. Furthermore, Robotics Process Automation(RPA) is an emerging field in the Software Testing Industry. So, it will be really compelling to analyse its performance in SQA process, in which process RPA performs well and where it fails, what are the causes of failures, and how to overcome those failures. Another motive is to cover more studies on testing areas that hasn’t been canvassed in this research paper. The questionnaire used for this survey have been developed by the authors. However, it could be adopted by anyone who wishes to work on the topic in the future.
9
Questionnaire: Computational Intelligence in Software Quality Assurance
1. Current Designation * – – – – –
Software Engineer SQA Engineer DevOps Engineer Data Scientist Other 2. Work Experience *
– – – –
Trainee (0–6 months) Junior (6 months–2 years) Intermediate (2 years–4 years) Senior (4 years+)
134
S. Ramchand et al.
3. Gender * – Female – Male – Prefer not to say 4. Do you think automated testing is cheaper than Manual testing in terms of time and cost? * – Yes – No – Maybe 5. Does AI help in reducing number of test cases required? * – Yes – No – May be 6. If you are a QA engineer, which AI algorithms you use in Software Testing? – – – – –
ANN GA RPA Decision Trees Clustering 7. Please tell us technologies you work on? *
– – – – –
Frontend Backend Full Stack SQA AI
References 1. Hourani, H., Hammad, A., Lafi, M.: The impact of artificial intelligence on software testing. In: 2019 IEEE Jordan International Joint Conference on Electrical Engineering and Information Technology (JEEIT), pp. 565–570. IEEE, April 2019 2. Sharma, D., Chandra, P.: Applicability of soft computing and optimization algorithms in software testing and metrics–a brief review. In: International Conference on Soft Computing and Pattern Recognition, pp. 535–546. Springer, Cham, December 2016 3. Mera, E., Lopez-Garc´ıa, P., Hermenegildo, M.: Integrating software testing and run-time checking in an assertion verification framework. In: International Conference on Logic Programming, pp. 281–295. Springer, Berlin, July 2009 4. Kanstr´en, T.: Experiences in testing and analysing data intensive systems. In: 2017 IEEE International Conference on Software Quality, Reliability and Security Companion (QRS-C), pp. 589–590. IEEE, July 2017
Role of Artificial Intelligence in Software Quality Assurance
135
5. Karpov, Y.L., Karpov, L.E., Smetanin, Y.G.: Adaptation of general concepts of software testing to neural networks. Program. Comput. Softw. 44(5), 324–334 (2018) 6. Li, B., Vendome, C., Linares-V´ asquez, M., Poshyvanyk, D., Kraft, N.A.: Automatically documenting unit test cases. In: 2016 IEEE International Conference on Software Testing, Verification and Validation (ICST), pp. 341–352. IEEE, April 2016 7. Tan, T.B., Cheng, W.K.: Software testing levels in internet of things (IoT) architecture. In: International Computer Symposium, pp. 385–390. Springer, Singapore, December 2018 8. Yang, S., Man, T., Xu, J., Zeng, F., Li, K.: RGA: a lightweight and effective regeneration genetic algorithm for coverage-oriented software test data generation. Inf. Softw. Technol. 76, 19–30 (2016) 9. Grano, G., Titov, T.V., Panichella, S., Gall, H.C.: How high will it be? Using machine learning models to predict branch coverage in automated testing. In: 2018 IEEE Workshop on Machine Learning Techniques for Software Quality Evaluation (MaLTeSQuE), pp. 19–24. IEEE, March 2018 10. Sarah, C., Jane, B., R´ on´ an, O.B., Ben, R.: Quality assurance for digital learning object repositories: issues for the metadata creation process. ALT-J 12(1), 5–20 (2004) 11. Malviya, R.: Revolutionizing Quality Assurance with AI and Automation, Infosys (2020) 12. Poth, A., Heimann, C.: How to innovate software quality assurance and testing in large enterprises?. In: European Conference on Software Process Improvement, pp. 437–442. Springer, Cham, September 2018 13. Gabor, T., et al.: The scenario coevolution paradigm: adaptive quality assurance for adaptive systems. Int. J. Softw. Tools Technol. Transfer 22(4), 457–476 (2020). https://doi.org/10.1007/s10009-020-00560-5 14. Dao-Phan, V., Huynh-Quyet, T., Le-Quoc, V.: Developing method for optimizing cost of software quality assurance based on regression-based model. In: International Conference on Nature of Computation and Communication, Cham (2016) 15. Crews, B.O., Drees, J.C., Greene, D.N.: Data-driven quality assurance to prevent erroneous test results. Crit. Rev. Clin. Lab. Sci. 57(3), 146–160 (2020) 16. Lee, C., Ho, G., Choy, K., Pang, G.: A RFID-based recursive process mining system for quality assurance in the garment industry. Int. J. Prod. Res. 52(14), 4216–4238 (2017) 17. Poth, A., Beck, Q., Riel, A.: Artificial intelligence helps making quality assurance processes leaner. In: Walker, A., O’Connor, R.V., Messnarz, R. (eds.) EuroSPI 2019. CCIS, vol. 1060, pp. 722–730. Springer, Cham (2019). https://doi.org/10. 1007/978-3-030-28005-5 56 18. Mahmoud, T., Ahmed, B.S.: An efficient strategy for covering array construction with fuzzy logic-based adaptive swarm optimization for software testing use. Expert Syst. Appl. 42(22), 8753–8765 (2017) 19. Li, Z., Li, M., Liu, Y., Geng, J.: Identify coincidental correct test cases based on fuzzy classification. In: International Conference on Software Analysis, Testing and Evolution (SATE), Kunming, China (2016) 20. Khuranaa, N., Chillar, R.S.: Test case generation and optimization using UML models and genetic algorithm. Procedia Comput. Sci. 57, 996–1004 (2016)
136
S. Ramchand et al.
21. Ansari, A., Shagufta, M.B., Fatima, A.S., Tehreem, S.: Constructing test cases using natural language processing. In: Third International Conference on Advances in Electrical. Electronics, Information, Communication and Bio-Informatics (AEEICB), Chennai, India (2017) 22. Shehab, M., Abualigah, L., Jarrah, M.I., Alomari, O.A.: Artificial intelligence in software engineering and inverse: review. Int. J. Comput. Integr. Manuf. 33, 1129– 1144 (2020) 23. Lachmann, R., Schulze, S., Nieke, M., Seidl, C., Schaefer, I.: System-level test case prioritization using machine learning. In: 15th IEEE International Conference on Machine Learning and Applications (ICMLA), Anaheim, CA, USA (2017) 24. AlShathry, O.: Operational profile modeling as a risk assessment tool for software quality techniques. In: International Conference on Computational Science and Computational Intelligence, Las Vegas, NV, USA (2016) 25. Saad, A., Saad, M., Emaduddin, S.M., Ullah, R.: Optimization of Bug Reporting System (BRS): artificial intelligence based method to handle duplicate bug report. In: International Conference on Intelligent Technologies and Applications, Singapore (2020) 26. Umer, Q., Liu, H., Sultan, Y.: Emotion based automated priority prediction for bug reports. IEEE Access 6(10), 35743–35752 (2018) 27. Rauf, A., Alanazi, M.N.: Using artificial intelligence to automatically test GUI. In: 9th International Conference on Computer Science & Education, Vancouver, BC, Canada (2016)
Machine Learning for Optimal ITAE Controller Parameters for Thermal PTn Actuators Roland Büchi(B) School of Engineering, Zurich University of Applied Sciences, Winterthur, Switzerland [email protected]
Abstract. In control theory, the ITAE criterion (integral of time-multiplied absolute value of error) is very well suited for setting the parameters of controllers, as it uses a step response and integrates the difference between the desired and actual value weighted over time. This criterion is to be minimized when setting the controller parameters. In the state of the art, parameters as example for PID controllers are found by hand and with the help of computing or Matlab toolboxes in order to minimize the ITAE or other criterions. The method presented here uses a machine learning algorithm for the automated search for the optimal controller parameters, in order to minimize the ITAE criterion. It can even be used both, in the simulation and directly on the real system. Since PTn systems have to be regulated in many cases, these are used here as example. With the application of this method, it is possible to find the parameters either using a Simulation, or directly on the real system. In the specific system, the temperature control of a thermal actuator with a small temperature chamber was applied. In particular with thermal actuators, it is often difficult or even impossible to place the sensor directly next to the heat source. This leads to PTn plant systems. The method works for this specific example and, due to its flexibility, can be extended to a huge number of applications in control theory. Keywords: PID controller · Hill Climbing · Machine learning
1 Introduction Figure 1 shows the block diagram of a PID- controlled PT3 system. PTn are n in series connected PT1 (1st order) elements. These are found very often and especially in process engineering or in general mechanical engineering. Figure 5 far below shows their step response when they are not regulated. The time lag is also very nice to see there. This is a property of PTn systems. Figure 2 shows how the ITAE criterion is to be understood. It is the integral of the deviation of the step response from a step (desired value). If the error persists, the criterion grows faster over time; it is time-weighted.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 295, pp. 137–145, 2022. https://doi.org/10.1007/978-3-030-82196-8_11
138
R. Büchi
Fig. 1. Complete block diagram of a PID controlled PTn (as example PT3) system
Fig. 2. ITAE-criterion of a closed loop system as showed in Fig. 1.
At first glance, it appears to be relatively easy to find control parameters for PTn control systems. What is explained and applied in the following, however, is an automated machine process to calculate and find the optimal parameters for each control system occurring in theory and practice, generally regardless of which optimization process, as example minimization of IAE (integral of absolute error), ISE (integral of square value of error) or ITAE is used. Furthermore, any control output limitation occurring in theory and practice can also be taken into account. The process is tested in practice using a thermal actuator as an example.
2 Model of the Plant The present thermal actuator is used for a small temperature chamber, in which a specific temperature profile must be passed through as fast as possible. It is heated by resistance heating. In order to be able to cool it down quickly to the ambient temperature, water cooling is used. As with many thermal actuators, the temperature cannot be measured directly at the resistance heating. Therefore the model does not have a PT1 characteristic,
Machine Learning for Optimal ITAE Controller Parameters
139
but is of the PTn type. Figure 3 shows a photograph of the system and Fig. 4 shows a physical model of this.
Fig. 3. Thermal system
Fig. 4. Heat transfer model for 2nd order model
i.) ii.) iii.) iv.)
˙ 1 = ∝ ·A · (T − T1 ) Q ˙ 2 = ∝ ·A · (T1 − T2 ) Q ˙1 −Q ˙2 m · c · T˙ 1 = Q ˙ ˙ m · c · T2 = Q2
The heat transfer model leads to a 2nd or 3rd order system. Since the poles of the model are real, this results in a PT2 or PT3 behavior. Basically this is a system with distributed parameters, the modeling with PT2 or PT3 is only an approximation. If the temperature sensor is far away from the heating coil, systems with a higher order will also result. However, the modelling or identification will not be discussed further here because it is a solved problem (6). However, for the sake of completeness, it should be explained how the model can be derived directly from the measured step response. If the parameters from Fig. 5 are identified in a measured system, the order n and the time constants of n in series connected PT1 elements can be determined using the known table in Fig. 5 [6]. The higher the ratio Tu/Tg, the higher is the order of the system.
140
R. Büchi
Fig. 5. Step response of a PTn-system
It is also well known from practice and theory [5, 8], that the systems with a higher order and therefore also a greater dead time Tu are more difficult to control. The PID controllers in particular are only suitable for controlling such systems up to a certain order, for example n < 5. A Block- Diagram of the simulated control loop for a PT3system is showed in Fig. 1. Finding the optimal parameters for a PID controller is difficult with the control systems that occur in practice, since optimization must take place with three parameters for P, I and D. This optimization is carried out in various sources, which is carried out either with the help of Matlab toolboxes or other calculations [1–4, 7, 9]. In the following, the calculation of the parameters with the ITAE criterion with a state of the art approach and one from the field of artificial intelligence will be examined in more detail.
3 Control Criterion and State of the Art Calculation of the Control Parameters When controlling with PID, it would make sense for higher orders to be able to calculate the optimal parameters of the proportional, integral and differential components. But because there is an infinitely large set of stable control parameter tuples, this turns out to be difficult, since there are no analytical calculations either. Various criteria are used today to determine the best behavior in the closed loop. These are the IAE criterion (integral of absolute error), the ITAE criterion (integral of time-multiplied absolute value of error) or the ISE criterion (integral of squares error). In the IAE standard, the controlled variable is subtracted from the reference variable (= error e) and its amount is integrated over time. So the area after Fig. 2 is minimized. With the ISE criterion, it is
Machine Learning for Optimal ITAE Controller Parameters
141
not the amount but the square of the error that is integrated. So big mistakes are punished more heavily. ∞ |e| · t · dt
I= 0
The ITAE criterion (formula above) multiplies the amount of the error with the time and integrates this over time. Thus, deviations are punished the more the time has progressed. This criterion is also called the L1 standard and is often used in particular for the step response that is frequently used in control engineering in order to optimally set the control parameters. In this section, the calculation of the parameters with this criterion will be examined in more detail. The control parameters can be calculated in theory by simulating all parameter combinations P, I, D (Kp, Ti, Td) with Simulink and calculating the ITAE criterion in each case. At the end, the parameters with the smallest criterion are used as the result. The parameters found in this way are then implemented and tested directly on the real system. Figure 6 shows an area with the control parameters found for a PT3 system with three series-connected PT1 elements with a time constant T = 1 s, according to Fig. 1 (there is T = 8 s). The D component Td was calculated as the optimal parameter (0.7). In the simulation, the D component (Td) was increased in steps of 0.1, the P (Kp) and I (Ti) component by 0.5 each. The lowest value of the ITAE criterion results with Kp = 8.5, Ti = 8.5 and Tv = 0.7. A sufficiently large control output limitation of a factor of 10, which is positive and negative ±10, with a stationary final value of 1 was assumed as a condition.
Fig. 6. Calculated parameters for the control of a PT3 Plant according to ITAE.
The problem with setting the control parameters of PID controllers is generally that the method can only be used to a limited extent in theory or in simulation. The three parameters span a 3-dimensional space. If the parameters are incremented from 1 to
142
R. Büchi
10 in steps of 0.1, the result is (10 * 10)ˆ3 = 1 000 000 simulations. Since the thermal systems in practice usually have time constants in the range of seconds in contradiction to electrical components, it is not possible to test this calculation method directly on the real system and even the Simulink simulation also requires several days with today’s computer systems. In this way, the systems are physically modeled using state-of-the-art control technology. Then they are simulated in the above way. Therefore, a new approach is required in which the parameters of the PID controller can be found faster and directly on the real system without simulation.
4 Calculation of the Control Parameters with ‘Hill Climbing’, an Approach from the Field of Artificial Intelligence This approach can be used in the simulation as well as in the direct calculation at the real system. The parameters of the controller are changed according to the following rule: With each new calculation or measurement, the minimum step size of Kp, Ti and Td is multiplied by a random sign (+1, −1, 0) and added to Kp, Ti and Td. Then the ITAE criterion is recalculated. Depending on the result, the original state is then restored or not. An excerpt from a Matlab sequence with the Simulink simulation of a controlled PT3 system shows the core of the algorithm:
Compared to a complete calculation with the entire three-dimensional space of the parameters Kp, Ti and Td, this method has an advantage and a disadvantage, whereby the advantage from the author’s point of view strongly predominates.
Machine Learning for Optimal ITAE Controller Parameters
143
Advantage: Without Modeling, Directly at the Real System The advantage is that the local minima of the ITAE criterion can be achieved with a very limited number of calculation steps. In the specific case of a three-dimensional space with values in the interval [0 … 10] and a step size of 0.1 sufficient for the application, only a few hundred simulations or calculations on the real system are required. Therefore, in contrast to the complete calculations of the state of the art method according to Sect. 3, the method is extremely suitable for finding the parameters directly on the real system without modeling. Disadvantage: Local Minima The disadvantage arises from the statement ‘local minima’. A property of the ‘hill climbing’ method is that although it converges quickly, it only finds local maxima or minima. However, this circumstance can be solved relatively well for the present problem by means of suitable starting values and with the aid of a relatively large-meshed calculation according to Fig. 6. It would certainly be possible to do more research on this. In practice, however, the absolute minima can be found very easily. In principle, it would be conceivable to calculate optimal PID parameter sets for different PTn in this way, for different values of n, as long as the systems can be sensibly controlled.
5 Optimal PID Parameters According to ITAE in Simulation and Tested on the Real System The theory is used for the concrete application of a system with a PTn path that occurs in practice. To do this, the thermal system described above for a small thermal chamber is used. The step response of the line with a jump from 0 to 5 V at the input shows a PT3 behavior with three identical PT1 elements with time constants of 8 s according to the identification according to Sect. 2. A maximum of 10 V would be possible at the input, which corresponds to a control output limitation of a factor of two at the operating point. The PID controller parameters according to the ITAE criterion were first calculated according to the above discussed Hill Climbing method in theory with simulation of the PT3 system. The optimal parameters for the ITAE criterion are Kp = 5, Ti = 70, Td = 6. These values found in the simulation were then tested on the real system and result in the closed loop step response according to Fig. 7. The step response appears slow, but this is the case because the system only has a relatively low control output limitation. The rise in temperature corresponds to the maximum possible in practice. The replacement of the previous identification and simulation with an automated ‘hardware in the loop’ configuration was also tested. In this variant, the controller parameters are calculated directly on the system. In this configuration, the controllers run in Matlab and act on the real thermal actuator. The optimal controller parameters are calculated using the Hill Climbing method and the ITAE criterion is also added up numerically. This method gives the same controller parameters as result. It turns out that the method mentioned above can be applied directly to the system without having to rely on a simulation.
144
R. Büchi
Fig. 7. Step response of the closed loop system, real thermal actuator
6 Discussion and Outlook With the use of this method it is not necessary to first identify a PTn system and calculate the parameters in the simulation. It is even possible to calculate the parameters directly on the real system, entirely without simulation. This also has the advantage that no simulation environment is required in practice. Rather, the same hardware and software can be used for the calculation with which the system is ultimately also controlled. Since there is no simulation environment and no Matlab available for many applications in this area, but any control system, it can be used very well in practice. The only requirement is that the Hill Climbing method as shown above can be programmed. It is a prerequisite that the overall system remains stable or is not damaged in the event of possible instability. It is also conceivable that corresponding controller parameter sets are calculated in advance and made available. A variety of different cases must be taken into account, for example different sets of limited control outputs, as well as different PTn plants. In addition, sets for PI controllers are also required in practice, as PID cannot be programmed with all systems. This would have the advantage that these parameters can be read directly from in the future available tables, after the order n of the control plants has been identified.
References 1. Joseph, E.A., Olaiya, O.O.: Cohen- Coon PID tuning method, a better option to Ziegler NicholsPID tuning method. Comput. Eng. Intell. Syst. 9(5) (2018). ISSN: 2222-1719 2. Ozana, S., Docekal, T.: PID controller design based on global optimization technique with additional constraints. J. Electr. Eng. 67(3), 160–168 (2016) 3. Hussain, K.M., et al.: Comparison of PID controller tuning methods with genetic algorithm for FOPTD system. Int. J. Eng. Res. Appl. 4(2), 308–314 (2014). ISSN: 2248-9622 4. Büchi, R.: Modellierung und Regelung von Impact Drives für Positionierungen im Nanometerbereich. Doctoral dissertation, ETH Zurich (1996) 5. da Silva, L.R., Flesch, R.C., Normey-Rico, J.E.: Controlling industrial dead-time systems: when to use a PID or an advanced controller. ISA Trans. 1(99), 339–350 (2020)
Machine Learning for Optimal ITAE Controller Parameters
145
6. Unbehauen, H.: Regelungstechnik. Vieweg, Braunschweig (1992) 7. Martins, F.G.: Tuning PID controllers using the ITAE criterion. Int. J. Eng. Ed. 21(5), 867–873 (2005) 8. Silva, G.J., Datta, A., Bhattacharyya, S.P.: PID Controllers for Time-Delay Systems. Boston. ISBN: 0-8176-4266-8 (2005) 9. Büchi, R., Rohrer, D., Schmid, C., Siegwart, R.Y.: Fully autonomous mobile mini-robot. In: Microrobotics and Micromechanical Systems, vol. 2593, pp. 50–53. International Society for Optics and Photonics, December 1995
Evaluation of Transformation Tools in the Context of NoSQL Databases Sarah Myriam Lydia Hahn(B) , Ionela Chereja, and Oliviu Matei Technical University Cluj Napoca, Cluj-Napoca, Romania [email protected]
Abstract. With the upcoming term of big data and enhancements in cloud technology a new kind of database arose - NoSQL databases. Concomitant with these new databases load processes changed from ExtractTransform-Load (ETL) to Extract-Load-Transform (ELT). Therefore new transformation tools are needed supporting the requirements of a transformation tool and NoSQL databases. Until now there are no evaluations available which give a review about the existing transformations tools as well as an analysis of them. In this context not only relevant work is presented but also the most popular NoSQL databases and the requirements of transformation tools. Furthermore the most known transformation tools fitting the requirements are introduced and evaluated. This work is not only a market overview for a new kind of transformation tools but also a decision guidance for choosing the right tool. Keywords: ELT · NoSQL · Transformation BASE model · Scalability · Consistency
1
· Cloud technology ·
Introduction
Over the past few years more and more data is collected. According to estimates in 2018 the data volume in 2020 will be 35 zettabytes [20]. Nowadays estimates prognosticate a volume of 59 zettabytes in 2020. Due to an exponential growth the data volume will still be increasing up to 149 zettabytes in 2024 [31]. The publicly known term for this kind of data is big data. Following the definition of Gartner big data can be described with the 3Vs: volume, velocity and variety [9]. The increase in volume is not only caused by an increased amount of data sources. It is also caused by an increased file size. Velocity specifies the speed of data generation. Continuous data generation leads to a state where processing in batches cannot be handled anymore. In fact data should be processed in real time using real time data streams. Examples for real time data generation are social media or sensor data. Last but not least variety characterizes the different types of data structure - structured, semi-structured and unstructured. Structured data are flat files. In contrast books and movies are unstructured documents. In the course of time the definition was expanded to the 6Vs. The added attributes are veracity, viability, and value [15]. Veracity means that the data c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 295, pp. 146–165, 2022. https://doi.org/10.1007/978-3-030-82196-8_12
Evaluation of Transformation Tools in the Context of NoSQL Databases
147
is truthful and reliable. Data with a bad data quality is not useful for further processes. They can result in false predictions. The fifth V stands for viability. The attributes used have to have a relevance for the use case. The usage of the data should go along with a value. Otherwise the data is not worth to collect it. The big challenge is to generate new insights out of the new accessible data. With these insights you can get an advantage over your competitors [57]. For data preparation and analysis relational database management systems (RDBMS) were used in the past. RDBMS have their origin in the 1970s [23]. Today the volume of big data is too big to handle within RDBMS [51]. Instead there are new technologies in the field of cloud technology like NoSQL databases. Also big data has partly its origin in the internet and the cloud technology. The requirements towards big data accompanies a new kind of databases. With new possibilities in the range of cloud computing like resource pooling and rapid elasticity [45] variable new database management systems were developed. So called NoSQL databases (“not only SQL”) try to eliminate the disadvantages of RDBMS by leveraging cloud technology. They go along with fast I/O, supporting big data, scalability and low cost [39]. Furthermore they are schema free to handle a variety of data [17]. Just as the mode of operation of the databases changed, the way of data load processing also changed. Typical for RDBMS were Extract-Transform-Load (ETL) processes. With NoSQL databases they transformed to Extract-LoadTransform (ELT) processes. The difference is that data is first loaded to the target database without any transformations. The data from the source is not filtered. That means that the extract and load steps have not to be modified caused by a new transformation. Only the transformation step has to be adapted [14]. In this load process it is possible to precompute transformations or calculate them in real-time. New database systems and a new logic of building data load processes go along with new transformation tools. First of all they have to fit to the databases. NoSQL databases come along with new properties in comparison to RDBMS like consistency. These properties have also an impact on the transformation tools and the transformations itself. Furthermore the transformation tools have to represent the new requirements like processing in real-time. This paper gives an overview about most important transformation tools in the field of NoSQL databases. Until now there are several evaluations regarding NoSQL databases [19,25]. But there is no evaluation concerning transformation tools in the field of NoSQL databases. By analysing the most important properties, the advantages and disadvantages of each tool are shown and a comparison in table form can be found. All in all it should support the decision-making to find the best database system and transformation tool for each process. It also represents a market overview on the most popular tools, which is existing yet. The structure is the following: the coming section “related work and explanations” provides concepts and further definition as the basis for the evaluation. In the third chapter “methodology” the approach of the analysis itself is described.
148
S. M. L. Hahn et al.
Following the “evaluation” of the several transformation tools can be found in chapter four. Last but not least the “conclusion” of the evaluation is shown.
2
Related Work and Explanations
In this chapter the related work and definitions are presented. They are the basis for the following survey. 2.1
NoSQL
Cloud computing comes along with new use cases and requirements towards DBMS. For this purpose NoSQL databases were developed because RDBMS cannot meet the demands. The name of NoSQL databases is deduced from the missing attributes and SQL compared to RDBMS [39]. The main focus is on data division and duplication. Often this goes along with distributed servers in the cloud [41]. The first requirement is high performance. For achieving the main objective of high performance NoSQL databases have a cache layer and work in-memory. Nevertheless persistence is important. Loss of data caused by a power failure is just unacceptable. For this requirement the data is copied from an in-memory database to a disk database. In view of big data a mass data storage is needed. This can be implemented with a distributed server architecture in the cloud. Last but not least another requirement is the high availability. With a replication on a second disk a fast recovery of the database can be made. Another concept for the availability is one master and several slave nodes on a distributed system [28]. There are many different NoSQL databases with different data models and implementations to meet the demands listed in the following [27]. The subchapters further focus on different data models. Depending on the application and their queries and modifications the data model should be chosen [71]. Column-Oriented Database. In a column-oriented data model the data is stored in a table but the tables cannot be linked. This results in the way of storage. The data of each column is stored separately and have a separate index. I/O can be reduced by only querying the columns needed. Also queries can be executed in parallel if they query different columns. If there are many data sets with similar values the compression is very high. This is especially useful to be used in a data warehouses [27]. Also very useful for a data warehouse is the fast aggregation per column [35]. If columns in data sets belong together like city and country they can be defined as a column family. Columns inside a column family are stored physically together. When the columns of a column family are queried together there is an increase of performance [26]. Further advantages are acceptance of short variation, versioning with timestamp by data modification, fractional data access and fast data operations [71].
Evaluation of Transformation Tools in the Context of NoSQL Databases
149
Document Store. Document databases also stores the data in a key-value structure. In contrast to the key-value data model the value is stored in a JSON or XML format. This format cannot contain only one value but many information inside. Furthermore a secondary index is supported [27]. It also can be used as a kind of content management system (CMS) by storing a document or text as a value. The content of the value does not have to be structured. The attributes in the JSON or XML file can vary completely [71]. A potential use case could be receipts from a point of sale. Document databases often integrate a full-text database [26]. Graph Database. In the graph data model the focus is on the relations between the entities instead of the entity itself. A relationship can be dynamic or static. Use cases are social media, airlines or access control for a large number of users [26,71]. Same as in graph theory there are vertexes and edges. A vertex is an entity instance with a various number of attributes e.g. destinations with their timezone and the letter airport code. An edge represents the relationship between to vertexes - for example the flight duration [12]. Key-Value Store. In the key-value data model a value always refers to a key. Depending on the database system the key is ordered or not. There is also a difference in the storage medium - in-memory or hard disk [26]. This results in faster times of query executions or other operations compared to RDBMS. The structure is very simple and not every data can be stored in this structure. But it allows to store mass data [27]. The response time allows data processing in real time. Also distributed servers can be used for scalability and high availability. Applications like session management or online bidding are potentially good use cases for using a key-value database [71]. Multivalue Store. As the name of multivalue databases indicates it is possible to store multivalue or composite attributes. A composite attribute is the combination of two attributes, e.g. first name and last name. A multivalued attribute can have more than one value in a column, e.g. the hobby of a person. A person can have several hobbies and thus several values for one attribute [12]. Object Database. The object data model uses functionalities out of objectoriented programming language capabilities. Objects can be created and modified in the database itself. Depending on the database object-oriented or database-specific programming language is supported. The advantage is that there is only one data model - in the database and in the program’s code. Use cases are computer-aided design or multimedia applications [26]. 2.2
ELT
The change from RDBMS to NoSQL databases also have an impact on the data integration processes. There is transformation from ETL process to ELT
150
S. M. L. Hahn et al.
processes. As the name suggests there are still three stages but they are switched from extract, transform and load to extract, load and at least transform [14]. The extract stage is the same as in at the ETL process - the data from different data sources is extracted to a working area. The opposite is that not only a required subset of data is loaded but all the data from the source [14]. An exception is the filter of unneeded information [64]. In the second stage the data which was extracted is loaded without any transformation in target database, e.g. a data warehouse, containing all the raw data from the sources [14]. Last but not least transformations are made and business logic is applied inside the target database. For these transformations the resources, processing tools and local driver of the target database can be used. By using cloud infrastructure there are low entry costs. The cost of processing of big data is also lower [64]. This approach has advantages and disadvantages. The process only needs the data sources and the target environment. There is no extra server, technology or other skills needed. In the target database the possibility of scalability as well as the management of the infrastructure by a cloud vendor can be used for an optimal performance [64]. Another advantage is that the implementation of new requirements is low cost because the stages extract and load do not need any adaption. Only new transformations have to be implemented [14]. The network is only used for loading the data from the working area to the target. All of the following computation are executed on the target system. This decreases network congestion [53]. But of course there are also disadvantages. By reason of that the ELT approach is much newer than the ETL there are not as many tools on the market [14]. In addition to that not only less tools but also less developers who are familiar with this new strategy of data integration. Furthermore this approach is developed for a large amount of data. It is, as well as cloud technology as a whole, not useful for small amounts of data [53]. It must also be factored in that there is a higher system load caused by the loading processes of the data sources. 2.3
BASE Model
In the literature NoSQL databases are often described as BASE systems. It has to be mentioned that not every NoSQL database is a BASE system. It depends on the properties of each database. This is a popular opposite to the ACID model. It is defined as basically available, soft state and eventual consistence. Whenever the database is accessible reading and writing operations are available. They go not along with consistency guarantees. Soft status describes that after a while the status cannot be clear because the transaction may not yet have converged. It implicates the possibility of inconsistency. After a time period the data will be consistent so operations are eventually consistent depending on the time and the status [52].
Evaluation of Transformation Tools in the Context of NoSQL Databases
3 3.1
151
Methodology Databases and Transformation Tools
There are many surveys which compare the different existing NoSQL databases and their properties [19,25]. But they do not focus on how data can be transformed. In this survey selected transformation tools are evaluated. Due to the fact that several databases come along with several transformation tools the most popular databases are the basis. In total there are more than 120 different databases available [69]. In the following passages it is described how the most popular databases were evaluated. A popular ranking of databases is published at DB-Engine Ranking. In this ranking the databases are clustered by their type. Additionally an overall ranking is available. Crucial for this ranking are the following six parameters. The first is the number of mentions on websites. This number is measured by the search results of Google and Bing. Another parameter is the search frequency deduced from Google Trends. Also important is the frequency of technical discussions calculated by relevant questions in Stack Overflow and DBA Stack Exchange. Besides these parameters the number of job offers in Indeed and Simply Hired is considered. Moreover the number of LinkedIn profiles in which the database system is mentioned is counted. Last but not least the social network relevance is measured by the amounts of Twitter tweets [60]. In the category of key-value stores Redis and Amazon DynamoDB were the highest rated. Graph databases are spearheaded by Neo4j and again by Amazon DynamoDB. Amazon DynamoDB is also mentioned on second rank of document stores as well as MongoDB on the first one. In the case of multivalue databases the databases Adabas, UniData and UniVerse are leading. But in the context of overall databases their ranking is very low and they will not be considered in this evaluation. Also very low ranked are object oriented databases. They are also not part of the further survey. The best ranked column store databases are Cassandra and HBase [60]. The objective of this article is to have an overview about the most important transformation tools in the context of NoSQL databases. Caused by the fact that there is no ranking yet, three different approaches were used to identify them. A research was made for scientific articles about big data or transformation tools. Furthermore Gartners magic quadrant for data integration was used. For each type of technology there is a separate quadrant. The quadrant contains the most popular players divided into four quadrants - leaders, visionaries, niche players and challengers. Leaders have a leading position in the market and also a good position for the future. Visionaries have a vision for the upcoming future but they do not have a good position yet. Niche players are not that popular in the market because of a small target group or missing innovation and focus. Challengers are doing well in the moment but they are not prepared for the future [21]. Overall for each database there is a focus on the most popular transformation tools. The ranking is also made by the mentions in Google Scholar and Web of Science. Generally third party components or internal developments are not mentioned.
152
S. M. L. Hahn et al.
The focus is only on standard connectors. Coding interfaces like SQL were also not mentioned. Databases. In the following passages the chosen databases are shortly introduced as well as their transformation tools are part of the following evaluation. A transformation tool is only mentioned if the transformation can be made within the database. Amazon DynamoDB is a key-value and document store developed by Amazon Web Services (AWS) and is also part of the equal named cloud platform AWS [1]. There are several transformation tools with a connector to Amazon DynamoDB like Domo, Fivetran, Informatica, KNIME, Safe Software, SnapLogic, Talend and TIBCO [3,72]. The three most popular out of them based on the search results of Google Scholar and Web of Science are Informatica, KNIME and Talend. Cassandra is a partitioned wide column store. The database combines the replication technique of Amazon’s DynamoDB and Google’s Bigtable data model [65]. The transformation tools HVR, IBM DataStage, Informatica, Pentaho, RapidMiner, SnapLogic, Talend and TIBCO support Cassandra [3,72]. The articles focus on Informatica, Pentaho and HVR. HBase is a column store based on the Google’s Bigtable model. It was created to host big data with billions of rows and millions of columns [66]. IBM DataStage, Informatica, Pentaho, Talend and TIBCO have a standard connector to the HBase database [72]. The most popular out of them are Pentaho, Informatica and Talend. MongoDB is a very popular document store. As object-oriented programming it is based on objects. The objects itself are saved as a JSON. Cloud vendors AWS, Google Cloud and Microsoft Azure are supported [47]. Most transformation tools inside this research have a connector to MongoDB: Azure Data Factory, Denodo, Domo, Fivetran, HVR, IBM DataStage, Informatica, Informations Builder, Jedox, KNIME, Pentaho, RapidMiner, Safe Software, SAS Data Management, SnapLogic, Talend and TIBCO [3,72]. The focus is on Azure Data Factory, SAS Data Management and KNIME. Neo4j is one of leading graph databases to represent data and its relationships. The graph can also be viewed as a whiteboard for an user-friendly access to the data model [50]. Neo4j is supported by the transformation tools KNIME, Talend and TIBCO [3,50,72]. Redis is an in-memory key-value storage. It supports different data as strings, lists, sets, bitmaps, hyperloglogs and streams. It is not only a database but can also be used as a cache and message broker [56]. The transformation tool TIBCO supports the database [72]. Transformation Tools. In the next sections the chosen transformation tools are shortly presented. Azure Data Factory is an administrated, serverless data integration tool from Microsoft. It is part of the Azure cloud. Therefore an Azure subscription is needed. A Data Factory can be created within the Azure Portal [46]. HVR is a data integration software. It runs in a separate distributed environment and uses log-based change data capture (CDC) technology. There are over 1500 deployments and clients in over 30 countries [32]. IBM DataStage is the main ETL tool from IBM. IBM DataStage is available for on premise as
Evaluation of Transformation Tools in the Context of NoSQL Databases
153
well as for the cloud [33]. More than 11.000 organizations are using this tool [72]. Informatica provides several data integration products for cloud as well as for on premise solutions. These products include solutions for advanced data transformation, B2B data exchange, ultra messaging and enterprise data catalog [34]. The focus of the following evaluation will be set on Informatica Power Center. KNIME offers a platform especially for data science, visualization and a graphical workbench. Leveraging its open source philosophy, there are many extensions and integrations which can be used without cost [43]. Pentaho is a data integration tool by Hitachi Vantara. There are two versions of it - Pentaho Community Project and Pentaho Enterprise Edition. With the free version a codeless pipeline development is possible. All other features like load balancing, scheduling, streaming, spark or the data science tool kit require the enterprise version [30]. RapidMiner is an environment especially for data science and machine learning. It is mentioned in leading tool comparisons from Forrester, Gartner and G2 [54]. SAS Data Management is the data integration tool from SAS. SAS is a leading company in machine learning and artificial intelligence (AI), priced by Forrester, Gartner and IDC MarketScape [58]. Talend offers not only a tool for ETL processes but also for data integrity, governance and application and API integration. In the scope of data integration it offers a pipeline desginer, data inventory, data preparation and stitch [63]. TIBCO has a product portfolio for data integration, unifying data and data prediction. Data transformation can be made with TIBCO Jaspersoft, cloud connection with TIBCO Cloud Integration. The different products can also be combined [68]. Table 1 gives a short overview which transformation tool can be used with which database. Table 1. Combination possibilities of NoSQL databases and transformation tools Azure Data HVR IBM Data Informatica KNIME Pentaho Rapid SAS Data Factory Amazon DynamoDB
Stage no
Talend TIBCO
Management
no
no
yes
Cassandra
no
yes
yes
yes
HBase
no
no
yes
yes
yes
no
no
no
yes
yes
no
yes
yes
no
yes
yes
no
yes
no
no
yes
yes
MongoDB
yes
yes
no
yes
yes
yes
yes
yes
yes
yes
Neo4j
no
no
no
no
yes
no
no
no
yes
yes
Redis
no
no
no
no
no
no
no
no
no
yes
4
Evaluation
The presented transformation tools will be reviewed in relation to various criteria. In the subchapters these criteria will be explained and applied to the databases and tools. A summary of the criteria on database level can be found in Table 2. The properties of the transformation tools are shortly summarized in Table 3.
154
4.1
S. M. L. Hahn et al.
Consistency
Data quality is, amongst others, defined as consistency and comprehensiveness of data [22]. Depending on the use cases, which will be implemented, the data quality is very important. Over 50% of business intelligence (BI) projects are not successful due to data quality reasons [24]. In the context of transformation tools they cannot improve the data quality of the source system. Depending on the database used they support ACID transactions or not. Depending on the use case it has to be decided if this is a relevant property. Amazon DynamoDB supports two different types of consistency - eventually consistent reads and strongly consistent reads. The property can be set depending on the use case. Eventually consistent reads means that a read does not contain the latest data caused by a write operation. After the write operation the read will contain the new data. This property reflects the BASE model. Strongly consistent reads reflects the ACID model. A read receives always the latest data. This consistency model accompanies with several disadvantages. In case of delay or outage an error will be returned. The latency is higher and it uses more throughput capacity. Further global secondary indexes cannot be used [2]. In Apache Cassandra there are several consistency levels to choose from. The consistency levels are defined as how many replicas have to answer. The minimum is one replica, the maximum is an answer from all replicas. The less replicas have to answer the higher is the throughput, latency, and availability. This also reflects the BASE model [65]. Apache HBase comes along with two different consistency levels - strong and timeline. The replication is based on the master-slave model. All write operations first have to be made at the master and are then replicated to the other nodes. This implicates that the master always has the latest data. At strong consistency level, the provided consistency level by HBase, read queries always get their data from the master. This level displays the ACID model. In the reverse conclusion the database is not highly available with this property. At the timeline consistency level read queries are first sent to the master. If the master did not respond in a given time the query is sent to the replicas. By a response from a replica instead of the master the data can be stale. For this case a flag that the data can be stale is transmitted. This property is unalike to the BASE model. Writes are always made by the master first - there are no conflicts caused by write operations. The replicas are snapshots of the master and have the same order of the write operations. Eventually stale data is declared. Last but not least by reading replicas the client can go back in time [66]. MongoDB uses the BASE model - the data is not always consistent, for example in the case of a failure. With the causal consistency property operations have a logical sequence. They are committed in the correct sequence to increase the consistency. This is only supported in between one thread [47]. In the opposite of most of NoSQL databases Neo4j supports ACID transactions and is strongly consistent. Furthermore it can be checked when the database is stopped [50]. Redis does not support a strong consistency. One reason is the
Evaluation of Transformation Tools in the Context of NoSQL Databases
155
asynchronous replication. If needed it can be forced to a synchronous replication by using the WAIT command. This results in a stronger consistency but not supporting ACID transactions because of possible failures [56]. 4.2
Scalability
Scalability of a system is relevant if there are much more or less data in a data flow, e.g. because of new data sources, or if much more or less complexity in the transformations is needed. In the field of cloud computing especially horizontal scalability is relevant [62]. Whether a system is scalable or not depends on the underlying database and not on the transformation tool. In the evaluation it is shown what kind of scalability the databases support. Referring to the evaluation the following kinds of scaling are of note: Partitioning. There are two ways of partitioning - horizontal and vertical partitioning. Horizontal partitioning means that data sets of one entity are stored in different partitions on different nodes. Vertical partitioning describes the storage of one or more columns in one partition. This way of partitioning is very common for column-oriented data stores. Horizontal partitioning splits up into the two most common techniques: range partitioning and consistent hashing. Range partitioning creates the partitions on a partition key. Each partition on each server has a defined range of the partition key. By querying data using the partition key the query is very fast because only a few partitions are needed. On the other side there can be data hot spots and load-balancing issues when loading data of only one partition. Further a routing server is needed. Apache HBase uses this partition method [66]. Consistent hashing builds the partition based on the hash of each data set. The hash values are approximative uniformly distributed. The advantage is that the partitions have nearly the same size and there is no need for restructuring of the partition borders. Even if there is a new partition the movement of the data is easy because only the partitions next to the new one has to be split up. Queries instead can be slow when an access on each partition is needed [25]. Consistent hashing is used by Amazon DynamoDB. The partitioning is made automatically by the system [2]. Also Apache Cassandra uses consistent hashing [65]. MongoDB supports range and consistent hashing partitioning [47]. In a graph database like Neo4j the presented partitioning concepts do not work because of the primary key [50]. Redis does not support an assisted partitioning. Instead a client side partitioning can be implemented if needed [56]. Replication. A replication of data enlarges not only read and write operation performance but also system reliability, fault tolerance, and durability. In the following it is differed between master–slave and multi-master replication. In the master-slave model always one node represents the master and the other nodes are so called slaves. All writes are committed at the master node. Afterwards they are replicated to the slaves. The master node is consistent. All the nodes can be used for read requests to have a performance increase. By reading from a slave node the data can be stale. Within MongoDB reads are replied, if possible, by the primary node. There can also be a state with more than one primary in
156
S. M. L. Hahn et al.
the case of a failure [47]. Neo4j provides a master-slave replication but not in a classic way. Write operations cannot only be made at the master node but also at the slaves. A write operation using a slave node is slower because it is instantly replicated to the master node to guarantee synchronous processing [50]. Redis support a master–slave replication [56]. In contrast to other databases Apache HBase uses a master-slave replication where reads are not sent to all nodes but to the master node first [66]. In the opposite using multi-master replication means that all nodes are masters and can receive write operations. The replication is happening between all nodes [25]. There is not one leading node which is consistent. Amazon DynamoDB has a multi-master replication model. There can be a replica for each region but not more than one per region. If there are conflicts the data set with the latest timestamp wins [2]. Apache Cassandra has a multi-master replication model, too. It can be defined how many replication should be created per data center [65]. Table 2. Database properties Consistency
Scalability
Licence and costs
Amazon DynamoDB
BASE (ACID available with disadvantages)
Consistent hashing, Multi-master replication per region
Proprietary
Cassandra
BASE (ACID available with disadvantages)
Consistent hashing, Multi-master replication
Open source
HBase
(almost) ACID
Range partitioning, Master-slave replication
Open source
MongoDB
BASE
Partitioning (range, consistent hashing), Master-slave replication
Proprietary
Neo4j
ACID
Master-slave replication
Open source (Proprietary)
Redis
BASE
Master-slave replication
Open source (Proprietary)
4.3
Licence and Costs
Many of the NoSQL databases are open source [44] like Apache Cassandra, Apache HBase and Redis [56,65,66]. But there are also proprietary software which goes along with licence costs such as Amazon DynamoDB. There are no fixed costs. Instead they depend on the usage of the database. The measurement for the cost units are reading and writing operations, storage, back-up, replication, in-memory cache, streams and data transfer. There is a free amount of several units each month [1]. In this evaluation the use cases are not further specified. Hence a fixed price cannot be assumed. A similar price model has MongoDB. The price depends on the cloud, region and resources chosen [47]. Neo4j is in its basic version, called community server, an open source database.
Evaluation of Transformation Tools in the Context of NoSQL Databases
157
The enterprise version comes along with a fee and extra modules like an online back-up [50]. Not only the cost of the underlying database is relevant but also the licence of the transformation tool. This information is needed to calculate the whole budget required for the development of an use case. Aure Data Factory is a part of the Azure cloud and also has the same price model as the other mentioned cloud services. There is no fixed licence fee for the usage but a pay-per-use price model [46]. HVR, Informatica, IBM DataStage, SAS Data Management and TIBCO are proprietary software but the pricing is not revealed [32–34, 58,68]. KNIME is at its basic version the KNIME Analytics Platform for free. The enterprise version KNIME Server accompanies a licence fee. This version is needed for collaboration, workflow automation and monitoring [43]. Pentaho also has two versions - the open source Pentaho Enterprise Edition and the proprietary Pentaho Community Project [30]. Besides that, RapidMiner offers two versions - RapidMiner Community and RapidMiner Enterprise [54]. Talend has a basic open source software called Talend Open Source. For the extended version there is a licence per user e.g. for using cloud integration [63]. 4.4
Transformation Types and Standard Components
A data flow contains one ore more data sources with data, one ore more transformation steps and at least a data target for the data output [6]. Concerning to the ELT approach data source and data target is the same database instance. The focus lies on the transformation tool thus the data source and target will not be considered. There are many common transformation types. To implement known transformation types in the environment of NoSQL databases the transformation tool has to provide at least these common functionalities: – Aggregation: Defined attributes are aggregated by a defined aggregation rule such as maximum, average or sum [7,16,61,74]. – Clear: False and void data is cleaned or rejected [10,61]. – Conversion: The data type or format is changed [7,16,61]. – Additional columns/Expression/Project: With functions the data is transformed row-by-row e.g. with if clauses. The result is stored into new columns. Furthermore columns can be created by splitting e.g. by length or specific character. Also columns can be merged into one column [7,10,74]. – Filter: Filters data by defined criteria. The criteria has to be a boolean condition. It also be used to verify data by given rules [7,16,61,74]. – Join/Lookup: At least two data sources are joined together by at minimum one attribute. The result set has attributes from several data sources [7,16, 61,74]. – Merge/Union: At least two data sources with the same structure are merged together [7,16,61]. – Sequence/Surrogate Key: An unique key is created. It is not a natural key. Normally it is build by an UUID or ascending row number [16,74]. – Update: With this row-by-row transformation a row can be inserted, deleted, updated or overridden [74].
158
S. M. L. Hahn et al.
In Azure Data Factory there are many different transformation types. There is no coding necessary although it is possible. Data transformations can be made visually with an interface called Mapping Data Flows in Azure Data Factory [46]. In HVR there are so called actions for data transformation. Actions in HVR are based on labels and text boxes. An aggregation can not be made out of the box [32]. IBM DataStage provides a stage editor with different stage types. With the standard stages all the mentioned transformation types can made within IBM DataStage [33]. With Power Center and Data Quality - part of Informatica’s product range - all the relevant transformation types are supported [34]. KNIME implements all of the mentioned transformation types in so called nodes. Creating of surrogate keys is not supported. There are workarounds for an implementation [43]. In Pentaho the transformation types are called Data Integration Steps. These steps can be used for the implementation of the different transformation types [30]. RapidMiner comes along with standard operators for the basic transformation types [54]. In SAS code is often used for data transformation. There are functions for aggregate data such as “SUM”. SAS Data Management transformations can be used instead of code. In fact there are not many of them [58]. Talend has several components in its Talend Open Studio and expanded modules for transformations. Also code can be used e.g. for querying the database. The derivation of columns or expression should be made in code [63]. TIBCO Jaspersoft is based on Talend and has the same functions [68]. In many parts there is not only one standard component available for a certain transformation type. Table 3 shows summarized how many of the transformation types can be implemented with each tool and for how many of them standard components are offered. 4.5
Usability, Popularity and Stability
The handling in the usage is very important for the popularity of a software. Usability is part of software ergonomic. Ergonomic focuses not only on the technical system but on the whole system consisting of the interaction between human, tool, task and the environment [38]. A good software ergonomic results in a software which is easy to understand and quick to use. ISO 9241 was published in 1999 containing seventeen parts of requirements for office work with visual display terminals. The requirements range from menu dialog, colours to usability [37]. In part eleven of the norm the term of usability is defined. It is the extent to which a software can be used by specified users to fulfill given tasks with effectiveness, efficiency and satisfaction in a specified context of use [36]. Effectiveness is the capability of the user to complete a task successfully without an error. The efficiency can be set equal to the duration. The satisfaction is emanated by the user. A good usability results not only in benefits for the user relating to raising productivity, decreasing costs, user satisfaction but also in benefits for the software supplier. Users generally demand a high usability. The usability itself can be a competitive advantage [8]. Within the scope of this article the usability of each tool will not be measured.
Evaluation of Transformation Tools in the Context of NoSQL Databases
159
Regarding to Gartner Azure Data Factory is valued from the clients for their usability [73]. There were no researches found about the usability of the transformation tools HVR, Informatica and TIBCO Data Virtualization. Using several functions as a node in KNIME increases the usability [5,18]. Evaluations of the transformation tools IBM DataStage, Pentaho, RapidMiner, Talend and TIBCO Jaspersoft shows that Pentaho, RapidMiner, Talend and TIBCO Jaspersoft are easy to use [11,42]. In total there is also no comparison to all of the focused transformation tools. Depending on other use cases the conclusion can be made that there is a correlation between usability and popularity [59]. In either case popularity of a software goes along with the usage of the tool. The usage can be seen indirectly in contents about the tool like websites and books, results in search engines or technical discussions [60]. To measure the popularity of each tool on the basis of DB-Engine Ranking the following metrics are used. The popularity is an indication by choosing a transformation tool how common the usage of the tool and how big the community is. Depending on the project this can be very relevant. The greatest popularity has SAS Data Management. With almost 40% less popularity Informatica follows. Talend and Azure Data Factory are trailing Informatica. Most of the tools have 20% of the popularity SAS Data Management has. These tools are KNIME, Pentaho, RapidMiner and TIBCO. The tools with the lowest popularity are IBM DataStage and HVR. It is not only important how popular a tool is but also how long it is already present on the market. This is also an indicator how stable a software is. The stability of a software which is available over years can be ranked higher than a new one. Azure Data Factory was published in 2015 [46]. After first explorations HVR was deployed in 2012 [32]. IBM DataStage was developed in 1996 [33]. Informatica, as a software company, is on the market since 1993 [34]. KNIME was developed at the University of Konstanz and published in 2006 [43]. Pentaho was developed in 2004. In 2015 the software was bought by Hitachi Vantara [30]. RapidMiner exists since 2001 - developed at the University of Dortmund and afterwards since 2007 refined by Rapid-I GmbH [54]. SAS has its beginning in 1966. The first SAS software was published in 1972 [58]. Talend was founded in 2005. Since then they enlarged their software solution [63]. TIBCO was founded in 1997 by Vivek Ranadiv´e. The motivation was to make real-time technology mass-suitable [68]. 4.6
Performance
Performance concerning time, hardware resources, capacity and modes is also an important factor for data integration processes. Factors for performance are time effciency, resource utilization, capacity and modes of data processing [67]. Within the scope of this article the performance of each tool is not measured. There is no study available about the performance of Azure Data Factory. Instead there are several hints how to improve the performance e.g. with new indexes, reduction of dimensions, number of workers or file format [13,55]. HVR does not support complex ETL processes very well [73]. A comparison between Informatica and
160
S. M. L. Hahn et al.
IBM DataStage shows, that both support parallelism but IBM DataStage is more scalable [48]. There are researches about the performance in KNIME and RapidMiner but they refer to data science use cases and are not representative for transformations [4,29,49]. Talend improves their performance by supporting parallel processing [40,63]. A comparison of Pentaho and TIBCO Jaspersoft shows, that Pentaho performs better [70]. Due to the fact that there is no study comparing the performance of all tools, the tools cannot be ranked by performance. Furthermore not for every transformation tool information was found. 4.7
Competitive Advantages
If there are any competitive advantages they will be presented. Depending on the use case these features can be helpful for the implementation. It has to be decided if a certain competitive advantage is useful in a project and if the additional value is high enough to choose over another transformation tool. Azure Data Factory does not only support over 90 standard connectors for files, databases, SaaS applications or big data services but also some further advantages. On premise, Microsoft offered SQL Server Integration Services (SSIS) to build ETL processes containing transformation steps. SSIS can be easily rehosted in Data Factory. Furthermore versioning with GIT as well as a CI/CD integration is provided [46]. HVR supports many different topologies such as uni-directional, bi-directional, multi-directional, broadcast, integration and cascading [32]. A strength of IBM is the support which is easy to access and very responsive operated by experts [73]. Informatica provides several products which supports the data integration process. These is provided by ultra messaging, building a data hub or the data exchange between companies [34]. KNIME offers many partially free extensions especially for data science use cases like text processing, network relations and orchestration. It also supports R and Python Scripting, H2O Machine and Deep Learning [43]. With the Lumada data services Hitachi Vantara offers a wide product range for Analytics, Governance and Ops Agility containing multi cloud solutions such as data catolog, data optimizer for Hadoop and a data lake [30]. RapidMiner is focused on data science use cases. That is the reason why it offers over 1500 native algorithms, data preparations and data science functions. Furthermore it provides over 250 learning sessions [54]. As a part of the SAS company SAS Data Management is integrated into the SAS environment. Especially for analytical use cases SAS has a lot of products like SAS Visual Data Mining and Machine Learning SAS Model Manager [58]. Talend has a strong focus on data integration processes. This results in over 900 free components and connectors which can be used [63]. TIBCO has a strength in upcoming data integration scenarios like streaming, messaging and Internet of Thing (IoT) devices. It offers also separate tools for this use cases [73].
Evaluation of Transformation Tools in the Context of NoSQL Databases
161
Table 3. Transformation tool properties Licence and costs
Transformation Standard Popularity types components
Founding year
Competitive advantages
Azure Data Factory
Proprietary
9/9
9/9
Popular
2015
over 90 connectors, SSIS rehosting, supporting GIT, CI/CD
HVR
Proprietary
8/9
7/9
Not popular 2012
Supporting many topologies
IBM Data Stage
Proprietary
9/9
9/9
Not popular 2012
Support
Informatica
Proprietary
9/9
8/9
Popular
1993
Continuative data integration products
KNIME
Open source (KNIME Analytics Platform)/ Proprietary (KNIME Server)
9/9
8/9
Less popular 2006
Extensions for data science use cases
Pentaho
Open source (Pentaho Community Project)/ Proprietary (Pentaho Enterprise Edition)
9/9
9/9
Less popular 2004
Lumada data services
Rapid Miner
Open source (RapidMiner 9/9 Community)/ Proprietary (RapidMiner Enterprise)
9/9
Less popular 2001
Standard algorithms and functions, online courses
SAS Data Management
Proprietary
9/9
4/9
Very popular
1972
Further analytic products
Talend
Open source (Talend Open Source)/ Proprietary
9/9
8/9
Popular
2005
over 900 components & connectors
TIBCO
Proprietary
9/9
8/9
Less popular 1997
Tools for streaming, messaging and IoT
5
Conclusion and Future Research
Caused by the growing data volume NoSQL databases are getting very popular. The change in the infrastructure also results in a change of data integration processes. This evaluation gives an overview about the most popular transformation tools on the market for NoSQL databases including their advantages and disadvantages. With the presented tools not only ETL but also ELT processes can be build. In fact the tools are not new on the market. These are existing ETL tools which extended their connector over time. Many of them have nowadays connectors for NoSQL databases. This has the advantage that already known ETL tools can be also used for NoSQL databases. In this study fundamental concepts in the field of databases and data integration processes were presented. Using a self-developed selection method the most popular NoSQL databases and transformation tools were chosen. The most important properties of a transformation tool were introduced and discussed for each database and tool. In overview tables the advantages and disadvantages can be seen quickly and clearly. The evaluation is not only a market overview
162
S. M. L. Hahn et al.
but a decision guidance for the implementation of specific use cases. It is shown that a choice of a transformation tool always includes the choice of a database system. A free choice is not possible if there is already a decision for a database system. Instead, code or an user specific connector has to be used. The research is an overview about different transformation tools in the context of NoSQL databases. Continuing specific use cases can be implemented with the several tools and be measured and analyzed. Within the scope of the study not every property could be analyzed in detail. There are further studies necessary. The performance of standard components have to be analyzed in relation to the performance of custom code. Furthermore the usability of each tool has to be measured with a representative group of people. Besides that it is partly possible to connect a database with a transformation tool without a standard connector e.g. via JDBC. This has to be proven for each database - transformation tool combination.
References 1. Amazon Web Services Inc.: Amazon dynamodb (2020) 2. Amazon Web Services Inc.: What is amazon dynamodb? (2020) 3. Atriwal, L., Nagar, P., Tayal, S., Gupta, V.: Business intelligence tools for big data. J. Basic Appl. Eng. Res. 3(6), 505–509 (2016) 4. Basha, S.M., Bagyalakshmi, K., Ramesh, C., Rahim, R., Manikandan, R., Kumar, A.: Comparative study on performance of document classification using supervised machine learning algorithms: Knime. Int. J. Emerg. Technol. 10(1), 148–153 (2019) 5. Beisken, S., Meinl, T., Wiswedel, B., de Figueiredo, L.F., Berthold, M., Steinbeck, C.: KNIME-CDK: workflow-driven cheminformatics. BMC Bioinform. 14(1), 257 (2013) 6. Belyy, A., Xu, F., Herdan, T., He, M., Syed, A., Cao, W., Yee, M.: Dataset previews for ETL transforms, 28 May 2013. US Patent 8,452,723 7. Bergamaschi, S., Guerra, F., Orsini, M., Sartori, C., Vincini, M.: A semantic approach to ETL technologies. Data Knowl. Eng. 70(8), 717–731 (2011) 8. Bevan, N., Macleod, M.: Usability measurement in context. Behav. Inf. Technol. 13(1–2), 132–145 (1994) 9. Beyer, M.A., Laney, D.: The importance of ‘big data’: a definition, pp. 2014–2018. Gartner, Stamford, CT (2012) 10. Bhide, M.A.,Bonagiri, K.K., Mittapalli, S.K.: Column based data transfer in extract transform and load (ETL) systems, 20 August 2013. US Patent 8,515,898 11. Badiuzzaman Biplob, Md., Sheraji, G.A., Khan, S.I.: Comparison of different extraction transformation and loading tools for data warehousing. In: 2018 International Conference on Innovations in Science, Engineering and Technology (ICISET), pp. 262–267. IEEE (2018) 12. Chen, J.-K., Lee, W.-Z.: An introduction of NoSQL databases based on their categories and application industries. Algorithms 12(5), 106 (2019) 13. Cot´e, C., Gutzait, M.K., Ciaburro, G.: Hands-On Data Warehousing with Azure Data Factory: ETL Techniques to Load and Transform Data from Various Sources, Both On-premises and on Cloud. Packt Publishing Ltd., Birmingham (2018) 14. Davenport, R.J.: ETL vs ELT: a subjective view. Insource Commercial aspects of BI whitepaper (2008)
Evaluation of Transformation Tools in the Context of NoSQL Databases
163
15. Ding, G., Wu, Q., Wang, J., Yao, Y.-D.: Big spectrum data: the new resource for cognitive wireless networking (2014) 16. El Akkaoui, Z., Zimanyi, E., Maz´ on, J.-N., Trujillo, J.: A model-driven framework for ETL process development. In: DOLAP 2011: Proceedings of the ACM 14th International Workshop on Data Warehousing and OLAP, pp. 45–52, October 2011 17. Fatima, H., Wasnik, K.: Comparison of SQL, NoSQL and NewSQL databases for internet of things. In: 2016 IEEE Bombay Section Symposium (IBSS), pp. 1–6. IEEE (2016) 18. Fillbrunn, A., Dietz, C., Pfeuffer, J., Rahn, R., Landrum, G.A., Berthold, M.R.: Knime for reproducible cross-domain analysis of life science data. J. Biotechnol. 261, 149–156 (2017) 19. Gajendran, S.K.: A survey on NoSQL databases. University of Illinois (2012) 20. Gantz, J., Reinsel, D.: The 2011 digital universe study: Extracting value from chaos. IDC: Sponsored by EMC Corporation (2011) 21. Gartner Inc.: Magic quadrant research methodology (2020) 22. Giovinazzo, W.: Bi: Only as good as its data quality. Information Management Special Reports (2009) 23. Grad, B.: Relational database management systems: the formative years [guest editor’s introduction]. IEEE Ann. Hist. Comput. 34(4), 7–8 (2012) 24. Graham, P.: Data quality: you don’t just need a dashboard! strategy execution. DM Rev. Mag., 10001727–1 (2008) 25. Grolinger, K., Higashino, W.A., Tiwari, A., Capretz, M.A.M.: Data management in cloud environments: NoSQL and NewSQL data stores. J. Cloud Comput. Adv. Syst. Appl. 2(1), 22 (2013) 26. Gudivada, V.N., Rao, D., Raghavan, V.V.: NoSQL systems for big data management. In: 2014 IEEE World Congress on Services, pp. 190–197. IEEE (2014) 27. Han, J., Haihong, E., Le, G., Du, J.: Survey on NoSQL database. In: 2011 6th International Conference on Pervasive Computing and Applications, pp. 363–366. IEEE (2011) 28. Han, J., Song, M., Song, J.: A novel solution of distributed memory NoSQL database for cloud computing. In: 2011 10th IEEE/ACIS International Conference on Computer and Information Science, pp. 351–355. IEEE (2011) 29. Hanif, M.H.M., Adewole, K.S., Anuar, N.B., Kamsin, A.: Performance evaluation of machine learning algorithms for spam profile detection on twitter using WEKA and RapidMiner. Adv. Sci. Lett. 24(2), 1043–1046 (2018) 30. Hitachi Vantara LLC. Pentaho enterprise edition—hitachi vantara (2020) 31. Holst, A.: Volume of data/information created worldwide from 2010 to 2024 (2020) 32. HVR Software Inc.: Enterprise data integration software—hvr (2020) 33. IBM United Kingdom Limited: IBM - United Kingdom (2020) 34. Informatica. Enterprise cloud data management—informatica deutschland (2020) 35. Ismail, R., Syed, T.A., Musa, S.: Design and implementation of an efficient framework for behaviour attestation using n-call slides. In: Proceedings of the 8th International Conference on Ubiquitous Information Management and Communication, pp. 1–8 (2014) 36. ISO: Ergonomische Anforderungen f¨ ur B¨ urot¨ atigkeiten mit Bildschirmger¨ aten Teil 11: Anforderungen an die Gebrauchstauglichkeit - Leits¨ atze. Beuth Verlag, Berlin (1999) 37. ISO: Ergonomische Anforderungen f¨ ur B¨ urot¨ atigkeiten mit Bildschirmger¨ aten ¨ Teil 1: Allgemeine Einf¨ uhrung (ISO 9241–1:1997) (enth¨ alt Anderung AMD 1:2001); Deutsche Fassung EN ISO 9241–1:1997 + A1:2001. Beuth Verlag, Berlin (2002)
164
S. M. L. Hahn et al.
38. ISO: Deutsche Norm DIN EN ISO 6385: Grunds¨ atze der Ergonomie f¨ ur die Gestaltung von Arbeitssystemen (ISO 6385:2004); deutsche Fassung EN ISO 6385:2004. Beuth Verlag, Berlin (2004) 39. Jing Han, Haihong, E., Le, G., Du, J.: Survey on NoSQL database. In: 2011 6th International Conference on Pervasive Computing and Applications, pp. 363–366 (2011) 40. Katragadda, R., Tirumala, S.S., Nandigam, D.: ETL tools for data warehousing: an empirical study of open source Talend Studio versus Microsoft SSIS. In: Computing Conference Papers [147] (2015) 41. Kelly, A., Kelly, A.M., McCreary, D.: Making Sense of NoSQL: a Guide for Managers and the Rest of Us (2013) 42. Kherdekar, V.A., Metkewar, P.S.: A technical comprehensive survey of ETL tools. In: Advanced Engineering Research and Applications, p. 20 (2016) 43. KNIME AG: Knime—open for innovation (2020) 44. Leavitt, N.: Will NoSQL databases live up to their promise? Computer 43(2), 12–14 (2010) 45. Mell, P., Grance, T., et al.: The NIST definition of cloud computing. NIST special publication, 800–145 (2011) 46. Microsoft: Data factory - datenintegrationsdienst—microsoft azure (2020) 47. MongoDB Inc.: The most popular database for modern apps—mongodb (2020) 48. Mukherjee, R., Kar, P.: A comparative review of data warehousing ETL tools with new trends and industry insight. In: 2017 IEEE 7th International Advance Computing Conference (IACC), pp. 943–948. IEEE (2017) 49. Naik, A., Samant, L.: Correlation review of classification algorithm using data mining tool: WEKA, Rapidminer, Tanagra, Orange and Knime. Procedia Comput. Sci. 85, 662–668 (2016) 50. Neo4j Inc.: Neo4j graph platform - the leader in graph databases (2020) 51. Ohlhorst, F.J.: Big Data Analytics: Turning Big Data Into Big Money, vol. 65. Wiley, Hoboken (2012) 52. Pritchett, D.: Base: an acid alternative. Queue 6(3), 48–55 (2008) 53. Ranjan, V.: A comparative study between ETL (extract, transform, load) and ELT (extract, load and transform) approach for loading data into data warehouse. Technical report (2009). http://www.ecst.csuchico.edu/∼juliano/csci693. Viewed 05 Mar 2010 54. RapidMiner Inc.: Rapidminer—best data science & machine learning platform (2020) 55. Rawat, S., Narain, A.: Understanding Azure Data Factory: Operationalizing Big Data and Advanced Analytics Solutions. Apress, New York (2018) 56. Redis Labs Ltd.: Redis (2020) 57. Sagiroglu, S., Sinanc, D.: Big data: a review. In: 2013 International Conference on Collaboration Technologies and Systems (CTS), pp. 42–47 (2013) 58. SAS Institute Inc.: Analytics & AI software-l¨ osungen f¨ ur unternehmen—sas (2020) 59. Scowen, G., Regenbrecht, H.: Increased popularity through compliance with usability guidelines in e-learning web sites. Int. J. Inf. Technol. Web Eng. (IJITWE) 4(3), 38–57 (2009) 60. solidIT consulting & software development GmbH. Db-engines (2020) 61. Song, X., Yan, X., Yang, L.: Design ETL metamodel based on UML profile. In: 2009 Second International Symposium on Knowledge Acquisition and Modeling, vol. 3, pp. 69–72. IEEE (2009)
Evaluation of Transformation Tools in the Context of NoSQL Databases
165
62. Stonebraker, M., Madden, S., Abadi, D.J., Harizopoulos, S., Hachem, N., Helland, P.: The end of an architectural era: It’s time for a complete rewrite. In: Making Databases Work: The Pragmatic Wisdom of Michael Stonebraker, pp. 463–489. Morgan & Claypool (2018) 63. Talend: Talend - a cloud data integration leader (modern ETL) (2020) 64. Tayade, D.M.: Comparative study of ETL and E-LT in data warehousing (2019) 65. The Apache Software Foundation. Apache cassandra documentation v4.0-beta3 (2020) 66. The Apache Software Foundation. Welcome to apache hbaseTM (2020) 67. Theodorou, V., Abell´ o, A., Lehner, W.: Quality measures for ETL processes. In: International Conference on Data Warehousing and Knowledge Discovery, pp. 9– 22. Springer (2014) 68. TIBCO Software Inc.: Reporting- und analysesoftware (2020) 69. Tudorica, B.G., Bucur, C.: A comparison between several NoSQL databases with comments and notes. In: 2011 RoEduNet International Conference 10th Edition: Networking in Education and Research, pp. 1–5. IEEE (2011) 70. Vargas, V., Syed, A., Mohammad, A., Halgamuge, M.N.: Pentaho and Jaspersoft: a comparative study of business intelligence open source tools processing big data to evaluate performances. Int. J. Adv. Comput. Sci. Appl. 7(10), 20–29 (2016) 71. Zafar, R., Yafi, E., Zuhairi, M.F., Dao, H.: Big data: the NoSQL and RDBMS review. In: 2016 International Conference on Information and Communication Technology (ICICTM), pp. 120–126. IEEE (2016) 72. Zaidi, E., Thoo, E., Heudecker, N., Menon, S., Thanaraj, R.: Gartner magic quadrant for data integration tools. Gartner Group (2020) 73. Zaidi, E., Thoo, E., Heudecker, N., Menon, S., Thanaraj, R.: Magic quadrant for data integration tools, 2020 (2020) 74. Zamanian, K., Nesamoney, D.: Apparatus and method for performing data transformations in data warehousing, 15 January 2002. US Patent 6,339,775
Network Classification with Missing Information Ruriko Yoshida(B) and Carolyne Vu Naval Postgraduate School, Monterey, CA, USA [email protected] http://polytopes.net Abstract. Demand for effective methods of analyzing networks has emerged with the growth of accessible data, particularly for incomplete networks. Even as means for data collection advance, incomplete information remains a reality for numerous reasons. Data can be obscured by excessive noise. Surveys for information typically contain some nonrespondents. In other cases, simple inaccessibility restricts observation. Also, for illicit groups, we are confronted with attempts to conceal important elements or their propagation of false information. In the real-world, it is difficult to determine when the observed network is both accurate and complete. In this paper, we consider a method for classification of incomplete networks. We classify real-world networks into technological, social, information, and biological categories by their structural features using supervised learning techniques. In contrast to the current method of training models with only complete information, we examine the effects of training our classification model with both complete and incomplete network information. This technique enables our model to learn how to recognize and classify other incomplete networks. The representation of incomplete networks at various stages of completeness allows the machine to examine the nuances of incomplete networks. By allowing the machine to study incomplete networks, its ability to recognize and classify other incomplete networks improves drastically. Our method requires minimal computational effort and can accomplish an efficient classification. The results strongly confirm the effectiveness of training a classification model with incomplete network information. Keywords: Classifications Random forests
1
· Graph statistics · Network analysis ·
Introduction
Numerous real-world problems and systems can be represented by networks. Varying from social relationships to biochemistry, networks exist in many different forms. Personal or organizational interactions are captured in network representations [1, sec. 1.5]. Communication devices, connected through wired or wireless means, can be mapped by technological networks [1, sec. 1.5]. Networks can also capture military coordination necessary to an operation to assist c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 295, pp. 166–183, 2022. https://doi.org/10.1007/978-3-030-82196-8_13
Network Classification
167
in decision-making [1, sec. 1.5]. Network analysis provides a discipline shared by many different professional fields. Biologists and intelligence analysts alike often characterize their system, extract information from potentially incomplete data, and develop an understanding of their system through analysis [1, sec. 1.4]. The initial stages of any network analysis includes categorization by class according to shared characteristics [13, p. 13]. This is known as classification. From an accurate classification, similar methods of analysis can be applied to networks belonging in the same class. The rise of accessible real-world data creates a growing interest in effective methods for accurate network classification, especially for networks with incomplete information. The principle challenge of analyzing real-world observed networks is its propensity to contain dubious or incomplete data. For example, criminal networks are “inevitably incomplete” given their elusive and dynamic operational nature [17]. Illicit groups might propagate false information to conceal true intentions. Even naturally occurring networks could have elements that are simply unobserved. Also, for some networks, due to their nature or size, it can be difficult to ascertain when the observed network is considered complete. Unfortunately, very little research has been completed to study the effects of incomplete data on network structure [17]. Most techniques for handling incomplete networks involve data imputation, the process of estimating unknown data from the observed data, which might incur unknown consequences to a network’s true structure and ultimately affect classification [11, p. 20]. An intelligence community’s assessment of enemy organizations requires accurate classification of the observed network before the intelligence team can develop a strategy for combating the adversary. Problems are typically timesensitive; however, gathering this complete and actionable intelligence is a challenging mission that could span years. An adversary’s actions are secretive in nature, making it extremely difficult to collect a complete observation of the network. Crucial information is deliberately concealed. Intentionally dubious information might create problematic noise or false imputations. Thus, if an observed incomplete network can be classified as-is without delay, the network can be properly analyzed for a strategy to be devised and acted upon earlier. With a method to accurately classify an incomplete network, techniques of imputation can be reserved for post-classification. This allows for the estimation to be tailored accordingly by network class in an effort to maintain the network’s true structure. These techniques could provide the intelligence team with a reasonable evaluation of an enemy’s prospective associations or activities. A method for classifying incomplete networks has a wide range of potential applications, from social network analysis, to epidemiology, and political campaigning. Incomplete network classification without imputation creates the possibility for new approaches to network analysis. In this paper, we consider a method for classification of incomplete networks. We examine the effects of training the classification model with complete and incomplete information. Observed network data and their network features are classified into technological, social, information, and biological categories using
168
R. Yoshida and C. Vu
supervised learning methods. This comparative analysis contributes to a better understanding of network characteristics for classification. Then we consider to create a robust method for rebuilding a network with missing information. We propose a method for classifying the percent of information missing in networks based on feature characteristics, which will determine how much information needs to be rebuilt. This paper is organized as follows: Sect. 2 reviews literature reviews and reminds readers definitions. In Sect. 3, we show asymptotic results of graphical statistics as the number of nodes in a graph goes to infinity. In Sect. 4, we show our method to classify an observed graph and in Sect. 5, we show computational results and analysis of our method with empirical datasets. We end this paper with discussion and future work.
2
Background
This section begins by describing frequently studied networks, separating them into four common classes: technological, social, information, and biological networks. Next, Sect. 2.2 reviews current literature for network classification and incomplete network studies that we build upon in our own study. 2.1
General Network Classes
Networks can be used to model a variety of systems. Classifying them into distinguished categories allows for treating networks in a category with common methods of analysis. In this study, we follow the categorization of networks by [13, p. 13] into four general classes. Technological Networks. Technological networks are used to model physical infrastructure systems fundamental to modern society [13, p. 13]. The Internet, as a global network of connections between devices, transportation networks, power grids, telephone and delivery networks, are included in this category, though they are not all are examined in this study. Technological networks, for example, can have nodes representing airports and the edges representing connections between those airports. Social Networks. Social networks model people or groups in some form of social interaction connecting them. In popular terms, social networks commonly refer to online network systems such as Facebook or LinkedIn [1, sec. 1.3]. However, the study of social networks also includes email interaction, professional collaboration, and familial ancestry [13, p. 30]. For social networks, nodes represent individuals, and edges are their connections.
Network Classification
169
Information Networks. Information networks represent shared data connections. Closely resembling social networks and some technological networks, information networks actually represent the content occurring over those other networks. Examples include information flow over the World Wide Web, citations, recommendations, and information distribution in the form of sharing others’ posts [13, p. 51]. Information networks, for example, can have nodes representing movies and edges representing related recommendations. Biological Networks. Biological networks are interactions between biological elements [13, p. 64]. Common types of biological networks are used to represent biochemical interactions, neurological systems, and relationships in an ecosystem. Biological networks, for example, can have nodes representing genes and edges representing their interactions. 2.2
Literature Review
Our current “era of big data” is emerging from an increased ability to collect and share data. [12, p. 1]. The sharp growth in data size and accessibility requires appropriate methods for processing and analyzing that big data. A machine learning (ML) approach to analysis is leveraged in our study. Study of Network Classification. The initial phase of any network analysis begins with classifying the network. Geng et al. discusses the following kernel methods popular for network classification [8]. – – – – –
Random Walk – similarity measured by common random walks Shortest Path – similarity measured by common shortest paths Cyclic Pattern – similarity measured by common cycles Subtree – similarity measured by common subtrees Graphlet and Subgraph – similarity measured by similar subgraphs or graphlets
Geng et al. proposed an alternative approach to kernel methods. It is commonly understood that networks of a class will have similar characteristics in their structure. Under this assumption, unique network features should be leveraged to classify an unknown network [8]. Geng et al. conducted a study of biological network classification based on attribute vectors generated from global topological and label features [8]. They discovered that networks from similar classes have similar characteristics, and network characteristics carry distinctions that can be leveraged in classification algorithms. Geng et al. found their featurebased classification models produced similar accuracy rates with less computational requirements than conventional kernel methods of measuring similarity between networks based on shared patterns [8]. Canning et al. investigated the use of network features for classification of complete real-world observed networks. Their research found that networks from
170
R. Yoshida and C. Vu
differing classes do contain distinguishing structural features useful in network classification. Research prior to this study was mainly focused on classification of only synthetic networks or distinguishing networks within one specific class type [3]. Canning et al. included synthetically generated networks among the real-world networks and discovered their classification model could identify the synthetic networks from real networks with great confidence [3]. Their multiclass classification model using RF was successful in classifying both real-world and synthetic complete networks using only their network features. These studies of feature-based classification presume complete network information in their methods. In contrast, we seek to examine a RF model that classifies a network as it is observed – even while incomplete. Study of Incomplete Networks. Incomplete data is a reality of analyzing real-world networks. Portions of the observed data may remain unknown for different reasons such as data obstruction by excessive noise, non-respondent survey answers, deliberate concealment, or inaccessibility for observation [7]. The proper handling of incomplete data is a critical requirement for accurate classification. An inapt approach can cause significant errors in classification results. Garcia-Laencina et al. in [7] discuss the following common techniques for analyzing incomplete data: – Exclusion – deletion of incomplete datasets to analyze only completely observed data – Weighting – modifying design weights to adjust for non-respondent data – Imputation – an estimation of unobserved data is generated from known data features – Model-Based – broad methods for modeling and making inferences based on data distribution or likelihood Other emerging approaches for handling incomplete data include the use of ML techniques such as support vector machines (SVM), decision trees, and neural networks (NN) [7]. However, when using any of these methods, we must be attentive to potential incidents of significant bias, added variance, or risks of generalizing estimated data [7]. Thus, we seek to develop a method for classifying an incomplete network without estimations to complete the network. Once classified, the methods of predicting unknown data can be customized to consider that network class’s known properties, not just its observed features.
3
Asymptotic Results on Some Graphical Features
In this subsection we show some theoretical results on asymptotic convergence on graphical feature, namely average distance between nodes in a random graph. Suppose we have a random graph G0 = (N0 , E0 ) where N is the set of nodes, N0 = {1, . . . , n} and a set of edges E0 . Then if we randomly delete a node from
Network Classification
171
G0 , then the average distance between a pair of any nodes stays the same as the average distance between a pair of any nodes for G0 . Suppose we have a random graph G0 = (N0 , E0 ) where N is the set of nodes, N0 = {1, . . . , n} and a set of edges E0 . Let GiE be a graph with N0 and Ei ⊂ E0 such that i many edges are randomly (uniformly) deleted. Let GiN be a graph with Ni ⊂ N0 and Ei ⊂ E0 such that i many nodes are randomly (uniformly) deleted and also edges adjacent to the deleted nodes. Without loss of generality, let Ni = {1, . . . , (N − i)}. In order to simplify the problem, we will assume that os-R´enyi model of a random graph [6]. G0 is generated by Erd¨ The main ingredient of the proof for our theorem is from [5]. Suppose we have a degree distribution w = (w0 , w1 , . . . , wn ) be the expected degree of the node i. Let 2 w d = i i , i wi that is, the second order average degree of nodes. Definition 1. The volume of a subset of nodes S ⊂ N in a graph G = (N, E) is defined as V ol(S) = deg(v) v∈S
where deg(v) is the degree of a node v. Let V olk (S) = i∈S wik and V olk (G) = i∈N wik . Definition 2. The expected degree sequence w for a graph G is called strongly sparse if G satisfies the following: 1. The second order average degree d satisfies the condition 0 < log(d) > (V ol3 (U ) log(d) log log(n)/(d log(n))) . Note that if G is generated under Erd¨ os-R´enyi model with p < 1 then it is admissible. Theorem 1 (Theorem 1 in [5]). For a random graph G with admissible expected degree sequence (w1 , . . . , wn ), the average distance is almost surely (1 + o(1))(log(n)/ log(d)).
172
R. Yoshida and C. Vu
Proposition 1. The expected degree of each node for a graph G = (N, E) with n nodes generated under Erd¨ os-R´enyi model with p, p ∈ [0, 1] is p · (n − 1). Proof. If p = 1, then G is the complete graph with n nodes. This means that the degree of each node is (n − 1). If p < 1, then the probability to be an edge between a node i ∈ N to another node j = i is p. Therefore, since there are (n − 1) possible j = i, the average degree of the node i is p · (n − 1). Theorem 2. If a graph G = (N, E) with n nodes generated under Erd¨ osR´enyi model with p, p ∈ (0, 1), then the average distance is almost surely (1 + o(1))(log(n)/ log(p · (n − 1))). Proof. Using Proposition 1, the expected degree sequence has wi = p · (n − 1) for i ∈ N . Then we have d = p · (n − 1). Using Theorem 1, since w is admissible, we are done. Corollary 1. Suppose a graph G0 = (N0 , E0 ) with n nodes generated under Erd¨ os-R´enyi model with p, p ∈ (0, 1). If i the principle of superposition of these forces: the forces resultant vector Fi acting on − → the particle si , is the sum of the vectors Fi,j |S| |S| − → − → Fi = Fi,j = j=1, j=i
j=1, j=i
qi qj ei,j 4π εri,j
− → ei,j is unit direction vector Fi,j . At the current iteration, we set the value corresponding to the charge of the si particle ϕi − ϕib , i ∈ [1 : |S|] qi = exp −|X | j=[1:|S|], j=i ϕj − ϕib is the normalized value of the fitness function at the current position X i of this particle, ϕ i is the minimum value of the fitness function reached by the population for this iteration. Coefficient |X| is need to prevent too small absolute values of the value under the sign of the exponential function at high dimensions of the search space. The argument of this function is not positive in all cases, so the charge qi is always positive and belongs to the interval (0; 1]. − → At the same iteration t we calculate vector Fi of dimension (X × 1) using next formula: ⎧ qi qj |S| |S| ⎨ X − X j i X −X 2 , ϕj < ϕi − → − → qj q i (3) Fi,j = Fi = ⎩ Xi − Xj i j 2 , ϕi ≤ ϕj X −X j=1, j=i
j=1, j=i
j
i
• the particle with the best value of the fitness function attracts the particle with the worst values of this function and the second particle pushes the first • the particle si with the best value of the fitness function equal to ϕ i , attracts all other particles of the population. 4. Execute the Movement of Particles. Rule for moving particles:
Xi = Xi + λU1 (0; 1) λ - move step (free parameter).
Fi ⊗ Vi , i ∈ [1 : |S|], i = ib , Xib = Xib Fi
(4)
244
E. N. Shvareva and L. V. Enikeeva
The values of the components of the vector Vi have values ⎧ ⎨ x+ − xi,j , Fi,j > 0 j , j ∈ [1 : |X |], j = i vi,j = ⎩ xi,j − x− , Fi,j ≤ 0 j
(5)
From formulas (4) and (5) it follows that when moving the particle si from position X i to position X i we use the normalized power (3). For each of the dimensions of the vector X, we move with a step of random size in the direction of the corresponding upper or lower borders of the parallelepiped P. Do not move sib particle at this iteration. 5. Check whether the Iteration end Condition is Met, and Depending on the Results of this Check, We either Complete the Calculations or Proceed to Step 2. Harmony Search (HS). Harmony search algorithm instigated by the improvisation process of jazz musicians and introduced by Zong Woo Geem Joong Hoon Kim and G. V. Loganathan in 2001. The harmony search algorithm is an efficient evolutionary algorithm that has shown good results in many problems in various fields [15–19]. The harmony Search algorithm scheme includes the following basic steps. 1. 2. 3. 4. 5.
Initialize the algorithm. Form the harmony vector. Execute step-by-step adjustment of the harmony vector. Update the harmony memory matrix. End iterations if the iteration end condition is met, go back to step 2 otherwise.
We assign individuals si to musicians, and the population S = {si , i ∈ [1:|S|]} We compare the value of the vector of variable parameters X i to the chord that the musician si takes at a given moment in time. The harmony of sounds formalizes the global minimum of the fitness function ϕ(X) [14]. The set of current coordinates X i , i ∈ [1:|S|] forms an (|S| × |X|)-harmony memory matrix HM. 1. Initialization. The initial values of the components of the vector assume evenly distributed in hyperparallelepiped:
xi,j = xi− + U1 (0, 1) xj+ − xj− , i ∈ [1 : |S|], j ∈ [1 : |X |]
(6)
[xj− ; xj+ ] is the valid interval, U 1 (0;1) are uniformly distributed random variables from 0 to 1.
Electromagnetism-Like Algorithm and Harmony Search
245
A set of vectors so constructed constitutes the source matrix of memory harmonies ⎛
HM
x1,1 ⎜ x2,1 =⎜ ⎝ ··· x|S|,1
x1,2 x2,2 ··· x|S|,1
⎞ · · · x1,|X | · · · x2,|X | ⎟ ⎟ ··· ··· ⎠ · · · x|S|,|X |
2. The Formation of the Harmony Vector X is performed according to the following rules. of a vector X we use the corresponding With probability ξ h as a component Xi,j i component of the random vector X i from the current harmony memory matrix HM:
M xi,j = Hi,j , . i = U1 (1 : |S|), jI [1 : |X |].
This operation simulates the playback of a chord from the musician’s harmony memory and is called a random selection operation. , we take the value generated by With probability 1 − ξ h as a component of xi,j the initial formula. The operation is adequate to the situation of forming an absolutely random chord from the range available to the musician. The free parameter ξh has the meaning of the relative frequency of the random selection operation, that is, the probability of using the harmony memory (HMCR). of the vector 3. Step-by-Step Adjustment of the Harmony Vector. If the component xi,j Xi is selected from the harmony memory, then perform the following actions. + With probability ξ p , we change this component by the formula xi,j = xi,j ±1 ±1 bw N1 (0; 1) or with probability (1 − ξp ) we leave it unchanged, usign is a equal usign to 1 or −1. The formula means that the x component is equally likely to receive a random positive or negative increment of bw N 1 (0;1). The ξp parameter defines the relative frequency of step-by-step tuning, i.e. the probability of step change, and the bw parameter is the value of the step that is given the meaning of the instrument’s bandwidth.
4. Updating the Harmony Memory Matrix. Calculate the value of the fitness function ) then we replace ϕ(Xi ) corresponding to the formed vector Xi . If ϕ(Xi ) < ϕ(Xjw of the harmony memory with the vector X ’ . We find the vector the worst vector Xiw i Xi from the condition max ϕ(Xi ) = ϕ(Xiw ). i∈[1:|S|]
5. Completion of the Iteration. If the condition for the end of iterations is met, then we use the best harmony memory vector X ib , which delivers the smallest value of the fitness function: min ϕ(Xi ) = ϕ(Xiw ). i∈[1:|S|]
246
E. N. Shvareva and L. V. Enikeeva
3 Results 3.1 Comparison of Electromagnetic Algorithm and Harmony Search with State-of-the-Art Algorithms We used classical test functions to assess the proposed electromagnetic algorithm and search for harmony. All test functions are listed in Table 1, where D is the dimension of the function, Range is the boundary of the function search space, Opt is the global minimum. In addition, F1 - F3 are unimodal functions, while F4 - F5 are multimodal functions. Table 1. Benchmark functions. No Formulation F1
f (x) =
D
D xi2
Range
Opt
30 [−100, 100] 0
i=1
F2
30 [−100, 100] 0
f (x) = max{|xi |, 1 ≤ i ≤ D} i
F3
f (x) =
D
|xi | +
i=1
F4
F5
1 f (x) = 4000
f (x) =
D
30 [−10, 10]
|xi |
0
i=1 D i=1
xi2 −
1 −20 exp −0.2 D
D
30 [−600, 600] 0
xi cos √ +1 i
i=1
30 [−100, 100] 0
D D 1 cos 2π x + 20 + e xi2 − exp D i
i=1
i=1
We took the average value characterizing the algorithm skill for global optimization to compare the optimization efficiency between algorithms. The results are shown in Table 2. In this table, “mean” and “Std” mean “mean” and “standard deviation”, respectively. In addition, the best results are shown in bold. They were compared with the following algorithms: Cuckoo Search (CS), Grey Wolf Optimizer (GWO), Whale Optimization Algorithm (WOA), Particle Swarm Optimization (PSO), and Salp Swarm Algorithm (SSA). Table 3 shows the results. 3.2 Simulation Results The proposed electromagnetic algorithm and harmonic algorithm for solving the inverse problem of chemical kinetics were evaluated. As a result of solving the inverse problem of chemical kinetics with an electromagnetic algorithm and searching for harmony, the values of E ref , E met , k ref , k met , B, and m, included in the expression for the reaction rates W ref and W met , were optimized. The values obtained are shown in Table 3.
Electromagnetism-Like Algorithm and Harmony Search
247
Table 2. Algorithm comparison. F1
F2
F3
F4
F5
7.45·10–3
2
CS
0
16.6
1.76·10–15
GWO
0
0
0
0
8.23·10–15
WOA
0
1.28·101
0
4.14·10–4
3.02·10–15
PSO
0
3.24·10–5
0
1.88·10–2
8.88·10–1
SSA
4.81·10–9
1.27
5.24·10–1
6.73·10–3
2.18
HS
−7.58·10–5
−6.06·10–11
4.02·10–4
1.11·10–1
4.75·10–3
EM
3.02·10–12
0
1.85·10–8
2.43·10–3
1.53·10–6
Table 3. The results of the optimization. E ref , kJ/mol
k ref
E met
k met
M
B
Best fitness
EM
115.3
1.8·1011
39.6
1.0·105
1.0
50.58
0.027
HS
116.4
7.6·1010
44.9
6.2·105
0.7
3.64
0.022
The reached optimal values were applied to solve the direct problem of chemical kinetics for experiments on propane pre-reforming. The model correctly describes the available experimental data. The results of comparing experimental data and theoretical data obtained using electromagnetic and harmonic algorithms can be viewed on the graph (see Fig. 1). HS_C3H8 C3H8_ex EM_C3H8
HS_CH4 CH4_ex EM_CH4
HS_CO2 CO2_ex EM_CO2
0.90
0.25 Concentraon, vol. %
Concentraon, vol. % 0.00 220.00
HS_H2 H2_ex EM_H2
270.00
320.00
Temperature, 0C
0.20 0.15 0.10 0.05 0.00 220.00
270.00
320.00
Temperature, 0C
Fig. 1. Comparing experimental data and theoretical data obtained using electromagnetic and harmonic algorithms.
248
E. N. Shvareva and L. V. Enikeeva
4 Conclusion As a result of this work, an electromagnetic algorithm and a harmony search for the problem of chemical kinetics were developed. The algorithms were tested on five benchmark functions, the minimum of which was known. The results of testing the algorithms on benchmark functions were compared with the results of other algorithms. The proposed algorithms showed good performance. After that, using the developed electromagnetic algorithm and the search for harmony, the inverse problem of chemical kinetics was solved, the results of which were checked for the direct problem of this process and compared with experimental data. The research goal has been achieved. The result was adequate, the data corresponded to reality. When comparing the performance of the algorithms under consideration, it turned out that the harmony algorithm gives a better optimization, that is, the deviations from the experimental data are less, but in whole the algorithms give almost the same results. But there is an assumption that it is possible to choose even more successful parameters of the considered algorithms for these problems, so in the future the authors plan to parameterize the algorithms using a genetic algorithm, as well as parallelize the processes for a more optimal search for constants. Acknowledgments. The reported study was funded by RFBR, project number 19-37-60014.
References 1. Li, J., Pan, Q., Duan, P., Sang, H., Gao, K.: Solving multi-area environmental/economic dispatch by Pareto-based chemical-reaction optimization algorithm. IEEE/CAA J. Automat. Sin. 6(5), 1240–1250 (2017) 2. Li, J., Pan, Q., Wang, F.: A hybrid variable neighborhood search for solving the hybrid flow shop scheduling problem. Appl. Soft Comput. 24, 63–77 (2014) 3. Zarei, K., Atabati, M., Moghaddary, S.: Predicting the heats of combustion of polynitro arene, polynitro heteroarene, acyclic and cyclic nitramine, nitrate ester and nitroaliphatic compounds using bee algorithm and adaptive neuro-fuzzy inference system. Chemometr. Intell. Lab. Syst. 128, 37–48 (2013) 4. Salehi, M.: Maximum probability reaction sequences in stochastic chemical kinetic systems. Front. Physiol. 1, 170 (2010) 5. Aamir, E., Nagy, Z.K., Rielly, C.D., Kleinert, T., Judat, B.: Combined quadrature method of moments and method of characteristics approach for efficient solution of population balance models for dynamic modeling and crystal size distribution control of crystallization processes. Ind. Eng. Chem. Res. 48, 8575–8584 (2009) 6. Sheth, P.N., Babu, B.V.: Differential evolution approach for obtaining kinetic parameters in nonisothermal pyrolysis of biomass. Mater. Manuf. Process. 24, 47–52 (2008) 7. Sinha, S., Praveen, C.: Optimization of industrial fluid catalytic cracking unit having five lump kinetic scheme using genetic algorithm. Comput. Model. Eng. Sci. 32, 85–101 (2008) 8. Chainikova, E.M., et al.: Interplay of conformational and chemical transformations of orthosubstituted aromatic nitroso oxides: experimental and theoretical study. J. Org. Chem. 82, 7750–7763 (2017) 9. Uskov, S.I., et al.: Kinetics of low-temperature steam reforming of propane in a methane excess on a Ni-based catalyst. Catal. Ind. 9, 104–109 (2017). https://doi.org/10.1134/S20700 50417020118
Electromagnetism-Like Algorithm and Harmony Search
249
10. Akhmadullina, L.F., Enikeeva, L.V., Gubaydullin, I.M.: Numerical methods for reaction kinetics parameters: identification of low-temperature propane conversion in the presence of methane. Procedia Eng. 201, 612–616 (2017) 11. Zyryanova, M.M., Snytnikov, P.V., Shigarov, A.B., Belyaev, V.D., Kirillov, V.A., Sobyanin, V.A.: Low temperature catalytic steam reforming of propane–methane mixture into methanerich gas: experiment and macrokinetic modelling. Fuel 135, 76–82 (2014) 12. Uskov, S.I., et al.: Fibrous alumina-based Ni-MOx (M = Mg, Cr, Ce) catalysts for propane pre-reforming. Mater. Lett. 257, 126741 (2019) 13. Uskov, S.I., Potemkin, D.I., Enikeeva, L.V., Snytnikov, P.V., Gubaydullin, I.M., Sobyanin, V.A.: Propane pre-reforming into methane-rich gas over Ni catalyst: experiment and kinetics elucidation via genetic algorithm. Energies 13, 3393 (2020) 14. Karpenko, A.P.: Modern search engine optimization algorithms. Nature-inspired algorithms, p. 446. BMSTU, Moscow (2014) 15. Manjarres, D., et al.: A survey on applications of the harmony search algorithm. Eng. Appl. Artif. Intell. 26, 1818–1831 (2013) 16. Geem, Z.W.: Optimal cost design of water distribution networks using harmony search. Eng. Optim. 38, 259–277 (2006) 17. Wang, L., Li, L.: An effective differential harmony search algorithm for the solving nonconvex economic load dispatch problems. Int. J. Electr. Power Energy Syst. 32, 832–843 (2013) 18. Assad, A., Deep, K.: Applications of harmony search algorithm in data mining: a survey. In: Proceedings of Fifth International Conference on Soft Computing for Problem Solving, pp. 863–874 (2016). https://doi.org/10.1007/978-981-10-0451-3_77 19. Ma, S., Dong, Y., Sang, Z., Li, S.: An improved AEA algorithm with Harmony Search (HSAEA) and its application in reaction kinetic parameter estimation. Appl. Soft Comput. 13, 3505–3514 (2013)
Multi-Level Visualization with the MLV-Viewer Prototype Carlos Manuel Oliveira Alves1(B) , Manuel Pérez Cota2 , and Miguel Ramón González Castro2 1 Instituto Politécnico de Castelo Branco, Av. D. Pedro Álvares Cabral 12, 6000-Castelo
Branco, Portugal [email protected] 2 Universidade de Vigo, Rúa Torrecedeira 86, 36208-Vigo, Spain [email protected], [email protected]
Abstract. Data visualization, especially if we are talking about a large volume, can, and should, be presented as a graphical and visual representation supported by a computer in an interactive way, and in this way it allows supporting the decision maker and increasing his cognition. The appropriate tools, methods and techniques can increase the understanding of data, with greater importance if they were large in volume and multidimensional. These visual and interactive representations, associated with analysis methods, enable decision makers to combine flexibility, creativity and human knowledge with the resources of computer storage and processing to obtain a more effective view of complex problems. Decision makers, too, should be allowed to interact directly with data analysis, adapting it to their tastes and needs. In this article, the mlv-viewer prototype will be presented, which, in short, consists of a universal decision support system, allowing multilevel data visualization, associating a set of data with a symbol. Keywords: Data analytics in DSS · DSS · Big data visualization
1 Introduction Model-oriented information systems with the aim of supporting the user in decisionmaking emerged in the 1960s, under the name of Decision Support System (SAD) or also called Decision Support Systems-DSS [1, 2]. Data visualization must be understood as a dynamic way, so that in real time it allows to react to new developments using virtual environments or network technology, as well as computer graphics algorithms. Thus, the user, that is, the decision maker, obtains a global idea about the problem, and then, he can force himself on the details using an appropriate graphic model and adjusted to his reality and need. In this way, it can be said that data visualization is nothing more than a “graphical representation of data with the main objective of providing the user with a qualitative and easy understanding of the information content (which can be data, relationships, perceptions) or processes)” [3], transforming data, objects, numbers, concepts, among others, into a representation © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 295, pp. 250–258, 2022. https://doi.org/10.1007/978-3-030-82196-8_19
Multi-Level Visualization with the MLV-Viewer Prototype
251
easily interpreted by the human eye, allowing the user to select, filter and transform that data and choose the correct type of visualization (such as bar graphs, lines or bubbles, among others). Visualization techniques simplify the data and its representations, making the relevant information more easily “perceived” and visible through the associated models and highlighting data that would otherwise be hidden. Thus, it can be concluded that the integration of visualization techniques in dynamic [4, 5] and complex systems is highly recommended for data abstraction, allowing the user to quickly navigate through a wide range of data. According to the authors [5–8], visual data analysis is important for analysts and decision makers, reducing and synthesizing the research of large volumes of data and also facilitating the task of exploring and obtaining data. relevant information. This article presents a prototype of universal DSS, called MLV-Viewer, developed for a web environment, with the main characteristics being free and adaptable to the needs of the decision maker. In the Sect. “1. Introduction” the concept will be presented, and then a literature review will be carried out in Sect. “2. Literature Review”, in Sect. “3. Multilevel visualization with the MLV-Viewer" describes the prototype and, finally, in the Sect. “4. Conclusion and Future Work” some conclusions and future work will be made.
2 Literature Review A DSS is nothing more than a tool, usually interactive and computer-based, to support decision makers in solving problems (unstructured, semi-structured or structured) in the most diverse areas of business/activity, the main ones being components of this tool are: decision maker; data base; models and procedures; program [4–8, 12, 13]. The system presented at [9], called “Web-based multi-level job assignment DSS” (WMJADSS) is a continuation of previous work (WJADSS - “Web Based Task Assignment Decision Support System”), complementing the process of multi-level work assignment, usually found in a project’s workflow, so that existing models are adjusted to support the single and multilevel job designation. The effectiveness and efficiency of the work done, throughout the project workflow, are increased and the data stored in the repositories are used to produce descriptive and procedural knowledge for the project’s stakeholders. The study [10] analyzes the architecture of the decision support system applied to Tariffs and Trade, facilitating data analysis and trade forecasting, allowing the integration of reporting and information retrieval tools, supporting decision making by providing economic quantitative analysis and system modeling tools. Multidimensional data warehouse modeling techniques as well as artificial intelligence are used. According to Edoardo L’Astorina [11], the 20 best tools for data visualization are grouped into two sets: those aimed at the end user and those aimed at programmers/developers.
252
C. M. O. Alves et al.
Of the first are listed: • Tableau: Big Data visualization tool for companies, allowing, among others, to create tables, graphs, maps and graphs. It can work as a desktop application or as a service hosted on a cloud server; • Infogram: allows associating visualizations and infographics with Big Data in real time, allowing the user (media editors, journalists, designers and educators) to choose and share between different models of graphs, maps, images and videos; • ChartBlocks: is an online tool that allows the creation of visualizations from spreadsheets, databases and live feeds. A graphics construction wizard, working on HTML5 and the JavaScript library D3.js, creates responsive visualizations for different devices, and also allows the insertion of the graphics created on web pages or to share them on Twitter and Facebook; • Datawrapper: being aimed at editors and journalists, it allows creating, uploading and publishing graphics and maps, with customized layouts, on web sites; • Plotly: it is an easy-to-use tool for the web to create graphics, but if the user has computer skills, he can use an API that includes JavaScript and Python languages; • RAW: based on data from Microsoft Excel, Google Docs, Apple Numbers or a commaseparated list, RAW exports the visualization to Adobe Illustrator, Sketch or Inkscape, thus bridging spreadsheets and vector graphics; • Visual.ly: is a visual content service, in which a team will provide support for the entire duration of the project.
3 Multi-Level Visualization with the MLV-Viewer The present work consists of a universal DSS prototype, called MLV-Viewer, supported in a relational database (currently the SGBD MySQL), in the languages of PHP and JavaScript and the graphic libraries canvasJs, Chart.js, JpGraph, HIGHCHARTS and three.js, enabling the user (who does not need to be an expert in computers and computers) through a web environment to select and interact with the data they want to view and the graphical representation for that visualization. With the work developed, a universal DSS interactive and in real time, multiplatform, of free access, was built to be used in a web environment and adaptable to the needs of the decision maker (through several user profiles), being able to the user chooses several ways to visualize the same data (dynamically selected by himself) opting for views 1C, 2C, 3C, 4C, 5C, 6C or 7C. The MLV-Viewer uses the concept of layered visualization (also called a layer or level) associated with a symbol. The views 1C and 2C can be compared to the views 1D and 2D in which the axes XX and YY are used to represent data. The 3C view is similar to a 3D view, but with the caveat that it is actually a 2.5D view, thus emulating 3D on a device that is a plane/surface and therefore only has two dimensions. In the MLV-Viewer prototype, the 4C visualization (see Fig. 1) adds to the 3C visualization (emulated in 2.5D) a layer to the symbol used in the data representation. This layer is the size of the symbol used to represent the data, which is proportional to the value of the data. The symbol used in the visualization is the circle. The diameter of the circle is proportional to the value of the data to be visualized, and in this way we obtain
Multi-Level Visualization with the MLV-Viewer Prototype
253
the 4C visualization in the MLV-Viewer. The same principle is applied in views 5C, 6C and 7C, with new layers being added to be able to represent these new views. The 5C view (see Fig. 2) is made from the 4C view by adding the color as a new layer. The same color is always associated with the same type/value of data. The 6C view (see Fig. 3) is made from the 5C view by adding the shape as a new layer. The same type/value of data is always associated with the same form. The 7C view (see Fig. 4) is achieved by adding a new outer layer (like an onion with the various layers) to the symbol to the 6C view. This new layer is represented by a variable thickness.
Fig. 1. View 4C “radius points”
In the 4C view (see Fig. 1), the XX, YY and ZZ axes and the radius are used to obtain the 4 levels/layers. The 5C view (see Fig. 2) adds color to the 4C view. The 6C view (see Fig. 3) adds to the 5C view the shape of the symbol. The 7C view (see Fig. 4) adds to the 6C view an outline outside the symbol with a thickness according to the amount of data to be represented. Critical Analysis For this comparison, the tools described in Sect. 2 were analyzed (see Table 1), that is, the compared tools were Tableau, Infogram, ChartBlocks, Datawrapper, Plotly, RAW and Visual.ly. From the previous table it can be concluded that practically all the tools allow the use in web environment, they are free or have free versions, they are of universal use, they allow the import and export of data, and they allow a diversity of graphs for data visualization. However, for the parameters “Data Mining”, “Data Prediction” and “data visualization/view in 4C, 5C, 6C, 7C” the scenario is completely different. Regarding the “Data Mining” parameter, only Tableau and Plotly (with some limitations) allow this functionality. For the “Data Prediction” parameter, only Plotly (with some limitations)
254
C. M. O. Alves et al.
Fig. 2. View 5C “radius points”
Fig. 3. View 6C “radius points”
allows this functionality. Finally, for the parameter “data visualization/view in 4C, 5C, 6C, 7C” none of the tools, except the MLV-Viewer, allow this functionality.
Multi-Level Visualization with the MLV-Viewer Prototype
255
Fig. 4. View 7C “radius points”
Table 1. Comparison between MLV-viewer and other tools Visualization Web Free Universal Data Data Data Visualization Data Data tool import export visualization 4C,5C,6C,7C mining prediction Tableau
X
X
V
V
V
V
X
V
X
Infogram
V
V1
V2
V
V3
V
X
X
X
ChartBlocks
V
V1
V
V
V4
V
X
X
X
Datawrapper V
V1
V
V
X5
V
X
X
X
Plotly
V
V6
V
V
X
V
X
X7
X8
RAW
V
V
V
V
V
V
X
X
X
Visual.ly
V
X
V
V
X
V
X
X
X
MLV-Viewer V
V
V
V
V
V
V
V
V
The following SWOT analysis of the MLV-Viewer prototype shows that the MLVViewer presents: • Strengths: • Ability to innovate: the present team of researchers has the necessary skills and abilities to design innovative products; • Free technologies and tools: in the development free or tendentially free technologies and tools are used, and this way, the end user will not have costs of acquisition and use; • Weaknesses: • Development team: being a team with a reduced number of elements, it takes time to produce a product or a new feature; • Disclosure capacity: because it is not a company, it does not have marketing resources, making product disclosure more difficult and slow;
256
C. M. O. Alves et al.
• Opportunities: • Innovative product: there is nothing similar on the market (as far as it is possible for authors to be aware of non-existence); • Attracting a large and wide target audience: being a product not only focused on an area of activity or business, it makes it more global; • Threats: • Ease of reproduction: the ideas that make the product innovative are easily copied / reproduced.
4 Conclusion and Future Work It is important that the user has the possibility to choose and decide on what he wants, when and how to view his data [7]. The system/tool should also advise the visualization most suitable to the business/activity area, because a bad choice not only makes an analysis difficult, it can even make it impossible to obtain the desired information. It must be clear and very explicit that a decision support visualization tool must meet the specifics of each use and situation, not overloading it with data and information that are of no interest in decision making, but, on the other hand, it is required of a good system/tool helps the user to obtain meaningful information in a simple, friendly and customizable way. Visualization and computer graphics must be understood as dynamic, virtual and networked, enabling the user of these tools and systems to obtain a complete idea of a scenario before focusing on the details, that is, a visualization does not mean just visualization, but it allows visualizing what is not seen in a natural and immediate way, with the main focus of providing a qualitative, quantitative, easy and quick understanding of the information content (which can be data, relationships, perceptions or processes), transforming objects, numbers and concepts in a way that can be easily captured by the human eye and interpreted by the brain. It is important that a user can see what they want, when they want and how they want it. If these visualizations are even better associated, helping to easily interpret the data. An incorrect choice of display type can prevent the user from obtaining the desired information. Adams R. says that decision support systems (DSS) can be seen as an “extension of the idea of management information systems, providing a wider range of information in a more flexible and interactive way” [7]. The development of a UDSS is not easy and must be configurable in order to meet the specificities and needs of each decision maker, because although the data/information may be displayed graphically, users may not be able to understand everything that is being displayed or may be overloaded with too much data to be displayed at the same time. A good tool for DSS should help decision makers to find meaningful information quickly, simply, friendly and customizable, as well as suggest the best view to obtain that information. The visualization of information contained in data using graphs or images is a technology in continuous development, transforming the traditional methods of representation and analysis of relational data into structures and relationships more easily interpreted by humans. These structures and relationships facilitate dialogue, communication, analysis and interpretation in dynamic, complex and real-time environments. The
Multi-Level Visualization with the MLV-Viewer Prototype
257
complexity normally associated with a large volume of data can be mitigated through mining algorithms and data visualization to reveal hidden knowledge or patterns. As far as possible, a UDSS should be easy, simple, effective and customizable, playing an important role in decision making. The UDSS prototype presented here, called MLV-Viewer, intends to implement a dynamic and interactive system, in a web environment, using a diverse range of data visualizations and data mining to convert data into meaningful information. Its advantages are: multiplatform; free; customizable to the user; In real time; import and export data, different types of visualizations and with multiple dimensions, making data analysis and forecasts; universal. The innovative concept of 3C to 7C visualizations can be easily reproduced. The process of investigating is not easy, it is not enough to just want to. It is a process that, given a lot of work, requires methodology, skill, expertise, dexterity, proficiency, creativity and luck, which, most likely, is not available to everyone. Investigation is not just about having the luck to drop an apple on your head. But it is also necessary to know where to go in search of innovation. In the same way that a computer application is never finished, the MLV-Viewer prototype also has many ways to be improved. Among others, the following stand out: • Implement new features for different user profiles such as “create”, “insert”, “change”, “delete” and “view”. These features would be associated with the database, tables, data or types of visualization, that is, in this way, the user could be limited to viewing data (possibly only some data, eg, not being able to perform data prediction), no being able to export data, not being able to create new views on existing data, etc.; • Advise a visualization set that best adapts to the activity / business area and respective data; • Allow the user to create and customize a visualization (and in this way the needs of a user with a certain disability would also be met); • Explore other forms of visualization, e.g., using virtual / augmented reality; • 4C, 5C, 6C and, mainly, 7C views can require a lot of calculation time, so parallel computing can be a good hypothesis; • Develop visualizations greater than 7C (8C, 9C, etc.); • Explore video animation associated with data or a symbol (whenever the course is placed over that symbol).
References 1. Morton, M.S.S.: Management Decision Systems. Graduate School of Business Admin., Harvard Univ, Division of Research (1971) 2. Power, D.J.: Decision support systems: a historical overview. In: Handbook on Decision Support Systems 1, pp. 121–140. International Handbooks Information System. Springer, Berlin, Heidelberg, Springer, Berlin, Heidelberg (2008). https://doi.org/10.1007/978-3-54048713-5_7 3. Kumar, S.M., Belwal, M.: Performance dashboard: cutting-edge business intelligence and data visualization. In: Proceedings of the 2017 International Conference On Smart Technology for Smart Nation, SmartTechCon 2017, pp. 1201–1207 (2018)
258
C. M. O. Alves et al.
4. Grignard, A., Drogoul, A., Zucker, J.D.:A model-view / controller approach to support visualization and online data analysis of agent-based simulations. In: The 2013 RIVF International Conference on Computing & Communication Technologies - Research, Innovation, and Vision for Future (RIVF), pp. 233–236 (2013) 5. Cota, M.P., Castro, M.R.G., Dominguez, J.A.: Importance of visualization usage in enterprise decision making environments. In: 2014 9th Iberian Conference on Information Systems and Technologies (CISTI), pp. 1–7 (2014) 6. Ellouzi, H., Ltifi, H., Ben Ayed, M.: New multi-agent architecture of visual intelligent decision support systems application in the medical field. In: 2015 IEEE/ACS 12th International Conference of Computer Systems and Applications (AICCSA) (2015) 7. Yan, X., Qiao, M., Li, J., Simpson, T.W., Stump, G.M., Zhang, X.: A work-centered visual analytics model to support engineering design with interactive visualization and data-mining. In: 2012 45th Hawaii International Conference on System Sciences, pp. 1845–1854 (2012) 8. Jorgensen, M., Spohn, J., Bunn, C., Dong, S., Li, X., Kaeli, D.: An interactive big data processing/visualization framework. In: 2017 IEEE MIT Undergraduate Research Technology Conference, URTC 2017, vol. 2018-Jan, pp. 1–4 (2018) 9. Vongsumedh, P.: A framework for building a decision support system for multi-level job assignment. In: 2009 Fourth International Multi-Conference on Computing in the Global Information Technology (2009) 10. Yu, C.: Architecture research of decision support system for tariff and trade based on the multi-dimensional modeling techniques. In: 2013 IEEE Third International Conference on Information Science and Technology (ICIST) (2013) 11. L’Astorina, E.: Review of 20 best big data visualization tools. https://bigdata-madesimple. com/review-of-20-best-big-data-visualization-tools/. Accessed Dec 2020 12. Bencsik, G., Bacsárdi, L.: Towards to decision support generalization : the universal decision support system concept. In: 2015 IEEE 19th International Conference on Intelligent Engineering Systems (INES), pp. 277–282 (2015) 13. Kozielski, M., Sikora, M., Wróbel, Ł: DISESOR - decision support system for mining industry. Proc. Federated Conf. Comput. Sci. Inf. Syst. 5, 67–74 (2015)
One-Class Self-Attention Model for Anomaly Detection in Manufacturing Lines Linh Le1(B) , Srivatsa Mallapragada2 , Shashank Hebbar2 , and David Guerra-Zubiaga3 1
Department of Information Technology, Kennesaw State University, 1000 Chastain Road Kennesaw, Kennesaw, GA 30044, Georgia [email protected] 2 School of Analytics and Data Science, Kennesaw State University, 1000 Chastain Road Kennesaw, Kennesaw, GA 30044, Georgia {smallapr,shebbar}@students.kennesaw.edu 3 Department of Robotics and Mechatronics Engineering, 1000 Chastain Road Kennesaw, Kennesaw, GA 30044, Georgia [email protected]
Abstract. In this paper, we present our case study on anomaly detection in manufacturing lines. More specifically, our goal is to explore machine learning and deep learning models, including our designed architectures, on detecting different types of irregularities in data that is collected from a manufacturing system in operations. We focus on four types of sensors which measure air pressures and water flows of a liquid injection process, and positions and torques of a transportation motor. The system works in a cyclic nature - the collected data can be divided into cycles with similar patterns, each of which form a data instance in this anomaly detection task. Since procuring labeled data with actual anomalies is costly, we simulate four types of abnormal patterns to examine the models’ behaviors in each case. The tested models include OneClass Support Vector Machine, Isolation Forest, Deep Auto-Encoders, and Deep One-Class Classifier. We further empirically design two deep learning architectures that are called One-Class Self-Attention (OCSA) models. OCSA integrates self-attention mechanisms with the one-class classifier training objective to incorporate the representation capacity of the former and the modeling capability of the latter. Our experimental study shows that our proposed designs consistently achieve the highest or competitive performances in both detection rates and in running times in a majority of tests. Keywords: Anomaly detection · Manufacturing automation line · One-class classifier · Self-attention · Deep learning · Machine learning
L. Le and S. Mallapragada—Equal contributions. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 295, pp. 259–275, 2022. https://doi.org/10.1007/978-3-030-82196-8_20
260
1
L. Le et al.
Introduction
Predictive Maintenance (PdM) [23] is an important function in any manufacturing industries. In brief, PdM means to perform maintenance prior to the degradation of system’s performance within a certain threshold and when the maintenance activity is the most cost-effective. In [28], Selcuk shows that it is possible to obtain a ten-time return on investment, 25–30% reduction in maintenance costs, and 70–75% of elimination of breakdowns in the manufacturing line if an industry implements PdM techniques. In such practices, monitoring the operation parameters to analyze the health of the systems is an important step. The emergence of Industrial Internet of Things (IIoT) has provided us with more effective approaches to track the operation parameters of a manufacturing system. More specifically, we now have accesses to IIoT devices and sensors that can be implemented in a system and log its operation measurements to analyze the systems’ health and optimize its industrial processes and maintenance. One application that is made possible by this development is early detection of abnormal behaviors in the system. Serradilla et al. [29] elaborately reviews the process of anomaly detection and shows that the analysis can help identify future system failures in the manufacturing and production lines. The authors then present a road-map for data-driven PdM approach of which the first stage is anomaly detection. Consequently, in this paper, we study the task of anomaly detection in data that is collected from IIoT sensors in a manufacturing system. Our goal is to develop an in-depth analysis on how different technologies perform on different types of anomalous patterns in such data. While utilizing machine learning and deep learning in manufacturing has been an ever-growing research area [29], studies on explicitly applying these advance techniques on anomaly detection are not as common. This in turn may limit the potential applications and benefits that machine learning and deep learning may bring to manufacturing plants [5,29]. With that motivation, in this paper, we present a case study on utilizing machine learning and deep learning techniques on detecting faults that occur during the operation of a manufacturing line. More specifically, we collect four different types of data that is generated by sensors in a working line. The four sensors measure the air pressures and water flows of a liquid injection process, and positions and torques of a motor in the system. The unit operates in an iterative manner, that is, the process consists of multiple, repeat cycles of similar measurements. Our task is to identify the cycles that may have abnormal patterns compared to the rest of the cycles which may emerge from either of the four sensors that are mentioned previously. To generalize, our task is to identify anomalies in a set of times series data. As collecting labeled data with actual anomalies in practice is costly, we opt for a simulation study instead. To be precise, we select a period of data in which the system operates normally. Then, four types of anomalies are generated and injected into the normal data, specifically, when a part of a cycle 1) becomes flat, 2) changes in scale, 3) changes in location, or 4) is noisier than usual. The
One-Class Self-Attention
261
models that we use include One-Class Support Vector Machine (OCSVM) [3], Isolation Forest [18], Deep Auto-Encoder (DAE) [2] with error model output, DAE with OCSVM output, and Deep One-Class Classifier [24]. We further present our One-Class Self-Attention (OCSA) model which is a deep architecture design that integrates the self-attention mechanism [31] and the one-class classifier training objective [24]. As showed in [24], one-class models outperform the more commonly used DAE’s, whereas attention mechanisms have been showed to be highly effective in handling sequential data [31]. The purpose of this design is to incorporate the modeling capability of one-class models and the representation capacity of attention models in a single architecture. Our experiment study that shows our OCSA designs consistently achieve higher or competitive performances to other models. To sum up, our goals in this paper are twofold: 1) conduct a thorough study on the performances of different anomaly detection models on industrial sensor data, and 2) introduce and showcase our design of deep architectures for the task. Our contributions are as follow 1. We present a case study on anomaly detection in data that is collected from a manufacturing line. We provide a detailed comparison of performances of common and state-of-the-art models on different types of sensor data with different types of anomalies. To our knowledge, such a research is not currently available. 2. We present our deep neural network design that combines the advantages of self-attention mechanisms and one-class classifier training for anomaly detection in manufacturing data. While showing high performance in anomaly detection [24], we find that one-class classifier architectures is not utilized as common as auto-encoder architectures, and has not been integrated with attention mechanisms. We show that this combination is able to outperform other models in a majority of tests. The rest of the paper is organized as follows. In Sect. 2, we discuss the current research that are related to our paper. In Sect. 3, we formally describe the task we are doing. We present our One-Class Self-Attention architecture in Sect. 4 and discuss the experiment study in Sect. 5.
2
Literature Review
In this section, we provide a general overview of current anomaly techniques that are related to our work. Perhaps, among the most famous anomaly detection machine learning technique is the One-Class Support Vector Machine (OCSVM) [27]. OCSVM seeks a hyper-plane that best separates the instances from the origin point. Anomalies are then labeled by their positions relative to the hyper-plane. The work in [30] introduces an improved version of OCSVM which is Support Vector Data Description (SVDD). SVDD seeks a smallest hyper-sphere that encloses the majority of instances. Both models handle non-linearities by applying the
262
L. Le et al.
kernel trick - utilizing a kernel function to map data to an implicit feature space. While being powerful, kernel methods like these two models have their performances heavily depended on the hyper-parameters, including those from the selected kernel functions. Another disadvantage of these kernel methods is that they have high complexity (O(n3 )) and poorly scale to big datasets. Isolation Forest (IF) [18] is another common machine learning model for anomaly detection that we utilize in this paper. An IF uses a set of trees that isolate instances by randomly splitting their features in which each split represents a node in the tree. Anomalies are then labeled by having shorter distances to the trees’ roots. In our study, we observe that forest ensemble models performance may be unstable at times due to the randomness in their constructions. More recently, deep neural networks (DNN) [16] have been rapidly developed and used in a variety of domains including unsupervised anomaly detection. In short, a DNN is a model that consists of a large number of parameters that are partitioned into layers. Except of the input layer, each layer in a DNN takes input as a non-linear transformation of output from its previous layer. Different DNN architectures are designed for different data types, for example, fully-connected DNN for tabular-structured data, Convolutional Neural Networks (CNN) [15] for images, and Recurrent Neural Network (RNN) types [20], e.g. vanilla RNN [10], Long Short-Term Memory (LSTM) [12], and Gated Recurrent Unit (GRU) [8], for sequential data. DNN are more and more applied on anomaly detection. For example, in [19], Malhotra et al. use stacked LSTM to model input sequences by predicting their next values. The prediction errors are then modeled as a multivariate Gaussian distribution based on which anomalies are identified. Zheng et al. [33] uses multi-channel CNN on multivariate time series data (MC-DCNN), each channel is used for a single feature of multivariate data and show that their model outperforms other baseline methods. Another popular deep architecture that is used for anomaly detection are Deep Auto-Encoders (DAE) [22]. A DAE consists of two components, an encoder and a decoder. The encoder transforms input data into embedding vectors (usually of sizes less than that of the original inputs) by which the decoder uses to reconstruct the inputs with minimized errors. The work in [2] obtains embedding vectors of inputs with a DAE, on which density-based clustering is utilized to get anomalous groups based on their low densities. Additionally, a common way of adapting DAE to anomaly detection is to use the reconstruction errors - anomalous instances tend to generate higher reconstruction errors. Examples of such works are in [4,7,25, 32]. A problem with these models is that they are not explicitly trained to solve an anomaly detection problem, and thus their performance may not be optimal. Furthermore, the decoder is usually mirrored from the decoder, which in turns increases the complexity and time of training a DAE. A different method of detecting anomalies with deep learning is Deep OneClass Classifier (DOCC) [24]. DOCC utilizes a deep architecture to maps data to a feature space in which regular instances are contained in a hyper-sphere of minimized radius. As showed in [24], one-class classifiers yields better results than other types like DAE’s or generative models like AnoGAN [26]. As DOCC closely relates to our proposed methods, we discuss DOCC in more details in Sect. 4.
One-Class Self-Attention
263
A newer type of deep learning architecture is attention-based neural networks [31]. In short, an attention block takes in three inputs, a Query matrix, a Key matrix, and a Value matrix, and outputs a score matrix that represents the “context” of each row in the Query matrix with respect to rows in the Key and Value matrices. Attention-based architectures have been adapted to anomaly detection problems in different areas such as operating systems [9], computer network [35] [17], system logs [6], surveillance video [34], traffic monitoring [14], etc. These research, however, focus on relatively different areas from anomaly detection in manufacturing system. To our knowledge, works that integrate attention mechanism in our respective area are relatively limited at the moment. Overall, we find that traditional anomaly detection models like OCSVM and DAE suffer from a complexity problem and thus being not scalable to the large amount of data that is collected from manufacturing sensors. Moreover, newer technologies like DOCC and attention mechanism have yet to be applied to industrial problems. Finally, the literature is lacking of a formal research that applies and compares the variety of technologies on actual data that is collected from manufacturing sensors. Therefore, in this paper, we provide an empirical comparison of different machine learning and deep learning techniques on identifying of abnormalities in industrial data. We believe this work to be a meaningful case study to which manufacturers can refer to utilize deep learning technologies in their system. Furthermore, we present our deep architectures that integrates the capabilities of both the one-class training mechanism and the self-attention mechanism to obtain better modeling capacities. Our techniques solve the problems of scalability, complexity, and training goal in models like OCSVM, SVDD, and DAE, and empirically outperform them in terms of accuracy and times.
3
Task Definition
In this section, we describe the task of this case study in details. We focus on four of the sensors that are implemented in the production system: air pressure, water flow, motor position, and motor torque. The sensors record their measurement values every 5 ms. The system operates in a cyclic manner; we consider each period that the motor starts moving until it returns to the original position as one cycle. The sensors output approximately 950 times during a cycle, which means each cycle can be considered as a time series or signal of 950 time points. We show a sample cycle from each sensor in Fig. 1. The x-axis represents the time steps in the cycle, and the y-axis represents the values that are recorded by the sensors after standardization. Our goal is to detect anomalous cycles that may occur while the system is operating. We consider the following behaviors as anomalies: – A part of the cycle is flat. We refer to this type as “flattened anomaly”. – A part of the cycle is scaled up or down. We refer to this type as “scaled anomaly”.
264
L. Le et al.
(a) An Air Pressure Cycle
(b) A Water Flow Cycle
(c) A Motor Position Cycle
(d) A Motor Torque Cycle
Fig. 1. Examples of cycles from four sensors in a manufacturing line
– A part of the cycle is shifted up or down. We refer to this type as “shifted anomaly”. – A part of the cycle is noisier. We refer to this type as “denoised anomaly”. We show examples of a sample signal that is simulated from a combination of sine and cosine functions, and the four types of anomalies in Fig. 2. As each cycle can be considered a time series or a signal, our task is equivalent to detect anomalies in a dataset of which each instance is a time series or signal. As mentioned, since labeling the actual anomalies of each type is costly, we approach a simulation study instead. In short, we randomly select a small number of cycles from a set of regular ones and convert them to anomalies to use in our experimental study. The simulation method and parameters are discussed in more details in Sect. 5.
One-Class Self-Attention
265
(a) The Original Signal
(b) Flattened Anomaly
(c) Scaled Anomaly
(d) Shifted Anomaly
(e) Denoised Anomaly
Fig. 2. Examples of a signal and the four types of anomalies
4
One-Class Self-Attention Architecture
In this section, we describe the models that we develop for the problem of anomaly detection in manufacturing lines. In short, our architectures consist of the following components – Embedding component: takes inputs as raw data and transform them into higher level representations. – Self-Attention component: generate context vectors for the embedding output – Output component: decision making layer
266
4.1
L. Le et al.
Embedding Component
The more common architectures for signals are probably RNN types such as vanilla RNN, GRU, and LSTM. These architectures, however, suffers two issues when being applied to our problem. First, RNN-type DNNs unfold the computational map through time, which in turn simulate a deeper network as there are more time steps in the input sequence. Since we are working with long sequential data, i.e. approximately 1000 time steps, training recurrent type networks become highly computationally expensive. Second, recurrent architectures model the hidden state of current time points as a function of current inputs and past hidden states. This means the hidden state of a time point do not contains information of its future. As we do not consider the causality within a cycle, we want to utilize the correlations of a time point to both its past and future, and therefore deeming recurrent networks to be insufficient. While this issue can be solved with a bi-directional RNN architecture, e.g. bi-directional LSTM, they increase the complexity which is already high of these models. Some of our initial experiments show that training RNN-type models in our data is overly slow, accordingly, we decide not to use them. Consequently, we choose a computationally cheaper architecture, which is 1D-CNN. In general, a convolutional block consists of a convolutional layer and a sub-sampling layer. The convolutional layer utilizes a set of kernels (filters) that slide on the time dimension of the input signals that capture both past and future correlation of a time step. As a result, choosing a good size for the filters, i.e. how many neighbor time steps to consider at a current time point, is important. In our problem, the signals are relatively stable within a small windows, therefore, we choose a larger size of kernels, e.g. 32 to 128, compared to other application which typically ranges from 2 to 32. The output from convolutional layers are then sub-sampled to reduce their dimensionality while signifying any patterns they may have. The common types of sub-sampling methods are max-pooling or average-pooling. However, we believe that these method individually cannot effectively capture all the anomalous patterns in 1D sequences, since anomalies can have parts that are either higher or lower, or more flat/fluctuated than regular while not exceeding the normal ranges. We show two examples of this issue in Fig. 3. In Fig. 3a, with a window of four, the two sequences are the same after max pooling. Similarly, the two sequences in Fig. 3b are very similar after average pooling regardless of window sizes. Consequently, we integrate min-pooling, average-pooling, and max-pooling in a single sub-sampling layer. The outputs from each pooling methods are concatenated to be input into the next layer. To reduce the complexity of the other components in the model, we aim for the final output of the embedding component to have a low dimensionality in both time steps and each time step’s sizes. Let the input cycle be x = {x0 , x1 , ..., xTX }, our goal is to have the output of the embedding component U = {U0 , U1 , ..., UTU } as a sequence of much shorter length than the original sequence, i.e. TU 0, t + k < T , and k ≥ 0.2 × T . For the three anomaly types scaled, shifted, and denoised, we further generate a smooth factor α = {α0 , α1 , ...αk } from a sine function to simulate the effect that the impact on a signal gradually increasing to a peak then gradually disappearing. We illustrates these effects on a flat signal (constant value of 1) in Fig. 5. In both examples, the curved segments of the signals are anomalies that are gradually introduced and then faded.
270
L. Le et al.
(a) Increasing Patterns
(b) Decreasing Patterns
Fig. 5. Examples of smoothly transitional anomaly
Then, each type of anomalies is simulated as follows – Flattened anomalies xi = xt
∀i ∈ [t, t + k]
(5)
– Scaled anomalies: randomize a scaling factor γ with 1.5 ≤ γ ≤ 3 or 0 ≤ γ ≤ 0.66, then set (6) xi = γ × xi × (αi + 1) ∀i ∈ [t, t + k] – Shifted anomalies: randomize a shifting factor μ with σc /2 ≤ μ ≤ 3σc or −3/sigmac ≤ μ ≤ −σc /2, then set xi = xi + μ × αi
∀i ∈ [t, t + k]
(7)
where σc is the standard deviation of the cycle c. – Denoised anomalies xi = xi + i ∗ αi
∀i ∈ [t, t + k]
(8)
where i is sampled from a normal distribution of mean 0 and standard deviation σc . 5.2
Modeling
All datasets are split into 80% training and 20% testing. The training set is further split into 80% training and 20% validation for finetuning models. We then generate anomalies in each set with β = 0.02. Models that we test include One-Class Support Vector Machine (OCSVM), Isolation Forest (IF), Deep AutoEncoder (DAE), OCSVM + DAE, Deep One-Class Classifier (DOCC), OneClass Self-Attention with center (OCSAv1), and One-Class Self-Attention without center (OCSAv2). The traditional machine learning models that we use and their setting are as follows
One-Class Self-Attention
271
– One-Class Support Vector Machine (OCSVM) using Radial Basis Function (RBF) kernel. We finetune the hyper-parameters γ, and ν from the value sets {0.001, 0.01, 0.1, 1, 10, 100} and {0.01, 0.05, 0.1, 0.15} respectively. – Isolation Forest (IF). We finetune the number of trees in {25, 50, 100, 150, 200}. All deep learning models utilize the same embedding/encoding architecture. We finetune a wide range of embedding/encoding architecture for one that balances between performances and training time, and select a four-layer 1D-CNN architecture with the number of kernels and kernel size in each layer being (32, 128), (16, 64), (8, 32), (4, 32). The AE models utilize a decoder architecture of (4, 32), (8, 32), (16, 64), (32, 128), i.e. the reverse of the encoder architecture. In the OCSVM + AE model, we apply OCSVM on the encoded vectors that are output from the trained DAE; γ and ν are finetuned similarly to the standalone OCSVM. DOCC, OCSAv1, and OCSAv2, utilize the output layer of size 64. We train our models with an initial learning rate of 0.1 then decays by a factor of 10 when the training cost fluctuates within ten consecutive epochs. We also use an early-stopping condition of having validation loss not improving more than 1% during ten consecutive epochs. The experiment is repeated with ten different seeds for each dataset and anomaly type combination. We report the average F1 scores of the models in Table 1 Table 1. Average F1 scores of models in testing data Data
Anomaly type OCSVM IF
DAE
AE+OCSVM DOCC OCSAv1 OCSAv2
Air
Flattened
0.309
0.302 0.392 0.462
0.632
0.914
0.934
Air
Shifted
0.602
0.617 0.51
0.682
0.490
0.740
0.680
Air
Scaled
0.354
0.334 0.270 0.403
0.510
0.730
0.690
Air
Denoised
0.623
0.389 0.378 0.416
0.197
0.539
0.398
Water
Flattened
0.400
0.548 0.582 0.290
0.443
0.462
0.564
Water
Shifted
0.555
0.429 0.520 0.369
0.410
0.370
0.340
Water
Scaled
0.477
0.330 0.440 0.377
0.400
0.410
0.410
Water
Denoised
0.520
0.193 0.150 0.160
0.140
0.280
0.200
Position Flattened
0.395
0.563 0.344 0.480
0.646
0.787
0.797
Position Shifted
0.487
0.728 0.970 0.539
0.510
0.720
0.810
Postion Scaled
0.232
0.195 0.342 0.362
0.310
0.521
0.486
Position Denoised
0.436
0.592 0.918 0.322
0.203
0.474
0.555
Torque
Flattened
0.439
0.710 0.666 0.435
0.706
0.736
0.777
Torque
Shifted
0.503
0.722 0.613 0.548
0.562
0.803
0.813
Torque
Scaled
0.317
0.240 0.251 0.372
0.361
0.562
0.532
Torque
Denoised
0.573
0.539 0.376 0.451
0.252
0.659
0.687
272
L. Le et al.
5.3
Discussion
As showed in Table 1, our two models consistently achieve the highest or second highest F1 scores in a majority of the tests. The result further shows that selfattention mechanism significantly boosts the performances of one-class classifier - OCSAv1 and OCSAv2 constantly outperform DOCC in all experiments. We also observe that traditional models like OCSVM and DAE still outperform other models in several tests. This shows that they are still useful, more specifically, in manufacturing sensor data of lower sizes. As the data size scales up, however, these two models suffer from a complexity problem. OCSVM, with the data complexity of O(n3 ), has two hyper-parameters that must be finetuned carefully; and DAE is typically the most complex deep model in this experiments (as the decoder architecture is often the reversed encoder architecture). During the experiments, we find that finetuning OCSVM may take as long as training a DAE which is much slower than training DOCC, OCSAv1, and OCSAv2. A DAE may require up to 1000 epochs to reach convergence, whereas the other three deep models would converge in under 50 epochs, not to mention the their shorter epoch time. As the production version of the models may be trained on a very large amount of data, training OCSVM would become infeasible while training DAE is highly computationally expensive. Finally, we see that the performances of the models depend more on the characteristics of the data than the type of anomalies in the data. More specifically, OCSAv1 and OCSAv2 perform well in the Air and Torque data, whereas OCSVM outperforms in the Water data, and DAE in the Position data. Overall, we conclude that our proposed models’ performances consistently achieve good performances throughout the experiments in terms of both accuracy and efficiency. Therefore, they can be the universal “safe choice” to apply on anomaly detection tasks in manufacturing data.
6
Conclusion
In this paper, we present our study on anomaly detection in manufacturing data. More specifically, we provide a detailed comparison on the performances of anomaly detection models, including our proposed designs, on data that is collected from different types of industrial sensors and with different types of anomalies. In more details, the data is collected from four sensors that measure air pressures, water flows, motor positions, and motor torques in a working manufacturing line. The system works cyclically, and our task is to determine working cycles with abnormal behaviors. We conduct a simulation approach in which four types of anomalies are artificially injected into the data: cycles that have a part 1) flattened, 2) scaled from regular ranges, 3) shifted from regular locations, and 4) noisier than regular. We test a mixture of traditional and state-of-the-art machine learning and deep learning models to study their performances on each situation. Additionally, we empirically design our own deep architectures for the task that integrate the one-class learning mechanism with the self-attention
One-Class Self-Attention
273
mechanism. The designs aims to incorporate the the detect performance of the one-class models, and the representation capability on sequential data of the self-attention mechanism. Our experimental study shows that models’ performances seem to depend more on the types of data than the type of anomalies in data. We also argue that while traditional models like One-Class Support Vector Machine and Deep Auto Encoder may still be useful in this experiment, they become prohibitively expensive in when being implemented and trained on large scale data. On the other hand, our designs consistently achieve high or competitive performances with reasonable time complexity, and thus can be used as a safe solutions in a majority of the cases. One limitation of our study lies in the simulation of anomalies. While we try to cover common patterns of abnormalities, the simulated data may not capture all faulty patterns that can occur during manufacturing processes. We will keep collaborating with our partner to obtain labeled data and extend our research. Finally, we will work on further improving our designs of models as well as expand the experiments to more types of sensor data. Additionally, we plan to model each cycle as a collection of sensor data instead of analyzing the sensors individually.
References 1. Abadi, M., et al.: TensorFlow: Large-scale machine learning on heterogeneous systems (2015). Software available from tensorflow.org 2. Amarbayasgalan, T., Jargalsaikhan, B., Ryu, K.H.: Unsupervised novelty detection using deep autoencoders with density based clustering. Appl. Sci. 8(9), 1468 (2018) 3. Amer, M., Goldstein, M., Abdennadher, S.: Enhancing one-class support vector machines for unsupervised anomaly detection. In: Proceedings of the ACM SIGKDD Workshop on Outlier Detection and Description, pp. 8–15 (2013) 4. Andrews, J.T.A., Morton, E.J., Griffin, L.D.: Detecting anomalous data using autoencoders. Int. J. Mach. Learn. Comput. 6(1), 21 (2016) 5. Bl’azquez-Garc’ia, A., Conde, A., Mori, U., Lozano, J.A.: A review on outlier/anomaly detection in time series data. arXiv preprint arXiv:2002.04236 (2020) 6. Brown, A., Tuor, A., Hutchinson, B., Nichols, N.: Recurrent neural network attention mechanisms for interpretable system log anomaly detection. In: Proceedings of the First Workshop on Machine Learning for Computing Systems, pp. 1–8 (2018) 7. Chen, J., Sathe, S., Aggarwal, C., Turaga, D.: Outlier detection with autoencoder ensembles. In: Proceedings of the 2017 SIAM International Conference on Data Mining, pp. 90–98. SIAM (2017) 8. Chung, J., Gulcehre, C., Cho, K.H., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014) 9. Ezeme, M.O., Mahmoud, Q.H., Azim, A.: Hierarchical attention-based anomaly detection model for embedded operating systems. In: 2018 IEEE 24th International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA), pp. 225–231. IEEE (2018) 10. Funahashi, K., Nakamura, Y.: Approximation of dynamical systems by continuous time recurrent neural networks. Neural Netw. 6(6), 801–806 (1993)
274
L. Le et al.
11. Harris, C.R., et al.: Array programming with NumPy. Nature 585(7825), 357–362 (2020) 12. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997) 13. Hunter, J.D.: Matplotlib: a 2D graphics environment. Comput. Sci. Eng. 9(3), 90–95 (2007) 14. Khorramshahi, P., Peri, N., Kumar, A., Shah, A., Chellappa, R.: Attention driven vehicle re-identification and unsupervised anomaly detection for traffic understanding. In: CVPR Workshops, pp. 239–246 (2019) 15. LeCun, Y., Bengio, Y., et al.: Convolutional networks for images, speech, and time series. Handb. Brain Theory Neural Netw. 3361(10), 1995 (1995) 16. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015) 17. Lin, P., Ye, K., Xu, C.-Z.: Dynamic network anomaly detection system by using deep learning techniques. In: International Conference on Cloud Computing, pp. 161–176. Springer (2019) 18. Liu, F.T., Ting, K.M., Zhou, Z.-H.: Isolation forest. In: 2008 Eighth IEEE International Conference on Data Mining, pp. 413–422. IEEE (2008) 19. Malhotra, P., Vig, L., Shroff, G., Agarwal, P.: Long short term memory networks for anomaly detection in time series. In: Proceedings, vol. 89, pp. 89–94. Presses universitaires de Louvain (2015) 20. Medsker, L.R., Jain, L.C.: Recurrent neural networks. Des. Appl. 5 (2001) 21. Pedregosa, F., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011) 22. Pouyanfar, S., et. al.: A survey on deep learning: algorithms, techniques, and applications. ACM Comput. Surv. (CSUR) (2018) 23. Prajapati, A., Bechtel, J., Ganesan, S.: Condition based maintenance: a survey. J. Qual. Maintenance Eng. (2012) 24. Ruff, L., et al.: Deep one-class classification. In: International Conference on Machine Learning, pp. 4393–4402 (2018) 25. Sakurada, M., Yairi, T.: Anomaly detection using autoencoders with nonlinear dimensionality reduction. In: Proceedings of the MLSDA 2014 2nd Workshop on Machine Learning for Sensory Data Analysis, pp. 4–11 (2014) 26. Schlegl, T., Seeb¨ ock, P., Waldstein, S.M., Langs, G., Schmidt-Erfurth. f-anogan, U.: Fast unsupervised anomaly detection with generative adversarial networks: Med. Image Anal. 54, 30–44 (2019) 27. Sch¨ olkopf, B., Platt, J.C., Shawe-Taylor, J., Smola, A.J., Williamson, R.C.: Estimating the support of a high-dimensional distribution. Neural Comput. 13(7), 1443–1471 (2001) 28. Selcuk, S.: Predictive maintenance, its implementation and latest trends. Proc. Inst. Mech. Eng. Part B: J. Eng. Manufact. 231(9), 1670–1679 (2017) 29. Serradilla, O., Zugasti, E., Zurutuza, U.: Deep learning models for predictive maintenance: a survey, comparison, challenges and prospect. arXiv preprint arXiv:2010.03207 (2020) 30. Tax, D.M.J., Duin, R.P.W.: Support vector data description. Mach. Learn. 54(1), 45–66 (2004) 31. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017) 32. Xu, D., Ricci, E., Yan, Y., Song, J., Sebe, N.: Learning deep representations of appearance and motion for anomalous event detection. arXiv preprint arXiv:1510.01553 (2015)
One-Class Self-Attention
275
33. Zheng, Y., Liu, Q., Chen, E., Ge, Y., Zhao, J.L.: Time series classification using multi-channels deep convolutional neural networks. In: International Conference on Web-Age Information Management, pp. 298–310. Springer (2014) 34. Zhou, J.T., Zhang, L., Fang, Z., Du, J., Peng, X., Yang, X.: Attention-driven loss for anomaly detection in video surveillance. IEEE Trans. Circ. Syst. Video Technol. (2019) 35. Zhu, M., Ye, K., Wang, Y., Xu, C.-Z.: A deep learning approach for network anomaly detection based on amf-lstm. In: IFIP International Conference on Network and Parallel Computing, pp. 137–141. Springer (2018)
Customer Churn Prediction and Promotion Models in the Telecom Sector: A Case Study Ulku F. Gursoy1,2 , Enes M. Yildiz2 , M. Ergun Okay2(B) , and Mehmet S. Aktas1 1 Yildiz Technical University, Istanbul, Turkey
[email protected]
2 Intellica Business Intelligence Consultancy, Istanbul, Turkey
{ulku.gursoy,enes.yildiz,ergun.okay}@intellica.net
Abstract. The problems of predicting customer churn behavior and creating customer retention models are very important research topics for telecommunications companies. Within the scope of this research, a data analysis business process software platform architecture that will provide solutions to these problems is proposed. In this context, attributes that can be used to predict the churn of customers are also recommended for the telecom industry. A prototype software of the proposed business process software platform architecture has been developed. In the developed prototype application, the performance of customer churn behavior and promotion model prediction was examined based on accuracy metrics. The results show that the proposed business process software platform is available. Keywords: Churn prediction · Data mining · Association rule mining · Sequential rule mining · XGBoost algorithm
1 Introduction In the modern era, telecommunication has become one of the key sectors for information dissemination all over the world [1]. The growth of digital services is inevitable, and telecom operators (CSP / Communication Service Providers) are one of the most important participants in providing connectivity to the customer in this era. The number of CSP subscribers and total revenue increased significantly in the 2007 – 2017 period [2]. The greatest concern of telecommunications companies is that their customers switch to another competitor. This situation is called “churn” [3]. Given the fact that the telecommunications industry has an average annual loss rate of 30 to 35% and the cost of acquiring new customers is 5 to 10 times more expensive than retaining existing customers, customer retention becomes more important than customer acquisition. To retain existing customers, organizations must improve customer service and product quality, and anticipate which customers are likely to leave the organization.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 295, pp. 276–286, 2022. https://doi.org/10.1007/978-3-030-82196-8_21
Customer Churn Prediction and Promotion Models
277
Telecom companies are trying to find ways to predict customers who have the potential to churn. With the churn forecast, customers can be identified, and appropriate marketing strategies can be applied to keep the current subscriber [4]. In such an industry, it is important to be able to detect customer churn behavior using big data platforms and to take the most appropriate actions for these customers based on promotion models. Therefore, there is a need for a data analysis platform to analyze churn behavior and create promotion models. Feature selection is very important to achieve the highest accuracy in model prediction. Its main purpose is to discover the most important, decisive features for classifiers to be used for churn behavior prediction [5]. Numerous features can lead to overfitting of the model and the model’s memorization of situations that occur in the training data set. This will cause poor results in experimental studies to be carried out on the test data set. In this respect, it is also important to reduce the size of the number of features used. Creating models from data sets with too many features is also very difficult in terms of computation. Therefore, many methods are used for feature set reduction. The goal here is to maximize relevance and minimize redundancy. Feature selection aims to find a subset of only relevant features [6]. Association rule mining aims to find repeated transactions and related actions on the transaction data that has taken place. Support and confidence are two important statistical metrics for Association Rule Mining. The rules obtained here are important if they meet the minimum support threshold and the minimum confidence threshold [7]. As per the marketing strategies, association rule mining algorithms are used for promotion models to be used to analyze which campaigns should be offered to customers who show churn behaviors. Sequential pattern mining is used to find the order of the components of the rules depending on time while detecting the states of association [8]. PrefixSpan (Prefix – Projected Sequential Pattern Mining) algorithm is among the sequential pattern mining algorithms. By using this algorithm, it is possible to find the sequential actions of the customers and determine which action chain will be presented to the campaign by following it. In this study, we develop a data analysis business process method for the developing telecom industry that we mentioned. We analyze the churn behavior of customers by using the churn prediction, which is very important for the telecom industry. Also, we identify products that can be recommended to customers, with the aim of cross-selling, by using association rules finding algorithms. We anticipate that the results obtained here can be used for customers who display churn behavior in their marketing strategies. Besides, with sequential pattern mining methods, we detect frequently repeated sequential purchases on previously realized purchases and sales data to keep customers in the system and increase their satisfaction. Thus, we determine the order in which the product is recommended will be correct to the customer. We propose a data analytics business process approach to predict customer churn behavior and to be able to recommend products. We describe in detail the prototype of the methodology we have proposed. To demonstrate the usability of the proposed methodology, we perform the performance evaluation of the prototype in terms of accuracy metric. Our results show that the proposed methodology is successful.
278
U. F. Gursoy et al.
The remainder of this report is organized as follows. Section 2 contains the research questions we seek to answer within the scope of this study. Section 3 deals with the literature review. Section 4 contains the methodology we propose within the scope of the project. Section 5 evaluates the prototype details of the proposed methodology and the results of its implementation. Section 6 describes the result of this research and possible future studies.
2 Research Questions Telecom companies are obliged to predict churn behavior in order not to lose their customers. Developing recommendations for customer churn behavior analysis with machine learning-based prediction methods has become a primary goal in the telecom field. In this context, there is a need for data analysis business processes based on machine learning methods to predict customers’ churn behavior trends. The features to be used are very important in order to measure the churn behavior tendency of the customers with high performance. Also, it is of great importance to analyze what kind of services should be provided to keep customers who have the potential to leave within the company. In order to respond to these needs, we list the research questions we identified in this study as items below: 1) Which features will we use to determine customer churn behavior? 2) What kind of data analysis business process structure should we have to detect customer churn behavior? 3) To what extent do we make successful predictions in our study using these features? 4) What campaign suggestions should we offer to keep customers who have the potential to leave? 5) Which action chain should we follow to present the campaign to the customers?
3 Literature Review It is seen that various classifying algorithms are used in customer churn prediction. Among these algorithms, it has been observed that customer churn prediction studies in telecommunications have been performed successfully using Random Forest and KNN algorithms [9]. In other studies, it is seen that estimations are made using Support Vector Machine, Random Forest, XGBoost, and Logistic Regression algorithms [10, 11]. In this study, the XGBoost algorithm is used as a classifier. In different studies, it has been observed that the XGBoost algorithm gives a successful performance in churn estimation [12, 13]. FP-Growth algorithm is an algorithm used for Association Rule Mining in market basket analysis [14]. Apart from that, there is an available study on supporting medical decision making and facilitating its interpretation, using it in cancer studies [15]. Within the scope of this study, the cross-selling analysis was performed using the FP-Growth algorithm in telecom data. By creating interesting association rules, it was determined which products should be recommended to customers who tend to churn.
Customer Churn Prediction and Promotion Models
279
Studies have proven that the PrefixSpan algorithm is used to detect frauds related to insurance claims in the healthcare system [16]. It has also been used to research massive news events and to explore customer behavior features in the retail industry [17, 18]. In this study, we use the PrefixSpan algorithm to discover the packet purchase behavior of telecom customers and to find sequential patterns. There exists some studies that focus on investigating the performance of association rule mining and pattern detection algorithms on big data processing platforms [19–22]. These studies focus on the scalability of the association based algorithms. However, in this study, we mainly focus on how association rule mining can be utilized for creating models for cross-cell and up-sell purposes using the telecom data.
4 Methodology Within the scope of this research, a study was conducted to analyze the behaviors of customers in the telecom industry and determine the churn behaviors of newcomers. We tried to determine which campaigns should be offered to customers who tend to leave and in what order they should be offered. The operations performed in the modules in the software architecture shown in Fig. 1 are summarized as follows: Feature Vector Generation Module: Within the scope of this module, among the features obtained from telecom data, the features with the highest Gini score were selected. Data Pre-Processing Module: In this module, separate data cleaning, data integration, data transformation, data reduction, and data normalization processes were performed for each model. Model Building Module: This module is the module where the churn estimation model is created. Using XGBoost which is one of the machine learning algorithms, predictions were made on the test data set. Relationship-Based Association Module: Relationship-based rules are generated from the data set within the scope of this module. It consists of two sub-modules. Association Rule Mining Module (FP-Growth): Using the FP-Growth algorithm, we tried to obtain meaningful rules from customers’ product purchases. As a result of these rules, a campaign proposal was made to customers who had churn behaviors. Sequential Rules Extraction Module (PrefixSpan): Using the PrefixSpan algorithm, the packages received by the customers in order were examined and we tried to determine the sequential rules of behaviors. By this method, the order in which the campaigns to be presented to customers who have the potential to churn should be presented were analyzed.
280
U. F. Gursoy et al.
Fig. 1. Churn prediction business process software architecture
4.1 Creating Feature Vectors We contributed to this study by choosing 296 metrics that give the best result among 600 features belonging to the telecom data. Definitions of these metrics are given below, and their categorized versions are given in Fig. 2. Features Containing SMS information: It refers to the features of the customers relating to the message function. Features Containing Outgoing SMS Information: It contains the sent message information of customers defined on the operator. On-Net SMS Features: It consists of the message information sent by the customers defined on the operator to another customer defined on the same operator. Off-Net SMS Features: It consists of the message information sent by the customers defined on the operator to the users defined in a different operator. Features containing incoming SMS Information: It contains the messaging information which is received by the operator defined customer. On-Net SMS Features: It specifies the features that contain the call information from another customer defined in the same operator. Features Containing Call Information: Features containing call information of customers are discussed. Features Containing Outgoing Call Information: It consists of features that contain call information initiated by customers defined on the operator. On-Net Call Features: It shows the features that contain the information to call another customer, defined on the same operator. Off-Net Call Features: It shows the features that contain the information of the customer defined in the operator to call another customer defined on a different operator. Features Containing Incoming Call Information: It consists of features that contain the call information the customers receive, defined on the operator. On-Net Call Features: It specifies the features that contain the call information from another customer defined in the same operator. Features Containing Billing Information: Information on customers’ bills is discussed. Features Containing Event Information: It consists of features that contain information on the activities performed by the customers.
Customer Churn Prediction and Promotion Models
281
Features Containing Duration Information: It consists of features that contain information on the talk times of the customers. Features Containing Data Usage Information: Internet usage information of the customers is discussed. Features Containing Value Added Service (VAS) Information: It contains information on various value-added services received by customers (voicemail, ring back tone, balance checks, top-up, SMS voting, SMS lotteries, etc.). Features Containing Revenue Statistics Information: Refers to the income information obtained from the customers. Features Containing Package statistics: Information about the preferred campaigns offered to customers is discussed. Features Containing Roaming Information: It includes information for the customer to automatically make and receive voice calls, send, and receive data, or access other services, including home data services, while traveling outside the geographic coverage of the home network. Features Containing Demographic Data: It refers to features that contain personal information of customers. 4.2 Data Preprocessing Data preprocessing steps were performed for all three data mining models. For the customer churn model, records containing 40% or more null data in the features have been removed from the dataset. Those features which were not considered beneficial for the model on character and date types are also removed from the dataset. The null data in the numeric features remaining in the dataset is filled with zero. As the last step, in the direction of the correlation ratios of the features among themselves, a random one of the features with a Pearson correlation coefficient above 0.9 was eliminated, and the features with a very high linear relationship were cleared from the dataset. The dataset stored in the table for the FP-Growth model contains customer id, transaction date, and received product information. For the dataset to be used in this model, the customer id and products taken in a specified timeframe has been converted to a format separated by commas. The dataset for the PrefixSpan model has been converted to the format to be used for the FP-Growth model, and the products received are provided in order. 4.3 Feature Selection Feature selection is applied only for the dataset used in the customer churn model. The “RFE” class of the “scikit-learn” library was used to select the 296 most useful features for the model from 600 features. The final dataset to be used for the model was determined by selecting 296 features with the highest Gini scores. 4.4 Training-Test Dataset Selection The dataset for the customer churn model has been determined as 30% test dataset and 70% training dataset.
282
U. F. Gursoy et al.
Fig. 2. Telecom feature categories selected by model
For FP-Growth and PrefixSpan models, the dataset is divided into training and test data on a time basis. Two-seventh of the time frame determined according to the transaction date is reserved for the test set and the remainder for the training data set. 4.5 Creating the Churn Prediction Model For the customer churn model, it was decided to use the XGBoost model, which has gained popularity in recent years. The Python version of the “XGBoost” library, which is frequently used in the literature and commercial applications, was chosen for the development of this model. In the XGBoost model, which has a higher number and variety of hyperparameters compared to other models, the parameter optimization was first started with “scale_pos_weight” and “max_depth” parameters, which are thought to have the greatest effect on the performance of this model. In parameter optimization, the success of the model to predict customer churn in the test data set is taken into account. After determining the values to be used for these parameters, other parameters have been optimized to maximize performance results. 4.6 Creating FP-Growth and PrefixSpan Models FP-Growth ad PrefixSpan algorithms are developed in the Python environment. For the FP-Growth model, 14 rules with the highest support rate were determined in the training set, and 10 rules for the PrefixSpan model. To measure the success of the rule, if one of the transactions belonging to the rule obtained from the FP-Growth model training
Customer Churn Prediction and Promotion Models
283
set was observed in the test set, the rule was deemed valid or invalid depending on the observation of any of the other transaction of this rule. In the testing of the PrefixSpan model, an approach similar to the FP-Growth model was followed and in order for the rule to be valid, it was observed that the order of operation must go along with the rule which was also in the test set. The success metric of the rules was accepted as the ratio of valid cases to the total of valid and invalid cases in the test set.
5 Prototype and Application Evaluation The 3.7 version of the Python programming language was used in all end-to-end development processes such as data preprocessing, model development and testing for the three models. The customer churn model was run in a single node Spark 3.0.1 environment. Spark environment is installed in a virtual machine with Ubuntu 18.04 operating system with 6 GB RAM and 4 CPUs. Default settings are used in the Spark node where the XGBoost model was running. In the customer churn model, “Pandas” and “Sklearn” libraries were used for data preprocessing and feature selection steps. In the modeling part, the “Python” language of the “XGBoost” library has been preferred. “Pandas” library is used in data processing of FP-Growth and PrefixSpan models. The customer churn prediction model was run with the XGBoost algorithm and the results obtained are given in Table 1. The overall prediction success of the model was found to be approximately 0.84. This shows that the model has an 84% success rate in churn estimation. If other parameters also are considered, it is seen that the model gives very successful results. Table 1. Churn prediction test set. Precision
Recall
F1-Score
Support
0
0.94
0.83
0.88
41921
1
0.64
0.84
0.72
14521
0.84
56442
Accuracy Macro Avg
0.79
0.84
0.80
56442
Weighted Avg
0.86
0.84
0.84
56442
Accuracy……………………:0.835512561567627
The rules obtained by running the FP-Growth algorithm are shown in Fig. 3. The rules obtained here are determined to be at least dual rules. The Correct Prediction field refers to the number of people who take the second product which is estimated in the rules in the test dataset. Wrong Prediction shows the number of people who bought the second product in the test dataset, but not the second product that was predicted in the rule, which is a different product than we expected. The Success field was obtained by dividing the Correct Prediction field by the sum of Correct Prediction and Wrong Prediction. According to the results we have obtained, the first rule found gives the most
284
U. F. Gursoy et al.
successful prediction result at the rate of 83%. As a result, as shown in Fig. 3, our model can generate rules which give high success rates.
Fig. 3. FP-Growth algorithm rules
The ordered rules list obtained with the PrefixSpan algorithm is given in Fig. 4. Likewise, Correct Prediction gives the number of buyers of the second product that we have determined among those who buy the second product. Wrong Prediction shows the number of people who bought the second product but a different product than we expected. Success refers to the ratio of correct predictions. Looking at the performance results given in Fig. 4, it is seen that the first rule with the highest success rate with a 92% success rate was determined. These results indicate that the rules found by the PrefixSpan model are successful on the test data.
Fig. 4. PrefixSpan algorithm rules
Customer Churn Prediction and Promotion Models
285
6 Conclusion and Future Work Within the scope of this research, a data analysis business process method is proposed to create customer churn behavior prediction and promotion models using traditional telecom features. Within the scope of the promotion model, the best-selling products are determined by cross-selling analysis. This is how it is decided which campaigns should be offered to customers with churn behaviors. Also, a method in which sequential patterns are determined for the order in which products should be presented to customers. The modules in the business process software architecture proposed within the scope of the study are discussed in detail. The modules in the business process software architecture proposed within the scope of the study are discussed in detail. A prototype of this software architecture has been created. Feature selection was made among the telecom features and the features included in the model were categorized. The prediction success of the three models that have been run has been examined. The results obtained show that the proposed architecture can be used. In the next steps of this research, studies are planning to increase the success of customer churn prediction behavior. It is aimed to improve model performance by adding various features to traditional telecom features. The rule set will be enriched by increasing the number of relational based rules and determined by more complex rules. In order to analyze the performance of the model, he results will be compared using different algorithms. Acknowledgment. We thank Intellica Business Intelligence Consultancy for providing us the telecom data set and for their continuous support in this case study. This study is supported by TUBITAK TEYDEB under the project ID 3170866.
References 1. Kiani, A.: Telecom penetration and economic growth: an empirical analysis. GCU Econ. J. (1 & 2), 105–123 (2018) 2. Hendrawan, R., Nugroho, K.W., Permana, G.T.: Efficiency perspective on telecom mobile data traffic. J. Bus. Econ. Rev. 5, 38–44 (2020) 3. Jadhav, R.J., Pawar, U.T.: Churn prediction in telecommunication using data mining technology. (IJACSA) Int. J. Adv. Comput. Sci. Appl. 2(2) (2011) 4. Yulianti, Y., Saifudin, A.: Sequential feature selection in customer churn prediction based on Naive Bayes. In: IOP Conference Series: Materials Science and Engineering, 3rd International Conference on Informatics, Engineering, Science, and Technology (INCITEST), vol. 879 (2020) 5. Raja, J.B., Sandhya, G., Peter, S.S., Karthik, R., Femila, F.: Exploring effective feature selection methods for telecom churn prediction. Int. J. Innov. Technol. Explor. Eng. (IJITEE) 9(3) (2020) 6. Chitra, S., Srivaramangai, P.: Feature selection methods for improving classification accuracy – a comparative study. UGC Care Group I Listed J. 10(1), 1 (2020) 7. Patil, T.H., Harde, A.V., Patil, T.: Comparative study of apriori algorithm and frequent growth algorithm. Int. J. Sci. Spirituality, Bus. Technol. (IJSSBT), 7(2) (2020) 8. Kim, B., Yi, G.: Location-based parallel sequential pattern mining algorithm. IEEE Access, 7 (2019)
286
U. F. Gursoy et al.
9. Andrews, R., Zacharias, R., Antony, S., James, M.M.: Churn prediction in telecom sector using machine learning. Int. J. Inf. Syst. Comput. Sci. 8(2) (2019) 10. Selvakanmani, S., Pranamita, N., Deepak, K., Kavi, B.A., Salmaan, A.K.: Churn prediction using ensemble learning: an analytical CRM application. Int. J. Adv. Sci. Technol. 29(5), 9192–9200 (2020) 11. Kavitha, V., Kumar, S.V.M., Kumar, G.H., Harish, M.: Churn prediction of customer in telecom industry using machine learning algorithms. Int. J. Eng. Res. Technol. (IJERT) 9(5), 181–184 (2020) 12. Al-Shatnwai, A.M., Faris, M.: Predicting customer retention using XGBoost and balancing methods. Int. J. Adv. Comput. Sci. Appl. (IJACSA) 11(7), 704–712 (2020) 13. Labhsetwar, S.R.: Predictive analysis of customer churn in telecom industry using supervised learning. ICTACT J. Soft Comput. 10(2) (2020) 14. Izang, A.A., Kuyoro, S.O., Alao, O.D., Okoro, R.U., Adesegun, O.A.: Comparative analysis of association rule mining algorithms in market basket analysis using transactional data. Afr. J. Online 27(1) (2020) 15. Davagdorj, K., Ryu, K.H.: Association rule mining on head and neck squamous cell carcinoma cancer using FP growth algorithm. In: International Conference on Information, System and Convergence Applications (2018) 16. Matloob, I., Khan, S.A., Rahman, H.U.: Sequence mining and prediction-based healthcare fraud detection methodology. IEEE Access 8, 143256–143273 (2020) 17. Wang, B., Dai, X., Yang, J., Li, S.: Research on massive news events evolution prediction based on improved prefixspan algorithm. In: 3rd International Symposium on Big Data and Applied Statistics (2020) 18. Chen, H., Yu, S., Huang, F., Zhu, B., Gao, L., Qian, C.: Spatio-temporal analysis of retail customer behavior based on clustering and sequential pattern mining. In: 3rd International Conference on Artificial Intelligence and Big Data (2020) 19. Celik, O., Hasanbasoglu, M., Aktas, M., Kalipsiz, O.: Association rule mining on big data sets, book chapter in book titled as data mining-methods, applications and systems. IntechOpen (2020). https://doi.org/10.5772/intechopen.91478 20. Sesver, D., Tuna, S., Akta¸s, M., Kalıpsız, O., Kanlı, A., Turgut, O., Implementation of association rule mining algorithms on distributed data processing platforms. In: 2019 4th International Conference on Computer Science and Engineering (UBMK-19) (2019) 21. Yildiz, E., Aktas, M.S., Kalipsiz, O., Kanli, A., Turgut, O.: Data mining library for big data processing platforms: a case study-sparkling water platform. In: 2018 3rd International Conference on Computer Science and Engineering (UBMK-18) (2018) 22. Celik, O., Hasanbasoglu, M., Aktas, M., Kalipsiz, O., Kanlı, A., Implementation of data preprocessing techniques on distributed big data platforms. In: 2019 4th International Conference on Computer Science and Engineering (UBMK-19) (2019)
Learning Incorrect Verdict Patterns of the Established Face Recognizing CNN Models Using Meta-Learning Supervisor ANN Stanislav Selitskiy(B) , Nikolaos Christou, and Natalya Selitskaya School of Computer Science, University of Bedfordshire, LU1 3JU Luton, UK https://www.beds.ac.uk/computing Abstract. We explore the performance of the established state-of-theart Convolutional Neural Network (CNN) models on the novel face recognition data set with artistic facial makeup and occlusions. The strength and weaknesses of different CNN architectures are probed on particular types of makeup and occlusions of the benchmark data. Apart from the practical value of the knowing effectiveness of the face camouflaging techniques, such a data set magnifies the reliability and robustness problem of the established CNN models in real-life settings. A flexible and lightweight approach of isolating uncertainty of the CNN models verdicts’ trustworthiness is investigated, aiming to increase the trusted recognition accuracy. A separate supervising Artificial Neural Network (ANN) is attached to the established CNNs and is trained to learn patterns of the erroneous classifications of the underlying CNN models. Keywords: Face recognition · Makeup Meta-learning · Paper with code
1
· Occlusion · Spoofing ·
Introduction
Artificial Neural Networks (ANNs), especially of the advanced deep and parallel architectures, are a proven Machine Learning (ML) tool for effective learning of complex local and global patterns. ANNs are successfully applied to practical problems of various domains. In biometrics, State of the art (SOTA) Convolutional Neural Network (CNN) models had already passed the milestone of the human-level accuracy of the face recognition number of years ago, given that training sets are representative of the test images given to the algorithms. However, if the test set does not have the same feature representation and distribution, for example, due to different lighting conditions, facial expressions, head rotations, or visual obstacles of various kinds, that significantly drops the face recognition performance of the ML models. In particular, here we research disruptive and spoofing effects on face recognition of the artistic makeup of various degrees of lightness or heaviness and realistic touches of absurdity, face masks, wigs, and eyeglasses. A novel for the face recognition research in conditions of c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 295, pp. 287–304, 2022. https://doi.org/10.1007/978-3-030-82196-8_22
288
S. Selitskiy et al.
the visual occlusions face disguise techniques data set BookClub is introduced that covers apparent gaps in the established publicly available benchmark data sets. Research in the makeup and occlusion influence on the face recognition Machine Learning (ML) algorithms, as well as a collection of the benchmark data sets, has been conducted since early face recognition algorithms development. The AR data set was collected in the late 90-s, especially for the face recognition task in mind in conditions aggravated by lighting, facial expression, head rotation and occlusions variability. Unlike many contemporary data sets scraped from the Internet, the AR data set’s acquisition was meticulously designed for the controlled laboratory set conditions, featuring 4000 face images of 126 subjects (70 men and 56 women) [28]. The data set still has relevance in contemporary research; unfortunately, in our experience, it is not currently maintained. Still, it continued to be used in 2000-s research, for example, occlusions influence on face recognition, such as [29]. Before CNN architectures made ANN, especially in Deep Learning (DL) architectures, feasible on the commonly available hardware, the engineered visual features algorithms were popular and have achieved results of decent accuracy. The study mentioned above used Local Binary Patterns (LBP) for face detection, Principal Component Analysis (PCA) - for global features and dimensionality reduction. Gabor wavelets were used for feature extraction and Support Vector Machine (SVM) - for clustering and classification. Approach for dealing with occlusions proposed by authors concentrated primarily on the non-face-like features detection and exclusion of such areas in the image from the face recognition process. In the case of makeup, the authors suggested focusing on makeup-free regions such as eyes. Algorithms based on such techniques, it was reported, have helped significantly boos face recognition performance. Yet another engineered features research on the impact of the various intensity everyday makeup used LBP as well [10]. Unfortunately, a proprietary data set was used for the research. The study confirms that even low-intensity makeup introduced into the test set is capable of disrupting facial recognition. Similarly to [40], it was noted that enrichment of the training set with makeup examples might have a dual effect, leading to both improvements or worsening of the face recognition performance, depending on the test and training data set image distribution and content of the occluded images infusion. In both studies, it was noticed that the intermediate complexity makeup infused into training data set proportionally to the test set’s image variety distribution, even in small amounts, increased face recognition accuracy. CV Dazzle project has similarities to the presented research in terms of emphasizing artistic grade makeup and hairstyle modifications as a means of disrupting face recognition algorithms. The idea behind the ‘artistic’ or ‘theatrical’ approach to facial camouflage is to pretend that there is no such goal as preventing the subject’s recognition in the extravagant changes to the person’s appearance. The project aims at generating individual recommendation in
Learning Incorrect Verdict Patterns using Meta-learning Supervisor ANN
289
the artistic makeup that would ‘inconspicuously’ disrupt face recognition. The workflow of the appearance changes recommendation is based on the detecting most prominent facial landmarks and then selecting makeup and hairstyle features from the template library that would efficiently masquerade those easily recognizable landmarks. Such engineered features algorithms as FisherFace, EigenFaces, Gabor wavelets, LBP, Elastic Bunch Graphic Matching (EBGM) were used to extract visual features. A proprietary collection of celebrity images scraped from the Internet was used for computational experiments to identify characteristic landmark features and camouflage them with digitally applied hairstyle and makeup templates. Commercial face recognition system such as PicTriev, Google+, and Baidu were used to test the effectiveness of the counterrecognition measures. It was reported that these countermeasures demonstrated high success rates [11]. Experiments with ANN models also have shown that such architectures are vulnerable to makeup and occlusions additions to test sets, and their accuracy significantly drops in such cases. It was even noted that the targeted spoofing of another person with means of makeup. In [14], two CNN architectures were experimented with: VGG-Face and unspecified commercial face recognition system. Computational experiments were conducted on the Labeled Faces in the Wild (LFW) data set that features spoofed images and Makeup Induced Face Spoofing (MIFS) containing non-makeup and makeup images. Both CNN models displayed a drop in performance in the presence of makeup that was not sampled in the training images. In study [7], it was observed that ability to spoof a particular person is highly dependent on the target person’s face and the patient’s person’s image. Some combination of the target and patient pairs was easy, and some were difficult to spoof using makeup. It was suggested that when makeup is detected, face recognition has to switch into multi-modal functioning, for example, into exclusively using or prioritizing thermal image recognition or iris recognition. The Internet is a rich, valuable, and easy-to-access source of visual data that can be used for building experimental data set for face recognition in general or, in particular, in the presence of occlusions and makeup. Data sets scrapped from the Internet could be huge, significantly larger than those collected in the laboratory settings with the help of volunteers or being paid subjects. However, these Internet-sourced data sets usually lack rigorously controlled variability when just one parameter is changed or changed in the desired direction with expected change step width and number. The frequently used Internet-sourced data sets in the face recognition research in the presence of makeup or occlusions include MIFS, YMU, MIF. Makeup Induced Face Spoofing (MIFS) data set is built upon YouTube makeup lessons and features original subjects’ images and subsequent stages of the makeup applications, overall 107 images [7]. Due to the nature of the original source material, the majority of the subjects are young Caucasian females. Another larger data set is built in the same manner and by the same team from the makeup workshops, the YouTube Makeup Database (YMU), also features a
290
S. Selitskiy et al.
variety of expressions, head rotations, and resolution. It features 151 Caucasian females and consists of 600 images taken before and after the makeup addition. Yet another extension data set from the same researchers, Additional Makeup in the Wild Database (MIW), features 125 subjects and 154 images of the original faces, and after makeup application [8]. One more data set, The Virtual Makeup (VMU) database, created by the same team, uses another approach similar to one of the CV Dazzle projects - digital application of the makeup. Images of the Recognition Grand Challenge (FRGC) database are digitally remastered to add makeup to 51 Caucasian female subjects. Formerly publicly open CyberExtruder data set is sourced from the Internet and features a significantly larger set of subjects and images: 1000 and 10205, respectively. Unfortunately for the scientific community, public access to the data set was ended. The data set features a large variety of occlusions and makeup, facial expressions and lighting, race and age spectrum [1]. Another large and publicly available data set, Disguised Faces in the Wild (DFW), is sourced from the Internet as well. Similarly to the CyberExtruder, it features a variety of occlusions such as makeup, glasses, moustache, beard, hairstyles, head dressing, veils, ball masks, and masquerades for 11157 images and 1000 subjects. Per each subject, few non-disguised and few occluded or impersonator images are collected. The overall collection of images varies in clothing, posing, age, gender, race, expression, lighting, and background [23]. The controlled environment data set collection continues even during the Internet image sourcing age. The Database of Partially Occluded 3D Faces (UMB-DB) can be considered a partial AR data set replacement. It features 1473 2D and 3D images of 143 subjects in the 19–50 age range, with three emotion expressions and at least nine images per session [9]. The laboratory-style controlled environment image acquisition used to collect the proposed BookClub data set is not the only advantage of the one-parameter variation when others are fixed. A high number of images per session, subjects, and multiple non-makeup and makeup sessions per the same subject make the data set suitable for more advanced partitions than just training and test sets [42]. An adjacent areas research in the presentation attack detection [20] and deep fake detection [21,27,46] have parallels to the real face variation detection and is an active area of research using modern artificial neural network (ANN) architectures. As most of the cited work related to face recognition with makeup concentrated on the effects of makeup and occlusions on face recognition using engineered features algorithms here, we also investigated the behaviour of the modern state-of-the-art (SOTA) CNN architectures applied to the problem. Observations that the ML algorithms demonstrate high recognition accuracy variance on some types of makeup and occluded face data called for an inquiry in increasing robustness and reliability of a given algorithm even in exchange for the decreased number of verdicts that may be trusted. Concerns about handling uncertainties of the real-life by AI systems, initially medical and military [22], and, later, into autonomous driving, drone and robot operation [45] have attracted the attention of researches and society as a whole [4]. Biometrics and
Learning Incorrect Verdict Patterns using Meta-learning Supervisor ANN
291
particularly face recognition (FR) may seem not so safety-critical. However, in the legal and security applications, and as part of the robot operation or even virtual reality simulations, their reliable and robust handling of the real-life uncertainties is vital for saving human lives and maintaining societal trust. The task to learn, estimate, and minimize or compartmentalize uncertainty in ANN models naturally invokes the idea of learning about learning, a flavour of the general meta-learning concept. The concept was introduced in the 90’s [43], and recently get traction in different variations [44], either as an extension of the transfer learning [5,12,30], or model hyper-parameter optimization [6,33], or conjoint to the explainable learning [24,26]. And particularly related to presented work, learning using external parallel, slave, or master resources, such as memory, knowledge bases, or other ANN models [34]. In this paper, the question we would like to answer is whether it is possible to “learn to learn” in the practical sense and “learn about learning”, at least about erroneous patterns in learning and detect them when they occur? An approach of the meta-learning supervisor ANN over target CNN models has been investigated to compartmentalize the non-trusted verdicts into the “unreliable classification” class to increase the precision of the “reliable classification” classes. The paper is organized as follows. Section 2 introduces the ML algorithms used in experiments on the BookClub benchmark data. Section 3 provides details of the BookClub data set. Section 4 describes how the data set is used in the experiments. Section 5 presents results of the experiments, and Sect. 6 draws conclusions from the results and states directions in which questions are not yet answered.
2
Machine Learning Concepts
Machine Learning concepts have been efficiently used for detection of abnormal patterns [31,32] and estimation of brain development [19,38], trauma severity estimation and survival prediction [18,35,36], collision avoidance at Heathrow [37], and early detection of bone pathologies [3,16]. 2.1
Artificial Neural Networks
An ANN model can be viewed as a multivariate piece-wise non-linear regression over high-dimensional data, targeting to map an input-output relationship. Such a model is fitted to the given data represented by a set of features, which dimensionality is typically smaller than the original one, to minimize a given error function. Feature selection and multiple models have efficiently increased the performance, as discussed in [17–19]. Each layer of neurons can be perceived as a transformation (regression) from a space of inputs X with one dimensionality m into another space of outputs Y with another dimensionality n: f : X ⊂ Rm → Y ⊂ Rn
(1)
292
S. Selitskiy et al.
If transformations from one space into another performed by the neuron layers are linear, they could be represented in the matrix form: y = f (x) = Wx, ∀x ∈ X ⊂ Rm , ∀y ∈ Y ⊂ Rn ,
(2)
where W ∈ W ⊂ Rn × Rm is the adjustable coefficient matrix. However, such a model is limited by ability to learn only linear relations between input signals. Because composition of the linear transformations is also a linear transformation, attempts of just adding more layers will not add representative complexity to the model. To create truly multiple hidden layers capable of learning non-linear relationships, one needs to add “activation functions” - elements of non-linearity between the layers of neurons. 2.2
Non-linearity and Activation Functions
One of the commonly used families of the activation functions are sigmoids, such as the logistic function: ey 1 , y = wT x. = (3) 1 + e−y ey + 1 The output of this function can be interpreted as the conditional probability of observations over classes, which is related to a Gaussian probability function with an equal covariance [15], which is a very convenient output for the final ANN layer. Rectified Linear Unit (ReLU) is another popular family of the activation functions: z = g(y) =
z = g(y) = y + = max(0, y).
(4)
They address the problem of vanishing gradients of sigmoid activation functions in ANN [2]. 2.3
Learning Algorithms and Back-Propagation
The way to fit an ANN model into a real-world problem is adjusting the ANN parameters Wij weights at each layer k. To find out how close the ANN transformations fall into the expected neighbourhood of the training data, a metric or distance function is needed, which is usually called in ML as a cost or objective function. The most popular family of the learning algorithms are Gradient Descent (GD). The simplest GD algorithm may be presented as follows: Wt+1 = Wt − η∇lWt .
(5)
where t is the sequence member or iteration number, 0 < η < 1 is learning rate, and ∇lWt is the gradient of the cost function l : Z ⊂ Rk → L ⊂ R in respect to the weight matrix W.
Learning Incorrect Verdict Patterns using Meta-learning Supervisor ANN
293
ˆ), where z ˆ Similarly, back-propagation algorithms define a cost function l(z, z is a training observation vector, activation function z = g(y), and neuron layer summation function y = Wx, partial derivatives of the cost function in respect ∂l are readily available, where j is the index to the activation function results ∂z j of a neuron in a layer. Using the chain rule for partial derivatives, it is easy to find out the cost function derivative in respect to the summation function for the given j-th neuron: ∇lx = J(
∂y T l ) ∇y , ∂x
(6)
T where J( ∂y ∂x ) is a transposed Jacobian matrix of the partial derivatives of the vector of the neuron summation function results yj in respect to the vector of ∂l ∂l ∂l T , . . . ∂y , . . . ∂y ) is a gradient of the cost function l in inputs x, and ∇ly = ( ∂y 1 j k respect to the the vector y. Similarly, one can express the needed for the learning algorithm cost function derivative in respect to the matrix or tensor of learning parameters flattened to vector W:
∇lW = J( 2.4
∂y T l ) ∇y . ∂W
(7)
Cost Functions
A natural and straightforward cost function based on the Euclidean distance Sum of Squared Errors (SSE) is convenient to use with linear transformations. However, if the logistic sigmoid activation function is used, SSE causes problems. Similarly does the “softmax” generalisation of the logistic activation function applied to the multi-class problem: eyj z = g(yj ) = yj , y = W x, je
(8)
Partial derivatives of SSE cost function in respect to yj , when logistic sigmoid activation function applied to it, result in the third degree’s polynomials, which have three roots. Such a gradient ∇ly has multiple local minimums, which is inconvenient even for GD algorithms. To make partial derivatives having one root, a convenient ‘cross-entropy’ function that being positive and becoming zero when zj = zˆj , is suitable for the cost function role for logistic-type activation functions: l(z) = −(zˆj ln zj + (1 − zˆj ) ln (1 − zj )). 2.5
(9)
Convolutional Neural Networks
When using general-purpose DNNs for image recognition, the necessity to address problems of input’s high dimensionality, massive training sets, and weak
294
S. Selitskiy et al.
control over the feature selection and processing algorithms led to the development of particularly structured DNNs driving training process in the desired direction. One of the popular ANN architectures for image and signal recognition is a Convolutional Neural Network (CNN) [13,25]. A CNN uses local receptive fields – neighbouring pixel patches that are connected to few neurons in the next layer. Such an architecture hints the CNN to extract locally concentrated features and could be implemented using Hadamard product y = (M W)Kx of the shared weight matrix W and its sliding receptive field binary mask M ∈ M ⊂ Bn × Bk , corresponding kernel mask K ∈ K ⊂ Rk × Rm , where k - length of the combined by receptive fields and flattened input vector. The shared rows weight matrix W can be viewed as a shift and distortion invariant feature extractor synthesized by a CNN, and y - as a generated feature map. Multiple parallel feature masks and maps ensure learning multiple features.
3
Data Set
The BookClub artistic makeup data set features 6182 non-makeup and nonoccluded images, and 11145 makeup or occluded images of 21 subjects, Fig. 1. The images were taken in 37 photo-sessions without makeup or occlusions, 40 - with makeup sessions, and 17 - with occlusions. The controlled parameters included three exposure times, facial expressions of six ‘basic’ emotions, closed eyes, and neutral emotion, seven head orientations [42]. In addition to the practical usefulness in training and verifying CNN against the makeup and occlusions face recognition avoidance, when non-makeup only photo-sessions compound the training set and makeup and occlusion sessions are used for testing, such a data set is suited very well for benchmarking uncertainty estimation for the real-life conditions of test data not being well represented by the training data.
4
Experiments
It was shown in [39], on the AlexNet CNN model example, that even stateof-the-art machine learning algorithms are prone to face recognition errors on particular types of makeups, occlusions, and spoofed personalities. Other stateof-the-art CNN models of various degree of deepness and connection structure, such as VGG19, GoogLeNet, Resnet50, Inception v.3, InceptionResnet v.2, were for the widened enquiry experimented with. The experiments were run on the Linux (Ubuntu 18.04) operating system with three GeForce GTX 1070 Ti GPUs (with 8 GB GDDR5 memory each), X299 chipset motherboard, 128 GB DDR4 RAM, and i7-7800X CPU. Experiments were run using MATLAB 2020b. To emulate the real-life conditions, when makeup or other occlusions of the subjects may not be predicted and included into training sets beforehand, nonmakeup and non-occlusion sessions were selected into the training set amounted to roughly 6200 images. The high-level recognition accuracy of a photo-session
Learning Incorrect Verdict Patterns using Meta-learning Supervisor ANN
295
Fig. 1. BookClub data set examples.
class was calculated as a ratio between the number of correctly identified images (Nclass correct ) and the whole number of images (Nclass all ) in the photo-session: Class Accuracy =
Nclass correct Nclass all
(10)
The AlexNet model consists of 25 elementary layers and takes input images scaled to 277 × 277 dimension. The VGG19 model has 47 elementary layers, GoogLeNet - 144, Resnet50 - 177, and they take 224 × 224 scaled images as input. The Inception v.3 contains 315 elementary layers, and InceptionResnet v.2 - 824, both taking 299 × 299 scaled images as input. All experimented with models, but AlexNet and VGG19 models, have Directed Acyclic Graph (DAG) architecture. “Adam” learning algorithm with 0.001 learning coefficient, minibatch size 64 parameters are used for training. Depending on the models’ rate convergence, 10, 20, 30 epochs were used. For the precision improvement experiments, in [40] it was suggested to use the highest softmax activation distribution from the incorrectly issued verdicts of the training set to find the A/B test threshold for the test set that would provide the desired confidence level. That proposition was made because the test and training sets’ softmax distributions for wrong verdicts visually looked similar. In contrast, the softmax distributions for correct verdicts visually differed from the incorrect verdict distributions, thus discriminating between two and calculating trusted accuracy, precision, and other metrics. However, rigorous distribution shape similarity hypothesis testing algorithms, such as Kolmogorov-Smirnov,
296
S. Selitskiy et al.
have not shown high confidence in the distribution similarity for all CNN models. Therefore, posterior trusted accuracy, though increased, has not strictly satisfied the desired confidence level. To create a more flexible model with higher dimensionality for input parameters than just one highest softmax activation, it is proposed to use a simple meta-learning ANN. The ANN would take all softmax activations of the being supervised target CNN model and be trained on two verdicts: wrong or correct classification by the target CNN, see Listing 1.1. Complete code and detailed results are available at [41]. Listing 1.1. Meta-learning ANN model and parameters
nClasses = 21; nVerdicts = 2; nLayer1 = n C l a s s e s ; nLayer2 = n C l a s s e s ; nLayer3 = n C l a s s e s ; sLayers = [ featureInputLayer ( nClasses ) f u l l y C o n n e c t e d L a y e r ( nLayer1 ) reluLayer f u l l y C o n n e c t e d L a y e r ( nLayer2 ) reluLayer f u l l y C o n n e c t e d L a y e r ( nLayer3 ) reluLayer fullyConnectedLayer ( nVerdicts ) softmaxLayer classificationLayer ]; s O p t i o n s = t r a i n i n g O p t i o n s ( ’ adam ’ , . . . ’ ExecutionEnvironment ’ , ’ auto ’ , . . . ’ MiniBatchSize ’ , 64 , . . . ’ InitialLearnRate ’ ,0.01 , . . . ’ MaxEpochs ’ , 3 0 0 , . . . ’ Verbose ’ , t r u e , . . . ’ P l o t s ’ , ’ t r a i n i n g −p r o g r e s s ’ ) ; For creating a training set for the supervisor ANN, for each subject with more than one non-makeup session, one non-makeup session was set aside and was not used for the target CNN training. Images of the test set were run against the target CNN, and then activations of the target CNN were used to obtain the trusted or non-trusted verdict from the meta-learning supervisor ANN.
Learning Incorrect Verdict Patterns using Meta-learning Supervisor ANN
297
Unconstrained accuracy of the target CNN model is calculated as ratio of the number of correctly identified test images to number of all test images: M odelAccuracy =
Ncorrect Nall
(11)
The trusted accuracy is calculated as a ratio of the number of the correctly identified test images with the trusted flag and the number of incorrectly identified test images with non-trusted flag, relative to a number of all test images: Accuracyt =
Ncorrect:F =T + Nwrong:F =T Nall
(12)
The trusted precision is calculated similarly to the trusted accuracy as the ratio of the correctly identified test images with the trusted flag to the number of all identified as particular class test images with the trusted flag: P recisiont =
Ncorrect:F =T Nall:F =T
(13)
The trusted recall is calculated as the ratio of the correctly identified test images with the trusted flag to the number of all correctly identified test images: Recallt =
5
Ncorrect:F =T Ncorrect
(14)
Results
Introducing more complex and deeper models into experiments has helped to increase the hypothetical ensemble’s accuracy. However, still, a sizable number of sessions remained misidentified either by all or majority of the tested models, see Table 1, and Fig. 2. The majority of the realistic human faces painted over subjects’ faces were correctly identified by almost all CNN models, including the simplest ones. Heavy, contrast, non-anthropomorphic, abstract, or anthropomorphic but lessrealistic and bright colour makeup, as well as wigs and dark glasses, posed most difficulties for all CNN models (see Fig. 2 and Fig. 3). Another set of the ‘problematic’ sessions, see Table 2, exposes ‘blind spots’ of particular CNN architectures. Some CNN models demonstrate very high, almost ideal accuracy for such sessions, while others are negligibly low or effectively zero accuracy. Particularly, the VGG19 model had problems with the artificial white wig that other models easily recognized. Inception v.3, which was the most accurate and reliable model overall, had few ‘blinders’ on simple, easily recognizable by even simpler models makeups. Resnet50 failed to recognize painted realistic human faces. Furthermore, GoogLeNet failed on face masks which other DAG models solved recognition.
298
S. Selitskiy et al.
Table 1. Image classes misidentified or identified with low accuracy by All CNN models Session
AlexNet VGG19 GoogLeNet Resnet50 Inception3 InceptRes2
S1HD1
0.1325
0.6747
0.5783
0.0000
0.0000
0.0000
S1MK3
0.0000
0.6989
0.1477
0.0000
0.0000
0.0739
S7MK2
0.3046
0.0000
0.0000
0.0000
0.0402
0.1149
S7FM1
0.0000
0.0000
0.0000
0.0000
0.4568
0.7531
S10MK2 0.0000
0.1018
0.1437
0.0298
0.0120
0.7066
S10MK3 0.0000
0.0000
0.0000
0.0178
0.0000
0.0000
S10MK4 0.1258
0.0000
0.0000
0.0000
0.0000
0.0599
S14HD1 0.0368
0.0000
0.1963
0.2025
0.3436
0.0000
S20MK1 0.0000
0.7048
0.0000
0.0000
0.0060
0.0060
S21MK1 0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
Subj.7, Sess.MK2; Subj.10, Sess.MK4
Subj.14; Sess.HD1; Subj.21, Sess.MK1
Fig. 2. Image class examples misidentified or identified with low accuracy by all CNN models.
Inception v.3 and InceptionResnet v.2 solved recognition of the light theatrical type makeup, which was problematic for other models. VGG19, while failing on many easy cases, uniquely recognized heavily painted over faces with contrast pigments (see Fig. 3). GoogLeNet and Resnet50 were particularly successful with recognition in the presence of wigs and dark glasses, and Inception v.3 - for face mask recognition (see Fig. 4).
Learning Incorrect Verdict Patterns using Meta-learning Supervisor ANN
299
Table 2. Image classes identified by some of the CNN models with high accuracy, and some - with low Session
AlexNet VGG19 GoogLeNet Resnet50 Inception3 InceptRes2
S1HD2
0.9939
0.0061
1.0000
0.8598
1.0000
0.0000
S1MK1
0.9583
0.9941
1.0000
0.8869
0.0059
0.0000
S1MK2
0.3869
0.9524
0.8512
0.9941
0.0000
0.0000
S1MK7
0.7371
0.9600
0.5429
0.0343
0.0000
0.0000
S1MK8
0.2470
0.1988
0.9880
0.8675
0.0000
0.0000
S2HD1
0.3742
0.0000
0.0204
0.9932
0.2109
0.0000
S2GL1
0.3151
0.0479
0.9658
1.0000
1.0000
0.0000
S4GL1
0.0201
0.0000
0.1042
0.0000
0.8472
0.0069
S4MK1
0.0000
0.1043
0.0123
0.0368
1.0000
0.0000
S5MK1
0.0000
0.0538
0.0000
0.4012
1.0000
0.9761
S5MK2
0.0000
0.0114
0.0000
0.0178
0.8639
1.0000
S5MK3
0.0484
0.0364
0.2424
0.1455
0.1515
0.9939
S6MK1
0.0000
0.2036
0.0059
0.0000
0.8743
0.6407
S7MK1
0.0000
0.0000
0.5950
0.0000
0.7857
0.9762
S7FM2
0.0000
0.1065
0.0000
0.0000
0.8521
0.8994
S7FM3
0.1428
0.0179
0.0000
0.1191
0.9048
1.0000
S10MK1 0.9702
0.0000
0.6429
0.9583
1.0000
1.0000
S12MK1 0.8935
0.4464
0.0417
1.0000
0.8810
0.0000
S12MK2 0.8494
0.0000
0.6084
0.0000
0.4699
0.0000
S14GL1
0.7725
1.0000
0.9940
0.9940
1.0000
0.0000
S14MK1 0.8795
0.8253
0.7349
0.8795
1.0000
0.0000
S15MK1 0.9091
0.0000
0.0061
0.0101
1.0000
0.1515
S16MK1 0.7425
0.1258
0.9641
0.0359
0.9701
0.0000
S17MK1 0.3455
0.0484
0.8849
0.7091
1.0000
0.0000
Using simple meta-learning supervisor ANN has helped increase trusted accuracy and other metrics only for deeper CNN architectures such as Resnet50, Inception v.3, and InceptionResnet v.2, see Table 3. For shallower models such as AlexNet, VGG19, GoogLeNet, the supervisor ANN failed to learn the trusted verdict state.
300
S. Selitskiy et al.
Subj.5, Sess.MK2; Subj.6, Sess.MK1
Subj.7; Sess.MK1; Subj.20, Sess.MK1
Subj.12; Sess.MK2; Subj.1, Sess.MK3
Fig. 3. Makeup image class examples correctly identified by only one or few CNN models.
Subj.1, Sess.HD1; Subj.2, Sess.GL1
Subj.4; Sess.GL1; Subj.7, Sess.FM3
Fig. 4. Occlusions class examples correctly identified by one or few CNN models.
Learning Incorrect Verdict Patterns using Meta-learning Supervisor ANN
301
Table 3. Unconstrained Accuracy, Trusted Accuracy, Precision, and Recall for the Test Images Considered the ‘trusted’ Flag Assigned to the Images by the Meta-Learning ANN
6
Metric
AlexNet VGG19 GoogLeNet Resnet50 Inception3 InceptRes2
Model Accuracy
0.4754
0.3619
0.4570
0.5451
0.6404
0.6369
Trusted Accuracy 0.5245
0.6381
0.5430
0.7186
0.6646
0.8519
Trusted Precision –
–
–
0.7569
0.7176
0.8641
Trusted Recall
0.0000
0.0000
0.7126
0.7853
0.9107
0.0000
Conclusions and Future Work
Testing the diverse set of state-of-the-art CNN models on the different sides of the complexity level still demonstrate their shortcomings in the real-life training and testing for images with makeup and occlusions. There is a subset of the problematic types of photo sessions in the benchmark data set on which all tested models failed. The being tested models have demonstrated specialization in resolving difficult recognition tasks, solving such problems as dark glasses, masks, or wigs, while failing on other tasks that other simpler models solve. Therefore, future work in finding explanations of which parts of particular architectures are responsible for strong and weak behaviour would be beneficial. Overall, the most accurate and robust model for the makeup and occlusions BookClub data set challenge was Inception v.3. The presented meta-learning technique with the supervisor ANN attached to the underlying CNN models does work, capable of training on the erroneous verdicts produced by observing CNNs and making “trustworthiness” verdicts helping to reduce erroneous classifications of the whole system. However, while noticeably increasing the trusted accuracy and false positive and negative errors metrics for underlying CNN models, it needs improvement. The dimensionality of the input and learning uncertainty of the image recognition inside the model and the uncertainty of the model about its parameters would be a venue to explore, training the supervising ANN on the homogeneous ensemble of the underlying CNN networks.
References 1. Set, Face Matching Data., | Biometric Data | CyberExtruder, December 2019 2. Agarap, A.F.: Deep learning using rectified linear units (relu) (2018) 3. Akter, M., Jakaite, L.: Extraction of texture features from x-ray images: case of osteoarthritis detection. In: Yang, X.S., Sherratt, S., Dey, N., Joshi, A. (eds.) Third International Congress on Information and Communication Technology, pp. 143– 150. Springer (2019) 4. Amodei, D., Olah, C., Steinhardt, J., Schulman, J., Man´e, D.: Concrete problems in AI safety, Paul Christiano (2016) 5. Andrychowicz, M.: Learning to learn by gradient descent by gradient descent (2016)
302
S. Selitskiy et al.
6. Bergstra, J., Bardenet, R., Bengio, Y., K´egl, B.: Algorithms for hyper-parameter optimization. In: Shawe-Taylor, J., Zemel, R., Bartlett, P., Pereira, F., Weinberger, K.Q., (eds.), Advances in Neural Information Processing Systems, volume 24. Curran Associates Inc. (2011) 7. Chen, C., Dantcheva, A., Swearingen, T., Ross, A.: Spoofing faces using makeup: an investigative study. In: 2017 IEEE International Conference on Identity, Security and Behavior Analysis, pp. 1–8, Feburary 2017 8. Chen, C., Dantcheva, A., Ross, A.: Automatic facial makeup detection with application in face recognition. In: 2013 International Conference on Biometrics (ICB), pp. 1–8 (2013) 9. Colombo, A., Cusano, C., Schettini, R.: UMB-DB: a database of partially occluded 3d faces. In: 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), pp. 2113–2119, November 2011 10. Eckert, M., Kose, N., Dugelay, J.: Facial cosmetics database and impact analysis on automatic face recognition. In: 2013 IEEE 15th International Workshop on Multimedia Signal Processing (MMSP), pp. 434–439, September 2013 11. Feng ,R., Prabhakaran, B.: Facilitating fashion camouflage art. In: Proceedings of the 21st ACM International Conference on Multimedia, MM 2013, pages 793–802, New York, NY, USA. ACM (2013) 12. Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks (2017) 13. Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press, Cambridge (2016) 14. Huang, G.B., Mattar, M., Berg, T., Learned-Miller, E., Ferencz, A., Jurie, F.: Labeled faces in the wild: a database for studying face recognition in unconstrained environments. In: ’Real-Life’ Images: Detection, Alignment, and Recognition, Marseille, France (2008) 15. Izenman, A.J.: Modern Multivariate Statistical Techniques: Regression, Classification, and Manifold Learning. Springer Publishing Company, Incorporated (2008) 16. Jakaite, L., Schetinin, V., Hladuvka, J., Minaev, S., Ambia, A., Krzanowski, W.: Deep learning for early detection of pathological changes in x-ray bone microstructures: case of osteoarthritis. Sci. Rep. 11 (2021) 17. Jakaite, L., Schetinin, V., Maple, C.: Bayesian assessment of newborn brain maturity from two-channel sleep electroencephalograms. Comput. Math. Meth. Med. 1–7 (2012) 18. Jakaite, L., Schetinin, V., Maple, C., Schult, J.: Bayesian decision trees for EEG assessment of newborn brain maturity. In: The 10th Annual Workshop on Computational Intelligence (2010) 19. Jakaite, L., Schetinin, V., Schult, J.: Feature extraction from electroencephalograms for Bayesian assessment of newborn brain maturity. In: 24th International Symposium on Computer-Based Medical Systems (CBMS), pp. 1–6, Bristol (2011) 20. Jia, S., Li, X., Hu, C., Guo, G., Xu, Z.: 3D face anti-spoofing with factorized bilinear coding (2020) 21. Khodabakhsh, A., Busch, C.: A generalizable deepfake detector based on neural conditional distribution modelling. In: 2020 International Conference of the Biometrics Special Interest Group (BIOSIG), pp. 1–5 (2020) 22. Kurd, Z., Kelly, T.: Establishing safety criteria for artificial neural networks. In: Palade, V., Howlett, R.J., Jain, L. (eds.) Knowledge-Based Intelligent Information and Engineering Systems, vol. 2773. Springer, Berlin, Heidelberg (2003)
Learning Incorrect Verdict Patterns using Meta-learning Supervisor ANN
303
23. Kushwaha, V., Singh, M., Singh, R., Vatsa, M., Ratha, N., Chellappa, R.: Disguised faces in the wild. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 1–18 (2018) 24. Lake, B.M., Ullman, T.D., Tenenbaum, J.D., Gershman, S.J.: Building machines that learn and think like people (2016) 25. Lecun, Y., Bengio, Y.: Convolutional networks for images, speech, and time-series. In: The Handbook of Brain Theory and Neural Networks, MIT Press (1995) 26. Liu, X., Wang, X., Matwin, S.: Interpretable deep convolutional neural networks via meta-learning. In: 2018 International Joint Conference on Neural Networks (IJCNN), pp. 1–9 (2018) 27. Mansourifar, H., Shi, W.: One-shot gan generated fake face detection (2020) 28. Martinez, A., Benavente, R.: The AR face database. Technical Report 24, Computer Vision Center, Bellatera, June 1998 29. Min, R., Hadid, A., Dugelay, J.-L.: Improving the recognition of faces occluded by facial accessories. In: FG 2011, 9th IEEE Conference on Automatic Face and Gesture Recognition, 21–25 March 2011, Santa Barbara, CA, USA (2011) 30. Nichol, A., Achiam, J., Schulman, J.: On first-order meta-learning algorithms (2018) 31. Nyah, N., Jakaite, L., Schetinin, V., Sant, P., Aggoun, A.: Evolving polynomial neural networks for detecting abnormal patterns. In: 2016 IEEE 8th International Conference on Intelligent Systems, pp. 74–80 (2016) 32. Nyah, N., Jakaite, L., Schetinin, V., Sant, P., Aggoun, A.: Learning polynomial neural networks of a near-optimal connectivity for detecting abnormal patterns in biometric data. In: 2016 SAI Computing Conference, pp. 409–413 (2016) 33. Ram, R., M¨ uller, S., Pfreundt, F.-J., Gauger, N.R., Keuper, J.: Scalable hyperparameter optimization with lazy Gaussian processes (2020) 34. Santoro, A.B., Sergey, B., Matthew, W., Daan, L.T.: Meta-learning with memoryaugmented neural networks. In: Balcan, M.F., Weinberger, K.Q., (eds.), Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, PMLRpp. 1842–1850, New York, New York, USA, 20–22 June 2016 35. Schetinin, V., Jakaite, L., Krzanowski, W.: Bayesian averaging over decision tree models: an application for estimating uncertainty in trauma severity scoring. Int. J. Med. Inform. 112, 6–14 (2018) 36. Schetinin, V., Jakaite, L., Krzanowski, W.: Bayesian averaging over decision tree models for trauma severity scoring. Artif. Intell. Med. 84, 139–145 (2018) 37. Schetinin, V., Jakaite, L., Krzanowski, W.: Bayesian learning of models for estimating uncertainty in alert systems: application to air traffic conflict avoidance. Integr. Comput.-Aid. Eng. 26, 1–17 (2018) 38. Schetinin, V., Jakaite, L., Schult, J.: Informativeness of sleep cycle features in Bayesian assessment of newborn electroencephalographic maturation. In: 24th International Symposium on Computer-Based Medical Systems, pp. 1–6 (2011) 39. Selitskaya, N., Sielicki, S., Christou, N.: Challenges in face recognition using machine learning algorithms: case of makeup and occlusions. In: Arai, K., Kapoor, S., Bhatia, R., (eds) Intelligent Systems and Applications. IntelliSys 2020. Advances in Intelligent Systems and Computing. Springer (2020) 40. Christou, N., Selitskaya, N., Sielicki, S.: Challenges in real-life face recognition with heavy makeup and occlusions using deep learning algorithms. In: Nicosia, G., et al., (Eds.) Machine Learning, Optimization, and Data Science 6th International Conference, LOD 2020, Siena, Italy, 19–23 July 2020, Revised Selected Papers, Part II, volume 12566. Springer International Publishing (2020)
304
S. Selitskiy et al.
41. Selitskiy, S.: Code for paper ‘Learning Incorrect Verdict Patterns of the Established Face Recognising CNN Models using Meta-learning Supervisor ANN’, January 2021 42. Selitskiy, S., Selitskaya, N., Marina, K.: BookClub artistic makeup and occlusions face data. Mendeley Data, 2 September 2020 43. Pratt, L., Thrun, S.: Learning To Learn. Springer, Boston, MA (1998) 44. Vanschoren, J.: Meta-learning: A survey (2018) 45. Willers, O., Sudholt, S., Raafatnia, S., Abrecht, S.: Safety concerns and mitigation approaches regarding the use of deep learning in safety-critical perception tasks (2020) 46. Zhao, T., Xu, X., Xu, M., Ding, H., Xiong, Y., Xia, W.: Learning to recognize patch-wise consistency for deepfake detection (2020)
Formation Method of Delaunay Lattice for Describing and Identifying Objects in Fuzzy Images S. O. Kramarov1(B) , V. V. Khramov3 , O. Y. Mityasova2 , E. V. Grebenyuk2 , A. A. Bocharov3 , and D. V. Chebotkov2 1 MIREA – Russian Technological University, Rostov-on-Don, Russian Federation 2 Surgut State University, Surgut, Russian Federation 3 Southern University (IMBL), Rostov-on-Don, Russian Federation
Abstract. Formation method of a lattice based on the fuzzy Delaunay triangulation by grinding a rough triangular grid using multiple half-division of the sides of the original triangles is considered in the article. This approach to shredding allows you to build regular grids and corresponding triangulations of these midpoints, consisting of both equilateral and arbitrary triangles, depending on how individual information objects are distributed in the space of the researched territory. The resulting problem of grinding and rearranging border triangles is simplified by the fact that the process of splitting them is carried out by analogy with adjacent source triangles. The features of the method (approach) proposed in this article are that this method, in contrast to earlier studies, is modified in order to grind the ready-made grid both as a whole and its individual zones. In this case, the stage of constructing the “rough” triangulation and the half division, which is usually carried out on the material of the finished grid, is skipped. It is these additional conditions that allow you to check the Delaunay condition at each step. With this approach, only local cell rearrangement can occur. The advantage of the proposed approach is that there is no need to generate the entire grid again. The described method can be modified to grind the finished grid as a whole and its individual zones. In this case, the stage of constructing a rough triangulation is skipped and the half division is carried out on the material of the already finished grid. Thus, there is no need to check the Delaunay condition at each step. With such grinding, only local rearrangement of cells can occur. The advantage of this approach is that there is no need to generate the entire grid again. It should be noted that the considered procedure for bringing the triangulation in accordance with the Delaunay condition can significantly reduce the machine time spent on rebuilding the grid, since the condition is not checked for all grid elements. In the process of describing the context of the image under study, the centers of gravity of flat information objects are determined, which act as the initial ones for rough division into triangles according to Delaunay’s rules. The ground triangular grid is used to record the contours of information objects and minimal polygons with the centers of gravity of these objects, which, in turn, act as information signs for identifying both individual objects and scenes of the field of technical vision. The proposed approach makes it possible to significantly simplify the computational procedures for identifying elements of fuzzy images. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 295, pp. 305–311, 2022. https://doi.org/10.1007/978-3-030-82196-8_23
306
S. O. Kramarov et al. Keywords: Delaunay triangulation · Identification · Information objects · Fuzziness · Triangular grids
1 Introduction In many problems of mathematical modeling of real objects and phenomena, researchers have to deal with a huge number of NON-factors [1] inherent in measurements, data, methods and algorithms for their processing, etc.: under determination of general and specific knowledge about the image model in general and this image object in particular; ambiguity and incorrectness of the image formation model; inaccuracy (final accuracy) of real values associated with both approximation errors and measurement errors; actual fuzziness (blurring). If it is necessary or at least useful to visualize the modeling process or the interpretation of its results, an important step is to build a calculated grid that takes into account the initial fuzziness [1, 2]. There are various approaches to the construction of grids, including the advanced front method and Delaunay triangulation [3], there are also methods using combinations of different geometric approaches [4, 5]. Usually [6] Delaunay triangulation is performed in two stages. At the first stage, the initial grid is formed based on the available source points. For example, if we are talking about images, including those obtained through technical vision systems, these can be any “reference points”, landmarks for the robot’s navigation system. At the second stage, usually [3, 6], the resulting coarse cells are crushed in some way. In work [4–6], we propose methods for constructing two-dimensional regular and irregular triangular grids whose cells correspond to the Delaunay condition. For example, “based on the initial points that approximate the boundary of the region, the initial (rough) triangular grid is constructed. The original area is contracted to a triangle by sequentially cutting off rough triangular elements from it. The remaining triangle represents the last element of the rough grid” [4]. The following approach is often used to “grind a coarse triangular mesh. Cells are split using a set of throw-in points. In addition, at each step of grinding, the region is rearranged so that it satisfies the Delaunay condition” [5]. However, all known approaches, due to the presence of NON-factors, have a rather limited accuracy of identification and do not involve the use of combination with other methods, for example, they do not fully take into account the morphology of the studied objects [7–10]. In this paper, a successful attempt is made to partially fill this gap, which made it possible to improve the quality of identification of both the object under study, and the assessment of its current state, the stability of the boundaries and the prediction of the dynamics of processes occurring on this object without increasing the time of its study.
2 Methods and Models The use of triangulation grids is associated with several significant “advantages of triangulation. Firstly, any surface can be approximated with the necessary accuracy by a grid of triangles. Secondly, the computational complexity of the triangle partitioning
Formation Method of Delaunay Lattice for Describing
307
algorithms is significantly less than when using other polygons. Third, there is now a tendency to define objects by triangulation everywhere” [7]. 2.1 Basic Definitions To model processes on a triangulation, we introduce several concepts. M - the many names of the cells, for example M = {mi :i = 1, n}; cell – pair (a, m), where a ∈ A, is called state of the cell (is denoted a(m)), a m ∈ M, its name. Each triangle from the triangulation corresponds to a cell. Thus, the entire triangulation corresponds to a cell array. A neighborhood pattern for a cell (a, m) is a set of cell names, usually adjacent to a given cell. For example, for a triangulation grid, two cells will be considered adjacent if the corresponding triangles share a common side. As Fig. 1 shows, each triangle may have no more than three neighbors. Neighborhood pattern for a triangulation grid T = {t0 , t1 , t2 , t3 }. Such a triangulation can carry the functions of a cellular automaton [7, 8], the transition rule of which is a certain function that determines the new state of a cell depending on its current state, from cells with names from T, and for all cells this function is the same.
Fig. 1. Triangulation neighborhood pattern
In general, the triangles in the Delaunay triangulation will not be equilateral, yet the requirements for neighboring cells are the same: a common side. So far, there are no universal methods for automatically identifying objects in images. The necessary amount of information for detecting objects in images and identifying them is contained in their form. Psychologically, the perception of images by a human operator occurs at the level of the outline, i.e., the contour of the object’s shape. Under the contour, in this case, we will understand the set of pixels of the objects under study that have at least one neighboring pixel that does not belong to this object. For a multigradation image, a contour is the edge (of some area of the image) where the gradient of the signal function changes most rapidly. The contour, in contrast the image texture and background, has a certain independence from the weather, lighting, and other factors. In a binary, two-level digital image, each pixel also either uniquely belongs to or does not belong to the object under study. Therefore, we will consider the contour as a characteristic (mathematical) object. At the same time, characteristic points can be selected
308
S. O. Kramarov et al.
for the contour, for example, the top of a mountain, a small lake, etc., or the center of gravity of a flat shape bounded by this contour can be calculated, which have relative spatial stability (compared to individual points of the contour itself). The procedure for constructing the initial triangulation involves constructing the initial (“rough” [6, 9]) triangles directly around researched spatial object (further RSO). It is quite well researched [3–10], and, as a rule, it does not cause difficulties in implementation. The purpose of coarse grid shredding is to improve the quality of RSO identification, assessment of its current state, stability of boundaries, and prediction of the dynamics of processes occurring at this object. The shredding has been studied in a number of works on modeling physical processes [3, 5], for example, using “thrown points”. We considered options for shredding, allowing to build both regular and irregular grids, depending on how the points being thrown are distributed. The method allows modifications in order to grind the ready-made mesh as a whole, and its individual zones. Let consider our proposed new approach to the shredding of the original Delaunay triangulation. Moreover, if the known approaches were divided into either completely regular [2, 3] or irregular [4, 5] lattices, then in this case we are talking about locally regular lattices that ensure the agreement of the Delaunay approach and the heuristic methods of identification by the shape and topology of information objects developed earlier [12–14]. First stage. Preparation for the original fuzzy Delaunay triangulation. At this stage, the standard image preprocessing is performed: normalization, centering relative to the study area, binarization (object/background), filtering. However, the sequence of these operations may vary. Second stage. Primary segmentation of the image, including the possibility of preliminary determination of the boundaries of reference (for the technical vision system) objects, determination of their characteristic points, for example, the centers of gravity of flat shapes bounded by selected contours. Third stage. The actual Delaunay triangulation using characteristic points. In this case, the Delaunay conditions are met [6], and auxiliary points are added if necessary. Fourth stage. Select the largest triangle in area, one of the vertices of which is the center of gravity of the researched object. The proportional division of this triangle is carried out, for example, using the system of iterable functions (ITF) [11], when forming displaced Sierpinski triangles; or in a computational algorithm, by connecting the midpoints of the original triangle. The division continues until the required area of the cell of the resulting triangle is reached. The resulting grid triangles, as shown in [6], also satisfy the Delaunay conditions. Fifth stage. The remaining triangles of the original Delaunay triangulation are split. In this case, the number of splitting steps must match the number of them when splitting the first triangle. In this case, the conjugation of the minimal triangles along the boundaries of the original ones will be complete, this ensures the local regularity of adaptive grinding. An example of such a division for images of the mining area is shown in Figs. 2 and Fig. 3.
Formation Method of Delaunay Lattice for Describing
309
Fig. 2. Segmentation of an object on a triangular grid
Fig. 3. Initial delaunay triangulation of the mining enterprise territory (the researched object is highlighted).
310
S. O. Kramarov et al.
The implementation of a heuristic algorithm for selecting the contour of a twodimensional object on a triangular raster is implemented. The features of image representation and processing are shown in Fig. 3a. As shown by the research carried out by the authors of this paper [12–14], if the image is represented by a binary lattice of dimensions n * n and any two adjacent contour points are known (and the corresponding Freeman code value γ(i)) on a smooth contrast (i.e. previously passed anisotropic filtration) image, that to select the next EV, defined by connectivity 6, it is necessary to know the number of k pixels adjacent to this cell C and belonging to the image. In this case, the EV has six possible directions (Fig. 3a): γ (i + 1) = mod 6 (γ (i) + k + α),
(1)
where α = 2 - six-digit constant, and the summation is modulo six. This method is described in more detail in [12, 13]. Figure 3b shows how in practice the contour selection is carried out on the Delaunay triangulation. A computer experiment has shown that the use of this approach makes it possible to increase the reliability of identification of extended objects by 12–15% in the conditions of spring and autumn bad weather [14]. The theoretical and practical results obtained in the framework of this study allow us to use in a broad context the well-known methods of the theory of cellular automata, the corresponding software, parallel data processing, as well as to form specialized integrated circuits as means of accelerating computational procedures. This ultimately allows for high-quality monitoring of extended objects in real time.
3 Conclusion In this research, to obtain a technical result in the form of increasing the reliability and accuracy of identifying a spatial scene in the robot’s technical vision system and determining the location and position of the target object in the local coordinate system, a new approach based on the fuzzy Delaunay triangulation is proposed. This result is achieved in [6] that in the course of pre-processing of the original image as a related set of objects, with the introduction of computing device, it is adjusted to a normal, standard for the method of modifying the scale, is centered, fits into a rectangle of the desired size, in turn compared to those stored in the computer’s memory templates that are stored in the form of fuzzy Delaunay triangulations, which are compared with fuzzy Delaunay triangulation of an input image of a scene by means of a neural network. The comparison produced by the analysis of features of the triangulation obtained scenes visuals of the earth surface, moreover, a comparison is made for each feature, and a decision is made on the coincidence of the Delaunay triangulation of the vector model of the scene of the resulting image and the reference fuzzy Delaunay triangulations. When shredding, only a local rearrangement of the cells will occur. The advantage of this approach is that there is no need to generate the entire grid again. It should be noted that the considered procedure for bringing the triangulation in accordance with the Delaunay condition allows to significantly reduce the machine time spent on rebuilding the grid, since the condition is not checked for all the elements of the grid.
Formation Method of Delaunay Lattice for Describing
311
References 1. Narinyani, A.S.: Non-Factors: brief introduction. News Artif. Intell. 2, 52–63 (2004) 2. Khramov, V.V.: Generation of model objects in intelligent environments. theory and use for managing complex systems. In: Management in Social, Economic and Technical Systems. Proceedings of the Inter-Republican Scientific Conference, pp. 67–68. Kislovodsk University of the Academy of Defense Industries of the Russian Federation, Kislovodsk (2002) 3. Kruglyakova, L.V., Neledova, A.V., Tishkin, V.F., Filatov, A.: Unstructured adaptive grids for mathematical physics problems (review. Math. Model. 10(3), 93–116 (1998) 4. Popov, I.V., Polyakov, S.V.: Construction of adaptive irregular triangular grids for twodimensional multi-connected non-convex domainsn . Math. Model. 14(6), 25–35 (2002) 5. Popov, I.V., Vikhrov, E.V.: Method for constructing unstructured grids. Predprints IPM named after Keldysh, M.V., 237, 15 (2018). https://doi.org/10.20948/prepr-2018-237 6. Skvorcov, A.V.: Delaunay Triangulation and its Application. Tomsk University Publishing, Tomsk (2002) 7. Bandman, O.L.: Cellular automatic models of spatial dynamics. Syst. Dyn. 10, 59–113 (2006) 8. Wolfram, S.: A New Kind of Science. Wolfram Media Inc., USA, Champaign, IL (2002) 9. Evseev, A.A., Nechaeva, O.I.: Cellular automatic modeling of diffuse processes on triangulation meshes. Appl. Discrete Math. 4(6), 72–83 (2009) 10. Kramarov, S.O., Khramov, V.V., Povkh, V.I., Groshev, A.R., Karataev, A.S.: A method for identifying objects on digital images of the underlying surface by the Delaunay fuzzy triangulation method. Patent RU 2729557 11. Crownover, R.: Introduction to Fractals and Chaos. Jones and Bartlett Publishers Inc., Sudbury (1995) 12. Mayorov, V.D., Khramov, V.V.: Heuristic methods of contour coding of information object models in the robot’s technical vision system. Bull. Rostov State Trans. Univ. 1(53), 62–69 (2014) 13. Khramov, V.V., Goncharov, V.V.: A device for tracking the contours of two-dimensional objects. Patent RU 2050594 14. Khramov, V.V., Grozdev, D.S.: Intelligent Information Systems: Data Mining. Rostov State Transport University, Rostov-on-Don (2012)
Analysis of Electricity Customer Clusters Using Self-organizing Maps Gabriel Augusto Rosa(B) , Daniel de Oliveira Ferreira, Alan Petrônio Pinheiro, and Keiji Yamanaka Federal University of Uberlândia, CEP 38408-100, Santa Mônica, Uberlândia, Brazil {gabriel.rosa1,danieldeoliveira}@ufu.br
Abstract. The electricity sector is essential for the economy. It is also important to improve the quality of life for the population. Evaluating the electricity consumption data helps in the management of crises in the sector and provides useful information for implementing pricing policies. Also, it allows identifying places that most need infrastructure improvement or expansion. As regards the demand side, this data analysis can also indicate how to promote more efficient and conscious consumption. In this way, this study aims to analyze data from units that consume electricity using the Kohonen self-organizing maps (SOM) technique. This technique of artificial neural networks was able to reveal patterns and behaviors in groups of customers of electric power companies. The results show relationships between quality indexes and economic activities, revealing an important space for improvements. The relationship between seasons and the energy consumption of some groups can also assist in making decisions related to energy sources and managing the resources of the electric power network. Keywords: Self-organizing maps · Energy consumption · Cluster analysis · Electricity sector
1 Introduction The analysis of clusters of energy customers allows advances in understanding the electricity sector challenges, the behavior, and the need of these customers, as well as assisting the public authorities and regulatory entities in the efficient management of laws [1]. Providing reliable electricity at reasonable costs avoids economic losses, increases productivity, and enables industry and agriculture to become competitive [2–4]. Also, establishing this supply universally is a pillar for the reduction of social inequality. Conducting research related to the generation, transmission, and distribution of electricity, as well as implementing innovations in the electricity sector, are important actions to face challenges such as consumer payment default, energy theft, the maintenance of quality indexes, and the guarantee of energy supply to essential activities [5–7]. Therefore, the general objective of this research is to mine Geographic Databases of Distributors (BDGD), considering Brazilian electricity distribution companies, to © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 295, pp. 312–325, 2022. https://doi.org/10.1007/978-3-030-82196-8_24
Analysis of Electricity Customer Clusters Using Self-organizing Maps
313
find patterns in customer consumption profiles, using Kohonen’s self-organizing map algorithm (SOM) [8]. For the success of the proposed objective, it is important to develop the following specific objectives: 1) Achieving the best topologies and efficient neural network training; 2) Illustrating the SOM’s parameters influence and how to optimize them; 2) Finding ways to make the result of the neural network understandable for professionals working in electric companies; 3) Revealing data related to problems of concessionaires/government concerning the profile of consumer units. The SOMs were successfully applied to get useful information about electrical systems databases. Examples are the use of self-organizing maps to estimate the daily load curve [9, 10]. Both papers use hourly energy consumption data, which is not available in many regions. Also, they do not present qualitative analysis about the samples classified in each cluster. In this paper, the SOM is used to exploit attributes commonly present in energy utility databases. Furthermore, a deeper analysis of the samples presented in some clusters is shown.
2 Kohonen’s Self-organizing Map There are several techniques for grouping data, such as K-Means, DBSCAN, Hierarchical Cluster, and Grid-based grouping. For such an application, neural networks also prove to be powerful for recognizing patterns and relationships between variables [11]. Kohonen’s self-organizing maps are basically formed by two layers of neurons: an input layer and a unit layer, called U. The neurons of the U layer are arranged in an architecture that offers a notion of the neighborhood between the neurons [8]. Considering an architecture of two dimensions, the neurons are distributed in a twodimensional grid where all input neurons are connected to all U layer neurons [12]. In this way, if an n-dimensional input of a vector of real numbers is presented to the network, all neurons in the U layer compare their weight vectors and the neuron with greater similarity receives the correspondence [13]. In this way, all neurons in a given neighborhood learn about the input vector and update their weights to become more similar to the given input vector [14]. This process is compared to a competition among the neurons, in which the most similar neuron, the winner, and its neighborhood, are adjusted to become closer to the input pattern. The inputs are presented in a random sequence, and through the repetition of this process over several periods, the range of the neighborhood is changed so that a neuron initially has many other neurons in its neighborhood, but at the end of this stage it has few or no neighbors [12]. After some training periods, the neurons can represent relations in the input samples in one or two-dimensional space. Therefore, the SOM can reduce dimensionality, allowing an easier way to analyze data. Also, it is possible to delimitate clusters observing the distance between the neurons in the SOM, or data attributes depending on the application. A deeper analysis of the SOM’s parameters is presented in the results section. It will illustrate how to use SOM for clustering data.
314
G. A. Rosa et al.
3 Methodology The main steps in the implementation of the solution were: previous data analysis with descriptive statistics, data normalization or standardization, removal of outliers, application and adaptation of the SOM algorithm, delimitation of clusters to analyze, and evaluation of the centroids and the data samples into the clusters. Additionally, to enable the objectives to be achieved satisfactorily, adjustments and analysis of the parameters were made, allowing the refinement of the results. Data processing was a challenging step because the database has a high number of records, which required the use of parallel processing techniques, as well as an adequate matrix data structure to compute centroids efficiently. The algorithm was implemented in the Python language. Among the benefits that justify choosing such a language, there is the availability of a wide range of libraries, ease of testing, and the parallel processing that some of its libraries offer. Among the tools used, we highlight Orange Data Mining, only for initial visualizations and Pyspark, to read and manipulate the database. As this study is aimed at electric utilities, there was a concern to interpret the information so that managers and other employees of the companies could understand the technical results obtained. Regarding the data used, the used information includes installed load, monthly energy consumption values, and the DIC indicator (Individual Interruption Duration per Consumer Unit). Qualitative data, such as the main activity of business customers, type of area, urban or rural, and other attributes, were also used to understand the groupings formed through the quantification of individuals. 3.1 Implementation Strategies The selection of variables and proper processing of information are among the main challenges faced during the algorithm implementation. To achieve the proposed objectives, some strategies were adopted, including the creation of a prototype using only database tables of consumer units. Besides, only quantitative variables were inserted into the model. Several implementations of SOM were tested, of which some were own authorship while others were adapted from available implementations. Many distance functions were tested, such as Manhattan, Euclidean, and cosine similarity. Some neighborhood functions, such as step and gaussian were also evaluated. Furthermore, parameters such as number of neurons, number of epochs, learning rate, and neighborhood radius/standard deviation, represented by sigma (σ), were carefully adjusted for each analysis scenario. To understand how to tune SOM, a simple empirical strategy with synthetic data was used, allowing a better understanding of SOM and its parameters. When analyzing the SOM map with the colors related to the average distance of each neuron to its neighbors, besides the map of the components as well as the map of variance among the data samples classified in each neuron, some groups of neurons were established for examination. These groups are named here as supercluster.
Analysis of Electricity Customer Clusters Using Self-organizing Maps
315
To evaluate the results of the self-organizing maps, a distance matrix was obtained considering the distance between the average of the centroids that were selected in each supercluster. Thus, it can be verified how superclusters are different or similar to each other. To understand the clusters, it was followed steps of de-standardization and analysis of the superclusters through the most frequent classes of categorical attributes, and the histogram of the quantitative attributes for each supercluster. Some attributes were interpreted using other databases, such as related to National Classification of Economic Activities (CNAE) and municipality code of the Brazilian Institute for Geography and Statistics (IBGE) [15].
4 Results Among the possible analyzes considering the available databases, which have tables of various elements of the electrical network, the SOM was used to group data from medium voltage consumer units in two scenarios: 1) using the average consumption in one year (ENE_M), installed load (CAR_INST) and duration of individual interruption per consumer unit (DIC); 2) with the use of monthly energy consumption values over twelve months (ENE_01 to ENE_12), normalized by the average consumption in the year. 4.1 Synthetic Data In order to understand the effect of each SOM parameter, synthetic data with two numerical attributes were used in such a way that they could be visually observed, as shown in Fig. 1.
Fig. 1. Synthetic data.
It is expected that a tuned SOM will be able to separate such data since the sets can be visually separated. Among the parameters analyzed, the following stand out: the number of training seasons, the size of the neighborhood, and the dimensions of SOM.
316
G. A. Rosa et al.
As shown in Fig. 2, where blue indicates clusters with a smaller average distance to its neighbors, and red indicates distant clusters, the number of epochs should not be too small, as the data is condensed only in a part of SOM. On the other hand, it cannot be too big, as the data spread in such a way that the clusters mix visually, in addition to increasing the computational cost.
(a) 10 epochs
(b) 150 epochs
(c) 10000 epochs
Fig. 2. Average distance map for different numbers of epochs.
Considering a Gaussian neighborhood function, the effect of a neuron on the neighborhood can be adjusted with the standard deviation parameter σ. As shown in Fig. 3, σ must not be too small, since this way the data is condensed only in one part of the SOM. On the other hand, it cannot be too big, as the data spread in such a way that the clusters are visually confused with a single large cluster. Similarly, regarding the number of neurons, there is also an ideal range, as shown in Fig. 4. Also, the increase in SOM means an increase in computational cost and execution time.
(a)
(b)
(c)
Fig. 3. Average distance map for different neighborhood sizes.
Analysis of Electricity Customer Clusters Using Self-organizing Maps
(a) 5x5
(b) 15x15
317
(c) 50x50
Fig. 4. Average distance map for different SOM dimensions.
4.2 Clusters Based on ENE_M, CAR_INST and DIC With around 14000 samples, and with the attributes of average consumption (ENE_M), installed load (CAR_INST) and DIC, the clusters that are shown in Fig. 5 were obtained. Superclusters are seen in blue on the map of average distances, indicating neurons close together.
Fig. 5. (a) Average distance map. (b) The variance between samples classified in each neuron.
For a choice of superclusters, that is, groups of neurons for analysis, it is interesting to use maps of the components used by SOM to group the data, aiming to understand what led to the formation of each cluster and its characteristics. Figure 6 shows the maps of the average monthly consumption components ENE_M and DIC, which reveal another aspect: the formation of smaller clusters with common component values within the larger clusters. To verify the difference between the chosen superclusters, a matrix, shown in Fig. 7, was used, in which each coordinate shows how different two groups are. Group 1 is the
318
G. A. Rosa et al.
Fig. 6. (a) ENE_M and (b) DIC components map.
most different, as the first row and the first column of the matrix have more red tones, indicating a greater difference for the other superclusters. Cluster 4 is less distant from cluster 1, as it has a lighter shade of red. This confirms what is seen in SOM’s topology. Comparing clusters 2 and 3, we can see from the map of the DIC component, Fig. 6, that despite being close in the SOM topology, these two groups have different DICs: group 2 has a good index, while group 3 has a higher index, that is worse, with more power outages.
Fig. 7. Difference matrix among the delimited clusters.
Regarding the classes of consumers present in each supercluster, group 2 presents a predominance of industrial and commercial units, while group 3 presents a predominance of rural and industrial units installed in non-urban areas, as shown in Fig. 8. This figure
Analysis of Electricity Customer Clusters Using Self-organizing Maps
319
also shows the most frequent categories in the entire database, to enable the understanding of the patterns of the groups analyzed in relation to the whole.
Fig. 8. Types of customers, urban UB and non-urban NU, with higher frequencies in the whole database and clusters 2 and 3.
Figure 9 shows the most frequent activities performed by these groups obtained using CNAE data from IBGE [8].
Fig. 9. Most frequent activities in the whole database and superclusters 2 and 3.
To analyze the clusters considering the cities with the largest number of consumer units in each cluster, a description of each municipality is useful. Such descriptions are presented below. • Municipality A is an extremely urbanized municipality with a mild climate, economic and political center. • Municipality B is an extremely urbanized municipality. • Municipality C is an urbanized municipality. • Municipality D is a medium-sized and hot climate municipality with an economy based on mining and agriculture. • Municipality E is a small rural municipality, with great economic importance. • Municipality F is an urbanized municipality.
320
G. A. Rosa et al.
• Municipality G is a medium-sized municipality with a mild climate and an economy based on agriculture. • Municipality H is an urbanized municipality. • Municipality I is a small rural municipality. • Municipality J is a medium-sized municipality with an economy based on mining, industry, and agriculture. • Municipality K is a medium-sized municipality with a hot climate and an economy based on agriculture. • Municipality L is a small rural municipality. • Municipality M is a small rural municipality. • Municipality N is a small rural municipality. • Municipality O is a medium-sized municipality with a hot climate and an economy based on agriculture and mining. As shown in Fig. 10, cluster 2 follows the trend of the base with large urban municipalities, A, B, and C, being more frequent. Cluster 3, on the other hand, presents small towns more frequently, D and E, whose economy is based on agriculture or mining. The municipality E, the second most frequent in group 3, is a big exporter of citrus products. This leads us to a situation in need of improvement since agricultural activities need good infrastructure regarding electricity. For example, to avoid losses in production due to lack of refrigeration or irrigation.
Fig. 10. Masked municipalities with higher frequencies in the database and clusters 2 and 3.
To analyze the subgroups 1, 4, 5, and 6, it is interesting to check the relationship between DIC and the average consumption (ENE_M) variables in these clusters. A high DIC means more time without power, which is a situation of operational difficulty for the concessionaire and poor service for the consumer unit. If the DIC indicator is below the levels established by law, the concessionaires pay fines. Higher consumption is related to the greater economic importance of the consumer unit for the concessionaire. These relationships are illustrated in Fig. 11. Thus, groups 4 and 6 have a worse DIC indicator, which shows scope for improvement. On the other hand, 1 and 4 represent economically important consumers.
Analysis of Electricity Customer Clusters Using Self-organizing Maps
DIC
5
321
SaƟsfied customer
1 ENE_M
4 Economically less significant
UnsaƟsfied customer
6 Economically Important
Fines...
Fig. 11. Relationship between DIC e ENE_M.
Figure 12 shows the most frequent groups of consumer units in each cluster. Again, the clusters with the worst indicators have a rural predominance, shown in the graphs on the right. Group 5 has a commercial predominance and presents the best DIC indicators.
Fig. 12. Types of customers, urban UB and non-urban NU, With higher frequencies in superclusters 1, 4, 5, and 6.
Regarding the municipalities present in each group, smaller municipalities are presented in the two clusters with the worst DIC indicators, clusters 4 and 6, as shown in Fig. 13. However, there are also large municipalities with consumer units with bad DIC, such as the case of municipalities B and H, the most frequent municipalities in cluster 4, the second cluster in Fig. 13. As regards the activities carried out by the clusters’ units, agriculture activities predominate in cluster 6, as shown in Fig. 14. Also, there are water treatment companies in clusters 1, 4, and 5. It is interesting to highlight that this type of service is essential and that these units are in urban areas, indicating another scope for improvement.
322
G. A. Rosa et al.
Fig. 13. Masked municipalities with higher frequencies in superclusters 1, 4, 5, and 6.
Fig. 14. Most frequent activities in superclusters 1, 4, 5, and 6.
4.3 Clusters Based on Monthly Normalized Energy Consumption With about 12000 samples, the clusters shown in Fig. 14 were obtained by using twelve values of monthly consumption divided by the average consumption of each unit in one year period as input to SOM. Thus, SOM grouped the data considering changes in consumption over the year, and not by the intensity of absolute consumption.
Analysis of Electricity Customer Clusters Using Self-organizing Maps
3
323
3
2
2
Fig. 15. Component maps of January and June, that are, respectively, summer and winter months in the location.
This figure shows the groups that were chosen for analysis based on consumption in January and June, to analyze clusters that consume more in summer and winter. Labels 2 and 3 were adopted for the groups so that the whole base, used for comparison, was the first group. Analyzing the most frequent cities in each group, shown in Fig. 16, it is possible to note that cluster 2 contains cities with a colder climate, like municipalities A, and G, while 3 contains cities with a warmer climate, like D, K, and O. Thermal discomfort would be more intense in 3 in the summer, as there are cities with a warmer climate, which could lead to suppose that this group would have higher consumption in this period. However, the components’ maps, Fig. 15, show that this group has the highest consumption in winter.
Fig. 16. Most frequent municipalities in clusters 2 and 3.
The explanation for the profile of cluster 3 is in Fig. 17 and Fig. 18. In Fig. 17, it is shown that the most frequent groups in this cluster are rural customers. They should probably use more energy for irrigation in the dry season. Besides, in Fig. 18, the most frequent activities in cluster 3 are agriculture.
324
G. A. Rosa et al.
Fig. 17. Most frequent consumption classes, urban UB and non-urban NU, in clusters 2 and 3.
Fig. 18. Most frequent activities in clusters 2 and 3.
5 Conclusions and Future Work Techniques of pre-processing and post-processing were used in a real database. The detailed examples with synthetic data contribute to a better understanding of the SOM parameters tuning. Consequently, this analysis helped the maximization of the results. The presented SOM was shown to be capable of grouping the electrical consumers, reflecting the relations between the variables, such as DIC and consumption category (rural, industrial, commercial). Patterns were found according to the objectives of the work, such as, for example, the understanding of the energy consumption throughout the year of some clusters. As for future work, it can be mentioned the implementation of improved SOMs with the treatment of categorical variables. Besides, it is possible to estimate loads installed in residential consumer units, using a PPH possession and consumption habits research base, based on electrical equipment data from consumer units. Acknowledgments. Research financed by the ANEEL R&D project Nº 05160–1805 / 2018, between CEB and UFU, and with partial support from CNPq, process number 135168/2019–8.
Analysis of Electricity Customer Clusters Using Self-organizing Maps
325
References 1. Ruas, G.I.S. et al.: Previsão de Demanda de energia elétrica utilizando redes neurais artificiais e support vector regression. In: VI Encontro Nacional de Inteligência Artificial, pp. 1262–1271. Rio de Janeiro (2007) 2. Oseni, M.O., Pollitt, M.G.: The economic costs of unsupplied electricity evidence from backup generation among African firms. Energy Policy Research Group Working Paper 1326, Cambridge Working Papers in Economics 1351, Cambridge University Press, Cambridge (2013) 3. Bose, R.K., Shukla, M., Srivastava, M., Yaron, G.: Cost of unserved power in Karnataka. India. Energy Policy. 34(12), 1434–1447 (2006) 4. Horváth, K. H.: The effect of energy prices on competitiveness of energy-intensive industries in the EU. In: International Entrepreneurship and Corporate Growth in Visegrad Countries, pp. 129–146. University of Miskolc, Miskolc (2014) 5. Griebenow, C., Ohara, A.: Report on the Brazilian Power System. Agora Energiewende and Instituto E+ Diálogos (2019). 6. Fillho, G.F., Cassula, A.M., Roberts, J.J.: Non-technical losses in Brazil: subsidies for implementation of smart-grid. J. Energy Power Eng. 8(7), 1301–1308 (2014) 7. Ceaki, O., Seritan, G., Vatu, R., Mancasi, M.: Analysis of power quality improvement in smart grids. In: 10th International Symposium on Advanced Topics in Electrical Engineering (ATEE), pp. 797–801. IEEE, Bucharest (2017) 8. Kohonen, T.: The self-organizing map. Proc. IEEE 78(9), 1464–1480 (1990) 9. Cerchiari, S.C., Teurya, A., Pinto, J.O.P., Lambert-Torres, G., Sauer, L., Zorzate, E.H.: Data mining in distribution consumer database using rough sets and self-organizing maps. In: Power Systems Conference and Exposition, pp. 38–43. IEEE, Atlanta (2006) 10. Oprea, S., Bâra, A.: Electricity load profile calculation using self-organizing maps. In: 20th International Conference on System Theory, Control and Computing (ICSTCC), pp. 860–865. IEEE, Sinaia (2016) 11. Mingoti, S.A., Lima, J.O.: Comparing SOM neural network with Fuzzy c-means, K-means and traditional hierarchical clustering algorithms. Eur. J. Oper. Res. 174(3), 1742–1759 (2006) 12. Ultsch, A.: Self-organizing neural networks for visualization and classification. In: Information and Classification, pp. 307–313. Springer, Berlin, Heidelberg (1993). https://doi.org/10. 1007/978-3-642-50974-2_31 13. Ghaseminezhad, M.H., Karami, A.: A novel self-organizing map (SOM) neural network for discrete groups of data clustering. Appl. Soft Comput. 11(4), 3771–3778 (2011) 14. Strecker, U., Uden, R.: Data mining of 3D poststack seismic attribute volumes using Kohonen self-organizing maps. Lead. Edge 21(10), 1032–1037 (2002) 15. IBGE, API CNAE - Cadastro Nacional de Atividades Econômicas, https://servicodados.ibge. gov.br/api/docs/CNAE?versao=2. Accessed 04 Jan 2021
SmartData: An Intelligent Decision Support System to Predict the Readers Permanence in News Jessie Caridad Mart´ın Sujo(B) , Elisabet Golobardes i Rib´e, Xavier Vilas´ıs Cardona, Virginia Jim´enez Ruano, and Javier Villasmil L´ opez Research Group on Data Science for the Digital Society (DS4DS), La Salle -Universitat Ramon Llull, 08024 Barcelona, Spain {jessiecaridad.martin,elisabet.golobardes,xavier.vilasis, virginia.jimenez,javier.villasmil}@salle.url.edu https://www.salleurl.edu
Abstract. This article proposes a hybrid intelligent system based on the application and combination of Artificial Intelligence methods as a decision support tool. The objective of this study is to exploit the advantages of the constituent algorithms, to predict the permanence rates of readers in news from a digital media. With this, the editor will be able to decide whether to publish a news item or not. To evaluate the effectiveness of the hybrid intelligent system, data from a reference digital media is used. In addition, a series of performance metrics is calculated, where 88% effective is demonstrated with the predicted results. Keywords: Intelligent systems · Artificial intelligence learning · Decision making · Digital media
1
· Machine
Introduction
Since the inception of the press, attracting and engaging loyal readers has been a challenging task, which has been accentuated with the advent of the Internet. Therefore, it is not surprising that digital media seek to emotionally impact the reader, in order to maintain reader figures and the advertising revenues. According to studies carried out by the Association for Media Research (AIMC) [1], 47.8% of readers only read the digital edition of the press. There is an increasing demand for high quality content which is accessible anytime and anywhere. Also the DIGINATIVEMEDIA1 , confirm that 45% of users prioritize television to what is going on in the world, while 40% opt for internet media (such as 1
Published in Digital NewsReport. Available in .
The original version of this chapter was revised: The chapter authors’ given and family name has been correctly identified and updated. The correction to this chapter is available at https://doi.org/10.1007/978-3-030-82196-8 61 c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022, corrected publication 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 295, pp. 326–339, 2022. https://doi.org/10.1007/978-3-030-82196-8_25
SmartData: An Intelligent Decision Support System
327
magazines, newspaper-tv-radio web/app). This is confirmed by the research of Parrat [2], claiming that despite the new technologies, the young generations choose television as a means of finding out about current affairs. Internet followed later with just over half an hour a day. Although this does not rule out the revelations of Casero [3] at present, the consumption of news in young people is decreasing and this trend could continue increase if the appropriate strategies are not taken to solve this problem. Therefore, it is vitally important to have a support system for this sector, that enables the prediction of the impact that a news item will generate on the reader, based on the length of time the reader spends on the page. As it reflected in the reference [4] the permanence of readers in a news item is a the factor in the generation of personalised recommendations and marketing, which is the basis for the social system. The article is organized as follows. In Sect. 2 we discuss related work and related problem of keyword extraction. In Sect. 3 we present the proposed system to the problem we are addressing, as well as the technologies used. In Sect. 4 we collect and display the information from the sources and compare the algorithms used during the development of this project. In Sect. 5 offer some criteria for interpreting and evaluating the results. Finally, in Sect. 6 we conclude the article with some final observations.
2
State of the Art
With reference to the problem of engaging readers in news and current affairs, similar proposals have been found in recent literature. On one hand, it was found that to analyze the impact of the news it was necessary to analyze its volatility, this would show that moderately good news (intraday) reduces volatility (the next day), while very good news (returns unusually high intraday positives) and bad (negative returns) increases volatility, with the latter having a more severe impact [5]. On the other hand, previous studies [6] focus on the prediction of the impact of the news (calling from now news item) by identifying the temporal evolution of news behavior patterns and in [7] on the user’s intentions during the news browsing session. In the case of work [8], they treat the impact of a news item as a classification problem, analyzing each news item before its publication. Finally in a more extensive search, two articles [9,10] were found where they use techniques such as Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) in the titles, text and publication date to determine the impact of the News. A gap was detected between the related works and the data analysed to bring new close approaches like the one that will be proposed in this study.
3
SmartData: Intelligent System
The present proposal consists of a model to predict the impact of a news item on the reader through its permanence, which may depend on multiple factors: the news writers, the quality of the content, the format, etc. The development of this intelligence system consisted of the following phases: first and main phase, data collection; second phase, creation of the intelligent system: it will be composed
328
J. C. Mart´ın Sujo et al.
Fig. 1. SmartData: phases to develop. Showing us the input data that our model receives, the different processing phases that it performs with this data and finally, it shows the output data through dashboards for a better understanding of the final user.
of several stages and finally, the third phase, the use of the tool by customers for decision-making. All these phases can be observed in Fig. 1. The proposed system will work with a new implemented methodology that will be explained in detail in Sect. 4 and during the experimentation, a comparison will be made between the algorithms used, and the selection made will be justified. 3.1
Data Collection
The system will be fed by the content of the news written in natural language from a digital newspaper. For each news item, we will collect both the variables that characterize it and variables that describe the user interaction. 3.2
SmartData
SmartData is the system we have created to predict the time that a reader is going to be interested in a news item. It is composed of four phases, where we apply both statistical and unsupervised/supervised learning techniques. This procedure allows us to obtain and display the results obtained for the end users in a clear and coherent format. The four phases are detailed as follows: Exploratory Data Analysis. We start with exploratory analysis of the data, an essential technique has been applied: “Selection of characteristics”. This allows us to reduce the complexity of the model, reducing the number of variables, avoids overtraining and its way of interpretation. Two techniques have been applied: Pearson’s Correlation [11], which is a test statistic to measure relationship or association between two continuous variables; and Extreme Gradient Boosting (XGBoost Regressor) [12] which is an algorithm built on the principles of gradient boosting framework, producing a prediction model in the form of an ensemble of weak prediction models, typically decision trees. Both techniques will also allow to reduce the weight of the model and improves the precision of prediction.
SmartData: An Intelligent Decision Support System
329
Pattern Matching. Once the input data is ready, the patterns found within the extracted data are identified and analyzed. For this, a comparison of grouping methods is carried out, verifying one selected from the references in Sect. 2: KMeans [13] and two others, such the Self Organizing Map (SOM) [14] and Density Based Spatial Clustering of Application with Noise (DBSCAN) [15]. Before implementing these algorithms, it is necessary to establish the optimal number of k groups. It provides us with a stopping criterion, given that may be different forms of groupings. To determine it, we verify where is the largest jump between the adjacent groups. For this, the following validation methods are used: – Elbow curve: This method looks at the percentage of variance explained as a function of the number of clusters. The number of clusters is chosen at this point, hence the elbow criterion. This elbow cannot always be unambiguously identified [16]. – Silhouette: This method measures how similar an object is to its own cluster called cohesion, compared to other clusters called separation. This technique provides a succinct graphical representation of how well each object has been classified. The silhouette can be calculated with any distance metric like Euclidean or Manhattan metric [17]. The methods mentioned before use the Euclidean distance metric who is an ordinary distance between two points in a Euclidean space, to determine the optimal k. These methods will help us to avoid making bad decision; how to group similar data into different groups. In Sect. 4 we will delve into the selection criteria. Prediction Model. The moment the permanence is segmented in groups, a second algorithm is applied, to forecasts future news in terms of this groups and complete our model. The algorithms for the classification have been selected based on the criteria of a number of conditions. This section summarizes the functionality of each of them: – Decision Tree (DT): It is a representation in the form of a tree whose branches branch according to the values taken by the variables and that they end in a specific action. It is usually employed when the number of conditions is not very large in this study. See [18] for a detailed description of this algorithm. – Gradient Boosting (GB): An algorithm that trains many models gradually, additively and sequentially. At insertion time, each new tree conforms to a modified version of the original dataset. See [19] for a detailed description of this algorithm. – K-Nearest Neighbors (KNN): This algorithm is based on the fact that two instances far enough apart in space, taking into account a distance function, are less likely to belong to the same class than two instances located close together. See [20] for a detailed description of this algorithm. – Random Forest (RF): It aims to create independent and uncorrelated trees based on different and random input vectors following the same distribution.
330
J. C. Mart´ın Sujo et al.
The result is the average of the trees generated during the process. See [21] for a detailed description of this algorithm. – Support Vector Machine (SVM): It consists of taking a set of input data and predicting for each given input, the output classes to whom it belongs, turning it into a non-probabilistic classifier. See [22] for a detailed description of this algorithm. – Artificial Neural Network (ANN): It consists of a set of units, called artificial neurons, connected together to transmit signals. The input information passes through the neural network where it undergoes various operations, producing output values. See [23] for a detailed description of this algorithm. The value that the use of these hyperparameters from the models gives us is that they will be specified ad-hoc to the data of the digital medium, through the use of cross validation. Dashboards. Once the predictive model is completed, the most immediate way to show the results is through indicators (KPIs) because they will allow the digital newspaper to determine whether or not to publish a news item. 3.3
End Users
Finally, the creation of a decision support tool to decision-making: SmartData will allow end users (editors) to improve the quality of decisions to publish news or not, thanks to the impact that it causes in readers by analyzing its permanence.
4
Experimentation
In this section it is shown how the system fit to the specific data of a digital press, from now will be called “The Editorial Office” (respecting the anonymity of the medium); and to a specific country, in this case Spain. In addition, the comparison between the different algorithms to evaluate the effectiveness of the results of our final model are explained. All calculations were performed on the Intel (R) Core i5-9400F CPU @ 2.90 GHz PC. 4.1
Data Collection
In order to test the proposed system, around 8 500 news items published during 2019 are used. The Matomo tool is used for the collection procedure, as it allows extracting data from websites from a digital medium. This tool allowed the extraction of two important datasets: one called News, whose contain is all the characteristics that define it and another called Matomo, containing the behavior of each news item every hour. Generating in the end a total of approximately 2 million data. They have been stored in an unstructured database like
SmartData: An Intelligent Decision Support System
331
Table 1. List of features of the datasets. Datasets
News
Characteristics newsId url newsTitle newsText newsSubtitle newsCategory numberOfImages numberOfVideos numberOfOthers publicationDate newsTextLengthChar multimediaOtherAssets numberOfMultimediaFiles
Matomo url timestamp pageviews exits avgTimeOnPage uniquePageviews exitRate bounces bouncesRate
MongoDB [24], given the high availability that it offers with automatic, fast and instant recovery. Detailed information for the data sets is listed in Table 1. Subsequently, these data go through a preprocessing filters, in order to eliminate all possible errors that they may contain. In addition, the news dataset is cut to a total of 110 timestamp (or what is the same, in a period of 6 days), since within the exploratory analysis it was observed that from this date the news did not generate any interest in the reader. In total, 1 557 933 news item are used. 4.2
SmartData
For the specific case of “The Editorial Office”, the parameters of each phase of the system are adjusted based on experimentation, using the Python[25] programming language as it is a high-level, general-purpose interpreter. The four adjusted phases are defined as follows: Exploratory Data Analysis. The use of the Pearson’s correlation coefficient test allows us to identify at first instance if there is a relationship between the variables in order to analyze. As can be seen in Fig. 2, there is a positive relationship between the reader permanence variable (avgTimeOnPage) and the length of the text (newsTextLengthChar), indicating there is a direct relationship between these variables. We can also visualize another type of connection such as the one observed with the variables: pageviews, numberOfOthers, numberOfImages, where negative values are observed, indicating these variables will act inversely or indirectly with respect to the objective variable.
332
J. C. Mart´ın Sujo et al.
Fig. 2. Pearson’s correlation matrix. The characteristics with a value of (p>0.05) indicate a significant correlation. The positive values that there is a direct relationship between the variables, with which if one of the variables increases, the other increases. And negative values, the opposite case.
As we can see, the Pearson correlation is significant enough because it is greater than 0.05, giving us the indications of a relationship between variables and the reliability that we can work with them. Therefore it leads us to use the XGBoost Regressor technique to identify irrelevant attributes and filter the columns that would be redundant in the model. As can be seen in Fig. 3, this technique performs a classification of the variables, discarding the irrelevant ones. In our specific case, the two variables that have a higher influence in the permanence of a reader in the news are:newsTextLengthChar and the number of audios (numberOfOthers). And so, once the dataset is completed without noise, we can find pattern matching.
SmartData: An Intelligent Decision Support System
333
Fig. 3. XGBoost: features selection. This shows us that those with the highest score are the most important features.
Pattern Matching. Before applying unsupervised learnings techniques for pattern matching, it is necessary to establish the optimal number of k groups. For this we use the Elbow and Silhouette methods that coincidentally show us that the optimal value is 3 groups. The rule applied to choose k = 3, as we can see in Table 2, is to determine where the largest jump is found between the adjacent groups, in this case we observe it between group 3 and 4. Then selecting the value closest to 1 between this two groups, since it indicates that the groups are dense and well separated. Of the three algorithms tested in the first part of the model, two of them (Kmeans and SOM) give us a consistent interpretation, while the DBSCAN indicates that there are five groups, resulting a bit more complex to interpret. Therefore, we decided to select the Kmeans as it is a simpler and faster method of training. As we can see in Fig. 4, there are three main groups that characterize the permanence of the reader: • Cluster 1, who will we call Worst permanence: This group is identified since it has a less extensive text and a greater amount of audios. • Cluster 2, who will we call Medium permanence: This group is characterized by having a less extensive text of the news and 0 audios. • Cluster 3, who will we call High permanence: This group is recognized for containing a long text, and an intermediate amount of audios.
334
J. C. Mart´ın Sujo et al.
Table 2. Performance evaluation to determine the k groups. This shows us that where the greatest jump is found between the adjacent groups is the best group to choose. Number of clusters Silhouette score 2
0.608
3
0.636
4
0.547
5
0.529
6
0.541
7
0.559
8
0.563
9
0.531
Prediction Model. Once the permanence groups are defined, we notice that the “Medium permanence” did not add significant value to the hypothesis sought during this investigation, which is a determinant in whether the impact of the news causes the reader to then take a decision. Then of this conclusion, we split this group into the more closed to the Worst and the High permanence group, leave us two principals groups. With this new groups defined, can pass to the second part of the model: supervised learning algorithms mentioned (Sect. 3.2).To do this, we try to explore the space of common parameters and variations for each learning algorithm as exhaustively as possible computationally. This section summarizes the hyperparameters used for each learning algorithm, in order to find the most optimal (marked): • Decision Tree (DT): Using as a parameters max leaf nodes with [50, 100, 500, 5000] and min samples split with [2, 4, 6]. Optimal: max leaf nodes= 50, min samples split=2 • Gradient Boosting (GB): Using as a parameters loss with [‘deviance’, ‘exponential’], a learning rate with [0.02, 0.03], a n estimators with [25, 50, 75]. Optimal: loss=’deviance’, learning rate=0.02, n estimators=25 • K-Nearest Neighbors (KNN). Using as a parameters n neighbors with [3, 12, 20, 22]. Optimal: n neighbors = 20 • Random Forest (RF): Using as a parameters n estimators with [10, 50, 100, 150], a min samples split with [2, 4, 6] and a criterion with [‘gini’,‘entropy’]. Optimal: estimators = 150, min samples split = 6, criterion = ’gini’ • Support Vector Machine (SVM): Using as a parameters kernel with [‘linear’,‘poly’,‘rbf’], a probability with [True, False]. Optimal: kernel = ’linear’, probability = True
SmartData: An Intelligent Decision Support System
335
Fig. 4. Clustering results. This shows us the 3 groups into which our data is divided after applying Kmeans, meaning the one with the darkest color the one with the worst permanence, the one with the intermediate color the one with medium permanence and the one with the lighter color the one with the high permanence of readers.
• Artificial Neural Network (ANN): Using as a parameters optimizers with [‘rmsprop’, ‘adam’], a epochs with [500,600, 700, 800], a batches with [100,1000, 10000]. Optimal: optimizers = ’rmsprop’, epochs = 600, batches = 10000 To apply these algorithms, the data has been split with 10-fold stratified cross-validation, in order to avoid overtraining and makes better precision of results. As a performance metric for these algorithms, there are different criteria, including: • Receiver Operating Characteristic (AUC ROC): Evaluates the sensitivity that produces continuous results, based on false positive (FPR) and then calculated the AUC (Area Under curve) score. • Mean Absolute Error (MAE): Is a measure of difference between two continuous variables (Eq. 1). • Root Mean Squared Error (RMSE): Is the standard deviation of the residuals (prediction errors). Residuals are a measure of how far from the regression
336
J. C. Mart´ın Sujo et al.
line data points are. It is a measure of how spread out these residuals are (Eq. 2). • Accuracy: Refers to how close a sample statistic is to a population parameter. M AE(y, yˆ) =
nsamples
1 nsamples T
RM SE =
|yi − yˆi |
(1)
i=0
yt t−1 (ˆ
− yt )2
(2)
T
The results of this comparison are summarized in Table 3. It can be seen that the algorithms that offer the best results are: Support Vector Machine, Gradient Boosting and the Neural Network, since it presents the figures lower error rates (both MAE and RMSE). The closer these values are to 0, the better the model is. These three algorithms have a high prediction precision, guaranteeing a robustness of the models towards the entered values. At this point we could have chosen any of the three algorithms. Although execution times increase in SVM and NN, we have to analyze the instances correctly (CC) and incorrectly classified (IC). These results enable to select SVM, since it increases the distance between the classes, increasing the margin of the hyperplane to help the accuracy of the classification. Table 3. Classifier performance comparison. this shows the different evaluation criteria that were taken into account to select the algorithm that best fit to our problem. Alg
Time(Sec) CC
IC
Accuracy MAE RMSE AUC ROC
DT
0.012
3135
435 0.878
0.122
0.349
0.905
GB
0.139
3142
428 0.880
0.120
0.346
0.910
KNN
0.045
3141
429 0.880
0.120
0.347
0.909
RF
1.89
2856
714 0.834
0.166
0.407
0.831
SVM 5.29
3143 427 0.880
0.120 0.346 0.910
ANN
3143
0.120
8.39
427 0.880
0.346
0.846
Dashboards. In the specific case of “The Editorial Office”, with the dashboard that is observed in Fig. 5, a quality report of our intelligent system is obtained. In this figure, the end user can observed the percentage of permanence readers by the text length of the news and also the global percentage. With this information can be used to make the decision of publishing or not. Another interpretation that can be made is to determine which categories are more frequently visited by readers. In general, the experimentation allowed us to use data (in a specific case of a digital medium) and to identify the two characteristics that influence the permanence of a reader in a news item. And from them, find three patterns
SmartData: An Intelligent Decision Support System
337
Fig. 5. Dashboard of smartdata.
of user behavior associated with the news. Once, these complex patterns have been identified and tuned the hyperparameters of the classification algorithms for the ad-hoc digital medium, we would be able to predict future behaviors. And finally, present the results in an interactive, attractive and simple way to end users for the use of this decision support tool.
5
Discussion
With the proposed system, the editor is able to enter a news item into the program and view the group it is belongs (Longer permanence — Worst permanence) and thus make the decision to publish it or not. Data indicates that the greater the content (length of the text) of the news together with an intermediate number of audios inserted in it, the greater the probability that a reader will be attract to it. With this study, the hypothesis that not only the quality of the content has an impact on the permanence of the reader has been tested. This work has opened the doors to new hypotheses: a) The quality of the audio in a digital news item influences the reader’s permanence; b) Trending topics influence a reader’s stay on a specific article; c) Which features can be integrated to into the Medium permanence cluster to improve this group and convert it in the High permanence cluster. Therefore, this analysis is a contribution to the continuous improvement of a news item before it is published based on the dimension of the content. It also encourages new authors to search for new indicators that optimize communication between the journalist and the reader.
338
6
J. C. Mart´ın Sujo et al.
Conclusions and Future Work
In conclusion, it could be said that the one of the problems that the digital press faces on a daily basis has been solved. Predicting the impact of a news item on a reader before its publication has been solved by analyzing the time invested by users reading the news. This has been possible, through the application of Machine Learning methods, guaranteeing us results which are 88% accurate. The development of this predictive tool will not only help digital press teams such as the editorial team (to determine when to publish a news or not), but also the marketing team will have the hourly accuracy of a reader in a news story. So, are will be able to determine which ads are best suited to advertise on it. Having completed this study, we would like to investigate other aspects related to the topic: a) Work on improving the model, further improving precision. b) Put this study into production. c) Expand the exposed study to various digital media. Acknowledgments. This work has been financed by the Ministry of Economy, Industry and Competitiveness of the Government of Spain and the European Regional Development Fund with the help no RTC-2016-5503-7 (MINECO / FEDER, EU) for the project Smart Data Discovery and Natural Language Generation for Digital Media Performance. And it has also been possible thanks to our partners Agile; Easy at University of Girona and DS4DS research group at La Salle - Ramon Llull University.
References 1. AIMC. Infograf´ıa Resumen 22 Navegantes en la Red. Disponible en https:// www.aimc.es/otros-estudios-trabajos/navegantes-la-red/infografia-resumen-22onavegantes-la-red/ 2. Parratt, S.: Por qu´e los j´ ovenes no leen peri´ odicos An´ alisis y propuestas. Libro Nuevos Medios, Nueva Comunicaci´ on. Salamanca. (Espa˜ na) (2010). Disponible en http://campus.usal.es/comunicacion3punto0/comunicaciones/080.pdf 3. Casero-Ripoll´es, A.: M´ as all´ a de los diarios: el consumo de noticias de los j´ ovenes en la era digital. Comunicar 20(39), 151–158 (2012) 4. Epure, E.V., Kille, B., Ingvaldsen, J.E., Deneckere, R., Salinesi, C., Albayrak, S.: Recommending personalized news in short user sessions. In: Proceedings of the Eleventh ACM Conference on Recommender Systems, pp. 121–129, August 2017 5. Chen, X., Ghysels, E.: News- good or bad- and its impact on volatility predictions over multiple horizons. Rev. Financ. Stud. 24(1), 46–81 (2011) 6. Ahmed, M., Spagna, S., Huici, F., Niccolini, S.: A peek into the future: predicting the evolution of popularity in user generated content. In: Proceedings of the Sixth ACM International Conference on Web Search and Data Mining, pp. 607–616, February 2013 7. Caruccio, L., Deufemia, V., Polese, G.: Understanding user intent on the web through interaction mining. J. Vis. Lang. Comput. 31, 230–236 (2015) 8. Fernandes, K., Vinagre, P., Cortez, P.: A proactive intelligent decision support system for predicting the popularity of online news. In: Portuguese Conference on Artificial Intelligence, pp. 535–546. Springer, Cham, September 2015
SmartData: An Intelligent Decision Support System
339
9. Stokowiec, W., Trzci´ nski, T., Wolk, K., Marasek, K., Rokita, P.: Shallow reading with deep learning: predicting popularity of online content using only its title. In: International Symposium on Methodologies for Intelligent Systems, pp. 136–145. Springer, Cham, June 2017 10. Kong, J., Wang, B., Liu, C., Wu, G.: An approach for predicting the popularity of online security news articles. In: 2018 IEEE Conference on Communications and Network Security (CNS), pp. 1–6. IEEE, May 2018 11. Pearson, K.: Determination of the coefficient of correlation. Science 30(757), 23–25 (1909) 12. Chen, T., Guestrin, C.: Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794, August 2016 13. MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, no. 14, pp. 281–297, June 1967 14. Kohonen, T.: Self-organized formation of topologically correct feature maps. Biol. Cybern. 43(1), 59–69 (1982) 15. Ester, M., Kriegel, H. P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Kdd vol. 96, no. 34, pp. 226–231, August 1996 16. Ketchen, D.J., Shook, C.L.: The application of cluster analysis in strategic management research: an analysis and critique. Strat. Manag. J. 17(6), 441–458 (1996) 17. Rousseau, P.: Silhouettes: a gaphical aid to the interpretation and validation of custer analysis. J. Comput. Appl. Math. 20(1), 53–65 (1987) 18. Quinlan, J.R.: Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986) 19. Freund, Y., Schapire, R.E.: Experiments with a new boosting algorithm. In: ICML, vol. 96, pp. 148–156, July 1996 20. Cover, T., Hart, P.: Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 13(1), 21–27 (1967) 21. Ho, T.K.: Random decision forests. In: Proceedings of 3rd International Conference on Document Analysis and Recognition, vol. 1, pp. 278–282. IEEE, August 1995 22. Cortes, C., Vapnik, V.: Support vector machine. Mach. Learn. 20(3), 273–297 (1995) 23. Zell, A.: Simulation neuronaler netze, vol. 1, no. 5.3. Bonn: Addison-Wesley (1994) 24. MongoDB. https://www.mongodb.com/> 25. Python. https://www.python.org/>
Tropical Data Science over the Space of Phylogenetic Trees Ruriko Yoshida(B) Department of Operations Research, Naval Postgraduate School, Monterey, USA [email protected]
Abstract. Phylogenomics is a new field which applies to tools in phylogenetics to genome data. Due to a new technology and increasing amount of data, we face new challenges to analyze them over a space of phylogenetic trees. Because a space of phylogenetic trees with a fixed set of labels on leaves is not Euclidean, we cannot simply apply tools in data science. In this paper we review first some new developments of machine learning models using tropical geometry to analyze a set of phylogenetic trees over a tree space. Then we define machine learning models using tropical geometry for phylogenomics. We end this article with open problems.
Keywords: Machine learning models geometry
1
· Max-plus algebra · Tropical
Introduction
Due to increasing amount of data today, data science is one of most exciting fields. It finds applications in statistics, computer science, business, biology, data security, physics, and so on. Most of statistical models in data sciences assume that data points in an input sample are distributed over a Euclidean space if they have numerical measurements. However, in some cases this assumption can be failed. For example, a space of phylogenetic trees with a fixed set of leaves is an union of lower dimensional cones over Re , where e = N2 with N is the number of leaves [2]. Since the space of phylogenetic trees is an union of lower dimensional cones, we cannot just apply statistical models in data science to a set of phylogenetic trees [20]. There has been much work in spaces of phylogenetic trees. In 2001, BilleraHolmes-Vogtman (BHV) developed a notion of a space of phylogenetic trees with a fixed set of labels for leaves [4], which is a set of all possible phylogenetic trees with the fixed set of labels on leaves and is an union of orthants, each orthant is for all possible phylogenetic trees with a fixed tree topology. In this R. Yoshida is supported in part by NSF DMS #1622369 and #1916037. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of any of the funders. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 295, pp. 340–361, 2022. https://doi.org/10.1007/978-3-030-82196-8_26
Tropical Data Science over the Space of Phylogenetic Trees
341
space, two orthants are next to each other if the tree topology for one orthant is one nearest neighbor interchange (NNI) distance to the tree topology for the other orthant. They also showed that this space is CAT(0) space so that there is a unique shortest connecting paths, or geodesics, between any two points in the space defined by the CAT(0)-metric. There is some work in development on machine learning models with the BHV metric. For example, Nye defined a notion of the first order principal component geodesic as the unique geodesic with the BHV metric over the tree space which minimizes the sum of residuals between the geodesic and each data point [14]. However, we cannot use a convex hull under the BHV metric for higher principal components because Lin et al. showed that the convex hull of three points with the BHV metric over the tree space can have arbitrarily high dimension [10]. In 2004, Speyer and Sturmfels showed a space of phylogenetic trees with a given set of labels on their leaves is a tropical Grassmanian [17], which is a tropicalization of a linear space defined by a set of linear equations [20] with the max-plus algebra. The first attempt to apply tropical geometry to computational biology and statistical models was done by Pachter and Sturmfels [18]. The tropical metric with max-plus algebra on the tree space is known to behave very well [1,5]. For example, contrarily to the BHV metric, the dimension of the convex hull of s tropical points is at most s − 1 [10]. Thus, this paper focuses on the tropical metric over tree spaces. In this paper we review some developments on statistical learning models with the tropical metric with max-plus algebra on tree spaces as well as the tropical projective space. Then we define some notation of machine learning models using tropical geometry for phylogenomics. We end this article with open problems.
2
Data Science Overview
In this section, we briefly overview statistical models in data science. For more details, we recommend to read Introduction of Statistical Learning with R http:// faculty.marshall.usc.edu/gareth-james/ISL/. In data science there are roughly two sub-branches of data science: unsupervised learning and supervised learning (Fig. 1). In unsupervised learning, our goal is to compute a descriptive statistics to see how data points are distributed over the sample space or how data points are clustered together. In statistics, unsupervised learning corresponds to descriptive statistics. In supervised learning, our goal is to predict/infer the response variable from explanatory variables. In statistics, supervised learning corresponds to inferential statistics. Like unsupervised learning and supervised learning, there are some notations with different names between machine learning and statistics. Thus we summarize some of the differences in Table 1. 2.1
Basic Definitions
1. Response variable – the variable for an interest in a study or experiment. It can be called as a dependent variable. In machine learning it is also called a target variable.
342
R. Yoshida
Table 1. There are several notations with different names in statistics and data science.
Statistics
Data science
Descriptive statistics Unsupervised learning Inferential statistics
Supervised learning
Response variable
Target variable
Explanatory variable Predictor variable feature
Fig. 1. Overview of data science
2. Explanatory variable – the variable explains the changes in the response variable. It can be also called a feature or independent variable. In machine learning it is also called feature or predictor. 2.2
Unsupervised Learning
Since unsupervised learning is descriptive, there is no response variables. In unsupervised learning, we try to learn how data points are distributed and how they related to each other. Among them, there are mainly two categories: clustering and dimensionality reduction. – Clustering – grouping data points into subsets by their “similarity”. These similarities are defined by a user. These groups are called clusters. – Dimensionality reduction – reducing the dimension of data points with minimizing the loss of information. One of the most commonly used methods is principal component analysis (PCA), a dimension reduction procedure via linear algebra.
Tropical Data Science over the Space of Phylogenetic Trees
2.3
343
Supervised Learning
Supervised learning is inferential. Thus, there are the response variable and explanatory variables in an input data set. Depending on the scale of the response variable, we can separate two groups in supervised learning: classification and regression. In classification, the response variable has categorical scale and in regression, the response variable has numerical (interval) scale. – Classifications – the response variable is categorical. Under classification, there are algorithms like logistic regression, support vector machine, linear discriminant analysis, classification trees, random forests, adaboost and etc. – Regression – the response variable is numerical. There are algorithms like linear regression, regression trees, lasso, ridge regression, random forests, adaboost and etc. For more details, see the following papers: – Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. Introduction of Statistical Learning with R http://faculty.marshall.usc.edu/ gareth-james/ISL/.
3
Phylogenetics to Phylogenomics
In this section we overview basics in phylogenetics and basic problem for phylogenomics. 3.1
Phylogenetic Trees
Evolutionary, or phylogenetic trees, are weighted trees which represent an organism’s evolutionary relationships over evolutionary time and their mutations. Phylogenetic trees are still trees so that they are graphs consisting of vertices (nodes) and edges (branches). Each node in a phylogenetic tree represents a past or present taxon or population: exterior nodes in a phylogenetic tree represent taxon or population at present; and interior nodes represent their ancestors. Therefore, we only label on their external nodes (leaves) but not on internal nodes on a phylogenetic tree. Edges in a phylogenetic tree have weights and a weight in each edge represents mutation rates multiplied by evolutionary time from its ancestor to a taxon. Throughout this paper, we denote N as the number of leaves and [N ] := {1, 2, . . . , N } as the set of labels on leaves on a tree. There are basically two types of a phylogenetic tree with the set of labels on its leaves [N ]: rooted phylogenetic tree and unrooted phylogenetic tree. As we can see from their names, a rooted phylogenetic tree is a phylogenetic tree with a root and an unrooted phylogenetic tree is a phylogenetic tree without the root. One can see a rooted phylogenetic tree as an unrooted phylogenetic tree with N + 1 leaves by assigning an extra label on its root. For example, consider a rooted phylogenetic tree with three leaves shown in Fig. 2. Each leaf from the set of labels on leaves [N ] := {1, 2, 3} represents
344
R. Yoshida
the current taxa. An interior node represents an extinct common species where ancestors split into two subgroups. Through this paper we assume that there exists the common ancestor of all leaves. If a phylogenetic tree is rooted, then the root of the tree represent the common ancestor of all leaves. We can see this rooted phylogenetic tree as an unrooted phylogenetic tree with the leaf labels {0, 1, 2, 3} by assigning the label 0 to the root for the common ancestor.
Evolutionary Time
Common Ancestor
1
2
3
Fig. 2. Example of rooted binary phylogenetic tree.
Each edge in a phylogenetic tree has weight, which represents an evolutionary time multiplied by mutation rates. These weights can be interpreted by different ways in terms of an evolutionary model. For more details on these evolutionary models, read [6,16]. Since a phylogenetic tree is a tree, there exists an unique shortest path from a leaf i ∈ [N ] to a leaf j ∈ [N ]. If a total of weights of all edges in an unique shortest path from the root to each leaf i ∈ [N ] := {1, . . . , N } in a rooted phylogenetic tree T is the same for all leaves i ∈ [N ], then we call a phylogenetic tree T equidistant tree. The height of an equidistant tree is the total weight of all edges in a path from the root to each leaf in the tree. Phylogenetic Tree Reconstruction. Phylogenetic reconstruction uses genetic data to create an inferential evolutionary (phylogenetic) tree.
Tropical Data Science over the Space of Phylogenetic Trees
345
Even though we do not discuss details on a phylogenetic tree reconstruction in this paper, multiple steps and techniques are involved in the reconstruction process and there are several types of tree reconstruction methods; – Maximum likelihood estimation (MLE) methods – These methods describe evolution in terms of a discrete-state continuous-time Markov process. – Maximum Parsimony – Reconstructs tree with the least evolutionary changes which explain data. – Bayesian inference for trees – Use Bayes Theorem and MCMC to estimate the posterior distribution rather than obtaining the point estimation. – Distance based methods – Reconstructing a tree from a distance matrix defined below. For interested readers, read [6,16] for more details. 3.2
Space of Phylogenetic Trees
There are several ways to define a space of phylogenetic trees with different metrics. One of the very well-known tree spaces is Billera-Holmes-Vogtmann tree space. In 2001, Billera-Holmes-Vogtmann (BHV) introduced a continuous space which models the set of rooted phylogenetic trees with edge lengths on a fixed set of leaves. In this space, edge lengths in a tree are continuous and we assign a coordinate for each interior edge. Note that unrooted trees can be accommodated by designating a fixed leaf node as the root. The BHV tree space is not Euclidean, but it is non-positively curved, and thus has the property that any two points are connected by a unique shortest path through the space, called a geodesic. The distance between two trees is defined as the length of the geodesic connecting them. While in this paper, we do not consider the BHV tree space, read [4] for interested readers. Through this paper, we assume that all phylogenetic trees are equidistant trees. An equidistant tree is a rooted phylogenetic tree such that the sum of all branch lengths in the unique path from the root to each leaf in the tree, called the height of the tree, is fixed and they are the same for all leaves in the tree. In phylogenetics this assumption is fairly mild since the multispecies coalescent model assumes that all gene trees have the same height. Example 1. Suppose N = 4. Consider two rooted phylogenetic trees with the set of labels on the leaves {a, b, c, d} in Fig. 3. Note that for each tree, the sum of branch lengths in the unique path from the root to each leaf is 1. Therefore they are equidistant trees with their height are equal to 1. For the space of equidistant trees with the fixed set of labels on their leaves, the BHV tree space might not be appropriate [7]. Therefore, we consider the space of ultrametrics. To define ultrametrics and their relations to equidistant trees, we need to define dissimilarity maps.
346
R. Yoshida 0.1
0.6
0.5
1
0.3
0.9
0.9
0.6
0.5
0.5
0.1
a
b
c
d
a
b
c
d
Fig. 3. Examples of equidistant trees with N = 4 leaves with the set of labels {1, b, c, d} and with their height equal to 1.
Definition 1 [Dissimilarity Map]. A dissimilarity map w is a function w : [N ] × [N ] → R≥0 such that 0 if i = j w(i, j) = ≥ 0 else, for all i, j ∈ [N ]. If a dissimilarity map w additionally satisfies the triangle inequality, that is: w(i, j) ≤ w(i, k) + w(k, j), for all i, j, k ∈ [N ], then w is called a metric. If there exists a phylogenetic tree T such that w(i, j) coincides with the total branch length of the edges in the unique path from a leaf i to a leaf j for all leaves i, j ∈ [N ], then we say w a tree metric. If a metric w is a tree metric and w(i, j) is the total branch length of all edges in the path from a leaf i to a leaf j for all leaves i, j ∈ [N ] in a phylogenetic tree T , then we say w realises a phylogenetic tree T or w is a realisable of a phylogenetic tree T . 0 w(i, j) = w(j, i)
Since
if i = j else,
to simplify we write w = (w(1, 2), w(1, 3), . . . , w(N − 1, N )) . Example 2. We consider equidistant trees in Fig. 3. The dissimilarity map obtained from the left tree in Fig. 3 is (1.2, 1.8, 2, 1.8, 2, 2). Similarly, the dissimilarity map obtained from the right tree in Fig. 3 is (0.2, 2, 2, 2, 2, 1). Since these dissimilarity maps are obtained from phylogenetic trees, they are tree metrics.
Tropical Data Science over the Space of Phylogenetic Trees
347
Definition 2 (Three Point Condition). If a metric w satisfies the following condition: For every distinct leaves i, j, k ∈ [N ], max{w(i, j), w(i, k), w(j, k)} achieves twice, then we say that w satisfies the three point condition. Definition 3 (Ultrametrics). If a metric w satisfies the three point condition then w is called an ultrametric. Theorem 1 ([8]). A dissimilarity map w : [N ] × [N ] is ultrametric if and only if w is realisable of an equidistant tree with labels [N ]. In addition, for each equidistant tree there exists a unique ultrametric. Conversely, for each ultrametric, there exists a unique equidistant tree. Example 3 We again consider equidistant trees in Fig. 3. The dissimilarity map obtained from the left tree in Fig. 3 is (1.2, 1.8, 2, 1.8, 2, 2). Similarly, the dissimilarity map obtained from the right tree in Fig. 3 is (0.2, 2, 2, 2, 2, 1). Since these phylogenetic trees are equidistant trees, these dissimilarity maps are ultrametrics by Theorem 1. From Theorem 1 we consider the space of ultrametrics with labels [N ] as a space of all equidistant trees with the label set [N ]. Let UN be the space of ultrametrics for equidistant trees with the leaf labels [N ]. In fact we can write UN as the tropicalization of the linear space generated by linear equations. Let LN ⊆ Re be the linear subspace defined by the linear equations such that xij − xik + xjk = 0
(1)
for 1 ≤ i < j < k ≤ N . For the linear equations (1) spanning the linear space LN , the max-plus tropicalization T rop(LN ) of the linear space LN is the tropical linear space with w ∈ Re such that max {wij , wik , wjk } achieves at least twice for all i, j, k ∈ [N ]. Note that this is exactly the three point condition defined in Definition 3. Theorem 2 [20, Theorem 2.18]. The image of UN in the tropical projective torus Rn /R1 coincides with trop(LN ). For example, if N = 4, The space of ultrametrics U4 is a two-dimensional fan with 15 maximal cones. For more details, see the following papers: – C. Semple and M. Steel. Phylogenetics, [16]. – Lin et al. Convexity in Tree Spaces [11].
348
4
R. Yoshida
Basics in Tropical Geometry
Here we review some basics of tropical arithmetic and geometry, as well as setting up the notation through this paper. Definition 4 (Tropical arithmetic operations). Throughout this paper we perform arithmetic over the max-plus tropical semiring ( R ∪ {−∞}, , ) . Over this tropical semiring, the basic tropical arithmetic operations of addition and multiplication are defined as the following: a b := max{a, b},
a b := a + b
where a, b ∈ R ∪ {−∞}.
Over this tropical semiring, −∞ is the identity element under addition and 0 is the identity element under multiplication. Example 4. Suppose we have a = 1, b = −3. Then 1 (−3) = max{1, −3} = 1 1 (−3) = 1 + −3 = −2. Definition 5 (Tropical scalar multiplication and vector addition). For any a, b ∈ R ∪ {−∞} and for any v = (v1 , . . . , ve ), w = (w1 , . . . , we ) ∈ (R ∪ −{∞})e , tropical scalar multiplication and tropical vector addition are defined as: a v = (a + v1 , a + v2 , . . . , a + ve ) a v b w = (max{a + v1 , b + w1 }, . . . , max{a + ve , b + we }). Example 5. Suppose we have v = (1, 2, 3), w = (3, −2, 1), and a = 1, b = −3. Then we have a v = (1 + 1, 1 + 2, 1 + 3) = (2, 3, 4), and a v b w = (max{1 + 1, (−3) + 3}, max{1 + 2, (−3) + (−2)}, max{1 + 3, (−3) + 1}) = (2, 3, 4).
Throughout this paper we consider the tropical projective torus, that is, the projective space Re/R1, where 1 := (1, 1, . . . , 1), the all-one vector. Example 6. Consider Re/R1. Then let v = (1, 2, 3). Then over Re/R1 we have the following equality: v = (1, 2, 3) = (0, 1, 2).
Tropical Data Science over the Space of Phylogenetic Trees
349
Note that Re/R1 is isometric to Re−1 . Example 7. Consider Re/R1. Then let v = (1, 2, 3), w = (1, 1, 1). Also let a = −1, b = 3. Then we have a v b w = (max(−1 + 1, 3 + 1), max(−1 + 2, 3 + 1), max(−1 + 3, 3 + 1)) = (4, 4, 4) = (0, 0, 0).
In order to conduct a statistical analysis we need a distance measure between two vectors in the space. Thus we discuss a distance between two vectors in the tropical projective space. In fact the following distance between two vectors in the tropical projective space is a metric. Definition 6 (Generalized Hilbert projective metric). For any two points v, w ∈ Re/R1, the tropical distance dtr (v, w) between v and w is defined:
dtr (v, w) = max |vi − wi − vj + wj | : 1 ≤ i < j ≤ e i,j
= max vi − wi i
− min vi − wi , i
(2) where v = (v1 , . . . , ve ) and w = (w1 , . . . , we ). This distance is a metric in Re/R1. Therefore, we call dtr tropical metric. Example 8. Suppose u1 , u2 ∈ R3/R1 such that u1 = (0, 0, 0), u2 = (0, 3, 1). Then the tropical distance between u1 , u2 is dtr (u1 , u2 ) = max(0, −3, −1) − min(0, −3, −1) = 0 − (−3) = 3. Similar to the BHV metric over the BHV tree space, we need to define a geodesic over the space of ultrametrics. In order to define a tropical geodesic we need to define a tropical polytope: Definition 7. Suppose we have a finite subset V = {v1 , . . . , vs } ⊂ Re The tropical convex hull or tropical polytope of V is the smallest tropically-convex subset containing V ⊂ Re written as the set of all tropical linear combinations of V such that: tconv(V ) = {a1 v1 a2 v2 · · · as vs , where v1 , . . . , vs ∈ V and a1 , . . . , as ∈ R}. A tropical line segment between two points v1 , v2 is a tropical convex hull of two points {v1 , v2 }. Note that the length between two points u1 , u2 ∈ R3/R1 along the tropical line segment between u1 , u2 equals to the tropical distance dtr (u1 , u2 ). In this paper we define a tropical line segment between two points as a tropical geodesic between these points.
350
R. Yoshida
Example 9. Suppose u1 , u2 ∈ R3/R1 such that u1 = (0, 0, 0), u2 = (0, 3, 1). From the previous example, the tropical distance between u1 , u2 is dtr (u1 , u2 ) = 3. Also the tropical line segment between u1 , u2 is a line segment between these three points: (0, 0, 0) (0, 2, 0) (0, 3, 1). The length of the line segment is dtr ((0, 0, 0), (0, 2, 0)) + dtr ((0, 2, 0), (0, 3, 1)) = 2 + 1 = 3. Example 10. Suppose we have a set V = {v1 , v2 , v3 } ⊂ R3/R1 where v1 = (0, 0, 0), v2 = (0, 3, 1), v3 = (0, 2, 5). Then we have the tropical convex hull tconv(V ) of V is shown in Fig. 4.
(0, 2, 5)
(0, 3, 5)
(0, 0, 3)
(0, 3, 1)
(0, 0, 0)
(0, 2, 0)
Fig. 4. Tropical polytope of three points (0, 0, 0), (0, 3, 1), (0, 2, 5) in R3/R1.
For more details, see the following papers: – D. Maclagan and B. Sturmfels. Introduction to Tropical Geometry [9].
Tropical Data Science over the Space of Phylogenetic Trees
5
351
Tropical Unsupervised Learning
Unsupervised learning is descriptive and we do not know much about descriptive statistics using tropical geometry with max-plus algebra, for example, tropical Fermat Weber (FW) points and tropical Fr´ecet means. In this section we discuss tropical FW points and tropical Fr´ecet means, what they are and what we know and we do not know. In the end of this section, we discuss tropical principal component analysis (PCA). Over this section we consider the tropical projective torus Re/R1. 5.1
Tropical Fermat Weber Points
Suppose we have a sample {v1 , . . . , vs } over Re/R1. A tropical Fermat-Weber point y minimizes the sum of distances to the given points. y := arg minz∈Re/R1
s
dtr (z, vi ).
(3)
i=1
There are properties of tropical Fermat-Weber points of a sample {v1 , . . . , vs } over Re/R1. Proposition 1. Suppose M = Re /R1. Then the set of tropical Fermat-Weber points of a sample {v1 , . . . , vs } over Re/R1 is a convex polytope. It consists of all optimal solutions y = (y1 , . . . , ye ) to the following linear program: minimize d1 + d2 + · · · + ds subject to yj − yk − vji + vki ≥ −di for all i = 1, . . . , s and 1 ≤ j, k ≤ e, yj − yk − vji + vki ≤ di for all i = 1, . . . , s and 1 ≤ j, k ≤ e.
(4)
From Proposition 1, there can be infinitely many tropical Fermat-Weber points of a sample. If we focus on the space of ultrametrics UN for equidistant trees with N leaves, then we have the following proposition: Proposition 2. If a sample {v1 , . . . , vs } over the space of ultrametrics UN , then tropical Fermat-Weber points are in UN . In [12], we showed explicitly how to compute the set of all possible FermatWeber points in Re/R1. However, we do not know the minimal set of inequalities needed to define the set of all tropical Fermat-Weber points of a given sample. Thus here is an open problem: Problem 1. What is the minimal set of inequalities needed to define the set of all tropical Fermat-Weber points of a given sample? What is the time complexity to compute the set of tropical Fermat-Weber points of a sample of m points in Re /R1? Is there a polynomial time algorithm to compute the vertices of the polytope of tropical Fermat-Weber points of a sample of s points in Re /R1 in s and e? For more details, see the following papers: – B. Lin and R. Yoshida Tropical Fermat–Weber Points [12].
352
5.2
R. Yoshida
Tropical Fr´ ecet Means
Suppose we have a sample {v1 , . . . , vs } over Re/R1. A tropical Fr´echet mean y minimizes the sum of distances to the given points. y := arg minz∈Re/R1
s
dtr (z, vi )2 .
(5)
i=1
As we formulated computing a tropical Fermat-Weber point as a linear programming problem, we can also formulate computing a tropical Fr´ecet mean as a quadratic programming problem: minimize d21 + d22 + · · · + d2s subject to yj − yk − vji + vki ≥ −di for all i = 1, . . . , s and 1 ≤ j, k ≤ e, yj − yk − vji + vki ≤ di for all i = 1, . . . , s and 1 ≤ j, k ≤ e.
(6)
While we know some propertied of tropical Fermat-Weber points we do not know much about tropical Fr´echen means. Here are some basics on tropical Fr´echet means. Proposition 3. Suppose M = Re /R1. Then the set of tropical Fr´echen means of a sample {v1 , . . . , vs } over Re/R1 is a convex polytope. It consists of all optimal solutions y = (y1 , . . . , ye ) to the following quadratic program: minimize d21 + d22 + · · · + d2s subject to yj − yk − vji + vki ≥ −di for all i = 1, . . . , s and 1 ≤ j, k ≤ e, yj − yk − vji + vki ≤ di for all i = 1, . . . , s and 1 ≤ j, k ≤ e.
(7)
Still we do not know much about tropical Fr´echet means. First we have the following problem. Problem 2. If a sample {v1 , . . . , vs } over the space of ultrametrics UN , then are tropical F´echet means in UN ? We still do not know how to compute tropical Fr´echet means in efficient ways. So we have the following problem: Problem 3. Suppose we have {v1 , . . . , vs } over Re/R1. Is there an algorithm to compute all tropical Fr´echet means in Re/R1? 5.3
Tropical Principal Component Analysis (PCA)
Principal component analysis (PCA) is one of the most popular methods to reduce dimensionality of input data and to visualize them. Classical PCA takes data points in a high-dimensional Euclidean space and represents them in a lower-dimensional plane in such a way that the residual sum of squares is minimized. We cannot directly apply the classical PCA to a set of phylogenetic trees because the space of phylogenetic trees with a fixed number of leaves is not N Euclidean; it is a union of lower dimensional polyhedral cones in R( 2 ) , where N is the number of leaves.
Tropical Data Science over the Space of Phylogenetic Trees
353
There is a statistical method similar to PCA over the space of phylogenetic trees with a fixed set of leaves in terms of the Billera-Holmes-Vogtman (BHV) metric. In 2001, Billera-Holmes-Vogtman developed the space of phylogenetic trees with fixed labeled leaves and they showed that it is CAT(0) space [4]. Therefore, a geodesic between any two points in the space of phylogenetic trees is unique. Short after that, Nye showed an algorithm in [14] to compute the first order principal component over the space of phylogenetic trees of N leaves with the BHV metric. Nye in [14] used a convex hull of two points, i.e., the geodesic, on the tree space as the first order PCA. However, this idea can not be generalized to higher order principal components with the BHV metric since the convex hull of three points with the BHV metric over the tree space can have arbitrarily high dimension [11]. On the other hand, the tropical metric in the tree space in terms of the maxplus algebra is well-studied and well-behaved [13]. For example, the dimension of the convex hull of s points in terms of the tropical metric is at most s − 1. Using the tropical metric, Yoshida et al. in [20] introduced a statistical method similar to PCA with the max-plus tropical arithmetic in two ways: the tropical principal linear space, that is, the best-fit Stiefel tropical linear space of fixed dimension closest to the data points in the tropical projective torus; and the tropical principal polytope, that is, the best-fit tropical polytope with a fixed number of vertices closest to the data points. The authors showed that the latter object can be written as a mixed-integer programming problem to compute them, and they applied the second definition to datasets consisting of collections of phylogenetic trees. Nevertheless, exactly computing the best-fit tropical polytope can be expensive due to the high-dimensionality of the mixed-integer programming problem. Definition 8. Let P = tconv (D(1) , . . . , D(s) ) ⊆ Re /R1 be a tropical polytope with its vertices {D(1) , . . . , D(s) } ⊂ Re /R1 and let S = {u1 , . . . un } be a sample |S| from the space of ultrametrics UN . Let ΠP (S) := i=1 dtr (ui , ui ), where ui is the tropical projection of ui onto a tropical polytope P. Then the vertices D(1) , . . . , D(s) of the tropical polytope P are called the (s − 1)-th order tropical principal polytope of S if the tropical polytope P minimizes ΠP (S) over all possible tropical polytopes with s many vertices. In [15], Page et al. developed a heuristic method to compute tropical principal polytope and they applied it to empirical data sets on genome data of influenza flu collected from New York city, Apicomplexa, and African coelacanth genome data sets. Also Page et al. showed the following theorem and lemma: Theorem 3 ([15]). Let P = tconv (D(1) , . . . , D(s) ) ⊆ Re /R1 be a tropical polytope spanned by ultrametrics in UN . Then P ⊆ UN and any two points x and y in the same cell of P are also ultrametrics with the same tree topology.
354
R. Yoshida
Lemma 1 ([15]). Let P = tconv (D(1) , . . . , D(s) ) ⊆ Re /R1 be a tropical polytope spanned by ultrametrics. The origin 0 is contained in P if and only if the path between each pair of leaves i, j passes through the root of some D(i) . There are still some open problem on tropical PCA. Here is one of questions we can work on: Conjecture 1. There exists a tropical Fermat-Weber point x∗ ∈ UN of a sample D(1) , . . . , D(n) of ultrametric trees which is contained in the sth order tropical PCA of the dataset for s ≥ 1. For more details, see the following papers: – R. Yoshida, L. Zhang, and X. Zhang. Tropical Principal Component Analysis and its Application to Phylogenetics [20]. – R. Page, R. Yoshida, and L. Zhang. Tropical principal component analysis on the space of ultrametrics [15].
6
Tropical Supervised Learning
For tropical supervised learning, there is not much done. For classification, there is some work done. Recently Tang et al. in [19] introduced a notion of tropical support vector machines (SVMs). In this section we discuss tropical SVMs and we introduce a notion of tropical linear discriminant analysis (LDA). 6.1
Tropical Classifications
For tropical classification, we consider the binary response variables. Suppose we have a data set given that {(x1 , y1 ), . . . , (xn , yn )} , where x1 , . . . , xn ∈ Re /R1 and y1 , . . . , yn ∈ {0, 1}. Therefore, the response variable yi is binary. Thus, we can partition a sample of data points x1 , . . . , xn ∈ Re /R1 into two sets P and Q such that xi ∈ P if yi = 0, xi ∈ Q if yi = 1. Tropical Support Vector Machine SVMs. A support vector machine (SVM) is a supervised learning model to predict the categorical response variable. For a binary response variable, a classical linear SVM classifies data points by finding a linear hyperplane to separate the data points into two groups. In this paper we refer a classical SVM as a classical linear SVM over an Euclidean space Re with L2 norm. For an Euclidean space Re , there are two types of SVMs: hard margin SVMs and soft margin SVMs. A hard margin SVM is a model with the assumption
Tropical Data Science over the Space of Phylogenetic Trees
355
that all data points can be separated by a linear hyperplane into two groups without errors. A soft margin SVM is a model which maximizes the margin and also allows some data points in the wrong side of the hyperplane. Similar to a classical SVM over a Euclidean space, a tropical SVM is a supervised learning model which classifies data points by finding a tropical hyperplane to separate them. In [19], as a classical SVM, Tang et al. defined two types of tropical SVMs: hard margin tropical SVMs and soft margin tropical SVMs. A hard margin tropical SVM introduced by [3] is, similar to a classical hard margin SVM, a model to find a tropical hyperplane which maximizes the margin, the minimum tropical distance from data points to the tropical hyperplane (which is z in Fig. 5), to separate these data points into open sectors. Note that an open sector of a tropical hyperplane can be seen as a tropical version of an open half space defined by a hyperplane. A tropical soft margin SVM introduced by [19] is a model to find a tropical SVM to maximizes the margin but it also allows some data points into a wrong open sector. The authors in [3] showed that computing a tropical hyperplane for a tropical hard margin SVM from a given sample on the tropical projective space can be formulated as a linear programming problem. Again, note that, similar to the classical hard margin SVMs, hard margin tropical SVMs assume that there exists a tropical hyperplane such that it separates all data points in the tropical projective space into each open sector (see the left figure in Fig. 5). In order to discuss details on tropical SVMs, we need to define a tropical hyperplane and their open sectors. Definition 9. Suppose ω := (ω1 , . . . , ωe ) ∈ Re/R1. The tropical hyperplane defined by ω, denoted by Hω , is the set of all points x ∈ Re/R1 such that max{ω1 + x1 , . . . , ωe + xe } is attained at least twice. ω is called the normal vector of Hω . Definition 10. A tropical hyperplane Hω divides the tropical projective space Re/R1 into e components. These e components divided by Hω are called open sectors given that: Sωi := { x ∈ Re/R1 | ωi + xi > ωj + xj , ∀j = i }, i = 1, . . . , e. Example 11. Consider R3/R1. Then a tropical hyperplane in R3/R1 has three open sectors seen as Fig. 5. Note that R3/R1 is isometric to R2 . Now we define the tropical distance from a point to a tropical hyperplane. Definition 11. The tropical distance from a point x ∈ Re/R1 to the tropical hyperplane Hω is defined as: dtr (x, Hω ) := min{dtr (x, y) | y ∈ Hω }.
356
R. Yoshida
H0 H0 z
β
z
α
Fig. 5. A hard margin tropical SVM (LEFT) and a soft margin tropical SVM (RIGHT) with the binary response variable. A hard margin tropical SVM assumes that all data points from the given sample can be separated by a tropical hyperplane. Red squared dots are data points from P and blue circle dots are data points from Q. A tropical hard margin hyperplane for a tropical hard margin tropical SVM is obtained by maximizing the margin z in the left figure, the distance from the closest data point from the tropical hyperplane (the width of the grey area from the tropical hyperplane in the left figure). A soft margin tropical hyperplane for a soft margin tropical SVM is obtained by maximizing a margin similar to a hard margin tropical SVM and by minimizing the sum of α and β at the same time. (Color figure online)
A tropical hard margin SVM assumes that all points are separated by a tropical hyperplane and all data points with the same category for their response variable are assigned in the same open sector. Here we overview the hard margin tropical SVMs with the random variable X ∈ Re/R1 given the response variable Y ∈ {0, 1}e . Before we formally define the hard margin tropical SVMs, we need to define some notation. Let S(x) be a set of indices of nonzero elements in a vector x ∈ Re/R1, i.e., S(x) ⊂ {1, . . . , e} where xi = 0. Let Iω (x) ∈ {0, 1}d be a vector of indicator functions of index set {1, . . . , e} of a vector x with a vector ω ∈ Rd e/R1 where 1 if xi + ωi = max(x + ω) Iω (x)i = 0 otherwise. Let Jω ∈ {0, 1}e be also a vector of indicator functions of index set {1, . . . , e} of a vector x such that 1 if xi + ωi = second max(x + ω) Jω (x)i = 0 otherwise. Here we assume that there are only two classes in the response variable. More precisely, we have a random variable Y 1 := (Y11 , . . . , Yd1 ), Y 2 := (Y12 , . . . , Ye2 ) ∈ {0, 1}e such that Yi1 · Yi2 = 0
Tropical Data Science over the Space of Phylogenetic Trees
357
for i = 1, . . . , d with a discrete probability π1 := P (Y = Y 1 ) and π2 := P (Y = Y 2 ). Then, suppose we have a multivariate random variable X ∈ Re/R1 given Y with the probability density function f if Y = Y 1 and the probability density function g if Y = Y 2 such that there exists a tropical hyperplane Hω∗ with a normal vector ω ∗ ∈ Re/R1 with the following properties: (i) there exists an index i ∈ {1, . . . , d} such that for any j ∈ {1, . . . , e}\{i}, ωi∗ + Xi > ωj∗ + Xj , and (ii) max {Iω∗ (X) − Y } = 0, with probability 1. Let D be the distribution on the joint random variable (X, Y ) and let S be the sample S := {(X 1 , Y 1 ), . . . , (X n , Y n )}. Then we formulate an optimization problem for solving the normal vector ω of an optimal tropical separating hyperplane Hω for random variables X given Y : For some cost C ∈ R ⎛ min
ω∈Rd/R1
⎞
⎜ n
⎟ ⎜ ⎟ C k k ⎟ ⎜ max (X + ω − X − ω ) + max I (X ) − Y ω i i j j ⎜X∈S,i∈S(I (X)),j∈S(J (X)) ⎟. n ω ω ⎝ ⎠ k=1 regularizer
error
Here, the expectation of the random variable the 0–1 loss function. Also note that
n k=1
max Iω (X k ) − Y k
is
dtr (X, Hω ) = Xi + ωi − Xj − ωj , where i ∈ S(Iω (X)), j ∈ S(Jω (X)). Thus, this optimization problem can be explicitly written as a linear programming problem (8)–(11) below, where the optimal solution z means the margin of the tropical SVM: For some cost C ∈ R
n k C k z+ min Y − Iω (X ) (8) max n (z,ω)∈R×Re/R1 k=1
s.t. ∀X ∈ S, ∀i ∈ S(Iω (X)), ∀j ∈ S(Jω (X)), z + Xj + ωj − Xi − ωi ≤ 0, (9) ∀X ∈ S, ∀i ∈ S(Iω (X)), ∀j ∈ S(Jω (X)), ωj − ωi ≤ Xi − Xj , (10) ∀X ∈ S, ∀l ∈ S(Iω (X)) ∪ S(Jω (X)), j ∈ S(Jω (X)) ωl − ωj ≤ Xj − Xl . (11) As we discussed earlier, tropical soft margin SVMs are similar to tropical hard margin SVMs. They try to find a tropical hyperplane which maximizes the margin but also they allow some points to be in a wrong open sector by introducing extra variables α, β in Fig. 5. Tang et al. showed in [19] that a soft margin tropical hyperplane for a tropical SVM is the optimal solution of the following linear programming problem.
358
R. Yoshida
There are still many open questions we can ask in terms of tropical SVMs. In general, if we use methods to find a hard margin or soft margin tropical hyperplane developed in [19], then we have to go through exponentially many linear programming problems. However, we do not know the exact time complexity to find a tropical hard margin or soft margin tropical hyperplane for a tropical SVM. Problem 4. What is the time complexity of a hard or a soft margin tropical hyperplane for a tropical SVM over the tropical projective torus? Is it NP-hard? In addition, the authors in [19] focused on tropical hyperplanes for tropical N SVMs over the tropical projective torus R( 2 )/R1 not over the space of ultrametrics UN . Again note that UN is an union of N − 1 dimensional cones over N UN ⊂ R( 2 )/R1. Thus we are interested in how UN and a tropical SVM over N R( 2 )/R1 related to each other. More specifically: Problem 5. Can we describe how a hard or soft margin tropical hyperplane for N a tropical SVM over the tropical projective torus R( 2 )/R1 separates points in the space of ultrametrics UN in terms of geometry? Also we are interested in defining a tropical SVM over UN and developing algorithms to compute them. Problem 6. Define tropical hard and soft margin “hyperplane” for tropical SVMs over UN . To define them can we use a tropical polytope instead of a tropical hyperplane? How can we compute them? Can we formulate as an optimization problem? For more details, see the following papers: – Tang, Wang, and Yoshida. Tropical Support Vector Machine and its Applications to Phylogenomics [19]. Tropical Linear Discriminant Analysis (LDA). In this section we discuss tropical linear discriminant analysis (LDA). LDA is one of the classical statistical methods to classify dataset into two classes or more as the same time they reduce the dimensionality. LDA is related to PCA in a Euclidean space and these relations are shown in Fig. 6. The different between PCA and LDA is how to find the direction of a linear plane.
Tropical Data Science over the Space of Phylogenetic Trees Original Data
Principal Component Analysis
359
Linear Discriminant Analysis
Fig. 6. There are two categories in the response variable, red and blue. The middle picture represents PCA and the right picture shows LDA on these points. (Color figure online)
For two classes of samples S1 = {u1 , . . . , un1 }, S2 = {v1 , . . . , vn2 } ⊂ Rd , the linear space for the classical LDA can be found as the optimal solution of an optimization problem such that ||μ1 −μ2 ||2 s21 +s22
maxw μ1 =
1 n1
μ2 =
1 n2
s1= s 2 =
n1 i=1
n2 i=1
n1 i=1
n2 i=1
such that
Proj (ui ), Proj (vi ),
||Proj (ui ) −
(12) μ1 ||,
||Proj (vi ) − μ2 ||,
where Proj (·) is a projection onto a linear plane w in Rd . Here we use the max-plus algebra in tropical setting. Also we consider the tropical projective space for now. Let dtr as a tropical distance between two points in the tropical projective space Rd /R1. Then we can formulate the tropical linear space for tropical LDA in Eq. (12) as maxw dtr (μ1 , μ2 ) − s 1 − s 2 such that μ1 = arg minz∈w μ2 = arg minz∈w s 1 = minz∈w s 2 = minz∈w
n1 i=1
n2 i=1
dtr (z, Proj (ui )), dtr (z, Proj (vi )),
n1
i=1 dtr (z, Proj (ui )),
n2 i=1
dtr (z, Proj (vi )),
where Proj (·) is a projection onto a tropical polytope w in Rd /R1.
(13)
360
R. Yoshida
Problem 7. Can we define a tropical LDA over the tropical projective space? If so how can we find a tropical linear space (or tropical polytope) for a tropical LDA? Problem 8. Can we define a tropical LDA over the space of ultrametrics UN ? 6.2
Tropical Regression
For a classical multiple linear regression, with the observed data set {(x1 , y1 ), . . . (xn , yn )} where xi := (x1i , . . . , xei ) ∈ Re and yi ∈ R, we try to find a vector (β0 , β1 , . . . , βe ) ∈ Re+1 such that Y = βe Xe + . . . + β1 X1 + β0 + where ∼ N (0, σ) with N (0, σ) is the Gaussian distribution with the mean 0 and the standard deviation σ, Y is a response variable, and X1 , . . . , Xe are explanatory variables with the smallest following value: n
βe xei + . . . + β1 x1i + β0 − yi
2
.
(14)
i=1
The value in Eq. 14 is called the sum of squared residuals. Thus, for a classical multiple linear regression over the Euclidean space Re , we try to find the linear hyperplane with the smallest sum of squared residuals. For tropical regression over the tropical projective space, one can define a tropical regression “polytope” as the tropical polytope with min
n
max{βe + xei , . . . , β1 + x1i , β0 } − yi
2
.
i=1
It has nothing done in tropical regression. Thus, it would be interesting to see how one can define them in the tropical projective space as well as the space of ultrametrics.
References 1. Akian, M., Gaubert, S., Viorel, N., Singer, I.: Best approximation in max-plus semimodules. Linear Algebra Appl. 435, 3261–3296 (2011) 2. Ardila, F., Klivans, C.J.: The Bergman complex of a matroid and phylogenetic trees. J. Comb. Theory. Ser. B 96(1), 38–49 (2006) 3. G¨ artner, B., Jaggi, M.: Tropical support vector machines (2006) 4. Billera, L.J., Holmes, S.P., Vogtmann, K.: Geometry of the space of phylogenetic trees. Adv. Appl. Math. 27(4), 733–767 (2001)
Tropical Data Science over the Space of Phylogenetic Trees
361
5. Cohen, G., Gaubert, S., Quadrat, J.P.: Duality and separation theorems in idempotent semimodules. Linear Algebra Appl. 379, 395–422 (2004) 6. Felsenstein, J.: Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17(6), 368–376 (1981) 7. Gavryushkin, A., Drummond, A.J.: The space of ultrametric phylogenetic trees. J. Theor. Biol. 403, 197–208 (2016) 8. Jardine, C.J., Jardine, N., Sibson, R.: The structure and construction of taxonomic hierarchies. Math. Biosci. 1(2), 173–179 (1967) 9. Joswig, M.: Essentials of tropical combinatorics (2017) 10. Lin, B., Sturmfels, B., Tang, X., Yoshida, R.: Convexity in tree spaces. SIAM Discret. Math. 3, 2015–2038 (2017) 11. Lin, B., Sturmfels, B., Tang, X., Yoshida, R.: Convexity in tree spaces. SIAM J. Discret. Math. 31(3), 2015–2038 (2017) 12. Lin, B., Yoshida, R.: Tropical Fermat-Weber points. SIAM J. Discret. Math. (2018). To appear arXiv:1604.04674 13. Maclagan, D., Sturmfels, B.: Introduction to Tropical Geometry. Graduate Studies in Mathematics, vol. 161. American Mathematical Society, Providence, RI (2015) 14. Nye, T.M.W.: Principal components analysis in the space of phylogenetic trees. Ann. Stat. 39(5), 2716–2739 (2011) 15. Page, R., Yoshida, R., Zhang, L.: Tropical principal component analysis on the space of ultrametrics (2019) 16. Semple, C., Steel, M.: Phylogenetics, volume 24 of Oxford Lecture Series in Mathematics and its Applications. Oxford University Press (2003) 17. Speyer, D., Sturmfels, B.: Tropical mathematics. Math. Mag. 82, 163–173 (2009) 18. Sturmfels, B., Pachter, L.: Tropical geometry of statistical models. Proc. Natl. Acad. Sci. 101, 16132–16137 (2004) 19. Tang, X., Wang, H., Yoshida, R.: Tropical support vector machines and its applications to phylogenomics (2020) 20. Yoshida, R., Zhang, L., Zhang, X.: Tropical principal component analysis and its application to phylogenetics. Bull. Math. Biol. 81, 568–597 (2019)
A Study of Big Data Analytics in Internal Auditing Neda Shabani1 , Arslan Munir1(B) , and Saraju P. Mohanty2 1
Kansas State University, Manhattan, KS 66506, USA {nshabani,amunir}@ksu.edu 2 University of North Texas, Denton, TX 76203, USA [email protected]
Abstract. As the world is progressing towards an era of automation and artificial intelligence (AI), the use of data is becoming more valuable than ever before. Many professions and organizations have already incorporated automation and AI into their work to increase their productivity and efficacy. Auditing firms are not an exception in this regard as these firms are also using many data analytics processes to plan and perform audit. This paper provides a systematic review of big data analytics application in auditing with primary focus on internal auditing. The paper contemplates the advantages of incorporating big data analytics in internal auditing. The paper further discusses the state-of-the-art and contemporary trends of big data analytics in internal auditing while also summarizing the findings of notable researches in the area. Finally, the paper outlines various challenges in incorporating big data analytics in internal auditing and provides insights into future trends. Keywords: Accounting · Auditing · Internal auditing auditing · Big data · Data analytics
1
· External
Introduction
As businesses are becoming progressively complex, the decision-making for stakeholders is becoming increasingly arduous. This complexity makes the role of auditors, both internal and external, very crucial to the organizations as stakeholders are curious about how efficient, effective and innovative organizations are operating internally and what is the financial as well as non-financial outcome of the organization operations. Furthermore, there exists an expectation gap between financial statement users and auditors. Auditors constantly strive to reduce this gap by informing people about the reasonable assurance of fairness in financial statement presentation, which is provided by external auditors, versus absolute assurance of fairness in financial statement presentation which people assume. Auditors also try to reduce this expectation gap by conducting more quality and in-depth audit which sometimes can be very costly, time consuming, and not always fully possible due to the nature of some intangible assets such as Goodwill. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 295, pp. 362–374, 2022. https://doi.org/10.1007/978-3-030-82196-8_27
Data Analytics in Internal Auditing
363
An in-depth and thorough audit means large samples of data and rigorous data analytics (DA) procedures to test both internal controls and different accounts. In external auditing, even with a very large sample size from a variety of accounts, such as accounts payable, accounts receivables, cash, etc., it is still not possible for an auditor to claim that a financial statement is 100% correct as he/she has not tested all the transactions in all the accounts. Similarly, in internal auditing, it is not possible for an auditor to claim that an operation of an organization is absolutely effective, efficient and in compliance with laws and regulations. In recent years, big data analytics (BDA) has revolutionized many industries including auditing. Although, BDA is fairly new and hence infrastructure of many organizations is not ready for it yet, some avant-garde organizations have either already implemented it or preparing themselves to implement it in the near future. The unique features and characteristics of BDA allows almost all industries that use data to take advantage of it and enhance their operation and decision-making to become more profitable and customer-friendly. Like many other firms, audit firms also heavily rely on data and DA to perform their job, and thus BDA can play a significant role in the performance of these firms. BDA is used both in internal auditing and external auditing, however, their purpose of usage is different from each other. Although we will mention the differences in BDA for internal auditing and external auditing shortly, the focus of this paper is on internal auditing. This paper provides a review of BDA application in auditing, primarily internal auditing. This paper provides a rich source of information for both academia and industry to learn at what stage big data in audit stands, and help prepare the organizations for implementing big data infrastructure. This paper also contributes to the available literature by filling the gap between the academic and industrial practices in the area of BDA in auditing. The main contributions of this work are: – – – –
Providing an overview of DA in auditing. Elaborating the advantages of incorporating BDA in internal auditing. Discussing the state-of-the-art and trends of BDA in internal auditing. Contemplating the challenges involved in incorporating BDA in internal auditing.
The rest of this paper is organized as follows. Section 2 discusses BDA and its usage in business industry. An overview of data analytics in auditing is provided in Sect. 3. Section 4 elaborates the benefits of incorporating BDA in internal auditing. The state-of-the-art and trends of BDA in internal auditing are discussed in Sect. 5. Section 6 deliberates the challenges involved in assimilating BDA in internal auditing. Finally, Sect. 7 concludes this study.
364
2
N. Shabani et al.
Big Data Analytics and its Usage in Business
Big data is a set of data that can be characterized by six V’s: Velocity, Volume, Value, Variety, Veracity, and Variability. As the name suggests, the size of big data must be “big”, which is measured as Volume. Velocity refers to the rapidly increasing speed at which new data is being generated and the corresponding requirement for this data to be assimilated and analyzed in near real-time [14]. Value measures the worth of the big data, that is, the insights that are generated based on big data must lead to quantifiable improvements. Variety refers to the huge diversity of data types. The trustworthiness of the data attributes to Veracity. Variability hint that the way the data is captured varies from time to time and place to place. Furthermore, variability suggests that the interpretation of data may change depending on the context. Cao et al. [7] have defined DA as “the process of inspecting, cleaning, transforming, and modeling Big Data to discover and communicate useful information and patterns, suggesting conclusions, and support decision making”. DA has as many uses as the number of businesses that works with and handles data. Different firms and organizations use DA for a variety of purposes, such as predicting and describing new market trends, predicting consumer needs and demands, and anticipating the market’s influence on the behaviour and pattern of customers. For example, many companies use data mining and DA in order to analyze the customers’ comments and feedback for improving their services and products or to predict which competitors the customers would switch to in case they were not happy with the current services and products that they have received from certain companies. Voice pattern recognition is another tool that many companies can take advantage for enhancing their performance and improving customer satisfaction by identifying dissatisfied customers and their reasons of dissatisfaction [1]. Not only companies but also government and police forces can get benefit from BDA by identifying repeated criminal and fraudulent behaviours and regions not only locally but internationally. As pointed out by Gershkoff [13], BDA is a fast-growing market. Companies, organizations and firms which are not incorporating BDA in their daily operation are likely to lose a lot of great opportunities and are prone to stay far behind than those that incorporate BDA in their operations.
3
Data Analytics in Auditing
Like many other professions, auditors, both internal and external, heavily use DA in their operations to obtain results and conclude their opinion about the firm they are auditing. According to Byrnes et al. [6], auditors use DA in their daily operation to analyze, identify and extract the information from the available data of their clients that can be useful in planning, fieldwork, and issuing final opinion about the firm they are auditing. Many scholars have given their opinion about importance and usage of DA in auditing. For example, Brown-Liburd, et al. [5] mentioned that DA is able to recognize and identify the existing patterns and
Data Analytics in Internal Auditing
BDA: Big Data Analytics
Absolute Assurance About Fairness of Financial Statement Presentation
365
Reasonable Assurance About Fairness of Financial Statement Presentation
Traditional Auditing Expectation Gap
What People Think Auditors Do!
BDA Applied in Auditing
What Auditors Actually Do!
Expectation Gap
Fig. 1. Big data analytics in auditing can help reduce expectation gap.
correlation among data that can be very useful for auditors. According to Byrnes et al. [6], DA in auditing is not only a science but an art that exposes the patterns and anomalies and pulls out beneficial information of data which is related to the subject when an auditor performs analysis, modeling, and visualization in all phases of planning and performing the audit. Furthermore, DA helps auditors to process a lot of data at a very fast pace. Auditors take advantage of innovative and competitive insights provided by DA to enhance efficiency and effectiveness of their audit performance [11]. Besides scholars and researchers, many organizations such as the International Auditing and Assurance Standards Board (IAASB), also have given opinion about BDA in auditing. The IAASB recognizes BDA as a science and art that discovers and analyzes existing patterns, deviations, and inconsistencies among data set, and pulls out the useful information related to the audit that is being done [1]. Many auditing firms either large or small constantly engage in utilizing BDA in their daily operations to reduce the amount of risk involved in both internal and external auditing as well as to offer value to their clients [1]. For instance, BDA can help reduce the expectation gap between financial statement users and auditors as depicted in Fig. 1. While larger auditing firms have more resources such as capital to create customized DA platforms, smaller firms may choose to utilize the available platforms and packages for assistance in planning and performing their audit. We note that there is no single DA tool in auditing, and firms consistently develop and customize a variety of DA tools depending on their needs and available resources to help identify patterns, trends and correlations, and extract information from data via different visual and descriptive methods.
366
4
N. Shabani et al.
Advantages of Incorporating BDA in Internal Auditing
According to Cao et al. [7], although many professions, such as consulting and marketing, have already adopted BDA in their daily operations, the use of BDA in accounting is still nascent. Nevertheless, BDA can profoundly benefit both internal and external auditing. For instance, BDA can enhance the efficiency and effectiveness of financial statement audits [7]. Moreover, auditors can take advantage of BDA to analyze and test more transactions [11]. According to the results of a research [16] that has used an extensive database to distribute and compile surveys, there are ten areas in both external and internal auditing that can benefit the most by incorporating BDA. These areas are [16]: (i) accounts payable and accounts receivable, (ii) duplicate detection, (iii) sampling, (iv) data imports/extractions and analysis of large data sets, (v) continuous auditing and monitoring, (vi) fraud detection and forensic auditing, (vii) P-Cards analysis, (viii) payroll and time sheets, (ix) joins and comparisons, and (x) inventory audits. According to [8], incorporating BDA can immensely benefit three phases of internal auditing, which are planning, audit execution (fieldwork), and reporting. Figure 2 depicts the benefits of incorporating BDA in the three phases of internal auditing. In planning phase, risk profiling, test data stimulation, and statistical sampling are the areas that give the most benefit to auditors by applying BDA. In audit execution phase continuous controls monitoring, fraud indicators, predictive risk identification and control simulation are the areas that are most improved by utilizing BDA. In reporting phase, risk quantification, real-time exception management, and root cause investigation are the areas that auditors can reap the most advantages by employing BDA. In the following, we discuss some of the vital areas in internal auditing that benefit the most from BDA. 4.1
Risk Assessment
One of the main things that internal auditors do, is to perform a thorough risk assessment for their organization. With the help of BDA, internal auditors can put together and analyze any data either from inside or outside of the organization in order to gain a deeper and more comprehensive insight related to their organization. The BDA can help internal auditors to better assess any existing and/or potential risks within or outside the organization, and help reach more distinct and precise findings. An accurate and detailed risk assessment and financial analysis help organizations to improve their daily and long-term business processes and internal controls. These findings also help the decision-makers and planners within the organizations to operate more strategically and be more efficient, effective and profitable. Incorporating BDA into internal audit process also helps transaction risks to be assessed in real-time. Real-time access to the information for internal auditors means they can gain a better insight about the organization and the efficiency and effectiveness of the business processes even before the actual fieldwork starts. This also helps continuous auditing for a better risk management. Furthermore, since internal auditors test and evaluate
Data Analytics in Internal Auditing
367
Big Data Analytics Application in Internal Auditing
Planning
Risk Assessment and Profiling Data Stimulation Testing
Fieldwork
Continuous Auditing and Monitoring
Reporting
Risk Quantification
Fraud Indication and Detection
Real-Time Exception Management
Predictive Risk Identification
Root Cause Investigation
Statistical Data Sampling
Fig. 2. Benefits of incorporating big data analytics in internal auditing.
non-financial data besides financial data to better assess the risks within the organization, it is much easier to do so through BDA tools [11]. 4.2
Audit Quality
Among other advantages of incorporating BDA in internal and external auditing is enhancement of the audit quality. Having professional skepticism is a significant requirement for both internal and external auditors that determines the quality of audit. While in traditional way of auditing, auditors bring their personal experience and judgement to their audit work, in the new and automated process of BDA, many documentations can be provided by artificial intelligence (AI) and machine learning (ML), which help enable the auditors to detect any potential fraud in reviewing financial statements, business processes, and internal controls [22]. This automated process of BDA allows the auditors to maintain their professional skepticism. Since big data works with automation processes, AI and ML, larger volumes of higher velocity data can be processed efficiently, and so auditors can gain valuable information and insights in a shorter period of time. Having BDA incorporated within the organizations, auditors can test all of the transactions thus improving the expectation gap between auditors and financial statement users [6]. 4.3
Compliance Assurance
Internal auditors job demands constant scanning and review of the organization’s performance and documentation to ensure the organization’s financial docu-
368
N. Shabani et al.
ments are being prepared and done in compliance with standards [15]. Making any mistakes in this regard, as human errors are common, can put the business out of compliance or waste too much time of the auditors. Through the help of BDA and automation of manual, frustrating and time-consuming tasks, internal auditors are able to set up as many controls as they want in the auditing system, and monitor those controls to determine whether the organization is adhering to the standards and guidelines or not. According to EY [12], continuous investing in DA is essential as it helps auditors to provide more assurance and relevant audit. 4.4
Fraud Detection
One of the significant factors in performing quality audit is to have an effective communication related to the key issues and findings that BDA enables auditors to achieve through dynamic dashboards [20]. In this regard, BDA tools give this power to the auditors to convert all the raw data to a pre-structured form and presentation format so that anybody from auditors to clients be able to understand the presented information. BDA tools can also be adjusted based on a specific client’s risk so that auditors plan and perform more efficiently and conclude their findings faster. BDA assists auditors to help detect any fraudulent action by interrogating all data and by testing all internal controls such as separation of duties [1]. 4.5
Planning Assistance
With the help of BDA, auditors are able to not only use descriptive features of the tools to better understand the process, conducting a thorough risk assessment and obtain detailed findings, but also, they can use predictive and prescriptive features to better assist decision makers for setting future goals and objectives [3]. 4.6
Cost of Operations
Incorporating and implementing BDA in internal audit process and hiring auditors with DA skills require an initial investment which may be seen as a costly and negative factor at first, however, in a long-term, incorporating BDA and hiring auditors with BDA proficiencies reduce the overall cost of audit, and further improve the efficiency and effectiveness of the business operations through more accurate recommendations and corrective action plans enabled by BDA [4].
5
State-of-the-Art and Trends of BDA in Internal Auditing
BDA is dynamic and continuously changing for better, so for an organization or firm to be always on the leading edge regarding their internal auditing processes
Data Analytics in Internal Auditing
369
and technologies, they must stay updated with innovations in BDA. This section discusses state-of-the-art and trends of BDA in internal auditing. There are several researches that have been done by Protiviti [18], which show interesting results. Protiviti is a global consulting firm headquartered in Menlo Park, California that provides consulting solutions in internal audit, risk and compliance, technology, business processes, DA, and finance. In 2015, Protiviti’s conducted an internal audit capabilities and needs survey [19], which contained 23 questions and distributed to a select group of the largest financial institutions in U.S. including 13 of 25 top banks and 2 of 5 top insurers. According to the findings from this survey, 69% of participants said that their internal audit functions have their own data warehouse for accessing data and 54% of participants reported that there are special requirements for the desktops assigned to internal auditors DA professionals. Also 54% of participants said that specific and defined protocols are used for the extraction of data leveraged during the audit process to validate the data’s quality and completeness. Furthermore, 54% of participants indicated that internal audit functions also use business intelligence and related dashboarding tools to support their processes such as Business Objects, Oracle, QlikView, SAS JMP, SQL (Structured Query Language) and other internal tools. Moreover, 69% of participants indicated that members of their department, including professionals outside of the Internal auditor’s analytics team, possess analytics skills that they deploy on audits. There are three tools that are most commonly being used by internal auditor analytics groups in their work, viz., Microsoft Excel, SQL and ACL (Audit Command Language). Internal auditor groups have been using these tools for a long time for performing and planning internal audit in a traditional and manual method, however, the trend of learning and using tools like Tableau and Spotfire shows a transition towards data visualization and other features of BDA to analyzing and assessing risk continuously. The results of another Protiviti research regarding the usage of BDA by internal audit departments as a part of their audit planning and performance in different continents are as follows: Asia-Pacific 76%, Europe 76% and North America 63%. The organizations that uses BDA in their audit works have also rated the quality of available data to be analyzed as excellent or very good are as follows: Asia-Pacific 59%, Europe 58% and North America 28% [8]. The results of yet another research [17] indicate that only 42% of chief audit executives (CAEs) responding to the Institute of Internal Auditors (IIA) 2017 North American Pulse of Internal Audit are frequently or always using DA in their audit planning and performance. The same research was conducted in 2018 as well and the results of that research also indicate the similar trend with only 62% of CAEs who reported the partial or fully usage of DA. Those internal auditors who incorporate BDA into their audit work should know that there are many risks involve in these processes. In order to reduce these risks, internal auditors should be aware of the risks and take appropriate measures into consideration while working with BDA, such as the use of clean and normalize data, dealing with outliers, accurately reading patterns and deleting noise, visualiz-
370
N. Shabani et al.
ing the data clearly, understanding correlation versus causation, and recognizing when data should not be used [17]. According to Deloitte, auditors need to update their knowledge and gain new skill sets in order to keep up with new demands, changes and technologies. These required skill sets can divide into two categories [10]: (i) technical and analytical, and (ii) business and communication. Technical and analytical skill set includes the following. Testing and validation, which means “defining, developing, and implementing quality assurance practices and procedures for technical solutions and validating hypotheses”. SQL querying, which means “querying and manipulating data to facilitate the solving of more complex problems”. Data modeling, which refers to “structuring data to enable the analysis of information, both internal and external to the business”. Data analytics, which means “valuating data using analytical and logical reasoning for the discovery of insight, e.g., predictive modeling”. Finally reporting software, which pertains to “understanding of the underlying theory and application of key reporting software”. Business and Communication skill set includes the following. Technology alignment, which means “understanding how technology can be leveraged to solve business problems”. Macro-perspective, which refers to “understanding of the company’s business strategy, current business issues and priorities and current industry trends”. Business knowledge, which means “understanding of business measurement of key performance indicators and business frameworks”. Business commentary, which relates to “articulation of insight to explain current and forecasted trends, their impact and opportunities for the business”. Finally soft skills, which pertains to “communication and interpersonal skills that are necessary to articulate insight gained from analysis”. The growing market of BDA is transitioning towards more user-friendly software to enable even those auditors who are not very skilled in BDA to use the tools [2]. With developing skill sets and training, internal auditors and their teams will be able to have a mutual and common understanding of BDA behaviors and tasks. One significant factor that requires special attention is setting and following policies related to collection, storage, and disposal of audit documentation and working papers. Internal auditors should make sure to consider all the policies regarding what data to be stored and requested, how to access data, who can access data, where the data will be stored, what data can or cannot be distributed, to whom data can or cannot be distributed and what the data retention period is. Failure to follow these policies can have unfavorable outcomes for the firms and organizations and they can be sued by their clients.
6
Challenges of Incorporating BDA in Internal Auditing
There exist various challenges in incorporating BDA in internal auditing. This section discusses some of the salient challenges in integrating BDA in internal auditing.
Data Analytics in Internal Auditing
6.1
371
Data-Access
Location identification of sought-after data is one of the most arduous endeavor in complex enterprise environments as operations are geographically distributed and each business unit within an organization operate autonomously and utilize a different system, which makes data sourcing laborious [19]. Furthermore, since auditors often do not have full access, it is difficult for auditors to acquire the data in a way that can be readily utilized. 6.2
Data Compatibility
Data acquired from different business units or client systems may be in different formats, which may render certain analytics techniques unusable. Even if the data acquired from different sources is compatible, data needs to be normalized to common terms to provide a fair comparison basis. Determining the terms for data normalization presents another challenge because of the variety and variability characteristics of big data. Consequently, data preparation and preprocessing before actual DA could be applied takes substantial time. 6.3
Data Relevance
Since new data is generated expeditiously, the mined information also becomes irrelevant quickly [9]. Hence, it is imperative to utilize the data in a timely manner. BDA with the help of tools can help to provide timely insights into the data. 6.4
Data Integrity and Veracity
It is difficult to guarantee the completeness and integrity of the extracted client data. Internal audit analytics specialists often need to perform data extraction, which may have limitations when either the firm does not have the right tools or understanding of the client data. This particularly can be the case when multiple data systems are utilized by the client. Furthermore, it is conceivable if clients only make certain data accessible or manipulate the data accessible for extraction [1]. Consequently, data veracity and ambiguity is another challenge that requires consideration. Auditors may not be confident to make meticulous decisions and are more likely to ignore additional information once a primitive solution is obtained when the acquired information is vague [5]. 6.5
Data Management
Data management issues pertain to data storage and accessibility for the duration of the required retention period for audit evidence. The acquired data must be stored for several years in a form amenable to retesting. Thus, the firms may need to invest in hardware to store this huge volume of data or outsource data storage. Outsourcing of data storage increases the risk of data loss or privacy violations [1].
372
6.6
N. Shabani et al.
Data Confidentiality and Privacy
Confidentiality and privacy policies of different business units pose another challenge as it requires internal audit analytics professionals to sought approvals before accessing certain systems and data [19]. Copying and storing of detailed client data may violate data confidentiality and privacy laws as the data could be abused by the firms. Furthermore, the stored data is susceptible to security attacks which may cause grave legal and reputational repercussions. Internal audit analytics team may also need to obtain information technology (IT) certifications for their data warehouses before they can store the retrieved data, which requires addressing questions, such as (i) how the data will be utilized? (ii) what access control procedures will be employed for that data? (iii) what steps will be taken to keep the data secure? and (iv) how the chain-of-custody stipulations are met in data capturing, usage, and storage? 6.7
Regulations
Currently, there exist no regulations or guidelines that cover all the uses of DA in audit [2]. Hence, firms with more resources can have a competitive advantage in developing DA tools as compared to smaller firms thus reducing the competition in audit industry [1]. Consequently, the firms that do not invest in DA might lag behind their competitors in providing better services [2]. 6.8
Training
Audit staff may not be qualified to comprehend the true nature of data to make appropriate inferences [11]. Hence, audit staff needs to be trained for DA which can be expensive. Scarcity of DA specialists and their unfamiliarity with audit as pointed out by [2] exacerbates the challenge of training audit staff. 6.9
Tools
The datasets are often too complex to be inspected by standard tools [21]. Hence, there is a need to develop new BDA tools for audit. Furthermore, as new ML and analytics methods are continuously being developed, it is imperative to incorporate the latest methods in BDA tools.
7
Conclusion
Big data has changed the landscape, performance, productivity, and profitability of many industries, organizations and firms. Auditing firms are not an exception in this regard as they are utilizing many data analytics processes to plan and perform audit. Big data analytics (BDA) has a plethora of unique features that offer many advantages to auditors, such as enabling them to gain a deeper insight into their auditing work, and help them develop a thorough understanding of
Data Analytics in Internal Auditing
373
various aspects of audit including risk assessment and compliance assurance. BDA helps auditors to reduce the expectation gap between auditors and people by performing more accurate audit through eliminating sampling and testing all transactions instead. Nevertheless, there are some challenges and issues involved in implementing and using BDA by audit firms, such as data-access, data compatibility, data relevance, data integrity and veracity, data management, data confidentiality and privacy, regulations, training, and tools. It is envisioned that as artificial intelligence techniques become more sophisticated, solutions will be developed to address the standing challenges. Although, it will take some time and investment on human resources and capital to fully integrate BDA in internal audit and solve the existing challenges, the benefits of incorporating BDA in audit firms warrant the effort and investment.
References 1. ACCA: Data analytics and the auditor (2020). https://www.accaglobal.com/ in/en/student/exam-support-resources/professional-exams-study-resources/p7/ technical-articles/data-analytics.html. Accessed 15 Jan 2020 2. Alles, M.G.: Drivers of the use and facilitators and obstacles of the evolution of big data by the audit profession. AAA Account. Horiz. 29(2), 439–449 (2015) 3. Appelbaum, D., Kogan, A., Vasarhelyi, M.: Big data and analytics in the modern audit engagement: research needs. AUDITING: J. Pract. Theory 36(4), 1–27 (2017) 4. Bierwirth, M.: Improving the internal audit function through enhanced data analytics (September 2019). https://www.surgentcpe.com/blog/improving-internalaudit-function-through-enhanced-data-analytics. Accessed 15 Jan 2020 5. Brown-Liburd, H., Issa, H., Lombardi, D.: Behavioral implications of big data’s impact on audit judgment and decision making and future research directions. AAA Account. Horiz. 29(2), 451–468 (2015) 6. Byrnes, P., Criste, T., Stewart, T., Vasarhelyi, M.: Reimagining auditing in a wired world (August 2014). http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1. 646.9343&rep=rep1&type=pdf. Accessed 15 Jan 2020 7. Cao, M., Chychyla, R., Stewart, T.: Big data analytics in financial statement audits. AAA Account. Horiz. 29(2), 423–429 (2015) 8. Consultancy.uk: Data analytics to become a game changer for internal audit. https://www.consultancy.uk/news/16863/data-analytics-to-become-agame-changer-for-internal-audit. Accessed 15 Jan 2020 9. Coyne, E.M., Coyne, J.G., Walker, K.B.: Big data information governance by accountants. Int. J. Account. Inf. Manag. 26(3), 153–170 (2018) 10. Deloitte: Internal audit analytics: the journey to 2020: insights-driven auditing (2016). https://www2.deloitte.com/content/dam/Deloitte/us/Documents/risk/usrisk-internal-audit-analytics-pov.pdf. Accessed 15 Jan 2020 11. Earley, C.E.: Data analytics in auditing: opportunities and challenges. Bus. Horiz. 58(5), 493–500 (2015) 12. EY: How big data and analytics are transforming the audit (April 2015). https:// www.ey.com/en gl/assurance/how-big-data-and-analytics-are-transforming-theaudit#item1. Accessed 15 Jan 2020
374
N. Shabani et al.
13. Gershkoff, A.: How to stem the global shortage of data scientists (December 2015). https://techcrunch.com/2015/12/31/how-to-stem-the-global-shortage-ofdata-scientists/. Accessed 15 Jan 2020 14. Jain, A.: The 5 V’s of big data (September 2016). https://www.ibm.com/blogs/ watson-health/the-5-vs-of-big-data/. Accessed 15 Jan 2020 15. Lynch, K.: The role of big data in auditing and analytics (September 2019). https://www.analyticsinsight.net/the-role-of-big-data-in-auditing-andanalytics/. Accessed 15 Jan 2020 16. Palombo, S.: Top 10 areas where data analysis adds the most value. https:// www.audimation.com/top-10-areas-where-data-analysis-adds-the-most-value/. Accessed 15 Jan 2020 17. Pelletier, J.: 6 essentials to jump-start data analytics in internal audit. https:// iaonline.theiia.org/blogs/Jim-Pelletier/2018/Pages/6-Essentials-to-Jump-startData-Analytics-in-Internal-Audit.aspx. Accessed 15 Jan 2020 18. Protiviti: A global consulting firm. https://www.protiviti.com. Accessed 15 Jan 2020 19. Protiviti: Changing trends in internal audit and advanced analytics (2015). https://www.protiviti.com/sites/default/files/united states/internal-audit-dataanalytics-whitepaper-protiviti.pdf. Accessed 15 Jan 2020 20. PwC: Transforming internal audit through data analytics. https://www.pwc. com/us/en/services/risk-assurance/advanced-risk-compliance-analytics/internalaudit-analytics.html. Accessed 15 Jan 2020 21. Thabet, N., Soomro, T.: Big data challenges. J. Comput. Eng. Inf. Technol. 4, 3 (2015) 22. Walsh, K.: Big data in auditing and analytics (March 2019). https:// reciprocitylabs.com/big-data-in-auditing-and-analytics/. Accessed 15 Jan 2020
An Automated Visualization Feature-Based Analysis Tool Rabiah Abdul Kadir1(B) , Shaidah Jusoh2 , and Joshua Faburada1 1
National University of Malaysia, Kuala Lumpur, Malaysia [email protected] 2 Xiamen University Malaysia, Sepang, Malaysia
Abstract. Data visualization and data analysis tool are parts of the main elements in the study of data science. Data science always involves finding answers to the questions being asked within any type of domain. Data science is a multidisciplinary study that requires many relevant fields such as data visualization and statistical analytic to find insights from huge sets of structured data or unstructured data. This study aims to develop an automated visualization feature-based analysis tool for unstructured data that is collected through an online survey for a research project entitled the impacts of social distancing during the Covid-19 pandemic. This paper presents step by step process of the design and development, and the prototype of the analysis tool. Keywords: Data visualization based analysis
1
· Data analysis · Data science · Feature
Introduction
Data science is a multidisciplinary research field. Data science aims to find insights from huge raw data set of structured data or unstructured data. Data science experts use a different kind of methods to find answers from the huge raw data set. These include statistical data analytic, predictive analytic, or machine learning to process, and analyze the raw data. Data analytic can be considered as a technical component to find answers to the raised questions. On the other hand, predictive analytic and machine learning techniques require data science experts to be eloquent in more advanced computing techniques and tools, while the statistical method has been used for ages to process and analyze huge structured data. Although the computing techniques have been progressing very well, and more intelligent data analytic tools have been produced so far, statistical data analysis tools such as Microsoft Excel and IBM SPSS Statistics remain the most convenient tools and widely used to analyze huge data, and one should have good skills in the mentioned tools, to analyze the data. Having an automated data analysis tool is essential in data science. Data scientist experts can focus on finding new insights for data-driven decision-making, rather than overwhelm with the know-how to use statistical tools. Data analytic c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 295, pp. 375–387, 2022. https://doi.org/10.1007/978-3-030-82196-8_28
376
R. A. Kadir et al.
tool is normally used to organize, interpret, and evaluate data, and to make the data presentable. Thus, automated data analytic is not only useful for data scientist experts, but also to enterprise stakeholders, business owners as a feedback mechanism. For example, the tool can be used to run a study on data to obtain results automatically, to improve business processes, or to adjust parameters or study inputs in real-time. There are two types of data analysis methods: quantitative and qualitative, and most of the statistical data analysis tools are developed for quantitative data. This study aims to design and develop an automated visualization data analysis tool for the collected data on the project entitled “the impact of social distancing during Covid-19”. The data was collected using an online survey for one week in May of the year 2020. The survey questionnaire contains twenty-five items with fifty parameters. The developed tool is used to visualize the collected data and analyze the relationships between parameters. The tool can be used by anyone who does not know statistical tools such as Excell or SPSS. Having said that, at this stage, our tool is used only for quantitative visualization data analysis and for academic purposes. This paper presents the process of design and development of the tool. This paper is organized as follows. Section 2 presents the related work, Sect. 3 presents the design and development of the tool, and Sect. 4 presents several key snapshots of the prototype tool. This paper is concluded in Sect. 5.
2
Related Work
In recent years, data science has become one of the important new multidiscipline. It is a mixture of several disciplines such as statistics, distributed systems, databases, and data mining. Some researchers [1] claimed that the parents of data science are computer science and statistics and others viewed statistics as the origin of data science. Nevertheless, the ingredient of data science include statistics, algorithms, data mining, machine learning, visual analytic, and so on, as shown in Fig. 1. Other techniques for data preparation, data extraction, data exploration, data retrieval, data computation, data transformation are very essential in data science. Data presentation and data explanation shall not be neglected in data science processes [2]. Answering answers and making predictions for unseen situations are the reasons why data science is a booming field these days. Visualization and visual analytic have also been seen as key elements of data science. To make use of data, people need to interpret and have guidance on the data analysis. Appropriate data visualization techniques may help human perception and cognitive capabilities to predict and interpret. According to [3,4], there are three major components of data science. These include description, prediction, and counter factual prediction. The description is about getting a quantitative summary from data using a simple method or sophisticated techniques such as learning algorithms. Data visualization also can be used for getting a quantitative summary. Prediction is about using the available data to map to some real features for a specific domain of problems.
Visualization Tool
377
Fig. 1. The ingredient of data science [2]
The author in [2] describes data as a collection of anything at any time and at any place. Organizations digitally create a warehouse of unstructured data and these organizations facing challenges in dealing with such huge quantities of data. The main issue is how to extract and value the data. Any kind of data visualization conveys insights. The main duty of statistics is to attain from data academically. Data visualization or often known as data viz is a new way of presenting data that can be easily understood. This can be as simple as bar charts to as complex as interactive multimedia. With the introduction of data science, data visualization has been critically important. Data visualization is a pictorial or graphical representation of structured data generated by a software tool. Data visualization facilitates users to interactively explore and analyze structured consequently allows them to identify interesting patterns. Using data visualization tool, further analysis such as infer correlations and causalities can be done for sense-making decision [5]. The author in [6] presented that Keim and Kriegel [7] classified visual data visualization technique into six categories, namely graph-based, icon-based, pixel-oriented, geometric, icon-based, pixel-oriented, hierarchical, and hybrid. Data visualization techniques have been used in many fields. These include transportation system [8] in which data visualization can be used to enhance transports systems based on the traffic data, bio-science [9] in which data visualization methods can support a better understanding of a biological system, ranging from simple analyses to complicated ones. Interactive data visualization has been introduced in which human perception and cognition are deployed for better accuracy and effectiveness of data analysis.
378
R. A. Kadir et al.
The combination of data visualization and human perception and cognition may solve the problem that neither one can do in isolation [10]. However, designing and developing proper visualizations for ordinary users is hard always hard, even for interactive data visualization tool such a Tableau, because users have to understand the data and its visualization well. Authors in [11] presented a visualization tool, so-called DeepEye, which takes a keyword query for the available data set and generates and ranks acceptable visualization. Researchers in [12] introduced a ‘cheat sheets’ concept for data visualization techniques. It is inspired by infographics which contains a set of graphical presentations with textual annotations. The goal of cheat sheets is to address required materials to support understanding data. According to the article published in [13], the best data visualization tools available on the market now include, Tableau, Google Charts, Chartist.js, Grafana, Infogram, FusionCharts, Datawrapper, ChartBlocks, and D3.js. These tools offer a variety of graph-based visualization styles and are easy to use. This indicates that the graph-based data visualization tool is the most preferred one. Our developed automated analysis tool falls into a graph-based technique.
3
Design and Development
The focus of this study is to facilitate the statistical analysis for a large volume of a dataset and to present the results of the analysis in diverse relationship conditions. The prototype of the system is developed using the open-source programming language Javascripts and Php to facilitate specific operations including statistical analysis, visual graph relationships, and implementation of three feature-based models. The system has been developed by implementing FeatureDriven Development (FDD) approach. Five main activities in FDD perform iteratively as shown in Fig. 2. The first step in the design process is to develop an overall model. We start with identifying the scope and context of the system. Throughout the project, we flesh this model out to reflect the building phase. Followed by developing a feature’s function and group them into related sets and subject domain. Next, is a plan by feature. This includes the identification of feature function and the identification of feature set for the specifically identified feature. The majority of the effort on the FDD approach, roughly 75%, is comprised of the fourth and fifth steps: design the feature and build the feature. The fourth step produces inspection and finalization details of each feature. Whereas, in the fifth step, after the design is improved, the completed feature is added to the system for deployment. These two steps include tasks such as detailed modeling, programming, testing, and deploying the system. The mentioned five steps in FDD were applied and implemented in the development of our visualization feature-based analysis tool.
Visualization Tool
379
Fig. 2. The FDD project lifecycle [14]
3.1
Develop an Overall Model
This step involves the process of identifying the scope and context of the system. The development of scopes or modules of the system domain area was guided by the system designer. This project consists of three modules: Dashboard of Data Visualization, Upload Dataset, and Data Analysis Tools which are connected to the Analysis Database as shown in Fig. 3. When the details of scope and context are created, these modules are progressively merged into an overall model.
Fig. 3. Three scope of feature-based analytical system
380
3.2
R. A. Kadir et al.
Develop a Feature Function
With the first process being to develop an object model, this step is grouping the scope and context into related sets and subject areas. The requirements are referred to the features that are expressed as action, result, and object. The overall model of the project is breakdown into small activities of the sets and subject areas and places individual activity within one of those activities or features. Figure 4 shows the joining activities of the model consist of three scopes or modules.
Fig. 4. Activities of feature-based analytical model
3.3
Plan by Feature
This step adjusts the overall sequence of identified feature functions to take into account technical risk and appropriate dependencies. At the same time identify the class and specific feature set for each module in the proposed model. As shown in Fig. 3, each scope has been allocated its activities or features set. This process is important for complex or critical classes. However, it becomes tricky to maintain true collective code as the sizes increase. 3.4
Design by Feature
In the Design by Feature step, each activity or feature will be designed to show the interaction of objects and their arrangement. The objects exchanged the sequence of the process between the objects to carry out the functionality of the activities. Development of the sequence of process is typically illustrated in logical view as shown in Fig. 5.
Visualization Tool
381
Fig. 5. Logical view of feature-based analytical system
4
Prototype and Discussion
This project constructs a prototype of a Feature-based Analytical System to allow the researcher to delve deep into their ideas in reviewing the analysis of the relationship between attributes from their dataset. It also conveys an overall design concept of the proposed system to the users to test and give feedback for enhancement. Figure 6 and 7 the basic interface for dashboard of data visualization.
Fig. 6. Sample one of the dashboard interface shows visualization of respondents’ country and gender
382
R. A. Kadir et al.
Fig. 7. Sample of available graph for visualization of respondents’ cost-increment and psychological-problem
Dataset will be uploaded via upload data interface as shown in Fig. 8.
Fig. 8. A Dashboard to upload dataset
The user is able to upload new dataset with .csv format. Database of the uploaded data will be generated for the purpose of analysis. Once dataset has been uploaded, the raw data will be displayed as shown in Fig. 9. Figure 10 shows how the users will interact with the analysis tool of the proposed system in choosing the appropriate type of graph for their relationship analysis. The Feature-based Analytical System allows the users (researcher) to analyse more than single data relationship, which is up to three data relationship analysis. Interface of data analysis tool also provides the related matric table and a guideline of choosing the appropriate type of graph for their relationship analysis as shown in Fig. 11 and Fig. 12. Furthermore, Fig. 13 demonstrates the sample of data relationship analysis for single data, Fig. 14 illustrates two relationships, and Fig. 15 displays three data relationships with the appropriate type of graph for each.
Visualization Tool
383
Fig. 9. Sample raw data
It is worth to mention here that our proposed analytic tool (so-called DATool) only focuses on the process of visualizing relationships within the available statistical data. This tool applies a statistical approach in modeling the association between a dependent variable with one or more independent variables. In comparing to Microsoft Excel and IBM SPSS Statistics, each of them has their own strength such as they are able to provide features calculation, graphing tools and pivot tables by simplifying large data sets, and also support hypothesis testing approach. Table 1 shows the differences between the tools. Table 1. A comparison between the tools Type of analytics tool
Description of strength
Microsoft Excel
Provide features calculation, Graphing tools and pivot tables
IBM SPSS Statistics
Supports hypothesis testing approach to the data
Our proposed analytics tool (DATool)
Provide correlation analysis to evaluate the strength of relationship between two and up to three quantitative variables
384
R. A. Kadir et al.
Fig. 10. Snapshot of input for data analysis tool
Visualization Tool
Fig. 11. Matric table for related data relationship analysis
Fig. 12. User guideline to select the appropriate type of graph
Fig. 13. Single data relationship with bar graph visualization
385
386
R. A. Kadir et al.
Fig. 14. Two data relationship with line graph visualization
Fig. 15. Three data relationship with bubble graph visualization
5
Conclusion
We have presented our automated data analysis tool in this paper. In spite of there are several data analysis tools currently available on the market, such as Microsoft Excel and IBM SPSS, users have to have some sort of skills in order to use the tools. Our proposed data analysis can be used by any users who do not have any statistical knowledge or skills in using statistical application tools. The future work of this study is to integrate with data mining techniques to produce an intelligent prediction using the stored data.
Visualization Tool
387
Acknowledgment. This research is supported by University Kebangsaan Malaysia (National University of Malaysia), Malaysia with research grant project code ZG2019-003.
References 1. Blei, D.M., Smyth, P.: Science and data science. Proc. Natl. Acad. Sci. 114(33), 8689–8692 (2017) 2. van der Aalst, W.: Data science in action. In: Process Mining, pp. 3–23. Springer, Heidelberg (2016) 3. Foster, I., Kesselman, C.: The Grid: Blueprint for a New Computing Infrastructure. Morgan Kaufmann, San Francisco (1999) 4. Hernan, M.A., Hsu, J., Healy, B.: A second chance to get causal inference right: a classification of data science tasks. Chance 32(1), 42–49 (2019) 5. Bikakis, N.: Big data visualization tools. arXiv preprint arXiv:1801.08336 (2018) 6. Chan, W.W.-Y.: A survey on multivariate data visualization. Dep. Comput. Sci. Eng., Hong Kong Univ. Sci. Technol. 8(6), 1–29 (2006) 7. Keim, D.A., Kriegel, H.-P.: Visualization techniques for mining large databases: a comparison. IEEE Trans. Knowl. Data Eng. 8(6), 923–938 (1996) 8. Chen, W., Guo, F., Wang, F.-Y.: A survey of traffic data visualization. IEEE Trans. Intell. Transp. Syst. 16(6), 2970–2984 (2015) 9. Kerren, A., Kucher, K., Li, Y.-F., Schreiber, F.: Biovis explorer: a visual guide for biological data visualization techniques. PLoS One 12(11), e0187341 (2017) 10. Steed, C.A.: Interactive data visualization. In: Data Analytics for Intelligent Transportation Systems, pp. 165–190. Elsevier (2017) 11. Luo, Y., Qin, X., Tang, N., Li, G., Wang, X.: DeepEye: creating good data visualizations by keyword search. In: Proceedings of the 2018 International Conference on Management of Data, pp. 1733–1736 (2018) 12. Wang, Z., Sunding, L., Murray-Rust, D., Bach, B.: Cheat sheets for data visualization techniques. In: Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, pp. 1–13 (2020) 13. Chapman, C.: A complete overview of the best data visualization tools (2020). https://www.toptal.com/designers/datavisualization/data-visualization-tools 14. Ambler, S.: Introduction to agile model driven development (2015). https://www. toptal.com/designers/datavisualization/data-visualization-tools
High Capacity Data Hiding for AMBTC Decompressed Images Using Pixel Modification and Difference Expansion Lee-Jang Yang1 , Fang-Ping Pai1 , Ying-Hsuan Huang1(B) , and Ching-Ya Tseng2 1 Aeronautical Systems Research Division, National Chung-Shan Institute of Science and
Technology, Taichung, Taiwan 2 Electronic Systems Research Division, National Chung-Shan Institute of Science and
Technology, Taoyuan, Taiwan
Abstract. Data hiding can embed secret data into various multimedia to avoid hackers’ doubt and attacks. Recently, methods of embedding data into decompressed images have been proposed because they are very suitable for military communications. In this paper, we propose an improved strategy to enhance the hiding capacity of the previous AMBTC-based hiding methods that cannot embed secret data into the smooth block. In addition, the developers did not investigate the problems related to the overflow or underflow of pixels. Consequently, we also proposed an effective strategy to avoid overflow and underflow problems with no needs need to record extra data. With the advantages, the proposed method can achieve high embedding capability. Keywords: Data hiding · Hiding capacity · PSNR value
1 Introduction Data hiding can be used to embed secret data into various multimedia to avoid hackers’ doubt and attacks, where common multimedia include grayscale images [1], color images [3, 4], medical images [2], videos [5], audios [6], and decompressed images [7–11]. Decompressed images are extensively used as cover images because they are very suitable for military communications [11]. The methods for hiding secret information can be classified mainly into three categories, i.e., difference expansion [12–14], histogram shifting [15–17], and compression methods [18]. In the difference expansion method, the difference between two pixels is expanded to be an even number. During the data embedding phase, the even number is changed to an odd number if the secret value is equal to 1. Otherwise, the even number remains unchanged. This rule makes it easy to extract the secret data. In particular, if the difference is an even number, then the value of the secret data that are extracted is 0. Otherwise, the value of the extracted secret data is 1. The method is simple, but difference expansion causes serious distortion of images. In order to solve that problem, histogram shifting was proposed that only shifts the pixel between the peak point and zero point to create a hidden space for embedding © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 295, pp. 388–403, 2022. https://doi.org/10.1007/978-3-030-82196-8_29
High Capacity Data Hiding for AMBTC Decompressed Images
389
secret data. The method can effectively control the quality of the stego image, but the hiding capacity is pretty low. To date, many compressed-based data hiding methods have been proposed that can decrease the size of a digital file or accelerate the transmission. The classic compressionbased data hiding methods include JPEG and block truncation coding (BTC) [20], the former of which the kinds of the former can recover the original image but cost more in computation. Different from JPEG [18], BTC only calculates the mean and standard deviation. With the advantages of AMBTC, this paper is focused on data hiding methods that are based on AMBTC. Hong et al. [22] proposed a bit-plane flipping method to embed secret data into the block, where the embeddable block satisfied the rule, i.e., L m < H m . If the secret bit is 1, then the bit-plane was flipped. Otherwise, the bit-plane remains unchanged. The method is simple, but it cannot embed secret data into the smooth block with L m = H m . Chen et al. (2010) [23] proposed an improved method that embedded secret bits into the bit-plane of the smooth block. Note that the values in the bit-plane of the smooth block are the same. Therefore, in the data extraction phase, the smooth block can be recovered easily by L m = H m . Since Chen et al. embedded secret data into all values of the bit-plane, the hiding capacity of their method is higher than that of Hong et al.’s method. Different from the methods mentioned above, Li et al. [19] proposed a hiding method for BTC-compressed images that uses two procedures to embed secret data. In the first procedure, the secret data are embedded by the bit-plane flipping technique (also called swapping the high mean and low mean). Afterwards, the histogram-shifting method is used to embed secret data into the two kinds of mean. Lin et al. (2013) [7] proposed a novel AMBTC-compressed image hiding method that consisted of four disjointed cases. After embedding the data, if the number of cases is less than the threshold, the AMBTC block cannot be used to embed secret data. Otherwise, it can be used for that purpose. In 2015, Ou and Sun [8] proposed a data hiding scheme with minimum distortion. Like previous methods, the blocks were classified according to the relationship between the pre-determined threshold T and the difference of L m and H m . If H m − L m < T, then the bit-plane was replaced by secret data. Different from the previous methods, the order of H m and L m is used to embed secret data. This technique skillfully increases the hiding capacity without distorting the image. Malik et al. (2017) [9] converted one bit-plane into two bit-planes to embed massive secret data and maintain the satisfactory quality of the image. However, the method cannot recover the AMBTC decompressed image losslessly. In order to solve this problem, in 2018, they proposed a method for adjusting the pixel value. In the method, all secret bits are converted into base-3 digits, and they were embedded into the pixels of the AMBTC-based decompressed image. Although two secret bits, at most, were embedded into the decompressed pixel, the visual quality of the stego image becomes worse as the frequency of the occurrence of the maximum secret values increases. In order to enhance the hiding capacity and the quality of the stego image, Yeh et al. [11] proposed an effective encoding method that calculates Entropy to determine an appropriate codebook. By this codebook, most values are encoded as the absolute minimum value to reduce distortion of embedding data. On the contrary, the secret data
390
L.-J. Yang et al.
with the lowest frequency are encoded as the maximum value. These encoded values are embedded into the AMBTC-based decompressed image. However, some smooth regions of the image cannot be used to embed secret data. In this paper, we expanded the pixels’ difference of the smooth region to embed secret data. Also, we used the first L m and the first H m to avoid pixel’s overflow and underflow problem. The remainder of the paper is organized as follows. Section 2 describes the proposed method, which consists of the AMBTC procedure, secret encoding, data embedding, data extraction, and reconstruction of the AMBTC decompressed image. Section 3 provides the experimental results, and Sect. 4 presents our conclusions and future works.
2 Proposed Method Since Mailk et al.’s method [10] and Yeh et al.’s [11] method cannot embed secret data into the smooth block; we expanded the difference between L m and H m to embed more secret data. Figure 1 shows the flowchart of embedding data using the proposed method, and the details are described in Subsect. 2.1 through Subsect. 2.3. First, a cover image is compressed and then decompressed by AMBTC. In addition, the difference of the smooth regions of the decompressed image is expanded, and the flag bit is used as reference information for identifying the smooth block. The flag bits and secret data are encoded as the smaller digits, thereby reducing the distortion of the embedded data. In the embedding phase, the first L m and the first H m are fixed to be the reference information for the extraction of data and the recovery of the image. In addition, they are also used to identify whether the the problem of a pixel with the overflow and underflow problem exists. If either L m or H m approaches an extreme value (0 or 255), then all of pixels in the block remain unchanged to avoid the overflow or underflow problem. Otherwise, both the flag bits and the secret data are embedded into the image to obtain the stego image.
Fig. 1. Flowchart of eembedding data using the proposed method.
High Capacity Data Hiding for AMBTC Decompressed Images
391
2.1 AMBTC and Image Postprocessing In the AMBTC method, an image I is divided into (L × W )/(n × n) non-overlapping blocks, where L and W denote the length of the image and its width, respectively. The notation n is a parameter. In general, the compression ratio becomes higher as n increases, but the quality of the decompressed image is decreased. According to previous research [21], an appropriate value for n is 4. After the segment, the mean of n × n pixels is calculated for each block, i.e., P=
1 n×n Pi . i=1 n×n
¯ each pixel in the blocks can be compressed as one bit, i.e., Based on P, 1, if Pi > P, Pi = 0, otherwise.
(1)
(2)
In order to represent two kinds of compressed values, their low mean and high mean values are calculated, i.e., Lm = P −
n×n×α , 2(n × n − q)
(3)
n×n×α , 2q
(4)
Hm = P − Where α denotes absolute moment, i.e., α=
1 n×n Pi − P . i=1 n×n
(5)
In order to embed secret data into the smooth region of images, L m is decreased by the half of maximum alternation level 2k−1 , while H m is increased by 2k−1 , where the half of maximum alternation level is analyzed. In other words, the smooth region is just changed to be a complex region. However, some altered values are overlapping with the part of the unaltered values. In order to discriminate the difference between the altered value and the unaltered value, an identify flag is used, i.e. ⎧ 1, if Hm − Lm < 2k and 2 ≤ {Hm , Lm } ≤ 253, ⎪ ⎪ ⎨ 0, if 2k ≤ Hm − Lm < 2k+1 , f = ⎪ 0, if 2 >Lm or Hm > 253, ⎪ ⎩ φ, otherwise.
(6)
The above flags will be connected with the front of the secret data, and they will be embedded into one cover image by the embedding procedure listed in Subsect. 2.2.
392
L.-J. Yang et al.
Figure 2 displays the example of the method mentioned above. First, the image is divided into several 4 × 4 sized blocks. The mean of the first block is calculated, i.e., P¯ = (79 + 178 + … + 185)/16 = 157 P = (79 + 178 + . . . + 185)/16 = 157, and it is used as the reference information of the bit-plane. For example, the first pixel in the block is 79, which is smaller than P¯ = 157 P = 157, so it is encoded as “0”. After accomplishing the bit-plane, both L m and H m are calculated, i.e., L m = 76 and H m = 184. The difference between L m and H m is greater than the maximum alternation level 2k , thus both of them remain unchanged. In addition, we don’t record any flag bit to discriminate the block.
Fig. 2. Example of AMBTC with image pre-processing.
The second and third blocks are processed in the same way. The average value of the pixels of the second block is 99.75. Afterwards, the block is encoded as the bit-plane according to the average value. Then, the mean of the pixels that belong to the lowervalue group is calculated, i.e., L m = 98. Also, the average value of the pixels that belong to the higher-value group is calculated, i.e., H m = 100. Their difference is smaller than the maximum alternation level 2k , and they do not belong to extreme pixels, so both L m and H m are altered, i.e., Lm = Lm −2 = 96 Lm = Lm −2 = 96 and Hm = Hm +2 = 102 Hm = Hm + 2 = 102. In order to discriminate the altered values, the flag is set to 1. In the third block, the average value of the pixels is 97.5, and it is used to classify pixels to obtain a bit-plane. According to the bit-plane, both kinds of means are calculated as L m = 95 and H m = 100. Since the difference between L m and H m is greater than the maximum alternation level 2k , the two values remain unchanged. However, the difference belongs to the overlapping range, and the flag is set to 0. Like the procedure mentioned
High Capacity Data Hiding for AMBTC Decompressed Images
393
above, the average value of the pixels of the fourth block is calculated, and the bit-plane and two means are derived, i.e., L m = 253 and H m = 255. Obviously, there is a high probability of an overflow problem. In order to avoid this problem, the block remains unchanged, and the flag is set to 0. 2.2 Secret Encoding The 2k dictionaries can be generated according to the number of secret bits k. Let Di be the ith dictionary, where includes 2k + 1 codewords. Assuming that k = 2. The codewords of the first dictionary are {000, 001, 01, 10, 11}, and the codewords of the second dictionary are {00, 010, 011, 10, 11}, and so on. Different dictionaries determine different encoding effectiveness, thus the proposed method still used the dictionary with the smallest entropy, H. The entropy is as follows: n
(7) pr Spk log2 pr Spk . H Spk = − k=1
The lower entropy represents that the values are the same, and obtain high encoding effectiveness. After determining an appropriate dictionary, the sorting index of symbols is encoded as the embedded value, where the symbol with the highest frequency of appearance is encoded as the absolute minimum value “0” to decrease the distortion of the image invoked in the data embedding phase. By contrast, the symbol with the lowest frequency of appearance is encoded as the absolute maximum value. The equation of data encoding is as follows: ⎧ ⎨ − sort index , if Sort index is an odd number, 2 (8) e = sort index ⎩ , otherwise. 2 The following example is used to illustrate the steps mentioned above, assuming that 34 embedded bits consisted of flag bits and secret data are {1001010111011110100111010110}, and according to the symbol in the first dictionary, D1 , where the secret bits are divided as {10, 01, 01, 01, 11, 01, 11, 10, 10, 01, 11, 01, 01, 10}. In addition, the frequencies of their appearances are counted, i.e., {0, 0, 7, 7, 3}, as listed in Table 1. According to the frequency, the entropy H is calculated, i.e., H(S pk ) = (14/34) × log2 (1/(14/34)) + (8/34) × log2 (1/(8/34)) + (6/33) × log2 (1/(6/33)) = 1.459. From the second to the fourth, entropies are calculated in the same way. They are 1.79, 1.94 and 1.79, as listed in Table 1. After calculating the entropy, the third dictionary is selected because its value was smaller than the others. In addition, according to Eq. (8), S p with the highest appearance frequency is transformed into the smallest value, as listed Table 2. Oppositely, S p with the lowest appearance frequency is converted into the largest value.
394
L.-J. Yang et al. Table 1. An example of the calculation of entropy. ID of Dictionary S p #1
Freq. of S p Entropy
000 0
1.459
001 0 01 7 10 4 11 3 #2
00 0
1.79
010 3 011 1 10 4 11 4 #3
00 1
1.94
01 3 100 1 101 3 11 3 #4
00 0
1.79
01 4 10 4 110 3 111 1
Table 2. An example of encoding data. Sp
Freq. of S p Sorted index e
000 0
4
2
001 0
5
−2
01 7
1
0
10 4
2
1
11 3
3
−1
2.3 Data Embedding After completing the encoding of the data, the encoded values are embedded into the decompressed AMBTC image. Each decompressed pixel Pi except the first Lm and the first Hm is increased by the encoded value e, i.e., Pi = Pi + e.
High Capacity Data Hiding for AMBTC Decompressed Images
395
Figure 3 shows an example of embedding data. The first Lm and first Hm with gray backgrounds remain unchanged. The third decompressed pixel “192” can be used to embed the encoded value “1”, and obtain the stego pixel, i.e., Pi = 192 + 1 = 193. P3 = 192 + 1 = 193. The remaining pixels are processed by the same procedure to obtain the stego block.
Fig. 3. Example of embedding data.
2.4 Extraction and Recovery Phase Figure 4 shows the flowchart of the extraction of the data and the recovery of the image. First, the encoded values are extracted according to the reconstructed dictionary. Then, these values are decoded as the flag bits and secret data, where the former is used to losslessly recover the decompressed AMBTC image. The details of data extraction and image recovery are listed as follows: Step 1: Scan the block of the stego image, where the size is n × n. Step 2: Extract the first Hm Hm and the first Lm Lm , where their difference is higher than or equal to 4. Step 3: Reveal the embedded values by
Pi − Hm , if Pi − Hm < Pi − Lm , (9) e= Pi − Lm , otherwise. Step 4: Map the embedded value e into the selected dictionary to extract the encoded values, including flag bits and secret data.
396
L.-J. Yang et al.
Step 5: Re-scan the block of the stego image. Step 6: According to identified flag f and the difference between Hm and Lm , the original values of the H m and the L m are recovered losslessly, i.e., ⎧ ⎨ Hm − 2, if Hm − Lm < 2k+1 and f = 0, (10) Hm = Hm , if Hm − Lm < 2k+1 and f = 1, ⎩ otherwise, Hm , ⎧ ⎨ Lm + 2, if Hm − Lm < 2k+1 and f = 0, (11) Lm = Lm , if Hm − Lm < 2k+1 and f = 1, ⎩ otherwise. Lm ,
Fig. 4. Flowchart of data extraction and image recovery of the proposed method.
Fig. 5. Example illustrating the proposed extraction stage.
Figure 5 shows the example of extracting and decoding data. First, both the first Lm Lm and the first Hm Hm are extracted, i.e., Lm = 68 and Hm = 192, where their difference is greater than or equal to 2k + 1 . Afterwards, according to Eq. (9), the third stego pixel is decreased by the corresponding mean value to obtain the embedded value, i.e., e = 193 – 192 = 1. In addition, the extracted value is mapped into the dictionary
High Capacity Data Hiding for AMBTC Decompressed Images
397
Fig. 6. Example of recovering the original AMBTC-decompressed image.
to obtain the original bit “11.” The remainders are processed in the same way. After obtaining all original bits, including flag bits and secret data, the former are used to recover L m and H m losslessly. The pair of Lm and Hm is {68, 192} and its difference is greater than 2k + 1 , therefore Lm and Hm are the original means. The difference of the second pair {96, 102} is smaller than 2k+1 , and the first flag bit is 1, thus L m and H m are recovered by L m = Lm + 2 = 98 and H m = Hm Hm − 2 = 100. The difference of the third pair {95, 100} is smaller than 2k + 1 , and the second flag bit is 0, therefore they are original means. The difference of the fourth pair {253, 255} is smaller than 2k + 1 and the third flag bit is 0, thus the pair {253, 255} does not need to any recovery procedure, as shown in Fig. 6.
3 Experimental Results In order to compare the proposed method with prior methods, the 1338 UCID images and four standard grayscale images were used as test images, where the sizes of the former are 512 × 384 and that of the latter are 512 × 512, as shown in Fig. 7. In addition, the secret data were the same as presented in the literature [11]. After embedding the secret data into the images, the peak signal-to-noise ratio (PSNR) is used to measure the difference between the original image and the stego image, i.e., PSNR = 10 log10
2552 , (dB) MSE
(12)
398
L.-J. Yang et al.
(a) Part of 1338 UCID images
Airplane
Lena Splash (b) Four standard images
Boats
Fig. 7. Cover images.
N M
1 Pi − Pi , MSE = M ×N
(13)
i=1 j=1
where Pi and Pi are the original and stego pixel values located at (i, j), respectively. The visual quality of the stego image becomes better as the PSNR value becomes greater. In other words, the higher PSNR value represents that there is no significant difference between the original image and the stego image. Figure 8 displays that the PSNR values of the smooth images “Airplane” is higher than that of complex images “Boat”, and the maximum embedding capacity of the smooth image is trivially lower than that of the complex. This is because AMBTC method is very suitable for smooth image, and the proposed method slightly modified H m and L m to change the smooth block to the complex block, thereby embed secret data.
High Capacity Data Hiding for AMBTC Decompressed Images
399
Figure 9 compare the results of the proposed method with the related methods [7, 8, 10, 11]. Under the same embedding ratio, the PSNR values of the proposed method were higher than those of Lin et al.’s method [7]. This is because the Lin et al.’s bit-planes are replaced directly by secret bits, and both the two kinds of mean and the variable in the block are re-calculated as decompressed pixels. In other words, the method cannot recover AMBTC decompressed image. The proposed method has higher PSNR values than Ou and Sun’s method [8]. This is because both the high mean and the low mean of Ou and Sun’s method are re-calculated after replacing the bit-plane in the smooth block by secret bits, which decreases the quality of the stego image. However, the proposed method directly modified the decompressed pixels, which made the hiding capacity of the proposed method higher than that of both Lin et al.’s method [7] and Ou and Sun’s method [8]. Compared with Malik et al.’s method [10] and Yeh et al.’s method [11], most PSNR values of the proposed method are trivially lower than that of Malik et al.’s method [10] and Yeh et al.’s method [11] because the proposed method embedded flag bits to degrade quality of the image. However, the proposed method can embed secret data into the whole images, including the smooth blocks and the complex blocks. In other words, the maximum embedding capacity of the proposed method is superior to that of Malik et al.’s method [10] and Yeh et al.’s method [11].
Fig. 8. Experimental results of the proposed method with different n.
Figure 10 shows the experimental results of 1,338 UCID images with different values of k, where most PSNR values are higher than 25 dB while most embedding capacities are higher than 1 bpp. It proves that the proposed method has higher practicality.
400
L.-J. Yang et al.
(a) Airplane
(b) Lena
(c) Splash Fig. 9. Comparison among the proposed method with the related methods.
High Capacity Data Hiding for AMBTC Decompressed Images
401
Fig. 9. (continued)
Fig. 10. Experimental results of 1338 UCID image.
4 Conclusions An improved hiding-capacity method was proposed that it can embed secret data into smooth and complex regions of the image apparently having had higher embedding capacity than prior methods. In the proposed method, the difference between H m and L m in the smooth region are expanded, and only a few extra bits are generated to discriminate the expanded value from the non-expanded value. Consequently, the proposed method can embed massive secret data and recover the decompressed image. In addition, we improved the solution of overflow and underflow, which does not record the ID numbers of the blocks with overflow and underflow. It is expected that the proposed method will achieve higher practicability.
References 1. Lu, T.C., Wu, J.H., Huang, C.C.: Dual-image based reversible data hiding method using center folding strategy. Signal Process. 115, 195–213 (2015)
402
L.-J. Yang et al.
2. Lu, T.C., Tseng, C.Y., Huang, C.C., Deng, K.M.: 16-bit DICOM medical images lossless hiding scheme based on edge sensing prediction mechanism. In: The Eighth International Conference on Genetic and Evolutionary Computing (ICGEC), pp. 189–196 (2014) 3. Lin, C.C., Lai, C.S., Liao, W.Y.: A novel data hiding scheme for color images based on GSBTC. In: 2nd International Conference on Ubiquitous Information Management and Communication, pp. 561–565 (2008) 4. Chou, C., Lin, C.C.: Hybrid color image steganography method used for copyright protection and content authentication. J. Inf. Hiding Multimed. Signal Process. 6(4), 686–696 (2015) 5. Jeni, M., Srinivasan, S.: Reversible data hiding in videos using low distortion transform. In: 2013 International Conference on Information Communication and Embedded Systems (ICICES), Chennai, India (2013) 6. Yan, D., Wang, R.: Reversible data hiding for audio based on prediction error expansion. In: 2008 International Conference on Intelligent Information Hiding and Multimedia Signal Processing, Harbin, China (2008) 7. Lin, C.-C., Liu, X.-L., Tai, W.-L., Yuan, S.-M.: A novel reversible data hiding scheme based on AMBTC compression technique. Multimed. Tools Appl. 74(11), 3823–3842 (2013). https:// doi.org/10.1007/s11042-013-1801-5 8. Ou, D., Sun, W.: High payload image steganography with minimum distortion based on absolute moment block truncation coding. Multimed. Tools Appl. 74(21), 9117–9139 (2014). https://doi.org/10.1007/s11042-014-2059-2 9. Malik, A., Sikka, G., Verma, H.K.: A high payload data hiding scheme based on modified AMBTC technique. Multimed. Tools Appl. 76(12), 14151–14167 (2016). https://doi.org/10. 1007/s11042-016-3815-2 10. Malik, A., Sikka, G., Verma, H.K.: An AMBTC compression based data hiding scheme using pixel value adjusting strategy. Multidimens. Syst. Signal Process. 29(4), 1801–1818 (2017). https://doi.org/10.1007/s11045-017-0530-8 11. Yeh, J.Y., Chen, C.C., Liu, P.L., Huang, Y.H.: High-payload data hiding method for AMBTC decompressed images. Entropy 22(145), 1–13 (2020) 12. Tian, J.: Reversible data embedding using a difference expansion. IEEE Trans. Circuits Syst. Video Technol. 13(8), 890–896 (2003) 13. Thodi, D.M., Rodriguez, J.J.: Prediction-error based reversible watermarking. In: 2004 International Conference on Image Processing, Singapore, vol. 3, pp. 1549–1552 (2004) 14. Lee, C.F., Chen, H.L., Tso, H.K.: Embedding capacity raising in reversible data hiding based on prediction of difference expansion. J. Syst. Softw. 83(10), 1864–1872 (2010) 15. Ni, Z., Shi, Y.Q., Ansari, N., Su, W.: Reversible data hiding. IEEE Trans. Circuits Syst. Video Technol. 16(3), 354–362 (2006) 16. Tai, W.L., Yeh, C.M., Chang, C.C.: Reversible data hiding based on histogram modification of pixel differences. IEEE Trans. Circuits Syst. Video Technol. 19(6), 906–910 (2009) 17. Lu, T.-C., Chen, C.-M., Lin, M.-C., Huang, Y.-H.: Multiple predictors hiding scheme using asymmetric histograms. Multimed. Tools Appl. 76(3), 3361–3382 (2016). https://doi.org/10. 1007/s11042-016-3960-7 18. Wang, K., Lu, Z.M., Hu, Y.J.: A high capacity lossless data hiding scheme for JPEG images. J. Syst. Softw. 86(7), 1965–1975 (2013) 19. Li, C.H., Lu, Z.M., Su, Y.X.: Reversible data hiding for BTC-compressed images based on bitplane flipping and histogram shifting of mean tables. Inf. Technol. J. 10(7), 1421–1426 (2011) 20. Delp, E., Mitchell, O.: Image compression using block truncation coding. IEEE Trans. Commun. 27(9), 1335–1342 (1979) 21. Lema, M., Mitchell, O.: Absolute moment block truncation coding and its application to color images. IEEE Trans. Commun. 32(10), 1148–1157 (1984)
High Capacity Data Hiding for AMBTC Decompressed Images
403
22. Hong, W., Chen, T.S., Shiu, C.W.: Lossless steganography for AMBTC-compressed images. In: 2008 Congress on Image and Signal Processing, Sanya, China, vol. 2, pp. 13–17 (2008) 23. Chen, J., Hong, W., Chen, T.S., Shiu, C.W.: Steganography for BTC compressed images using no distortion technique. Imaging Sci. J. 58(4), 177–185 (2010)
SIFCM-Shape: State-of-the-Art Algorithm for Clustering Correlated Time Series Chen Avni(B) , Maya Herman, and Ofer Levi Computer Science Department, Open University of Israel, Ra’anana, Israel Abstract. Time-Series clustering is an important and challenging problem in data mining that is used to gain an insight into the mechanism that generate the time series. Large volumes of time series sequences appear in almost every fields including astronomy, biology, meteorology, medicine, finance, robotics, engineering and others. With the increase of time series data availability and volume, many time series clustering algorithms have been proposed to extract valuable information. The Time Series Clustering algorithms can organized into three main groups depending upon whether they work directly on raw data, with features extracted from data or with model built to best reflect the data. In this article, we present a novel algorithm, SIFCM-Shape, for clustering correlated time series. The algorithm presented in this paper is based on K-Shape and Fuzzy c-Shape time series clustering algorithms. SIFCMShape algorithm improves K-Shape and Fuzzy c-Shape by adding a fuzzy membership degree that incorporate into clustering process. Moreover it also takes into account the correlation between time series. Hence the potential is that the clustering results using this method are expected to be more accurate for related time-series. We evaluated the algorithm on UCR real time series datasets and compare it between K-Shape and Fuzzy C-shape. Numerical experiments on 48 real time series data sets show that the new algorithm outperforms state-of-the-art shape-based clustering algorithms in terms of accuracy. Keywords: Big data disease detection
1
· Time series clustering · K-Shape · Heart
Introduction
Big Data is a field in computer science that searches for ways to analyze massive amount of data and extract information from it. Big data deals with three main challenges in gathered data: volume, variety and velocity. Data sets are growing rapidly and large volumes of time series data are becoming available from a diverse range of sources. The revolutions in availability of internet-of-things and information sensing devices caused the problem to scale largely. Hence techniques for analyzing time series data and identifying interesting patterns are so essential. Main challenge in time series analysis field is to cluster a set of time series according to the similarity of their shape. Most approaches for shape-based clustering suffers from two serious problems: (i) computationally expensive therefore c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 295, pp. 404–418, 2022. https://doi.org/10.1007/978-3-030-82196-8_30
SIFCM-Shape: State-of-the-Art Algorithm
405
don’t perform well in a large volumes of data (ii) approaches are domain and data dependent. To overcome those problems a novel method named K-Shape was introduced. K-Shape is a method that cluster time series effectively and domain independent [1,2]. For clustering K-Shape algorithm uses an iterative refinement procedure similar to the one used in k-means with a new distance measure based on cross correlation and a new centroids computation technique based on a shape extraction method. The author of this approach demonstrate it highly effectiveness in terms of clustering accuracy and efficiency. Although the results were good they were not yet satisfying so to improve K-Shape effectiveness two new algorithms were introduced [3] Fuzzy C-Shape (FCS+ and FCS++). The Fuzzy C-shape idea is to replace the k-means clustering in K-Shape to Fuzzy c-means so each item has a crisp membership probability for each cluster. Hence each item in Fuzzy C-Shape can belong to more than one cluster. Fateme Fahiman et al. [3] described how Fuzzy c-Shape algorithm improved k-shape results regarding accuracy and efficiency. The problem in the Fuzzy C-Shape algorithm introduced in [3] is that it is semi fuzzy clustering. The fuzzy membership degree in Fuzzy C-Shape is not incorporate into the clustering process therefore not releasing it’s full potential in improving clustering accuracy. Moreover both method, K-Shape and Fuzzy C-Shape, don’t take into consideration domains where the correlation between time series is important in the clustering process. In this paper, we present a novel algorithm, SIFCM-Shape, for clustering correlated time series. The algorithm presented in this paper improves K-Shape and Fuzzy c-Shape time series clustering algorithms. The uniqueness of SIFCMShape is that the clustering method is based on SIFCM. SIFCM-Shape adds a fuzzy membership degree and correlation between time series that are both incorporate into the clustering process. The next section presents relevant theoretical background on K-Shape and Fuzzy c-Shape clustering on which we base our new algorithm. Section Three introduces the SIFCM-Shape algorithm. In section Four we examine our algorithm and compare its results with the reported results described in K-Shape and Fuzzy C-Shape papers, on the same UCR archive [4]. Section Five includes discussion as well as conclusions and suggestions for future research.
2
Preliminaries
In this section we shall review some relevant theoretical background. First we review numerical fuzzy clustering algorithms that are the basis of our work. Then we review theoretical background for time series clustering and show how they inspired from the numerical clustering algorithms and how they adapted and evolved. 2.1
Fuzzy Clustering Algorithms
Many clustering algorithms exist for numerical feature vectors. The clustering algorithms attempt to partition a collection of elements into c cluster such that items in the same class are as similar as possible while items in different classes
406
C. Avni et al.
are as dissimilar as possible. The novelty in fuzzy clustering algorithms is that data points can belong to more than one cluster [5]. The basic steps of fuzzy clustering algorithms are: First randomly select c cluster centers; Second calculate fuzzy membership grade function for every item to every cluster; Then recompute the centers according to the new clusters; And Finally decide whether to stop or to repeat previous step in order to get a more accurate centers. Theorem 1 (IFCM [6]). Approach improves Fuzzy C Means (FCM) by using a parameter of hesitation degree. In this approach a fuzzy membership function parameter known as the hesitation degree or intuitionistic fuzzy set is added. uij = uij + πi∗
(1)
The hesitation degree is the uncertainty in membership of an element to a subset. This uncertainty expressed as the completion of uA (Xi ) and vA (Xi ), the membership and non-membership degree of each element to a subset, to a 100% membership score. (2) πi∗ = 1 − uA (Xi ) − vA (Xi ) Theorem 2 (SFCM [7,8]). Approach improves Fuzzy C Means (FCM) by using a parameter of correlation between objects. SFCM adds spatial function representing the probability that object xi belongs to the jth cluster. The spatial function of an object for a cluster is large if the majority of its correlated objects belong to the same cluster. Nn symbol in equations is the objects correlated to the ith object. ujk (3) hij = k∈Nn
The spatial function is incorporated into the membership function as follows uij
upij ∗ hpij
= C
k=1
upkj ∗ hqkj
(4)
where p and q are parameters to control the relative importance of both functions. Theorem 3 (SIFCM [9,10]). Approach uses both parameters introduced before - correlation between objects and hesitation degree. 2.2
Time Series Clustering - Theoretical Background
Time-Series Clustering is an important technique of data mining which is used to gain an insight into the time-series. Several methods have been proposed to cluster time series [11,12]. The main three categories of time series clustering algorithms are shape-based, feature-based and model-based. In the shape-based approach, shapes of two time-series are matched working directly with the raw time-series data. In the feature-based approach, the time-series are compared
SIFCM-Shape: State-of-the-Art Algorithm
407
using a feature vector of a lower dimension searched and extracted from each time series. And in model-based methods, a raw time-series is transformed into model parameters and clustered accordingly. We shall focus our review on KShape algorithm which is a shape-based time series clustering algorithm. We shall review it’s techniques to capture the shape of time series and to measure the distances between time series efficiently and domain independently in the clustering process. Distance Measure. Distance is a numerical measurement that define how much two objects are far apart meaning how much they are similar or dissimilar. In time series distance calculation, measure must be invariant to (1) Scaling (2) Shift (3) Uniform scaling (4) Occlusion and Complexity. The majority of the most common approaches for time series comparison first z-normalize the sequences and the use a distance to determine their similarity, and possibly capture more invariances. The most widely used distances is Euclidean Distance (ED) and Dynamic Time Wrapping (DTW). In K-Shape [1,2] a new algorithm for capturing shape-based similarity has been introduced - Shape Based Similarity (SBD). SBD is based on efficient cross-correlation computation to compare and measure similarity of two time series sequences even if they are not aligned. The algorithm receives two z-normalized sequences and outputs the similarity grade between them. The SBD algorithm that been described in details in papers [1,2] is an efficient and parameter free measure that achieves similar results to the most accurate but computationally expensive distance measures that require parameter tuning. Regarding to time complexity, assume that p is the length of two finite time series then SBD function requires θ(p log p) time to calculate the distance measurement between them. Algorithm efficiently search for the shift position to align y sequence towards x sequence and then compute the dissimilarity between them. Algorithm 1: [dist, y’] = SBD(x, y) [1], [2] Input: Two z-normalized sequences x and y ; Output: dist - Dissimilarity dist of x and y ; y’ - Aligned sequences y’ of y towards x ; length = 2nextpower2(2∗length(x)−1) ; CC = IF F T {F F T (x, length) ∗ F F T (y, length)} ; CC ; N CCc = ||x||||y|| [value, index] = max(N CCc ) ; dist = 1 − value ; shif t = index − length(x) ; if shif t >= 0 then y = [zeroes(1, shif t), y(1 : end − shif t)] ; else y = [y(1 − shif t : end), zeroes(1, −shif t)] ;
408
C. Avni et al.
Centroids Calculations. Finding cluster centroids is the task of searching a small amount of sequences (in most cases we search for one) that effectively summarize a set of objects and capture its shared characteristic. In time series centroid calculation we search for algorithm that should be invariant to scaling and shifting so it can capture effectively the class shared characteristic. The obvious way to try to estimate the average sequence is to compute each coordinate as the arithmetic mean of the corresponding coordinates in all sequences. However this approach does not offer invariances to scaling shifting or noise. Hence in k-shape paper [1,2] a new algorithm for centroid calculation has been introduced - Shape Extraction (SE). Shape Extraction (SE) method calculates a single centroid that summarizes the entire set of time series. This method treats this searching task as an optimization problem which its goal is to find a sequence with minimum sum of squared distances to all other sequences in that cluster. The algorithm described in details in papers [1,2]. Algorithm 2: C = Shape Extraction(X, R) Input: X - An n-by-p matrix with z-normalized time series. ; R - A 1-by-p vector with the reference sequence against which time series of X are aligned.; Output: C - A 1-by-p vector with the centroid. ; X = [] ; for i = 1 to n do [dist, x ]] = SBD(R, X(i)) ; X = [X ; x ] ; S = X T ∗ X ; 1 [1] ; Q = Identity M atrix − m T M =Q ∗S∗Q ; C = Eig(M, 1) % The first eigenvector of M ; Time Series Clustering Algorithms Theorem 4 (K-Shape[1,2]). K-Shape tries to resolve the problem of time series clustering by an iterative refinement procedure that scales linearly in the number of sequences and generates homogeneous and well separated clusters. K-Shape algorithm is based on an iterative refinement procedure similar to the one used in kmeans. In every iteration, k-Shape performs two steps: (i) Assignment step, update to all time series the cluster they belong. Accomplished by finding for each time series the closeset centroid; (ii) Refinement step, to adapt changes in clusters made by previous step, new centroids must be calculated for each cluster using the SE method. K-Shape repeats these steps until either no change in cluster membership occurs or the maximum number of iterations allowed is reached. Theorem 5 (K-MS [1,2]). K-MS is significantly more accurate then K-Shape on datasets with large variance in the proximity and spatial distribution of time series. K-MS is similar to K-Shape except it uses MultiShapesExtraction instead
SIFCM-Shape: State-of-the-Art Algorithm
409
of ShapeExtraction to generate cluster centroids. MultiShapesExtraction extract multiple centroids to summarize the time series in a cluster. Hence it makes it more suitable to cluster time series in the presence of outliers and noise. Theorem 6 (Fuzzy c-shape [3]). Fuzzy c-Shape improves K-Shape by adding a fuzzy membership function. Fuzzy c-Shape uses SBD function for distance measurement and SE for centroids calculation same as K-Shape. However unlike K-Shape which based his clustering on k-means, Fuzzy c-Shape based its clustering on fuzzy c-means. Hence in Fuzzy c-Shape algorithm each item has a membership probability to each cluster, meaning that item can belong to more than one cluster.
3
SIFCM-Shape Time Series Clustering
In this section we present IFCM-Shape, SFCM-Shape and SIFCM-Shape our three novel time series clustering algorithms. Rational: We believe in the potential of the K-Shape algorithm and we want to sharpen it’s ability in solving time series clustering problems. To enhance it’s accuracy first we would like to incorporate fuzzy membership degree. In Addition we want to generalize the algorithm and to enhance it’s effectiveness in areas where there is influence between spatial relation in the clustering process. To achieve it, we would like to add a membership degree based on spatial relations between time series. This will allow us to get more accurate results in fields where there is a correlation between time series. Proposed Algorithms: IFCM-Shape, SFCM-Shape and SIFCM-Shape are extension of a crisp c-shapes time series clustering algorithm. Those new algorithms are partitional clustering methods that are based on an iterative refinement procedure similar to the one used in K-Shape. However unlike K-Shape that bases its clustering algorithm on k-means, IFCM-Shape, SFCM-Shape and SIFCM-Shape base their clustering algorithm on IFCM, SFCM and SIFCM respectively. Through the iterative clustering procedure algorithm minimizes the sum of squared distances and mange to: (i) produce homogeneous and well separated clusters (ii) scale linearly with the number of time series (iii) Add fuzzy membership degree for a more accurate results (iv) Add spatial membership degree. Those algorithms sharpen the abilities of K-Shape to compare sequences efficiently, computes centroids and finally cluster time series effectively under the scaling, translation and shift invariances. Algorithms Description: IFCM-Shape, SFCM-Shape and SIFCM-Shape algorithms accept as input the time series set X and the number of clusters k that we want to produce. Additionally the spatial algorithms (SFCM-Shape and SIFCM-Shape) receive also N, the correlation between the items in set X.
410
C. Avni et al.
The output of the algorithm is the assignment of sequences to clusters and the centroids for each cluster. Initially we randomly assign the time series in X to clusters. Then like in K-Shape algorithm we perform in every iteration two steps: (i) Assignment step: compare each time series with all computed centroids using the SBD function for distance measurement and compute the membership degree of each series to each cluster (ii) Refinement step: updating the cluster centroids using the SE method to reflect the changes in cluster memberships in the previous step. Algorithms repeats those steps until algorithm converges or reaches the maximum number of iterations. However we had to generalized the k-shape algorithm and those steps introduced for fuzzy and for spatial based clustering. To generalize k-shape for fuzzy clustering first we must create a membership matrix U rather then a u vector in k-shape. Item Uij in matrix U represent the fuzzy membership value of item xi ∈ X to cluster cj . During the refinement step the membership will be determined with respect to a fuzzy grade. IFCMShape shall add hesitation degree to fuzzy membership grade as represented in Theorem 1. SFCM-Shape shall add spatial function measure (based on input N) to the fuzzy membership grade as represented in Theorem 2. SIFCM-Shape will add both hesitation degree and spatial function to the fuzzy membership grade. To integrate the fuzzy membership grade into the refinement and to take it into consideration in the SE stage we came up with a novel technique. We need a technique that will not damage SE effectiveness in capturing the class characteristics. Hence we thought to use the membership grades as weights of importance. Each item in X is being replicated according to its importance(fuzzy membership grade) and then we send to SE function the new X set. This way we maintain SE ability to capture the class characteristics and also integrate the fuzzy membership grade in SE centroids calculations. To sum-up by replacing the clustering algorithm we could add fuzzy membership degree and also correlation between time series items. Time Complexity Analysis: Lets label the amount of given time series with n, the length of each time series with p and the number of clusters with c. All three algorithms we created use the SBD function to measure the distance between item to centroids. SBD time complexity is θ(p log p). The implementation of IFCM-Shape is θ(npc2 ) and SFCM-Shape and SIFCM-Shape is θ(max{npc2 , n2 c}). In IFCM-Shape & SFCM-Shape & SIFCM-Shape algorithms, the refinement step calculates a matrix M to each cluster which takes θ(p2 ) time complexity. Then it does an eigenvalue decomposition to matrix M which takes θ(p3 ) time complexity. Therefore, the complexity of the refinement step is θ(max{np2 , cp3 }). As a result, the per iteration time complexity algorithms are θ(max{nc2 p log p, n2 c, np2 , cp3 }) time.
SIFCM-Shape: State-of-the-Art Algorithm
Algorithm 3: [IDX, C] = SIFCM-SHAPE(X, k) Input: X - An n-by-p matrix containing n times series of length p that are initially z-normalized. ; k - The number of clusters to produce. ; Output: IDX - An n-by-1 vector with the assignment of n time series to k clusters ; C - A k-by-p matrix with k centroids of length m. ; #define SBD get dist(x,y){[dist, y’] = SBD(x,y); dist;} ; #define Harden(U){ Replace the largest value in each column of U with a 1 numeric value, and place 0’s numeric values in the other k-1 slots in each column of U.} ; iter = 0 ; U = [][]; Randomly select K cluster centers ; do // Calculate the fuzzy membership grade function ; for i = 1 to n do for j = 1 to k do 1 Uij = ; 2 SBD get dist(X(i),c(j)) (
k m−1 t=1 SBD get dist(X(i),c(t)) )
// Add the hesitation degree ; for i = 1 to n do for j = 1 to k do Uij = Uij + πi∗ % See Eq(2) ; // Add spatial grade ; for i = 1 to n do for j = 1 to k do hij = t∈Nn Ujt % See Eq(3) ; for i = 1 to n do for j = 1 to k do p
Uij =
k
q
Uij ∗hij
p q t=1 Utj ∗htj
% See Eq(4) ;
// Shape-Extraction Refinement step ; for j = 1 to k do X’ = [] ; for i = 1 to n do X = [X ; (10 ∗ Uij )timesof X(i)] ; C(j) = Shape-Extraction(X’, C(j)) ; Ct = [C(1)C(2)...C(k)] ; iter = iter + 1 ; while (iter < 100) and (coefficient change between two iterations is no more than ); [IDX, C] = [Harden(U ), Ct ] ;
411
412
4
C. Avni et al.
Experimental Settings
We now describe the experimental settings used to evaluate SIFCM-Shape algorithm. Datasets: We use, UCR, the largest public collection of class labeled time series datasets [4]. This is dataset used to examine K-Shape and FCS [1–3]. Although since then there was an expansion and the archive grew from 45 to 85 dataset. UCR dataset consists 85 synthetic and real datasets from various domains. Figure 1 below displays the variety in dataset domains types. Each dataset contains from 40 to 16,637 labeled sequences. The sequences in each dataset have equal length, but from one dataset to another the sequence length varies from 24 to 2,709. The sets are all z-normalized, crisply labeled, split into training and test sets by dataset collector. For decreasing testing running time we exclude the 50words, Adiac, Phoneme, NonInvasiveFatalECG Thorax1, NonInvasiveFatalECG Thorax2, ShapesAll, WordsSynonyms, ElectricDevices datasets. We decided to use this dataset due to its size, diversity and also because this dataset has already been used in papers [1–3] to examine K-Shape and FCS algorithms hence the results are documented.
Fig. 1. Histogram of UCR’s domains types
Platform: We ran our experiments on a Intel Core i7-6700HQ computer with a clock speed of 2.60 GHz and 16 GB RAM and with windows 10 64bit OS. We implemented the clustering method in python 64 bit. Metrics: To evaluate the clustering accuracy we use the Adjusted Rand Index (ARI) as the quality measure. ARI measures the agreement between two partitions and shows how much clustering results are close to the ground truth. ARI is a corrected-for-chance version of Rand Index(RI). The adjusted Rand index is thus ensured to have a value close to 0.0 for random labeling independently
SIFCM-Shape: State-of-the-Art Algorithm
413
of the number of clusters and samples and exactly 1.0 when the clusters are identical (up to a permutation). Given a set of n elements S = {O1 , ..., On } and two partition of S to Compare X = {X1 , ..., Xr }, a partition of S into r subsets, and Y = {Y1 , ..., Ys }, a partition of S into s subsets, define the following: a, the number of pairs of elements in S that are in the same subset in X and in the same subset in Y. and b, the number of pairs of elements in S that are in different subset in X and in different subset in Y. The Rand Index is calculated as RI =
a+b n(n−1) 2
(5)
The Adjusted Rand Index is calculated as ARI =
5
(RI − Expected RI) max(RI) − Expected RI
(6)
Experimental Results
In this section, we discuss our experiments to evaluate our algorithms against K-Shape and Fuzzy C-Shape. Results: Table 1 shows comparison of our new algorithms IFCM-Shape, SFCMShape and SIFCM-Shape against K-Shape and FCS that are both scalable time series clustering algorithms that have been proven to outperform other algorithms. Each row present the number of datasets from UCR over which the method ARI score is better, equal or worse in comparison to K-Shape. In first glance Table 1 seems to show inconsistent results regarding the comparison of FCS to K-Shape. This can be explained because for time saving reasons we executed FCS with a maximum of 4 reinforcement iterations and K-Shape with 100 hence causing the FCS accuracy to downgrade drastically regarding to K-Shape. However the advantage of FCS over K-shape was demonstrated in [3] and does not need more testings. Table 1 should be looked and analyzed as a comparison of the fuzzy algorithms merely. Table 1. Comparison of FCS, IFCM-Shape SFCM-Shape and SIFCM-Shape against KShape. Columns “>”, “=”, “ =
and a joint distribution P are faithful to one another iff. every conditional independence entailed by the graph G and the Markov condition is also presented in P [5]. 2.4
Colliders and V-Structures [3]
Three nodes in a BN, X, Y, Z, form a V-structure if node Y has two incoming edges from X and Z, forming X → Y ← Z, and X is not adjacent to Z. Y is a collider if Y has two incoming edges from X and Z in a path, whether X and Z are adjacent or not. It is said that Y with nonadjacent parents X and Z is an unshielded collider for the path X to Z [1]. 2.5
d-Separation [3]
Two nodes X and Y are d-separated by a set of nodes Z if and only if every path from X to Y is blocked by Z. Such a set Z is called as sepset of X from Y and is denoted as SepY {X}. Most local BN learning methods systematically perform CI tests to identify d-separators as a key step in the structure discovery process [1]. 2.6
Markov Blanket
The Markov blanket of a target variable T (Fig. 1), denoted as M B(T ), is the minimal set of variables that can render T independent from all other variables that do not belong to M B(T ) [1]. Equation 3 formalize this concept. X⊥ ⊥ T | M B(T ), ∀X ⊆ V \ {T } \ M B(T )
(3)
It has been said that any variable that is not in the Markov blanket of T can be considered as not providing new information about T once we know the values of M B(T ) (Koller and Sahami, 1996) [2].
Max-Min Random Walk Parents and Children
471
Fig. 1. The Markov Blanket of Target Node T Contains the Parents, Children and Spouses of T . The Nodes in Pink are the Parents and Children Set.
2.7
Theorem 1 (Adjacent Nodes in a BN) [6]
If a BN G is faithful to a joint probability distribution P, then: 1) node X and Y are adjacent in G if and only if X and Y are dependent given every set of nodes that does not include X and Y, and 2) for nodes X, Y, and Z in G, if X and Y are adjacent, Y is adjacent to Z, and Z is not adjacent to X, then (X, Y, Z) form a V-structure with Y as a collider node if and only if X ⊥
⊥ Z | S, ∀S such that X, Z ∈ / S and Y ∈ S. 2.8
Theorem 2 (MB Uniqueness [3])
If a BN G and a join distribution P are faithful to each other, then MB(T), ∀T ∈ V , is unique and is the set of parents, children, and spouses of T. In addition, the set of parents and children of T, PC(T), is also unique. 2.9
InterIAMB
IAMB is a popular local learning method that have influenced the development of many other methods [5]. It follows the Grow and Shrink (GS) approach [5]; it orders the variables to be included in the current MB(T) (initialy empty) according to their strength of association with T. It then admits into the candidate MB(T) the next variable in the ordering that is not conditionally independent with T given the current MB(T). In each iteration of the algorithm the forward phase is executed interleaved with a shrinking phase where false positive are sought and removed [5]. Figure 2 and 3 illustrates the forward and backward phase, respectively.
472
S. del R´ıo and E. Villanueva
Fig. 2. Forward Phase of InterIAMB. Function f Usually Measures the Association between T and X, given a Set of Candidate Nodes to the Markov Blanket CM B, and an Arbitrary Threshold Value t. All Nodes Surpassing t will be Added to the CM B.
Fig. 3. Backward Phase of InterIAMB. Given m = ||CM B|| (Cardinality of CMB), if there is a Set Z in All Possible Combinations of Sets of CM B that Makes X Conditionally Independent of T , it is Removed from the CM B Set. This Calculation is Repeated for Every X in CM B.
Fig. 4. Forward Phase of MMHC. For Each Iteration, Only One Node X with the Max Score will be Added to CM B.
Max-Min Random Walk Parents and Children
473
Algorithm 1. MMPC 1: procedure MMPC Input: Target node T, Features set U Output: Set of Candidate Parents and Children Features for T. 2: - - Forward Phase 3: CPC ← [ ] 4: loop 5: < F, assocF >← MaxMinH(T,CPC, U) 6: if assocF != 0 then: 7: CPC ← CPC U F 8: end if 9: until CPC not modified 10: - - Backward Phase 11: for x in CPC: 12: if exists sep(X) inside CPC 13: discard X 14: end if 15: end for 16: return CPC 17: 18: 19: procedure MaxMinH Input: T, CPC, U Output: Feature A, belonging to U, with highest association to T 20: evaluatedFeatures ← [ ] 21: for X in U: 22: evaluatedFeatures ← evaluatedFeatures U MinAsoc(X, T, CPC, U) 23: end for 24: < assocF, F > ← max(evaluatedF eatures) 25: return < F, assocF >
2.10
Max-Min Parents and Children [4]
The Max-Min Hill Climbing (MMHC) algorithm is a popular local learning method that improves the IAMB method. The overall algorithm consists of two steps: Max-Min Parents and Children (MMPC) discovery step and spouses identification step. The MMPC discovery consists of a forward phase (Fig. 4) where all possible candidates are added, and a backward phase (Fig. 3) where candidates that contain redundant information about T are identified. Algorithm 1 illustrates the MMPC algorithm.
3
MMRWPC
MMHC has been widely adopted in different tools and libraries for BN structure learning (ex. Causal Explorer). However, it presents an elevated computational burden, mainly due to the high-order conditional independence (CI) tests
474
S. del R´ıo and E. Villanueva
performed in the MMPC step. To alleviate this, some authors have proposed restricting the order of the CI tests, implying some sacrifice in learning precision [8]. In this paper we propose a different approach. We introduce a random walk process into the MMPC algorithm in order to reduce the amount of higher order CI tests. The random walk is embedded in the backward phase of MMPC (line 28, Algorithm 2) in order to create communities of nodes based of their mutual correlation, and thus, to decrease the order of the executed CI tests. Williams et al. [7] have already proved the robustness of Random Walk in identifying network modules in a systems biology application. The proposed algorithm, called Max-Min Random Walk Parents and Children (MMRWPC) - Algorithm 2, implements three core modifications to the classic version of the MMPC: 1. Community creation (Fig. 5) by embedding a Random Walk in the backward phase, to avoid executions of CI tests between variables that are unrelated at all as much as possible. We achieve this by the following steps: – Calculate a zero-order CI test matrix between all the variables in the dataset. This matrix will serve as a “cached” zero-order CI tests, as these values will be used to determine the grouping inside the Candidate Parents and Children before each iteration of the backward phase. – Before each iteration of the backward phase, all the variables included in the Candidate Parents and Children will go through a selection process: a random variable its chosen (line 27), and the Random Walk algorithm does an arbitrary number of jumps to different variables (line 28); the number of jumps executed will determine which variables are grouped with the initial one. This process is repeated until no variable is left alone, and then we progress to the “classic” backward phase, but only inside of each group (line 33, 34, Fig. 6). This modification allows to minimize the number of higher order conditional independence tests while maintaining the elimination of all redundant variables in the Candidate Parents and Children. 2. Interleaving. By reducing the amount of higher order conditional independence in the first modification, we can cleanse the Candidate Parents and Children without incurring in expensive operational costs (line 12). 3. Group-based insertion of variables if more than one scores the highest dependency score during the given iteration (line 5).
4
Experiments and Results
In this section, we present the experiments and result. First we describe the data used in the experiments. Then, we present the experimental setup and discuss the results achieved.
Max-Min Random Walk Parents and Children
475
Algorithm 2. MMRWPC 1: procedure MMRWPC Input: Target node T, Features set U Output: Set of Candidate Parents and Children Features for T. 2: - - Forward Phase 3: CPC ← [ ] 4: loop 5: ArrAsoc[< Fi , AssocFi >] ← MaxMinH(T, CPC, U) 6: for X in ArrAssoc: 7: if X < assocF >> 0 : 8: CPC ← CPC U X < F > 9: end if 10: end for 11: –Backward Phase interleaved in Forward Phase 12: CP C ← FilterWithSeparator(T,CP C, U) 13: until CPC is not modified 14: return CPC 15: 16: 17: procedure MaxMinH Input: T, CPC, U Output: Feature A, belonging to U, with highest association to T 18: evaluatedFeatures ← [ ] 19: for X in U: 20: evaluatedFeatures ← evaluatedFeatures U MinAsoc(X, T, CPC, U) 21: end for 22: returnevaluatedF eatures 23: procedure FilterWithSeparator Input: T, CPC Output: Filtered CPC 24: - - Community Creation stage: 25: groupedCPC ← [] 26: while CPC not empty : 27: CentroidAtt ← RandomlyChosenAtt(CPC) 28: community ← RandomWalk(CentroidAtt, CPC) 29: groupedCPC ← groupedCPC U community 30: CPC ← CPC - community 31: end while 32: -Filtering phase: 33: for community in groupedCPC: 34: community ← SeparatorsLookup(T,community) 35: end for 36: return groupedCP C
476
S. del R´ıo and E. Villanueva
Fig. 5. Community Creation Stage of MMRWPC. Using the Correlation Matrix, the Current CP C and a Random Selected Feature Fa , a Community is Created by Selecting the Features that Obtained Greater Amount of Jumps than an Arbitrary Threshold Value.
Fig. 6. Backward Phase of MMRWPC. For Each Iteration, the Evaluated Sets are the Communities Previously Formed on CP C.
4.1
Data
To evaluate the proposed method we have used artificial data generated from benchmark Bayesian networks [8]: 1. 2. 3. 4.
Alarm (370 nodes) Child (200 nodes) Hailfinder (560 nodes) Insurance (270 nodes)
With each network we have sampled 10 datasets for each sample size in {500, 1000 and 5000}.
Max-Min Random Walk Parents and Children
4.2
477
Experimental Setup
We executed the following eight versions of the proposed method in each dataset of each network: 1. 2. 3. 4. 5. 6. 7. 8.
MMPC without any modification. MMPC with only modification 1, abbreviated as MODS 1. MMPC with only modification 2, abbreviated as MODS 2. MMPC with only modification 3, abbreviated as MODS 3. MMPC with modifications 1 and 2, abbreviated as MODS 1 2. MMPC with modifications 1 and 3, abbreviated as MODS 1 3. MMPC with modifications 2 and 3, abbreviated as MODS 2 3. MMRWPC (MMPC with all modifications).
The idea behind these experiments was to assess the individual contribution of each proposed modification to MMPC, as well as the suitability of the coupling of these modifications. An experiment in a dataset consisted in the following steps: 1. Sample 4% of all the nodes randomly. 2. For each selected node T : – recover and register the candidate parents and children of the node (CP C(T )) with each method. – Shuffle the columns of the dataset before continuing to the next node. This is done to obtain robust results irrespective to the node ordering. 3. Compute the performance metrics of each method. As performance metrics, we used the popular True Positive Rate (TPR), True Negative Rate (TNR) and PC distance [8]. These metrics assess how well the set of parents and children retrieved by the method (CP C(T )) approaches to the actual P C(T ) set. These metrics are defined as follow: – True Positive Rate (TPR): number of parents and children correctly inferred divided by the total amount of nodes in PC(T). T P R = (CP C(T ) ∩ P C(T ))/P C(T )
(4)
– True Negative Rate (TNR): number of parents and children incorrectly inferred divided by the total amount of nodes not in PC(T). T N R = 1 − (CP C(T ) ∩ P C (T ))/P C (T )
(5)
– PC Distance: measure the overall distance of the inferred CPC(T) against PC(T). (6) P Cdistance = (1 − T P R)2 + (1 − T N R)2 Additionally, we registered the amount of CI test calls by order of the conditioning set. We also differentiate between true CI test call and unsuccessful CI test calls executed in each algorithm. An unsuccessful CI test call is a call to the dependency function (Dep()) that did not actually perform the dependency calculation because there were not enough samples (at least 5 samples for each combination of values of the conditioning variables, similar to [4]). In that case it is assumed complete dependency.
478
4.3
S. del R´ıo and E. Villanueva
Results and Discussion
Here we present the results obtained with the different configurations of the proposed method MMRWPC. To obtain these results we averaged the metrics obtained in each Bayesian network and sample size (average over the sampled nodes and the ten dataset versions of each sample size). We impose a limit of maximum 15 min per execution in each dataset to keep the evaluation’s time approachable. First, we present results for the Child network and sample size of 500 instances. Table 1 presents the total number of successful executions for each method (those that take less than the imposed limit of 15 min). It is observed that MODS 1 3 and MODS 3 are the only methods that had presented some few missing execution results due to the runtime limit mentioned beforehand. Table 1. Number of Successful Executions over Ten Datasets of 500 Samples of Child BN. The Maximum Number of Executions is 800 (10 Datasets, 4% of Nodes and 10 Independent Repetitions for Each Node) Method
Number of successful executions
MMPC
800
MODS 1
800
MODS 2
800
MODS 3
790
MODS 1 2 800 MODS 1 3 791 MODS 2 3 800 MMRWPC
800
Table 2 shows the average performance metrics obtained in Child datasets of 500 samples. We can also observe that there are no relevant differences in the metrics between the different methods. Tables 3 and 4 shows the average number of calls to the dependency function, differentiating the degree of the CI test. We can observe that both the proposed method MMRWPC and MODS 2 execute a higher amount of dependency calls of degree zero and a lower number of higher degree calls in comparison to the other methods. These results are opposite to the classical MMPC, that execute the lower number of zero-order CI tests.
Max-Min Random Walk Parents and Children
479
Table 2. Average Performance of PC Discovery Methods using Child 500 Samples. method
pc distance max min mean std
TPR max min mean std
TNR max min mean std
MMPC MODS 1 MODS 2 MODS 3 MODS 1 2 MODS 1 3 MODS 2 3 MMRWPC
1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.15 0.15 0.15 0.15 0.15 0.15 0.15 0.15
0.22 0.22 0.22 0.22 0.22 0.22 0.22 0.22
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.85 0.85 0.85 0.85 0.85 0.85 0.85 0.85
0.23 0.23 0.23 0.23 0.23 0.23 0.23 0.23
0.98 0.98 0.98 0.97 0.98 0.97 0.98 0.98
0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99
0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01
Table 3. Average number of Dependency Function Calls by Degree of the CI Test (Degrees 0 to 4). method
Degree of CI test calls 0 1 2 3
4
MMPC 199.00 34.29 85.67 135.03 135.87 199.00 63.71 125.08 169.36 144.93 MODS_1 2486.62 27.58 61.01 87.49 82.25 MODS_2 199.00 32.95 101.94 201.24 266.10 MODS_3 MODS_1_2 5985.87 55.19 97.62 117.63 97.91 MODS_1_3 199.00 38.78 99.59 182.22 240.16 MODS_2_3 199.00 34.29 85.67 135.03 135.87 3874.57 64.29 78.63 59.96 25.18 MMRWPC
best worst
Figure 7 shows the number of unsuccessful CI test calls for each method as a function of the test degree in the Child datasets of 500 samples. These results represent the amount of tests skipped by the method because of lack of data and where dependency is assumed. From this plot we can infer that the proposed method MMRWPC perform the lower number of unsuccessful CI tests in comparison to the other methods, specially in orders greater than two. This means that MMRWPC performs lower dependency assumptions in higher-order tests due to insufficient data.
480
S. del R´ıo and E. Villanueva
Table 4. Average Number of Dependency Function Calls by Degree of the CI Test (Degrees 5 to 9, Continuation of Table 3) method
Degree of CI test calls 5 6 7 8
9
MMPC 97.89 59.00 33.22 21.47 15.61 89.83 42.36 19.78 8.04 2.05 MODS_1 56.27 32.70 18.02 11.50 8.21 MODS_2 270.24 231.11 157.13 104.25 63.65 MODS_3 4.82 1.68 MODS_1_2 63.34 31.76 13.28 MODS_1_3 244.15 216.16 161.16 124.72 72.69 MODS_2_3 97.89 59.00 33.22 21.47 15.61 10.70 4.83 1.00 0 0 MMRWPC
Fig. 7. Averaged Numbers of Unsuccessful CI Tests Per Degree for Child Datasets of 500 Instances
Table 5 shows average numbers of CI test calls by degree for datasets of 1000 instances of Child Network. We can observe the same pattern obtained in datasets of 500 instances (Table 3), that is, MMRWPC presents a higher number of CI tests call of order zero compared with MMPC, but a drastic reduction of CI test calls in higher order. Figure 8 shows the number of unsuccessful CI test calls for each method as a function of the test degree in the Child datasets of 1000 samples. We also observe the same pattern found in datasets of 500 samples: MMRWPC performs the lowest number of unsuccessful CI test in orders greater than 2, which means, lower dependency assumptions when data is scarce.
Max-Min Random Walk Parents and Children
481
Table 5. Average Number of Dependency Function Calls by Degree Order for Degrees 0 to 4 using Child 1000 Samples. method
Degree of CI test calls 0 1 2
MMPC 199.00 39.55 199.00 70.89 MODS_1 2186.15 34.95 MODS_2 199.00 42.27 MODS_3 MODS_1_2 5358.69 68.48 MODS_1_3 199.00 47.618 MODS_2_3 199.00 39.55 2865.96 54.69 MMRWPC
107.66 147.42 86.09 213.78 136.13 211.85 107.66 84.50
3
4
168.36 165.65 183.92 152.05 123.75 113.79 599.09 1, 120.48 166.07 140.21 569.02 1050.40 168.37 165.65 75.28 35.12
best worst
Table 6 shows the total number of successful executions for each method for the experiments with datasets of 5000 instances of Child network. It is observed that the methods MODS 3 and MODS 1 3 failed to successfully finish even half of the execution tests, which are also the methods that perform the highest number of high-order dependency calls. Table 6. Number of Successful Executions using Child 5000 Samples. method
number of successful executions
MMPC MODS_1 MODS_2 MODS_3 MODS_1_2 MODS_1_3 MODS_2_3 MMRWPC
800 800 800 305 800 386 800 800
The pattern of results found in the Child network was also observed in the other Bayesian networks. Table 7 shows the ratios of the average metrics obtained with MMRWPC between the average metrics obtained with MMPC across the
482
S. del R´ıo and E. Villanueva
Fig. 8. Averaged Numbers of Unsuccessful CI Tests Per Degree for Child Datasets of 1000 Instances
different networks and datasets of 500 instances. We can see that both methods perform similarly across all Bayesian networks (values very close to 1). This indicates that the proposed method does not sacrifice precision in the PC recovery in comparison to MMPC, but improves the computing burden (represented by the higher order CI test), as shown before. Table 7. Ratios of the Average Metrics Obtained with MMRWPC between the Average Metrics Obtained with MMPC Across the Different Bayesian Networks and Sample Size = 500. For TNR and TPR, Values > 1 Means that MMRWPC Presented Better Average Metrics Compared to MMPC, and Vice Versa Por PC-Distance. Bayesian network ratio of metrics (MMRWPC/MMPC) PC-distance TNR TPR Alarm
0.957784
1.000278 1.006214
Child
1.014436
1.000405 0.997072
HailFinder
0.983696
0.999418 1.011004
Insurance
1.008649
1.000103 0.995981
Figure 9 shows the ratio of the average number of CI test calls performed by MMRWPC between the average number of CI calls performed by MMPC as a function of the order. In 3 of 4 networks the proposed MMRWPC presented increased zero-order CI tests, but a notorious reduction of the costly higher order
Max-Min Random Walk Parents and Children
483
tests. Only in the HailFinder network that the reduction of higher-order CI tests was no enough to reach MMPC performance.
Fig. 9. Ratio of the Average Number of CI Test Calls Performed by MMRWPC between the Average Number of CI Calls Performed by MMPC as a Function of the Order Across Bayesian Networks. Values < 1 Means that MMRWPC Presented Better Metrics Compared to MMPC.
Finally, Fig. 10 shows the ratio of the average number of unsuccessful CI test calls performed by MMRWPC between the average number of unsuccessful CI calls performed by MMPC as a function of the order. It is observed that the proposed method perform significantly less unsuccessful CI calls than MMPC, which translates to less assumptions of dependency between nodes. Figure 9 and 10 illustrate how the trade off made by the modifications present in MMRWPC against MMPC behaves, by increasing the quantity of CI test calls of degree zero in order to diminish CI test calls of greater degrees. By decreasing the executed CI test calls of greater degrees, MMRWPC performs less assumptions of dependency between nodes, thus increasing the statistic reliability of the proposed algorithm.
484
S. del R´ıo and E. Villanueva
Fig. 10. Ratio of Unsuccessful CI Test Calls of MMRWPC Against MMPC Across Different Artificial Bayesian Networks. Values < 1 Means that MMRWPC Presented Better Metrics Compared to MMPC.
5
Conclusion
According to the obtained results we can conclude that the proposed method MMRWPC is an efficient alternative to the classical MMPC algorithm, increasing the execution of inexpensive low-order conditional independence tests but drastically reducing the execution of expensive high-order CI tests in most cases, thus avoiding unsuccessful CI tests whenever as possible. The accuracy in recovering parents and children showed to be in pair with the obtained with MMPC algorithm. All modifications proved to be useful in MMRWPC. The combination of these modifications are potentially applicable to other Markov Blanket discovery methods, a plausible exploration in further investigations. Acknowledgment. The authors gratefully acknowledges financial support by INNOVATE PERU (Grant 334-INNOVATEPERU-BRI-2016).
References 1. Gao, T., Ji, Q.: Efficient Markov blanket discovery and its application. IEEE Trans. Cybern. 47(5), 1169–1179 (2017) 2. Koller, D., Sahami, M.: Toward optimal feature selection. In: Proceedings of the 13th International Conference on Machine Learning (ICML-1996), October 2000 3. Pearl, J.: Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann Publishers Inc., San Francisco (1988)
Max-Min Random Walk Parents and Children
485
4. Pellet, J.-P., Elisseeff, A.: Using Markov blankets for causal structure learning. J. Mach. Learn. Res. 9(43), 1295–1342 (2008) 5. Shunkai, F., Desmarais, M.: Markov blanket based feature selection: a review of past decade. Lecture Notes in Engineering and Computer Science, no. 2183, June 2010 6. Spirtes, P., Glymour, C., Scheines, R.: Regression, causation and prediction. In: Spirtes, P., Glymour, C., Scheines, R. (eds.) Causation, Prediction, and Search. Lecture Notes in Statistics, vol. 81, pp. 238–258. Springer, New York (1993). https:// doi.org/10.1007/978-1-4612-2748-9 8 7. Williams, T.D., et al.: Towards a system level understanding of non-model organisms sampled from the environment: a network biology approach. PLoS Comput. Biol. 7, e1002126 (2011) 8. Villanueva, E., Maciel, C.D.: Efficient methods for learning Bayesian network super-structures. Neurocomputing 123, 3–12 (2014). Contains Special issue articles: Advances in Pattern Recognition Applications and Methods 9. Yang, X., Wang, Y., Yang, O., Tong, Y.: Three-fast-inter incremental association Markov blanket learning algorithm. Pattern Recogn. Lett. 122, 73–78 (2019)
A Complete Index Base for Querying Data Cube Viet Phan-Luong1,2(B) 1
Aix-Marseille Univ, Universit´e de Toulon, CNRS, LIS, Marseille, France 2 LIS - Team DANA, Marseille, France [email protected]
Abstract. We call a base of data cubes a structure that allows to compute data cube query. In a previous work, we have presented a compact index base, called the first-half index base, that consists of tuple indexes. This base is stored on disks. For computing query in the whole data cube, we need to compute further indexes based on this stored based. Those further indexes are in the last-half index base. The present work shows that the last-half index base can be integrated into the stored first-half index base with a very small cost of computing and storage. The integration, called the complete index base, allows to improve significantly the data cube query computing. The efficiency of the complete base, on the storage space and the query response time, is shown through experimentation on real datasets.
Keywords: Data warehouse
1
· Data cube · Data mining · Database
Introduction
Given a relational fact table with n attributes (also called dimensions) and a measure m (a numerical attribute), and an aggregate function g (as COUNT, SUM, MAX, etc.), a cuboid over a relational scheme S built on k attributes among the n attributes (k ≤ n) is the result of a group-by SQL query with group by on S and the aggregate function g applied to m. The data cube over the n dimensions and the measure m and the aggregate function g is the set of all cuboids built on g and m and all such schemes S. In business intelligence, each cuboid represents an aggregate view of business over a set of dimensions. Therefore, the data cube can offer to managers multiple dimension views of their business. However, computing data cube query has important issues concerning the big size of large datasets and the number of the cuboids. In Online Analytical Processing (OLAP), data cube is precomputed and stored on disks. The storage space can be tremendous, as for n dimensions, we have 2n cuboids. There exist many approaches to these issues. The work [9] proposed an approximate and space-efficient representation of data cubes based on a multiresolution wavelet decomposition. In [8,10,21,23], we can find the approaches c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 295, pp. 486–500, 2022. https://doi.org/10.1007/978-3-030-82196-8_36
A Complete Index
487
to compute partially data cube: only cuboids above certain threshold are computed. Based on the equivalence defined on aggregate functions or on the concept of closed itemsets in frequent itemset mining, the works [1,2,7,18,19] proposed efficient methods for computation and storage of the entire data cube. In the same category, the works [2,11,16,17,20,22,24] proposed methods for reducing the storage space by using tuple references. In this category, the computation is usually organized on the complete lattice of schemes of cuboids, in a top-down or bottom-up manner. For creating a cuboid, tuples are sorted and grouped. An aggregated tuple is a tuple that represents a group with several tuples. A non-aggregated tuple is tuple that represents a group with only one tuple (a singleton). Only aggregated tuples are stored on disk. Non-aggregated tuples are not stored but represented by references to the stored tuples. The experimental results [24] of many of these approaches on real and synthetic datasets show: among these approaches, the Totally-Redundant-Segment BottomUpCube approach (TRS-BUC) nearly dominates its competitors in all aspects: fast computation of a fully materialized cube in compressed form, incrementally updateable, and quick query response time. We remark that in almost the above works, the data cube is precomputed for a fixed aggregate function and a fixed measure. When another aggregate function or another measure are considered, the whole data cube needs to be reprecomputed. The work [12–15] present a new approach to compute data cubes. In this approach the data cube is not precomputed for a fixed aggregate function and a fixed measure. Instead, an index structure of the data cube is precomputed, based on which the data cube queries can be efficiently computed for any aggregate function and any measure, without reprecomputed the index structure. We observe that the index structure of a data cube can be divided into two halves such that one can be computed on another efficiently. These two halves are called the first-half data cube index and the last-half data cube index. In [12,13], the last-half data cube index is computed and stored on disks. This approach follows the top-down computing: the index of the cuboid over the largest scheme is computing first. In [15], the first-half data cube index is computed and stored on disks. This approach follows the bottom-up computing: the index of the cuboids over the smallest schemes are computing first. Through experimental results, the bottom-up computing is more efficient than top-down computing, both in computing time and in storage space. The present work studies the possibility to integrate the last-half index base into the first-half index base and its efficiency in the storage space and computing runtime. The paper is organized as follows. Section 2 recalls the main concepts of the approaches in [13] and [15] that we need for the present approach. Section 3 presents the method for integrating the last-half index base into the first-half index base; the result is called the complete index base. Section 4 presents the methods for computing data cube query based on the complete base. Experimental results and discussions are in Sect. 5. Finally, conclusion and further work are in Sect. 6.
488
2
V. Phan-Luong
Preliminary
This work is developed on the main concepts and algorithms presented in [13] and [15]. In what follows and in the following sections, we consider a relational fact table T with a dimension scheme Rn = {1, 2, ..., n} and a set of k measures M = {m1 , m2 , ..., mk }. 2.1
The First-Half and the Last-Half Data Cube
Let Pn denote the power set of Rn . Pn can be defined as follows: 1. For R0 = ∅ (the empty set), P0 = {∅}. 2. For Rn = {1, 2, ..., n}, n ≥ 1, Pn = Pn−1 ∪ {X ∪ {n} | X ∈ Pn−1 }
(1)
Pn−1 is called the first-half of Pn and {X ∪ {n} | X ∈ Pn−1 }, the last-half of Pn . Example 1: For n = 3, R3 = {1, 2, 3}, we have: P0 = {∅}, P1 = {∅, {1}}, P2 = {∅, {1}, {2}, {1, 2}}, P3 = {∅, {1}, {2}, {1, 2}, {3}, {1, 3}, {2, 3}, {1, 2, 3}}. The first-half of P3 is P2 = {∅, {1}, {2}, {1, 2}} and the last-half of P3 is {{3}, {1, 3}, {2, 3}, {1, 2, 3}}. The first-half data cube over Rn is the set of all cuboids over the schemes in the first-half of Pn , and the last-half data cube is the set of all cuboids over the schemes in the last-half of Pn . The work [12,13] computes and stores the tuple indexes of the last-half data cube, and data cube queries are computed based on these indexes. In contrast, the work [15] computes and stores the tuple indexes of the first-half data cube. 2.2
The First-Half Index Base for Data Cubes
The first-half index base [15] is a set of indexes for tuples in the first-half data cube. It serves to compute all data cube queries, including queries on the lasthalf data cube. The computation of this index base is based on a structure called attribute index tree and on an elementary algorithm called InsData2AttIndex. The attribute index tree is a search binary tree with two data fields, the first one is used to store an attributed value and the second one is for the list of rowids of tuples that have the attributed value. The search in this tree is organized on the first field. The algorithm InsData2AttIndex is to insert values on an attribute (dimension) into the attribute index tree of this dimension. The creation of the first-half index base is incremental by using the algorithm TupleIndex. This algorithm builds the index of tuples over a scheme {A1 , ..., Ak } ⊆ Rn (for 1 ≤ i ≤ k, 1 ≤ Ai ≤ n). The construction of index over {A1 , ..., Ak } supposes the index over {A1 , ..., Ak−1 } is already created. Note that
A Complete Index
489
the algorithm InsData2AttIndex is used to build the indexes over the schemes {1}, ..., and {n}. The tuple index over a scheme {A1 , ..., Ak } is a partition of the set of all rowids of the fact table T . For each set p in the partition, p is the set of the rowids of all tuples of the fact table T that have the same values on {A1 , ..., Ak }. The first-half index base representation for data cubes of T has three components: the fact table T , the indexes over all sub-schemes in the first-half of Pn (the power set of Rn ), and the list of structures that allow to access to all those indexes. These components are stored on disks. This representation is built by the algorithm GenFHIndex. For information on the computation of data cube query based on the firsthalf base, see [15]. Here we observe that the only difference between query on the first-half data cube and on the last-half data cube is that the latter needs an extra step to partition the indexes that are available in the first-half index base. This takes a complementary computation time. In the next section, we present a solution to reduce this cost.
3
Complete Index Base for Querying Data Cube
We recall from Sect. 2, that a data cube can be divided into two parts: the first-half part and the last-half part, each one with cuboids over the dimension schemes that are divided into the first-half part and the last-half part of the power set of the scheme Rn = {1, 2, ..., n}, n ≥ 1, respectively, and these letters are defined by the recursive formulas: Pn = Pn−1 ∪ {X ∪ {n} | X ∈ Pn−1 } and the power set of R0 = ∅ (the empty set) is P0 = {∅}. Pn−1 is called the first-half of Pn and the second operand of Pn , i.e., {X∪{n} | X ∈ Pn−1 }, the last-half of Pn . The last-half power set of Rn is obtained by add n into each element of Pn−1 . As a consequence of this definition, the creation of the index base can be incremental: for a scheme X = A1 , ..., Ak ⊆ Rn , the index over a scheme X is created based on the index for A1 , ..., Ak−1 , where A1 < ... < Ak−1 < Ak . Each partition in the index for A1 , ..., Ak−1 is partitioned on Ak -attributed values using algorithm InsData2AttIndex. All indexes over the first-half power set of Rn are saved on disk and used for querying data cube. By this way, we can save more than a half of the storage space, but we must pay for computing queries in the last-half data cube. We show now how we can reduce this cost with the complete index base, without growing up considerably the storage space. Let us suppose that we build the complete index base in the way that we have built the first-half index base. That is, after building the first-half index base, for each partition p in the index for tuples over a scheme {A1 , ..., Ak−1 }, to obtain the index for tuples over {A1 , ..., Ak−1 , n} (where n is the last dimension of fact table scheme) we partition p into the partitions based on the n-attributed values using algorithm InsData2AttIndex, and save those new partitions on disks. In such a way, we save no storage space.
490
V. Phan-Luong
To save storage space, we would not save those new partitions in usual way, but in the space that we have used to save the first-half index base. Note that each partition p in the index for tuples over a scheme {A1 , ..., Ak−1 } is a list of rowids. The partition of this list based on the n-attributed values is a list p1 , ..., pi where – ∀j, 1 ≤ j ≤ i, the tuples at rowids in pj have the same value on the attribute n, – the set of all values on the attribute n in all tuples at rowids in p1 ∪ ... ∪ pi is the set of all values on the attribute n in all tuples at rowids in p, and – p = p1 ∪ ... ∪ pi and pj ∩ ph = ∅, ∀1 ≤ h, j ≤ i, h = j. Based on the last point, instead of saving the partition over {A1 , ..., Ak−1 , n} in a new space, we can use the storage space of indexes over {A1 , ..., Ak−1 } and for each partition p in this space, we insert separator marks into p to separate p1 , ..., and pi . Moreover, if we generate the first-half index base and then generate the last-half index base and save it as explaining above, we must spend time for reading the first-half index base into memory. To avoid this cost, it is better to generate the first-half index base and the last-half index base in one phase. Using this way, we can generate the complete index base by slightly modifying the algorithm GenFHIndex as follows: for each index, over a scheme {A1 , ..., Ak−1 }, of the first-half index base, before saving it, we partition it over attribute n to create the index over a scheme {A1 , ..., Ak−1 , n}, and save this latter on disk. In what follows, we formalize the above method for generating the complete index base in one phase. Algorithm GenCompleteIndex: Input: the fact table T over Rn = {1, ..., n}. Output: the complete index base for the data cube over Rn . Method: 0. Let LS = ∅ // LS: List of schemes 1.1 Use InsData2AttIndex to build n − 1 attribut indexes over the schemes 1 ... n − 1. Partition these indexes over attribute n using InsData2AttIndex. Save the result to disk, using: ‘ ‘ to separate rowids, ‘\n’ to separate the partitions over attribute i , for i ≤ 1 ≤ n-1, and ‘;’ to separate the sub-partitions over attribute n; 1.2 Generate the index over attribute n, using InsData2AttIndex. Save it to disk, using ‘ ‘ to separate rowids and ‘\n’ to separate the partitions over attribute n. 1.3 Append successively the schemes {1}, ..., {n} to LS. 2. Set a pointer pt to the head of LS; 3. Let X be the scheme pointed by pt (i;e. X = {1}); 4. while pt = N U LL do 4.1. Let lastAtt be the greatest number in X; 4.2. For i from 1 to n − 1 such that i > lastAtt do 4.2.2 Let nsch = append i to X and RS = append nsch to RS;
A Complete Index
491
4.2.3 Use T upleIndex to build the index of tuples over the scheme nsch, and using InsData2AttIndex to partition each partition of rowids of this index over attribute n; 4.2.4 Save the index to disk using ‘ ‘ to separate rowids, ‘\n’ to separate the partitions of rowids over the tuple scheme, and ‘;’ to separate the sub-partitions over attribute n; 4.2.5 done; 4.3 Move pt to the next element of LS; 4.4 done; 5. Save LS to disk.
We can see that the algorithm GenCompleteIndex is a slightly modified version of the algorithm GenFHIndex: The partition over the last attribute n is applied to tuple indexes that are generated in each step of tuple index generation of GenFHIndex and the result of the partition is stored in the same space of those tuple indexes.
4
Data Cube Query Based on the Complete Index Base
We call the complete index base for querying data cube on the fact table T the triple (T, LS, CIndex), where CIndex and LS are respectively the set of tuple indexes and the list of dimension schemes saved by GenCompleteIndex(T ). The list of dimension schemes contains information to access to the tuple indexes. To compute a cuboid on an aggregate function g (as MAX, COUNT, SUM, AVERAGE, etc.), over a scheme X, we consider two cases: If X is the scheme of a cuboid in the first-half data cube, then in the file saving the tuple indexes over X, the partitions of rowids are separated by ‘\n’. Otherwise, X is the scheme of a cuboid in the last-half data cube, and in the file saving the tuple indexes over X − {n}, the partition of rowids of indexes over X are separated by ‘;’. So we can get precisely the partitions over X, either X is the scheme of a cuboid in the first-half data cube or in the last-half data cube. Now, for each partition p of rowids of a tuple index over X, the computation of the aggregate function g on the set of tuples of which the rowids are in p is the same as the computation based on the first-half index base representation.
5
Experimental Results and Discussions
We experiment the first-half index base and the complete index base for data cube query on a laptop Intel Core i5-3320 CPU 2.60 GHz, running Ubuntu 18.04 LTS, using C programming, and on the following real datasets:
492
V. Phan-Luong
– CovType [3] is a forest cover-types dataset with 581,012 tuples on ten dimensions, – SEP85L [4] is a weather dataset with 1,015,367 tuples on nine dimensions, – STCO-MR2010 AL MO [5] is a dataset on the population census with 640,586 tuples on ten dimensions, and – OnlineRetail [6] is a UK transactions data set with 393,127 complete data tuples on ten dimensions. For the complete description of these datasets, see [15]. 5.1
Base Construction - Complete Index Base Vs First-Half Index Base
Table 1 reports the construction time and the disk use of the first-half index base and the complete index base, where – RT: the run time in seconds, from the beginning to the end of the construction of the index base. The run time includes the time to read/write input/output files. RT-FH is the runtime for building the first-half index base and RT-CO is the runtime for building the complete index base. – DU: the disk use in Giga bytes to store the index base. DU-FH is storage space of the first-half index base and DU-CO is storage space of the complete index base. For the central memory use of both methods, for the four datasets, varies from 80 Mega bytes to 200 Mega bytes. Table 1. Construction time and disk use of first-half base and complete base Datasets
RT-FH RT-CO DU-FH DU-CO
CoveType
125s
178s
2 Gb
2.3 Gb
SEP85L
114s
158s
1.8 Gb
1.9 Gb
STCO-...
131s
158s
2.2 Gb
2.3 Gb
104s
1.4 Gb
1.5 Gb
OnlineRetail 84s
Observations: (i) The RT-CO is longer than the RT-FH, the increasing time is from 20% to 42%, and (ii) The DU-CO is larger than the DU-FH, the increasing storage space is from 4% to 15%. The increasing time or storage space is due to the integration of the last-half index base into the first-half index base to create the complete index base.
A Complete Index
5.2
493
On Query with Aggregate Functions
For experimentation, we run the following query for all cuboids in each half of the above data cubes. Select $A_1, ..., A_k$, g(m) From T Group by $A_1, ..., A_k$; where A1 , ..., Ak ⊆ Rn , m is a measure, and g is an aggregate function among MAX, COUNT, SUM, AVG, and VARIANCE. For example, for CovType the query is runned on 512 cuboids of the first-half and on 512 cuboids of the last-half. Table 2 shows the total time in seconds (including all computing and i/o time) for computing the aggregate queries on the first-half data cubes. For each aggregate function, the suffix FH indicates the computing time, based on the first-half index base, and the suffix CO indicates the computing time based on the complete index base. The lines Mean-FH (or Mean-CO) contains the mean of the above lines with suffix FH (respectively, suffix CO), e.g. in column CovType, Mean-FH is the mean of aggregate query computing times for the data set CovType based on the first-half index base, and Mean-CO is the mean of aggregate query computing times for the data set CovType based on the complete index base. The avgQRT-FH is the average query response time: Mean-FH divided by the number of cuboids in the first-half data cube, e.g., for CovType, avgQRT-FH = 256/512. It is similar for the avgQRT-CO. The Increase % is the percentage of rounded increasing time of avgQRT-CO with respect to avrQRT-FH. Figure 1 shows the comparison of avgQRT of the two approaches on the firsthalf data cubes. Table 3 is similar to Table 2, but for computing the aggregate queries for all cuboids in the last-half data cubes. In this table, the decrease % is the percentage of rounded decreasing time of avgQRT-CO with respect to avrQRT-FH. Figure 2 shows the comparison of avgQRT of the two approaches on the last-half data cubes. By integrating the last-half index base into the first-half index base, the avgQRT-CO for the first-half data cube increases from 2% to 7% with respect to the avgQRT-FH, but the avgQRT-CO for the last-half data cube decreases from 6% to 20% with respect to the avgQRT-FH. Table 4 shows the average query response time (in seconds) for each aggregate function, in the complete index base approach. This table is computed based on Tables 2 and 3. For each half of a data cube, the average time is with respect to the number of cuboids in the half. For example, in the first-half CoveType data cube, the AvgQRT for SUM is 253s/512 = 0.49 s, and for the five aggregate functions the AvgQRT is 273s/512 = 0.53 s. For the entire data cube, the AvgQRT for SUM is (253s + 284s)/1024 = 0.52 s, and for the five aggregate functions the AvgQRT is (273s + 303s)/1024 = 0.56 s.
494
V. Phan-Luong
Table 2. Computing time of first-half cube queries Agg Funct
CovType Sep85 STCO OnlineRet
Count-FH Count-CO
207s 225s
112s 116s
122s 137s
102s 106s
Max-FH Max-CO
237s 257s
143s 146s
146s 157s
119s 120s
Sum-FH Sum-CO
239s 253s
140s 145s
147s 157s
115s 119s
Avg-FH Avg-CO
312s 332s
173s 180s
176s 189s
148s 150s
Var-FH Var-CO
286s 301s
171s 174s
179s 187s
142s 144s
Mean-FH Mean-CO
256s 273s
148s 152s
154s 165s
125s 128s
avgQRT-FH 0.50s avgQRT-CO 0.53s 7% Increase %
0.58s 0.59s 3%
0.30s 0.32s 7%
0.24s 0.25s 2%
Fig. 1. AvgQRT comparison on first-half data cubes
Fig. 2. AvgQRT comparison on last-half data cubes
A Complete Index
495
Table 3. Computing time of last-half cube queries Agg Funct
CovType Sep85 STCO OnlineRet
Count-FH Count-CO
261s 252s
159s 128s
161s 149s
130s 118s
Max-FH Max-CO
301s 279s
188s 155s
180s 166s
145s 131s
Sum-FH Sum-CO
301s 284s
190s 158s
180s 166s
144s 133s
Avg-FH Avg-CO
381s 359s
221s 187s
212s 196s
176s 157s
Var-FH Var-CO
365s 340s
229s 191s
223s 201s
177s 159s
Mean-FH Mean-CO
322s 303s
197s 164s
191s 176s
154s 140s
avgQRT-FH 0.63s avgQRT-CO 0.59s
0.77s 0.64s
0.37s 0.34s
0.30s 0.27s
Decrease %
20%
9%
11%
6%
Table 4. Avg response times of aggregate functions on the complete index base ON FIRST-HALF CUBES DATASETS COUNT MAX SUM AVG VAR MEAN CoveType SEP85L STCOOnlineRetail
0.44 0.45 0.27 0.23
0.50 0.57 0.31 0.21
0.49 0.57 0.31 0.23
0.65 0.70 0.37 0.28
0.59 0.68 0.36 0.28
0.53 0.59 0.32 0.25
ON LAST-HALF CUBES DATASETS COUNT MAX SUM AVG VAR MEAN CoveType SEP85L STCOOnlineRetail
0.49 0.5 0.29 0.23
0.54 0.60 0.32 0.26
0.55 0.62 0.32 0.26
0.70 0.73 0.38 0.31
0.66 0.75 0.39 0.31
0.59 0.64 0.34 0.27
ON ENTIRE CUBES DATASETS COUNT MAX SUM AVG VAR MEAN CoveType SEP85L STCOOnlineRetail
0.46 0.47 0.28 0.22
0.52 0.59 0.31 0.24
0.52 0.59 0.31 0.25
0.67 0.72 0.37 0.30
0.62 0.71 0.38 0.30
0.56 0.62 0.33 0.26
496
V. Phan-Luong
Table 5 is similar to Table 4, but it only shows the AvgQRT based on the entire cube, in the first-half index base approach. The comparison of data in Tables 4 and 5 shows, on the whole data cube, the avgQRT in the complete index base approach is slightly improved with respect to the avgQRT in the first-half index base approach, only for the dataset SEP85L, the improvement is considerable. Table 5. Avg response times of aggregate functions on the first-half index base ON ENTIRE CUBES DATASETS COUNT MAX SUM AVG VAR MEAN CoveType SEP85L STCOOnlineRetail
5.3
0.46 0.53 0.28 0.23
0.52 0.65 0.32 0.26
0.53 0.64 0.32 0.25
0.68 0.77 0.38 0.32
0.63 0.78 0.39 0.31
0.56 0.67 0.34 0.27
Complete Index Base Vs TRS-BUC
As the TRS-BUC approach [24] is known as the most competitive among the approaches to data cube query, we now consider the experimental results of the complete index base approach with those of the TRS-BUC approach, on the datasets used in [24]. Those results are represented in Table 6 and graphically in Fig. 3, 4, and 5. However, this presentation is not a real comparison between the complete index base approach and the TRS-BUC approach. The reasons are: – The TRS-BUC implementation is for the data cube query on a fixed aggregate function and a fixed measure; when we change the aggregate function or the measure, the method must redo the whole computation. This is not necessary in the complete index base approach: the data cube query is computed on the index base, the query can be on any aggregate function and on any measure, without changing the index base. – In the TRS-BUC experimental result report [24], we do not know whether the construction time and in the average query response time include the i/o time. In the complete index base approach, the i/o time is included in all times reported. Moreover, in this approach the avgQRT reported is the average on the five aggregate functions MAX, COUNT, SUM, AVG, and VARIANCE. – TRS-BUC is implemented on a PC Pentium 4, 2.80 Ghz, running Windows XP, while this work implements the experimentation on a PC laptop Intel Core i5-3320 CPU 2.60 GHz running Ubuntu 18.04 LTS. The difference between the index base approaches and TRS-BUC consists in: – TRS-BUC optimizes the storage space using tuple references. The index base approaches optimize the storage by the indexes of a half cube.
A Complete Index
Table 6. Results of TRS-BUC and Index Base TRS-BUC Datasets
Construction time Storage space avgQRT
CoveType 300 s 1150 s SEP85L
0.4 Gb 1.2 Gb
0.7s 0.5s
DU-FH
avgQRT
2 Gb 1.8 Gb
0.56s 0.67s
DU-CO
avgQRT
2.3 Gb 1.9 Gb
0.56s 0.62s
FIRST-HALF INDEX BASE Datasets
RT-FH
CoveType 125 s 114 s SEP85L COMPLETE INDEX BASE Datasets
RT-CO
CoveType 178 s 158 s SEP85L
Fig. 3. Construction/Run times in seconds
Fig. 4. Storage space of the representations in giga bytes
497
498
V. Phan-Luong
Fig. 5. Average query response times in seconds
– TRS-BUC computes the entire the data cube on a specific aggregate function and only one measure. The index base approach computes the indexs for building any data cube on any measure and with any aggregated function.
6
Conclusion and Further Work
Based on Tables 2 and 3, we can see that with a little cost of integration, in construction time and in storage space, the complex index base representation really improves the first-half index base representation, in terms of average query response time, in particular for query in the last-half data cube. Though the complete index base or the first-half index base could not be really compared to TRS-BUC, on functionality and on the experimental results, we can see that the complete index base or the first-half index base can be competitive to TRS-BUC, because – based on indexes, it allows to compute data cube query on any measure and any aggregate function, – the time for building the index bases is substantially reduced with respect to the construction time of TRS-BUC, and – its avgQRT can be competitive to that of TRS-BUC. For further work, we plan to study the incremental construction of the complete index base and a more interesting question: could we reduce further more the storage space of the index base, but do not increase substantially the average query response time.
References 1. Agarwal, S., et al.: On the computation of multidimensional aggregates. In: Proceedings of VLDB 1996, pp. 506–521 (1996) 2. Harinarayan, V., Rajaraman, A., Ullman, J.: Implementing data cubes efficiently. Proceedings of SIGMOD 1996, pp. 205–216 (1996) 3. Blackard, J.A.: The forest covertype dataset, ftp://ftp.ics.uci.edu/pub/machinelearning-databases/covtype
A Complete Index
499
4. Hahn, C., Warren, S., London, J.: Edited synoptic cloud re-ports from ships and land stations over the globe. http://cdiac.esd.ornl.gov/cdiac/ndps/ndp026b.html 5. Census Modified Race Data Summary File for Counties Alabama through Missouri. http://www.census.gov/popest/research/modified/STCO-MR2010 AL MO.csv 6. Online Retail Data Set, UCI Machine Learning Repository. https://archive.ics.uci. edu/ml/datasets/Online+Retail 7. Ross, K.A., Srivastava, D.: Fast computation of sparse data cubes. In: Proceedings of VLDB 1997, pp. 116–125 (1997) 8. Beyer, K.S., Ramakrishnan, R.: Bottom-up computation of sparse and iceberg cubes. In: Proceedings of ACM Special Interest Group on Management of Data (SIGMOD 1999), pp. 359–370 (1999) 9. Vitter, J.S., Wang, M., Iyer, B.R.: Data cube approximation and histograms via wavelets. In: Proceedings of International Conference on Information and Knowledge Management (CIKM 1998), pp. 96–104 (1998) 10. Han, J., Pei, J., Dong, G., Wang, K.: Efficient computation of iceberg cubes with complex measures. In: Proceedings of ACM SIGMOD 2001, pp. 441–448 (2001) 11. Lakshmanan, L., Pei, J., Han, J.: Quotient cube: how to summarize the semantics of a data cube. In: Proceedings of VLDB 2002, pp. 778–789 (2002) 12. Phan-Luong, V.: A simple and efficient method for computing data cubes. In: Proceedings of the 4th International Conference on Communications, Computation, Networks and Technologies INNOV 2015, pp. 50–55 (2015) 13. Phan-Luong, V.: A simple data cube representation for efficient computing and updating. Int. J. Adv. Intell. Syst. 9(3&4), 255–264 (2016). http://www. iariajournals.org/intelligent systems 14. Phan-Luong, V.: Searching data cube for submerging and emerging cuboids. In: Proceedings of The 2017 IEEE International Conference on Advanced Information Networking and Applications Science AINA 2017, pp. 586–593. IEEE (2017) 15. Phan-Luong, V.: First-half index base for querying data cube. In: Arai, K., Kapoor, S., Bhatia, R. (eds.) IntelliSys 2018. AISC, vol. 868, pp. 1129–1144. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-01054-6 78 16. Sismanis, Y., Deligiannakis, A., Roussopoulos, N., Kotidis, Y.: Dwarf: shrinking the petacube. In: Proceedings of ACM SIGMOD 2002, pp. 464–475 (2002) 17. Wang, W., Lu, H., Feng, J., Yu, J.X.: Condensed cube: an efficient approach to reducing data cube size. In: Proceedings of International Conference on Data Engineering 2002, pp. 155–165 (2002) 18. Casali, A., Cicchetti, R., Lakhal, L.: Extracting semantics from data cubes using cube transversals and closures. In: Proceedings of International Conference on Knowledge Discovery and Data Mining (KDD 2003), pp. 69–78 (2003) 19. Casali, A., Nedjar, S., Cicchetti, R., Lakhal, L., Novelli, N.: Lossless reduction of datacubes using partitions. Int. J. Data Warehous. Min. (IJDWM) 5(1), 18–35 (2009) 20. Lakshmanan, L., Pei, J., Zhao, Y.: QC-trees: an efficient summary structure for semantic OLAP. In: Proceedings of ACM SIGMOD 2003, pp. 64–75 (2003) 21. Xin, D., Han, J., Li, X., Wah, B.W.: Star-cubing: computing iceberg cubes by top-down and bottom-up integration. In: Proceedings of VLDB 2003, pp. 476–487 (2003) 22. Feng, Y., Agrawal, D., Abbadi, A.E., Metwally, A.: Range cube: efficient cube computation by exploiting data correlation. In: Proceedings of International Conference on Data Engineering, pp. 658–670 (2004)
500
V. Phan-Luong
23. Shao, Z., Han, J., Xin, D.: Mm-cubing: computing iceberg cubes by factorizing the lattice space. In: Proceedings of International Conference on Scientific and Statistical Database Management (SSDBM 2004), pp. 213–222 (2004) 24. Morfonios, K., Ioannidis, Y.: Supporting the data cube lifecycle: the power of ROLAP. VLDB J. 17(4), 729–764 (2008)
Multi-resolution SVD, Linear Regression, and Extreme Learning Machine for Traffic Accidents Forecasting with Climatic Variable Lida Barba1(B)
, Nibaldo Rodríguez2 , Ana Congacha1 , and Lady Espinoza1
1 Universidad Nacional de Chimborazo, Riobamba 060108, Ecuador
[email protected]
2 Pontificia Universidad Católica de Valparaíso, Valparaíso, Chile
[email protected]
Abstract. The purpose of this work is to forecast variables related to traffic accidents in Ecuador since 2015 to 2019. Traffic accidents lead severe injuries and fatalities in Ecuador with 4925 deaths and 20794 injured in the period of analysis. Models based on Multiresolution Singular Value Decomposition (MSVD) and Extreme Learning Machine (ELM) are proposed to improve the accuracy for multi-week ahead forecasting. This study adds a climatic variable for enhancing the effectiveness of both type of models. The performance of MSVD+ELM based is compared with a conventional Linear Regression Model (LRM) joint with MSVD. To assess the forecasting accuracy, three metrics were used, Root Mean Squared Error (RMSE), Index of Agreement modified (IoAm), and Nash-Suctlife Efficiency modified (NSEm). Models based on Linear Regression (SVD+LRM) without climatic variable present the lowest accuracy, with an average NashSuctlife Efficiency of 65.4% for 12-weeks ahead forecasting, whereas models that integrate climatic variable at input, present gains in prediction accuracies, with an average Nash-Suctlife Efficiency of 94.6% for Linear Regression - based models, and 95.9%. for ELM -based models. The implementation of the proposed models will help to guide the planning of government institutions and decision-making, in face of complex problem of traffic accidents addressed in this work. Keywords: Extreme learning machine · Forecasting · Linear regression · Multiresolution singular value decomposition · Traffic accidents
1 Introduction According to the World Health Organization (WHO), around 3500 people die every day in road accidents, and tens of millions of people are injured or disabled each year. Traffic accidents forecasting can be used as prior knowledge to handle multiple issues that must be assumed by the government and society, conducing to an optimal traffic management. The traffic accidents forecasting involves spatial and temporal information commonly provided by government and police agencies. The historical data about the number of events, fatality, injuries, drivers, and vehicles involved in traffic accidents in different © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 295, pp. 501–517, 2022. https://doi.org/10.1007/978-3-030-82196-8_37
502
L. Barba et al.
geographical zones are the main source of information used as input of the estimation models, whereas as output the future conditions are obtained. In Ecuador, in accordance with the data provided by the National Directorate of Traffic Control and Road Safety (DNTSV), between 2015 and 2019, 21575 accidents were registered, with 4925 deaths and 20794 injured. Regressive models have been used for traffic accidents prediction, among them, linear and nonlinear regression [1], Seasonal Autoregressive Integrated Moving Average [2], Knearest neighbor in conjunction with neural networks [3, 4], among others. Researchers must deal with the nonlinear and nonstationary characteristics of traffic accidents time series. Non-parametric methods have been preferred over parametric methods to achieve improved prediction accuracy. In this context, in recent years, deep learning (DL) has been found to be useful in different applications for prediction and classification [5]. ELM is an improved version of a single hidden layer feedforward neural network, with fixed architecture and randomly assigned hidden nodes, providing fast learning speed and good generalization performance in different fields [6–8]. The aim of this work is to get more accurate prediction results for traffic accidents, without introducing high time consumption. The proposed models are based on Multiresolution Singular Value Decomposition in conjunction with Extreme Learning Machine (MSVD+ELM), for multi-step ahead forecasting, by using a multi-input, multi-output (MIMO) strategy [9]. MSVD it has been proved in traffic accidents predictions reaching high effectiveness[10], however it not has been used with ELM. In order to enhance the accuracy of the proposed models, we also introduce a climatic time series as explanatory variable. The proposed models are ebarvaluated with twenty time series of traffic accidents of Ecuador, five series correspond to the entire country, while the remaining fifteen correspond to zones 3, 6, and 9 of Ecuador. Five series correspond to the entire country, while the remaining fifteen correspond to zones 3, 6, and 9 of Ecuador; data contains the count of fatalities, injured, uninjured, drivers, and involved vehicles, from 2015 to 2019, with weekly sampling. Furthermore, comparative experiments are carried out with the conventional Linear Regression model joint with MSVD (MSVD+LRM). The rest of the paper is organized as follows: In Sect. 2, we describe the methodology, we present ELM and MSVD for multi-week ahead forecasting, and the complementary methods for experimentation and evaluation. In Sect. 3 the experiment results are showed and discussed, with the proposed models, and in comparison, with conventional linear regression. In Sect. 4, we concluded the paper and the future work is pointed out.
2 Methodology The methodology proposed in this work consists of the implementation of 80 models, based on Linear Regression and Extreme Learning Machine, with and without climatic variable, all models are based on Multiresolution Singular Value Decomposition. The aim of this study is to obtain improved accuracy in multi-week ahead forecasting of five variables of traffic accidents: fatality, injured, uninjured, drivers and vehicles, by means of real historical databases of Ecuador. The proposed methodology is summarized in the block diagram of Fig. 1.
Multi-resolution SVD, Linear Regression, and Extreme Learning Machine
503
Fig. 1. Block Diagram of MSVD+LRM and MSVD+ELM for Multi-Week Ahead Forecasting of Traffic Accidents, with and without Climatic Variable T.
2.1 Linear Regression Model The well-known linear regression model is defined through the relationship of two group of variables, responses and inputs, which can be represented as a straight line and denoted with: y(n) = βZ + where y(n) is the matrix of observations at instant n, Z is the regression matrix, β is the unknown parametric matrix, and is the matrix of random errors. The coefficients β are computed with the Moore-Penrose pseudoinverse matrix. 2.2 Extreme Learning Machine The Extreme Learning Machine is a nonparametric model applied to solve classification and regression problems. ELM presents good performance and higher learning speed than conventional neural networks. ELM only update the output weights which connect the hidden nodes and the output nodes, while the parameters, i.e., the input weights and biases are randomly generated and fixed during the process of training [11–13]. The output function of an ELM for a MIMO model in a time instant n is: yk (n) = β(k,j) hj (x)
(1)
where β(k,j) is the weight of the connection among the hidden layer and the output layer for an horizon k = 1, . . . , τ , and j = 1, . . . , NH hidden nodes; hj (x) is the output of the j-th hidden node with respect to the input x, by using a nonlinear piecewise continuous functions such as, sigmoid, hyperbolic, gaussian, sine, radial basis, among others. The weights matrix solution is described as follows: β = H† T
(2)
where H† is the Moore–Penrose generalized inverse matrix of hidden-layer outputs and T is the target matrix of the training sample. ELM applies nonlinear mapping functions, at difference of conventional artificial intelligence-based techniques, such as Support Vector Machine, which uses kernel functions for feature mapping [14], or deep neural networks, which use Restricted Boltzmann Machines (RBM) or Auto-Encoders/ Decoders for feature learning [15, 16]. ELM is based on the concept that hidden nodes need not be tuned and can be independent of training data, therefore the generalization process can be reached without iterative processes of connection weights adjusting [17].
504
L. Barba et al.
2.3 Multilevel Singular Value Decomposition Singular Value Decomposition (SVD) is an old technique of matrix algebra, initially applied for square matrices, after extended to rectangular matrices. SVD have been used in numerous works for different purposes, for example, features reduction [19], denoising [20–22], and image compression [23]. Multilevel SVD (MSVD) is an algorithm inspired in the pyramidal process implemented in multiresolution analysis of Mallat Algorithm [24], created for the representation of wavelet transform. MSVD implements iterative embedding of a time series in a Hankel matrix for a pyramidal decomposition in components of low and high frequency [10]. The pseudo code of MSVD is presented in Fig. 2.
Fig. 2. Pseudocode of Multilevel Singular Value Decomposition Algorithm ( Source: Barba, 2018)
From Fig. 2, the raw time series X is the input, while the output are two intrinsic components of low and high frequency CL and CH . MSVD consists of three stages, embedding, decomposing, and unembedding. In the embedding, the normalized time series of length N is mapped in a Hankel matrix HL×M , where L = 2 and M = N − 1. In the second stage, H is decomposed in orthogonal matrices of eigenvectors U and V, and singular values λ1 , λ2 , with U, V and the singular values, are computed elementary matrices H1 and H2 , which contain each candidate component of low and high frequency, respectively. In the third stage, is developed the unembedding, which consists in the extraction of the components of low and high frequency from elementary matrices H1 and H2 respectively. MSVD is processed iteratively and it is controlled by the calculus of the Singular Spectrum Rate (SSR), as follows: Rj , Rj−1
(3)
λ1 λ1 + λ2
(4)
SSRj = Rj =
Multi-resolution SVD, Linear Regression, and Extreme Learning Machine
505
where Rj is the relative energy of the singular values λ1 and λ2 during the j iteration. When SSR reach the convergence in the maximum valus, the algorithm is stopped, and the final components are used as input of the prediction model. 2.4 Air Temperature as Explanatory Variable The influence of climatic conditions in traffic accidents has been well established over recent decades [25–27]. It has been probed that the temperature average is correlated with traffic accidents and the fatalities produced in urban and rural areas, at difference that precipitation amounts that do not explain the traffic accidents phenomena [27]. In this work is used the historical data of the air temperature in Ecuador, provided by the Global Historical Climatology Network (GHCN). The data represents weekly average temperature since January 1st, 2015 to December 31st, 2019, from 10 weather stations. The data were prepared through weekly sampling, for the usage as an additional set of input data in the proposed MSVD-ELM model. Time series of air temperature are shown in Fig. 3.
Fig. 3. Air Temperature Average in Ecuador from January 2015 to December 2019
2.5 Multi-week Ahead Forecasting The inputs of the prediction model MSVD-ELM are the components of low and high frequency extracted through MSVD. The intrinsic components of five time series coming from Ecuador and three relevant regions are used. The model also incorporates as input the climatic variable the air temperature average.
506
L. Barba et al.
The model was set through a training sample and a validation sample. The forecasting strategy is multiple input – multiple output (MIMO), where the inputs are the observations of P lags, identified through Spectral Periodograms computed on the intrinsic components CL and CH . The outputs are the future observations, based on the forecast horizon defined after the execution of the training and validation processes. MIMO computes the output for the forecast horizon τ in a single simulation with a unique model, the result is a vector rather than a scalar value, as below [29]: ∼ ∼ ∼ X (n + 1), X (n + 2), . . . , X (n + τ ) = f [Z(n), Z(n − 1), . . . , Z(n − P + 1)] (5) where n is the time instant, τ is the maximum forecast horizon, Z is the regressor vector composed of CL , CH and the air temperature T , which is built with P units of time delays (lags), as follows: Z(n) = [CL (n), CH (n), TH (n)]
(6)
2.6 Assessment Methods for Forecast Accuracy The process of building the forecasting models nam ed MSVD-ELM requires setting processes to obtain the components of low and high frequency via MSVD and to identify the number of hidden nodes for each ELM implemented. To assess the forecasting accuracy, three metrics were used, Root Mean Square Error (RMSE), Index of Agreement modified (IoAm) and Nash-Suctlife Efficiency modified (NSEm). The equations of the assessment methods for forecast accuracy are presented next: 1 N ∼ 2 xi − x i (7) RMSE = i=1 N N ∼ (8) SAE = xi − x i i=1
SAD =
N i=1
|xi − μx |
(9)
IoAm = 1 −
SAE 2SAD
if SAE ≤ 2SAD
(10)
IoAm = 1 −
SAE 2SAD
if SAE ≤ 2SAD
(11)
NSEm = 1 −
SAE SAD
(12)
where N is the number of elements of the validation sample, xi is the ith observed value, ∼ x i is the ith forecasted value and μx is the mean.
Multi-resolution SVD, Linear Regression, and Extreme Learning Machine
507
3 Results and Discussion 3.1 Data Historical data of traffic accidents was provided by the DNTSV, an institution of the Ecuadorian government, responsible for planning, regulating and controlling the management of land transport, traffic and road safety, in order to guarantee free and safe land mobility. The areas of its competence correspond to 22 of the 24 provinces of Ecuador, in regions coast, sierra, and amazon, except provinces Guayas and Santa Elena (zone 8). The accident numbers are divided into 20 different series with weekly count data since 2015 to 2019. The first five corresponds to the entire Ecuadorian territory, the regard fifteen correspond to three relevant zones of Ecuador, collected in Zone 3 (provinces: Chimborazo, Cotopaxi, Pastaza,Tungurahua), Zone 6 (provinces: Azuay, Cañar, Morona Santiago), and Zone 9 (Distrito Metropolitano de Quito). Figures 4, 5, 6 and 7 show the time series of Ecuador and its zones. From Fig. 4 (a–e), is observed high variability in time series, with 4925 deaths, 20794 injured, 20392 uninjured, 33524 vehicles involved in accidents, and 27551 drivers. Figure 4(f) shows the periodogram of the time series corresponding to the number of traffic accidents, this technique was applied to identify the relevant frequencies/periods for time lag selection that will be used in the prediction models.
Fig. 4. Traffic Accidents Data in Ecuador since January 2015 to December 2019 with Weekly Sampling. (a) Number of Fatalities, (b) Number of Injured, (c) Number of Uninjured, (d) Number of Involved Vehicles, (e) Number of Involved Drivers, (f) Periodogram of Traffic Accidents Time Series, which Contains the Number of Accidents in Ecuador.
508
L. Barba et al.
Table 1 presents the descriptive statistics of time series, total values, mean, standard deviation, kurtosis, skewness, quartile 1, and quartile 3, computed to characterize the location and variability of data sets. Kurtosis is a measure of whether the data are heavytailed or light-tailed relative to a normal distribution; standard normal distribution has a kurtosis of zero, positive kurtosis indicates a heavy-tailed distribution and negative kurtosis indicates a light tailed distribution. Skewness is an indicator of lack of symmetry, a zero value informs about a symmetric data set (used so that the standard normal distribution), a positive values for the skewness indicate data that are skewed right, whereas a negative value means data that are skewed left. The counts of deaths have higher values of kurtosis and skewness, followed by injured counts, with respect to the remains series. Data of uninjured, vehicles, and drivers of Ecuador and Zone 3 present values of skewness near or equal to zero. Quartile 1 (Q1) is the median of the lower half of the data set, this means that about 25% of the numbers in the data set lie below Q1 and about 75% lie above Q1. Quartile 3 (Q3) or 75th percentile indicates to 75% of the observations are less than its value. Table 1. Descriptive statistics of weekly accident count series since 2015 to 2019 Series
Total
Mean Std. Kurtosis Skewness Quartile Quartile deviation 1 3
1. Ecuador:Deaths
4925
18,6
2. Ecuador:Injured
20794 78,5
36,6
−0,4
0,5
48
100
3. Ecuador:Uninjured 20392 77,0
33,0
−0,5
0,0
53
102
4. Ecuador: Vehicles
33524 126,5 46,1
−0,7
−0,1
89
163
5. Ecuador: Drivers
27551 104,0 41,8
−0,8
0,0
70
136
6. Z3:Deaths
1261
4,8
3,0
6,2
1,6
3
6
7. Z3:Injured
4064
15,3
8,9
0,9
1,0
9
20
8. Z3:Uninjured
6506
24,6
11,0
−0,1
0,1
18
32
9. Z3: Vehicles
10208 38,5
14,2
−0,3
−0,1
29
48
10. Z3: Drivers
7789
12,0
−0,3
0,1
21
37
11. Z6:Deaths
441
1,7
1,6
3,0
1,4
0
2
12. Z6:Injured
2074
7,9
6,4
3,6
1,7
3
11
13. Z6:Uninjured
1475
5,6
3,8
0,4
0,9
3
8
14. Z6: Vehicles
2717
10,3
5,4
0,3
0,8
6
13
15. Z6: Drivers
2084
7,9
4,6
0,5
0,8
5
10
16. Z9:Deaths
294
1,1
2,1
50,0
6,0
0
1
17. Z9:Injured
1246
4,7
5,0
11,0
2,8
2
6
18. Z9:Uninjured
1600
6,1
4,0
2,3
1,0
3
8
19. Z9: Vehicles
2256
8,5
4,3
0,0
0,3
5
11
20. Z9: Drivers
1868
7,1
3,9
−0,3
0,3
4
10
29,4
6,2
0,9
0,6
14
22
Multi-resolution SVD, Linear Regression, and Extreme Learning Machine
509
Figures 5, 6, and 7, presents the time series of traffic accidents of Zone 3 (Z3), Zone 6 (Z6), and Zone 9 (Z9), respectively. Figures 5(f), 6(f), and 7(f), show the periodograms of each time series corresponding to the number of traffic accidents, which show the relevant frequencies/periods, used to select the time lag for its usage in the prediction models.
Fig. 5. Traffic Accidents Data in Zone 3 of Ecuador since January 2015 to December 2019 with Weekly Sampling. (a) Number of Fatalities, (b) Number of Injured, (c) Number of Uninjured, (d) Number of Involved Vehicles, (e) Number of Involved Drivers, (f) Periodogram of Traffic Accidents Time Series (Number of Accidents in Zone 3).
According to [28], 96.3% of traffic accidents in Ecuador since 2015 to 2018, were associated to ten causes, among them, the predominant cause was driving with inattention to traffic conditions, with an incidence of 56.8% of accidents, followed by drunkenness, not yielding the right of way or right of way to the pedestrian, inattention while driving, speeding, the recklessness of the pedestrian, the climatic factor, improper overtaking, and mechanical damage.
510
L. Barba et al.
Fig. 6. Traffic Accidents Data in Zone 6 of Ecuador since January 2015 to December 2019 with Weekly Sampling. (a) Number of Fatalities, (b) Number of Injured, (c) Number of Uninjured, (d) Number of Involved Vehicles, (e) Number of Involved Drivers, (f) Periodogram of Traffic Accidents Time Series (Number of Accidents in Zone 6).
3.2 Intrinsic Components Obtained from MSVD The proposed models are based in time series decomposition and artificial intelligence, via MSVD algorithm and ELM respectively. The time series were decomposed in components of low and high frequency through Multilevel Singular Value Decomposition, which consists of a pyramidal process for multiresolution analysis, the algorithm stop was controlled with the calculus of Singular Spectrum Rate (Eq. 3). The SSR results are shown in Table 2, for example, a value of 16, means that the time series was decomposed in 16 decomposition levels. The illustration of the decomposition in intrinsic components of low and high frequency for time series of deaths and injured are shown in Fig. 8, can be observed that components of low frequency present features of long-memory periodicity, while the components of high frequency present features of short-term periodic fluctuations.
Multi-resolution SVD, Linear Regression, and Extreme Learning Machine
511
Fig. 7. Traffic Accidents Data in Zone 9 of Ecuador since January 2015 to December 2019 with Weekly Sampling. (a) Number of Fatalities, (b) Number of Injured, (c) Number of Uninjured, (d) Number of Involved Vehicles, (e) Number of Involved Drivers, (f) Periodogram of Traffic Accidents Time Series (Number of Accidents in Zone 9).
3.3 Forecasting Based on an ELM and Regression Models Twenty multi-week ahead forecasting models based on ELMs were implemented. The intrinsic components of time series were extracted with MSVD and used as input of the ELMs; additionally, a climatic variable was used as input variable of each model, corresponding to the air temperature average of Ecuador in the data collection period. For comparison purposes, typical parametric Linear Regression - based models were used, because that type of models is easily implemented and suited for prediction tasks. The time lags of prediction models are shown in Table 2. The optimal number of hidden nodes for ELM-based models was identified for one-week ahead forecasting, the assessment method used was NSEm, values close to one are indicators of good performance. Figure 9 show the models performance for each group of series, the counts of traffic accidents of Ecuador show the best performance with 70 hidden nodes, for counts in Zone 3, 60 hidden nodes, or counts in Zone 6, 50, and for counts in Zone 9, hidden nodes.
MSVD+LRM with climatic variable
2,6
5,4
3,8
6,1
5,0
1,7
3,3
2,6
3,1
2,8
1,1
2,6
1,8
2,1
1,9
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
77,7
79,7
74,9
77,5
65,2
89,2
90,1
88,5
80,8
71,1
95,8
95,6
95,2
93,5
80,5
58,2
61,4
54,2
58,8
44,0
78,6
80,2
77,4
63,1
48,1
91,5
91,1
90,4
86,9
61,9
1,0
1,0
0,7
0,9
0,4
2,0
2,3
1,7
1,0
0,5
4,7
5,8
3,2
4,3
1,9
RMSE IoAm NSEm RMSE
Series MSVD+LRM
1,0
1,0
1,0
1,0
1,0
1,0
1,0
1,0
1,0
1,0
1,0
1,0
1,0
1,0
1,0
94,6
94,6
95,4
95,6
96,0
93,4
92,1
94,2
95,4
95,9
93,6
92,8
93,8
91,9
91,2
1,0
1,0
0,8
1,1
0,4
2,0
2,3
1,7
1,2
0,6
4,3
5,3
2,9
3,9
1,9
IoAm NSEm RMSE
MSVD+ELM
95,64
95,84
95,87
95,8
95,16
96,61
96,52
96,6
96,7
95,39
98,34
98,12
98,32
97,93
94,95
91,2
91,6
91,6
91,4
90,1
93,1
93,0
93,1
93,3
90,6
96,7
96,2
96,6
95,9
89,8
1,0
1,0
0,7
0,9
0,4
1,9
2,3
1,7
1,0
0,5
4,3
5,3
2,9
3,9
1,9
IoAm NSEm RMSE
MSVD+ELM with climatic variable
97,4
97,5
97,8
98,0
98,0
97,7
97,6
97,9
98,3
97,9
98,6
98,4
98,7
98,4
96,3
(continued)
94,8
95,0
95,5
95,9
96,0
95,5
95,1
95,8
96,7
95,9
97,3
96,8
97,4
96,8
92,6
IoAm NSEm
Table 2. Information and Performance of Multi-Week Ahead Forecasting Models, ELM-Based, ELM+Climatic Variable, and LRM-Based Model, All Results Correspond to a Forecast Horizon of 12 Weeks.
512 L. Barba et al.
MSVD+LRM with climatic variable
1,7
3,1
1,7
1,4
1,5
2,8
16
17
18
19
20
Mean
80,6
80,6
82,1
77,4
65,3
51,4
65,4
63,6
65,9
58,7
41,3
32,2
1,7
0,5
0,5
0,5
0,8
0,4
RMSE IoAm NSEm RMSE
Series MSVD+LRM
1,0
1,0
1,0
1,0
1,0
1,0
94,6
96,7
96,9
97,5
94,7
95,0
1,7
0,6
0,6
0,6
1,1
1,1
IoAm NSEm RMSE
MSVD+ELM
Table 2. (continued)
96,2
96,39
96,57
96,6
94,53
91,32
92,1
92,6
93,0
93,1
88,7
81,1
1,6
0,5
0,5
0,5
0,8
0,4
IoAm NSEm RMSE
MSVD+ELM with climatic variable
98,0
98,4
98,8
98,8
97,4
97,4
95,9
96,8
97,5
97,6
94,9
94,8
IoAm NSEm
Multi-resolution SVD, Linear Regression, and Extreme Learning Machine 513
514
L. Barba et al.
Table 2 records the forecasting results by means of assessment metrics IoAm, and NSEm, in terms of the criterion that was optimized in this study in order to improve upon the predictions obtained by regression models without climatic variable, for twelve-week ahead forecasting. The results of ELM-based models correspond to the best performance of 50 repetitions with the number of hidden nodes showed in the same table. The average accuracy by means of MSVD+LRM for IoAm is 80.6%, and for NSEm is 65.4%. The second-best performance was obtained with MSVD+ELM models, with an average accuracy of 82.1% for IoAm, and 67.3% for NSEm. The highest accuracy was reached through models MSVD+ELM with climatic variable with an average accuracy of 96.9% for IoAm, and 93.8% for NSEm. Further, in Table 2 is presented the computation of the gain among models, when is applied MSVD+ELM in place of MSVD+LRM is obtained an average gain of 3.7% for NSEm, whereas when is applied MSVD+ELM with climatic variable, accuracies increasing up to 54.8% for NSEm. The experiment shows that the proposed forecasting methods, linear and nonlinear, based on MSVD, has high efficiency, with increasing accuracy when the climatic variable is added as models’ inputs. Our results are consistent with the findings in [29], who concluded that the weather regression model yields the most accurate prediction for series of crash numbers in Germany, followed by the model without weather variables.
Fig. 8. Time Series of Deaths and Injured and the Intrinsic Components of Low and High Frequency. (a) Fatalities in Ecuador, (b) Injured in Ecuador, (c) Fatalities in Zone 3, (d) Injured in Zone 3, (e) Fatalities in Zone 6, (f) Injured in Zone 6, (g) Fatalities in Zone 9, (f) Injured in Zone 9.
Multi-resolution SVD, Linear Regression, and Extreme Learning Machine
515
Fig. 9. Selection of Optimal Number of Hidden Nodes for ELM-Based Models. (a) Hidden Nodes vs NSEm for One-Week Ahead Forecasting of Traffic Accidents in Ecuador, (b) Hidden Nodes vs NSEm for One-Week Ahead Forecasting of Traffic Accidents in Zone 3, (c) Hidden Nodes vs NSEm for One-Week Ahead Forecasting of Traffic Accidents in Zone 6, d) Hidden Nodes vs NSEm for One-Week Ahead Forecasting of Traffic Accidents in Zone 9.
4 Conclusions In this paper, the validity of proposed linear and nonlinear models for multi-week ahead forecasting, has been evaluated. Twenty time series of traffic accidents weekly sampled in Ecuador since 2015 to 2019 has been used for evaluating the models effectiveness. All models were based on intrinsic components of time series, extracted with Multilevel Singular Value Decomposition. Models based on Linear Regression present the lowest accuracy, with an average Nash-Suctlife Efficiency of 80.6%. ELM-based models present improved accuracies with an average Nash-Suctlife Efficiency of 82.1%. Furthermore, models based on Extreme Learning Machine that integrate climatic variable at input, present clear gains in prediction accuracies, with an average Nash-Suctlife Efficiency of 96.9%. Accordingly, it has been observed that climatic variable is a good predictor of the complex dynamics of traffic accidents, carrying in improved accuracies. The clear gains in prediction accuracies that were obtained, allow us to conclude that the approaches proposed in this work, provide valid alternatives to forecast time series of traffic accidents, number of deaths, injured, uninjured, vehicles, and drivers of whole Ecuador and three zones of relevance in this country.
516
L. Barba et al.
The methodology presented in this paper allows its application to other areas, influenced or not by climatic variables. It should be expected that forecasting models in areas influenced by climatic variables will increase their accuracies when that exogenous input is added. The implementation of these models will help to guide the planning of government institutions and decision-making, in face of complex problem of traffic accidents addressed in this work. As future work, we aim to apply new forecasting strategies with comparison purposes and looking for more accuracy and versatility, such as stationary wavelet transform as preprocessing technique for decomposing time series in approximation coefficients and detail coefficients for their usage as inputs of extreme learning machine. Moreover, we will implement new spectral analysis techniques for identify elementary frequencies in time series signals. Acknowledgment. We thank to the researching group of Modeling and Simulation of Universidad Nacional de Chimborazo, for its endorsement to the research project “Intelligent Traffic Management for accident prevention”.
Conflicts of Interest. The authors declare that there are no conflicts of interest of any nature.
References 1. Igissinov, N., et al.: Prediction mortality rate due to the road-traffic accidents in Kazakhstan. Iran. J. Public Health 49(1), 68–76 (2020) 2. Zhang, X., Pang, Y., Cui, M., Stallones, L., Xiang, H.: Forecasting mortality of road traffic injuries in China using seasonal autoregressive integrated moving average model. Ann. Epidemiol. 25(2), 101–106 (2015) 3. Gu, X., Li, T., Wang, Y., Zhang, L., Wang, Y., Yao, J.: Traffic fatalities prediction using support vector machine with hybrid particle swarm optimization. J. Algorithms Comput. Technol. 12(1), 20–29 (2018) 4. Kuang, L., Yan, H., Zhu, Y., Tu, S., Fan, X.: Predicting duration of traffic accidents based on cost-sensitive Bayesian network and weighted K-nearest neighbor. J. Intell. Transp. Syst. 23(2), 161–174 (2019) 5. Emmert-Streib, F., Yang, Z., Feng, H., Tripathi, S., Dehmer, M.: An introductory review of deep learning for prediction models with big data. Front. Artif. Intell. 3, 1–23 6. Zhang, R., Lan, Y., Huang, G., Xu, Z.: Universal approximation of extreme learning machine with adaptive growth of hidden nodes. IEEE Trans. Neural Networks Learn. Syst. 23(2), 365–371 (2020) 7. Hu, R., Ratner, E., Stewart, D., Björk, K., Lendasse, A.: A modified Lanczos Algorithm for fast regularization of extreme learning machines. Neurocomputing 414, 172–181 (2020) 8. Ren, L., Gao, Y., Liu, J., Zhu, R., Kong, X.: L2,1-extreme learning machine: an efficient robust classifier for tumor classification. Comput. Biol. Chem. 89, 107368 (2020) 9. Wang, J., Song, Y., Liu, F., Hou, R.: Analysis and application of forecasting models in wind power integration: a review of multi-step-ahead wind speed forecasting models. Renew. Sustain. Energy Rev. 60, 960–981 (2016) 10. Barba, L.: Multiscale Forecasting Models. Springer, Cham (2018). https://doi.org/10.1007/ 978-3-319-94992-5
Multi-resolution SVD, Linear Regression, and Extreme Learning Machine
517
11. Huang, G., Zhu, Q., Siew, C.: Extreme learning machine: a new learning scheme of feedforward neural networks. In: 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No.04CH37541), Budapest, vol. 2, pp. 985–990 (2004) 12. Huang, G., Zhou, H., Ding, X., Zhang, R.: Extreme learning machine for regression and multiclass classification. IEEE Trans. Syst. Man Cybern. Part B 42(2), 513–529 (2012) 13. Huang, G., Song, S., You, K.: Trends in extreme learning machines: a review. Neural Networks 61, 32–48 (2015) 14. Catanzaro, B., Sundaram, N., Keutzer, K.: Fast support vector machine training and classification on graphics processors. In: Proceedings of the 25th International Conference on Machine Learning, pp. 104–111 (2008) 15. Bengio, Y.: Learning deep architectures for AI. Found. Trends® Mach. Learn. 2(1), 1–127 (2009) 16. Kasun, L., Zhou, H., Huang, G.: Representational learning with ELMs for big data. IEEE Intell. Syst. 28(5), 31–34 (2013) 17. Huang, G.: An insight to extreme learning machines: random neurons, random features and kernels. Cogn. Comput. (Online) 6, 376–390 (2014) 18. Pozo, C., Ruíz-Femenia, R., Caballero, J., Guillén-Gosálbez, G., Jiménez, L.: On the use of principal component analysis for reducing the number of environmental objectives in multiobjective optimization: application to the design of chemical supply chains. Chem. Eng. Sci. 69(1), 146–158 (2012) 19. Abu-Shikhah, N., Elkarmi, F.: Medium-term electric load forecasting using Singular Value Decomposition. Energy, 36(7), 4259–4271 (2012) 20. Jha, S., Yadava, R.: Denoising by singular value decomposition and its application to electronic nose data processing. IEEE Sens. J. 11(1),34–44. Chemical Engineering Science, 69(1),146– 158 (2011) 21. Zhao, X., Ye, B.: Selection of effective singular values using difference spectrum and its application to fault diagnosis of headstock. Mech. Syst. Sig. Process. 25(5), 1617–1631 (2011) 22. Ranade, A., Mahabalarao, S., Kale, S.: A variation on SVD based image compression. Image Vis. Comput. 25(6), 771–777 (2007) 23. Mallat, S.: A theory for multiresolution signal decomposition: the wavelet representation. IEEE Trans. Pattern Anal. Mach. Intell. 11(7), 674–693 (1989) 24. Karlaftis, M., Yannis, G.: Weather effects on daily traffic accidents and fatalities: a time series count data approach. In: Proceedings of the 89th Annual Meeting of the Transportation Research Board, Washington, D.C., USA (2010) 25. Bergel-Hayat, J., Debbarh, M., Antoniou, C., Yannis, G.: Explaining the road accident risk: Weather effects. Accid. Anal. Prevent. 60, 456–465 (2013). 26. Drosu, A., Cofaru, C., Popescu, M.V.: Influence of weather conditions on fatal road accidents on highways and urban and rural roads in Romania. Int. J. Automot. Technol. 21(2), 309–317 (2020). https://doi.org/10.1007/s12239-020-0029-4 27. Congacha, A., Barba, J., Palacios, L., Delgado, J.: Characterization of traffic accidents in Ecuador. Novasinergia 2(2), 17–29 (2019) 28. Barba, L., Rodríguez, N.: A novel multilevel-SVD method to improve multistep ahead forecasting in traffic accidents domain. Comput. Intell. Neurosci. 2017, 1–12 (2017). rticle ID 7951395 29. Diependaele, K., Martensen, K., Lerner, M., Schepers, A., Bijleveld, F., Commandeur, J.: Forecasting German crash numbers: the effect of meteorological variables. Accid. Anal. Prev. 125, 336–343 (2019)
Identifying Leading Indicators for Tactical Truck Parts’ Sales Predictions Using LASSO Dylan Gerritsen and Vahideh Reshadat(B) Department of Industrial Engineering and Innovation Sciences, Eindhoven University of Technology, P.O. Box 513, 5600 MB Eindhoven, The Netherlands [email protected], [email protected]
Abstract. This paper aimed to identify leading indicators for a case company that supplies truck parts to the European truck aftersales market. We used LASSO to extract relevant information from a collected pool of business, economic, and market indicators. We propose the efficient onestandard error rule, as an alternative to the default one-standard error rule, to reduce the influence of sampling variation on the LASSO tuning parameter value. We found that applying the efficient one-standard error rule over the default one, improved forecasting performance with an average of 0.73%. Next to that, we found that for our case study, applying forecast combination yielded the best forecasting performance, outperforming all other considered models, with an average improvement of 2.38%. Thus, including leading context information did lead to more accurate parts sales predictions for the case company. Also, due to the transparency of LASSO, using LASSO provided business intelligence about relevant predictors and lead effects. Finally, from a pool of 34 indicators, 7 indicators appeared to have clear lead effects for the case company. Keywords: LASSO
1
· Sales forecasting · Leading indicators
Introduction
Sales forecasting plays a significant role in business strategies nowadays. In particular, tactical (i.e. up to 12 months) forecasting often supports short-term decision-making in supply chain management as it serves as a basis for raw material purchase, inventory planning, and production scheduling. Alternatively, strategic forecasting is often referred to as long-term forecasting, and principally supports decision-making in the development of overall strategies and capacity planning. Both forecasting strategies commonly use observed values of the past and available knowledge to predict the future as accurately as possible [2]. However, including external information could enhance the performance of a sales forecasting model [1]. According to Currie and Rowley [1] using additional information can enhance forecasting performance in volatile environments. The main focus of previous research has been enhancing operational forecasts (i.e. up to c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 295, pp. 518–535, 2022. https://doi.org/10.1007/978-3-030-82196-8_38
Identifying Leading Indicators
519
48 h ahead). For example, Williams et al. [3] successfully integrated supply chain information into a forecasting model, whereas Ma et al., [4] used additional price and promotional data to improve forecast accuracy. Conversely, the dynamics of tactical forecasts can be different due to the relevant horizons and business models. Moreover, leading indicators, such as macroeconomics, can contain leading context information in terms of changing economic conditions [2]. These indicators are mainly published on a monthly or quarterly basis and are therefore useless for operational forecasting purposes. However, for medium to long-term horizons (i.e. 3 to 12 months ahead), macroeconomic information is relevant and could enhance forecast performance [5]. This paper aims to identify leading indicators for a case company that supplies truck parts to the European truck aftersales market. Also, this paper explores whether including leading context information leads to more accurate sales predictions in comparison with traditional time series methods. Additionally, this paper has several contributions to the existing academic literature: • We propose the efficient one-standard error rule, as an alternative to the default one-standard error rule, by combining efficient CV, proposed in Jung [20], with the commonly used one-standard error rule, described in Hastie et al. [16]. The purpose of the efficient one-standard error rule is to reduce the influence of sampling variation on the LASSO tuning parameter value. • The studies of Sagaert et al. [2] and Verstraete et al. [6] reported forecasting performance losses of LASSO on the longer horizons, compared to traditional methods. These studies solely considered and compared the forecasting performance of individual models, whereas we found that, for our case study, applying forecast combination resulted in improved forecasting performance over almost all horizons. The rest of this paper is organized as follows. The next section discusses relevant literature related to identifying leading indicators in a tactical sales forecasting environment. Section 3 introduces the case study, the modeling characteristics we are dealing with, and the defined experimental setup. Finally, the results are presented in Sect. 4, followed by a brief conclusion in Sect. 5.
2 2.1
Related Work Tactical Sales Forecasting Problem
Forecasting methods have large influence on the development of different artificial intelligent branches consists of Fuzzy Systems [7], Natural Language Processing [8–11], Expert Systems [12] etc. According to Verstraete et al., [6] using macroeconomic indicators as input variables for tactical sales forecasting introduces two major challenges. The first challenge is the limited sample size of available sales data, which is considered a typical challenge in sales forecasting in general. It is often the case that companies lack having effective data management practices and therefore cannot
520
D. Gerritsen and V. Reshadat
access historical data. On the other hand, even if companies do have effective data management practices to a certain extent, it is often the case that historical data is not representative anymore due to changing product portfolios and customer behaviors. Moreover, as macroeconomic data is mainly reported on monthly or higher aggregation levels, it could be that the amount of usable data becomes even more limited. The second challenge is the large number of available macroeconomic indicators across multiple publicly available data sources. For example, the best known economic database sources such as Organisation for Economic Cooperation and Development (OECD), Federal Reserve Economic Data (FRED), and Eurostat provide access to thousands of macroeconomic time-series. As a result, selecting potential macroeconomic indicators could become very time consuming and quite complex. According to Sagaert et al. [2], these two challenges together create a distinct tactical sales forecasting problem. Hence, the tactical sales forecasting problem consists of a considerably large set of predictors (p) with limited sales data sample sizes (n). 2.2
Forecasting Frameworks
Sagaert et al. [13] proposed a framework to improve tactical sales forecasting using macroeconomic leading indicators. The proposed framework automatically identified key leading indicators from an enormous set of macroeconomic indicators for a supplier to the tire industry. In many studies, the gross domestic product (GDP) is used to represent the ongoing economic activity at some point in time. For clarity, GDP represents the value of all finished goods and services produced within a country in a specific time period. However, Sagaert et al. [2] indicated that GDP is principally an aggregate variable and therefore does not provide detailed changes in the various sectors and economic activities. Moreover, they mention that using more detailed macroeconomic indicators could provide relevant information and therefore the number of potentially relevant indicators will increase extensively, especially for supply chains across multiple markets and countries. The case company of Sagaert et al. [2] has a global supply chain and supplies numerous tire manufacturers across multiple markets and as a result they initially selected 67,851 monthly macroeconomic variables from several sections of the FRED database. To model any indicator leading effects to the sales variable, each input variable is lagged in time up to a maximum considered time lag. A maximum leading effect of 12 months ia assumed and therefore the number of input variables increased to a total of 67, 851 ∗ 12 = 814, 212 predictors. Due to the extensive number of predictors causal-regression modeling becomes highly complex and truly impossible. For this reason, [13] proposed using LASSO regression. Moreover, they opted for the use of LASSO regression as “the LASSO forecast is transparent, and provides insights into the selected leading indicators. Experts can benefit by gaining a better understanding of their market and can thus improve their understanding of market dynamics and interactions” [13, p. 127]. Additionally, since each input variable is lagged in time multiple times, they highlight that multicollinearity may be present among
Identifying Leading Indicators
521
the input variables. Due to the shrinkage properties of LASSO, authors mention that LASSO is capable of effectively dealing with multicollinearity. Verstraete et al. [6] proposed a comparable methodology that automatically generates tactical sales forecasts by using a large group of macroeconomic indicators. A noticeable difference is they assumed that macroeconomic conditions determine the trend of sales. Therefore, they opted to use the LASSO regression technique of Sagaert et al. [2] to forecast the trend component. The sales data is decomposed into a trend, a seasonal, and a remainder component using the STL decomposition proposed by Cleveland et al., [14]. Additionally, they motivate their choice for STL decomposition because it can be robust to outliers, the seasonal component may alter across different periods, and the smoothness of the trend component is controllable. Furthermore, as the data is divided into three independent components, each component is forecasted separately. Verstraete et al. [6] used the seasonal naive method as proposed by Xiong et al., [15] to forecast the seasonal component. For clarity, the seasonal naive method uses the latest seasonal observation as a forecast for the consecutive seasonal period. Verstraete et al. [6] assumed that the remainder component is determined by other factors than macroeconomic indicators. For instance, they mention social media, promotions, weather, and random noise as factors that will mainly determine the remainder component. However, they considered predicting the remainder component out-of-scope as they assumed that the predictive power of these factors is out of the tactical time window.
3 3.1
Case Study Target Variable
Leading indicators are defined as variables that contain predictive information and ideally can predict a certain movement for a target variable in advance. Hence, to identify any leading indicators that are relevant for the case company’s parts sales, it is necessary to specify a target variable. For the case company, we specified the target variable as the total monthly truck parts sales, reported by the entire European dealer network. Figure 1 shows the specified target variable in this case study. It should be noted that the y-axis is normalized due to data confidentiality. As can be seen, observed sales data is available from January 2002 to February 2020 which is equivalent to 218 monthly observations. Moreover, it becomes clear that the sales data is following a certain trend and that it contains a seasonal pattern that repeats every 12 months.
522
D. Gerritsen and V. Reshadat
Fig. 1. Parts sales across the entire european dealer network.
3.2
Leading Indicators
The studies of Sagaert et al. [13] and Verstraete et al. [6] only included publicly available macroeconomic indicators as potential leading indicators. Concerning this field project, it was possible to access non-publicly available databases with data specifically related to the case company and the European road freight market. Therefore, a distinction is made between three different types of potential leading indicators: business, economic, and market indicators. We collected several indicators that have the potential of being a leading indicator for the case company’s parts sales. To collect indicators that have the potential of being a leading indicator for the case company’s parts sales, different data sources were used. First of all, internal departments of the case company served as data sources for several business indicators that covered the case company’s business activities. Secondly, the Eurostat and OECD publicly available economic databases provided access to thousands of macroeconomic time series related to European territories and served as data sources for several economic indicators that covered Europe’s overall economic climate. Finally, Rementum Research and Management, further denoted as Rementum, is a market research and advisory firm specialized in both the European road freight market and commercial vehicles with a gross vehicle weight over 6 tonnes. Rementum collects data from a broad array of sources relevant for heavy commercial road transport such as road carriers, transport equipment OEMs, and OE & aftermarket component suppliers to analyze the European road transport market conditions. The expertise of Rementum was used to collect several market indicators that covered the ongoing activities in Europe’s road transport sector. Accordingly, Table 1 presents an overview of all included business, economic and market indicators in this case study. Noticeably, all indicators have data available from January 2005,
Identifying Leading Indicators
523
and thus we have January 2005 to February 2020 available for data preparation and modeling, which is equivalent to 182 observations. Table 1. Collected indicators that have the potential of being a leading indicator. Indicator description
Unit
Y
Autoregressive information: NSA
e
Case comp
X1
Truck deliveries: NSA
Trucks
Case comp
X2
Business climate: SA
Index
Rementum
X3
Economic sentiment: SA
Index
Eurostat
X4
Passenger car registrations: NSA
Cars
Rementum
X5
Retail confidence: SA
Balance Eurostat
X6
Industrial confidence: SA
Balance Eurostat
X7
Industrial production: SA
Index
X8
Gross domestic product: SA
Index
OECD
X9
Producer price index: SA
Index
Eurostat
X10 Construction confidence: SA
Source
Eurostat
Balance Eurostat
X11 Construction spending: NSA
Index
Rementum
X12 Construction activity: SA
Index
Eurostat
X13 Construction and mining equipment sales: MA (3mos), NSA Index
Rementum
X14 Replacement truck tire sales (ST): MA (2mos), NSA
Rementum
Index
X15 Replacement truck tire sales (LT): MA (12mos), NSA
Index
Rementum
X16 OE truck tire sales (ST): MA (2mos), NSA
Index
Rementum
X17 OE truck tire sales (LT): MA (12mos), NSA
Index
Rementum
X18 Aftermarket truck tire deliveries: NSA
Index
Rementum
X19 Diesel consumption growth (LT): MA (12mos), NSA
%
Rementum
X20 Diesel consumption growth (ST): MA (2mos), NSA
%
Rementum
X21 Retail diesel price: NSA
e
Rementum
X22 Automotive diesel deliveries: NSA
m3
Rementum
X23 OEM truck orders growth (LT): MA (12mos), NSA
%
Rementum
X24 OEM truck orders (ST): MA (3mos), NSA
Trucks
Rementum
X25 OEM order intake expectations: MA (6mos), NSA
Balance Rementum
X26 OEM production expectations: MA (6mos), NSA
Balance Rementum
X27 Road transport activity: NSA
Index
Rementum
X28 Road transport capacity: SA
Index
Rementum
X29 Freight volume index: MA (3mos), NSA
Index
Rementum
X30 Carrier confidence: NSA
Balance Rementum
X31 Carrier demand expectations: NSA
Balance Rementum
X32 Carrier hiring expectations: NSA
Balance Rementum
X33 Carrier pricing expectations: NSA Balance Rementum ST = Short-term; LT = Long-term; SA = Seasonally adjusted; NSA = Not seasonally adjusted; MA = Moving average; mos = months;
3.3
Modeling of Lead Effects
The previous sections have shown that the identification of leading indicators in a sales forecasting environment results in a high-dimensional problem with the presence of multicollinearity among the predictors. Accordingly, LASSO has
524
D. Gerritsen and V. Reshadat
already been found useful in the identification of leading indicators, due to its shrinkage properties and transparency [2,6,13]. As a result, for the same reasoning, LASSO is chosen as the modeling technique for this case study. In order to use regression techniques on a forecasting problem with temporal dependencies, the data must be restructured to a supervised learning task. Given a sequence of numbers for a time series, the data can be restructured as a supervised learning task by using previous time steps as input variables and the next time step as the output variable. Sagaert et al. [13] indicated that macroeconomic indicators can contain leading context information up to a maximum horizon of 12 months, and thus this case study includes results up to 12 months ahead. Next to that, to model any lead effects, the decision was made to include all 12 previous time steps as input variables. This means that every potential leading indicator is lagged in time 12 times, and therefore the number of predictors increases significantly from 34 business, economic and market indicators to a total number of 34 ∗ 12 = 408 predictors. Hence, we are dealing with a high-dimensional problem since the number of predictors exceeds the number of observations (p > n). Figure 2 shows how the data is restructured to a supervised learning task for the first three observations when predicting (t + 1) ahead.
Fig. 2. Time series data to supervised learning setting.
3.4
Proposed Method for Identifying Lead Effects
LASSO is a linear regression analysis method that performs both variable selection and regularization in order to prevent overfitting of high-dimensional data, and in order to enhance both prediction accuracy and model interpretability [18]. The LASSO solution minimizes a penalized residual sum of squares, yielding coefficients that are shrunken to zero: p p n lasso = argmin βp xip + λ |βp | (1) yi − β0 − βˆ β
i=1
p=1
p=1
The solution and thus the βˆlasso estimator highly depends on the magnitude of regularization, which equals a value between 0 and 1, and is represented by tuning parameter λ. Hastie et al., [16] proposes to determine λ based on the crossvalidation estimate of the prediction error. Typically, K-fold cross-validation (KCV) randomly splits the data into K-folds and subsequently fits a model
Identifying Leading Indicators
525
using K − 1 folds and uses the K th fold for testing. As the data is partitioned randomly, using KCV in a time series environment does not seem applicable as temporal dependencies are ignored. Nevertheless, using CV in a time series environment was extensively studied by Bergmeir and Ben´ıtez [17], and they did not find any practical problems with standard cross-validation. Moreover, they suggest to use standard KCV or blocked CV together with stationary data, as this uses all available information for training and testing. Accordingly, we used 10-fold CV with stationary data to determine the CV estimate on the in-sample data. Additionally, we selected λ corresponding to the most regularized model within one-standard error (λ1se ) of the minimum CV error estimate (λmin ), also known as the one-standard error rule [16]. Figure 3 shows an example of how tuning parameter values λmin and λ1se are determined based on the CV error curve with vertical standard error bars.
Fig. 3. Tuning parameter values λmin and λ1se .
3.5
Experimental Setup
To conduct any experiments, the observations available for modeling must be split into two subsets: in-sample observations (i.e. training data) and out-ofsample observations (i.e. test data). A common approach to split training and test sets when dealing with temporal dependencies, is time series cross-validation [19]. This approach uses a series of test sets, with each test set consisting of a single observation. The corresponding training set consists only of observations prior to the observation in the test set. Figure 4 shows how the training and test sets are defined, when predicting (t + 1) ahead. We set the initial size of the in-sample data to 70% of the available sample size, after which, every time new observations become available, the in-sample data is updated.
526
D. Gerritsen and V. Reshadat
Fig. 4. Time series cross-validation data split when predicting (t + 1) ahead. Table 2. Data available for model performance evaluation. Model Period used as test sets Number of test sets Mt+1
Jul 2015 to Feb 2020
56
Mt+2
Aug 2015 to Feb 2020
55
Mt+3
Sep 2015 to Feb 2020
54
Mt+4
Oct 2015 to Feb 2020
53
Mt+5
Nov 2015 to Feb 2020
52
Mt+6
Dec 2015 to Feb 2020
51
Mt+7
Jan 2016 to Feb 2020
50
Mt+8
Feb 2016 to Feb 2020
49
Mt+9
Mar 2016 to Feb 2020
48
Mt+10 Apr 2016 to Feb 2020
47
Mt+11 May 2016 to Feb 2020
46
Mt+12 Jun 2016 to Feb 2020
45
The smallest sample size, after data preparation activities, equals 147 observations for model Mt+12 and as a result, the initial size of the training set was set to 102 observations for all 12 models. Accordingly, Table 2 presents an overview of how many observations are used as a series of test sets in order to evaluate model performance. When aggregating all observations across all horizons, a total number of 606 observations are available for model performance evaluation. On the whole, Fig. 5 shows the experiment design that will be used at every time step, across all forecast horizons. First of all, we determine the CV estimate of the prediction error using the in-sample data and a 10-fold CV grid search. Thereafter, using this grid search, we determine the value of tuning parameter λ with the commonly used one-standard error rule. Then, we fit a βˆlasso estimator using all available in-sample data and the selected tuning parameter λ, after which the βˆlasso estimator is used to predict the observation in the final test set.
Identifying Leading Indicators
4 4.1
527
Results Case Study
The purpose of extracting information from leading indicators is to ideally improve the parts sales predictions by including information in terms of changing economic and market conditions. In order to assess whether including leading indicators improves forecasting performance, the performance of LASSO is benchmarked to commonly used univariate methods that are unable to respond to these changing conditions. Since we are dealing with a very small sample size, complex machine learning techniques, such as the recurrent neural network, are considered out of scope as these techniques require a large sample size for training purposes. The methods used as a benchmark are additive Holt-Winters (AHW), multiplicative Holt-Winters (MHW), and SARIMA. With regard to the benchmark models, both the additive (AHW) and multiplicative HoltWinters (MHW) methods were implemented with optimal smoothing parameters α = 0.2, β = 0.1, γ = 0.2, whereas a SARIMA model with (p, d, q)(P, D, Q)m equal to (2, 1, 0)(1, 1, 1)12 was selected using the Akaike information criterion (AIC). Overall, Table 3 shows the mean absolute prediction error across all considered models and forecast horizons.
Fig. 5. Experimental setup.
528
D. Gerritsen and V. Reshadat
Table 3. Mean absolute prediction error across all models and forecast horizons. Forecast horizon LASSO
SARIMA AHW
MHW
1-month
1,865.93
1,876.55 1,732.56 1,994.59
2-months
1,883.23
1,778.85 1,636.93 1,935.19
3-months
1,895.85
1,955.72 1,669.06 1,999.18
4-months
1,808.90
1,778.09 1,722.89 2,043.26
5-months
1,803.76
1,820.68 1,726.69 2,087.26
6-months
1,785.07
1,872.53
1,887.68 2,140.95
7-months
2,067.32 1,968.04
2,125.34 2,313.06
8-months
2,092.85 1,939.43
2,038.74 2,248.73
9-months
2,242.76 2,021.24
2,120.01 2,291.26
10-months
2,232.01
2,342.59 2,547.03
11-months
2,138.72
2,377.19 2,363.90
2,405.85 2,680.01
12-months
2,349.52 2,316.47
2.378.72 2,614.66
As can be seen in Table 3, when comparing AHW with MHW, AHW consistently outperforms MHW and thus the seasonal variations could be considered additive. Moreover, when comparing LASSO, SARIMA and AHW, their model performances seem more competitive as no model consistently outperforms the other. In particular, AHW seems to predict more accurate on the shorter horizons, whereas on the longer horizons, SARIMA seems to predict more accurate. Thus, despite the fact that LASSO uses information from external indicators, forecasting performance has not improved compared to traditional time series forecasting methods. Accordingly, Sect. 4 elaborates on two experiments that have been conducted in order to explore, investigate and analyze whether forecasting performance can be enhanced by applying efficient tuning parameter selection or forecast combination. 4.2
Effect of Efficient Tuning Parameter Selection
With regard to the case study, the commonly used one-standard error rule was used for choosing the value of λ [2,6]. Jung [20] stated that tuning parameter selection is often one of the crucial parts in high-dimensional modeling and hence using CV to select a single value as optimal value for the tuning parameter can be unstable due to the sampling variation. A possible solution to account for these sampling variations is to apply repeated CV. Nevertheless, applying repeated CV significantly increases computational costs when predicting multisteps ahead and as a result Jung [20] proposed the use of efficient CV. Efficient CV selects multiple candidates of parameter values and calculates an average based on different weights depending on their performance without significant additional computational costs. As a criterion to select C candidates, Jung [20] opts to select the top C best performing parameter values. This experiment
Identifying Leading Indicators
529
explores and analyzes an extension that combines efficient CV with the onestandard error rule. Thus, instead of choosing the top C best performing parameter values as candidates, all parameter values which are considered by the onestandard error rule {λmin , . . . , λ1se } are selected as candidates. The combination of efficient CV with the one-standard error rule will be further denoted as the efficient one-standard error rule. The efficient one-standard error rule will calculate a weighted average of all candidates with different weights depending on the CV error estimates as proposed by Jung [20]. The estimates of the weights are designed in such a way that candidate values with lower CV errors are assigned a greater weight. Additionally, the weights are normalized and thus the weights of all candidate models add up to 1 [20]. The tuning parameter corresponding to the efficient one-standard error rule is obtained by: 1 C CV(λc ) ˆ ef f 1se = wc λc with wc = C (2) λ 1 c=1 c=1
CV(λc )
Figure 6 shows an example of which parameter values are selected as candidate values by the efficient one-standard error rule.
Fig. 6. Tuning Parameter Values λmin , λ1se and λef f 1se .
As can be seen in Fig. 6, the efficient one-standard error rule selects a total number of 4 candidate models: log(λ) = −1.3, log(λ) = −1.2, log(λ) = −1.1, log(λ) = −1.0. After determining the weights wc s, the efficient onestandard error tuning parameter is calculated at log(λef f 1se ) = −1.14. It should
530
D. Gerritsen and V. Reshadat
be noted that the value of λef f 1se is not a value on the grid used for the parameter search. Hence, the efficient one-standard error rule is capable of finding parameter values on a finer grid without any additional computational costs [20]. Accordingly, Table 4 presents an overview of the model performances when using both the default and the efficient one-standard error rule. Table 4. Mean absolute prediction error when using the default and efficient onestandard error rule. Model One-standard error Efficient one-standard error rule (λ1se ) rule (λef f 1se ) Mt+1
1,865.93
1,855.95
Mt+2
1,883.23
1,855.71
Mt+3
1,895.85
1,891.97
Mt+4
1,808.90
1,788.00
Mt+5
1,803.76
1,786.83
Mt+6
1,785.07
1,818.96
Mt+7
2,067.32
2,063.71
Mt+8
2,092.85
2,082.58
Mt+9
2,242.76
2,190.58
Mt+10 2,138.72
2,150.93
Mt+11
2,377.19
2,335.49
Mt+12
2,349.52
2,297.18
As can be seen in Table 4, the proposed efficient one-standard error rule outperforms the default one-standard error rule for 10 out of 12 models, with an average improvement of 0.73%. Hence, choosing multiple candidate values in order to reduce the influence of sampling variation on the tuning parameter value, instead of choosing one optimal value, does seem to cause improvements in both the tuning parameter selection process and forecasting performance. 4.3
Effect of Forecast Combination
The case study has shown that LASSO did not outperform traditional time series forecasting methods, whereas the studies of Sagaert et al. [2] and Verstraete et al. [6] reported forecasting performance losses on the longer horizons compared to traditional methods. It should be noted that these studies solely compared the forecasting performance of individual models. Bates and Granger [21] noted that combining sets of forecasts can lead to improvements if each set contains independent information. Moreover, Bates and Granger [21] indicated that this independent information could be of two types: (1) forecasts are based on variables or information that other forecasts have not considered,
Identifying Leading Indicators
531
and (2) forecasts make different assumptions about the form of relationships between variables. For clarity, Fig. 7 illustrates whenever forecast combinations are superior to individual forecasts.
Fig. 7. Forecast Combinations Considering Five Forecast Vectors u(1), u(2), u(3), u(4) and u(5) and Two Steps Ahead y1 and y2 . The Solid Lines Represent the Forecast Combination in Pairs of Two, Whereas the Dotted Lines to y Represent the Corresponding Error of the Forecast Vectors [22].
As can be seen in Fig. 7, forecasts u(1) and u(3) are highly correlated and therefore combining these forecasts will not improve the forecasting performance significantly. Moreover, forecast u(5) is a considerably poor forecast as the distance to y is large. However, combining forecasts u(5) with u(2) will improve the forecasting performance significantly as the distance between y and the solid line between u(5) with u(2) is reduced. Clearly, the performance improvement is due to the diversity of both models [22]. With regard to LASSO, SARIMA and AHW, each individual model creates forecasts based on independent information. For example, LASSO extracts information from leading indicators, whereas SARIMA extracts information from auto correlations and AHW extracts information from level, trend and seasonal variations. Hence, in order to quantitatively assess their model diversities, correlation between the individual forecast errors are presented in Table 5. As can be seen in Table 5, the least correlation exists between the forecast errors of LASSO and AHW. Thus, combining the individual forecasts of LASSO and AHW will have the highest potential for enhanced forecasting performance. In order to obtain combined forecasts, weights must be allocated to the individual forecasts. Accordingly, Bates and Granger [21] introduced numerous methods for determining the weights of each individual forecast as it is preferred to
532
D. Gerritsen and V. Reshadat Table 5. Correlation of individual forecast errors. LASSO AHW SARIMA LASSO
1
AHW
0.758
SARIMA 0.825
–
–
1
–
0.944 1
assign a greater weight to an individual forecast with higher accuracy. However, Armstrong [23] mentioned that applying weights is only beneficial if there is strong evidence that particular forecasting models are likely to predict better than others. Otherwise, the use of equal weights is likely to perform better under almost all other circumstances [24]. In our case study, there is no strong evidence that LASSO outperforms AHW or vice versa and thus the decision was made to allocate equal weights to the individual forecasts, i.e. the individual forecasts of LASSO and AHW are averaged. Overall, Table 6 shows the forecasting performance of the combined LASSO and AHW forecasts (LASSO-AHW) in comparison to all other individual models. Table 6. Mean absolute prediction error across all forecast horizons. Forecast horizon LASSO
SARIMA AHW
LASSO-AHW
1-month
1,855.95
1,876.55 1,732.56 1,618.09
2-months
1,855.71
1,778.85 1,636.93 1,581.07
3-months
1,891.97
1,955.72 1,669.06 1,650.04
4-months
1,788.00
1,778.09 1,722.89 1,648.04
5-months
1,786.83
1,820.68 1,726.69 1,642.26
6-months
1,818.96
1,872.53 1,887.68 1,736.84
7-months
2,063.71
1,968.04 2,125.34 1,928.84
8-months
2,082.58
1,939.43 2,038.74 1,939.34
9-months
2,190.58 2,021.24 2,120.01
10-months
2,150.93
11-months
2,335.49
12-months
2,297.18
2,047.29
2,232.01 2,342.59 2,120.66 2,363.90 2,405.85 2,279.48 2,316.47 2.378.72
2,316.44
As can be seen in Table 6, after combining the individual forecasts of LASSO and AHW, LASSO-AHW outperforms all other individual models for almost all forecast horizons. Thus, it seems that both the LASSO and AHW models are so diverse, that combining the predictions of these models results into enhanced forecasting performance, with an average improvement of 2.38%. Apparently, LASSO extracted valuable information from leading indicators, whereas HoltWinter extracted valuable information from level, trend and seasonal variations,
Identifying Leading Indicators
533
and ultimately, combining all of this information resulted in forecasting performance improvements. Hence, with regard to this case study, the inclusion of information extracted from leading indicators actually did lead to more accurate parts sales predictions.
5
Conclusion
This paper aimed to identify leading indicators for a case company that supplies truck parts to the European truck aftersales market. Next to that, it was explored whether including leading context information leads to more accurate predictions in comparison with traditional time series methods, often used in businesses. LASSO was used to extract relevant information from a collected pool of business, economic, and market indicators. It was found that combining predictions of LASSO and the traditional Holt-Winters method yielded the best forecasting performance, outperforming all other considered Holt-Winters and SARIMA models. Thus, including leading context information did improve forecasting performance for the case company. Also, due to the transparency of LASSO, using LASSO provided business intelligence about relevant predictors and any lead effects. Finally, from a pool of 34 indicators, 7 indicators appeared to have clear lead effects for the case company. The exact indicators and lead effects are not revealed due to confidentiality. Additionally, this research has several contributions to the existing academic literature. First of all, we proposed the efficient one-standard error rule, as an alternative to the default one-standard error rule, by combining efficient CV, proposed in Jung [20], with the commonly used one-standard error rule, described in Hastie et al. [16]. The purpose of the efficient one-standard error rule is to reduce the influence of sampling variation on the actual tuning parameter value. As stated earlier, we found that applying the efficient one-standard error rule over the default one, improved forecasting performance in 10 out of 12 models. With regard to future research purposes, there is a need to explore and analyze whether the efficient one-standard error rule improves performance, compared to the default one-standard error rule, when applied on multiple and larger data sets. Secondly, the studies of Sagaert et al. [13] and Verstraete et al. [6] reported forecasting performance losses of LASSO on the longer horizons, compared to traditional methods. These studies solely considered and compared the forecasting performance of individual models, whereas we found that, for our case study, applying forecast combination resulted in improved forecasting performance over almost all horizons.
References 1. Currie, C.S., Rowley, I.T.: Consumer behaviour and sales forecast accuracy: what’s going on and how should revenue managers respond. J. Revenue Pricing Manag. 9(4), 374–376 (2010). https://doi.org/10.1057/rpm.2010.22 2. Sagaert, Y.R., Aghezzaf, E.H., Kourentzes, N., Desmet, B.: Tactical sales forecasting using a very large set of macroeconomic indicators. Eur. J. Oper. Res. 264(2), 558–569 (2018). https://doi.org/10.1016/j.ejor.2017.06.054
534
D. Gerritsen and V. Reshadat
3. Williams, B.D., Waller, M.A., Ahire, S., Ferrier, G.D.: Predicting retailer orders with POS and order data: the inventory balance effect. Eur. J. Oper. Res. 232(3), 593–600 (2014). https://doi.org/10.1016/j.ejor.2013.07.016 4. Ma, S., Fildes, R., Huang, T.: Demand forecasting with high dimensional data: the case of SKU retail sales forecasting with intra- and inter-category promotional information. Eur. J. Oper. Res. 249(1), 245–257 (2016). https://doi.org/10.1016/ j.ejor.2015.08.029 5. Fildes, R., Goodwin, P., Lawrence, M., Nikolopoulos, K.: Effective forecasting and judgmental adjustments: an empirical evaluation and strategies for improvement in supply-chain planning. Int. J. Forecast. 25(1), 3–23 (2009). https://doi.org/10. 1016/j.ijforecast.2008.11.010 6. Verstraete, G., Aghezzaf, E.H., Desmet, B.: A leading macroeconomic indicators’ based framework to automatically generate tactical sales forecasts. Comput. Ind. Eng. 139(August 2019), 106169 (2020). https://doi.org/10.1016/j.cie.2019.106169 7. Atsalakis, G.S., Protopapadakis, E.E., Valavanis, K.P.: Stock trend forecasting in turbulent market periods using neuro-fuzzy systems. Oper. Res. Int. J. 16(2), 245– 269 (2015). https://doi.org/10.1007/s12351-015-0197-6 8. Reshadat, V., Hoorali, M., Faili, H.: A hybrid method for open information extraction based on shallow and deep linguistic analysis. Interdiscip. Inf. Sci. 22(1), 87– 100 (2016). https://doi.org/10.4036/iis.2016.R.03 9. Nourani, E., Reshadat, V.: Association extraction from biomedical literature based on representation and transfer learning. J. Theoret. Biol. 488, 110112. https://doi. org/10.1016/j.jtbi.2019.110112 10. Reshadat, V., Faili, H.: A new open information extraction system using sentence difficulty estimation. Comput. Inform. 38(4), 986–1008 (2019). https://doi.org/10. 31577/cai 2019 4 986 11. Reshadat, V., Feizi-Derakhshi, M.R.: Studying of semantic similarity methods in ontology. Res. J. Appl. Sci. Eng. Technol. 4(12), 1815–1821 (2012) 12. Collopy, F., Armstrong, J.S.: Rule-based forecasting: development and validation of an expert systems approach to combining time series extrapolations. Manag. Sci. 38(10), 1394–1414 (1992) 13. Sagaert, Y.R., Aghezzaf, E.H., Kourentzes, N., Desmet, B.: Temporal big data for tactical sales forecasting in the tire industry. Interfaces 48(2), 121–129 (2017). https://doi.org/10.1287/inte.2017.0901 14. Cleveland, R.B., Cleveland, W.S., McRae, J.E., Terpenning, I.: STL: a seasonaltrend decomposition procedure based on loess. J. Off. Stat. 6(1), 3–73 (1990) 15. Xiong, T., Li, C., Bao, Y.: Seasonal forecasting of agricultural commodity price using a hybrid STL and ELM method: evidence from the vegetable market in China. Neurocomputing 275, 2831–2844 (2018). https://doi.org/10.1016/j. neucom.2017.11.053 16. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. Data Mining, Inference, and Prediction, 2nd edn. Springer, Heidelberg (2009) 17. Bergmeir, C., Ben´ıtez, J.M.: On the use of cross-validation for time series predictor evaluation. Inf. Sci. 191, 192–213 (2012). https://doi.org/10.1016/j.ins.2011.12.028 18. Tibshirani, R.: Regression shrinkage and selection via the Lasso. J. Roy. Stat. Soc. Ser. B (Methodol.) 58(1), 267–288 (1996) 19. Hyndman, R., Athanasopoulos, G.: Forecasting: Principles and Practice, 2nd edn. OTexts, Melbourne (2018) 20. Jung, Y.: Efficient tuning parameter selection by cross-validated score in high dimensional models. World Acad. Sci. Eng. Technol. 10, 19–25 (2016)
Identifying Leading Indicators
535
21. Bates, A.J.M., Granger, C.W.J.: The combination of forecasts. 20(4), 451–468 (1969). https://doi.org/10.2307/3008764 22. Atiya, A.F.: Why does forecast combination work so well? Int. J. Forecast. 36(1), 197–200 (2020). https://doi.org/10.1016/j.ijforecast.2019.03.010 23. Armstrong, J.S.: Combining forecasts. Principles of Forecasting: A Handbook for Researchers and Practitioners, pp. 417–439 (2001). https://doi.org/10.4018/jncr. 2012070103 24. Clemen, R.T.: Combining forecasts: a review and annotated bibliography. Int. J. Forecast. 5(4), 559–583 (1989). https://doi.org/10.1016/0169-2070(89)90012-5
Detecting Number of Passengers in a Moving Vehicle with Publicly Available Data Luciano Branco1 , Fengxiang Qiao2 , and Yunpeng Zhang3(B) 1
3
Electrical and Computer Engineering, University of Houston, Houston, USA [email protected] 2 Innovative Transportation Research Institute, Texas Southern University, Houston, USA [email protected] Information and Logistics Technology, University of Houston, Houston, USA [email protected]
Abstract. Detecting and counting people inside vehicles with little to no human input has many applications today. From assisting rescue authorities to smart transportation systems; from automatic crash response to law enforcement of High Occupancy Vehicle and High Occupancy Toll lanes. We propose a framework for counting passengers in moving vehicles based on various image features from various objects and contextual information. Each feature can be computed with stateof-the-art techniques, like Fisher Vectors, before being consolidated for a final detection score. Images from publicly available surveillance road cameras were taken to create a real-world data set, before training the convolutional neural network YOLOv3. Preliminary results show good prospective for this approach with the potential for improvements in each object detected in scene, thus improving the overall results. Future work can explore image enhancement with generative adversarial networks. Keywords: Passenger detection Law enforcement
1
· Machine learning · HOV · HOT ·
Introduction
Autonomously counting people has gained increasing attention of researchers in recent years. Specifically, counting the number of people inside a vehicle can have many applications. The European emergency eCall system, which is now mandatory in every manufactured vehicle since 2018 [3], automatically informs authorities in the event of a car accident, drastically improving response times and helping with victims’ recovery [9]. In designing these systems, manufactures can choose to relay to authorities, the number of passengers and the status of seat belts, along with other emergency related information such as current location [4]. In addition to improve emergency aid systems, placing cameras inside public vehicles like buses or trains, and using computer vision techniques to unveil c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 295, pp. 536–548, 2022. https://doi.org/10.1007/978-3-030-82196-8_39
Detecting Number of Passengers in a Moving Vehicle
537
passenger data such as passenger flow or perform object detection can help with statistical analysis of the system’s operation and therefore aid in the development of smart systems [37]. Moreover, image data from private vehicles could provide useful information for automatic systems such as airbag deploy adjustment [17] or help with autonomous driving during handover situations [13]. Furthermore, another approach that gained a lot of popularity over the years among research is to obtain the vehicle occupancy from outside cameras, usually mounted on top of highways to monitor each lane. Although a challenging field of study, recent advances in artificial intelligence and computer vision have shown potential using this information to ensure proper road law enforcement in High Occupancy Vehicle (HOV) and High Occupancy Toll (HOT) lanes [21,34]. HOV lanes are car-pool lanes where at least two vehicle occupants are required to use the lane legally, making them lesser congested and helping with traffic flow [14]. Typically, however, it is not very effective as enforcement rates can be as low as 10% [30], requiring the presence of roadside officers for compliance with the law. The manual enforcement of HOV and HOT lane regulations, makes the process costly, less safe, and inefficient, than compared to a fully automated one [7]. By developing automatic systems to enforce HOV and HOT regulations, the reliance on road-side officers can be diminished which can improve the efficiency, cost and general safety of HOV and HOT enforcement [7]. In the present work, an alternative to address challenges of automatic HOV and HOT lane enforcement is proposed using machine learning methods to augment vehicle occupancy detection.
2
Related Work
To investigate automatic ways of detecting passengers of moving vehicles, previous scholars have used standard-sensor [23,31] and thermal [24] cameras, in addition to near-infrared sensors [5,6,33,34] and radars [1,10,20], along with various approaches. Here we present some of the most relevant literature. 2.1
Standard Sensored Cameras
Complementary metal oxide semiconductor (CMOS) technology or charge coupled device (CCD) image sensors are the standard type of sensors used on common digital cameras and smartphones [18]. The authors of [31] and [23] used these types of standard-sensor cameras to work on the passenger detection problem. In addition to the typical sensor, flooding lights just outside the spectrum of visible light for humans were used in [23] to increase image quality. In [31], a novel windshield segmentation using Local Appearance Model (LAM) was used for Histogram of Oriented Gradients (HOG), in order to also accomplish a low-computational machine learning classification of front-seat occupancy. Furthermore, in [19], HOG features were used to detect bus passenger heads’ and perform classification with support vector machine (SVM). The authors also used Kanade-Lucas-Tomasi feature tracker to perform real-time tracking of the
538
L. Branco et al.
passenger’s heads to achieve a trajectory and unveil the passenger flow. The authors of [36] also explored the number of passengers in a bus, but used a compact convolutional neural network (CNN) to estimate a density map for a given inside bus image, and were able to obtain feasible performances for embedded systems. 2.2
Radar Cameras
In [1,10,20], a frequency modulated continuous wave radar was used as sensor to estimate the in-vehicle number of passengers. The authors in [1] used principal component analysis (PCA) and machine learning classification with SVM to achieve over 97% accuracy in experiments for detecting passengers inside a minivan. 2.3
Near-Infrared Cameras
Most of the publications reviewed, explored the use of near-infrared (NIR) camera sensors [5,6,21,33,34], as they have been shown to address many of the challenges of windshield penetration and various environmental conditions retaining good image quality [32]. In [6], the authors explored NIR- based images to capture in-vehicle occupancy, along with seat belt and cellphone violations. To do so, they used deformable part models (DPM) for windshield detection and compared different local aggregation-based image features for classification. Moreover, in [7], DPM was used along with Fisher Vectors (FV) to estimate the presence or absence of a front seat passenger. The authors experimented with an in-house database of 3000 vehicle front passenger seat images, obtaining approximately 96% accuracy. Using NIR cameras, authors in [5,33,34], performed DPM for region of interest (ROI) extraction and FV classifiers. In addition to this approach, [33] also used HOG, PCA, and SVM to conclude that classifying the front and side views of the vehicle separately, is better than aggregating the two views and classifying together. Likewise to the FV classifier, authors in [5] also used bag-of-visualwords (BOW) and concluded that image classification methods (such as BOW and FV) outperform object detection (i.e. face or seat belt detection) at the field of research of detecting passengers in a moving vehicle, in agreement with other authors [6,21]. Usually, however, Fv tend to yield better results [7]. The authors of [21] also used NIR sensors, but claim that using the neural net “You only look once” (YOLOv3) outperforms the usage of DPM without depending on image preprocessing [21]. They used the CNN GoogleNet for the front view binary classification of occupancy. In addition, they compared 3 different CNNs – GoogleNet, ResNet, and VGGNet – for a 3-class problem of estimating the number of passengers in the backseat by using side-view images. The authors claim to be able to classify more than one person in the back seat, making their approach suitable for HOV3+ lanes.
Detecting Number of Passengers in a Moving Vehicle
2.4
539
Occupancy Detection Methods
According to [21], there are mainly three occupancy detection methods that rely only on information from outside of the vehicle: Object Detection. Firstly, there are object detection methods which rely on detecting specific items with defined locations. For instance, Alves [2] disclosed a roadside imaging unit based face detection and their relative detected location within the car. In contrast, Fan et al. [16] proposed to use seat pattern recognition to detect front passengers inside the vehicle. To do so, they make use of detected features that vary from vacant to occupied seats such as long horizontal seat lines, softness of edges and number of lines. P´erez-Jim´enez et al. showed improved results by combining face and seat belt detection, classified with k-nearest-neighbors [26]. Usually, however, this approach of finding objects in a scene is more computationally intensive than feature-based approaches, and suffers more heavily from occlusions and various passenger positions, leading to less than optimal results [7,21]. Feature Based. Next, there are feature based methods which employ categorization of image patterns to find differences between a passenger and his or her surroundings. For example, Xu et al. [35] disclosed a vehicle passenger detection design, which calculates features from both the driver and a candidate passenger. The two sets of features are then compared to a previously obtained distance threshold, to determine if a passenger is present or not. In the work of Artan and Paul [7], mentioned previously, the performance of image features FV, BoW and Vector of Locally Aggregated Descriptors (VLAD) are compared in the task of detecting the front passenger, with the former achieving best results. DPM has also been an important tool in aggregating good detection results of image descriptors in vehicle passenger detection [6,7,21,34]. In general, this class of methods performs better than object detection [5] but still encounters problems with low resolution images, such as the ones taken from roadside cameras [21]. Density Maps. The last occupancy detection approach uses density maps learned from training image data sets. In this approach, various local patch features are extracted from an image and fed to a previously trained CNN to obtain the passenger count. Additional information in time is needed to compare possible detected objects with the respective background. Yang et al. used the supposedly less computational intensive feature weighted Euclidian loss, to highlight dense regions of bus images, in addition to constraints from consecutive frames, in order to obtain bus passenger count and flow [36]. Typically, this method requires knowledge of the background behind targets and fails to estimate passenger numbers in lower density scenarios [21]. Based on the analysis presented in this section, previous methods of tackling this issue typically focused on one criterion to determine passenger occupancy in vehicles. We present an alternative approach that uses multiple sources of
540
L. Branco et al.
information to augment the automatic decision of whether a passenger is present or not. This way, we can use the best techniques for each detection module, suppressing any shortcomings of a given approach and achieving an improved overall result. 2.5
Solution
The proposed solution for automatic detection of passengers in moving vehicles comprises multiple parts. Firstly, we use YOLOv3 for windshield and side window detection [28], as previous scholars have shown that this convolutional neural network (CNN) outperforms other high accuracy algorithms such as the deformable parts model (DPM) in this task [21]. These two objects will then form regions of interest (ROI) in our image. Next, by using multiple ROIs and other context information such as seat belt detection, face detection, and person detection, we can use state-of-theart techniques for each separate object detection module, building on top of their specific advantages, and thus constructing a final score for the vehicle’s occupancy. In other words, we propose to detect multiple objects using various image features, to compute a final detection score.
3 3.1
Methods Data Set
The data set used in this research, was obtained by extracting stills from the live feed of cameras from the city of Austin, Texas [25]. Upon initial inspection, images from 35 of around 100 different cameras were selected for having higher quality images and located in places where a view of the passenger seat is possible, and sequential images were taken every minute in early morning and late evening periods. Next, we discard images in which no car is present, or if the vehicles are too far way for any practical recognition. All images of the data set were then labelled for all cars in scene, as well as their respective frontal side windows closer to the camera and their windshields. The vehicles were also labelled regarding the presence or absence of a front passenger visible. This process also resulted in discard of a few images due to poor visibility of the passenger seat. The data set was finally separated into train, test, and validation, with roughly 70%, 20%, and 10% images for each, respectively, totalling 568 images of preliminary data. 3.2
Preprocessing of Images
After data set extraction, we preprocessed images in accordance with previous authors, to help the machine learning algorithm find the necessary patterns in the data [6]. We initially converted images to a grayscale, then performed an
Detecting Number of Passengers in a Moving Vehicle
541
adaptive histogram equalization CLAHE algorithm [27], to increase the contrast of images with poor lighting conditions. Figure 1 shows example of one fully labelled image from the data set, after preprocessing.
Fig. 1. Example of preprocessed image with manually inserted labels. Each rectangle depicts a class of the data set, including cars, windshields, side windows, passengers and empty seats.
3.3
Training Session
The YOLOv3 neural network was used in accordance to previous studies, to detect the windshield and side window of the vehicles, as well as the vehicles [21]. The goal is to use this information to augment the detection of the car and to use this for relative position calculation to find the expected position of the passenger seat, i.e. if the side window is to the left of the windshield, the ROI for the passenger overlaps the ROI of the windshield. However, if the side window is detected to the right of the windshield, the passenger seat is expected to be found in a region purely composed of the windshield, without overlapping. This analysis is necessary since detection will come from various angles and different cameras. The training phase was performed over 80 iterations and the resulting algorithm is trained to detect windshields, passengers, and empty seats. The validation set was kept totally separate from the training process and will be used to
542
L. Branco et al.
evaluate the performance. Images from the training set were used to train the neural network, while the ones from test were used iteratively throughout the training phase to verify the improvement of the neural net. 3.4
Detection Modules
To reach the final scoring for the vehicle’s occupancy, we used logic comparisons and the detection of various objects in the scene. First, by considering the relative position of the car’s windows, we estimate the expected passenger location. Next, we run various detection modules in the ROI, each one optimized for one specific task. In accordance with previous authors, we use Fischer Vectors (FV) to extract image features for the passenger detection module [5,6]. In other words, this module does not rely on detecting specific objects like faces or seat belts, as those can get occluded, but rather tries to detect the overall presence of the passenger. FV can be seen as an extension of Bag-of-Visual-Words (BoVW) and work similarly by transforming low-level local features (such as Scale-Invariant Feature Transform features) into high-level image representation concepts [6]. The main improvement from BoVW is that, in FV, not only each features gets paired up with its closest word in the vocabulary but also information on how likely it is to belong to that cluster (word). In addition, the distribution of the vocabulary words also is embedded into this image classification technique. With FV, we obtain a fixed-size image representation, sometimes considered as a signature of the desired ROI [12,29]. For the face detection module, we use the commercially free library Dlib that has been considered state-of-the-art by previous scholars [11]. The detector implemented in this library works by identifying facial landmarks with a histogram of oriented gradients (HOG) and a linear support vector machine (SVM) approach. We also implement a seat belt detection module, consisting of gradient orientation detection in specific regions of the image where the seat belt is expected to be visible. Differently from the previous methods, this one is based on object detection and relies on the predetermined knowledge of the seat belt’s shape.
4
Evaluation and Testing
To evaluate the results, we use the Jaccard index as show in (1) and proceed to find the overlap between ground truth and detected regions [15]. We consider a Jaccard greater or equal to 60%, true detection [6]. Otherwise, it is labeled a miss. Jaccard(A, B) =
|A ∩ B| |A ∪ B|
(1)
Detecting Number of Passengers in a Moving Vehicle
543
In Fig. 2, we present an image from the validation set that was given for the algorithm for prediction. The image highlights the results for the class of empty seats. We can see with this method we were able to correctly identify three empty seats, as demonstrated by the green rectangles. However, there were also 2 false positives in this image, represented by the yellow rectangles. The top one depicts a finding that was not previously labelled manually so it could be a legitimate detection if image quality was higher, but the bottom one had a passenger present and thus represents a miss detection.
Fig. 2. Detection example for “empty seats” in the image, from an image of the validation set. Green rectangles depict a correctly identified region, with respect to the manually labeled ground truth in orange. Yellow rectangles show false positives cases, where the algorithm detected either a wrong object in the image or incorrectly identified the class being tested. The yellow rectangle on the bottom of the image shows a case of incorrectly identifying a “passenger” with an “empty seat” label. (Color figure online)
Across all classes, the detection algorithm had a sensitivity of 0.6653, having a 37% false positive (FP) detection rate, a 42% true positive (TP) rate, and 21% false negative (FN) rate. Focusing on each class, windshields had the best detection performance, with a sensitivity of 0.8145, 30% FP rate, 56% TP rate, and 0.1% FN rate. The detection of empty seats had a sensitivity of 0.6213, with a 47% FP, 32% TP detection rate and 0.2% FN rate. However, of the 21 examples of passengers in the validation set, none was correctly identified. In Fig. 3, we present an overview of the most interesting results. On the left half of this image, we show results for the detection of empty seats and on the
544
L. Branco et al.
right half, we show the detection of windshields. The algorithm was able to detect windshields with a good performance and empty seats as well. However, the presence of passengers was not very well detected by this algorithm. We can observe in this image that the green rectangles (correct detection) happens in empty seats, but yellow rectangles appear in places where passengers are also present.
Fig. 3. Overview of the most interesting results. The left half of this image depicts detections for the “empty seats” class, while the right half shows the results for detecting windshields. Green rectangles depict a correctly identified region, with respect to the manually labeled ground truth in orange. Yellow rectangles show false positives cases for this class. We can observe that the algorithm is able to detect windshields reasonably well. It was also able to correctly identify some empty seats but also erroneously recognized passengers also in this class, as depicted by the yellow boxes in this image. (Color figure online)
5
Conclusion and Future Work
In the present work, we propose an alternative approach for automatic detection of vehicle occupancy based on information from various sources. Our approach is comprised of a framework in which state-of-the-art detection algorithms can be used to identify various objects on the scene and thus augment the score of final decision regarding vehicle occupancy.
Detecting Number of Passengers in a Moving Vehicle
545
We investigate the use of features obtained from standard-sensored cameras. However, this approach could benefit other types of cameras as well, such as NIR cameras or radars. In addition, the proposed framework could be further expanded to perform the estimation of the vehicle position and with that, estimate an expected passenger position. This approach would give a stochastic aspect to the problem and allow conditions such as, “given the good current angle of the car, the passenger should be very visible”, making the accuracy of the approach higher. We can also have conditions for when the angle of the car is poor for passenger visualization, which would lower the algorithm’s confidence in the detection, decreasing false positive cases and possibly changing the result for HOV/HOT lane law enforcing. This study has several limitations. In contrast with most previous research on this topic, the data set created for this research is comprised of images taken from surveillance road cameras, made publicly available by the city of Austin, Texas. The cameras were not optimized for the task at hand, and sometimes the angle at which the photographs are taken does not allow for visibility of the front passenger seat. Besides, the use of seat belt detection relies on compliance of the passengers which could not be always true. In addition, the live feed refreshes at roughly one frame per minute, which usually is more than enough time for any given vehicle in frame to move out of the image for the next frame. This makes it difficult to explore temporal-based techniques to optimize detection or choose the best possible frame for running the algorithm. Moreover, the nature of data set creation also prevents the knowledge of a totally accurate ground truth for the images and sometimes mismatches can occur. However, this feasibility study proposes an augmentation framework for passenger detection in moving vehicles, which means future detection features can be added for improvements, and thus further investigation is required. Even with these limitations, the results of this study are promising, specifically, the algorithm was able of achieving good performance on detecting empty seats and windshields of the vehicles, using this real-world data set after simple preprocessing techniques. The approach was able to detect empty seats, but not many passengers. This could be due to the imbalance of the data set, where there are much more examples of empty passenger seats than occupied ones. This skewness in data could be due to real world data, since the majority of trips with personal vehicles are made with only the driver and without extra passengers [8]. This is possibly a secondary outcome of the need for High-Occupancy Vehicle and High-Occupancy Toll lane enforcement, as these car-pooling lanes would benefit even more from an automatic enforcement of the regulations. Finally, in the future this approach could be enhanced with the use of generative adversarial networks (GAN) to improve visibility through the windshield and side windows. Previous scholars have investigated this to improve images of difficult visualization, such as images with more glare, noise, and reflections in the windshield, with exciting results [22]. This could have a tremendous positive impact in the field, making the use of expensive camera sensors obsolete and improving the performance of automatic passenger detection.
546
L. Branco et al.
Acknowledgments. This work is funded by Deep Learning Based Intrusion Detection Approaches for Advanced Traffic Management Systems, Data Science Institute, University of Houston, U.S.A.
References 1. Alizadeh, M., Abedi, H., Shaker, G.: Low-cost low-power in-vehicle occupant detection with mm-wave FMCW radar. In: 2019 IEEE SENSORS, pp. 1–4, October 2019 2. Alves, J.F.: High occupancy vehicle (HOV) lane enforcement. US Patent 7,786,897, August 31 2010 3. Anonymous. eCall in all new cars from April 2018, 2015. https://ec.europa. eu/digital-single-market/en/news/ecall-all-new-cars-april-2018. Accessed 21 May 2020 4. Anonymous. eCall EENA Operations Document. http://www.eena.org/ressource/ static/files/2012 04 04 3 1 5 ecall v1.6.pdf. Accessed 21 May 2020 5. Artan, Y., Paul, P., Perronin, F., Burry, A.: Comparison of face detection and image classification for detecting front seat passengers in vehicles. In: IEEE Winter Conference on Applications of Computer Vision, pp. 1006–1012, March 2014 6. Artan, Y., Bulan, O., Loce, R.P., Paul, P.: Passenger compartment violation detection in HOV/HOT lanes. IEEE Trans. Intell. Transp. Syst. 17(2), 395–405 (2015) 7. Artan, Y., Paul, P.: Occupancy detection in vehicles using fisher vector image representation. arXiv preprint arXiv:1312.6024 (2013) 8. Benita, F.: Carpool to work: determinants at the county-level in the united states. J. Transp. Geogr. 87, 102791 (2020) 9. Bony´ ar, A., G´eczy, A., Harsanyi, G., Han´ ak, P.: Passenger detection and counting inside vehicles for ecall-a review on current possibilities. In: 2018 IEEE 24th International Symposium for Design and Technology in Electronic Packaging (SIITME), pp. 221–225, October 2018 10. Chen, W., Chen, R., Li, J., Wu, P.: Compact X-band FMCW sensor module for fast and accurate vehicle occupancy detection. In: 2014 International Symposium on Antennas and Propagation Conference Proceedings, pp. 155–156, December 2014 11. Cheney, J., Klein, B., Jain, A.K., Klare, B.F.: Unconstrained face detection: state of the art baseline and challenges. In: Proceedings of 2015 International Conference on Biometrics, ICB 2015 (2015) 12. Csurka, G., Perronnin, F.: Fisher vectors: beyond bag-of-visual-words image representations. In: Communications in Computer and Information Science (2011) 13. Da Cruz, S.D., Wasenm¨ uller, O., Beise, H., Stifter, T., Stricker, D.: Sviro: synthetic vehicle interior rear seat occupancy dataset and benchmark. In: 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 962–971, March 2020 14. Daley, W., et al.: Sensing System Development for HOV (High Occupancy Vehicle) Lane Monitoring Draft Final Report. Technical report, February 2011 15. Everingham, M., et al.: The PASCAL visual object classes (VOC) challenge. Int. J. Comput. Vis. 88, 303–338 (2010) 16. Fan, Z., Islam, A.S., Paul, P., Xu, B., Mestha, L.K.: Front seat vehicle occupancy detection via seat pattern recognition. US Patent 8,611,608, 17 December 2013 17. Farmer, M.E., Jain, A.K.: Occupant classification system for automotive airbag suppression. In: Proceedings of 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1, p. 1. IEEE (2003)
Detecting Number of Passengers in a Moving Vehicle
547
18. Fossum, E.R.: CMOS image sensors: electronic camera-on-a-chip. IEEE Trans. Electron Dev. 44(10), 1689–1698 (1997) 19. Haq, E.U., Huarong, X., Xuhui, C., Wanqing, Z., Jianping, F., Abid, F.: A fast hybrid computer vision technique for real-time embedded bus passenger flow calculation through camera. Multimedia Tools Appl. 79(1), 1007–1036 (2020) 20. Hoffmann, M., Tatarinov, D., Landwehr, J., Diewald, A.R.: A four-channel radar system for rear seat occupancy detection in the 24 GHZ ISM band. In: 2018 11th German Microwave Conference (GeMiC), pp. 95–98, March 2018 21. Kumar, A., et al.: VPDS: an AI-based automated vehicle occupancy and violation detection system. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 9498–9503 (2019) 22. Ma, D., Bai, Y., Wan, R., Wang, C., Shi, B., Duan, L.-Y.: See through the windshield from surveillance camera. dl.acm.org, pp. 1481–1489, October 2019 23. Miyamoto, S.: Passenger in vehicle counting method of HOV/HOT system. In: 2018 24th International Conference on Pattern Recognition (ICPR), pp. 1536– 1541, August 2018 24. Nowruzi, F.E., El Ahmar, W.A., Laganiere, R., Ghods, A.H.: In-vehicle occupancy detection with convolutional networks on thermal images. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 941–948, June 2019 25. The City of Austin. Home — AustinTexas.gov. https://austintexas.gov/ 26. P´erez-Jim´enez, A.J., Guardiola, J.L., P´erez-Cort´es, J.C.: High occupancy vehicle detection. In: da Vitoria Lobo, N., et al. (eds.) SSPR /SPR 2008. LNCS, vol. 5342, pp. 782–789. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3540-89689-0 82 27. Pizer, S.M., et al.: Adaptive histogram equalization and its variations. Comput. Vis. Graph. Image Process. 39(3), 355–368 (1987) 28. Redmon, J., Farhadi, A.: YOLOv3: An Incremental Improvement. arXiv, April 2018 29. S´ anchez, J., Perronnin, F., Mensink, T., Verbeek, J.: Image classification with the fisher vector: theory and practice. Int. J. Comput. Vis. 105(3), 222–245 (2013) 30. Schijns, S., Mathews, P.: A breakthrough in automated vehicle occupancy monitoring systems for HOV/HOT facilities. In: 12th HOV Systems Conference, vol. 1 (2005) 31. Silva, B., Martins, P., Batista, J.: Vehicle occupancy detection for HOV/HOT lanes enforcement. In: 2019 IEEE Intelligent Transportation Systems Conference (ITSC), pp. 311–318, October 2019 32. Smith, B.L., Yook, D., et al.: Investigation of enforcement techniques and technologies to support high-occupancy vehicle and high-occupancy toll operations. Technical report, Virginia Transportation Research Council (2009) 33. Xu, B., Bulan, O., Kumar, J., Wshah, S., Kozitsky, V., Paul, P.: Comparison of early and late information fusion for multi-camera HOV lane enforcement. In: 2015 IEEE 18th International Conference on Intelligent Transportation Systems, pp. 913–918, September 2015 34. Xu, B., Paul, P., Artan, Y., Perronnin, F.: A machine learning approach to vehicle occupancy detection. In: 17th International IEEE Conference on Intelligent Transportation Systems (ITSC), pp. 1232–1237. IEEE (2014) 35. Xu, B., Paul, P., Perronnin, F.: Vehicle occupancy detection using passenger to driver feature distance, US Patent 9,760,783, 12 September 2017
548
L. Branco et al.
36. Yang, B., Cao, J., Liu, X., Wang, N., Lv, J.: Edge computing-based real-time passenger counting using a compact convolutional neural network. Neural Comput. Appl. 32, 1–13 (2018) 37. Zhang, S., Wu, Y., Men, C., Li, X.: Tiny yolo optimization oriented bus passenger object detection. Chin. J. Electron. 29(1), 132–138 (2020)
Towards Context-Awareness for Enhanced Safety of Autonomous Vehicles Nikita Bhardwaj Haupt(B) and Peter Liggesmeyer Technische Universit¨ at Kaiserslautern, 67663 Kaiserslautern, Germany {haupt,liggesmeyer}@cs.uni-kl.de https://seda.cs.uni-kl.de/
Abstract. Autonomous vehicles operate in dynamic environments continuously encountering safety-critical scenarios. This necessitates employing methodologies that can handle these scenarios and ensure safety of the vehicle as well as other traffic participants. Besides, random failures or malfunctions in its components might result in hazardous situation(s), further raising concerns regarding safety. The intensity of these hazards caused by the malfunctions depends upon the current state of the operational context in which they occur. Thus to guarantee safe behavior of the vehicle, one must be aware of its operational context in the first place. To this end, we propose to systematically model the operational context of an autonomous vehicle apropos its safety-relevant aspects. This paper puts forth our initial work for context-awareness aided safety, including our perspective towards context and its modeling, and its categorization based on relevance and goal. We also propose a context meta-model and its fundamental elements crucial for developing a safety-relevant context model. Keywords: Autonomous vehicles · Context-awareness Safety-relevant context · Autonomous systems
1
·
Introduction
Autonomous vehicles (AVs) operate in dynamic environments continuously encountering safety-critical scenarios. Since they navigate with little or no human assistance ensuring safety for themselves as well as other traffic participants becomes a vital requirement. Moreover, randomly occurring errors or malfunctions in system components further raise concerns for their safe behavior. It is crucial to consider the intensity of hazard(s) caused by these malfunctions depend upon the current state of the operational context in which they occur. That is to say, not all malfunctions result in a catastrophic situation, like a high-speed collision at runtime. In particular severity and controllability of a safety-critical malfunction is influenced by the state of the physical context in which the vehicle is currently driving. Some malfunctions can be tolerated given the current state of operational context is in a favorable state and the system c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 295, pp. 549–563, 2022. https://doi.org/10.1007/978-3-030-82196-8_40
550
N. B. Haupt and P. Liggesmeyer
can be brought back to a safe state via graceful degradation. On the other hand, tolerating some malfunctions, merely with graceful degradation, is not feasible if system operational context is in a critical state. In such cases, stringent safety measures like an instant shut-down of the driver-assistance system and giving full control to the driver must be executed. It can thus be inferred that there exists a system-context association that plays a crucial role in tolerating unsafe behavior of the AV runtime, thereby ensuring safe operation of the vehicle. Traditional safety assurance demands complete system information to be available at design time; from operational behavior of the system to the environment in which it is functioning. Since AVs adapt their behavior in response to changes in their own state and/or their operational environment, this information cannot be fully determined well in advance. As a result, traditional safety methodologies, despite still being vital, are inadequate to ensure safe behavior. This brings forth two essential facets to consider for safety assurance of the AV: First, shifting conventional design time safety approaches to runtime safety methodologies and second, explicit incorporation of system-context association in these runtime methodologies. To this end, we propose modeling operational context of an AV in terms of its safety-relevant aspects. The context model is subsequently integrated into the process of risk analysis which is carried out to ensure safe adaptation of the system at runtime. This paper sketches our initial work towards utilizing context and its awareness for enhanced safety of an AV. It includes our perspective towards context and its modeling, its categorization based on relevance and goal, and finding out the system-context association in regards to safety. We introduce the context meta-model and its fundamental elements crucial for developing a safety-relevant context model. The rest of the paper is structured as follows: Work related to our research is outlined in Sect. 2. Section 3 discusses the elemental concepts of context and how we categorize its elements from safety standpoint. In Sect. 4, we present the context meta-model and its fundamental entities essential for context modeling. We also discuss how system and its goal are two crucial entities to understand system-context association. We conclude the paper with discussions about the current and planned future work of the presented approach in Sect. 5.
2
Related Work
Based on the application, there have been multiple efforts to define the term context [1,6–8]; some of which have even defined it synonymous with situation or environment. For our work, we consider the definition of context given by Dey[1]: “Context is any information that can be used to characterise the situation of an entity. An entity is a person, place, or object that is considered relevant to the interaction between a user and an application, including the user and applications themselves.”. Regarding the categorization of context, several different ways have been identified in different works. From an IoT standpoint, a comparison of context categories is presented in [8–11], emphasizing that the categorization depends upon the perspective towards context and thus varies
Towards Context-Awareness for Safety of Autonomous Vehicles
551
widely from researcher to researcher. Dey [1] for instance, categorized context into primary and secondary context types, where primary context consists location, identity, time and activity, and secondary context refers to something that can be obtained using primary context. Perera et al. [8], on the other hand, defined a context categorization scheme which categorizes a given value of context from an operational perspective. According to them, primary context is any information obtained without using an existing context and without carrying out any kind of data fusion on the sensor data. Whereas, secondary context is information derived from performing computations using the primary context. The former aids in understanding difficulties in data acquisition, while the latter aids in comprehending relationships between context. Based on the operational categorization scheme, Henrickson [10] introduced four context categories, sensed, static, profiled and derived. Unlike others, Van Bunningen et al. [11] introduced a classification for the context classification schemes: Operational and conceptual. Operational classification refers to how the context was obtained, modeled and treated, and conceptual refers to the context on the basis of meaning and conceptual connections between it. Taking into consideration context, its modeling and awareness in the automotive domain, from safety assurance standpoint in particular, CAW has been used to achieve objectives such as situation prediction, accident reduction or prevention, risk assessment and safety monitoring. In [12], Wardzi indicates situationawareness as a vital element for safety assurance of autonomous vehicles and introduces an ontology based situation model to address issues of trust, uncertainty and insufficient knowledge of the operational environment. Armand et al. [13] employed ontologies to represent the environment and entities involved in the drivable space of the vehicle. The information gathered about the context is provided to the Advanced Driving Assistance System (ADAS) for improved situation awareness. Boehmlaender et al. [14] developed a context-aware system to discern potential collisions and as a response activate measures to prevent accident occurrence, thus reducing the severity of the collision. Tawfeek et al. [15] proposed to integrate a context identification layer to the context-aware Driver Assistance System (DAS) to improve the adaptability of the system with an aim to enhance alertness of the driver. Reich et al. [16] introduced a situationaware framework that lays foundation for synthesis of an assurable runtime risk monitor. The monitor considers knowledge about the environment along with risk control behavior and performs residual risk assessment. In [17] authors propose knowledge based scene creation in natural language for automated vehicles. Ryu et al. [18] presented an approach to design and implement a driving assistance system for autonomous vehicles using context-awareness of the environment around the vehicles. They develop an ontological model of the contextual information within the vehicle space. After reviewing the following work in the autonomous automotive domain, it is clear that CAW has not yet been exploited to ensure safety integrating designtime and runtime methodologies in a way presented in this paper. Besides, the system-context association with reference to safety as a non-functional goal and categorizing context accordingly has not
552
N. B. Haupt and P. Liggesmeyer
been considered yet. Furthermore, the technique to utilize CAW and translating it into a set of safety-rules to monitor and analyze system state with respect to its operational context is a novel technique introduced in this paper.
3
Fundamentals of Context and Its Categorization
For the sake of in-depth understanding of context and its challenges, and to find the significance of context-awareness (CAW) to achieve system safety goals at runtime, we begin with a set of questions which encompass the fundamental aspects of CAW: What is context? Who is benefitted from being aware of its context? When is CAW useful? Where is this awareness exploited? Since we aim at utilizing CAW for safety, we approach these questions from a safety perspective. Besides, we introduce a few terminologies that facilitate answering these questions, thereby finding relevance and focus of CAW-aided safety. What: We define context as the surroundings in which - or where - the AV functions or operates to accomplish its designated tasks or goals. These goals can either be functional, non-functional or both. A context consists of assorted elements which participate in it in one way or the other. These participants are called as elements of context (EOC). An AV, itself, is a participant of its operational context. We call it as a primary participant/primary EOC (pEOC), as the context is observed with respect to it; the remainder are referred to as secondary participants/secondary EOC (sEOC). Furthermore, based on how an EOC interacts with the context, it can be categorized as active, partially-active or passive EOC. Each EOC provides some information about itself in relation to its present operational context referred to as contextual information (CXI). This information manifests a great variety of knowledge in form of possible values, states and/or potential behavior of EOCs.
Fig. 1. Context categorization and identification of safety-relevant elements
Who: Certainly, with CAW-aided safety, AV becomes the principal beneficiary. Detailed knowledge about the context and its elements assist in the refined analysis of the situation, which further allows an AV to take adequate safety
Towards Context-Awareness for Safety of Autonomous Vehicles
553
measures. Unlike traditional safety measures, which were based on the worstcase assumptions, CAW-aided safety measures take into account other potential considerable-cases of the context to ensure safe behavior of the system. This prevents the AV from taking stringent safety actions, in cases where moderate actions can be applied, thereby eliminating inessential limitations on its functionality along with its performance and efficiency. When: In order to enhance the safety of an AV, we aim to exploit CAW at runtime. This can be achieved by utilizing obtainable knowledge about the current state of context for the safety assessment of the AV. To this end, we perform a sequential conceptual [8] categorization of context as manifested in Fig. 1. Our first step is to categorize context based on its relevance to AV. This results in two categories of context, relevant and irrelevant. This eliminates part of the context, which does not explicitly influence the operation of the AV. The relevant context still makes a huge context space comprising a diverse range of EOCs. To be able to utilize this space, and alleviate the associated complexity, we employ a set of categories to classify EOCs in this space, such as Environment, Traffic Participants (other vehicles and people), Weather and Time of the Day etc. Some of them are further categorized into sub-categories to facilitate refined representation and identification of EOCs. Where: As our goal is to use CAW to aid the safe behavior of the AV, we require information about EOCs relevant to safety of the AV as well as other traffic participants. Thus, in our subsequent step, we categorize the CXI associated with the EOCs into safety-relevant and safety-irrelevant information. It is essential to highlight that the safety-relevant context, obtained in the preceding step, consists of safety-critical CXI about EOCs, i.e. a particular set of information which influences safe behavior of the AV. Since not every available information about EOCs is essential for the safety of the AV, the non-safety-critical CXI is not taken into consideration.
Fig. 2. Safety-relevant context and its elements of an AV
554
N. B. Haupt and P. Liggesmeyer
Figure 2 illustrates the categories of context in an exemplary scenario of driving in a city. The blue vehicle in the figure is our AV (V 1) capable of adaptation at runtime. With respect to V 1, the figure in its entirety is its operational context. However, its relevant context is the region outlined with green. In the presented relevant context, V 1 makes the pEOC, wherein the cyclist (Cyc), the vehicle (V 2) diagonally ahead on the adjacent lane, and the communication tower (CommT) make sEOCs. Safety-relevant context of V 1 comprises of safety-critical information about sEOCs like: Distance between V 1 and Cyc (dist V1 Cyc) and V 1 and V2 (dist V1 V2 ), speed of Cyc (spd Cyc) and V 2 (spd V2 ) and the like.
4 4.1
Context Modeling and System State-Space The Context Meta-model
In order to design an adequate context model for CAW-aided safety, we begin with a generic meta-model. Safety-relevant context model is subsequently developed by means of this model and the categorization presented in the previous section. Figure 3 depicts the proposed meta-model of context and its entities, some of which have already been introduced in this paper. The formerly introduced entities, along with the new ones, are described below from the meta-model standpoint.
Fig. 3. Context meta-model and its entities
– Situation is an abstraction of the context and can be described using detailed information about the context. In essence, a situation is a context or a group of contexts, without information about its participating elements, relations, constraints and other attributes. – Context is the operational environment of the system. It is not a stand-alone entity, i.e. it belongs to the system, and is, therefore, always defined with respect to it. It aids in evaluating and characterizing the situation.
Towards Context-Awareness for Safety of Autonomous Vehicles
555
– Elements of Context are participants of the context which contribute to its current state at a given point in time. These elements altogether construct the context as a whole. – Contextual information is the knowledge used to represent the elements of context. An element can be represented in form of its state of being, behavior, or its potential values. – Each element of context, as well as the information representing it, is associated with a set of attributes. In case of elements, these attributes define their relevance, relations with other elements, their interaction with the context. Wherein for contextual information, they specify their type, certainty, levels at which this information is collected and the like. – System is the AV, the entity with reference to which the context is being identified and modeled. It is an element of its own operational context. – Lastly, goal depicts the objective of the system for which its context is being monitored and its information is being utilized. It can be a functional or a non-functional goal, a single or a set of related or unrelated goals. The type of information to be considered about the elements of context depends upon this very goal itself. 4.2
Identification and Classification of Critical Parameters
The fundamental entities of this meta-model are the system - the AV - and its goal, all the other entities are determined in accordance with them. It is this tuple that lays the foundation to understand and identify the system-context association, and aids in answering crucial questions like, Why system must be aware of its context? What must be done with the collected knowledge about the context? How and where this awareness should be implemented? Thus, it can be inferred that there are two implicit phases to achieve CAW-aided safety: First, identifying the system and its goal, and second, modeling the context accordingly. As we have already identified our system - the AV and its goal - safety assurance, our subsequent step, of the first phase, is to identify the systemcontext association. We know that random errors or malfunctions in system components can result in a hazardous situation, thereby affecting safe behavior of the system. Besides, the intensity of these hazards is immensely influenced by the state of the operational context in which they occur. Therefore, to determine whether a malfunction has resulted in a hazardous situation and, if so, with what intensity, and which safety measures can be implemented to adequately avoid or tolerate these hazard(s), the current state of operational context must be taken into consideration. To this end, we exploit the process of Hazard Analysis and Risk Assessment (HARA) at design time. Conventionally, HARA is performed in the concept phase of ISO 26262 [2] to classify the risk associated with the potential hazards. This classification is then used to determine safety measures like prevention or mitigation strategies for these hazards. Based on the ascertained system-context association, we use HARA to systematically analyze system malfunctions and
556
N. B. Haupt and P. Liggesmeyer
their potential consequences with respect to their operational scenario. By means of this analysis, we identify and extract critical parameters of the AV. These parameters represent attributes of operational context which are critical for safe behavior of the vehicle. Table 1. Critical parameters extraction using HARA of ISO26262 ID Service Malfunction Operational Hazardous Situation Mode and Consequences
ASIL (S,E,C)
1
B
2
V1 vsp
V1 brk
Omission (missing value of vehicle speed)
ACC activated
Comission (unintended braking during driving)
ACC activated
Driving on city road with other vehicles around. V1 is unable to maintain a constant velocity and distance to the front vehicle. Sudden deceleration due to delayed speed value might result in rear-end collision with the following vehicle. Delayed response from the driver to switch to manual-mode might result in front-end collision with the vehicle ahead and/or roadside collision involving pedestrians or other traffic participants. Driving on city road with other vehicles around. V1 brakes abruptly. Sudden deceleration might result in rear-end collision with the following vehicle and/or roadside collision invloving pedestrians or other traffic participants.
S2 E4 C2
C
S2 E4 C3
Table 1 shows HARA for malfunctions missing vehicle speed (v1 vsp) and unintended braking (v1 brake) with their corresponding critical parameters marked with red color. The identified parameters are mainly traffic participants like vehicles and pedestrians, and objects like trees, signal lights etc. These parameters are, however, physical participants of the context, as they exist physically in the context and their influence is evident on system safety. There also exists other participants which are not physically evident, yet play a crucial rule in system safety. We refer to these parameters as non-physical critical parameters. Weather, for instance, is one such type of non-physical parameter which affects contextual elements like visibility and ground interface, thereby influencing safe operation of the AV. Furthermore, critical parameters are not restricted to operational context. We define critical parameters for the AV as well. One type of system-critical parameters are the output services of safety-critical (sub-)components whose malfunctioning or erroneous behavior might result in an unsafe situation, and the other type is related to system functionality. The latter are parameters which
Towards Context-Awareness for Safety of Autonomous Vehicles
557
have an impact on severity and/or controllability aspects of system risk. For easeof-use, we classify these parameters, into a set of classes as illustrated below in Fig. 4. This classification lays the foundation for categorizing elements of context while context modeling.
Fig. 4. Classification of critical parameters of system and context
Once we have identified and classified all physical and non-physical critical parameters, we determine their values. For some parameters, these values are possible values that can be assigned to them, e.g. vehiclespeed (Vspeed ) can be a range from [0,120] in kph units. For others, this may represent their state of being or different possibilities, for instance, Time of the Day can either be (Day,Night), and Infrastructure, which represents the area where the AV is operating, can be roads in different areas like (Country, City, Highways, Field). 4.3
Critical Parameters as Elements of Context
Critical parameters, their types and possible values, identified and classified at design time, make the elements of operational context of the AV at runtime. Table 2 illustrates a set of EOCs and their attributes with respect to their participation in the context. We classify participant attributes into following four classes: Perspective, which defines from which standpoint the context is being observed, Existence refers to the physical existence of an EOC, Involvement is how active an EOC is participating in the context, and lastly, Relevance states if the EOC is relevant to the goal or not. The AV with respect to which the context is being observed has a primary perspective towards the context. Since we have only one AV, the perspective of all other elements become secondary. It can also be seen that pEOC has primary perspective, whereas all sEOCs have a secondary perspective towards context. Involvement highlights how much an EOC is involved in the context, which can be classified into active, partially-active or passive involvement.
558
N. B. Haupt and P. Liggesmeyer Table 2. Elements of context and their participation attributes
Element of context Type
Participant attributes Perspective Existence
Involvement
Relevance
Primary
Physical
Active
Safety
Car2
Secondary
Physical
Active
Safety
Bicycle
Secondary
Physical
Active
Safety
Secondary
Physical
Active
Safety
Secondary
Physical
Active
Safety
Car1
Pedestrians
Vehicle
People
Construction Workers Stoplights
Roadside Secondary
Physical
Partially-Active Safety
Communication
Unit
Secondary
Physical
Passive
Safety
Tower Sunny
Secondary
Physical
Active
Safety
Foggy
Weather
Secondary
Physical
Active
Safety
Windy
Secondary
Non-Physical Active
Safety
Cars and bicycle of Vehicle type and pedestrians of People type are active EOCs as they constantly move and their respective CXI change frequently with time. Roadside units like stoplight are in a sense partially-active, as do not move physically, but the CXI provided by them changes periodically. Finally, a passive EOC is one which does not directly interact with the other EOCs, but change in its state influences their behavior, communication tower is one such example. For instance, a tower in the vicinity is an element of relevant context for the primary vehicle. In case of a malfunction resulting in an abrupt lane change or loss of control over the vehicle must consider a potential collision with this tower. Thus, the presence of the tower influences the hazard of the vehicle and thereby its safety measure. Lastly, as our goal of CAW is to ensure safety, all EOCs in the table are safety-relevant elements. Weather also has a significant influence on the driving condition of the vehicle, on Sunny day, the road surface is dry and tyres have a good grip compared to on a rainy day where the visibility and grip of the tyres reduces significantly with the amount of water on the road. Therefore, Sunny, Raindy or Windy weather are active EOCs as they keep varying, though not frequently, with time. Table 3 represents EOCs alongside their categories, contextual information, instances, scope values and their units. Vehicles as EOCs can be of different types: car, bicycle, motorbike, trucks, buses etc. Our AV being the pEOC is always represented as Vehicle1 . Car1 - a pEOC, belongs to Traffic Participant category of the context, but from the AV critical parameter standpoint, the contextual information like Vspeed , ACC and ABS belong to Component parameters, whereas Platooning and Plat Nr are Functionality parameters. Plat Nr indicates the total number of vehicles in the Platoon including Car1 itself. sEOCs like Car2 and Bicycle, also of Vehicle type, belong to Traffic Participant category. The contextual information associated with them like their driving speed (V2speed ,V3speed ), their distance
Towards Context-Awareness for Safety of Autonomous Vehicles
559
Table 3. Elements of context, their categories, instances and values Element of Context Category
Contextual Instance Scope Information
Car1
Traffic Participant
Car2
Traffic Participant
Bicycle
Traffic Participant
V1speed ACC ABS Platooning Plat Nr V2speed Dist V1 Dist V3 Const V1 Const V3 V3speed Dist V1 Dist V2 Const V1 Const V2
Pedestrians
Units
50 On On On 3 45 40 30 Diagonal Diagonal 15 50 20 Front Diagonal
[0,160] (On,Off) (On,Off) (On,Off) (2,4) [0,160] [0,80] [0,80] (V,H,D) (V,H,D) [0,30] [0,80] [0,80] (V,H,D) (V,H,D)
kph kph m m kph m m -
Traffic Speed Participants Dist V1 Nr. of Pedestrians Traffic Speed Participants Dist V1 Nr. of Workers
4 15 12
[0,15] [0,30] [0,15)
kph m -
3 10 4
[0,30] [0,30] [0,10)
kph m -
Lanes
Area
Streets Stoplights
Area Area
Lim Vspeed Nr.ofLanes Lim Vspeed Traffic Signal Dist V1
40 2 120 Red
[0,50] [1,2] [0,130] (R,Y,G)
kph kph -
20
[0,50]
m
Rainy
Weather
Visibility
Average
Ground Interface
Poor
(G,A,P) (G,A,P) -
Contruction Workers
(Dist V2 , Dist V3 ) and their driving constellation, driving diagonally ahead (Diagonal) or straight ahead (Front), is safety relevant information for Car1 , and is thus measured with respect to it. Another type of Traffic Participants are People, Pedestrians and Construction Workers belong to this class. Safety relevant contextual information about Pedestrians and Construction Workers include the number of pedestrians Nr. of Pedestrians, their walking speed Speed and their distance Dist V1 to Car1 .
560
N. B. Haupt and P. Liggesmeyer
Lanes, streets, stoplights, junctions and the like are sEOCs are a part of Area category of the context. Area represents mainly the infrastructure and its elements present in the context. Contextual information related to these elements, essential for safe behavior of Car1 , include number of lanes Nr.ofLanes, speed limit on the lane Limit Vspeed , the status of the stoplight and its distance Dist V1 to Car1 . Lanes and streets as elements are of type Roads, whereas Stoplights are Roadside Units. In case of Weather category of context, multiple factors play a role in defining a good, bad or an average weather. This can be determined using information about factors like Visibility and Ground Interface, as both of them are influenced by rain, snow or fog. Instance depicts an exemplar value a CXI can have from its range of values defined by its scope. For some EOCs, scope is an integer value range, V1speed is [0,160], for some others it is defined by the element’s possible state of being. As an example, Scope of Roadside Unit - Stoplight is (Red,Yellow,Green), where a stoplight can either be red, yellow or green. For some EOCs, the scope can be a wider range of values like CXI Const V2 gives knowledge about vehicle constellation in the operational context. Vehicles can be in three main configurations - when observed in top view from the primary vehicle: (Vertical, Horizontal, Diagonal), where Vertical can be (Front,Rear) or both, Horizontal can be (SideLeft,SideRight) or both, and Diagonal can be (FrontRight,FrontLeft,RearRight,RearLeft) or all, depending upon the number of secondary vehicles in the context. Sometimes Scope can be dependent on different factors, e.g. Visibility and Ground Interface have a Scope of being either one of them: (Good, Average, Poor). This, however, depends upon weather conditions like Rainy, Snowy, or Foggy. In that case, CXI is a Rainy day and thus, the visibility is Average. 4.4
Safety State-Space Generation
At a given point in time, each EOC has a particular value. A set of EOCs and their corresponding values, at a given time, defines the state of the system and its operational context. Any change in the value of an EOC represents a different state. All possibles values of all EOCs make a state-space for the system in a specified operational context. Since all these EOCs are safety-critical, we call this space as system safety state-space. The states of safety state-space create a spectrum of possible system-context scenarios that can occur at runtime. Based on the occurrence of a particular scenario, risk associated with the system for a specific malfunction can be determined. This gives freedom to consider a lot more possible system-context scenarios for runtime risk assessment than just a set of worst-case scenarios. As a consequence, ensuring behavior of the system is not confined to a set of worst-case scenario assumptions and their corresponding intense safety measures, rather a wide range of many potentially considerable scenarios which are not worse-case.
Towards Context-Awareness for Safety of Autonomous Vehicles
561
Fig. 5. System & context spaces collectively make the safety state-space
As illustrated in Fig. 5, we consider four states to represent safety state/status of the system: safe, warning, hazardous and accident. The latter three are a refinement of an unsafe system state. Depending upon which state the system is currently in a corresponding safety measure is taken at runtime. The system can only be in one of the four states at a specific time. It is in a safe state if all system critical parameters are within their expected values, i.e. there is no malfunctioning or erroneous/unintended behavior of the vehicle. The occurrence of a malfunction or any random error, brings the system to an unsafe state. Detailed information about operational context of the system, together with the malfunction allows a refined analysis of system unsafe state, thereby determining whether it is in warning, hazardous, or accident state. We differ between warning and hazardous states in terms of Level of Automation (LOA) [3]. If the system is operating with a higher LOA like level 4 or 5, a malfunction in its (sub-)component results in state change from safe to hazardous. This is because with increasing automation of the system human involvement and interaction decreases, and automated driving features in higher levels do not necessitate the driver to take over. Therefore, in case of an unintended behavior, with LOA 4 or 5, the system does not expect human assistance, and thus undergoes suitable adaptation as a countermeasure to bring the system back to the safe state. In case of LOA-3 or below, an erroneous behavior results in state change from safe to warning, where the human involved is not only informed via warning, but is also expected to take over the system operations and decide whether a certain adaptation must be conducted or not. Accident state is where the system has met with an accident in its operational context and cannot be brought back to a safe state using any mitigation or toleration techniques.
562
N. B. Haupt and P. Liggesmeyer
To ascertain that the vehicle stays in a safe state or is brought back to it in case of a transition, it is monitored with respect to a set of safety rules (SFR) [4,5]. SFRs represent the safety status of the system and its corresponding restorative measure. They act like safety constraints and trigger the safety measure in case of a safety violation. Generic SFRs are without contextual information, i.e. they encompass only system safety-related information and its corresponding safety measure, e.g., IF (Vspeed = novalue ∧ ACC = ON) THEN (Switch to ManualMode). We envision to refine these rules with the available knowledge about the operational context. The newly synthesized rules containing contextual information are referred to as system-contextual safety rules (SFCR), and have information about both the system and its operational context. An SFCR looks like: IF ((V1speed = novalue ∧ ACC = ON) ∧ (V2speed = 60kph ∧ V2 dist V1 ≤50m)) THEN (Reduce V1speed to 50kph ∧ Switch to ManualMode). Monitoring the system with respect to SFCRs and carrying out runtime risk analysis and determine a corresponding safety measure is out of scope of this paper, and thus not discussed further.
5
Conclusion and Future Work
This paper presents our initial work on CAW-aided safety assurance for autonomous vehicles. We propose a systematic method to categorize and model the operational context of the vehicle based on the aspects relevant for safety. The gathered knowledge about the operational context, along with the system safety status, aids in analyzing system malfunction(s) in the current operational context, thereby assisting in employing safety measures specific to system context. Based on the proposed meta-model, context categories and necessary domain knowledge we are currently modeling the context using ontologies. We decided to model the context using ontologies as they are a great means to analyse domain knowledge, they facilitate separation of domain knowledge from operational knowledge and aid in inferring of high-level information [17]. The ontology-based context model encompasses high-level temporal information about the context. Our subsequent step is to integrate system safety status into the context model and create refined system-contextual rules. These rules lay the basis for a safety-oriented rule based runtime risk analysis which enables the vehicle to carry out a safe self-adaptation in the event of an unexpected safetycritical behavior. Thereafter, we intend to incorporate an uncertainty measure to tackle the uncertainty associated with contextual information and enhance its quality.
References 1. Abowd, G.D., Dey, A.K., Brown, P.J., Davies, N., Smith, M., Steggles, P.: Towards a better understanding of context and context-awareness. In: Gellersen, H.-W. (ed.) HUC 1999. LNCS, vol. 1707, pp. 304–307. Springer, Heidelberg (1999). https:// doi.org/10.1007/3-540-48157-5 29
Towards Context-Awareness for Safety of Autonomous Vehicles
563
2. ISO: ISO 26262: Road Vehicles - Functional Safety. International Organization for Standardization, Geneva, Switzerland, International Standard (2011) 3. SAE: Taxonomy and definitions for terms related to driving automation systems for on-road motor vehicles. SAE J3016, Technical report (2016) 4. Haupt, N.B., Liggesmeyer, P.: Systematic specification of a service safety monitor for autonomous vehicles. In: 5th Workshop on Critical Automotive Applications: Robustness & Safety, EDCC (2019) 5. Haupt, N.B., Liggesmeyer, P.: A runtime safety monitoring approach for adaptable autonomous systems. In: Romanovsky, A., Troubitsyna, E., Gashi, I., Schoitsch, E., Bitsch, F. (eds.) SAFECOMP 2019. LNCS, vol. 11699, pp. 166–177. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-26250-1 13 6. Bazire, M., Br´ezillon, P.: Understanding context before using it. In: Dey, A., Kokinov, B., Leake, D., Turner, R. (eds.) CONTEXT 2005. LNCS (LNAI), vol. 3554, pp. 29–40. Springer, Heidelberg (2005). https://doi.org/10.1007/11508373 3 7. Zimmermann, A., Lorenz, A., Oppermann, R.: An operational definition of context. In: Kokinov, B., Richardson, D.C., Roth-Berghofer, T.R., Vieu, L. (eds.) CONTEXT 2007. LNCS (LNAI), vol. 4635, pp. 558–571. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-74255-5 42 8. Perera, C., Zaslavsky, A., Christen, P., Georgakopoulos, D.: Context aware computing for the internet of things: a survey. IEEE Commun. Surv. Tutor. 16(1), 414–454 (2014) 9. Schilit, B., Adams, N., Want, R.: Context-aware computing applications. In: Mobile Computing Systems and Applications, pp. 85–90 (1994) 10. Henricksen, K.: A framework for context-aware pervasive computing applications. Computer Science, School of Information Technology and Electrical Engineering, The University of Queensland, September 2003 11. van Bunningen, A., Feng, L., Apers, P.: Context for ubiquitous data management. In: Ubiquitous Data Management, pp. 17–24 (2005) 12. Wardzi´ nski, A.: The role of situation awareness in assuring safety of autonomous vehicles. In: G´ orski, J. (ed.) SAFECOMP 2006. LNCS, vol. 4166, pp. 205–218. Springer, Heidelberg (2006). https://doi.org/10.1007/11875567 16 13. Armand, A., Filliat, D., Iba˜ nez-Guzman, J.: Ontology-based context awareness for driving assistance systems. In: IEEE Intelligent Vehicles Symposium Proceedings, Dearborn, MI, pp. 227–233 (2014) 14. Boehmlaender, D., Dirndorfer, T., Al-Bayatti, A.H., Brandmeier, T.: Contextaware system for pre-triggering irreversible vehicle safety actuators. Accid. Anal. Prev. 103, 72–84 (2017) 15. Tawfeek, M.H., El-Basyouny, K.: A context identification layer to the reasoning subsystem of context-aware driver assistance systems based on proximity to intersections. Transp. Res. Part C: Emerg. Technol. 117, 102703 (2020) 16. Reich, J., Trapp, M.: SINADRA: towards a framework for assurable situation-aware dynamic risk assessment of autonomous vehicles. In: 16th European Dependable Computing Conference (EDCC) (2020) 17. Bagschik, G., Menzel, T., Maurer, M.: Ontology based scene creation for the development of automated vehicles. In: IEEE Intelligent Vehicles Symposium, Proceedings 2018-June, pp. 1813–1820 (2018) 18. Ryu, M., Cha, S.-H.: Context-awareness based driving assistance system for autonomous vehicles. Int. J. Control Autom. 11(1), 153–162 (2018)
Hybrid Recurrent Traffic Flow Model (URTFM-RNN) Elena Sofronova(B) Federal Research Center “Computer Science and Control” of the Russian Academy of Sciences, Moscow, Russia sofronova [email protected]
Abstract. Traffic consistent representation and simulation are the key steps towards the optimal traffic control. The paper presents a hybrid recurrent mathematical model of traffic flow control that potentially enables present large-scale networks with detailed analysis of areas of specific interest. The hybrid model consists of two types of models: a newly developed universal recurrent traffic flow model for subnetworks with known parameters, and a model based on the recurrent neural network for subnetworks with partially unknown parameters. An integration framework is presented. Evolutionary computation techniques are applied to both models. A case study is presented. Keywords: Hybrid traffic flows model traffic control · Recurrent model
1
· Optimal control · Urban
Introduction
To ensure the sustainability of modern cities there is a growing demand on the improvement of urban traffic situation. For the last several decades the efforts undertaken by applied mathematicians, programmers and engineers have been aimed at solution of traffic flow control problem in urban road networks. Simulation of traffic flows dynamics is based on models of different types [1,2]. Macroscopic models [3,4] present traffic at a strategic level as a continuous flow, often following the hydrodynamic flow theories. Mesoscopic models [5,6] simulate traffic at an aggregate level, usually by speed– density relationships and queuing theory approaches. Microscopic models consider traffic dynamics in detail, for example interaction of vehicles, behavior of drivers, incidents, pedestrians, which is appropriate and effective for intelligent transportation systems. It resulted in creation of different traffic responsive traffic signal control systems such as VISSIM [7], SCOOT [8,10], SCATS [11], CRONOS [12], RHODES [13], and still today it cannot be stated that the solution to this problem has been achieved. One of the main obstacles on this way is that mathematical models used are too complicated, incomplete or do not coincide in form with the models used in the optimal control theory. Most models either do not use ordinary differential equations with a free control vector in the right parts, or there is no mathematical c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 295, pp. 564–573, 2022. https://doi.org/10.1007/978-3-030-82196-8_41
Hybrid Recurrent Traffic Flow Model (URTFM-RNN)
565
expression for the function that explicitly describes the relationship between the value of the control vector and the quantitative characteristics of the traffic flows. Hybridization of models or so called multiresolution modeling has become a promising attempt to cope with the obstacles mentioned above. In a large-scale networks traffic signal control system should be capable to examine both macroand micro- levels and adjust various traffic signal control parameters in response to the varying traffic situation. For example, mesoscopic model DYNASMART [6] was combined with a microscopic model PARAMICS [14] in an embedded structure [15], macro- and microhybrid model based on the principles of Lighthill–Whitham–Richards (LWR) model are proposed in [16,17]. Known multiagent systems and computational intelligence-based approaches develop distributed traffic responsive signal control models [18,19]. In this paper, a microscopic universal recurrent traffic flow model (URTFM) that most closely matches the classical statement of the optimal control problem is considered. The model is based on controlled networks theory [20] and is a system of recurrent nonlinear finite-difference equations [21]. Controlled networks have changeable configuration. The configuration depends on maneuvers that are permitted or forbidden by certain signals. In previous studies the controlled networks theory was effectively used for subnetworks with full information on parameters of flows and road network [22–24]. The state of the system depends on the state of the traffic flows at previous step and current configuration of the network. If there is a large network, parameters of some subnetworks and their models may be unknown. To solve this problem, it is proposed to use a generalized hybrid mathematical model of traffic control in the city road network. When applying hybridization, the focus is on which type of models are combined, the way in which the models are combined and how the interfaces at the boundaries are dealt with. This paper continues the previous study [25]. The objective of the paper is to present a methodology for integration of URTFM and recurrent neural networks (RNNs) into a single hybrid model to use benefits of both approaches, such as highfidelity simulation in areas of potential impact or particular interest and taking into account the incompleteness of information on some neighbouring subnetworks, that overall enables to present a large-scale networks and significantly reduce the data collection, calibration of the model, and computational requirements. The rest of the paper is organized as follows. The methodology of the developed hybrid recurrent traffic flow model (HRTFM) is given in Sect. 2. The optimal control problem statement is presented in Sect. 3. Section 4 contains evolutionary methods for search of optimal admissible control program and training of RNN. A case study is described in Sect. 5.
2
Hybrid Traffic Flow Model Methodology
Figure 1 shows the proposed simplified integration architecture. The model of the most important subnetwork with known parameters of network configuration and signal groups, capacity of maneuvers, distribution of flows, etc. is presented by previously developed URTFM based on the controlled networks theory
566
E. Sofronova
[22,25]. URTFM is a new microscopic model that was developed in the context of fundamental traffic control theory. Traffic flows are controlled by traffic lights at intersections. The control is a state responsive control program that includes all intersections and determines the control steps at which it is necessary to change signal of certain traffic light.
Fig. 1. Integration architecture
URTFM is a discrete recurrent model. Control time is divided into control steps. State vector at each control step depends on the state at previous step. Being a graph model, URTFM can describe all network configurations and is extendable due to using linking matrices [27]. Known parameters of the subnetworks are transfered to URTFM. Subnetworks with unknown parameters are included in control loop of the system under study. Since there is little information on their configuration and other parameters these subnetworks are approximated by recurrent neural networks. Hybrid model can be considered as a supply-and-demand model. Flows from one model influence the other and vice versa. Both models communicate through boundary sections. It is necessary to exchange the following information: positioning of the boundaries (i.e. indices of entrance and exit nodes), constraints on boundary sections, flows passing between the models and their characteristics, number of control steps to be simulated, etc. It can be done since both models have a vehicle-based representation of traffic flows and an open architecture.
Hybrid Recurrent Traffic Flow Model (URTFM-RNN)
567
Human expert is included in the system as he defines the control strategy and sets a basic solution for optimization. Evolutionary algorithms are used for both models: they search for weights in RNN and control program in URTFM. Using the proposed hybrid model human expert can simulated existing control programs or optimize them, and if the found solution meets all the requirements then it can be implemented in real network.
3
Statement of Optimal Control Problem
The control object is presented as x˙ = f (x, u, φ(t)),
(1)
where φ(t) = ψ(t)+γ(t), ψ(t) is a vector of incoming flow from subnetworks with URTF models, γ(t) is a vector of incoming flow from subnetworks approximated by RNNs. The initial vector of traffic flow is x0 = x, . . x = x1 ..x2 ..x3 ,
(2)
x1 + x2 + x3 = L, where x1 is a state subvector that describes quantitive characteristics of flows at entrance sections, x2 is a state subvector that describes quantitive characteristics of flows at internal sections, x3 is a state subvector that describes quantitive characteristics of flows at exit sections, L is a number of sections in URTFM. State vector is constrained + T x+ = [x+ 1 . . . xL ] .
(3)
The quality criterion is J=
K
f0 (x(k), (u(k)) → min,
(4)
k=1
where K is a number of control steps. ˜ (·) that minimizes quality criterion (4) The solution is a control program u ˜ (·) = (˜ ˜ (K)), u u(0), · · · , u T
˜ (k) = [˜ ˜M (k)] , u u1 (k) · · · u
(5) (6)
where u ˜i (k) ∈ {0, 1}, i = 1, M , M is a number of regulated intersections. The order of working phases of traffic lights is fixed, and the control program switches the phases. The values of elements in control program are ones and zeros. The ones switch the current phase to the next one in the specified order,
568
E. Sofronova
and zeros - do not switch the phase. When the maximum phase number u+ i is reached, the phase turns to the initial, (˜ ui (k − 1) + 1) mod u+ ˜i (k − 1) = 1, i , if u u ˜i (k) = (7) u ˜i (k − 1), otherwise. The duration of traffic signals is constrained. The control program is searched by a modification of genetic algorithm, the variational genetic algorithm [24].
4
Evolutionary Algorithms
To train RNN and to solve an optimal control problem a modification of the genetic algorithm (GA) which is called a variational genetic algorithm is used. A variational genetic algorithm (VarGA) was developed in 2014 from the classic genetic algorithm to solve the problem of minimization of calculations. The main genetic operations resemble the GA. Firstly, a population of possible solutions is generated. Then each solution is estimated by the quality criterion (4), and the best possible solution is found. Genetic operations crossover and mutation are performed on ordered sets of vectors of small variations taking into account estimation of each solution and probabilities of operations. New solutions are also evaluated and if they turn out to be better than some other existing solutions, then they replace them. To vary the search, the basic solution may be changed after several generations by the best currently found one. The algorithm runs a given number of times called generations. After all generations the best found solution is considered to be the solution of the problem. Principle of small variations [26] is a powerful tool to obtain new solutions in the neighbourhood of admissible one. To use it in practical applications the researcher needs to specify what is considered to be a small variation for a certain problem. Representation of Solutions for URTFM To implement VarGA to optimization problem, a human expert sets one basic control program which is called a basic solution and a set of small variations of the basic solution. Vector of small variations contains three elements: index of intersection, index of control step, and a new value of the control program element. Vector of small variations changes the basic solution, thus changes the control program. A basic solution as a control program for each control step is ˜ 0 (K )). ˜ 0 (·) = (˜ u0 (0), · · · , u u
(8)
The initial population of possible solutions consists of a basic solution (8) and a set of ordered sets of vectors of small variations W = (W1 , . . . , WH ), where Wi is an ordered set of vectors of small variations,
(9)
Hybrid Recurrent Traffic Flow Model (URTFM-RNN)
Wi = (wi,1 , . . . , wi,d ),
569
(10)
wi,j = [w1i,j w2i,j w3i,j ]T , i = 1, H, j = 1, d, H is a number of possible solutions in an initial population, d is a depth of variation, that shows maximal number of variation of the basic solution. Representation of Solutions for RNNs To apply a variational genetic algorithm to RNN, training the vector of small variations was modified. The vector of small variations contains six components: 0 - a pointer to vary the matrix or bias, 1 - a layer number, 2 - a row number in the matrix or in the vector, 3 - a number of the column in the matrix, 4 - a number of the bit in the selected element in the Gray code, 5 - a new value of the bit. To obtain new solutions, the basic solution is exposed to vectors of small variations. The basic solution is a randomly generated Elman network with a hidden layer. VarGA does not require the generation of new neural networks. Each new neural network is described by an ordered set of vectors of basic solution variations. After the variations, the correct new neural network is obtained.
5
Case Study
The behaviour of the hybrid URTFM-RNN model was studied for a test road subnetwork with 4 intersections and 24 sections presented by URTFM (see Table 1 for parameters), and a neighbouring subnetwork with partially known parameters was presented by Elman RNN with one input, 1 hidden, 3 output layers, and ReLU activation function. In URTFM entrance road sections are sections 1–8, internal sections are 9–16, and exit sections are 17–24. An interaction with RNN is performed via Sects. 4, 5, 6, 20, 21 and 22. The incoming flows on other entrance sections at each control step are considered to be increased by given increment. Firstly, the partially known subnetwork was presented as Elman RNN. The training sample were observations of the flow state on sections at 200 control steps. URTFM was used to obtain a training sample. Random flow states were forwarded to the entrance sections of URTFM and the resulting flows at exit sections were recorded. Since both models are vehicle-based no special pre- or postprocessing of data were needed. In real subnetworks faulty or incomplete data should be take into consideration. The RNN was trained by VarGA that searched for parameter matrices and a bias vector. Parameters of VarGA for RNN training: number of possible solutions, H = 512, number of generations, G = 50, number of pairs for crossover in 1 generation, R = 128, depth of variations, d = 8, probability of mutation, pmut = 0.75. The obtained accuracy was less than 15%. Then flows from the RNN were moved to boundary sections of URTFM, and optimization problem was solved for the whole hybrid model with the following parameters:
570
E. Sofronova
initial state of the URTFM x0 =(30, 25, 28, 29, 22, 32, 26, 29, 6, 9, 8, 7, 9, 8, 7, 5, 0, 0, 0, 0, 0, 0, 0, 0); constraints on road sections x+ = (100, 100, 100, 150, 120, 150, 220, 100, 20, 20, 20, 30, 40, 20, 20, 25, 8000, 8000, 8000, 8000, 8000, 8000, 8000, 8000); increments to the entrance road sections Δ = (20, 20, 20, x4RNN , x5RNN , x6RNN , 16, 22). In the computational experiment one quality criterion that takes into account the whole throughput of the network and penalties for overflow on internal sections was used
J =α
K k=0 i∈I / 0 ∪I1
+ ϑ(xi (k) − x+ i )(xi (k) − xi ) −
(xi (K) → min,
(11)
i∈I1
where x(k) is a state vector, x(k) = [x1 . . . xL (k)], xi (k) is a number of average+ sized vehicle on the section i at control step k , xi (k) < x+ i , xi is a constraint 1 on the flow on section i, xi (k) ∈ R , i = 1, L, L is a number of sections in the subnetwork, k = 1, K, K is a time of control process in control steps, I0 is a set of elements of state vector that correspond to incoming sections, I1 is a set of elements of state vector that correspond to exit sections, α is a weight coefficient, 1, if a > 0 ϑ(a) = . 0, otherwise Parameters of VarGA for search of optimal admissible control program: number of possible solutions, H = 1024, number of generations, G = 50, number of pairs for crossover in 1 generation, R = 128, depth of variations, d = 12, probability of mutation, pmut = 0.75. Control at intersections is carried out by sequentially switching the phases. Maximum number of control modes at each intersection is 6. It is supposed that all traffic lights are synchronized. Each control mode is characterized by permitted maneuvers. The control modes change in the established order, the durations of the control modes are the control actions. A series of 20 experiments has been conducted. An average value of quality criterion was Javg = −10322 with standard deviation SD 80% of time) were the same stations in the most unused stations list. We can assume that they are unpopular for bike returns, but maybe they would be popular for rentals. Regrettably, we have too little information on these stations to reliably calculate the relative frequency. The information received from these stations is not intended for use in training the demand prediction model, but predictions will be made for them based on their similarity to other stations with available statistics. Based on the estimated forecasts and subsequent statistics obtained after rebalancing, the profitability of these stations can be determined. Figure 2 shows the percentage of observed time each station was empty or overloaded.
Fig. 2. Graph by station showing percentage of time station is empty (orange) or overloaded (purple).
To calculate historical demand, 4-h relocation windows were chosen, which inevitably split the day into 6 parts. After calculating target values for historical data (bike demand and rental demand), investigations can be conducted on the divergence between the average demand for different relocation windows on weekends and working days, as well as the impact of other factors. Figure 3 displays the average difference in demand on bicycles and racks for relocation window 1 (7 am – 11 am) and window 3 (3 pm – 7 pm) on workdays. The radius depends on the size of the difference. When
Optimizing the Belfast Bike Sharing Scheme
595
purple, the value is positive (more returns) while when orange, the value is negative (more rentals). Some stations were self-rebalanced or their demands were equal to zero. This demonstrates contrasting demand trends during morning and evening rush hours and confirms that station demand differs by time and location.
Fig. 3. Figure showing the average difference in demand on bicycles and racks, (a, b) – workdays, windows 1, 3.
4.2 Demand Predictions As previously outlined, the demand predictions were made for bikes and racks. To train the prediction model on the demand for bikes, only data records where the station was available at least 80% of the time during the relocation window (the presence of at least one bike) was used. Therefore 6,442 examples were used. Filtered data were randomly divided into training and testing parts according to a 80:20 ratio. Since the stations were always available for bike returns, there were 11,656 examples available for use to train the racks demand prediction models. The data were randomly split into training and testing parts according to a 80:20 ratio. We use cross validation technique to compare models. During the cross validation, the training data splits into n folders (we set n = 5) and in each iteration uses one folder as a validation data (unseen data) and trains the model on the rest n-1 (4) folders, switching validation folder on each iteration. This allows us to use all the data to validate the model and to find the average RMSE and R2 values, which will give us a more accurate understanding of the models performances. Then, we use the whole training dataset to train the selected model. Table 2 shows the results of cross validation performed on the training datasets for both bike and rack demand and the final results for the test dataset. The XGBoost model performs the best for the racks demand prediction as well as on the training set for the bike demands prediction. Despite the fact that the Catboost model has the best performance when it comes to the test set data for bikes demand prediction, the difference in R2 scores
596
N. Demidova et al.
is negligible when compared to the XGBoost model. Therefore, for consistency sake we select the XGBoost model for both bikes and racks demand prediction. Table 2. Bikes and Racks Demand Prediction Training data, cross validation 5-folders RMSE mean(std)
Test data
R2 mean(std)
RMSE
R2
Bikes Demand Prediction Catboost
1.1095 (0.128)
0.4877 (0.125)
0.9835
0.6427
Random forest
1.1054 (0.125)
0.4920 (0.121)
1.0090
0.6240
XGBoost
1.0956 (0.126)
0.5008 (0.117)
1.0190
0.6165
Catboost
1.1380 (0.058)
0.5335 (0.036)
1.0712
0.5457
Random forest
1.1089 (0.056)
0.5573 (0.029)
1.0095
0.5966
XGBoost
1.1073 (0.056)
0.5588 (0.026)
1.0076
0.5981
Racks Demand Prediction
We can observe a slightly better result on the test dataset for all models, because we use all 5 folders of training data to train the model before the final checking on the test data. The top ten features that were identified in making the best bike demand predictions with the XGBoost model are as listed below: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
next to a waterfront length of cycle paths (all types) nearby morning rush hours evening rush hours lunch time distance to the closest train station office area relocation window weekend length of bus lane (with cycle provision) nearby
Similarly, the top ten features that were identified in making the best rack demand predictions with the model are as listed below: 1. 2. 3. 4.
morning rush hours evening rush hours next to a waterfront length of shared use footways nearby
Optimizing the Belfast Bike Sharing Scheme
5. 6. 7. 8. 9. 10.
597
office area length of bus lane (with cycle provision) nearby distance to the closest train station number of train stations nearby number of cycle infrastructure routes nearby public space
By observing the most important features for XGBoost model, we can see, how important location-based characteristics of the stations, such as closeness to a waterfront, train station or cycle infrastructure, are for bikes and racks demand prediction. In combination with the time-based and weather features it helps us to predict customers’ activity. 4.3 New Sites Following this investigation and model development, two proposed locations were recommended for consideration in installing new stations. These are: • An opposite entrance to the Ormeau Park (54.590278, −5.911046) • Entrance to the Botanic garden from the river side (54.581055, −5.927895) The suggested locations are situated at the entrance to large parks, and will fill “empty” places. They are not far but also not too close to other stations and can relieve the burden of overloaded stations. Furthermore, they are also on the path of the most popular routes. The model was trained on all available stations without using the station name as a feature with predictions of bikes and racks demand made for new sites using the same weather and time conditions.
Fig. 4. Figure showing the average demand for the entire period (for all relocation windows). For existing stations using the historical values and for new sites using the predictions.
Figure 4 shows the average demand for bicycles/racks for the entire period (for all relocation windows). For existing stations the historical values were used while for new
598
N. Demidova et al.
sites, predictions were made. It is evident from this image that the demand in the area is shared by the new proposed station locations and demand across the stations is more evenly distributed.
5 Conclusion Following the study and application of the proposed rebalancing approach on the data obtained from the Belfast system, it is clear that there is a need for periodic reallocation of bicycles in the system. The validity of the proposed approach to determine the demand for bicycles and free racks has been proven, but it may be possible to improve the quality of investigation by increasing the amount of data used for training. Further research should determine the target value of the bikes needed at the station, based on predicting demand, station capacity, and the number of bikes in the system. It is also important to find a suitable approach for re-balancing when there is a change in demand. Further research on routes and travel time can also offer interesting insights into the use of the system and suggest new locations of stations. In fact, the search for new stations should critically examine changes in user flows when new stations are established as well as the redistribution of activity at neighbouring stations (and the potential for growth due to new users). As the study covered a fairly short period of time, increasing the amount of data for training the model and consideration of seasonal features and events held in the city will enrich results. Enhanced knowledge about the features of each station will improve the accuracy of forecasts for rebalancing operations and demand prediction for new sites. The new locations proposed based on data show a good predicted demand for bicycles and racks, yet it is difficult to verify the accuracy of the assumption. It is also important to further explore questions on the expansion of the system and how the addition of new stations changes user demand. Acknowledgments. The authors wish to acknowledge Allstate for proposing the challenge and for their valuable feedback throughout the project and Queen’s University Belfast for funding.
References 1. Wood, A.: Tracing the absence of bike-share in Johannesburg: a case of policy mobilities and non-adoption. J. Transp. Geogr. 83, 102659 (2020). https://doi.org/10.1016/j.jtrangeo.2020. 102659 2. Chen, H., Zhu, T., Huo, J., Andre, H.: 9. J. Clean. Prod. 260, 120949 (2020). https://doi.org/ 10.1016/j.jclepro.2020.120949 3. DeMaio, P.: Bike-sharing: history, impacts, models of provision, and future. J. Public Transp. 12, 41–56 (2009). https://doi.org/10.5038/2375-0901.12.4.3 4. Meddin, R., et al.: The Meddin Bike-sharing World Map. https://bikesharingworldmap.com 5. Shui, C.S., Szeto, W.Y.: A review of bicycle-sharing service planning problems. Transp. Res. Part C Emerg. Technol. 117, 102648 (2020). https://doi.org/10.1016/j.trc.2020.102648 6. Contardo, C., Morency, C., Rousseau, L.-M.: Balancing a Dynamic Public Bike-Sharing System Bureaux de Montréal: Bureaux de Québec (2012)
Optimizing the Belfast Bike Sharing Scheme
599
7. Chung, H., Freund, D., Shmoys, D.B.: Bike angels: an analysis of citi bike’s incentive program. In: Proceedings of the 1st ACM SIGCAS Conference on Computing and Sustainable Societies, COMPASS 2018, pp. 1–9. Association for Computing Machinery, Inc., New York (2018) 8. Froehlich, J., Neumann, J., Oliver, N.: Sensing and predicting the pulse of the city through shared bicycling. In: IJCAI 2009: Proceedings of the 21st International Joint Conference on Artificial Intelligence, pp. 1420–1426 (2009) 9. Liu, J., Sun, L., Chen, W., Xiong, H.: Rebalancing bike sharing systems: a multi-source data smart optimization. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1005–1014. Association for Computing Machinery, New York (2016) 10. Wang, S., et al.: BRAVO: improving the rebalancing operation in bike sharing with rebalancing range prediction. In: Proceedings of the ACM Interactive, Mobile, Wearable Ubiquitous Technology, vol. 2, pp. 1–22 (2018). https://doi.org/10.1145/3191776 11. Belfast Bikes: Bike Rental in Belfast | Rent a Bike nearby. https://www.belfastbikes.co.uk 12. nextbike: nextbike B2B - Ride with us into the future. https://nextbike.net/maps/nextbikelive.xml?city=238 13. OpenWeather: Weather API – OpenWeatherMap. https://openweathermap.org/api 14. Open Data NI: Welcome - Open Data NI. https://www.opendatani.gov.uk/ 15. Drolet, J.P., Martel, R.: Distance to faults as a proxy for radon gas concentration in dwellings. J. Environ. Radioact. 152, 8–15 (2016). https://doi.org/10.1016/j.jenvrad.2015.10.023 16. Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A.V., Gulin, A.: CatBoost: unbiased boosting with categorical features (2018) 17. Breiman, L.: Random forests. Mach. Learn. 45, 5–32 (2001). https://doi.org/10.1023/A:101 0933404324 18. Chen, T., Guestrin, C.: XGBoost: a scalable tree boosting system. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794. Association for Computing Machinery, New York (2016)
Vehicle-to-Grid Based Microgrid Modeling and Control Samith Chowdhury, Hessam Keshtkar(B) , and Farideh Doost Mohammadi Christopher Newport University, Newport News, VA 23606, USA {Samith.chowdhury.18,Hessam.keshtkar, Farideh.d.mohammadi}@cnu.edu
Abstract. A vehicle-to-grid based microgrid system consist of multiple renewable energy sources and distributed generation units is modeled and tested in this study. Realistic EV penetration and PV solar contributions are modeled in the microgrid to improve the validity of the responses. An optimal control method is utilized to minimize the frequency deviations resulted from the generation and load changes during 24-h simulations. The analysis results show the effect of stochastic signals on the frequency and power flow responses of the system and prove improvement in the stability of the system with implementing the proposed secondary control. Keywords: Microgrid control · Particle Swarm Optimization (PSO) algorithm · Vehicle-to-grid system · Renewable energy sources
1 Introduction Microgrids are compressed, smaller power systems consisting of renewable energy sources (RES) and distributed generators (DG) as generation units that can operate in both islanded or on-grid modes. Recently, the implementation of electric vehicles into the grid, vehicle-to-grid (V2G), is a new technology that has drawn a lot of attention. By having EVs enter the grid, there will be a considerable impact on several aspects of the power system. There is hardly any storage, so the transmission and generation of power has to be constantly monitored and managed to match changing customer load [1]. It is shown that at any specific time, 95 percent of cars can be found parked and not in motion, therefore the batteries inside the EV could have a secondary function by allowing it to be used as an energy storage through the V2G technology [1]. For instance, if customer load is high or if weather conditions does not produce renewable resources at the time, these batteries can compensate for the need and stabilize the grid. These batteries could serve as mobile power stations in case of power outage or other emergency situations [2, 3]. Investigations about the use of the V2G system for different reliability and stability purposes have begun. In [4], the author researched the V2G system to increase stability of the microgrid. It was found that even a 10% penetration of EVs can improve the stability of the grid by regulating the grid during peak hours. However, the profiles and © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 295, pp. 600–608, 2022. https://doi.org/10.1007/978-3-030-82196-8_44
Vehicle-to-Grid Based Microgrid Modeling and Control
601
data simulated in the model of this study are based upon theoretical values and not real data. In [5], a plan and framework for mass V2G adoption is outlined. The EVs act as a controllable load to stabilize the demand of power during low consumption hours and as a battery device during peak demands to provide storage capacity to the grid. In [6], authors propose a decentralized V2G system to improve the frequency control in the grid. It considers the biased use of EV batteries by removing such from using the control of State of Charge (SOC) so that they can be synchronized. [7] works on utilizing the V2G technology for active power regulation and tracking of several renewable energy sources, offering load balancing, and enabling ancillary services such as spinning reserve. These several papers have not included a secondary load control concepts and have not used realized EV penetration data. In this paper, the modeling and control of a microgrid consisting of multiple renewable energy sources is studied. The microgrid model consists of different generation units such as Diesel Engine Generator (DEG), Photovoltaic (PV) systems, Wind Turbine Generators (WTG) along with different loads such as industrial and residential load as well as electric vehicles. We explored the impact of real data for Vehicle-to-Grid (V2G) systems integrated into the microgrid and attempted to utilize this system for frequency regulation purposes in the microgrid. A centralized control approach is proposed to regulate the frequency oscillations of the microgrid due to the load, electric vehicle, and generation fluctuations. We also study the impact of renewable energy sources by importing real data to our microgrid system. The remainder of this paper is written as follows: Sect. 2 explains the modeling of the V2G based microgrid which is used as a case study. The algorithms used in the control technique such as Particle Swarm Optimization (PSO) are explained in Sect. 3. Section 4 analyzes the stability and performance of the controller based on simulation results. Section 5 is the conclusion of this paper.
2 V2G Based Microgrid Model The microgrid simulated in this study operates in an islanded mode, made up of a 15 MW diesel engine generator, photovoltaic (PV), wind turbine generators, load, and an EV aggregator. The 15 MW diesel generator provides load generation balance since the renewable energy sources output depends on the weather conditions of the day. The PV generators uses solar panels to supply energy based on irradiance data obtained from the Pecan Street Database [8] at a minute-level sampling rate. The wind turbine generators use wind data rated at 8 MW to supply energy. The total load is comprised of residential load of 1000 households following a consumption profile, as well as an asynchronous machine (ASM). Figure 1 shows a microgrid model containing this hybrid power generation and energy storage system. A control method was implemented and the impact of renewable energy and V2G was analyzed in this microgrid model. By having this microgrid operate in an islanded mode, the tie-line to the distribution could be disconnected to isolate the grid from utility. The total output power from the wind turbine generators (Pw), diesel
602
S. Chowdhury et al.
generators (Pg), photovoltaic generators (Ppv), and negative/positive power from EVs using V2G technology (Pe) gives the total power supplied to the load (Ps). This is shown below: Ps = Ppv + Pg + Pw ± Pe
(1)
Fig. 1. Microgrid model with vehicle-to-grid implementation
2.1 Modeling of Electric Vehicles The simulated microgrid comprised of level 2 chargers only. These chargers are commonly seen in residential and commercial properties, offering an output of 6.6 kW from a 240 V/32 A AC supply [7]. Level 1 (120 V AC supply) and level 3 chargers (800 V DC supply) were not considered due to them being as uncommon. That said, the aggregated model of 200 charging stations shows a total rated power of 1320 kW. The total contribution of aggregated EVs is subject to change and is presented in the results. Five user profiles were simulated, they follow: User Profile 1, representing 35% of the EV population, commutes two hours to his/her workplace with a schedule from 8 AM to 4 PM. His/her place of work has level two charging available. User Profile 2, representing 25% of the EV population, commutes three hours to his/her workplace with a schedule from 8 AM to 4 PM. His/her place of work has level two charging available. User Profile 3, representing 10% of the EV population, commutes two hours to his/her workplace with a schedule from 8 AM to 4 PM. His/her place of work has no charging capabilities.
Vehicle-to-Grid Based Microgrid Modeling and Control
603
User Profile 4, representing 20% of the EV population, has the EV connected to the grid for all 24 h of the day, not using his/her EV. User Profile 5, representing 10% of the EV population, commutes two hours to his/her workplace with a schedule from 10 PM to 4 AM. His/her place of work has no charging capabilities. 2.2 Modeling of Wind Turbine Generator The wind turbine is modeled by power coefficient curve Cp which is a function of blade’s tip speed ratio λ and pitch angle β. The tip speed ratio is formulated as the ratio of blade tip’s speed to the speed of the wind (VW ). It can be formulated as λ=
Rb ωb VW
(2)
where Rb is the blade radius of and ωb is the blade rotational speed. The power coefficient expression is approximated as a function of λ and β by π (λ − 3) − 0.0184(λ − 3)β (3) Cp = (0.44 − 0.0167β)sin 15 − 0.3β The output power of the studied wind turbine generator is calculated by PW =
1 ρAr Cp VW3 2
(4)
where ρ is the air density and Ar is the blade swept area [10].
3 Algorithm and Approaches In this part the algorithms and infrastructures that are utilized for the purpose of microgrid control are discussed. PSO algorithm which is used to tune the parameters of the controller, is also described in this section. 3.1 Microgrid Frequency Control Conventionally, droop method is being utilized as a robust strategy against powerfrequency oscillations in power systems [9], however, it requires to be equipped with a parameter tuning strategy such as optimization to obtain the desired result. Therefore, a secondary control mechanism is necessary to maintain frequency back in its nominal value against different contingencies in the system [10]. One possibility is a centralized method that tunes the control parameters based on the deviation of the measured frequency with respect to a reference signal. PSO algorithm utilized to optimize the parameters of the controllers, is discussed in the next part.
604
S. Chowdhury et al.
3.2 Particle Swarm Optimization PSO is used to tune the control parameters of the diesel engine governor. The objective function of the optimization is to minimize the frequency deviation resulted from the generation and load changes. This section discusses the application of PSO algorithm in this study. PSO is based on a multi-agent search strategy, which mimics the motion of birds searching for food and traces its evolution. The particles in this search approach are called swarm. Each swarm moves in the search space and look for the global optimum. PSO search space is multidimensional, and each particle moves in different directions and changes its position based on the experience of its own as well as its neighbor’s. Therefore, it selects the best position calculated by itself and its neighbors. Similarly, the swarm moves with the particle speed which is calculated by the data obtained by itself and its neighboring particles [11]. Let’s assume p and s are particle’s position and speed, respectively. The best position of a particle in each step is stored as Pbest . The best particle’s performance among all is represented as Gbest . Finally, the speed and position of each particle will be updated based on (5) and (6). sd +1 = k ∗ (γ ∗ vd + ac1 .rand () ∗ (Pbest − Pd ) + ac2 ∗ rand () ∗ (Gbest − Pd )) (5) Pd +1 = Pd + sd +1
(6)
where d is the index of iteration, Pd and sd are the particle’s position and speed at the d-th iteration, respectively, γ is inertia weight factor, ac1 and ac2 are acceleration constants, rand() is a uniform random function in the range between 0 and 1, and k is the constriction factor which is calculated based on ac1 and ac2 according to (7). k=
2 √ |2 − ac − ac2 − 4ac|
(7)
where ac = ac1 + ac2 and ac4. Appropriate choice γ makes a balance between global and local search. In general, γ is updated according to (8). γ = γmax −
γmax − γmin × iter itermax
(8)
Where itermax is the specified number of iterations, and iter is the current number of the iterations [10]. In the above procedures, the particle speed is limited by a maximum value, Smax [12]. Finally, the described PSO and the determined objective function, which is minimizing the frequency deviations, is used to find the best values for the control parameters.
4 Simulation Results Initially, the microgrid model is simulated for every second during 24 h without real PV and EV data. Different power flows related to generating units and demand loads are
Vehicle-to-Grid Based Microgrid Modeling and Control
605
shown in Fig. 2. This figure illustrates how the changes in the renewable energy sources are variable during a day. In addition, the load profile also changes due to different penetration of the electric vehicles into the system. Load (Red), Total Power (Blue), Diesel (Yellow), Solar (Magenta), Wind (Green)
Fig. 2. Microgrid power flow results during 24-h simulations (without real data)
Real PV data are used for modeling the solar panels existed in this microgrid power system. Figure 3 illustrates the PV output power during 24-h simulations.
Fig. 3. PV output power during 24-h simulations (with real data)
The real data related to the electric vehicles in the microgrid are also collected for 24 h and the total power of the V2G system during 24-h simulations are shown in Fig. 4. The stored power in the electric vehicles’ batteries would be used for frequency regulation purposes when there is a lack of renewable energy available during the day. The PV and electric vehicle real data are integrated to the microgrid system to obtain more realistic response of the system. Different power flows related to generating units and demand loads are shown in Fig. 5. As shown in this figure, the power flow results
606
S. Chowdhury et al.
Fig. 4. V2G system output power during 24-h simulations (with real data) Load (Yellow), Total Power (Blue), Diesel (Red), Solar (Green), Wind (Magneta)
Fig. 5. Microgrid power flow results during 24-h simulations (with real data)
in the modeled microgrid have become more oscillatory due to the stochastic behavior of the real electric vehicles and solar panels data. As a result, the dynamic response of the microgrid system due to these perturbations is required to be monitored and controlled. Figure 6 illustrates the frequency of the microgrid with PV and EV real data. As it can be seen, the amount of oscillations in the frequency response of the system is increased in such a way that the maximum overshoot over 24-h simulation (~ −59.4 & 61 Hz) is larger than the standards. For instance, the EN50160 standard [13] mandates load shedding requirements for 1 Hz drop, which starts with 5% of the total load. Therefore, a frequency control based on the proposed method in Sect. 3 is implemented in the governor of the diesel engine generator to minimize the frequency oscillations. The frequency response of the microgrid after implementing the proposed method is shown in Fig. 7. It resulted in decreasing the maximum overshoot over 24-h simulation (~ −59.6 & 60.5 Hz).
Vehicle-to-Grid Based Microgrid Modeling and Control
607
61 60.8 60.6
Frequency (Hz)
60.4 60.2 60 59.8 59.6 59.4 59.2 59
0
1
2
3
4
5
6
7
8
9 10 4
Time (sec)
Fig. 6. Frequency response of the microgrid with PV and EV real data 61
Frequency (Hz)
60.5
60
59.5
59
58.5
0
1
2
3
4
5
Time (sec)
6 10
4
Fig. 7. Frequency signal of the microgrid with proposed control method
5 Conclusions A vehicle-to-grid microgrid system is modeled and tested in this study. Realistic EV penetration and PV solar contributions are modeled in the microgrid. It is observed that the extra power provided by electric vehicles can not always guarantee better power quality. Therefore, an optimal control method based on Particle Swarm Optimization algorithm is implemented in order to minimize the frequency deviations due to the generation and load changes during 24-h simulations. The results show that the proper design of an optimal controller could cause the frequency deviations to fall within the standard limits for a reliable microgrid system operation. As future works, our next point of interest would be to study the vulnerabilities and cyber security aspects at the aggregator level of the V2G system. Our goal is to find exploitable points of entry, feasible forms of attacks from the exploit, and analyze the handling of the risks and dangers of the microgrid operation in case of cyber intrusions.
608
S. Chowdhury et al.
Also, power management strategies can also be implemented by considering the contribution of the V2G system in the process of peak shaving during high demand hours throughout the day.
References 1. Short-Term Energy Outlook, 10 July 2018. www.eia.gov/outlooks/steo/ 2. Mohamed, Y.A.-R.I., El-Saadany, E.F.: Adaptive decentralized droop controller to preserve power sharing stability of paralleled inverters in distributed generation microgrids. IEEE Trans. Power Electron. 23(6), 2806–2816 (2008) 3. Junxiong, Z., Fei, L., Ze-xiang, C., Jianzhong, Y.: Coordinated control strategies between unit and grid in islanded power system. In: 4th International Conference on Electric Utility (DRPT), pp. 1454–1458 (2011) 4. Kempton, W., Tomic, J.: Vehicle-to-grid power fundamentals: calculating capacity and net revenue. J. Power Sources 144(1), 268–279 (2005) 5. Guille, C., Gross, G.: A conceptual framework for the vehicle-to-grid (V2G) implementation. Energy Policy 37(11), 4379–4390 (2009) 6. Orihara, D., Kimura, S., Saitoh, H.: Frequency regulation by decentralized V2G control with consensus-based SOC synchronization. IFAC-PapersOnLine 51(28), 604–609 (2018) 7. Cundeva, S., Dimovski, A.: Vehicle-to-grid system used to regulate the frequency of a microgrid. In: IEEE EUROCON, 6–8 July 2017, Ohrid, R. Macedonia (2017) 8. https://dataport.pecanstreet.org/ 9. Poudel, B., Cecchi, V.: Frequency-dependent models of overhead power lines for steady-state harmonic analysis: Model derivation, evaluation and practical applications. J. Electric Power Syst. Res. 151, 266–272 (2017) 10. Keshtkar, H., Solanki, J., Solanki, S.K.: Application of PHEV in load frequency problem of a hybrid microgrid. In: North American Power Symposium (NAPS), pp. 1–6. IEEE (2012) 11. Li, H., Yang, D., Su, W., Lü, J., Yu, X.: An overall distribution particle swarm optimization MPPT algorithm for photovoltaic system under partial shading. IEEE Trans. Ind. Electron. 66(1), 265–275 (2019) 12. Babaei, E., Galvani, S., Ahmadi Jirdehi, M.: Design of robust power system stabilizer based on PSO. In: IEEE Symposium on Industrial Electronics and Applications (ISIEA 2009), Malaysia, October 2009 13. EN 50160:2007: Voltage characteristics of electricity supplied by public distribution networks. ed. European Committee for Electrotechnical Standardization – CENELEC (2007)
Intelligent Time Synchronization Protocol for Energy Efficient Sensor Systems Jalil Boudjadar1(B) and Mads Mørk Beck2 1
Aarhus University, Aarhus, Denmark [email protected] 2 ReMoni ApS, Aarhus, Denmark
Abstract. The ubiquity of sensors has offered great opportunities to monitoring infrastructures to track the real-time state of observed environments. Monitoring systems rely on synchronous massive data to calculate the environment state. However, collecting and communicating massive data usually leads to expensive energy consumption due to high sampling frequency and communication cost for the sensors operation. Clamp-on wireless sensors are a flexible plug-and-play technology easy to deploy where sensors operate on integrated batteries. This paper introduces a highly accurate time synchronization protocol to coordinate the operation of individual clamp-on sensors deployed as an energyconstrained monitoring solution. While maintaining sensors highly synchronized, the proposed protocol reduces drastically data communication and energy consumption of the sensor network. To minimize the energy consumption further, we optimize the protocol behavior using a genetic algorithm. The protocol has been modeled in VDM for formal analysis, implemented and compared to a state of the art protocol. Keywords: Sensor networks efficiency · Intelligent control
1
· Time synchronization · Energy
Introduction
The EU Commission has stated that all new buildings after 2021 should be Nearly Zero Energy Buildings (NZEB) [1]. This will be achieved through integrating local green energy resources in buildings, so that individual buildings become almost self-satisfied in terms of energy; technically known by islanded micro-grids [7]. This will require smart grid systems to guarantee profitable deployment of such infrastructures. The reason behind this is that buildings can generate enough green energy on their own to satisfy the private demand, whereas the extra energy can be sold to the grid. Achieving a transparent profitable business model for NZEB requires a monitoring of the building energy generation and consumption with high accuracy and low latency, as false readings could result in energy shortage or expensive bills. Different monitoring solutions have been proposed to use sensors to measure energy and collect data, delivering thus a real time energy state for NZEBs c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 295, pp. 609–623, 2022. https://doi.org/10.1007/978-3-030-82196-8_45
610
J. Boudjadar and M. M. Beck
[8,11–14]. The conventional solutions often require many sensors that collect large amount of data. This comes at a significant energy cost, both related to collecting, communicating and processing data. This is complaint with the EU’s new green vision, and motivates the use of a fine grained monitoring model where sensors are deployed at appliance level instead of just the in-going power to the building. The rational behind this recommendation is to simply identify energy waste within buildings and optimize the underlying energy management. Achieving this findgrained monitoring requires a constellation of sensor networks, data communication and processing [15]. Data driven solutions rely on collecting massive data to obtain accurate tracking of the energy dynamics over time. Clamp-on wireless sensors are an important low cost ingredient to build such monitoring solutions given their flexibility and mobility [16]. A clamp-on sensor enables to measure both voltage and current which are the main parameters in calculating the energy consumption. Collecting both parameters from different sensors must be synchronized so that energy consumption is calculated at the timestamp at which the data is sampled. The main challenge is that such sensors operate on batteries and have very limited computation resources to track the time advancement, thus individual clocks can drift from the master node clock. This requires the sensors to synchronize frequently with the master node in order to calibrate their individual clocks. The time synchronization gets even worse in case of a failure, e.g. loss of communication to the master node. Different time synchronization protocols have been proposed for energyefficient operation of sensor networks [6,17,19–22]. The common denominator of most of the time synchronization protocols is that the time accuracy is obtained at the expense of energy consumption. In this paper, we propose an intelligent time synchronization protocol to coordinate the operation of individual clampon sensors deployed as an energy-constrained monitoring solution. The proposed protocol is able to deliver perfect time synchronization (0-delay) to prevent the drifting of sensor clocks while increasing the energy efficiency of the sensors. The protocol has been formally modeled in VDM and analyzed using combinatorial testing against deployment constraints. Furthermore, a prototype has been implemented and examined on an actual case study. The rest of the paper is structured as follows: Sect. 2 introduces the overall system architecture and modeling. Relevant related work is described in Sect. 3. Section 4 presents our energy-efficient time synchronization protocol. Analysis of both time accuracy and energy efficiency of the proposed protocol is provided in Sect. 5. Implementation is provided in Sect. 6. Finally, Sect. 7 concludes the paper.
2
System Architecture and Modeling
This section describes the system architecture and elements, beside to what is challenging in the time synchronization of clamp-on wireless sensor networks. The architecture of the wireless sensor network (WSN) is depicted in Fig. 1. It consists of a set of sensors sampling voltage and current data in real-time, and a master node (sink) where all data is deposited. The master node communicates sensors data further to the cloud where control decisions are calculated,
Time Synchronization for Intelligent Sensor Systems
611
i.e. what energy resource to schedule?; when to sell energy to the grid?; etc. Sensors operate on integrated batteries and communicate to the master node in a wireless way to forward data and receive time synchronization messages given that the sensors do not have physical clocks; reduced hardware to reduce energy consumption of the sensors. The master node is able to track physical time given that it has a clock. To prevent time drifting of sensors, the master node assists by sending timestamp message (sync) frequently to the sensors to calibrate their local time.
panels
Fig. 1. Overall system architecture
The energy consumption P at any time point t is calculated using the following equation: P (t) = V (t).I(t). cos(φ) where V (t) is the voltage at time t, I(t) is the current at time t and φ is the angle between voltage and current. This requires that both voltage and current must be taken at the same time point in order to obtain meaningful readings.
612
J. Boudjadar and M. M. Beck
If the sensor readings are non-synchronous, the actual energy consumption state will be inaccurate, which in turn leads to misleading control actuations.
Sensor1
Sensor2
Master node
Current reading
Voltage reading
Current reading Voltage reading
Fig. 2. Data synchronization problem
In Fig. 2, one can see that although data is sampled at the same time point in sensor1 and sensor2, the arrival of data from sensor1 to Master node is much earlier than data from sensor2. This in fact due to the communication environment and data size sent by different sensors. Thus the master node cannot rely on the arrival time to merge data, rather on the time point at which data is sampled. The time instant at which the data is sampled is integrated as a timestamp to data by the sensor for each sample. This motivates the need for different sensors to have their internal clocks synchronized. Definition 1. A sensor Si is given by Ti , Ei , Di where Ti denotes the actual time of the sensor, Ei is the energy consumption of the sensor and Di is the drifting of the sensor time compared to the time in the master node. A sensor network is given accordingly by M, (S1 , .., Sn ) where M = C, E is the master node with C a physical clock and E the energy consumption of the node, and (S1 , .., Sn ) are sensors. In fact, Ti is the number of clock ticks since the sensor is powered on. This variable can be updated through a synchronization operation. The energy consumption Ei is a critical attribute of the sensor as it determines the life time and impacts how often the sensor can sample and send data to the master node. As for behavior, the sensors experience four different operation phases: Standby, Transmit, Receive and Measure. At Standby phase the sensor is inactive and consume a very minimal energy consumption, whereas in the other three phases the sensor is either sending, receiving or sampling data while consuming proportional energy to the data throughput. The different phases can run with different time intervals depending essentially on the sampling frequency and the time synchronization protocol used to align sensors local time to the clock of the
Time Synchronization for Intelligent Sensor Systems
613
master node. Further details of the sensors behavior and the underlying energy consumption is provided in Sect. 4.
3
Related Work
Time synchronization and energy efficiency for wireless sensor networks have been studied thoroughly in the literature [5,18–23] for different applications. Beside to improving time synchronization accuracy and energy efficiency, a challenging task is how to balance the trade-off between such conflicting attributes [9,10]. The authors of [5] proposed a networked control to identify sensors synchronization and estimate potential drifting errors. This might lead to much accurate synchronization result but on the expense of energy given that a drifting of a sensor would require a synchronization of the entire network. The larger the network is, the more frequent synchronization packets are issued. Compared to that, our protocol calculates the drifting error of sensors individually so that only the sensors experiencing a drift will issue a sync request. In [21], Ali et al. introduced a time synchronization protocol based on averaging consensus algorithm. The protocol relies on updating the sensors local time following the expected communication time. This improves time synchronization accuracy but with an increase in the needed number of communication instances to estimate the actual communication time, by which the energy consumption increases. The authors of [20] proposed a time synchronization protocol for WSNs to compensate the time discrepancy of sensors. It relies on a broadcast scheme and timestamping mechanism to achieve low execution time and low network traffic along with accurate synchronization. However, similarly to [21], performing a broadcast communication every time a sensor drifts is expensive in terms of energy and it is even not needed for sensors having clocks well synchronized. The authors of [19] introduced a protocol, as an integration of two processes, to reduce the energy consumption of sensors while maintaining high time accuracy. Such a protocol relies on the assumption that the master node knows the complete topology of the network, which makes such a protocol not suitable for ad-hoc networks due to the lack of flexibility. The authors of [22] designed a high precision time synchronization based on common performance clock source. Based on the mutual drift estimation, each pair of activated sensors fully adjusts clock rate and offset to achieve networkwide time synchronization without necessarily going through the gateway (master node). The protocol considers stochastic communication to model random packets loss. Given that our protocol does not consider packets loss, this will be an attractive feature to add in future.
614
4
J. Boudjadar and M. M. Beck
An Intelligent Time Synchronization Protocol
The challenge with wireless sensor networks have a single (master) node equipped with a physical clock is that sensors may request time synchronization from the master node for each data packet to send [6]. Although this guarantees an accurate synchronization of sensors to the master node’s clock, it can end up in draining the sensor batteries much faster. This situation gets even worse when sensors operate with high sampling frequency. Such a naive synchronization approach is depicted in Fig. 3.
Fig. 3. Naive time synchronization
In order to improve the energy efficiency of WSNs while maintaining sensors highly synchronized, we propose a new time synchronization protocol. The protocol dynamics as well as the underlying energy consumption are described below. 4.1
Time Synchronization Dynamics
One way to reduce the sensors energy consumption is by reducing the size of the data packets to send. The proposed protocol relies on the assumption that each sensor is responsible for its own synchronization with the master node. It first reduces the synchronization request packet by 4 bytes (out of 13 bytes used in the Naive synchronization) thus enables the sensors to turn on their transmitters/receivers for a shorter time to communicate with the master node. Figure 4 depicts the packets structure of the proposed protocol. Given the high
Time Synchronization for Intelligent Sensor Systems
615
number of data packets to communicate, the longer WSN runs the higher the energy efficiency our protocol will achieve.
Fig. 4. Data packets for synchronization
At a second stage, the proposed protocol reduces the number of synchronization requests/answers to communicate between sensors and the master node. This is achieved by enabling each sensor to use the timestamps (synchronization packets) from the master node to calculate its own clock drifting. Based on its drifting, the sensor decides to request new synchronization packets only if it has a considerable drifting1 . Otherwise, as long as the sensor internal time is synchronized with the master’s clock no synchronization requests will be issued by the sensor, thus no sync packet will be sent by master node to such a sensor. This will reduce drastically the number of packets to communicate between sensors and master node, which means longer Standby phase and higher energy savings for the sensors. Following such a behavior, the sensor can be seen as a phase-locked PI controller [4]. Figure 5 depicts the overall behavior of the sensor with respect to drifting estimation and correction. The time interval to request a new synchronization u(t) by the PI controller is calculated as follows (discretized): u(t) = Kp .Di (t) + Ki
t
Di (t) − Di (t − 1)
0
where Kp and Ki are cumulative parameters to use for optimization. The drifting (error) Di (t) of a sensor at time point t is calculated in turn as the difference between the timestamp, sent by the master node, together with the trip time of the synchronization packet, and the sensor time (number of ticks) Ti (t) at time point t.
1
By clock drifting we refer to the deviation of a sensor’s time from the physical clock of the master node.
616
J. Boudjadar and M. M. Beck PI controller
Kp T_world
CorrecƟon factor
Error Ki
T_EsƟmate
Fig. 5. PI controller for time synchronization
In fact, the error calculation is performed as follows: Di (t) =
(Tworld − [Ti (t)]).F [Ti (t)] − [Ti (t − 1)]
where Tworld = C + R/2, C is the physical clock value provided by the master node and R is the round trip time for a message communication between the sensor and master node. F is a scaling factor defined as follows: F = C(t)−[Ti (t)]. 4.2
Optimization
To improve the time accuracy and energy efficiency of the proposed protocol further, we have conducted an optimization process using a genetic algorithm [2]. The optimization amounts at calibrating the PI controller parameters (Kp , Ki , F ), so that optimal starting and result values are identified by which a lower package count will be used for synchronization, i.e. minimize the amount of sync packets needed before an actual drifting error occurs. Kp dictates how much the controller should correct for the proportional error. Ki controls how much of the previous error to correct for, and F is an integrator starting value used to track the constant error. The genetic algorithm works by taking a random starting sets of values for the control parameters and parsing them through a fitness function. In fact, the fitness function weights the resulting errors for different parameter values. Such a function is defined as follows: Wi = |u(t)|.Count(t) where Count(t) is the number of synchronization packets to send in order to reach a satisfying synchronization between the sensor time and the master node clock. The optimization results and sketch code are presented in Sects. 5 and 6 respectively.
Time Synchronization for Intelligent Sensor Systems
617
400
Driing me
200
0
-200
-400 Simulaon me 0
100000
200000
300000
400000
500000
Fig. 6. Sensors drifting without synchronization (ms)
5
Analysis of the Time Accuracy and Energy
This section analyzes the time accuracy and the improvement of the energy efficiency of our time synchronization protocol. 5.1
Time Accuracy Analysis
Given that both the sensor network and our protocol are modeled using VDM, we first formally analyze the time accuracy of our protocol against the baseline delay requirement (500 ms) imposed by our industrial partner. Furthermore, Python simulations are used to analyze the time accuracy improvement through optimization. Without using any time synchronization mechanism, we can see that the drifting error of a sensor can reach up to 400 ms (Fig. 6). Using combinatorial testing of VDM [3], we succeeded to prove that the drifting obtained for our protocol satisfies the baseline requirement. The drifting can either be positive or negative. In fact, the drifting of our protocol is far below that requirement and does not exceed 66 ms (Fig. 7). Using the optimization algorithm on the sensor behavior (modeled as PI controller), the drifting time of the senors has been drastically reduced to converge to zero with a maximum value of 1 ms. Figure 8 depicts the probability distribution of the sensor drifting obtained through optimization. One can see that over 100 iterations the objective function of our optimization algorithm converges to zero (Fig. 9), which is the optimal value of the objective function.
618
J. Boudjadar and M. M. Beck
10
0
Time error
-10
-20 -30 -40 -50 -60 SimulaƟon Ɵme
0
100000
200000
300000
400000
500000
Fig. 7. Sensor drifting using our protocol (ms) Normalized Ɵme error on a single sensor 1.0
Probability [%]
0.8
0.6
0.4
0.2
0.0
-1.00
-0.75
-0.50
-0.25 0.0 0.25 Clock error [Ɵme units]
0.50
0.75
1.00
Fig. 8. Sensor drifting after optimization
5.2
Energy Consumption Analysis
The energy consumption of the sensor amounts to the energy consumed throughout the four behavior phases. The stay at each phase depends on the sampling frequency and the output of the drifting calculation of the time synchronization protocol. At Standby phase, the energy consumption EiS (t, t ) of a sensor i is in fact the consumption rate Is accumulated along the stay duration. EiS (t, t ) = Is .(t − t) The next state is when the sensor is transmitting over the radio interface, this is often a high energy operation. The energy consumption EiT during Transmit phase is obtained as the energy consumed per packet IT , corresponding to the
Time Synchronization for Intelligent Sensor Systems
619
energy consumption rate for transmission, times the number of packets obtained as the time duration on the time to send one packet (Tm ). EiT (t, t ) = It .
t − t Tm
In similar way, the energy consumed during Receive phase EiR is calculated as follows: t − t EiT (t, t ) = Ir . Tr where Ii is the energy consumption for receiving a single packet and Tr is the time duration to receive the packet. We assume that during Transmit and Receive phases, the sensor is active exactly for the time duration to send, respectively receive, packets without waiting time between the different packets. As for Measure phase, the energy consumption is calculated as the consumption rate per sample (Im ) times how many samples, obtained by the sampling duration on the sampling frequency H. EiM (t, t ) = Im .
t − t 1/H
Figure 10 depicts a simulation-based analysis of the energy consumption of a sensor with a naive synchronization, i.e. a synchronization packet is requested for each data sample. The energy consumption of our protocol, after optimization, is depicted in Fig. 11. One can see that our optimized protocol outperforms quite well the naive synchronization approach.
6
Implementation and Experimental Results
The proposed time synchronization protocol has first been implemented in Python for simulation and optimization purposes. Thereafter, in order to test the proposed protocol in an actual setup and integrate it into the ReMoni cloud solution for remote monitoring2 , it has been implemented in C language. To enable comparison, the naive synchronization protocol has also been implemented in Python. A sketch code for the main time synchronization function is depicted in Listing 1.1. It is important to mention that, even for a large network, our optimization function converges to the optimal result after reasonable number of iterations.
2
https://www.remoni.com/solutions/.
J. Boudjadar and M. M. Beck
Fig. 9. Optimization convergence
1e7
Total power usage
3.0 2.5
Power [mA]
620
2.0 1.5 1.0 0.5 0.0
SimulaƟon Ɵme 0
50000
100000
150000
200000
250000
Fig. 10. Energy consumption for a Naive synchronization
Time Synchronization for Intelligent Sensor Systems 1e7
621
Total power usage
1.50
Power [mA]
1.25 1.0
0.75 0.50 0.25 0.0
SimulaƟon Ɵme
0
100000
200000
300000
400000
500000
Fig. 11. Energy consumption of our protocol after optimization
Listing 1.1. Python code for time synchronization
def SyncTime ( s e l f , Ctx , Crx , Trx , timeQa ) : 2 i f s e l f . t s . s t a t e == t i m e s y n c S t a t e s . FIRST BOOTED : 3 s e l f . t s . s t a t e = t i m e s y n c S t a t e s .CONVERIGN 4 s e l f . t s . T r e f = Trx + ( ( Crx Ctx ) / 2 ) s e l f . t s .R 5 s e l f . t s . C r e f = Crx 6 else : 7 i f s e l f . t s . e r r == 0 : 8 s e l f . t s . b adness = abs ( s e l f . t s . e r r ) 9 Tnow = s e l f . SyncTimeNow ( Crx ) 10 C r t t = Crx Ctx 11 Tworld = Trx + ( C r t t / 2 ) s e l f . t s .R 12 dC = ( Crx s e l f . ts . Cref ) 13 s e l f . t s . e r r = ( Tworld Tnow) dC/ (Tnow s e l f . t s . Tref ) 14 s e l f . t s . i n t e g r a t o r = s e l f . t s . i n t e g r a t o r + ( 15 s e l f . ts . err s e l f . ts . ki ) 16 s e l f . t s .R = s e l f . t s . i n t e g r a t o r + s e l f . t s . kp s e l f . ts . err 17 s e l f . t s . T r e f = Tnow 18 s e l f . t s . C r e f = Crx 19 s e l f . t s . b adness = 0 . 9 s e l f . t s . badnes s + ( 20 0.1 abs ( s e l f . t s . e r r ) ) 21 i f s e l f . t s . badness < SYNC LEVEL : 22 s e l f . t s . s t a t e = t i m e s y n c S t a t e s .SYNCRONISED 23 e l s e : 24 s e l f . t s . s t a t e = t i m e s y n c S t a t e s .CONVERIGN
7
Conclusion
This paper proposed an intelligent energy-efficient time synchronization protocol for wireless (clamp-on) sensor networks. Sensors operate on battery and have very limited hardware. The mission of the protocol is to maintain sensors time
622
J. Boudjadar and M. M. Beck
synchronized with the clock of the master node so that the data sampling from sensors is aligned. This results in obtaining accurate real-time state of the monitored environment. The proposed protocol requests a synchronization packet from the master node only if a sensor time is considerably drifting from the master node clock. The protocol has been formally specified in VDM to verify that the maximum drifting does not violate the baseline requirement. Furthermore, an optimization of the protocol has been conducted using a genetic algorithm to improve further the time accuracy. The proposed protocol has been implemented and integrated into the ReMoni cloud solution for remote monitoring. The analysis results of the proposed protocol shows that it outperforms the state of the art protocols with respect to time accuracy while it maintains a high energy efficiency. A future work will be to study the trade-off between the time accuracy and the energy efficiency and finding the optimal sampling frequency to reduce the energy consumption much further.
References 1. European Commission.: NZEB Buildings. https://ec.europa.eu/energy/topics/ energy-efficiency/energy-efficient-buildings/nearly-zero-energy-buildings en 2. Jayachitra, A., Vinodha, R.: Genetic algorithm based PID controller tuning approach for continuous stirred tank reactor. Adv. Artif. Intell. (2014) 3. Kulik, T.P., Tran-Jørgensen, W.V., Boudjadar, J., Schultz, C.: A Framework for threat-driven cyber security verification of IoT systems. In: 2018 IEEE International Conference on Software Testing, Verification and Validation Workshops ICSTW, pp. 89–97 (2018) 4. Crowe, J., Johnson, M.: Phase-Locked Loop Methods and PID Control. Springer Publisher, pp. 259–296 (2005) 5. Wang, T., Cai, C., Guo, D., Tang, X., Wang, H.:Clock synchronization in wireless sensor networks: a new model and analysis approach based on networked control perspective. Math. Prob. Eng. 19 (2014) 6. Ranganathan, P., Nygard, K.: Time synchronization in wireless sensor networks: a survey. Int. J. UbiComp 1(2) (2010) 7. Banaei, M., Rafiei, M., Boudjadar, J., Khooban, M.: A comparative analysis of optimal operation scenarios in hybrid emission-free ferry ships. IEEE Trans. Transp. Electrification 6(1), 318–333 (2020) 8. Ingeli, R., Cekon, M.: Analysis of energy consumption in building with NZEB concept. Appl. Mech. Mater. 824, 347–354 (2015) 9. Rafiei, M., Khooban, M., Igder, M.A., Boudjadar, J.: A novel approach to overcome the limitations of reliability centered maintenance implementation on the smart grid distance protection system. IEEE Trans. Circ. Syst. Expr. Briefs 67(2), 320– 324 (2020) 10. Gheisarnejad, M., Boudjadar, J., Khooban, M.: A new adaptive type-II fuzzy-based deep reinforcement learning control: fuel cell air-feed sensors control. IEEE Sens. J. 19(20), 9081–9089 (2019)
Time Synchronization for Intelligent Sensor Systems
623
11. Magrini, A., Lentini, G., Cuman, S., Bodrato, A., Marenco, L.: From nearly zero energy buildings (NZEB) to positive energy buildings (PEB): the next challenge - the most recent European trends with some notes on the energy analysis of a forerunner PEB example. In: Developments in the Built Environment, vol. 3 (2020) 12. Magrini, A., Lentini, G.: NZEB analyses by means of dynamic simulation and experimental monitoring in mediterranean climate. Energies J. 13(18), 4784 (2020) 13. Cao, X., Dai, X., Liu, J.: Building energy-consumption status worldwide and the state-of-the-art technologies for zero-energy buildings during the past decade. Energy Build. 128, 198–213 (2016) 14. Van de Putte, S., Bracke, W., Delghust, M., Steeman, M., Janssens, A.: Comparison of the actual and theoretical energy use in nZEB renovations of multi-family buildings using in situ monitoring. E3S Web Conference (2020) 15. Haq, A.U., Jacobsen, H.-A.: Prospects of appliance-level load monitoring in offthe-shelf energy monitors: a technical review. Energies J. 11 (2018) 16. ReMoni.: Clamp-on IoT Sensors. https://www.remoni.com/products2/productoverview/ 17. Elson, J., Girod, L., Estrin, D.: Fine-grained network time synchronization using reference broadcasts. ACM SIGOPS Oper. Syst. Rev. 36, 147–163 (2002) 18. Ganeriwal, S.K., Srivastava, R., Mani, B.: Timing sync protocol for sensor networks. In: SenSys 2003: Proceedings of the First International Conference on Embedded Networked Sensor Systems (2003) 19. Li, F., He, G., Wang, X.: An improved hybrid time synchronization approach in wireless sensor networks for smart grid application. In: HPCC Conference (2015) 20. Kim, K.-H., Hong, W.-K., Kim, H.: Low cost time synchronization protocol for wireless sensor network. IEICE Trans. Commun. 92-B(4), 1137–1143 (2009) 21. Al-Shaikhi, A.: Accuracy-enhanced time synchronization method for WSNs using average consensus control. In: Proceedings of International Multi-Conference on Systems, Signals and Devices (2018) 22. Xiong, N., Fei, M., Yang, T., Tian, Y.-C.: Randomized and efficient time synchronization in dynamic wireless sensor networks: a gossip-consensus-based approach. Complex. J. (2018) 23. Upadhyay, Divya., Dubey, Ashwani Kumar, Santhi Thilagam, P.: Time synchronization problem of wireless sensor network using maximum probability theory. Int. J. Syst. Assur. Eng. Manag. 9(2), 517–524 (2018). https://doi.org/10.1007/ s13198-018-0698-9
Intelligent Sensors for Intelligent Systems: Fault Tolerant Measurement Methods for Intelligent Strain Gauge Pressure Sensors Thomas Barker1,2 , Giles Tewkesbury1 , David Sanders1(B) , and Ian Rogers2 1 University of Portsmouth, Portsmouth, UK
{thomas.barker,david.sanders}@port.ac.uk 2 Gems Sensors & Controls, Basingstoke, UK
Abstract. A new method is described for measuring an existing pressure transducer with greater potential for analysis. The new method has the potential to allow more comprehensive and early detection of sensor faults. This supports further work to develop fault tolerance, on board data quality estimation and failure prediction enabling intelligent sensors to operate more independently and reliably. A computer model of the sensor was constructed, and measurement approaches compared. A typical measurement scenario was simulated under normal operating conditions before and after an over pressure damage event. The range of failure modes detectable using this approach are discussed. The new method was then simulated using the same overpressure damage event. The results of the simulation are discussed and compared. The new measurement method has the potential to allow more comprehensive and early detection of sensor faults. Keywords: Pressure sensing · Strain gauge · Fault tolerance · Failure detection · Intelligent sensors
1 Introduction The research presented in this paper describes new methods to create fault tolerant strain gauge pressure sensors. The work presented here is part of broader research to use AI techniques to provide onboard failure prediction and data quality estimation methods for intelligent fluid pressure sensors. Intelligent systems can be defined as those capable of autonomous action and decision making based upon the information available. Intelligent sensors are here defined as those capable of reporting not just raw signals but processed and useful information valuable to intelligent systems. Intelligent sensors may increase the value of the information they produce in numerous ways. They may advertise their capabilities to a connected system, present data digitally in standard units and report on their state of health. More specialised sensors may indicate not values, but states and events of interest identified by the sensor.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 295, pp. 624–632, 2022. https://doi.org/10.1007/978-3-030-82196-8_46
Intelligent Sensors for Intelligent Systems
625
Pressure sensors are used in a wide diversity of applications to monitor system performance and safety of critical systems. The monitoring and modelling of compressed air systems can reduce energy consumption [1, 2]. Real-time observation of pressure data from sensors can be used in the oil refining processes to provide new intelligent systems that eliminate anomalies by monitoring crude oil distillation columns [3] and the prediction of storage tank leakage to improve safety [4]. These sensors are critical to ensure safe and reliable operation, to provide feedback to control the systems and to enable condition-based maintenance. However, all sensors are subject to wear and aging. Some installation locations of sensors make servicing difficult, for example in high-recirculation airlift reactors for treating wastewater [5], and damage is often not visible. These concerns are exacerbated by the growth of the Internet of Things (IOT), wherein rapid deployment, unattended operation and machine to machine communication are commonplace. The concept of virtual task machines was discussed in [7] where advances production machines provided design level advice about their manufacturing capabilities and the task that they could perform. The concept of task machines, able to advice on their abilities was integral even to a joint level where intelligent actuators were able to advise on their ability and function [8]. Extending this intelligence to sensors can provide additional information such as confidence ratings that can be used in the automatic processing of sensor data. The wider aim of this project is to create intelligent sensors that are able to report on their state of health, the quality of their measurements in terms of accuracy, units and repeatability, and to provide advanced warning of problems such as deterioration, or failure. This approach extends the benefits of condition-based maintenance [9] to the sensor itself, enabling the sensor to inform maintenance planning processes and offer enhanced confidence in safety monitoring applications [6]. In this paper a new measurement approach was used with an existing sensor which allowed more in-depth condition monitoring. Section 2 describes a typical configuration and the results from simulating an overpressure event. Section 3 describes the new measurement approach and the same simulation of an overpressure event. Results are discussed in Sect. 4 along with further work, and conclusions drawn in Sect. 5. 1.1 The Sensor The sensor used in this research was a resistive strain gauge sensor commonly used in several commercial products. A pressure summing diaphragm was mounted within a metal housing, one side of the diaphragm was at atmospheric pressure, with the other side connected to the applied pressure to be measured. The diaphragm deflected according to the pressure applied with respect to atmospheric pressure. Several thin film strain gauges were bonded to the diaphragm in a Wheatstone bridge configuration, shown in Fig. 1.
626
T. Barker et al.
Fig. 1. Diagram of resistive sensor elements. VP represents the measured output. Excitation voltage is applied between VS and VG .
Under deformation two elements are in areas largely under tension and two largely compression as shown in Fig. 2.
Fig. 2. Cutaway view of sensor with deformation under pressure exaggerated.
2 Typical Configuration 2.1 Measurement The measurement setup that was used in a typical application is shown in Fig. 3. The sensor was excited with a voltage applied between VS and VG , the voltage VP was then measured to determine the pressure of the measured fluid.
Intelligent Sensors for Intelligent Systems
627
Fig. 3. Typical sensor usage
Before use the sensor typically underwent a calibration process, wherein known pressures were applied, and the resultant voltages recorded to establish the relationship. A curve fitting function generated an equation which was then used by the conditioning electronics to obtain pressure from the voltage measured. 2.2 Measurement Analysis With this configuration some failure detection was possible. Upper and lower bounds could be placed on each signal based on the expected response to the expected minimum and maximum input signals. Signals outside this range could be treated as erroneous and indicative of potential damage. Additional hardware such as small current sources or pull up/down resistors on measurement inputs could augment this technique by forcing inputs to extremes when connections were broken. If measurement setup allowed the common mode component of the sensor output could also be measured, though this was rarely used in practice. These techniques combined could detect a range of major failures such as open or short circuit connections or large changes in single elements, but smaller changes could go unnoticed. For example, an overpressure event (application of pressure well above the specified limits) could shift the sensor output artificially high, however this would be indistinguishable from an actual rise in input pressure. Without prior knowledge of the input signals, only the most extreme failures could fall outside the range of possible correct readings. To illustrate, a simulated overpressure event was applied to a computer model of the sensor when measured as described above. The plots in Fig. 4 show the model before and after simulated overpressure damage. It can be seen that only near the upper limit of rated pressure input did any of the measured parameters exceed the limits set at calibration. The pressure signal however, now indicated approximately 10% higher than the actual value across the input range. The sensor was specified to give an error of less than 0.5%, so this represents a significant degradation in data quality that could be considered a failure.
628
T. Barker et al.
Fig. 4. Effect of overpressure on sensor output. Grey planes indicate the upper and lower bounds of signals corresponding to the rated operating pressure and temperature range.
3 Proposed Configuration 3.1 Measurement A new approach is presented here. The aim of this new measurement method is to increase the reliability of the sensor, through fault tolerance and early fault prediction. The new measurement method can be seen in Fig. 5. The resulting voltage measured allowed for advanced calculations. The calibration process was similar to the typical approach described. An additional temperature sense element (not shown) was used to compensate for thermal effects less prevalent in the typical approach.
Intelligent Sensors for Intelligent Systems
629
Fig. 5. Block diagram of proposed measurement system
3.2 Measurement Analysis The new configuration provided multiple pressure signals, offering some redundancy which could be exploited. For example, the pressure signals could be represented as an average and a span as in Fig. 6.
Fig. 6. Sensor model represented as an average and span.
When the same overpressure event was applied to the model measured in this way the signal remained largely unchanged, however the span changed significantly. A greater deviation indicated that the sensor calibration functions were performing poorly and that the data produced was less trustworthy.
630
T. Barker et al.
4 Discussion The new method offers the potential to increase sensor reliability through fault tolerance and early detection allowing early corrective action. 4.1 Fault Identification Basic processing of the signals allowed the generation of sensor integrity metrics which were to some extent decoupled from the input signal. Analysis conducted on the resulting data was via manual processes, but the true value of this approach is in automatic processing. As a first step the manual process could be replicated using a basic expert system. This could be used by the system to detect easily identifiable failure modes and inform on compensation strategies and the likely impact on performance. More challenging decisions could be approached with a multi criteria decision making method [10]. This approach appears capable of detecting all of the common failure modes observed without significant complexity, but is ultimately limited by the knowledge of the authors. The exact mechanics of sensor failure are often unknown, but the conditions under which they occur are well documented. With this is mind, a machine learning based approach [11] may be more suitable. For example, a neural network could be trained on data obtained through destructive testing to classify failure types. This task is well suited to the black box function mapping abilities of deep neural networks, and it is likely the resultant AI may be able to detect subtle features in the signals which are not readily apparent. This approach also simplifies the transition to different sensor technologies as the data collection and training process can be carried out for each new sensor without the need for the implementors to develop a deep understanding of sensor operation. A number of tools are now available to assist in evaluating neural networks on low performance processors typically used for on board signal processing [12–14], making this approach viable to implement in a commercial product. 4.2 Limitations When using a Wheatstone bridge sensor outside of its intended configuration much of the original configuration’s advantages are lost. The ratiometricity inherent in Wheatstone bridge measurements is not used, and so to ensure accurate results precise voltage and current references are required. Another advantage of the original configuration is the ability to balance out undesired effects such as thermal change in resistivity, this too is lost so these effects must be compensated for later. Both of these requirements place a greater burden on the signal conditioning and calibration process, in practical terms this is likely to add cost and complexity to the sensor. Another major limitation of this approach is that the functions used to calculate pressure from element resistance require a reliable temperature signal. The sensor used during testing has a temperature sense element bonded to the diaphragm in a position largely unstressed even during pressure extremes, but this may not always be trusted. Reliance on this single point of failure negates some of the reliability advantages of this approach.
Intelligent Sensors for Intelligent Systems
631
4.3 Further Developments With better characterisation of the measured sensor it may be possible to identify and compensate for more types of sensor damage. The model of sensor elements could be expanded to enable further analysis or prediction. This could include dynamic as well as static electrical properties, modelling the capacitance between gauge and diaphragm for example to detect gauge delamination. Alternatively, the physical and chemical properties of the sensor could be modelled to predict the rate of aging under extreme conditions. Finally, there is more interdependency between sensor elements that is not well utilized by the methods described here. Perhaps the most significant is the effect of temperature on the pressure measuring elements, presently compensated for by a separate temperature source that is assumed to be reliable. With further work a multi variable algorithm could be implemented to allow temperature and pressure sensing using the pressure elements alone, reducing reliance on the single temperature element. Trends in time could also be established. Many sensors are affected by long term drift, this is typically not possible to correct for in sensor as it happens long after the calibration process. The described multi element measurement approach may be able to detect sensor drift. With sufficient test data however, it may even be possible to correct for this long-term drift, or to at least understand how much deviation is acceptable such that trends can be extrapolated to determine suitable replacement intervals.
5 Conclusions In conclusion, the new method which allows each element of a resistive strain gauge sensor to be measured individually providing four separate pressure signals has the potential to allow more comprehensive and early detection of sensor faults. By comparing multiple signals, the internal relationships of the sensor can be established and monitored to detect changes in the operation of the sensor. The resultant data offers enhanced potential for automated AI analysis, through techniques such as expert systems and artificial neural networks. This analysis can be used to inform decisions about damage compensation strategies and to estimate the impact of damage upon the output signal accuracy. This can enable greater confidence in the output signals or inform condition based maintenance strategies. Both of these traits are desirable in the deployment and management of large or remote sensor networks and IOT devices.
References 1. Thabet, M., et al.: Management of compressed air to reduce energy consumption using intelligent systems. In: Arai, K., Kapoor, S., Bhatia, R. (eds.) IntelliSys. AISC, vol. 1252, pp. 206–217. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-55190-2_16 2. Sanders, D.A., Robinson, D.C., Hassan, M., Haddad, M., Gegov, A., Ahmed, N.: Making decisions about saving energy in compressed air systems using ambient intelligence and artificial intelligence. In: Arai, K., Kapoor, S., Bhatia, R. (eds.) IntelliSys. AISC, vol. 869, pp. 1229–1236. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-01057-7_92
632
T. Barker et al.
3. Omoarebun, P.O., Sanders, D., Haddad, M., Hassan Sayed, M., Tewkesbury, G., Giasin, K.: An intelligent monitoring system for a crude oil distillation column. In: 2020 IEEE 10th International Conference on Intelligent Systems (IS), pp. 159–164. IEEE IS Proceedings Series. IEEE (2020). https://doi.org/10.1109/IS48319.2020.9200175 4. Ikwan, F., et al.: Intelligent risk prediction of storage tank leakage using an Ishikawa diagram with probability and impact analysis. In: Arai, K., Kapoor, S., Bhatia, R. (eds.) IntelliSys. AISC, vol. 1252, pp. 604–616. Springer, Cham (2021). https://doi.org/10.1007/978-3-03055190-2_45 5. Sanders, D.: New method to design large scale high-recirculation airlift reactors. J. Environ. Eng. Sci. 12(3), 62–78 (2017). https://doi.org/10.1680/jenes.17.00008 6. Painting, A., Sanders, D.: Disaster prevention through intelligent monitoring. J. Syst. Saf. 52(3), 23–30 (2016). http://www.system-safety.org/jss/ 7. Tewkesbury, G.E.: Design using distributed intelligence within advanced production machinery. Ph.D. thesis, University of Portsmouth, UK (1994) 8. Tewkesbury, G., Sanders, D., Strickland, P., Hollis, J.: Task orientated programming of advanced production machinery. In: Dedicated Conference on Mechatronics, 1993, pp. 623–630. Automotive Automation (1993) 9. Ahmad, R., Kamaruddin, S.: An overview of time-based and condition-based maintenance in industrial application. Comput. Ind. Eng. 63(1), 135–149 (2012). https://doi.org/10.1016/ j.cie.2012.02.002 10. Haddad, M.J.M., Sanders, D.: Selection of discrete multiple criteria decision making methods in the presence of risk and uncertainty. Oper. Res. Perspect. 5, 357–370 (2018). https://doi. org/10.1016/j.orp.2018.10.003 11. Liang, H., Chen, H., Lu, Y.: Research on sensor error compensation of comprehensive logging unit based on machine learning. J. Intell. Fuzzy Syst. 37(3), 3113–3123 (2019). https://doi. org/10.3233/JIFS-179114 12. X-CUBE-AI – STMicroelectronics. STMicroelectronics (2021). https://www.st.com/en/emb edded-software/x-cube-ai.html. Accessed 26 Apr 2021 13. Lai, L., Suda, N., Chandra, V.: CMSIS-NN: Efficient neural network kernels for arm cortex-M CPUs, arXiv, pp. 1–10 (2018) 14. TensorFlow Lite for Microcontrollers. TensorFlow (2021). https://www.tensorflow.org/lite/ microcontrollers. Accessed 26 Apr 2021
IoT Computing for Monitoring NFT-I Cultivation Technique in Vegetable Production Manuel J. Ibarra1(B) , Edgar W. Alcarraz1 , Olivia Tapia1 , Aydeé Kari1 , Yalmar Ponce2 , and Rosmery S. Pozo3 1 Universidad Nacional Micaela Bastidas de Apurimac, Abancay, Perú
{mibarra,otapia,akari}@unamba.edu.pe
2 Universidad Nacional José María Arguedas, Andahuaylas, Perú
[email protected] 3 Universidad Tecnológica de Los Andes Abancay, Abancay, Perú
Abstract. This article compares the production and growth times of three types of lettuce and in three cultivation systems NFT-I, RF and soil with Worm Humus. Additionally, it describes the NFT-I cultivation system, which is a cultivation technique supported by the Internet of Things (IoT). NFT-I allows to measure and store the data of three parameters: ambient temperature, pH level and electrical conductivity; the advantage is that this system allows notifying the farmer about the current status of each variable and notifying through the social network Telegram (through bots). The methodology used was to start the planting process in the three systems on the same day, then the NFT-I system was saving data read by the sensors, and later measurements were made of the time and growth of each of the planted lettuces. The results show that this system can reduce electricity consumption by 91.6%; on the other hand, it helps farmers monitor plant growth. On the other hand, regarding the harvest time, it can be verified that the RF system, NFT-I and land were harvested in 61, 69 and 105 days respectively, which shows that RF is the most efficient; In terms of size, the number of leaves, length and width, RF is also of better size than the NFT-I crop and soil. Finally, in these times of confinement due to the coronavirus disease (COVID-19), in which the economy has slowed and the needs are multiple, this NFT-I system could help people create their vegetable growing system of quickly and cheaply. Keywords: Hydroponic · IoT · Automation · Lettuce · Parsley · Vegetables · Raspberry PI · Arduino · COVID-19 · NFT · RF · NFT-I
1 Introduction The traditional production of plants and vegetables is based on planting on land, and in terms of water, it depends on rain; only in some cases, they have manual or automated irrigation [1]. The deterioration of the land due to the excessive use of fertilizers is forcing farmers to think about new ways of growing vegetables. In these times of confinement in which we live, because of COVID-19, people are restricted from going out and working; therefore, the lack of employment, tension, © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 295, pp. 633–647, 2022. https://doi.org/10.1007/978-3-030-82196-8_47
634
M. J. Ibarra et al.
nervousness grows every day, and they need other forms of distraction and occupation to reduce stress, then making a hydroponic culture at home can be a favorable alternative for the occupation of families. Hydroponics is a method of growing plants in water and uses nutrient solutions. Hydroponic growing is mostly done for edible leafy plants. There are many hydroponic cultivation techniques; among the most important are: Nutrient Film Technique (NFT) and floating root (RF) [2]. NFT is a culture where the roots of the plants are directly in contact with the nutrient solution; the advantage of this technique is that the nutrient solution is collected and again supplied through the PVC culture channels to the roots. The floating root is a technique in which you have a rectangular tank with a water container (like a small pool with water), and a sheet of Styrofoam board floats on the water and covers the entire surface of the tank, on the other hand, the roots of the Plants are constantly in contact with water, and they must be oxygenated daily from time to time, oxygenation can be applied manually or through air pumps [3]. Hydroponic production is the art of growing plants using water instead of soil as is traditionally done. There has been much question as to whether hydroponic production is more efficient than onshore production. In this regard, there are studies that show that hydroponic cultivation is more efficient than on land [4, 5]. An experimental study by Gashgari et al. [4] showed a competitive analysis between the growth of the plant in the hydroponic system and in the soil; for the experiment, both production systems received the same germination and growth conditions, the experiment was carried out in different types of plants for a month. The results show that hydroponic plants germinated and grew faster than soil plants. For a plant to improve its growth and reduce the production time, it must grow in a natural environment, and some factors must be controlled, such as: temperature, pH level, conductivity, air humidity, nutrient solutions (chloride, potassium calcium, magnesium bicarbonate, sulfate, ammonia, nitrite, nitrate and others), among others; each of these factors are difficult to control on a daily basis and considerably influence the quality of the crop [6–8]. There are some techniques to optimize and automate nutrient control [9–12]. In recent years the need has arisen to take advantage of the free spaces on the roofs of houses to gain space and above all, to avoid environmental pollution. As the world population grows, so makes the demand and need for food products; due to this increasing demand, there may be a food crisis in the coming years [4]. Urban agriculture is a new concept that implements agriculture in the city and can solve part of the problem of food shortages in markets. China has densely populated megacities, and they will continue to expand in number and size, requiring increased food production. It has been estimated that the sum of the total rooftop space in China is about 1 million hectares, which can be used for growing vegetables [13, 14]. A well-known term has been incorporated in urban agriculture: the Internet of Things (IoT). This involves the use of sensors, wired and wireless communication networks, RFID technology, the use of the Internet, data management, architecture, specialized software, etc. [15–19]. Nowadays, farmers need crop information in a consistent and timely way to make decisions immediately; with IoT the farmer can be made to perceive
IoT Computing for Monitoring NFT-I Cultivation Technique
635
the growing environment, control and monitor it remotely through the infrastructure of an existing network [6, 19, 20].
2 Related Work In Cusco, Meza M. [21] developed his thesis to obtain the professional title, in which he determined the behavior of three hydroponic cultivation techniques NFT, RF and gravel with lettuce. In the methodology, they used four treatments and three repetitions in 3 cultivation techniques plus control (without nutrients), having a total of 12 experimental units. The results obtained were processed using the analysis of variance and the Tukey test. The results obtained show that the RF cultivation technique shows better results in weight, diameter and root length compared to the NFT and gravel cultivation technique. Regarding the height, the gravel cultivation technique showed better results than the other techniques. In Ecuador, Zambrano O. [2] carried out a thesis to obtain the title of an agronomist, in which five types of lettuce were cultivated in two hydroponic production systems: NFT and RF, with three repetitions and 30 treatments. The results show that the RF production system shows better results in terms of height, size of the leaves, the number of leaves, weight of the plant and yield per hectare. In Argentina, Scaturro [22] carried out a thesis to obtain the title of Agronomist Engineer, in which he carried out a study of cultivation of lettuce plantations in three different systems: hydroponics undercover, hydroponics under controlled conditions and undercover field. After making the respective plantations, he carried out a weekly control to see the control of the study variables. The system under controlled conditions was the one that gave the best result, and there was no presence of pests that could harm growth. On the other hand, this research also incorporates a new concept of the Internet of Things (IoT), which is why some works related to the subject have been seen. In Thailand, Changmai et al. [6] carried out an investigation in which they state that intelligent agriculture is the application of IoT for the cultivation of plants and with the main objective of saving in labor, resources, carrying out a more detailed control in irrigation and fertilization. They developed smart hydroponics for lettuces, using Internet of Things technology to investigate its benefits compared to regular hydroponic farming. The developed system can monitor the growing environment and automatically adjusts the nutrient solution, air temperature and air humidity. The results show that smart farm lettuces have on average about 36.59% more weight, 17.2% more leaves and 13.9% more stem diameter than those grown with normal hydroponics. In Indonesia, Crisnapati et al. [7] carried out a study that allows monitoring and collecting information from the hydroponic cultivation NFT, for which they used the concept of IoT, in which through a web application, they could see the values marked by the sensors. The results show that the web system works as intended for monitoring a traditional NFT hydroponic system; however, the nutrient concentration is still controlled manually, which requires one person to do it every day, and this implies increased costs.
636
M. J. Ibarra et al.
3 Design and Implementation of NFT-I, RF and Earth Cultivation Techniques 3.1 NFT-I System NFT-I is an autonomous irrigation system based on IoT, which is a variant of the traditional NFT system. The related works mentioned above use the NFT technique; the main characteristic in this type of system is that the tubes must have an incline or slope between 5% and 15% so thatthe water flows by gravity inside the tube and whose equivalent angle is: 180 15 ◦ ∝ = arctan 100 3.1415 = 8.53 ; however, this in turn generates an additional cost in electrical energy consumption because the pumps must run all the time, which increases the cost in electrical energy consumption. Figure 1 shows the traditional NFT system with an inclination of 8.53°. Figure 2 shows the optimized NFT-I system without a tilt angle.
Fig. 1. Traditional NFT system
Fig. 2. Optimized NFT-I system
IoT Computing for Monitoring NFT-I Cultivation Technique
637
Specifications for the Construction of the NFT-I System Table 1 shows the elements of the system, and the most important elements are denoted by the letters A, B, C, D, E, F in Fig. 2, each of them is described below: Table 1. Elements of the NFT-I system letter Description of functionality A
Bucket of 0.30 m in diameter and with a capacity of 20 L, inside which the nutrient solution and also the submersible water pump is placed
B
Submersible water pumps whose function is to send the water from the ground level to a height of 0.96 m approximately 0.38 m + 0.42 m + 0.16 m)
C
Tube of 0.03 m in diameter, which serves to return the water with nutrients from the tubes to the water bucket
D
Tube of 0.08 m in diameter and 1.4 m in length, used for planting lettuces. These tubes have no angle of inclination
E
Water pumping system, which is built with tubes of 0.03 m in diameter and whose function is to carry the water to transfer the water that pumps the motor (B) up to about 0.16 m above the level of the pipeline on which the plants are located
F
There are holes of 0.06 m in diameter and are separated 0.22 m between the center of the two holes. On each hole sits a special cup that supports the lettuce plant that is wrapped in a sponge. In total, there are 64 lettuce plants (there are eight tubes, and each tube has eight holes with their respective plants)
G
Seedling planted in cups (lettuce)
Electronic Devices Used in NFT-I For the design of the system, various electronic devices have been used. Table 2 shows a summary of each of them: Table 2. Electronic devices Device
Description
Electric timer
This device allows automating the switching on and off of the submersible water pump. It is used to manage the on and off of an electronic device
Arduino ONE
Open source microcontroller board based on the ATmega328P microchip; this board has been developed by Arduino
Temperature sensor
Submersible probe for temperature measurement (DS18B20)
PH sensor
An electronic device used to measure the pH of the water that contains the solution to feed the plant (pH Electrode) (continued)
638
M. J. Ibarra et al. Table 2. (continued)
Device
Description
Raspberry Pi
A mini-computer called Raspberry Pi was used, which fulfills the function of receiving the data read by the sensors and saving it in a database and in turn, answering queries through a web server
Ultrasound sensor
This device allows measuring the distance to an object using sound waves that serves to measure the volume of water (HCSR04)
Submergible water pump It is a motor that allows the water to be evacuated at a certain height (SOBO brand, model WP 3550). According to its technical specifications, it can pump water up to 2.8 m, a flow of 2800 L/s, and 25W of power Others
Cables, connectors, resistors, LEDs and others
The proposed NTF-I system has the advantage that the water circulates through the tube periodically (not all the time) only about 10 min and every 2 h, this in order to save electrical energy consumed by the motor that drives the water. The advantage is that the nutrient solution remains in the tube for a period of one hour and 50 min; the total volume of the tube is: Vt = (3.1415)(0.04)(0.04)(1.4)(1000000), and the volume left in the nutrient solucm3 = 3.5 liters, this with the aim that the tion in the tube is about 50%, Vn = 7033.6 2 root is always in contact with the nutrient solution to receive the nutrients. Each time the pump raises the water, then the nutrient solution begins to circulate again. Variable Control Through IoT in NFT-I With NFT-I it is possible to measure the values of the most important variables in the growth of vegetables, such as: Temperature, pH and Electrical Conductivity; that is, the control of variables is automated (something that a person could do, which raises the cost of production). There are certain normal or standard values for the growth of vegetables; thus, for example, Table 3 shows the ranges of the values in which lettuce grows. Table 3. Range of ideal values for lettuce growth Variable
Minimum value Maximum value
pH
5.5
6.5
Electric conductivity 1.5
2.5
Temperature
25 °C
15 °C
The values of the variables read by the sensors can be viewed at any time through Telegram and a mobile device or laptop. It was decided to use Telegram because it is an open-source social network and allows the creation of a “bot” (a computer program
IoT Computing for Monitoring NFT-I Cultivation Technique
639
that automatically performs repetitive tasks over the Internet), unlike WhatsApp, which is private code and does not allow the creation of a “bot”. The programming has been carried out in the Python programming language to send the data read by the sensors to the user in such a way that, if a value goes out of the allowed ranges, then it must alert the user to take the necessary corrective measures. A snippet of the code is shown in Fig. 3. bot = telebot.TeleBot("943769984:AAGOjMs0T4Vu9fuGzUcZ1fVU3YQctbupAE") arduino = serial.Serial('/dev/ttyACM0', 9600) @bot.message_handler(func=lambda message: True) def echo_all(message): bot.reply_to(message, message.text) print message.text
Fig. 3. Telegram.py file
3.2 Floating Root System (RF) The RF system was built as a kind of a small pool, and then it was lined with thick black plastic material, then a thick Styrofoam board with 64 holes was used. The Styrofoam board floats on water and must be oxygenated about every 2 h; it can be done manually, generating a stream of water or bubbles with a jug, or automatically, by means of a non-submersible water pump motor, see Fig. 4.
Floating root Nutritive solution Fig. 4. RF system
3.3 Planting System on Land The land planting system is a traditional plant growing system in general. There are several ways to use organic compost, which can be Compost, Worm Humus or sheep guano. Worm castings were used for this project. For the land planting system, a wooden container was designed and lined with thick plastic, then in said container, a layer of worm humus was placed, which has organic and ecological fertilizer, see Fig. 5.
640
M. J. Ibarra et al.
Lettuce plantation Earth with worm castings Fig. 5. Land planting system with earthworm humus
4 Performed Tests 4.1 The Geographical Context of the Evidence The tests were carried out in the city of Abancay in the Apurímac region in Peru, and this city is located in the south-central part of Peru at 2378 m.a.s.l., and the temperature ranges between a minimum of 8 °C (June, July and August) and a maximum of 28 °C (September, October and November). This project was tested in the months of April, May, June and July. 4.2 Analysis of the Quality of Drinking Water and Rainwater Samples were taken in a sterile one-liter bottle of drinking water, which were sent to the soil laboratory of the Universidad Nacional Agraria la Molina for physicochemical analysis, and the formulation of the nutrient solution was made from the data obtained. The results of this analysis showed that the pH of drinking water has a value of 7.25. The results of the physicochemical analysis are shown in Table 4. Table 4. Physical-chemical analysis of drinking water Field No.
Drinking water
pH
7.25
C.E. (dS/m)
0.22
Calcium (meq/L)
1.35
Magnesium (meq/L)
0.55
Potassium (meq/L)
0.02
Sodium (meq/L)
0.40
sum of cations
2.32
Nitrates (meq/L)
0.01
Carbonates (meq/L)
0.00
Bicarbonates (meq/L) 2.19 Sulfates (meq/L)
0.05 (continued)
IoT Computing for Monitoring NFT-I Cultivation Technique
641
Table 4. (continued) Field No.
Drinking water
Chlorides (meq/L)
0.20
sum of anions
2.45
Sodium (%)
17.24
RAS
0.41
Boron (ppm)
0.11
Classification
C1-S1
4.3 Preparation of the Nutrient Solution for Growing Vegetables Once the results of the water analysis had been obtained, the adjustment was made to the Hoagland formulation called (Hoagland II-modified). The nutrient solution must contain micro and macronutrients, information on the type, quantities and units are shown in Tables 5 and 6. Table 5. Macronutrients Fertilizer
Quantity
Unit
Potassium nitrate
625
gr
Concentrated solution A
26
gr
Concentrated solution A
Calcium triple superphosphate
200
gr
Concentrated solution A
Calcium nitrate
676
gr
Concentrated solution C
Ammonium nitrate
Observation
Table 6. Micronutrients Fertilizer
Quantity Unit Observation
Iron chelate
16.67
Gr
Concentrated solution B
Manganese sulphate
2.78
gr
Concentrated solution B
Boric acid
1.56
gr
Concentrated solution B
Zinc sulfate
0.65
gr
Concentrated solution B
Copper sulphate
0.40
gr
Concentrated solution B
4.4 Planting Period The cultivation process was carried out in three phases: germination in which the seeds begin to germinate; then, the plantation stage in which the plant is transplanted to the three forms of cultivation; finally, the harvest phase, see Table 7.
642
M. J. Ibarra et al. Table 7. Cultivation phases Kind
Germination Plantation Harvest
Total, days
Floating root 11-Abr-20
5-May-20 11-Jun-20
61
NFT
11-Abr-20
5-May-20 19-Jun-20
69
land
11-Abr-20
5-May-20 25-Jul-20 105
4.5 Analysis of Electrical Energy Consumption with NFT-I At a global level, one of the important factors is the conservation of the environment, in that understanding reducing the consumption of electrical energy is vital. Normally, when the engine is running 24 h a day and for 40 days, the consumption would be determined as follows: CNFT =
25W
×
24h 40dia 1kW × = 24kW × dia 1000W
On the other hand, consumption using NFT-I can be significantly reduced, since the engine is only turned on for 10 min every 2 h, then the consumption would be: CNFT −I =
25W
×
40dia 2h 1kW × = 2kW × dia 1000W
5 Results 5.1 Results of Electrical Energy Consumption NFT-I By growing vegetables using the NFT-I system, electricity consumption has been reduced from 24 kW to 2 kW, which implies a reduction of 91.6% in electricity consumption, and depending on the rate applied in each country for 1 kW, the economic savings would be considerable for each plantation carried out, for example in Peru, the cost per 1 kW is 0.6569 Soles, and applying the cost for public lighting (6.91 soles) and the legal taxes (IGV = 18%) and for the consumption difference of 22 kW, the saving would be 25.2 soles (equivalent to approximately 7.4 dollars). 5.2 Results of the Operation of the NFT-I System with IoT The NFT-I system based on IoT has worked adequately during the planting of vegetables. The pH, electrical conductivity and ambient temperature values have been adequately read; It should be noted that it is necessary to calibrate the equipment well to have good results. Software tests were performed: alpha and beta tests. Functional tests were also carried out, which consisted of seeing if the read values corresponded to the real values; at the beginning, there were difficulties with the sensors due to poor calibration and synchronization difficulties of the database, and with the Telegram social network, then it was gradually overcome.
IoT Computing for Monitoring NFT-I Cultivation Technique
643
5.3 Results of Sending Alert Messages to the User by Telegram The NFT-I hydroponic system based IoT has sensors that four times a day the values of pH, Electrical Conductivity, and ambient temperature; these values can be read from anywhere in the world, with a simple connection to the Internet through the social network Telegram. Many tests have been done, and finally, the system works properly. In addition, it has a small camera that is controlled by a Raspberry PI 3 mini-computer [23], this camera was configured in such a way that it communicates with Telegram, so that when the user (farmer) wants, he can take a photo and know the growth of the
Fig. 6. The farmer makes a query about the pH value on his smartphone
644
M. J. Ibarra et al.
vegetables; In addition, these photos could be analyzed, and you can know the size of the plant, the color, the maturity, if it is invaded by insects or bugs that eat the plants, among other aspects (in this first version these functionalities are not implemented). Figure 6 shows an example of the query that a farmer makes to know how the pH value reading is on a certain date; for this purpose, the farmer types “/ pH”, in the same way, he can know the values of electrical conductivity, temperature, etc. 5.4 Production Time Results Three types of lettuce were grown: Black Rose (cherry color), Bohemia (crepe) and Duett (traditional). On the other hand, they were planted in three different types of cultivation system: NFT, RF and soil, as shown in Fig. 7, Fig. 8 and Fig. 9.
Fig. 7. IoT-based NFT-I growing system
Fig. 8. RF growing system
IoT Computing for Monitoring NFT-I Cultivation Technique
645
Fig. 9. Soil cultivation system with worm humus
As we can see in Table 7, the most efficient production time was in the RF production system, which in 61 days the lettuces were ready for consumption; Then there is the production in NFT-I, some of them were already ready from day 69, while the others had yet to mature a bit; and then there is the production on land, which was the most inefficient, as it took 105 days and the growth was uneven. 5.5 Growth Results Table 8 shows the growth results; the strategy of measuring at harvest time was used. Table 8. Growth results Kind
Height (cm)
No. leaves
length (cm)
width (cm)
RF
15.58
12.69
20.58
17.08
NFT-I
13.75
10.88
18.69
15.42
Land
13.08
10.00
18.08
14.67
The results show that the cultivation in the RF system is more efficient with respect to NFT-I and Land in terms of height, No. of leaves, length, and width. Thus, for example, the average height in RF is 15.58 cm, NFT-I 13.75cm and Earth 13.08cm; in the same way as to the number of leaves RF 12.69, NFT-I 10.88 and land 10.00; similarly, in length and width, there is a greater size in lettuces grown in RF.
6 Conclusion and Future Work This research focused on creating a hydroponic growing system called NFT-I based on IoT, and it was created to help farmers or owners of the hydroponic growing system
646
M. J. Ibarra et al.
to care for or create a hydroponic growing system by using a module of hardware that is easily found in the market. NFT-I is an improved version of the traditional NFT hydroponics system, and they do not have an angle of inclination in the tubes in such a way that it retains 50% of the nutrient solution in the tube, and the planted vegetables require oxygenation of 10 min every 2 h, which saves energy consumption by up to 91.6%. The other important conclusion is that, after having compared three types of lettuce: Black Rose (cherry color), Bohemia (crisp) and Duett (traditional) and in three cultivation systems (RF, NFT-I, the earth with earthworm humus), it has been shown that the RF cultivation system is the most efficient in cultivation time and in growth, followed by NFT-I and followed by cultivation in soil. The analysis carried out was made regarding the average of the height, the number of leaves of each lettuce, length and width. As future work, it is intended to use a solar panel so that the system is autonomous and works only with sunlight and in this way saves on the cost of electricity consumption. Likewise, it is intended to test the sending of messages to the user or farmer using other means, for example, WhatsApp, Facebook, text messages or email message. A coupled system will also be added that allows the nutrient solution to be calculated and added automatically using fuzzy logic, artificial intelligence or machine learning techniques. Finally, a study will be made at a medium production scale to see how the system works. Acknowledgments. Thanks to the Micaela Bastidas National University of Apurímac for supporting the financing for the execution of this project, which was the winner of the III contest of basic and applied research projects for teachers with the funding of mining canon.
References 1. Ross, N.: Hidroponía: La GuíaCompleta de Hidroponía Para Principiantes. Babelcube Inc. (2017) 2. Zambrano Mendoza, O.O.: Validación de cincogenotipos de lechuga Lactusa sativa L. cultivados en dos sistemas de producción hidropónica (2016) 3. Perez Reategui, F.I., Perez Reategui, U.F.: Aplicación de software para controlar el balance de la soluciónnutritiva de un sistema cultivo de lechuga (Lactuca Sativa) bajo técnica de hidroponía automatizada a raíz del monitoreo de nitrógeno, PH y conductividad eléctrica en Pucallpa (2016). http://repositorio.unu.edu.pe/handle/UNU/3888 4. Gashgari, R., Alharbi, K., Mughrbil, K., Jan, A., Glolam, A.: Comparison between growing plants in hydroponic system and soil based system. In: Proceedings of the 4th World Congress on Mechanical, Chemical, and Material Engineering (2018). https://doi.org/10.11159/icmie1 8.131 5. Samangooei, M., Sassi, P., Lack, A.: Soil-less systems vs. soil-based systems for cultivating edible plants on buildings in relation to the contribution towards sustainable cities. J. Food Agric. Soc. 4, 24–39 (2016) 6. Changmai, T., Gertphol, S., Chulak, P.: Smart hydroponic lettuce farm using internet of things. In: 2018 10th International Conference on Knowledge and Smart Technology: Cybernetics in the Next Decades, KST 2018, pp. 231–236 (2018). https://doi.org/10.1109/KST.2018.842 6141
IoT Computing for Monitoring NFT-I Cultivation Technique
647
7. Crisnapati, P.N., Wardana, I.N.K., Aryanto, I.K.A.A., Hermawan, A.: Hommons: hydroponic management and monitoring system for an IOT based NFT farm using web technology. In: 2017 5th International Conference on Cyber and IT Service Management (CITSM), pp. 1–6 (2017) 8. Wortman, S.E.: Crop physiological response to nutrient solution electrical conductivity and pH in an ebb-and-flow hydroponic system. Sci. Hortic. 194, 34–42 (2015). https://doi.org/10. 1016/j.scienta.2015.07.045 9. Jsm, L.M., Sridevi, C.: Design of efficient hydroponic nutrient solution control system using soft computing based solution grading. In: 2014 International Conference on Computation of Power, Energy, Information and Communication (ICCPEIC), pp. 148–154 (2014) 10. Umamaheswari, S., Preethi, A., Pravin, E., Dhanusha, R.: Integrating scheduled hydroponic system. In: 2016 IEEE International Conference on Advances in Computer Applications, ICACA 2016, pp. 333–337 (2017). https://doi.org/10.1109/ICACA.2016.7887976 11. Yolanda, D., Hindersah, H., Hadiatna, F., Triawan, M.A.: Implementation of real-time fuzzy logic control for NFT-based hydroponic system on Internet of Things environment. In: Proceedings of the 2016 6th International Conference on System Engineering and Technology, ICSET 2016, pp. 153–159 (2017). https://doi.org/10.1109/FIT.2016.7857556 12. Filho, A.F.M., et al.: Monitoring, calibration and maintenance of optimized nutrient solutions in curly lettuce (Lactuca sativa, L.) hydroponic cultivation. Aust. J. Crop Sci. 12(04), 572–582 (2018). https://doi.org/10.21475/ajcs.18.12.04.pne858 13. Liu, T., Yang, M., Han, Z., Ow, D.W.: Rooftop production of leafy vegetables can be profitable and less contaminated than farm-grown vegetables. Agron. Sustain. Dev. 36(3), 1–9 (2016). https://doi.org/10.1007/s13593-016-0378-6 14. Li, B., et al.: Preliminary study on roof agriculture. Acta Agriculturae Zhejiangensis 24, 449–454 (2012) 15. Ray, P.P.: Internet of things for smart agriculture: technologies, practices and future direction. J. Ambient Intell. Smart Environ. 9, 395–420 (2017) 16. Gubbi, J., Buyya, R., Marusic, S., Palaniswami, M.: Internet of Things (IoT): a vision, architectural elements, and future directions. Futur. Gener. Comput. Syst. 29, 1645–1660 (2013) 17. Li, J., Weihua, G., Yuan, H.: Research on IOT technology applied to intelligent agriculture. In: Huang, Bo., Yao, Y. (eds.) Proceedings of the 5th International Conference on Electrical Engineering and Automatic Control. LNEE, vol. 367, pp. 1217–1224. Springer, Heidelberg (2016). https://doi.org/10.1007/978-3-662-48768-6_136 18. Carrión, G., Huerta, M., Barzallo, B.: Internet of Things (IoT) applied to an urban garden. In: Proceedings - 2018 IEEE 6th International Conference on Future Internet of Things and Cloud, FiCloud 2018, pp. 155–161 (2018). https://doi.org/10.1109/FiCloud.2018.00030 19. Pitakphongmetha, J., Boonnam, N., Wongkoon, S., Horanont, T., Somkiadcharoen, D., Prapakornpilai, J.: Internet of things for planting in smart farm hydroponics style. In: 2016 International Computer Science and Engineering Conference (ICSEC), pp. 1–5 (2016) 20. Ruengittinun, S., Phongsamsuan, S., Sureeratanakorn, P.: Applied internet of thing for smart hydroponic farming ecosystem (HFE). In: 2017 10th International Conference on Ubi-media Computing and Workshops (Ubi-Media), pp. 1–4 (2017) 21. Meza Arroyo, M.: Comportamiento de trestécnicas de cultivo hidropónico con lechuga (Lactuca sativa L.) en un sistema acuapónico-Echarati-La Convención-Cusco (2018) 22. Scaturro, G.N.: Evaluación de dos sistemas de producción de lechuga en hidroponia y un cultivo tradicional bajo cubierta (2019) 23. Ibarra, M.J., Huaraca, C., Soto, W., Palomino, C.: MLMS: mini learning management system for schools without internet connection. In: Twelfth Latin American Conference on Learning Technologies (LACLO), pp. 1–7 (2017)
Selective Windows Autoregressive Model for Temporal IoT Forecasting Samer Sawalha1(B) , Ghazi Al-Naymat2,3 , and Arafat Awajan3 1 King Hussein School of Computing Sciences, Princess Sumaya University for Technology,
Amman, Jordan [email protected] 2 College of Engineering and IT, Ajman University, Ajman, UAE 3 King Hussein School of Computing Sciences, Princess Sumaya University for Technology, Amman, Jordan
Abstract. Temporal Internet of things (IoT) data is ubiquitous. Many highly accurate prediction models have been proposed in this area, such as Long-Short Term Memory (LSTM), Autoregressive Moving Average Model (ARIMA), and Rolling Window Regression. However, all of these models employ the direct-previous window of data or all previous data in the training process; therefore, training data may include various data patterns irrelevant to the current design that will reduce the overall prediction accuracy. In this paper, we propose to look for the previous historical data for a pattern that is close to the current one of the data being processed and then to employ the next window of data in the regression process. Then we used the Support Vector Regression with Radial Basis Function (RBF) kernel to train our model. The proposed model increases the predicted data’s overall accuracy because of the high relevancy between the latest data and the extracted pattern. The implemented methodology is compared to other famous prediction models, such as ARIMA and the rolling window model. Our model outperformed other models with a 9.91 Mean Square Error (MSE) value compared with 12.02, 18.79 for ARIMA and rolling window, respectively. Keywords: Internet of Things · Machine learning · Predictive analytics · Regression · Selective window · Time series forecasting
1 Introduction Internet of Things (IoT) is an emerging technology of Internet accessing worldwide digital and smart devices. These devices are used to sense, monitor, and interact with each other to facilitate our daily life [1]. IoT is used in many fields, including the environment, economy, medical, and many others [2]. The expected number of connected sensors in the year 2020 is 50 billion sensors distributed worldwide [3]. Each sensor generates a massive amount of data that needs to be appropriately analyzed to help the decision-makers take the right decision in a short time. Each IoT sensed value attached to the date and time measured in, accumulating a sequence of sensed values will generate a time series data type. These time-series data can be generated from © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 295, pp. 648–657, 2022. https://doi.org/10.1007/978-3-030-82196-8_48
Selective Windows Autoregressive Model for Temporal IoT Forecasting
649
various sensors, such as weather sensors, RFID tags, motion sensors, light sensors, pressure, mechanical stress sensors,…etc. [4, 5]. IoT’s expected economic impact in 2025 would reach 6.2 trillion dollars; a significant part of this impact is for the healthcare market with 41%, followed by the industry market with 33%, 7% for energy. The rest is for the other domains such as transportation, security, infrastructure, and agriculture [6]. IoT time-series data can be analyzed and used in business process optimization and knowledge discovery. These analyzed data can enhance operational trends and discover the abnormality events, patterns, change, etc. [4]. Because of this vast amount of data, the analysis process is considered a significant challenge for IoT technology. IoT data analysis is defined as the process where various IoT data are examined to reveal trends, unseen patterns, hidden correlations, and extract new information. IoT data analytics aims to extract knowledge from these data using different Data Mining and Machine Learning techniques to allow the computer to learn without manual programming. If these patterns are found, then the model will successfully predict new values [7]. It can also adapt itself using the newly received data to enhance the accuracy of the predictions and decisions [8, 9]. In this research, we are focusing on the time-series IoT data forecasting models. Hence, a novel framework is proposed to predict the next set of values based on the previous history by finding the closest pattern from all of the previous data. Then train the model using the next window of values to gain a higher prediction accuracy than the other prediction models. The main idea about the proposed framework is to train the prediction model by searching for a similar pattern of the last set of values from the previous historical data; this will increase the prediction model’s overall accuracy. The rest of the paper is organized as follows: A discussion of the previous studies in Sect. 2. The proposed methodology is presented in Sect. 3. Then the implementation and evaluation are given in Sect. 4, and finally, we conclude our paper in Sect. 5.
2 Previous Studies There are many prediction models proposed to solve the problem of forecasting and predicting time series data. The Auto-Regressive Integrated Moving Average model (ARIMA) is considered the most traditional one used for IoT data prediction. It is based on the historical data time series for a specific sensor to predict the subsequent time-series data [10]. The primary role of the ARIMA model is to describe the autocorrelations in the previous data. ARIMA model contains two parts, the autoregressive models and the moving average models. The Autoregressive (AR) model is used for predicting the value using a linear combination of past values of the current variable; this part can be used when there is a correlation between the values. The Moving Average (MA) part does not use the previous values to forecast a value. MA uses the past forecast errors in the regression process, the regression thought of a weighted moving average of the past few forecast errors. This part is primarily used when having a shallow correlation rate between data. This model’s major drawback is that the parameters need to be initialized by an expert to have highly accurate predictions [11]. ARIMA has the advantage over other similar methods because it determines the proper model to fit the selected time series best; if the data are not correlated, then the
650
S. Sawalha et al.
ARIMA uses a constant mean and the predicted value will be around the mean. If there is a correlation in the selected data, then the mean will be adjusted to represent this correlation to have high accurate predictions [12]. ARIMA model breaks the time series into different components, these are trend and seasonality components and estimate a model for each component. However, it requires an expert in statistics to calibrate the model parameters [13]. ARIMA model faces some difficulties in modeling nonlinear relationships between variables. ARIMA fails to solve another challenge because it only gives accurate results when having a constant standard deviation in error, which in practice may not be satisfied. A new version of ARIMA was proposed by its integration with the Generalized Autoregressive Conditional Heteroskedasticity (GRACH), using the GRACH with ARIMA to solve the constant standard deviation error problem. However, the GRACH model’s optimization process and finding the optimal parameter for it is a challenging problem and problematic [14]. The Rolling Window Regression model sometimes called sliding window regression, is a simple algorithm. This algorithm’s main idea is to predict the next value by using a set of previous values (window of values). When the value is predicted, the window will be moved to predict the next value using the last predicted value. The prediction process can use different learning methods, including statistical methods such as the average value and machine learning methods such as the linear regression method. This method gives impressive results without prior experience [15]. This method is commonly used because of its simplicity and high accurate predictions. These predicted values from this model are based on the statistical representation of the historical data. The historical data is divided into the estimation sample and the prediction sample. The model is then fitted using the estimation sample by applying some statistical methods or some machine learning algorithms to predict a set of values ahead. The predicted values are then evaluated using the prediction sample to evaluate the adequacy of the model [16]. The Long Short-Term Memory Network (LSTM) is a Recurrent Neural Network (RNN) that contains recurrent connections. The state from previous activations of the neuron from the previous time step is used as context for generating the output. But it differs from the RNN that LSTM has a unique formulation that allows it to avoid the problems that prevent the training and scaling of other RNNs. LSTM also can effectively solve the problem of gradient vanished by introducing a set of memory units. LSTM allows the network to learn when to forget the memory unit’s historical information or update the memory unit with new information. Only the current state and some of the previous states are needed to train the network. LSTM can track relationships and dependencies across many time steps [4]. For this reason, LSTM is a widespread technique [17]. LSTM is designed to learn long-term dependencies. Deep learning-based approaches are used in many types of research to predict the time series data. Thomas Fischer and Christopher Krauss [18] used different forecasting algorithms, such as gradient-boosted trees, deep learning, and random forest. They found that the gradient-boosted trees and random forests outperformed the deep learning-based modeling. They also reported that the neural network’s training process in the deep learning algorithms is challenging. Sang Lee and Seong Yoo [19] proposed a framework using the RNN internal layers. They adjusted the threshold levels of the values by internal
Selective Windows Autoregressive Model for Temporal IoT Forecasting
651
layers of RNN; their proposed framework enhanced the predicted data’s overall accuracy. Hyeong Choi et al. [20] proposed a hybrid model for forecasting time-series data; they used the ARIMA model with the LSTM model to predict stock prices in financial time series data. Mahmoud Ghofrani et al. [21] proposed a combination of clustering and the Bayesian Neural Network (BNN), they used the K-Means clustering to identify the most appropriate training data, and the BNN for forecasting model, the results from this combination was the highest accuracy in comparing with other models. Rathnayaka et al. [22] proposed a combination between the ANN and ARIMA to predict the stock prices; they studied the behavior of the values from the time series data pattern. Horelu et al. [23] used the Hidden Markov Models (HMM) for short-term dependencies values, such as the temperature patterns, and the RNN for the long-term dependents, such as the speed of the winds. However, all of the previous models employ the direct-previous window of data or all previous data in the training process; therefore, training data may include various data patterns irrelevant to the current design that will reduce the overall prediction accuracy. The main motivation of this research is to answer the following questions: (1) How to use the previous historical data (not the direct previous one) to train a model based on the similarity of the patterns to enhance the overall prediction accuracy? (2) How to use the new sensed values to improve the trained model? (3) How to choose the best window size from the current data and the previous historical data? (4) How many values can be predicted?
3 Research Methodology Training the data using the direct previous sensed data may result wrong predictions, especially when the pattern is repeated every long period. Therefore, having a small training window size will train the model using data that are not close to the currently needed prediction data; this will result in wrong predictions. On the other hand, having a huge window size will cover the necessary data, but it will also cover not related data, which will decrease the accuracy in the prediction phase. Because of that, in this research, we propose a novel method to forecast and predict the time series values based on historical data of the current sensor node. The proposed method divides the data into two parts, the first one is a small partition of the data that is the validation part, and the other part is the training part. The validation part is the latest set of values (window of values) received from the sensor node and preceded the needed prediction partition. This part is a set of values used as a reference value used to compare previous historical data (the training part) to find the closest pattern to it (less Mean Square Error value). This window moves over all the previous historical data to find the best set of points closest to the current sequence of values as a pattern. After finishing the comparison part, a window of values will be extracted from the training part, including the closest pattern of data and a window of values that is equal to the length of values needed to predict. These extracted points are considered as a good portion of data that can be used in the training process. Machine learning techniques will be applied to these points to
652
S. Sawalha et al.
build a model and extract the regression line, which is used to predict the next set of values that follows the current state. Various machine learning techniques are tested on the proposed method; we found that the Support Vector Regression (SVR) obtained the best results and the highest accuracy. The proposed framework is shown in Fig. 1. When a new value or set of values is received from the sensor node, the proposed method will modify the prediction values. The training and validation window will be moved one or a set of steps to include these new values. The validation part will be used again to find the closest pattern from the training dataset. The new-trained model may use the same portion of data if the sensed value is close to the previously predicted ones. If the trained model predicted a value accuracy is far from the new sensed value; then the model will automatically retrain itself to find another set of values close to the newly received data from the sensor. This retraining process will enhance the prediction accuracy for every new value received. We propose to find a relation between the total number of values that need to be predicted and the validation window size to choose the best window size from the validation dataset that can represent the current state to be used in the prediction phase. The proposed solution for this problem is to consider the requested prediction period represents 30% of the data, and the validation part is 70%. For example, if the requested period for the prediction is one month, then it represents 30% of the data so that the validation window size will be 2.33 months. This means that to predict the next month’s values, we need a window size with 2.33 months as a validation data set that will be used in the comparing process with the previous historical data.
Regression FuncƟon
Training Part
Extract Next Window
Train
Best Match
Generate
Sensed Data
Divide
Search Best Match Using
Trained Model
Predict
ValidaƟon Part
Sensor Node
Fig. 1. The proposed framework
Predicted Values
Selective Windows Autoregressive Model for Temporal IoT Forecasting
653
The pseudocode of the proposed framework is shown below in Algorithm 1: Algorithm 1 Predict Next Set of Values Input: PredPeriod Requested Prediction Period NodeDataset Historical Dataset of a Sensor Node PredDataset [ ] PredPeriod * 70 / 30
ValWindowSize
For each new value SVal received from the sensor node do: Add SVal to NodeDataset ValWidnowStart SVal.Time – ValWindowSize ValWindowEnd SVal.Time ValDataSet NodeDataset [ ValWidnowStart : ValWindowEnd ]
MinError MinErrorPos For ( i
0 ; i < NodeDataset – ValWindowSize ; i++ ) TrainDataSlice NodeDataset [ i : i + ValWindowSize ] RMSE Sum ( ValDataSet.Value – TrainDataSlice.Value ) If ( RMSE < MinError ) MinError RMSE MinErrorPos i
TrainDataEnd MinErrorPos + ValWindowSize + PredPeriod TrainData NodeDataset [ MinErrorPos : TrainDataEnd ] PredModel For ( j
Fit ( TrainData.Time , TrainData.Value , algorithm
'SVR')
0 ; j < PredPeriod ; j++ ) PredDataset [j] Predict ( SVal.Time[j] , PredModel )
Return PredDataset
4 Implementation and Results This Section presents the proposed method’s implementation to prove it in terms of overall accuracy. The dataset used in the implementation part is the historical weather data of Basel city in Switzerland [24]. The dataset contains many weather features, including humidity, temperature, wind speed, clouds, etc., where they sensed every one hour from
654
S. Sawalha et al.
1/1/1985 to the present. In our experiments, the temperature column data used in the training and prediction processes. First of all, we performed preprocessing techniques on the dataset, including removing unnecessary columns, combine the date and the time into one column, and then conducted a normalization process on the temperature values. The data is divided into three parts testing, training, and validation parts. In the prediction (testing) part, we fixed the window of data that needed to predict to three months; the prediction requested in the hour level each day (2160 value). In the validation part, the window length contains a 5040 value, (we discussed how to find the window size in the previous Section). This part is used as a reference to compare history within the training process. The last part is the training part; the validation part is compared to extract the closest set of connected window values, which contains 300263 values. After dividing the data, the MSE value is calculated for each window of values starting from 1/1/1985 until the last point before the validation window. The window moves one value at each step. The resulted window size contains 5040 values considered the closest values to the validation window (lowest MSE value). As we mentioned before, some of the temporal data, such as the weather, can be represented as a repeating pattern; therefore, the extracted data has the closest pattern for the validation part. Extracting the pattern of the next window of values (the testing part), a window of values that have the same size as the prediction window (in our case 2160 values) is extracted from the direct next set of values after the last point of the extracted window. The result of the previous step will be 7200 values. A machine learning technique is used to learn from the extracted set of data; in this implementation, we used the RBF kernel in SVR, which has the highest results compared with other kernels. The features of the model are the month, day, and hours. The target value is the temperature value. Finally; the resulted model from the previous steps used to predict the testing window’s temperature values. The mean square error testing method is used to find the difference between the predicted values and the model’s testing values. The accuracy of the proposed method compared with other temporal prediction techniques, such as the ARIMA model with P = 0, D = 0, Q = 1, where P is the order of the Auto-Regressive (AR) term. This refers to the number of lags of Y to be used as predictors, D is the minimum number of differences needed to make the series stationary, and Q is the order of the Moving Average (MA) term refers to the number of lagged forecast errors that should go into the ARIMA Model. As for the Rolling Window model the window size is 50. The proposed method is also compared to a general machine learning technique (not based on the temporal data), which is the RBF SVR on the overall data, where the gamma value is equal to 1/(n_features * X.var()). The results are presented in Table 1 and Fig. 2. The proposed method achieved the lowest MSE value equals to 9.9168, followed by the ARIMA with 12.018, SVR-RBF with 13.2815, and finally the Rolling Window 18.79. The MSE value of the proposed method equal to 9.9168 means that the average error of the hourly-predicted values is 3.3 degrees, which is an excellent value compared to the hourly prediction process. As shown in the previous results, the proposed method used the historical data (not the direct previous one) to train the model to increase the overall forecasting accuracy comparing with the literature. The model training is done using the closest pattern of
Selective Windows Autoregressive Model for Temporal IoT Forecasting
655
Table 1. Prediction performance comparison Model
Parameters
MSE
ARIMA
P = 0, D = 0, Q = 1
12.018
Rolling Window Window size = 50 SVR-RBF
18.79
Gamma = 1/(n_features * X.var()) 13.2815
Proposed Model Window Step = 1
9.9168
20 18
Mean Square Error (MSE)
16 14 12 10 8 6 4 2 0 ARIMA
Rolling Window
SVR-RBF
Proposed Model
ForecasƟng Model
Fig. 2. Prediction performance comparison
data extracted from the historical data similar to the pattern of the latest sensed data. The similarity is found by comparing the MSE value between the window of new sensed data with all the historical data, this window must contain at least one seasonal value.
5 Conclusions IoT prediction and forecasting are kinds of the main challenges that face IoT data analytics. The prediction process is considered a core value in decision-making activities. Each IoT sensed value attached to time-series information. IoT data has periodical patterns that are repeated every period in various applications, such as environmental applications. Because of that, we proposed a novel framework to train the model using a previously best matching pattern of the current values (latest values sensed). The predicted values will have a pattern that is very close to the previous matching pattern. As shown in the implementation part, the overall accuracy is enhanced compared to other methods such as the ARIMA and Rolling Window models that use the previous direct data or the whole data in the training process. The proposed method’s MSE value resulted from predicting
656
S. Sawalha et al.
an hourly value for three months equal to 9.9168 compared with ARIMA and Rolling Window models where got 12.018, 18.79, respectively.
6 Future Work Since temporal IoT data forecasting is a hot topic and used in different applications, the proposed method will be enhanced to include more than one relevant window of data (closest set of windows) to be used in the learning process to increase the accuracy. Also, the identification of the closest windows of data will be evaluated using different techniques such as the correlation measurement. More data (small and big datasets) and different learning models will be evaluated.
References 1. Ben-Daya, M., Hassini, E., Bahroun, Z.: Internet of things and supply chain management: a literature review. Int. J. Prod. Res. 57(15–16), 4719–4742 (2019) 2. Zeinab, K.A.M., Elmustafa, S.A.A.: Internet of things applications, challenges and related future technologies. World Sci. News 67(2), 126–148 (2017) 3. Economides, A.: User perceptions of internet of things (IoT) systems. In: Obaidat, M.S. (ed.) ICETE 2016. CCIS, vol. 764, pp. 3–20. Springer, Cham (2017). https://doi.org/10.1007/9783-319-67876-4_1 4. Xie, X., Wu, D., Liu, S., Li, R.: IoT data analytics using deep learning. arXiv preprint arXiv: 1708.03854 (2017) 5. Hassan, S.A., Syed, S.S., Hussain, F.: Communication technologies in IoT networks. In: Hussain, F. (ed.) Internet of Things. Springer Briefs in Electrical and Computer Engineering, pp. 13–26. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-55405-1_2 6. Mohammadi, M., Al-Fuqaha, A., Sorour, S., Guizani, M.: Deep learning for IoT big data and streaming analytics: a survey. IEEE Commun. Surv. Tutor. 20(4), 2923–2960 (2018) 7. Wazurkar, P., Bhadoria, R.S., Bajpai, D.: Predictive analytics in data science for business intelligence solutions. In: 2017 7th International Conference on Communication Systems and Network Technologies (CSNT), pp. 367–370. IEEE, November 2017 8. Virmani, C., Choudhary, T., Pillai, A., Rani, M.: Applications of machine learning in cyber security. In: Handbook of Research on Machine and Deep Learning Applications for Cyber Security, pp. 83–103. IGI Global (2020) 9. Marjani, M., et al.: Big IoT data analytics: architecture, opportunities, and open research challenges. IEEE Access 5, 5247–5261 (2017) 10. Zhang, C., Liu, Y., Wu, F., Fan, W., Tang, J., Liu, H.: Multi-dimensional joint prediction model for IoT sensor data search. IEEE Access 7, 90863–90873 (2019) 11. Hyndman, R.J., Athanasopoulos, G.: Forecasting: principles and practice. OTexts (2018) 12. Sen, P., Roy, M., Pal, P.: Application of ARIMA for forecasting energy consumption and GHG emission: a case study of an Indian pig iron manufacturing organization. Energy 116, 1031–1038 (2016) 13. Calheiros, R.N., Masoumi, E., Ranjan, R., Buyya, R.: Workload prediction using ARIMA model and its impact on cloud applications’ QoS. IEEE Trans. Cloud Comput. 3(4), 449–458 (2014) 14. Kane, M.J., Price, N., Scotch, M., Rabinowitz, P.: Comparison of ARIMA and random forest time series models for prediction of avian influenza H5N1 outbreaks. BMC Bioinform. 15(1), 276 (2014)
Selective Windows Autoregressive Model for Temporal IoT Forecasting
657
15. Khan, I.A., Akber, A., Xu, Y.: Sliding Window Regression based Short-Term Load Forecasting of a Multi-Area Power System. arXiv preprint arXiv:1905.08111 (2019) 16. Zivot, E., Wang, J.: Rolling analysis of time series. In: Zivot, E., Wang, J. (eds.) Modeling Financial Time Series with S-Plus®, pp. 299–346. Springer New York, New York, NY (2003). https://doi.org/10.1007/978-0-387-21763-5_9 17. Brownlee, J.: Long Short-term Memory Networks with Python: Develop Sequence Prediction Models with Deep Learning. Machine Learning Mastery (2017) 18. Fischer, T., Krauss, C.: Deep learning with long short-term memory networks for financial market predictions. Eur. J. Oper. Res. 270(2), 654–669 (2018) 19. Lee, S.I., Yoo, S.J.: A deep efficient frontier method for optimal investments. In Department of Computer Engineering. Sejong University (2017) 20. Choi, H.K.: Stock price correlation coefficient prediction with ARIMA-LSTM hybrid model. arXiv preprint arXiv:1808.01560 (2018) 21. Ghofrani, M., Carson, D., Ghayekhloo, M.: Hybrid clustering-time series-Bayesian neural network short-term load forecasting method. In: 2016 North American Power Symposium (NAPS), pp. 1–5. IEEE, September 2016 22. Rathnayaka, R.K.T., Seneviratna, D.M.K.N., Jianguo, W., Arumawadu, H.I.: A hybrid statistical approach for stock market forecasting based on Artificial Neural Network and ARIMA time series models. In: 2015 International Conference on Behavioral, Economic and Socio-cultural Computing (BESC), pp. 54–60. IEEE, October 2015 23. Horelu, A., Leordeanu, C., Apostol, E., Huru, D., Mocanu, M., Cristea, V.: Forecasting techniques for time series from sensor data. In: 2015 17th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC), pp. 261–264. IEEE, September 2015 24. Weather history download Basel. https://www.meteoblue.com/en/weather/archive/export/ basel_switzerland_2661604
Distance Estimation Methods for Smartphone-Based Navigation Support Systems Bineeth Kuriakose(B) , Raju Shrestha , and Frode Eika Sandnes Department of Computer Science, Oslo Metropolitan University, Oslo, Norway [email protected] Abstract. Distance estimation is a key element of a navigation system. Various methods and instruments are used in distance estimation procedures. The methods and instruments used usually depend on the contexts of the application area. This paper compares the accuracy of five practical distance estimation methods that can be used on portable devices. Some of the methods selected for this study have currently not yet been used in the context of navigation systems. The experimental results show that Rule 57 and AR based distance estimation methods hold great potential for the practical application of navigation support as they provide adequate accuracy and computational efficiency. Keywords: Distance estimation · Object detection · Computer vision · Navigation · Smartphones · Assistive technology
1
Introduction
Navigation involves monitoring or controlling the movement of a vehicle or person, or any machine from one location to another through an environment with constraints and obstacles. Much research and development have recently been reported in the domain of navigation systems. The advancements in the field of computer vision and machine learning have probably contributed to the accelerated developments in this area. Autonomous cars [12,42], robotic navigation [5,43], navigation of people with or without accessible needs [10,13,14] are some of the areas receiving research attention. In addition to obstacle (object) detection, distance estimation is also a key component in navigation systems [57]. Obviously, information about how far an object is from the viewer can be used to avoid collisions during navigation. Distance estimation methods can be classified as active methods and passive methods [54]. Active methods send signals to the obstacle, and based on the time the signal takes to reach the obstacle and bounce back, the distance to the obstacle is estimated [41,44]. Such methods may use laser beams [2,51], ultrasound [31,40], or radio signals [8,36]. Passive methods estimate the distance by receiving information about the object’s position by applying computer vision techniques on camera images. Passive methods can be based on monovision (single camera) [4,27] or stereo vision systems (double cameras separated by a small distance) [32,50]. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 295, pp. 658–673, 2022. https://doi.org/10.1007/978-3-030-82196-8_49
Distance Estimation Methods
659
The use of distance estimation methods in portable navigation support systems for assisting people with visual impairment is the primary concern of this work. According to the World Health Organization (WHO), more than 22 hundred million people worldwide have a vision impairment or can be classified as technically blind1 . People with visual impairments often experience challenges during navigation. However, tools such as white canes and guide dogs give the users more freedom to navigate independently. WHO uses assistive technology as an umbrella term to refer to both systems and services for people with reduced functioning. The primary purpose of assistive devices and technologies is to maintain or improve independence and dignity, to facilitate participation and enhance overall well-being2 . Portability is one of the vital requirements in navigation systems for people with visual- impairment [35]. Portability is mainly characterized by small physical dimensions and low mass. Portable systems provide comfort and convenience during carriage and usage. When prioritizing portability, smartphones emerge as a feasible and practical technological platform. Several works have reported on how smartphones can be used in navigation systems for people with visual impairments [19,22,34,55]. Moreover, one should use the functionalities available on smartphones and avoid the use of any peripheral hardware such as bulky cameras or add-on sensors that increase the overall system dimensions and mass [34]. Some studies have also reported on the use of various sensors for navigation distance estimation [6]. The main objective of this work was to evaluate five distance estimation methods in the context of a portable navigation system for people with visual impairments. The methods are chosen based on their potential to be used with smartphones. Eventhough many works are reported on object distance estimation in general, few studies have addressed distance estimation in context of smartphone assisted navigation for blind users specifically. To the best of our knowledge this is one of the first attempts at experimenting and analyzing established object distance methods in the problem domain of smartphone-based navigation support. The paper is organized as follows. Section 2 discusses the distance estimation in general and the major components involved. Section 3 describes the five distance estimation methods that we considered for our experiments. Section 4 describes the experimental procedures. The results are presented and discussed in Sect. 5. The paper ends with the conclusion in Sect. 6.
2
Distance Estimation
Many methods have been proposed for the estimation of the distance between the object (or obstacle) to the viewer. The distance estimation process in navigation usually consists of object detection and computation of distance from the objects. They are described in the following subsections. 1 2
https://www.who.int/news-room/fact-sheets/detail/blindness-and-visualimpairment. https://www.who.int/news-room/fact-sheets/detail/assistive-technology.
660
B. Kuriakose et al.
2.1
Object Detection
Object detection is the task of detecting instances of objects of interest in an image. Different challenges such as partial/full occlusion, varying illumination conditions, poses, and scale are needed to be handled while performing object detection [21,26]. As an essential navigation component, object detection is sometimes considered a pre-phase in the distance computation procedure. Object detection methods, in general, use machine learning and deep learning algorithms. A typical object detection pipeline consists of a Convolutional Neural Network (CNN). CNN is a type of feed-forward neural network and works on the principle of weight sharing. Some benchmarked datasets, such as MS COCO [37] and ImageNet [11], make object detection using deep learning a preferable choice among developers [26]. In addition to different object detection models that are computationally expensive, various lightweight object detection models intended to be used in mobile devices are available. You Only Look Once (YOLO) [45–47], Single Shot Detector (SSD)-MobileNetV1 [24] are some examples. YOLO divides each image into grids, and each grid predicts bounding boxes of detected objects with confidence values. The SSD architecture is a single convolution network that learns to predict bounding box locations and classify these in a single pass [38]. SSD combined with MobileNet as its base network gives better object detection results in terms of accuracy compared to similar other deep learning models [33]. This study uses the SSD-MobileNetV2 [38,48] object detection model as a prephase to the distance estimation methods. The main reasons for this choice are portability and usability. Moreover, SSD-MobileNet has the highest mean average precision (mAP) among the models that facilitate real-time processing. Even though the latest version of the series, MobileNetV3 offers higher accuracy and speed in general classification tasks than MobileNetV2, while MobileNetV2 provides higher performance for object detection tasks than MobileNetV3 [23]. In the SSD-MobileNetV2 model, input image features are extracted by the CNN layers in the MobileNetV2, and SSD predicts the obstacles based on the feature maps. The MobileNetV2 architecture is shown in Fig. 1. MobileNetV2 is based on an inverted residual structure with residual connections between the bottleneck layers. The intermediate expansion layer uses lightweight depthwise convolutions to filter features as a source of nonlinearity. The MobileNetV2 architecture contains an initial fully convolution layer with 32 filters, followed by 19 residual bottleneck layers. ReLU6 is used as the nonlinearity because of its robustness when used in devices with low computational power [48]. After the extraction of basic features, several layers of depth-wise separable convolution are operated to generate several feature maps with decreasing scales. SSD performs on multiscale feature maps to predict multiscale objects [38]. Each feature map is evenly divided into cells; every cell predicts k bounding boxes and c category confidences. For each category, the top ‘n’ bounding boxes are retained. Then the non-maximum suppression is performed to filter out the bounding boxes with considerable overlap, and finally it outputs the
Distance Estimation Methods
661
detection results. The model output comprises the detected objects along with their bounding box coordinates.
Fig. 1. The architecture of MobileNetV2 (adapted from [48]).
2.2
Distance Computation
Several distance computation methods have been proposed in the literature. Some require object detection as a prerequisite to compute distance, while others compute distance based on information from specific sensors. Examples of such sensors include stereo cameras [9,28], monocular cameras [7,52], ultrasonic sensors [15,56], Light Detection and Ranging (LiDaR) [25], and Time of Flight (ToF) sensors [16]. With the stereo vision method, two cameras separated by certain distance capture two images of a scene from two slightly different vantage points, which are used to calculate their disparity. This disparity helps to estimate the depth, enabling projection of the scene to a 3D world that can be used for navigation [9,28]. With the monocular vision-based distance estimation, the image captured by a single camera is used to compute distance. The distance is estimated either using a traditional distance estimation model, such as the pinhole imaging model or deep learning-based methods [7,52]. Ultrasonic distance sensors are also widely used to measure the distance between the source and target using ultrasonic waves [15,56]. In robotics applications and autonomous vehicles, LiDaR is commonly used to measure the distance by illuminating the target with laser light and measuring the reflection with a sensor. LiDaR uses the Simultaneous Localization and Mapping (SLAM) technique, which builds a map representing its spatial environment while keeping track of any robot or vehicle within the map of the physical world [25]. Time-of-flight (ToF) sensors are also widely used in range imaging camera systems. A ToF sensor computes the distance between the camera and the object for each point of the image by measuring the round-trip time of an artificial light signal from a laser or an LED [16]. Besides sensors, different computational methods can be used to compute the distance. Optical methods [30] and Rule of 57 [1] are examples of methods that do not require any hardware besides a smartphone. Therefore, the purpose of this work was to compare these methods. These methods are described in detail in the next section.
662
3
B. Kuriakose et al.
Distance Estimation Methods
Among the distance estimation methods described in Sect. 2 we selected five methods that can be used on a smartphone without any additional hardware and computational complexity barrier. Some of the distance estimation methods explored in this work are being used in different scenarios other than in the context of navigation support. But through this work we explored how to map those methods to use with a smartphone device. To ensure an unbiased assessment, the SSD-MobileNetV2 was used for object detection with all distance estimation methods. 3.1
Optical Method
The relationship between the object distance and the image distance is defined by the lens makers equation is defined as [30], 1 1 1 = + f do di
(1)
where, f is the focal length of a camera lens, do is the distance from the lens to the target object, and di is the distance between the lens and the projected image. The following expression is used to compute the distance (do ) from the object’s bounding boxes [30], distance(in inches) =
(2 × π × 180) (w + h × 360) × 1000 + 3
(2)
where, w and h are the width and height of the bounding box of the object detected by the object detection model. We refer to [53] for details on the derivation of the expression. 3.2
Smartphone Position Sensors Based
Most of today’s smartphones come with two types of position sensors that can help determine a device’s position: the geomagnetic field sensor and the accelerometer. These position sensors can be used to determine the physical position in the world’s frame of reference. The combination of the geomagnetic field sensor with the accelerometer can be used to determine a device’s position relative to the magnetic north. These sensors can also be used to determine the device orientation with respect to the frame of reference defined in the app. Both the Android and the iOS platforms support specific functions to access these sensor data [17,29,49].
Distance Estimation Methods
663
If the angle between the camera and the object is a, then by the right-angled triangle property, the tangent of the angle a gives the expression to find the distance d from the camera and the object. d = h × tan a
(3)
where h is the height from the base (camera height), and d is the distance from the object to the camera. The estimation of the angle can be computed using the sensors present in the smartphone. In an Android platform, it is possible to access this sensor data using its sensor framework [18]. Using the different sensors supported in a smartphone, such as accelerometer and magnetometer, it is possible to find the angle between the object and the phone camera [17,29,49]. The implementation details for Android are as follows: From the accelerometer and magnetometer sensor values, it is possible to compute a rotation matrix and an orientation matrix. The rotation matrix involves mapping a point in the phone coordinate system to the real-world coordinate system. And the orientation matrix is derived from the rotation matrix. From the orientation matrix, we can compute the pitch and roll. Using pitch or roll depending on whether the phone is in portrait or landscape mode, the distance can be estimated using the following equation. distance, d = h × tan(pitch | roll × π/180)
(4)
where, h, denotes the height of the camera from the base in meters, and in our case, it was set to 1.4. 3.3
Augmented Reality-Based Methods
Augmented reality (AR) can be described as an enhanced version of the real physical world that is executed through the use of digital visual elements, sound, or other sensory stimuli delivered using technology. Several commercial stakeholders such as Google and Apple are incorporating AR technology into smartphones for multiple applications. ARCore is Google’s Augmented Reality (AR) developer platform, which provides simple but powerful tools to developers for creating AR experiences [3]. The com.google.ar.core package helps to design applications that make it possible to determine the distance from a smartphone’s camera to an object. The anchor class in the same package describes various methods to find a fixed location and orientation in the real world. Besides, to stay at a fixed location in physical space, the numerical description of that position will update as ARCore’s understanding of the space improves. The limitation on the usage of the ARCore is that it only supports to work in ARCore compatible devices3 . 3
https://medium.com/@shibuiyusuke/measuring-distance-with-arcore6eb15bf38a8f.
664
B. Kuriakose et al.
ARCore can create depth maps containing data about the distance between surfaces from a given point, using the primary RGB camera of a supported device. ARCore uses the Simultaneous Localization and Mapping (SLAM) technique, to understand where the phone is relative to the world around it. It detects visually distinct features in the captured camera image called feature points and uses them to compute its location change. The visual information is combined with inertial estimation from the device’s IMU to estimate the camera’s pose (position and orientation) relative to the world over time. Using ARCore, it is possible to place an anchor, a fixed location in the real world, and find the camera’s distance to the anchor. Both the anchor position and the camera position can be acquired as x, y, and z values (width, height, and depth) corresponding to the world position of objects in the ARCore package [39]. Once the two positions are known, it is straightforward to calculate the Euclidean distance between them. 3.4
Method Based on Rule of 57
The Rule of 57 states that an object with an angular size of 1◦ is about 57 times further away than it is big (see Fig. 2). Therefore, the ratio of an object’s angular size (in degrees) to a whole 360-degree circle should equal the ratio of the object’s actual size to the circumference of a circle at that distance from the observer. This method has been derived for measuring distance and angles from telescope images in astronomy [1]. The key to using telescope images to measure distances is to realize that an object’s apparent angular size is directly related to its actual size and distance from the observer. It means that if the object appears to be smaller as it is farther away from the observer. However, in our experiments, we found it can be applied to find the distance to the object even if the angular size of the object is more than 1◦ to the field of view of the smartphone camera sensor.
Fig. 2. The rule of 57 (adapted from [1]).
Distance Estimation Methods
665
From Fig. 2, we can write that, actual size angular size (5) = ◦ 360 2πD Using object distance for angular size and object size for actual size, we get the following equation to calculate the distance, object distance = (object size) ×
1 × 57 (angular size in degrees)
(6)
In order to use this approach, it is necessary to get an estimate of the size of objects before finding the distance from them. We measured the size of the object (height) for our experiment. The geomagnetic field sensor and the accelerometer sensor measure the angular size [17,29,49]. Both sensors are used in a similar manner as described in Sect. 3.2. 3.5
DisNet Method
DisNet uses multi-layer neural network to estimate the distance [20]. The method can be used to learn and predict the distance between the object and the camera sensor. A six-dimensional feature vector (v ) can be obtained from the bounding box of the object detected by the object detection model is used as input to the DisNet as, (7) v = [1/Bh 1/Bw 1/Bd Ch Cw Cb ] where, Bh , Bw and Bd denotes the height, width, and diagonal of the object bounding box in pixels/image, respectively. And Ch , Cw and Cb represents the values of average height, width, and breadth of the particular class’s object. For example, for the class person, Ch , Cw and Cb are, respectively, 175 cm, 55 cm and 30 cm. These values were chosen based on an average case assumption. The features Ch , Cw , and Cb are assigned to objects labeled by the SSD + MobileNetV2 detector as belonging to the particular class to complement more information to distinguish different objects. Finally, the DisNet model outputs the estimated distance of the object to the camera sensor. In the original work of DisNet [20], YOLO was used as the object detector. However, in our work, we used the SSD + MobileNetv2 object detection model, typical for all methods already described. An illustration of how the model works is given in Fig. 3. The SSD+MobileNet model was pretrained with the COCO dataset [37]. We applied transfer learning to train more classes such as person, bag, and chair. The images were collected using a smartphone camera. For the DisNet model, it is necessary to collect the distance to the objects (classes) in addition to the images. For the class person, the dataset was already available from the reference paper [20]. For the other two (chair and bag), along with the images, its ground truth distance to the camera was measured and recorded. To train the network, the input dataset was randomly split into a training set (80% of the data), validation set (10% of the data), and a test set (10% of the data). After calculating the input
666
B. Kuriakose et al.
Fig. 3. The DisNet based distance estimation (adapted from [20]).
vector, the DisNet model was trained using the custom dataset. The output of the model gives the distance of the object from the camera sensor.
4
Experiment
An Android app was developed and deployed in Huawei P30 Pro smartphone to assess the performance of each of the methods. One of the reasons for the selection of smartphone device is the requirement of AR-enabled device for ARbased distance estimation. The independent variables included the ground truth distance and object size. Observed distance and computational overload were the dependent variables. Other evaluation parameters include how a moving camera could affect the distance estimation, and how the object’s size, and distance and accuracy in each method are related. Four types of objects were used to estimate the distance from them in our experiments. The selection of objects was made with varying physical sizes, namely, a bottle, bag, chair, and person. This particular selection was made to understand the effect of varying sizes on each distance estimation method. We use the term distance marker to refer to ground truth values. The selection of different distance markers was made to analyze the estimation method’s effect at various distances (near, medium, and far). The distance markers were placed at four different spots (very near-1 m, near2 m, medium-5 m, far-10 m) away from the observation point. In the first round of the experiment, the bag object was placed one meter from the marker. Then we measured the distance using the five different methods. However, we were unable to estimate distance when the object is placed in a 1 m distance marker using all methods. Next, the object was kept 2 m away. Again, five methods were used to obtain the measurements and these measurements were recorded. The next step involved measuring the distance at 5 m. Finally for the 10 m case, we were unable to estimate the distance from the object. We repeated the same procedure using other objects (such as chair and person). However, we observed the same situation as in the previous case. We were unable to measure the distance at 1 and 10 m but were able estimate the distance when the object was placed at a
Distance Estimation Methods
667
distance of 2 and 5 m. A detailed explanation of the possible causes for this is given in the discussion section. To identify whether the size was a factor in distance estimation, we placed a smaller object bottle at the 1 m marker. Surprisingly, we were able to measure the distance this time with four methods. We were unable to estimate the distance using the DisNet method in the 1m case. However, with other distance markers (2 m, 5 m, 10 m), we were unable to find the distance to the bottle object. To identify the maximum range of the distance methods, we moved the object from 10 m to closer to the camera point. However, we observed that distance estimation was not possible beyond 5 m. We therefore concluded that the maximum distance possible with the methods studied is 5 m. Furthermore, beyond the distance marker of 5 m, it is impossible to compute the distance using any of the methods described here. All the experiments were conducted in a controlled indoor environment (room) during midday.
5 5.1
Results and Discussions Results
We took five samples using each method and then calculated the mean and standard deviation of the value corresponding to each object. The results of the experiments with different objects using different distance computation methods are given in Table 1. Results are also graphically shown in Fig. 4. Table 1. Mean and standard deviation of the estimated distances of the different objects for three different distance markers using the five methods.
The 1 m case is only applicable to the smallest object bottle. The 10 m case is out-of-range since we were unable to obtain any estimates. There were variations in the 5 m distance marker estimation for all objects for each distance estimation method.
668
B. Kuriakose et al.
Fig. 4. Mean distances (error bars shows the standard errors), estimated with the five distance estimation methods.
5.2
Discussion
From Fig. 4, it is clear that the deviation from the ground truth is the smallest when using the ARCore method when the object is placed 1 m away. However, when the object is placed either 2 m or 5 m away, the Rule 57 method gives the most accurate results. The largest variation is observed when the object is placed at a 1 m distance marker with the smartphone sensors. We also considered how various factors can affect distance estimation. They are discussed below. Effect of Distance: When we tried to compare different objects (bag, chair, person), the distance estimation methods showed fair results up to 4 m. However, there were varying results when the object is placed more than 4 m away. Hence, we decided to consider the values of 5 samples and find the mean and standard deviation when the object marker is placed at a distance of 5 m. When the object marker is placed more than 5 m away, no distance estimate could be obtained. Size of Object: The size of the object is also an important parameter that affects distance estimation. It was understood from the experiment that the distance could not be estimated through the above-selected methods when the object size is small or if it is placed far away. However, we did a small experiment on how well the distance estimation method could perform when the object size is small and placed in one meter in our experiment. The testing with the bottle object shows that the distance estimation is possible within 1 m. When we tried
Distance Estimation Methods
669
to place the object far away from that point, none of the distance estimations showed any results. Moving Camera: We tried to estimate the distance to the moving camera. For this, we placed the object at a 5 m distance. And marked path with the distance markers - 1 m and 2 m away from the starting point (initial position of the camera). We moved with the camera in each of those distance markers in parallel (left to right direction) to the object. We observed that there were fluctuations in the distance estimation when the camera were moving. However, still, it was able to do distance estimation. However, in some methods, such as (AR-based or smartphone sensor-based), camera focusing is required to estimate the object’s distance. We assume there are some reasons for these observations. The camera sensor’s size is one reason that can affect the estimation of distant objects and objects with small sizes. Since we tested all methods with a smartphone camera, the limitation of the same may involve detecting distant objects. This results in the fact that all methods considered for the study cannot be used for long range applications. Probably by using a long range camera can elude the limitation. However, since we are focusing only on portable navigation solutions, the idea of using long range cameras which can increase the system weight does not hold well in our case. When the size of the object is small, and the object is kept far, it is not easy to get detected using the methods described in the experiment. Another factor we think of is the lighting effect of the environment. Since the experiment was done in an indoor setting, we should consider the effect of lighting as well. Furthermore, the smartphone used for the experiment was held in bare hands without any anti-motion devices. Therefore, when the camera experiences fluctuations, this could have affected the object detection. This might be one reason which affects the varying observations at the same distance marker. However, it also to be noted that the smartphone we used for the experiment has a good video image stabilization4 . The AR-enabled smartphones available today are already used to estimate various metrics such as calculating an object’s length. But the smartphone AR features can be further enhanced and used in the applications such as navigation to assist people. Moreover, the distance estimation methods explored in this work are tested in smartphone devices can be used in other portable devices or miniature computing devices such as Raspberry Pi to develop applications other than navigation.
6
Conclusion
This study does an analysis of different distance estimation methods and their performance. We did a controlled and structured experiment on how a smartphone can be used in distance estimation tasks without any additional hardware. 4
https://www.dxomark.com/huawei-p30-pro-camera-review/.
670
B. Kuriakose et al.
Our findings reveal which distance estimation method that is appropriate for short-range navigation applications. The Rule 57 distance estimation method holds great potential. The AR-based method could also be considered a viable alternative, though it requires an AR-compatible device. Moreover, the result also shows that none of the methods are suitable for long-range applications beyond 5 m. We believe that this study could help developers and researchers in making informed choices of technologies when designing systems involving distance estimation. Future work involves in using the results from this research in the development of a smartphone-based navigation system for people with visual impairments.
References 1. Measuring size from images: a wrangle with angles and image scale, November 2012. https://www.cfa.harvard.edu/webscope/activities/pdfs/measureSize. pdf. Accessed 01 Oct 2020 2. Aghili, F., Parsa, K.: Motion and parameter estimation of space objects using laser-vision data. J. Guid. Control Dyn. 32(2), 538–550 (2009) 3. Google ARVR: Build new augmented reality experiences that seamlessly blend the digital and physical worlds, June 2016. https://developers.google.com/ar. Accessed 5 Nov 2020 4. Celik, K., Chung, S.J., Somani, A.: Mono-vision corner SLAM for indoor navigation. In: 2008 IEEE International Conference on Electro/Information Technology, pp. 343–348. IEEE (2008) 5. Haiyang Chao, Y.G., Napolitano, M.: A survey of optical flow techniques for robotics navigation applications. J. Intell. Robot. Syst. 73(1–4), 361–372 (2014) 6. Chen, S., Fang, X., Shen, J., Wang, L., Shao, L.: Single-image distance measurement by a smart mobile device. IEEE Trans. Cybern. 47(12), 4451–4462 (2016) 7. Chenchen, L., Fulin, S., Haitao, W., Jianjun, G.: A camera calibration method for obstacle distance measurement based on monocular vision. In: 2014 Fourth International Conference on Communication Systems and Network Technologies, pp. 1148–1151. IEEE (2014) 8. Coronel, P., Furrer, S., Schott, W., Weiss, B.: Indoor location tracking using inertial navigation sensors and radio beacons. In: Floerkemeier, C., Langheinrich, M., Fleisch, E., Mattern, F., Sarma, S.E. (eds.) IOT 2008. LNCS, vol. 4952, pp. 325– 340. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-78731-0 21 ˙ K.: Computer vision based distance measurement sys9. Dandil, E., K¨ ur¸sat C ¸ eviIk, tem using stereo camera view. In: 2019 3rd International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT), pp. 1–4. IEEE (2019) 10. Ramanamurthy, D.: Methods and systems for indoor navigation, June 2012. https://patents.google.com/patent/US20120143495A1/en. Accessed 1 Oct 2020 11. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. IEEE (2009) 12. Emani, S., Soman, K.P., Sajith Variyar, V.V., Adarsh, S.: Obstacle detection and distance estimation for autonomous electric vehicle using stereo vision and DNN. In: Wang, J., Reddy, G.R.M., Prasad, V.K., Reddy, V.S. (eds.) Soft Computing and Signal Processing. AISC, vol. 898, pp. 639–648. Springer, Singapore (2019). https://doi.org/10.1007/978-981-13-3393-4 65
Distance Estimation Methods
671
13. Fallah, N., Apostolopoulos, I., Bekris, K., Folmer, E.: Indoor human navigation systems: a survey. Interact. Comput. 25(1), 21–33 (2013) 14. Filipe, V., Fernandes, F., Fernandes, H., Sousa, A., Paredes, H., Barroso, J.: Blind navigation support system based on Microsoft Kinect. Procedia Comput. Sci. 14, 94–101 (2012) 15. G˘ a¸sp˘ aresc, G., Gontean, A.: Performance evaluation of ultrasonic sensors accuracy in distance measurement. In: 2014 11th International Symposium on Electronics and Telecommunications (ISETC), pp. 1–4. IEEE (2014) 16. Gokturk, S.B., Yalcin, H., Bamji, C.: A time-of-flight depth sensor-system description, issues and solutions. In: 2004 conference on computer vision and pattern recognition workshop, pp. 35–35. IEEE (2004) 17. Google: Position sensors, April 2018. https://developer.android.com/guide/topics/ sensors/sensors position. Accessed 1 Oct 2020 18. Google: Sensors overview: android developers, 2020. https://developer.android. com/guide/topics/sensors/sensors overview. Accessed 5 Nov 2020 19. Han, D., Wang, C.: Tree height measurement based on image processing embedded in smart mobile phone. In: 2011 International Conference on Multimedia Technology, pp. 3293–3296. IEEE (2011) 20. Haseeb, M.A., Guan, J., Ristic-Durrant, D., Gr¨ aser, A.: DisNet: a novel method for distance estimation from monocular camera. In: 10th Planning, Perception and Navigation for Intelligent Vehicles (PPNIV18), IROS (2018) 21. Hechun, W., Xiaohong, Z.: Survey of deep learning based object detection. In: Proceedings of the 2nd International Conference on Big Data Technologies, pp. 149–153 (2019) 22. Holzmann, C., Hochgatterer, M.: Measuring distance with mobile phones using single-camera stereo vision. In: 2012 32nd International Conference on Distributed Computing Systems Workshops, pp. 88–93. IEEE (2012) 23. Howard, A., et al.: Searching for mobilenetv3. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1314–1324 (2019) 24. Howard, A.G., et al.: MobileNets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017) 25. Jiang, G., Yin, L., Jin, S., Tian, C., Ma, X., Ou, Y.: A simultaneous localization and mapping (SLAM) framework for 2.5 D map building based on low-cost LiDAR and vision fusion. Appl. Sci. 9(10), 2105 (2019) 26. Jiao, L., et al.: A survey of deep learning-based object detection. IEEE Access 7, 128837–128868 (2019) 27. Johannsdottir, K.R., Stelmach, L.B., et al.: Monovision: a review of the scientific literature. Optom. Vis. Sci. 78(9), 646–651 (2001) 28. Kala, R.: On-Road Intelligent Vehicles: Motion Planning for Intelligent Transportation Systems. Butterworth-Heinemann, Oxford (2016) 29. Katevas, K.: SensingKit/SensingKit-iOS, October 2019. https://github.com/ SensingKit/SensingKit-iOS. Accessed 1 Oct 2020 30. Khan, M.A., Paul, P., Rashid, M., Hossain, M., Ahad, M.A.R.: An AI-based visual aid with integrated reading assistant for the completely blind. IEEE Trans. Hum. Mach. Syst. 50(6), 507–517 (2020) 31. Kim, S.J., Kim, B.K.: Dynamic ultrasonic hybrid localization system for indoor mobile robots. IEEE Trans. Ind. Electron. 60(10), 4562–4573 (2012) 32. Kriegman, D.J., Triendl, E., Binford, T.O.: Stereo vision and navigation in buildings for mobile robots. IEEE Trans. Robot. Autom. 5(6), 792–803 (1989)
672
B. Kuriakose et al.
33. Kurdthongmee, W.: A comparative study of the effectiveness of using popular DNN object detection algorithms for pith detection in cross-sectional images of parawood. Heliyon 6(2), e03480 (2020) 34. Kuriakose, B., Shrestha, R., Sandnes, F.E.: Smartphone navigation support for blind and visually impaired people - a comprehensive analysis of potentials and opportunities. In: Antona, M., Stephanidis, C. (eds.) HCII 2020. LNCS, vol. 12189, pp. 568–583. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-491086 41 35. Kuriakose, B., Shrestha, R., Sandnes, F.E.: Tools and technologies for blind and visually impaired navigation support: a review. IETE Tech. Rev. 1–16 (2020) 36. Lepp¨ akoski, H., Collin, J., Takala, J.: Pedestrian navigation based on inertial sensors, indoor map, and WLAN signals. J. Sig. Process. Syst. 71(3), 287–296 (2013) 37. Lin, T.Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1 48 38. Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0 2 39. Ian, M.: How to measure distance using ARCore? August 2017. https:// stackoverflow.com/questions/45982196/how-to-measure-distance-using-arcore. Accessed 5 Nov 2020 40. Majchrzak, J., Michalski, M., Wiczynski, G.: Distance estimation with a long-range ultrasonic sensor system. IEEE Sens. J. 9(7), 767–773 (2009) 41. Mufti, F., Mahony, R., Heinzmann, J.: Robust estimation of planar surfaces using spatio-temporal RANSAC for applications in autonomous vehicle navigation. Robot. Auton. Syst. 60(1), 16–28 (2012) 42. Obradovic, D., Lenz, H., Schupfner, M.: Fusion of map and sensor data in a modern car navigation system. J. VLSI Sig. Process. Syst. Sign. Image Video Technol. 45(1–2), 111–122 (2006) 43. Ponce, H., Brieva, J., Moya-Albor, E.: Distance estimation using a bio-inspired optical flow strategy applied to neuro-robotics. In: 2018 International Joint Conference on Neural Networks (IJCNN), pp. 1–7. IEEE (2018) 44. Prusak, A., Melnychuk, O., Roth, H., Schiller, I., Koch, R.: Pose estimation and map building with a time-of-flight-camera for robot navigation. Int. J. Intell. Syst. Technol. Appl. 5(3–4), 355–364 (2008) 45. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788 (2016) 46. Redmon, J., Farhadi, A.: YOLO9000: better, faster, stronger. arXiv preprint arXiv:1612.08242 (2016) 47. Redmon, J., Farhadi, A.: YOLOv3: an incremental improvement. arXiv (2018) 48. Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: Mobilenetv 2: inverted residuals and linear bottlenecks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520 (2018) 49. StackOverFlow: How can we measure distance between object and android phone camera, May 2013. https://stackoverflow.com/questions/15949777/howcan-we-measure-distance-between-object-and-android-phone-camera. Accessed 5 Nov 2020 50. Stelzer, A., Hirschm¨ uller, H., G¨ orner, M.: Stereo-vision-based navigation of a sixlegged walking robot in unknown rough terrain. Int. J. Robot. Res. 31(4), 381–402 (2012)
Distance Estimation Methods
673
51. Utaminingrum, F., et al.: A laser-vision based obstacle detection and distance estimation for smart wheelchair navigation. In: 2016 IEEE International Conference on Signal and Image Processing (ICSIP), pp. 123–127. IEEE (2016) 52. Wang, X., Zhou, B., Ji, J., Bin, P.: Recognition and distance estimation of an irregular object in package sorting line based on monocular vision. Int. J. Adv. Robot. Syst. 16(1), 1729881419827215 (2019) 53. Xiaoming, L., Tian, Q., Wanchun, C., Xingliang, Y.: Real-time distance measurement using a modified camera. In: 2010 IEEE Sensors Applications Symposium (SAS), pp. 54–58 (2010) 54. Zaarane, A., Slimani, I., Al Okaishi, W., Atouf, I., Hamdoun, A.: Distance measurement system for autonomous vehicles using stereo camera. Array 5, 100016 (2020) 55. Zhang, J., Huang, X.Y.: Measuring method of tree height based on digital image processing technology. In: 2009 First International Conference on Information Science and Engineering, pp. 1327–1331. IEEE (2009) 56. Zhang, L., Zhao, L.: Research of ultrasonic distance measurement system based on DSP. In: 2011 International Conference on Computer Science and Service System (CSSS), pp. 2455–2458. IEEE (2011) 57. Zhu, J., Fang, Y.: Learning object-specific distance from a monocular image. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3839– 3848 (2019)
Adversarial Domain Adaptation for Medieval Instrument Recognition Imad Eddine Ibrahim Bekkouch1(B) , Nicolae Drago¸s Constantin2 , Victoria Eyharabide3 , and Frederic Billiet4 1
3
Sorbonne Center for Artificial Intelligence, Sorbonne University, Paris, France [email protected] 2 Institutul de cercet˘ ari pentru Inteligenta Artificiala, Academia Romana, Bucharest, Romania Sens Texte Informatique Histoire Laboratory, Sorbonne University, Paris, France 4 Institute for Research in Musicology, Sorbonne University, Paris, France
Abstract. Image classification models have improved drastically due to neural networks. But as a direct consequence of being trained on a specific dataset, neural networks tend to be biased towards theirs training data and provide worse results on other domains. Hence, a new sub field of Transfer Learning emerged namely Domain Adaptation which uses in the most case a combination of adversarial methods and mathematical heuristics which are applied to the model’s latent space. In this paper we present a new method for Unsupervised Domain Adaptation that is both fast and resilient. Our method starts by applying style transformations to the input images and train a transformation discriminator module to predict these style changes. Whereas the feature extractor part of our model is trained on the adversarial part of that loss allowing to forget and not extract style information which in return improves the accuracy of the classifier. Our second contribution is our new dataset of Musical Instruments Recognition In Medieval Artworks which provides a better benchmark for transfer learning and domain adaptation methods and pushes the research in this area further. We evaluated our method on two main benchmarks namely MNIST-USPS-SVHN and MIMI-MusiconisVihuelas benchmark and in all cases our method provides state of the art performances.
Keywords: Computer vision
1
· Transfer learning · Domain adaptation
Introduction
Machine learning (ML) has become part of our everyday lives, from ads on our phones to self-driving cars and smart-homes. Mainly because we now have an abundance of computational power and large datasets to train models for every task. There are three main types of ML which are supervised learning c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 295, pp. 674–687, 2022. https://doi.org/10.1007/978-3-030-82196-8_50
Adversarial DA for Medieval Instrument Recognition
675
(Video Recognition [1], Diagnosis [3]), unsupervised learning (Domain Adaptation [4], outlier detection [13,16,20]) and semi-supervised learning (Speech analysis, Spam Detection [8]) which are defined based on the availability of training annotations data. Supervised learning is usually the easiest and most advanced type given its predictability and stability compared to other types as it gives the required output and can be tailored to each task. The machine learning field has been dominated by one large algorithm: neural networks, commonly known as Deep Learning (DL). Inspired by the brain’s architecture and neurons, artificial neural networks showed a huge gain in performance when applied to computer vision tasks and other human-intelligence tasks such as Natural Language Processing, painting, or detecting outliers. The first neural network architecture that proved their efficiency is Convolutional Neural Networks (CNN) [19], which reflect the human vision system and allow machines to outperform human performance on several tasks such as medical imaging, diagnostic, and object detection. The biggest drawback of neural networks is their bias to their training data and inability to generalize to new domains, meaning if they were trained on a specific type of images, their performance drastically decreases when tested on a different type of images even if the difference is not clear to the human eye. This problem is commonly known as domain gap, which is different from over-fitting. Over-fitting occurs when both training and testing data are similar, but the model cannot distinguish, where as domain gaps occur between different datasets. The initial solution was to retrain the model from scratch on the new dataset, which is a common practice in industrial models. However, collecting a new dataset, and labeling it, and then retraining the whole model is a very costly and challenging task. A better approach is to leverage the knowledge extracted from an easier domain (dataset) and later transfer those weights (containing knowledge) to the target domain that commonly shares the same classes and objects but might potentially follow a different marginal distribution. This strategy helps reduce the cost of data re-collection and speed up the process of redeployment. Putting this in a formal definition, we denote the source and target domains throughout the paper as Ds and Dt , respectively. Domain adaptation (DA) is a sub-field of transductive transfer learning (TTL), which solves a problem in the target data Dt , where data are hard to collect or insufficient for training, using data from the source domain Ds . Both domains share the same classification problem, i.e., T t = T s but the marginal distributions of their inputs differ, i.e., P (X s ) = P (X t ). DA is achieved by learning a shared latent space, i.e., Z s = Z t [15]. Domain Adaptation is either closed-set or open-set. On the one hand, closedset DA [11] deals with the case where the classes of Dt are the same as those of Ds . On the other hand, open-set [5,17] DA handles the case where only a few classes are shared between the two domains, and the source or the target domain might contain more classes. Our paper is working on a particular case
676
I. E. I. Bekkouch et al.
where the classes are shared, but their design has changed over time; hence it is considered as an open-set DA problem. As for most machine learning sub-fields, DA is split into supervised [14], unsupervised [6], and semi-supervised [21], depending on the availability of labeled data in Dt . In the unsupervised domain adaptation (UDA) case, the target domain samples are completely unlabeled, whereas the source domain is fully labeled, which is useful in situations where the data collection process is easy. However, the data labeling process is time-consuming for the target data. In the opposite case of supervised domain adaptation (SDA) and semi-supervised domain adaptation (SSDA), the target domain labels are either entirely or partially available, respectively. Our paper handles a particular case of unsupervised domain adaptation where the source dataset is not labeled, but the target data is fully labeled. Domain Adaptation is a very close field to Domain Generalisation (DG), which when we do not have any access to the target data. They fix this problem by leveraging various easy-to-collect datasets from different domains to allow the model to learn a non-biased representation of the datasets and generalize well to unseen domains [2,9]. Our method’s focus is on unsupervised domain adaptation where the source dataset does not contain labels as it is a less researched area and has several applications in historical and medieval datasets. Although this data is always annotated, they are small in size, making it hard to train models on them directly. Previous UDA methods aim at achieving two requirements for the shared latent space: (i) extract (or learn) a latent space representation from Ds and Dt that are class informative and useful for determining and separating the classes from each other, and (ii) Making the feature spaces of Ds and Dt similar to each other allowing to get similar results for both domains. The most common domain adaptation methods rely heavily on either mathematical heuristics, which are formulated as loss functions affecting the latent space or a min-max problem formulated with adversarial learning. Our method leverages adversarial learning only since mathematical heuristics can be added to any model to improve its performances. We assume that the classifiers suffer with the classification of new data since the latent space extracted contains several information about the input, which is not useful for classification. This information is due to the variation in the data style and is considered noise. Hence, we apply drastic transformations and data augmentation techniques to the input images and build a classifier that predicts the transformations applied to each image. This transformation-classifier is trained separately, and its loss does not influence the encoder part of our model (the feature extractor). On the contrary, the encoder is trained on the adversarial side of that classification loss, removing the transformation and style information from the extracted latent representations. We evaluate our method on a new dataset of Medieval Musical Instruments annotated by five expert musical instruments historians. The images are extracted from three sources (Musiconis, Vihuelas, MIMO), providing us with
Adversarial DA for Medieval Instrument Recognition
677
images of instruments thought a long time period. We created a dataset that presents a challenge to machine learning models for three main reasons: 1. Variation in Style: Our dataset contains images of musical instruments from historical periods and recent times. The instruments are represented on various supporting materials, mainly paintings, manuscripts, photographs, and sculptures. 2. Difficulty in Acquiring and Labeling: Due to the damaged nature of ancient artworks, experts may have difficulty classifying objects. Making the training process on such images a more demanding task. 3. Scarcity of Data: Finding medieval images is a difficult task, especially in the musicology domain, making our dataset a great contribution to research. The rest of the paper sections are organized as follows: Sect. 2 is an overview of the different methods and architectures in Transfer Learning and Domain Adaptation. Section 3 describes our model in detail. The empirical evaluation of our method is shown in Sect. 4. Finally, Sect. 5 summarizes the paper.
2
Related Works
The field of transfer learning is a vast field, in particular the dub-field domain adaptation, which was tackled by using deep learning in several ways to build more resilient models. Most of the papers used discriminators as the core component of domain adaptation and accuracy as the metric to evaluate the similarity of the source and target domains’ latent spaces. The works relying on adversarial losses keep the generation part of it and construct new images, which can be time consuming. Our model is an example of the case where discrimination loss is used without any reconstruction of the input images. Pseudo-labeling is a widely used technique in the field that allows results’ improvements. We briefly discuss these topics below. 2.1
Transfer Learning
Transfer Learning (TL) or Knowledge transfer (KT) is a succeeding and increasing field of artificial intelligence (AI), which focuses on leveraging knowledge obtained from training on one domain (or task) to another. For deep learningbased models, the transferred knowledge is represented in the weights of the deep learning model. Image Classification is the first field where transfer learning has been applied to given the homogeneity of image classification models based mainly on CNNs and Encoder/Classifier settings. During the Transfer Learning process, some methods rely on various techniques: 1) unsupervised such as Image reconstruction [12], Adversarial losses [4], Image coloring, and Jigsaw puzzle solving [7]; or 2) Supervised such as Classification loss, Latent space-based losses, Pseudo-labeling, and Separability Losses [4].
678
2.2
I. E. I. Bekkouch et al.
Adversarial and Generative Models
After the great success of the generative adversarial networks (GAN) which not only produced state of the art data generation and unlocked a new area of applications that was before thought of as not achievable. It also came with a completely new approach to building and training neural networks, which is the adversarial training, in other terms, we train two separate neural networks to trick each and hence learn from each other in order to improve their performances. Domain Adaptation techniques were split into two main categories regarding this area. 1) Adversarial and generative models: mainly models such as DupGAN [12] which made a huge success in mitigating the domain gaps by using images for each domain and re-generating the same image but in the style of both domains making its latent space representation a domain invariant representation. This type of models showed great results but were very difficult to train which lead researchers to move towards the Adversarial part only without the need to generate images. 2) Adversarial-only models: such as TripNet [4] which reduces the number of components greatly keeping only one domain discriminator which tries to predict the domain of an image based on its latent space whereas the encoder is trained on the adversarial loss of the discriminator trying to hide and forget the information relative to the domain. 2.3
Pseudo Labeling
A technique initially made for semi-supervised learning to help models improve their results by getting leveraging more unlabeled data. Now, we can find it in almost any unsupervised domain adaptation method. Pseudo labeling aims at reducing the differences amongst the target and source domains by providing the pseudo labels for the unlabeled samples from the target domain. There are two main techniques for making pseudo labels in the literature: 1) Similarity based: In [18], KNN graphs were used to find the annotations for the unlabled data using the labeled ones. 2) Classifier based: In DupGAN [12], used a pre-trained classifier on another dataset that shares the same classes.
3
Methodology
In this section, we describe our new method for unsupervised domain adaptation. Before we get into the details of our methods, we start by defining the notations used thought the section. Let the source domain data be denoted as X s = M N (xsi )i=1 while the target domain data and annotations are X t = (xti , yit )i=1 , it in important to note that the input dimensions of xsi and xti are the same but they come from different marginal distributions. Since our research focuses on open set domain adaptation, the classes of the tow domains overlap but not necessarily have to the be the same. Our model consists of an encoder (The main focus of our method), a classifier (used only for classification and is not influencing our method) and
Adversarial DA for Medieval Instrument Recognition
679
a transformation-discriminator (The added component of our method). The Encoder and the classifier are the final classification model as in a typical CNN classification scenario, whereas the transformation-discriminator is used only in training time and removed encoder at inference. Furthermore we can formulate our inference classification function f as the composition of two sub functions f = e◦c, such that e : X −→ Z represents the encoder’s function which performs the extraction of latent space vectors from the input images, and c : Z −→ Y performs the classification of the previously mentioned latent space vectors into their appropriate classes. The transformation-discriminator function g is similarly another composition of two sub-functions g = e ◦ d where e is the exact same encoder function whereas d : Z −→ A is the multi-class multi-label transformation detector function. Moreover, besides the commonly used classification loss and the discussed transformation-discriminator loss, we also used a separation loss which was shown to improve the results of domain adaptation models and it operate similarly to Linear Discriminant Analysis (LDA). 3.1
Architecture
In this subsection, we will only present the component of our method and in the next subsection we will see the losses that are used to train these neural networks. Encoder: Our encoder E(.) is a typical pure Convolutional Neural Network with weights W E (by default it contains only convolutional layers and max-pooling followed by a Flattening layer, but depending on the use of a pretrained model the architecture might include other layer types). The goal of using an encoder is to encode the input images of both domain into a latent space representation in vector forms which is represented in the following formula: z = E(x), x ∈ X s ∪ X t
(1)
Such that z ∈ Z represents the desired latent representation which we push towards being more domain invariant and category informative. Hence, we denote the output of the encoder for the source input images as z s = E(xs ) whereas for the target input images as z t = E(xt ). Both the classifier and the transformation-discriminator take as input the flattened output of the encoder. Classifier: Our classifier is a vanilla artificial neural network (ANN)C(.) which is commonly used for multi-class classification or binary classification by changing the loss function between binary cross entropy and cross entropy (for the sake of our dataset we use the formulation with cross entropy as it is a multiclassification task). As previously states, its input is the output of the encoder function f which is the latent space vector representation z and its output is the per class the probabilities represented as the vector yˆ. The function we used in our case is the following: yˆ = C(z) = C(E(x)), x ∈ X, X = X s ∪ X t
(2)
680
I. E. I. Bekkouch et al.
In the above equation, yˆ represents the output of the classifier which the vector of per-class probabilities such that yˆ ∈ Yˆ , Yˆ = Yˆ s ∪ Yˆ t meaning it is common for both domains and for both labels of the target domain and pseudo labels for source domain. The first step is to train the classifier and the encoder on the target dataset only and we use the confidently predicted classes of source data as pseudo labels for later training. We repeat the pseudo labelling step on every iteration in the next step and it usually provides very few samples confidently in the beginning but it increases with time. Transformation Discriminator: Our hypothesis is that the classifier isn’t able to generalize well to other domains because the encoder is extracting information not just for classification but also about the style of the images. Our Transformation Discriminator D(.) is a Fully Connected Neural Network with weights W D similar to the discriminator of the Generative Adversarial Networks but it has a multi-class output instead of a binary output. The transformation discriminator works in the following manner: a = D(z) = D(E(x)), a ∈ A, A = {[0, 1, ..., 0], ..}
(3)
such that a is the predicted vector of the transformations applied to the image. 3.2
Losses
In this subsection, we give a clear overview of the three losses that we use to train our model. Classification Loss: We start by explaining the classification loss as it is the most common loss in our method. As previously explained, we cross entropy loss H(, ., ) applied on the predictions of target data and its annotations and source images and its pseudo-labels (if any), and is computed as below: E C s s t t H(ˆ y , y ) + λt H(ˆ y ,y ) (4) Lc (W , W ) = 1 ∗ xs ∈X s
xt ∈X t
Where λt is used to as a balancing hyper parameter between the two domains. As the loss function clearly states this loss effects both classifier and decoder in the same manner. Transformation Discrimination Loss: The goal of the classification loss is to ensure the class informative quality in the latent space whereas the goal of the transformation discrimination loss is to ensure the domain independence quality of the latent space. order to get domain independent features we used the discrimination loss to train the Discriminator to distinguish between the features for both domains using categorical cross entropy CCE loss which operates on multi-label multi-class classification problems: CCE(D(E(xs )), T r(xs )) + CCE(D(E(xt )), T r(xt )) LD (W D ) = xs ∈X s
xt ∈X t
(5)
Adversarial DA for Medieval Instrument Recognition
681
where T r(.) is the boolean vector of transformations applied to the input images. This loss effects only the weights of the transformation discriminator W D . On the other hand, the encoder is trained on the opposite loss that is maximizing the LD and trying to hide the information relative to transformation and style, as follows: LP (W E ) = −LD
(6)
LP is the loss used to train the weights of our encoder component in order to deceive the transformation discriminator and remove the information relative to style and transformation. Separability Loss: This is a typical example of a mathematical heuristic applied to an encoder as a method for improving the results and providing a cleaner latent space and hence an easier classification challenge for both domains. This loss is an extension of Linear Discriminant Analysis (LDA) which is in return an extension Fisher’s linear discriminant which is used to find a linear combination of features that characterizes or separates two or more classes of objects or events. It is later used a linear classifier to separate the classes. We use it as a continuous function trying to make the latent space as a combination of features that can separate the classes in the most linear way possible allowing the classifier to get a better generalization ability. It is defined as follows: i∈Y z ∈Z d(zij , μi ) E ij i (7) × λBF Lsep (W ) = i∈Y d(μi , μ) λBF =
mini |Yit | maxi |Yit |
(8)
Such that λBF is a balancing parameter used to reduce the effect of badly annotated source images. 3.3
Optimization
To sum up, we can consider that our model is being trained to minimize the balanced loss which is a weighted sum of all the three mentioned losses, it is given in the equation below. L=
min
WD ,WC ,WE
1 ∗ LC + βP LP + βSep LSep
(9)
where βP , βSep are the balancing parameters. We detail how our model improves it performance in Algorithm 1.
4
Dataset and Results
In this section we go through the dataset description and the results that we obtained on the different benchmarks.
682
I. E. I. Bekkouch et al.
Algorithm 1: The Training Process of TripNet Input: X s — Source domain images X t — Target domain images Y t — Target domain image labels I — Number of iterations Output: W E — Weights of the encoder W C — Weights of the classifier Pre-train E and C using X t and Y t ; for i ← 1 to I do Sample a batch of images for both domains xs , (xt , y t ); Get pseudo-labelling yˆs for xs using C; Update W D by deriving LD ; Update W C by deriving LC ; Update W E by deriving LC , LP and LSep ; end return W C ,W E
4.1
Datasets
To validate our theoretical method, we chose to use two benchmarks, digital digit recognition (svhn-mnist-usps) and musical instruments recognition in medieval artworks containing (MIMO1 , Musiconis2 , Vihuelas3 ) Digital Digit Recognition. We evaluated our model for unsupervised domain adaptation for digit classification task, on datasets with ten labels ranging from 0∼9 MNIST database (Modified National Institute of Standards and Technology database) is the most commonly known machine learning database for handwritten digits recognition and is used for benchmarking almost every single image processing system. It contains a training set of 60,000 examples, and a test set of 10,000 example. It is a subset of a larger set available from NIST which was originally 20 * 20 images and converted into a 28 * 28 grayscale images centered around the center of mass of the pixels. SVHN Street View House Numbers (SVHN) is created by taking pictures of real-world images used also for most benchmarks is association with MNIST. SVHN was created from numbers plates found in the Google Street View images and it provides a more challenging scenario than mnist because of the large amount of side artifacts in its images and since the images are RGB and not only grayscale. USPS US Post Office Zip Code Data of Handwritten Digits which contains 7291 training samples and 2007 testing samples. The size of the images is are 1 2 3
https://mimo-international.com/MIMO/. http://musiconis.huma-num.fr/fr/. https://vihuelagriffiths.com/.
Adversarial DA for Medieval Instrument Recognition
683
16 * 16 grayscale but for the sake of our experiments we convert it into 28 * 28 make them similar to MNIST but overall less complex. Musical Instruments Recognition in Medieval Artworks. For the sake of pushing the research in the area of domain adaptation and transfer learning we created and annotated a new dataset for musical instruments recognition in medieval artworks which provides a more challenging scenario for many reasons: 1. Difference in periods: Our images were extracted from a variety of manuscripts covering both a large geographical and historical periods. Our datasets provide images from 11 different countries and going all the way from the 9th century to the 17th century along with images of the same instruments but from the 20th century as a source for improving the results of models on the historical instruments. 2. A Variety of Image Styles: the large period that we covered allowed us to obtain art works in a variety of forms such as manuscripts, embroideries, stained glasses, paintings, photographs, stone and ivory sculptures. 3. A Variation in Conservation States: As our primary focus is musical instruments in historical artworks some of the objects we found may suffer from broken parts and are often highly damaged yet they are still recognisable. This provides a more difficult challenge for both neural networks and even expert musicians, that’s why our musicology experts only kept undamaged images or those with a maximum of 10% damage. MIMO Database. MIMO is an abbreviation of Musical Instrument Museums Online. MIMO is considered as one of the public datasets for musical instruments. We chose to use only a subset of the large MIMO dataset that contains only stringed instruments with a total of 10258 images, containing the following distribution of classes: 1) Vielles: or the commonly known as violins with 3508 samples, 2) Lutes: or luths with 3163 samples, 3) Zithers: with 2102 samples, 3) Harps: with 867 samples, 4) Lyres with 181 samples. As MIMO is the biggest dataset we used in this benchmark we chose to use it as the source dataset. Musiconis Database. The Musiconis database is our own catalog of iconographic representations annotating musical and sound performances spamming the entire Middle Ages period. This database is the fruit of different partnerships with a variety of specificities as we covered in the previous subsection. To show an example of the sources of this dataset we take (i) the Musicastallis database4 that stores a variety of musical representations carved on religious buildings (the stalls); (ii) the Gothic Ivories database5 that focus on ivory historical artworks; or (iii) the Romane database6 and the Initial database7 which 4 5 6 7
http://www.plm.paris-sorbonne.fr/musicastallis/. http://www.gothicivories.courtauld.ac.uk. http://base-romane.fr/accueil2.aspx. http://initiale.irht.cnrs.fr.
684
I. E. I. Bekkouch et al.
focus largely on medieval manuscripts. From this large dataset Musiconis, we annotated 662 images of stringed instruments for the sake of our experimentation. The classes are the same as in MIMO but as time progresses these musical instruments improved largely. Overall we have 1) Zithers: with 112 samples, 2) Harps: with 132 samples, 3) Lutes: with 56 samples, 4) Lyres: with 75 samples, and 5) Vielles: with 327 samples. Vihuela Database. The Vihuelas database [10] is a collection of Spanish Renaissance musical instruments and especially the famous vihuelas. It focuses mainly on the period from 1470 to 1630. The vihuela is also a stringed instrument that was widely used in countries such as Spain, Portugal, Italy, and Latin America. We selected a subset of this dataset with a total of 165 stringed instruments. As the name suggests it is a dataset of vihuelas which are similar to lutes and violins and hence most of the images are for these two instrument families. The images are distributed in the following manner: 1) Lutes: with 130 samples, 2) Vieles: with 31 samples, 3) Harps: with five samples, and 4) two lyres. 4.2
Results
Digital Digit Recognition. In order to evaluate our method, we implemented the following strategy. We used a dataset as a source without any labels and a target dataset with labels. We compared our method against the baseline of training only on the source dataset (Deep Source) and we compare our values against two state of the art methods in domain adaptation which are DupGan and TripNet. These two models are based also on adversarial losses but they use the same type of discriminator as GANs which takes a lot of time to stabilize whereas our method is much faster. We also compared the model with and without the separability loss to prove its efficiency. Table 1. The test accuracy comparison for UDA on digit classification. Target
Deep source DupGAN TripNet ADA (ours) ADA- Sep (ours)
SVHN - MNIST 98.55
98.72
98.79
99.12
99.43
MNIST - USPS
96.70
96.24
98.23
98.90
95.02
USPS - MNIST
98.55
99.31
99.18
99.53
99.51
Avg
97.37
98.24
98.07
98.96
99.28
We report the results in Table 1, where we can see that all methods improve over the baseline on average which is expected and our method clearly outperforms the others on average and especially for the MNIST-USPS experiment. We also see the our method improves by using the separability loss on average. In these experiments we found that the following transformations gave the best results: Random Gray scale, Random Collor jitter (For the SVHN experiment), Random Scale, IAA Super pixels (SLIC algorithm), Bluring.
Adversarial DA for Medieval Instrument Recognition
685
Musical Instruments Recognition in Medieval Artworks. For the purpose of future comparison with our method and new proposed benchmark, we implemented the following setup. We used the MIMO dataset as a source without any labels for two experiments (one for MIMO and one Vihuelas) and made an extra experiment where Musiconis is the source and Vihuelas is target dataset. We compared our method against the baseline of training only on the source dataset (Deep Source) and we compared our values against the TripNet method. We present also the results of our model with and without the separability loss. Table 2. The test accuracy comparison for UDA on musical instruments recognition in medieval artworks. Target
Deep Source TripNet ADA (ours) ADA- Sep (ours)
MIMO - Vihuelas 92.39 MIMO - Musiconis 74.24 Musiconis - Vihuelas 92.39
93.24 81.05 92.72
95.96 83.90 94.81
96.93 86.63 95.83
Avg
89.00
91.55
93.13
86.34
We report the results in Table 2, Our method clearly again outperforms the TripNet model on average which in return also improves over the deep Source baseline as expected. We also see the our method improves by using the separability loss on average. In these experiments we found that the following transformations gave the best results: Random Gray scale, ISO Noise, Shift Scale Rotate, Random Collor jitter (For the SVHN experiment), Random Scale, IAA Super pixels (SLIC algorithm), Bluring.
5
Conclusion
Domain gaps are the main reason for performance drops of neural networks models. In this paper we made two main contributions: The first is our new manually-annotated image dataset of historical musical instruments which was created by 5 expert musicologists at Sorbonne University. Our MMV (MIMOMusiconis-Vihuelas) dataset will help push the research of new domain adaptation, domain generalization and transfer learning in general forward by providing a new benchmark and a more challenging scenario to evaluate models on. Our second main contribution is our new method for unsupervised domain adaptation which provides a fast adversarial based non generative technique to bridging domain gaps between dataset via performing style transformations and trying to force the encoder to forget information relative to the style allowing the classifier to improve its accuracy across several domains.
686
I. E. I. Bekkouch et al.
References 1. Batanina, E., Bekkouch, I.E.I., Youssry, Y., Khan, A., Khattak, A.M., Bortnikov, M.: Domain adaptation for car accident detection in videos. In: 2019 Ninth International Conference on Image Processing Theory, Tools and Applications (IPTA), pp. 1–6 (2019) 2. Bekkouch, I.E.I., Nicolae, D.C., Khan, A., Kazmi, S.M.A., Khattak, A.M., Ibragimov, B.: Adversarial reconstruction loss for domain generalization. IEEE Access 9, 42424–42437 (2021) 3. Bekkouch, I.E.I., Aidinovich, T., Vrtovec, T., Kuleev, R., Ibragimov, B.: Multiagent shape models for hip landmark detection in MR scans. In: Iˇsgum, I., Landman, B.A. (eds.) Medical Imaging 2021: Image Processing. volume 11596, pp. 153– 162. International Society for Optics and Photonics, SPIE (2021) 4. Bekkouch, I.E.I., Youssry, Y., Gafarov, R., Khan, A., Khattak, A.M.: Triplet loss network for unsupervised domain adaptation. Algorithms 12(5), 96 (2019) 5. Busto, P.P., Gall, J.: Open set domain adaptation. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 754–763 (October 2017) 6. Cai, G., Wang, Y., Zhou, M., He, L.: Unsupervised domain adaptation with adversarial residual transform networks. CoRR abs/1804.09578 (2018) 7. Carlucci, F.M., D’Innocente, A., Bucci, S., Caputo, B., Tommasi, T.: Domain generalization by solving jigsaw puzzles. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2229–2238 (2019) 8. Cheng, V., Li, C.H.: Personalized spam filtering with semi-supervised classifier ensemble. In: 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings) (WI 2006), pp. 195–201. IEEE (2006) 9. Deshmukh, A.A., Bansal, A., Rastogi, A.: Domain2vec: deep domain generalization. CoRR abs/1807.02919 (2018) 10. Griffiths, J.: At court and at home with the vihuela de mano: Current perspectives on the instrument, its music, and its world. J. Lute Soc. Am. 22, 1–27 (1989) 11. Hoffman, J., et al.: Cycada: cycle-consistent adversarial domain adaptation. In ICML (2018) 12. Hu, L., Kan, M., Shan, S., Chen, X.: Duplex generative adversarial network for unsupervised domain adaptation. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2018) 13. Ibrahim, B.I., Nicolae, D.C., Khan, A., Ali, S.I., Khattak, A.: VAE-GAN based zero-shot outlier detection. In: Proceedings of the 2020 4th International Symposium on Computer Science and Intelligent Control, ISCSIC 2020, New York, NY, USA. Association for Computing Machinery (2020) 14. Motiian, S., Piccirilli, M., Adjeroh, D.A., Doretto, G.: Unified deep supervised domain adaptation and generalization. CoRR abs/1709.10190 (2017) 15. Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22(10), 1345–1359 (2010) 16. Rivera, A.R., Khan, A., Bekkouch, I.E.I., Sheikh, T.S.: Anomaly detection based on zero-shot outlier synthesis and hierarchical feature distillation. IEEE Trans. Neural Netw. Learn. Syst. 1–11 (2020) 17. Saito, K., Yamamoto, S., Ushiku, Y., Harada, T.: Open set domain adaptation by backpropagation. CoRR abs/1804.10427 (2018) 18. Sener, O., Song, H.O., Saxena, A., Savarese, S.: Learning transferrable representations for unsupervised domain adaptation. In: Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 29, pp. 2110–2118. Curran Associates, Inc. (2016)
Adversarial DA for Medieval Instrument Recognition
687
19. Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.: Inception-v4, inception-resnet and the impact of residual connections on learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 31 (2017) 20. Yakovlev, K., Bekkouch, I.E.I., Khan, A.M., Khattak, A.M.: Abstraction-based outlier detection for image data. In: Arai, K., Kapoor, S., Bhatia, R. (eds.) Intelligent Systems and Applications, pp. 540–552. Springer, Cham (2021) 21. Yao, T., Pan, Y., Ngo, C., Li, H., Mei, T.: Semi-supervised domain adaptation with subspace learning for visual recognition. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2142–2150 (June 2015)
Transfer Learning Methods for Training Person Detector in Drone Imagery Saša Sambolek1 and Marina Ivaši´c-Kos2(B) 1 High School Tina Ujevi´ca, Kutina, Croatia 2 Department of Informatics, University in Rijeka, Rijeka, Croatia
[email protected]
Abstract. Deep neural networks achieve excellent results on various computer vision tasks, but learning models require large amounts of tagged images and often unavailable data. An alternative solution of using a large amount of data to achieve better results and greater generalization of the model is to use previously learned models and adapt them to the task at hand, known as transfer learning. The aim of this paper is to improve the results of detecting people in search and rescue scenes using YOLOv4 detectors. Since the original SARD data set for training human detectors in search and rescue scenes are modest, different transfer learning approaches are analyzed. Additionally, the VisDrone data set containing drone images in urban areas is used to increase training data in order to improve person detection results. Keywords: Transfer learning · YOLO v4 · Person detection · Drone dataset
1 Introduction Deep learning methods have been successfully applied in many computer vision applications in recent years. Unlike traditional machine learning methods, deep learning methods allow automatic learning of features from data and reduce manual extraction and presentation features. However, it should be emphasized that the deep learning model is highly data-dependent. Large amounts of data are needed in the learning set to detect patterns among the data, generate features of the deep learning model, and identify the information needed to make a final decision. Insufficient data to learn deep learning models are a significant problem in specific application domains such as search and rescue (SAR) operations in non-urban areas. The process of collecting relevant image data, in this case, is demanding and expensive because it requires the use of drones or helicopters to monitor and record non-urban areas such as mountains, forests, fields, or water surfaces. The additional problem is that scenes with detected casualties rarely appear on the recorded material, which is the most useful for learning the model for detecting an injured person. Besides, the data collected should be processed, each frame inspected, and each occurrence of a person marked with a bounding box and labeled, which is a tedious and time-consuming process. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 295, pp. 688–701, 2022. https://doi.org/10.1007/978-3-030-82196-8_51
Transfer Learning Methods for Training Person Detector
689
One way to overcome the problem of data scarcity is to use transfer learning. Transfer learning allows a domain model not to be learned from scratch, assuming that the learning set data is not necessarily independent and identically distributed as the data in the test set. This assumption makes it possible to significantly reduce the amount of data required in the learning set and the time required to learn the target domain model. This paper aims to detect persons on the scenes of search and rescue (SAR) operations. Today, it has become commonplace to use drones in SAR missions that fly over the search area and film it from a bird’s eye view. They can capture a larger area at higher altitudes, but then the people in the image are tiny and take up only a few pixels. People can be detected more efficiently at lower altitudes, but in that case, the field of view is smaller. People who are searched for are very often barely noticeable because of the branches and trees, occluded by some vegetation, in the shadow, fused with the ground, which further complicates the search even for favorable weather conditions. During SAR operations, the drone operator has a demanding task to analyze the recorded material in real-time to detect a relatively small person on a large, inaccessible surface that requires great concentration, so automatic detection can be valuable. We used the YOLOv4 model for the person detector trained on the MS COCO dataset, which proved to be the most successful in previous research after additional learning on domain images [1–3]. To train the YOLOv4 model, we used the custom-made set of SARD scenes that were shot in a non-urban area with actors simulating injured people and prepared for machine learning. To increase the set, we have generated the Corr-SARD set from SARD scenes by adding atmospheric conditions. Since tailor-made SARD and Corr-SARD datasets were relatively small for learning deep learning models, we have additionally used the VisDrone dataset to include more images of people taken by drone, although not in non-urban areas. This paper examined three different transfer learning methods for building YOLOv4 models for detecting persons in search and rescue operations. In the next section, three different methods of transfer learning will be presented. In the third section, the experimental setup is given along with the description of image data sets SARD, Corr-SARD, VisDrone, and basic information about the YOLO4 detector. In the fourth section, the experimental results of applying different transfer learning methods will be presented and compared. In conclusion, we list important characteristics regarding the impact of different transfer learning approaches on person detection in search and rescue scenes and a plan for future research.
2 Transfer Learning Transfer learning involves taking a pre-trained neural network and adapting that neural network to a new distinct set of data by transferring or repurposing the learned features. Transfer learning is beneficial when learning models with limited computing resources and when a modest set of data is available for model learning. Many state-of-the-art models took days, or even weeks, on powerful GPU machines to train them. So, to not repeat the same procedure over a long time, learning transfers allow us to use pre-trained weights as a starting point.
690
S. Sambolek and M. Ivaši´c-Kos
Different levels and methods of applying deep transfer learning can be classified into four categories according to [4]: network-based transfer learning, instance-based transfer learning, mapping-based transfer learning, and adversarial-based transfer learning, which we will not examine here. 2.1 Network-Based Deep Transfer Learning Network-based deep transfer learning refers to the reuse of a part of the network (without fully connected layers) previously trained in the source domain and is used as part of the target network used in the target domain [4]. The CNN architecture contains many parameters, so it is difficult to learn so many parameters with a relatively small number of images. Therefore, for example, in [5], the network is first trained on a large set of data for classification (ImageNet, source domain), and such pre-trained parameters of the inner layers of the network are transferred to the target tasks (classification, detection, domain target). An additional network layer was added and trained on the labeled target set data to minimize the differences between the source and the target data regarding various image statistics (object type, camera position, lighting) and fit the model to the target data task. Suppose the source domain and the target domain differ in scenes. In that case, the objects’ appearance, lightings, background, position, distance from the camera, and similar lower detection results can be expected on target sets than achieved on the source. For example, the original model of the YOLO object detector trained on the COCO data set was used for detecting players in video frames of handball sports [6] and for person detection on thermal images [7]. In the case of player detection in handball scenes, the original YOLO model achieved an AP of 43.4%, which is often better than person detection in thermal images, where an AP of 19.63% was achieved. Lower results on thermal images are due to significant differences between thermal and RGB images. Lower detection results on handball scenes were achieved since the detector did not accurately identify the player and often drop to mark a high-raised hand or leg in the jump, as handball-specific poses did not exist in the original set. 2.2 Instances-Based Deep Transfer Learning Instance-based deep transfer learning refers to a method in which a union of selected instances from the source domain and instances of the target domain is used for training. It is assumed that regardless of differences in domains, the source domain’s instances will improve detections in the target domain. In deep learning, the approach of fine-tuning models on the target domain, which are pre-trained on large benchmark datasets of source domains, is standard to improve results in other similar target domains. The authors in [8] use an instance-based deep transfer approach to measure each training sample’s impact in the target domain. The primary purpose was to improve the model’s performance in the target domain by optimizing its training data. In particular, they use a selected pre-trained model to assess each training sample’s impact in the target domain. According to the impact value, remove negative samples and thus optimize the target domain’s training set.
Transfer Learning Methods for Training Person Detector
691
In the previously mentioned research in the sports domain [6] and thermal images [7], it was shown that additional learning at the appropriate set and fine-tuning the parameters of the pre-trained model to tasks of interest could significantly improve the detection results at the target set. Thus, the basic model’s AP on the set of thermal images with AP 19.63% with additional adjustment on the customized set of thermal images achieved AP of 97.93%. In additional learning in the handball scenes, AP increased from an initial 43% to 67%. Similar results after fine-tuning with state-of-the-art backbone deep neural networks such as Inception v2, ResNet 50, ResNet 101 were also reported in [9]. 2.3 Mapping-Based Deep Transfer Learning Mapping-based deep transfer learning refers to mapping instances from the source domain and the target domain to a new data space [4]. Mapping-based deep transfer learning finds a common latent space in which feature representations for the source and target domains are invariant [10]. In [11], a CNN architecture was proposed for domain adaptation by introducing an adaptation layer for learning feature representations. The maximum mean discrepancy (MMD) metric is used to calculate the overall structure’s distribution distance concerning a particular representation, which helps select the architecture’s depth and width and regulate the loss function during fine-tuning. Later, in [12] and [13], a multiple kernel variance of MMD was proposed (MKMMD) and joint MMD (JMMD) to improve domain adaptation performances. However, the main limitation of the MMD methods is that the computational cost of MMD increases quadratically with the number of samples when calculating Integral Probability Metrics (IPM) [14]. Therefore, Wasserstein distance has recently been proposed in [15] as an alternative for finding better distribution mapping. 2.4 Adversarial-Based Deep Transfer Learning Adversarial-based deep transfer learning mainly refers to introducing adversarial technology inspired by generative adversarial networks (GAN) [16] to find transferable representations that apply to both the source and target domain but can also refer to the use of synthetic data used to enlarge the original dataset artificially. In adversarial networks, the extracted features from two domains (source and target) are sent to the adversarial layer that tries to discriminate the features’ origin. If there is a slight difference between the two types of features, the adversarial network achieves worse performance, and it is a signal for better transferability, and vice versa. In this way, general features with greater portability are revealed in the training process. In the case of using synthesized data in order to increase the learning set of the deep learning model, it is necessary to analyze the content of the reference video scene and select elements to be generated on the virtual scene taking into account the background, objects on the scene and accessories, such as [17].
692
S. Sambolek and M. Ivaši´c-Kos
3 Experimental Setup 3.1 Dataset In this paper, three datasets were used: the publicly available VisDrone dataset, custommade SARD dataset and synthetically enlarged SARD dataset, Corr-SARD datasets. From the VisDrone dataset [18] containing images of urban scenes taken by the drone, we selected 2,129 images that include a person or pedestrian tag. We combined both labels into one class: person. The obtained dataset was divided into a training set (1,598 images) and a test set (531 images). The selected dataset from the VisDrone set includes shots of people taken under different weather and lighting conditions in different urban scenarios such as roads, squares, parks, parking lots, and the like. The SARD dataset [19] was recorded in a non-urban area to show persons in scenes specific to search and rescue operations. The set contains footage simulating poses of injured people found in inaccessible terrains in the hills, forests, and similar places by searching and rescuing actions and standard poses of people such as walking, running, sitting. The set contains 1,981 images divided into two subsets, a training set containing 1,189 images and a test set with 792 images. The Corr-SARD dataset is derived from the SARD set so that the effects of snow, fog, frost, and motion blur are added to the SARD images. The training set has the same number of images as the SARD training set, while the test set has slightly fewer images (714) because images in which no persons are seen after adding the effect have been removed. For the experiment, we created an additional three datasets containing images of the sets mentioned above.
Fig. 1. Example of images from SARD dataset.
The SV refers to a mixture of SARD and VisDrone sets. Similarly, the SC is a mixture of SARD and Corr set, and SVC is a mixture of SARD, VisDrone, and Corr test set.
Transfer Learning Methods for Training Person Detector
693
3.2 YOLOv4 Person Detection Model Detection of persons in high-resolution images taken by a drone is a challenging and demanding task. People who are searched for due to loss of orientation, fall, or dementia are very often in unusual places, away from the road, in atypical body positions due to injury or fall, lying on the ground due to exhaustion, covered with stones due to slipping or landslides (Fig. 1). On top of all that, the target object is relatively small and often camouflaged in the environment, so it is often challenging to observe. In this experiment, for person detection, we used the YOLOv4 model [20]. YOLOv4 uses CSPDarkNet53 as a backbone [21] that includes the DarkNet53, a deep residual network with 53 layers, and the CSPNet (Cross Stage Partial Network). To increase the receptive field without causing a decrease in velocity, the authors added Spatial Pyramid Pooling SSP [22] as the neck, and PAN, Path Aggregation Network [23] for path aggregation, instead of the Pyramid Feature Network (FPN) used in YOLOv3. The original YOLOv3 network is used for the head [24]. In addition to the new architecture, the authors also used training optimization called “Bag of Freebies” to achieve greater accuracy without additional hardware costs, such as CutMix, Mosaic, CIoU-loss, DropBlock regularization. There is also a “Bag of Specials” set of modules that only slightly increase the hardware costs with a significant increase in detection accuracy. To train and evaluate the YOLOv4 model, we used the Darknet framework [25], an open-source neural network framework written in C and CUDA that supports CPU and GPU computing. For the experimentation, we used Google Colab [26], a free tool for machine learning and local computer Dell G3 i7-9750H CPU, 16 GB RAM, GeForce GTX 1660 Ti 6 GB, with Ubuntu 16.04. 64-bit operating system. 3.3 Evaluation Metrics We use average accuracy (AP) to evaluate the detection results. AP is a metric that considers the number of correctly and incorrectly classified samples of a particular class and is used to determine the detection model’s overall detection power, not just accuracy [27]. In this experiment, we have used three precision measures in the MS COCO format that takes into account detection accuracy (IoU): - AP thresholds of 10 IoU (0.5: 0.05: 0.95), - AP50 at IoU = 0.50, - AP75 in IoU = 0.75. The original COCO script was used to calculate the results.
4 Results of Transfer Learning Methods and Discussion This section presents the overall performance results from the conducted experiments. It is worth mentioning that the pre-trained YOLOv4 with weights (yolov4.conv.137 [25]) learned on the MS COCO [28] dataset was trained on three training datasets with
694
S. Sambolek and M. Ivaši´c-Kos
different transfer learning methods to identify the transfer learning variant that provides the best solution for person detection in SAR scenes. In all cases, the YOLOv4 model was trained with a batch size of 64, a subdivision of 32, and iterations of 6000. The learning rate, momentum, and decay for the training process were set to values of 0.001, 0.949, 0.0005, and width and height to value 512. Before training, the parameters of the original model should be changed and adapted to our domain. The first step is to change the number of classes from 80, which corresponds to the number of MS COCO classes, to 1 class, a person in this experiment. After defining the class size, each Conv filter must be set to 18 as defined in (1), where the class corresponds to the number of classes (class = 1 in our case). x filters = (classes + 5) x 3
(1)
The impact of applying each of the transfer learning methods in training the detection model on the detectors’ results in search and rescue operations is given below. 4.1 Fine-Tuning the YOLOv4 Model to the Target Domain In the network-based deep transfer learning, the pre-trained YOLOv4 model trained on the COCO source domain was fine-tuned to the target domain: SARD, VisDrone, or Corr-SARD dataset. The sketch of network-based deep transfer learning is shown in Fig. 2.
Fig. 2. A network-based deep transfer learning: the first network was trained in the source domain (in our case MS COCO), and then the pre-trained network was fine-tuned on the target domain (SARD dataset).
Transfer Learning Methods for Training Person Detector
695
For a more straightforward presentation of the results, the model trained on the SARD training dataset was designated as the SARD model. The model labeled COCO refers to the pre-trained model on the MS COCO dataset. Table 1 shows the results of person detection on SARD images concerning the AP metric with the original YOLOv4 model and the YOLOv4 model that was further trained on SARD images. The results show a significant improvement in AP (Imp 37,9) and Ap50 and AP75 metrics of the detection results after fine-tuning the model to the SARD dataset. Table 1. Results of YOLOv4 models on SARD test dataset in case of network-based deep transfer learning Model AP
AP50 AP75 Imp
COCO 23.4 40.2
25.3
SARD 61.3 95.7
71.7
37.9
4.2 Instances-Based Deep Transfer Learning with SARD, Corr-SARD, and VisDrone Datasets After we applied the network-based transfer learning, we applied several instance-based deep transfer learning to train further the YOLOv4 model, including a series of sets (VisDrone and Corr-SARD and SARD). Using the VisDrone set, we selected only those instances from that set relevant to our target domain, i.e., those that contained a person. In the VisDrone training set that we used, there is approximately the same number of images as in the SARD training set, but in the VisDrone set, there are 25,876 objects more than in the SARD dataset that is 29,797 marked persons in VisDrone and 3,921 marked persons in SARD dataset. In the first case of instance-based transfer learning, the original model was trained first on a selected part of the VisDrone dataset and then fine-tuned on the SARD training dataset (V + S model). The sketch of instances-based deep transfer learning with VisDrone and SARD dataset is shown in Fig. 3.
Fig. 3. Instance-based deep transfer learning. We selected only images relevant to our target domain and trained the model with it from the source domain. In the second step, the model was trained on the SARD dataset.
696
S. Sambolek and M. Ivaši´c-Kos
According to the results presented in Table 2, additional model training on the VisDrone set (model V + S) did not affect the detection results obtained on the SARD model. However, it improved the results compared to the original model (Imp 37,9). Training on the Corr-SARD training dataset contributed to a slight improvement in detection results concerning the SARD model and significant AP improvement to the original model (Fig. 4). Also, the results show that transfer learning is not commutative and that the order of the sets used to train the model affects the detection results. The best results are achieved when the model is fine-tuned on the dataset on whose examples it will be tested, so the V + S model achieves significantly better results than the S + V model. We also tested instance-based deep transfer learning using three datasets so that the original model was fine-tuned on the SARD training set after training on VisDrone, and the Corr-SARD datasets (V + C + S model). Table 2. Results of YOLOv4 models on SARD test set to build with instance-based transfer learning Model
AP
AP50 AP75 Imp
S+V
22.8 41.7
23.7
−0.6
V+S
61.3 95.8
70.6
37.9
V + C + S 62.0 95.9
71.9
38.6
Table 3 shows the individual detection results on the SARD test set obtained when the original model was additionally trained on the VisDrone and Corr-SARD sets. For an easier results notation, a model trained on the VisDrone dataset is designated as VisDrone, and the model trained on the Corr-SARD as Corr-SARD.
Fig. 4. Using Corr-SARD dataset for transfer learning. After training on the SARD dataset, the model was re-train with the same images with added effect.
Transfer Learning Methods for Training Person Detector
697
The results are interesting and show that fine-tuning the original model to the VisDrone set even lowered the detection results even though the original COCO dataset does not include shots of people taken by the drone. The VisDrone set includes them just like the target SARD test set, but in urban areas. The use of the synthetic Corr-SARD set contributed to improved person detection outcomes in the SARD test set. Table 3. Results of YOLOv4 models on SARD test dataset after learning on the VisDrone set and Corr-SARD set Model
AP
AP50 AP75 Imp
VisDrone
18.9 33.2
20.5
−4.5
Corr-SARD 54.9 90.5
61.9
31.5
4.3 Mapping-Based Deep Transfer Learning with Images from SARD, Corr-SARD, and VisDrone Datasets In mapping-based deep transfer learning, several new sets were made for training the model as a union of images from the VisDrone, SARD, and Corr-SARD training sets. These are the SV sets created as a union of images from the SARD training set and VisDrone set, the SC model created by merging images from the SARD training set and Corr-SARD, and the SVC set created as a union of images from all three sets. A sketch of mapping-based deep transfer learning is shown in Fig. 5. The results in Table 4 show that transfer learning on newly created sets (SV, SC, SCV) significantly contributed to the improvement of the detection result concerning the original model with a relatively high AP score achieved: for SC model 59.4%, SV 55.4%, and SVC 56.4%. The AP increase after transfer learning the model on new sets is 32 to 36 percent higher than with the original model (Imp column in Table 4). However, it can be noticed that the results of the model trained on the newly created sets SV, SC, SCV are comparable but still slightly lower than the case when the model was fine-tuned only on the training data from the target set (model SARD). Table 4. Results of YOLOv4 model on SARD test set to build with mapping-based transfer learning methods Model AP SV
AP50 AP75 Imp
55.4 92.5
60.8
32.0
SC
59.4 94.7
67.4
36.0
SVC
56.4 93.6
63.1
33.0
From the obtained results, it can be concluded that in the case of deep transfer learning based on mapping, relatively good AP results were achieved, but that results are still
698
S. Sambolek and M. Ivaši´c-Kos
worse compared to deep transfer learning based on instances and network transfers. Overall, the best AP score of 62.0% was achieved with the V + C + S model, and immediately afterward, with the AP 61.3%, a SARD model was fine-tuned only on the SARD training set.
Fig. 5. Mapping-Based Deep Transfer Learning. Images from the Target SARD Dataset are Mapped with Images from the VisDrone and Corr-SARD Datasets.
Additionally, to evaluate the performance of the SV, SC, SCV models built with mapping-based transfer learning on the appropriate test sets, additional testing of the models was done on the test sets generated in the same way as SV, SC, SCV training sets but from the corresponding test sets. Table 5. Results of YOLOv4 models build with mapping-based transfer learning on appropriate test sets Model Test set
AP
AP50 AP75
SV
SV test
29.7 61.7
24.6
SC
SC test
55.8 91.6
61.7
SVC
SVC test
31.7 64.4
27.9
The obtained results of the models obtained with the mapping-based transfer learning tested on the testing part of SV, SC, SCV sets are shown in Table 5 and have worse results than when tested only at the set SARD test set. The SC model achieved a minor difference in performance on the SC test set, comparing the SARD test set’s detection results. This was expected because the Corr-SARD set images included in the SC test set are those from the SARD set only with the added effects of bad weather.
Transfer Learning Methods for Training Person Detector
699
5 Conclusions In this paper, transfer learning approaches to improve person detection on drone images for the SAR mission were examined. We have fine-tuned the YOLOv4 model using different transfer learning methods on three datasets: a tailor-made SARD set for SARD missions, a VisDrone drone-recorded dataset in urban places, and a Corr-SARD dataset with synthetically added weather effects on SARD images. We compared and discussed the impact of the transfer learning methods used in YOLOv4 model training on detection results. Testing was performed on the target dataset SARD and the newly created datasets SV, SC, and SVC, created by merging the initial sets. The results show that the best detection results are achieved on the target SARD domain using network-based transfer learning when the set on which the model is finetuned is equally distributed as the set on which the model is tested. The best results were achieved by applying the network transfer learning method, which transmits features obtained on large data sets, and the instance-based transfer learning method, in which the model is trained on images of the domain corresponding to the images on which the model will be tested. The use of synthetic image instances further improved the performance of the model. From the results, we also see that the worst results were obtained when the datasets were merged because, in that case, the model could not fully adapt to the data of interest. However, this way, by increasing the learning data, a more general model can be achieved. It has been shown that when training models with multiple datasets, it is not insignificant whether we train with all images simultaneously or individually on each set and the sets’ order during training. For future work, we plan to explore the impact of different transfer learning methods on various application domains and determine the key characteristics of learning datasets that positively impact model performance. Also, we are interested in further exploring different network strategies for selecting, merging, and changing network layers to improve detection results.
References 1. Sambolek, S., Ivaši´c-Kos, M.: Detection of toy soldiers taken from a bird’s perspective using convolutional neural networks. In: Gievska, S., Madjarov, G. (eds.) ICT Innovations 2019. CCIS, vol. 1110, pp. 13–26. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-331 10-8_2 2. Sambolek, S., Ivasic-Kos, M.: Person detection in drone imagery. In: 2020 5th International Conference on Smart and Sustainable Technologies (SpliTech), pp. 1–6. IEEE, September 2020 3. Kristo, M., Ivasic-Kos, M., Pobar, M.: Thermal object detection in difficult weather conditions using YOLO. IEEE Access 2020, 125459–125476 (2020) 4. Tan, C., Sun, F., Kong, T., Zhang, W., Yang, C., Liu, C.: A survey on deep transfer learning. In: K˚urková, V., Manolopoulos, Y., Hammer, B., Iliadis, L., Maglogiannis, I. (eds.) ICANN 2018. LNCS, vol. 11141, pp. 270–279. Springer, Cham (2018). https://doi.org/10.1007/9783-030-01424-7_27
700
S. Sambolek and M. Ivaši´c-Kos
5. Oquab, M., Bottou, L., Laptev, I., Sivic, J.: Learning and transferring mid-level image representations using convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1717–1724 (2014) 6. Buric, M., Pobar, M., Ivasic-Kos, M.: Adapting YOLO network for ball and player detection. In: 8th International Conference on Pattern Recognition Applications and Methods, pp. 845– 851 (2019) 7. Ivasic-Kos, M., Kristo, M., Pobar., M.: Human detection in thermal imaging using YOLO. In: 5th International Conference on Computer and Technology Applications, pp. 20–24 (2019) 8. Wang, T., Huan, J., Zhu, M.: Instance-based deep transfer learning. In: 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 367–375. IEEE, January 2019 9. Pobar, M., Ivasic-Kos, M.: Active player detection in handball scenes based on activity measures. Sensors 20(5), 1475 (2020) 10. Cheng, C., Zhou, B., Ma, G., Wu, D., Yuan, Y.: Wasserstein distance based deep adversarial transfer learning for intelligent fault diagnosis. arXiv preprint arXiv:1903.06753 (2019) 11. Tzeng, E., Hoffman, J., Zhang, N., Saenko, K., Darrell, T.: Deep domain confusion: maximizing for domain invariance. arXiv preprint arXiv:1412.3474 (2014) 12. Long, M., Cao, Y., Wang, J., Jordan, M.: Learning transferable features with deep adaptation networks. In: International Conference on Machine Learning, pp. 97–105. PMLR, June 2015 13. Long, M., Zhu, H., Wang, J., Jordan, M.I.: Deep transfer learning with joint adaptation networks. In: International Conference on Machine Learning, pp. 2208–2217. PMLR, July 2017 14. Gretton, A., Borgwardt, K.M., Rasch, M.J., Schölkopf, B., Smola, A.: A kernel two-sample test. J. Mach. Learn. Res. 13(1), 723–773 (2012) 15. Arjovsky, M., Chintala, S.: Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875 (2017) 16. Goodfellow, I.J., et al.: Generative adversarial networks. arXiv preprint arXiv:1406.2661 (2014) 17. Buric, M., Paulin, G., Ivasic-Kos, M.: Object detection using synthesized data. In: ICT Innovations 2019, Web Proceedings (2019) 18. Zhu, P., Wen, L., Bian, X., Ling, H., Hu, Q.: Vision meets drones: a challenge. arXiv preprint arXiv:1804.07437 (2018) 19. Sambolek, S., Ivasic-Kos, M.: Detecting objects in drone imagery: a brief overview of recent progress. In: 2020 43rd International Convention on Information, Communication and Electronic Technology (MIPRO), pp. 1052–1057. IEEE (2020) 20. Bochkovskiy, A., Wang, C.Y., Liao, H.Y.M.: Yolov4: optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934 (2020) 21. Wang, C.Y., et al.: CSPNet: a new backbone that can enhance learning capability of CNN. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 390–391 (2020) 22. He, K., Zhang, X., Ren, S., Sun, J.: Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 37(9), 1904–1916 (2015) 23. Liu, S., Qi, L., Qin, H., Shi, J., Jia, J.: Path aggregation network for instance segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8759– 8768 (2018) 24. Redmon, J., Farhadi, A.: Yolov3: an incremental improvement. arXiv preprint arXiv:1804. 02767 (2018) 25. Darknet. https://github.com/AlexeyAB/darknet. Accessed 21 Feb 2021 26. Google Colab. https://colab.research.google.com/. Accessed 21 Feb 2021
Transfer Learning Methods for Training Person Detector
701
27. Padilla, R., Netto, S.L., da Silva, E.A.: A survey on performance metrics for object-detection algorithms. In: 2020 International Conference on Systems, Signals and Image Processing (IWSSIP), pp. 237–242. IEEE, July 2020 28. Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/ 10.1007/978-3-319-10602-1_48
Video Processing Algorithm in Foggy Environment for Intelligent Video Surveillance Alexey Nikolayevich Subbotin1(B) , Nataly Alexandrovna Zhukova1,2 , and Tianxing Man3 1 Saint-Petersburg State Electrotechnical University, St. Petersburg, Russia 2 Saint Petersburg Institute for Informatics and Automation of the Russian Academy of
Sciences, St. Petersburg, Russia 3 ITMO University, St. Petersburg, Russia
Abstract. This article discusses the use of foggy environments for intelligent video surveillance. Foggy environments are needed for processing data using embedded computers with low computing power. The use of a foggy environment has significantly reduced the processing time of video information on a single board and embedded computers and systems. The proposed algorithm is significant for the developers of intelligent surveillance systems. The effectiveness of the algorithm for processing video images in foggy environments is shown on the examples in various subject domains, in particular, for the subway video images. Keywords: Foggy environments · Intelligent video surveillance · Fog computing · Embedded computers · Video processing algorithms
1 Introduction An intelligent video surveillance systems are not only a program or an algorithm but complex hardware and software systems for the automated collection of information from streaming video. These systems rely on various algorithms for image recognition, systematization, and processing of the obtained data. The capabilities of such systems are very wide and varied. It can be a digital pass when people can only get to an organization or institution by face or get a loan from an ATM without visiting a bank office. A huge number of such systems are presented on the service market (here are some of them: https://www.aurabi.ru, https://securtv.ru, https:// www.proline-rus.ru, https://sks-sp.ru, and many other offers). It can also be a solution for stores, supermarkets, shopping centers to search for intruders. The economic and organizational effect, as well as an increase in the level of security reached due to usage of an intelligent video surveillance system is visible for both large networks with a wide territorial distribution and small business systems. At the moment, it is customary to classify intelligent video surveillance systems into three classes according to the type of hardware and software systems: server; built-in intelligent systems; systems for distributed processing of video data. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 295, pp. 702–715, 2022. https://doi.org/10.1007/978-3-030-82196-8_52
Video Processing Algorithm in Foggy Environment
703
Analytical data processing on the server systems is carried out centrally on a video server or PC. The hardware components are two interacting processors: central (CPU) or graphics (GPU). The main advantage of the server systems of intelligent video surveillance is the capability to use the software, which allows adding additional modules and video processing algorithms to the existing shell, as well as to combine existing algorithms [1]. However, the main disadvantage of the server system is the need for constant transmission of high-resolution video from cameras to the server, which significantly loads the communication channels. A variant with built-in directly in video cameras intelligent algorithms is often used. A partially or fully processed picture with the analysis results (metadata) is transmitted to the video recorder or server. This method significantly reduces (10 or even 100 times) the load on the communication channels [2]. However, video cameras have a limited set of analytical functions, and their cost is much higher than of conventional devices. The distributed version of video data processing assumes primary information analysis that does not require complex algorithms to be performed on the video cameras themselves. And more complicated intellectual processing, requiring CPU load, is performed using the server’s capacity [3–5]. Single-board computers Raspberry Pi and other embedded computers lag far behind in performance from PCs [6]. The results of the analyses of the software market show that in the majority of video surveillance systems, for one reason or another embedded computers are used, which are severely limited in performance and cannot process video information quickly. Thus, the main problem of the existing systems is the low speed of video processing on embedded computers. We suggest solving this problem using foggy environments. It is the foggy environment that contributes to load balancing between devices and cloud systems. Usage of foggy technologies greatly simplifies the task of solving the problem of video processing. A new algorithm for video processing in fog environments based on using machine learning models is presented. The rest of this article is structured as follows: in Sect. 2 we present the related work about the intelligent video surveillance; in Sect. 3 we propose the description of the proposed video processing algorithm for foggy environment. A case study is presented to prove the effectiveness of the algorithm in Sect. 4. The last section is the conclusion and future work.
2 Background 2.1 Intelligent Video Surveillance The main component of intelligent video surveillance is the video information analysis algorithms themselves, which can be the following. Perimeter control analytics algorithms are used in systems protecting long-range perimeter areas. They allow reacting to the shape, speed, and location of the objects. One of the most reliable algorithms is presented in [7, 12], it well works not only on cyclists, dogs, or cats but also for video surveillance of objects lying near the controlled area.
704
A. N. Subbotin et al.
Business analysis in intelligent video surveillance systems is used to monitor staff productivity, optimize the service processes, identify dissatisfied customers, and investigate the reasons for their dissatisfaction [8, 13, 14]. It assumes building a large number of various reports with the ability to create individual data filters. Biometric analysis is used for various methods of biological identification of an object. In this case, the algorithm traditionally operates with such concepts as the base of tolerances, the black and white lists, etc. Some models of intelligent video surveillance systems can work according to more complex algorithms. Multi-camera analysis enables automatic tracking of an object using multiple cameras. The result is the formation of a trajectory of movement of an object in the protected area [9, 15]. Tampering is constant monitoring of equipment, special attention is paid to the control of technical malfunctions and to the prevention of the possibility of blocking a camera, illuminating or darkening the lens or shifting the body, or changing the picture [10, 11]. 2.2 Intelligent Video Surveillance Systems The area of video surveillance systems is open to development. Ready-made systems are presented, where the software is included in the package, but the algorithms are proprietary and prohibited for research by international copyright. They have several limitations that often don’t allow use them in practice that are the following. • • • • •
Usage of algorithms with the low accuracy of object detection; Low speed and delays in video processing; Lack of a clear software structure and constant patches; There is no support of scheduling at high server load; Storing information in the cloud.
We considered video surveillance systems such as SecurOS (https://securos.ruprog ram.ru), Cyber Vision Control (https://cvc.ai), Selectel (https://selectel.ru), Finnlock Security (https://finnlock.spb.ru), ProfiTB (http://profitsec.ru), Maskom Vostok (https:// www.mascom-vostok.ru), Domofonov.net Moscow (https://domofonov.net), Elzam (http://elzamvideo.ru), Nienschanz-Avtomatika (https://nnz-ipc.ru), Dom.ru (https:// spb.b2b.domru.ru) and many other systems that are less popular in the market of intelligent video surveillance. 2.3 Hardware for Intelligent Video Surveillance Systems The optimal from the point of view of functional efficiency are considered to be intelligent video surveillance system developed on the basis of video servers [2]. But the cost of equipment is quite high. For example, a video server based on a Xeon E3 V3 processor for connecting 85 2 Mpix video cameras costs at least $ 2500 (without a disk array). Such devices support integration of any video analytics modules. A similar device for 100 network cameras costs at least $ 6,000, and the maximum configuration for 700 cameras costs about $ 29,000. In this case, one should take into account the need to purchase several specialized disks for storing a video archive. For example, to store data
Video Processing Algorithm in Foggy Environment
705
with a retrospective view of only 30 days for 60 cameras with continuous recording of 12 frames/sec you need 6 HDD disks with an archive depth of 36 TB combined into a RAID 5 or RAID 6 disk array, which not every organization can do [9, 16]. The workstation of the video surveillance system dispatcher must be equipped with widescreen monitors with a diagonal of at least 23 , in order to be able to simultaneously display up to 16 streams on the monitors [3, 17]. If necessary, the PC must support hardware decoding of the incoming video signal. This requires significant computing power and a CPU of at least Intel Core i7. Most of the intelligent video surveillance systems support not only IP video cameras, but also analog cameras with high resolution AHD, TVI, CVI formats. Connecting analog devices to a video server can be done through a specialized adapter [3, 18, 19].
3 Video Processing Algorithm with Foggy Environment We propose a new video processing algorithm for foggy environment. It is represented in the form of a block diagram in Fig. 1. The general idea of the algorithm is the multiple application of various machine learning models to improve the accuracy of object detection and increase the speed of the algorithm using fog technologies. Different types of objects are considered: people, children, mothers with strollers, disabled people, elderly people with large bags, people with large objects, people in winter clothes. Machine learning models are compiled by experts, amateurs from the Internet, analytic professionals, purchased through commercial lines in various organizations. Of significant importance in this work is the concept of a script, which is a program in the Python programming language that is generated by the software tools (Lazarus 2.0.10 and the Everest 1.7.3 network libraries for the Linux operating system). Generating and modifying the machine learning scripts implies the operation of an additional pre-installed SDR 2.4.7 application (Remote Rendering Systems). The program allows sharing computing resources between devices and cloud servers to quickly and accurately solve the assigned tasks. The foggy environment in this algorithm is an intermediate layer for the interaction of all devices (tablets, smartphones, PCs, embedded computers, etc.) with limited computing resources. The proposed algorithm searches for and processes objects in the video image with high accuracy and speed in comparison with analogs. An algorithm has been developed in the field of machine learning and neural networks for object detection and state signaling. The proposed algorithm allows meet the following requirements: • Fast image processing based on a single board or embedded computers; • The speed of image processing using foggy environments is significantly higher than the speed of a single board or embedded computers; • Possibility of an algorithm for two levels of image processing (server-client) using foggy technologies; • The statement of requirements and conditions for the processing of video images are transmitted from the operator to the server; • The ability to receive video images from various sources (on the principle of one server and many clients) and to synchronize video processing between different programs;
706
A. N. Subbotin et al.
Fig. 1. Block diagram of the algorithm operation
• Possibility of centralized management of video processing via the website and mobile application or software of operators. The algorithm is represented below as a series of structured steps, combined into blocks. Dynamic Editing of Machine Learning Scripts At the first stage, the machine learning models and algorithms are chosen in accordance with the solved tasks of video processing. When we choose an algorithm, the following factors are considered: • Transmitted parameters from the operator (types of the sought-after objects: people, mothers with strollers, disabled people, animals; people with luggage, carts, and suitcases; children, etc.) • Available machine learning models on the server for image processing.
Video Processing Algorithm in Foggy Environment
707
To run the algorithm with the required parameters, machine learning scripts are dynamically generated. Script generation assumes editing a predefined script and creating input files. The generated script is installed and executed on a server, the obtained results are saved on the server for image processing for further analysis.
Selecting Profile and Keywords The second stage assumes that the operator selects the keywords for machine learning video processing according to the type of processed video information (animation, people, layered images, passenger counting on the escalator, etc.) and defines the parameters of video processing. The system makes preliminary analysis of a sequence of images from streaming video to determine the main characteristics of the images (frames per second, resolution, quality and color depth for all channels, etc.). It analyses the parameters defined by the operator and the obtained characteristics and makes decision whether to take into account glare, fog, darkness, frost in winter, etc. Taking into account the characteristics of video the system runs a script to define objects in a sequence of images from a video. The identified objects are analyzed in relation to the keywords and the required accuracy of object identification defined by the operator. The keywords that match the identified objects and meet the requirements to the accuracy are used to build a machine learning model and generate a corresponding machine learning script.
708
A. N. Subbotin et al.
The system runs the machine learning script. Usage of the script allows determine the number of objects in each image and to identify the objects. Also the probability of occurrence of objects of different types is estimated and the total probability of required objects identification is calculated.
Dynamic Editing of the Database of Images At the third stage dynamic editing of the input images is executed. For dynamic editing of images the system gets the technical characteristics of the original images (resolution, detail, color depth in all channels and their presence if the image is black and white) and determines possible image defects (sepia, streaks, highlights, overlays, quick inserts, blurring, applying filters, changing color saturation, etc.). The defects are determined using a pre-trained machine learning model. The system classifies the processed images according to the selected defects and defines a machine learning model to detect defects in images from models that have been preliminary prepared and loaded on the server. The machine learning script is edited for identifying objects taking into account the defects from the defect classifier. The system processes the images and compares the probability of defects. The images for which the probability of detected defects exceeds 0.80% are rejected, the rest of the images are added to the list of images for further processing.
Video Processing Algorithm in Foggy Environment
709
Temporary Additional Training The fourth stage is targeted on increasing the accuracy of determining objects using the technologies of recurrent neural networks (NN). For image processing using NN catalogs of images that contain images of different types are created by the system. It is proposed to use NN models of no more than 10 layers (the ratio of computational costs and the accuracy of identifying objects is incomparable with the ratio of these indicators when more than 10 layers are used). Image processing using NN allows find pairs of images that contain the same objects recorded from different angels. These pairs are further grouped into pairs, etc. For the resulting groups of images the maximum probability that similar objects are presented on all images in the group is estimated. On the base of obtained estimations for images of all types of objects the array of images for subsequent analysis is formed. Searching of Objects in the Images The fifth stage is the searching of the objects in the images. The system gets all images from different surveillance cameras for a certain period of time. The machine learning script for analyzing images is edited to analyze the objects taking into account their tiny details. For this different machine learning models are used in combination. Their usage allows identify the objects on the base of images where objects are presented at different angles. This increases the accuracy of object identification as different fragments of objects are considered. The summary statistics is calculated and the probability of each object identification is estimated.
710
A. N. Subbotin et al.
Defining Objects Using Various Sources and Different Surveillance Cameras At the sixth stage, different sources and different surveillance cameras for a more accurate definition of objects in the video images are used. The system creates one machine learning model for processing images that present the objects of one type. Only the images that were not excluded at the stage of detecting defects (glare, frost, fog, etc.) are considered. For processing each type of objects the corresponding machine learning script is edited. The probability of identification of objects of different types are calculated.
Dynamic Selecting of Trained Models The seventh stage assumes estimation of the obtained results of images processing. The obtained results are compared with results of processing the same images by external machine learning models. The external models are dynamically selected using expert
Video Processing Algorithm in Foggy Environment
711
systems. The access to the selected external models is provided by Microsoft.API, Google.API, Amazon.API. The results of image processing obtained using the proposed algorithm and the external model are compared and the summary statistics of the accuracy of images processing is formed.
The proposed algorithm allows effectively find objects in video images of high complexity for objects identification. The accuracy of objects identification has been increased due to dynamic editing of scripts for images processing, using machine learning models and recurrent neural networks (RNN). The algorithm has significant prospects, because it does not require the purchase of sufficiently expensive equipment and its support by specialized personnel. This allows save a considerable amount of resources: human, financial, operational and organizational. The high speed of the algorithm and the accuracy cover all the costs of the subscription fee for working in foggy environments.
4 Case Study Multilayer video processing is a good example of the use of video processing in foggy environments when there can be multiple layers in video output in different sectors of images, in particular some of the sectors can contain other video images. Processing of such images currently requires high system performance and expensive equipment. Figure 2 illustrates a problem where a high frame rate per second is required when multiple layers are superimposed to obtain one video image.
712
A. N. Subbotin et al.
The 5-min input and output video frame are shown in Fig. 2:
Fig. 2. Input and output frame of the sequence of images after processing them in foggy environments
An equally important task for intelligent video surveillance systems is the overlay of two or more video images onto in one layer that requires sufficiently high productive capacities. With the use of embedded computers and systems, as well as outdated PCs, this task is not always feasible. Consider an example when a video surveillance operator needs to monitor three image streams on one computer screen: front, left and right side edges. Processing video frames with preinstalled software takes a lot of time and allows process one frame (Fig. 3). Thanks to the technologies of foggy environments, the problem becomes solvable.
Fig. 3. Visual presentation of video image with multiple layers.
On the left is the video image with the layer overlay disabled, and on the right is the already processed video, where layers are combined into a single video image. The efficiency of the proposed algorithm was estimated using the developed SDR 2.4.7 application (Remote Rendering Systems) (Fig. 4).
Video Processing Algorithm in Foggy Environment
713
Fig. 4. Measurements of the processing speed of a 5-min video fragment.
The processing speeds (5-min fragment) of video images using systems with different parameters are given in Table 1: Table 1. Comparative analysis of system performance. System Notebook ASUS PC i3 PC Ryzen 3 Notebook MSI GF63 9RCX870RU Notebook Lenovo IdeaPad 5 PC i7 SDR 2.4.7 (this)
Parameters AMD A9 9425, 2 x 3.1 GHz, RAM 4 GB, SSD 128 GB, Radeon R5 Intel Core i3 9100F, 4x3600 MHz, 8 GB DDR4, GeForce GTX 1050 Ti, HDD 1 TB AMD Ryzen 3 1200, 4x3100 MHz, 8 GB DDR4, Radeon RX 570, HDD 1 TB, SSD 120 GB Intel Core i5 9300H, 4 x 2.4 GHz, RAM 8 GB, SSD 256 GB, GeForce GTX 1050 Ti Max-Q 4 GB
Results 1 min. 27 sec. 42 sec.
AMD Ryzen 5 4500U, 6 x 2.3 GHz, 16 GB RAM, 512 GB SSD, Radeon Vega 6 Intel Core i7 10700, 8x2900 MHz, 16 GB DDR4, GeForce RTX 2070 SUPER, HDD 1 TB, SSD 1000 GB Azure, Core i7, 32 GB RAM on 4 virtual machines
36 sec.
39 sec. 37 sec.
32 sec. 23 sec.
We analyzed a number of 5-min fragments and built a graphic (Fig. 5) that shows how much the speed of processing video fragments in intelligent video surveillance systems has increased (reduced time) when using the proposed new algorithm.
714
A. N. Subbotin et al.
Fig. 5. Time spent on processing with and without the proposed algorithm.
Processing of more complex sequences of video images with animation (Fig. 6) also showed a high processing speed compared to traditional methods and algorithms. Animation processing requires linking of multiple layers with different parameters of color, transparency, width, length, shape and curves of movement and their suppression with other layers. Figure 6 shows a sequence of rotating and intersecting cubes with a transparency effect and an overlay of an additional layer with the “Institute of Film and Television” logo. The entire process of combining the output video is highly time consuming.
Fig. 6. Video sequence with special effects modeled in adobe after effects.
By processing the video stream in foggy environments, it was possible to reduce the processing time of one frame from 8 to 4 s, that is, 4 times faster.
5 Conclusion This article describes the capabilities of intelligent video surveillance in foggy environments. The problem of the low speed of video information processing on embedded computers is considered. The proposed algorithm is targeted on processing images when the requirements for processing are very high in terms of performance, but the performance of the computers that are used are severely limited, in particular, single-board and embedded computers (such as the Raspberry Pi) are used. The algorithm allows increase the speed of video processing in 4 times and the accuracy of object detection on 17.59% due to using foggy environments, machine learning models and neural networks. Measurements of the processing speed and determination of accuracy were made on different systems and results obtained were the same.
Video Processing Algorithm in Foggy Environment
715
References 1. Guraya, F.F.E., Cheikh, F.A.: Neural networks based visual attention model for surveillance videos. Neurocomputing 149, 1348–1359 (2015) 2. Sreenu, G., Saleem Durai, M.A.: Intelligent video surveillance: a review through deep learning techniques for crowd analysis. J. Big Data 6(48), 3–27 (2019) 3. Chitade, A.Z.: Colour based image segmentation using k-means clustering. Int. J. Eng. Sci. Technol. 2(10), 5319–5325 (2017) 4. Beauchemin, S.S., Barron, J.L.: The computation of optical flow. ACM, New York (2015) 5. Belhumeur, P.N., Hespanha, J.P., Kriegman, D.J.: Eigenfaces vs. Fisherfaces: recognition using class specific linear projection. IEEE Trans. Pattern Anal. Mach. Intell. 19(7), 711–720 (1997) 6. Cauwenberghs, G., Poggio, T.: Incremental and decremental support vector machine learning. Neural Inf. Process. Syst. 7, 209–218 (2019) 7. Burges, C.: A tutorial on support vector machines for pattern recognition. Data Mining Knowl. Discov. 2, 121–167 (2018) 8. Chapelle, O., Haffner, P., Vapnik, V.N.: Support vector machines for histogram-based image classification. IEEE Trans. Neural Netw. 10(5), 1055–1064 (1999) 9. Chen, X.R., Yuille, A.L.: Detecting and reading text in natural scenes. In: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Los Angeles, USA (2016) 10. Cristianini, N., Shawe-Taylor, J.: An introduction to support vector machines (and other kernel-based learning methods). Cambridge University Press - University of London - United Kingdom (UK) (2017) 11. Hu, B., Zhou, N., Zhou, Q., Wang, X., Liu, W.: Diffnet: a learning to compare deep network for product recognition. IEEE Access 8, 19336–19344 (2020) 12. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412. 6980 (2018) 13. Aurelio, Y.S., Almeida, G.M., Castro, C.L., Braga, A.P.: Learning from imbalanced data sets with weighted cross-entropy function. Neural Process. Lett. 50(2), 1937–1949 (2019) 14. Chen, Y., Bai, Y., Zhang, W., Mei, T.: Destruction and construction learning for fine-grained image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5157–5166 (2019) 15. Sukhbaatar, S., Bruna, J., Paluri, M., Bourdev, L., Fergus, R.: Training convolutional networks with noisy labels. In: ICLR (2018) 16. Cui, Y., Jia, M., Lin, T.-Y., Song, Y., Belongie, S.: Class-balanced loss based on effective number of samples. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9268–9277 (2019) 17. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2018) 18. Tan, M., Le, Q.V.: Efficientnet: rethinking model scaling for convolutional neural networks. arXiv preprint arXiv:1905.11946 (2019) 19. Li, Y., Vasconcelos, N.: Repair: removing representation bias by dataset resampling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Document Digitization Technology and Its Application in Tanzania Mbonimpaye John(B) , Beatus Mbunda, Victor Willa, Neema Mduma, Dina Machuve, and Shubi Kaijage The Nelson Mandela Institution of Science and Technology, Arusha, Tanzania [email protected] https://www.nm-aist.ac.tz/
Abstract. Document digitization is the process of converting information into a digital format, that is the computer-readable format. Traditionally way of handling documents is challenging and has a lot of disadvantages for people. Due to the increase of the number of people using the internet and smartphone in Tanzania where the number of mobile subscribers in Tanzania has raised to 42% of the population subscribing to mobile service in 2018, this makes it easy for people to adapt to the technology. Famous digital scanner technology most have limitations which include digital advertisement, also have limited features which require to pay the extra money and have watermarks which reduce the sense of ownership of the document. To create a competitive advantage we have used experimental procedure to find the validity of our application and through observation of other document digitization application to get valid primary data. Despite the presence of other document digitization applications, most features are limited, paid, and possess advertisement which may cause discomfort to users. There are different kinds of application developed in android phones which help people digitize their documents easily. In this project work, a mobile application is developed for digital scanning and converting the documents in PDF format, extracting text from the images, wireless printing ability, and sharing the documents into social media. Through this application, users will experience a lightweight document digitization application compared to other applications, a resource utilized application, and enhanced limited functionalities from other applications. Keywords: Machine learning · Optical Character Recognition (OCR) · Portable Document Format (PDF) · Joint Photographic Experts Group (JPEG) · Edge detection · Document conversion
1
Introduction
Document digitization is the process of converting information into digital format, a format which computers can understand, the information in a computer c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 295, pp. 716–723, 2022. https://doi.org/10.1007/978-3-030-82196-8_53
Document Digitization Technology
717
is organized into bits [1,2]. Tanzania is undergoing a massive digital transformation, reflected by the growing number of people connected to communications and internet services. This is having a profound impact on the country’s social, cultural, and economic frameworks, through enhanced access to key services which lead to improved productivity and efficiency across various sectors. Keeping documents in a hard copy is challenging for the community, private, government, and people. Currently, a big number of the private, government, and the community, in general, keep documents in hard copy form. When there is a need for these documents to be in digital form it is very costly since they must purchase equipment that will help them to convert these documents into digital format. At a time it was hard to create editable version of document it was only possible by retype of texts which the process is time consuming also leads to errors [3]. Document digitization involves the process of converting information into a digital format. This process on our side we use mobile phones i.e., a mobile camera to alleviate costs and increase productivity at the same time improve access to the information, this also prevents loss of critical information or records by ensuring backup in the event of loss to physical documents. The Document digitization application provides functionality with the extra ability to edit and scan papers with images of high resolution as well as black and white, a different format for storing at a different scale. This project study aims at providing services that help to transform completely into paperless. This will be critical especially for data-sensitive documents. The proposed document digitization application requires a negligible amount of manpower and care. The need for making data easily accessible will be easily solved in case of natural disasters. The following sections cover the methodology the procedure how we conducted our project study, the related works, result, discussion and future work to improve the study.
2
Literature Review
The usage of technology has currently increased due to the huge development of technology; this has resulted in solving real-life problems. Document digitization is one of the scenarios, where several efforts have been made regarding the document digitization process. Currently, smartphones replace the handheld digital scanner, [4] proposed methods for evaluation of documents captured by text extracted using OCR technology. The study showed that the output is different when the same document is captured using different conditions and parameters. On making a benchmark for the digitization of documents using a smartphone, [5] proposed the model for checking the performance and efficiency of document digitization technology and its applications for constructing digital libraries in China using the China Asian Optical Character Recognition (TH-OCR) system. The study showed that CamScanner helps in the digitization of ancient Chinese documents. Even though CamScanner is easy to use, having the ability to scan any document to Portable document format directly, and performs OCR on the
718
M. John et al.
scanned documents; it uses a camera application that increases storage size, and consumes smartphone resources [6]. The free version of CamScanner has watermark, advertisements, unedited OCR, and unedited PDF documents and has security loopholes that attacks malware in its version [7]. The Grizzly Labs described Genius Scan as an easy document processing scanner that captures images quickly and generates PDFs of any object. The application automatically processes, and enhances the processed scanned object, bringing out the final text and making it more relevant in the final image. Even though Genius Scan has the best customer care and always provide regular updates; the free version lacks OCR functionality, has advertisements, no editable PDF documents, and signature can’t be placed in the digitized documents [8]. Evernote Scanner, scan business cards, document with multiple pages, and automatically file and organize the resulting images and file in the account. The scans are automatically cropped to remove backgrounds and enhanced to make the text readable. However, it doesn’t over signature services for the converted documents [9], Google Drive a well known tool from Google Company is considered the best tool with functionality of scanning documents and saving them in google drive directly. However, the free Google Drive Scanner has advertisements, content analysis, and storage limitations. The maximum upload storage is 5 terabytes (TB). Freemium software works on the document digitization process to provide few functionalities for free and requires payment for extra functionalities [10]. Therefore, this study proposed the development of a mobile application which is lighter in size, use phone resources like camera application, and offer all important features of document digitization for free. The proposed solution uses the internal camera of the smartphone to enable well utilization of resources. From the literature review, the available document digitization applications have watermark and advertisement, high storage size, and uses coded internal camera application which increases memory overload therefore this study proposed a light weight application, free from watermark and advertisement application and a better resource utilization of memory.
3
Methodology
In this project, qualitative data collection was conducted to gather the latest information about the document digitization process. The implementation was done in android smartphones to make it available to many users. Figure 1 shows the flowchart of the developed project. 3.1
Qualitative Data Collection
The first step was to collect information from different sources includes online resources by, reading and analyzing different publications to grasp more information about existing applications. The focus was to look at the strength and
Document Digitization Technology
719
Fig. 1. Flow chart of the application development.
weaknesses of the existing applications with consideration of targeted people with a lower level income. 3.2
Development Took Kit
The application was developed in Android Studio Integrated Environment (IDE) using Java programming language. Android Software Development Kit (SDK) was used which includes a variety of custom tools like libraries that help to develop mobile applications on an android platform [11] and the Android Emulator and the Android Development Tools (ADT) plug-in for Android Studio. The User Interface (UI) of the application was kept understandable and simple to allow a normal user to easily use and understand the functionality of the application [12]. As people create mental models on the icon, the icon was selected in such a way that the user can easily use the application. The screens were designed using Extensible Markup Language (XML) and the basic logic was written in Java programming language. Different libraries have been used to make the application perform the desired way, for-example Room Library Database which provides an abstraction layer over SQLite to allow a robust database, Google Mobile Vision which was used as the text recognition Application Program Interface (API), and document conversion library called iTextPDF Library was used to convert the scanned document into PDF format. 3.3
Graphical User Interface
The Graphical User Interface (GUI) was designed quietly simple and userfriendly. Table 1 listed the main screen and features of this application.
720
M. John et al. Table 1. Menu of the application
Screen
Features
Splash screen
Click on icon to open app
Home screen
Main screen that show all the specific functionality of the app
Camera
Click to open the camera and scan document
Edge detection
Click to open the camera and detect edges of document
OCR (Optical Click to open the camera and convert image of typed or Character Recognition) printed text into machine-encoded text
4
Results and Discussion
The developed mobile application called “Document Digitization app” facilitates the storing of documents in digital format and possesses functionalities of OCR technology with the ability to export documents in other formats. Utilization of resources using the internal camera of the smartphone, automatic cropping, automatic edge detection, electronic signature, converting documents to PDF, saving documents in JPEG format, and free from advertisements. Figure 2 below shows the home page of the application and the features the application provides.
Fig. 2. Home page, [13].
The developed application home screen has the icon for scanning or taking the picture, an icon for edge detection, and OCR (Optical Character Recognition) as shown in Fig. 3.
5
Unit Testing
Various modules have been tested manually to observe if the expected results are achieved and can be seen on the screen. Table 2 listed unit testing results used for checking the efficiency and accuracy of the developed application.
Document Digitization Technology
721
Fig. 3. Home page of the application. Table 2. Unit Testing Results Test case
Expected results
When pressed the icon application on the smartphone
Open application and show the splash screen and then home screen containing the main function of application
When pressed the edge detection icon
Open and then let you choose the camera in your phone, then detect the edge of the object
When pressed the camera icon
Open the camera and then let you choose the camera in your phone, then scan the document you need, save or edit, share and print wireless
When pressed the Open the camera and then let you choose the camera in OCR(Optical your phone,scan or read the document then process it, Character Recognition) lastly edit the recognized text and save or share icon
5.1
Compatibility Testing
The application was mainly designed for android smartphone version 4.4 called KitKat also called key lime pie which covers a large number of android phone users. The screen size and resolution of android smartphones differ from one phone to another. This application has been developed and made compatible with android devices regardless of their screen size, resolution, and older android versions.
6
Future Enhancement of the Application
Currently, this application can detect the edge of the document, scan the document and save it in a different format. Also, it can perform optical character recognition where it converts an image of typed or printed text into machineencoded text.OCR technology replaces manual rewriting of printed documents with the electronic form. The OCR application can recognize printed text [14].
722
M. John et al.
The study can be extended by adding the module for machine learning object detection using labeling techniques in the Swahili language to enable foreigners to scan the document which will then display words in Swahili. The OCR technology can be extended by adding more features including font sizes, font types, and different languages. The study can be extended by making it possible to convert documents into Microsoft word format, and Microsoft PowerPoint presentation. Nevertheless, the study can be extended by giving the ability for the mobile application to save the document on the cloud and integrate it with LinkedIn. Currently, the developed application has been implemented in android, as part of future work it will be implemented to accommodate other mobile operating systems.
7
Conclusion
Converting hard copy document into soft copy is the challenge in our daily basis. Document digitization is the technology which accelerates the conversion of the documents into soft copy. There are problems associated with keeping documents in hard copy including; environmental effect, physical storage limitations, difficulties in processing when the need arises, document not being easier searchable, high access times, higher cost, prone to damage and being misplaced, and challenges on physical carrying and handling the documents. The developed mobile application home screen has an icon for scanning or taking the picture, an icon for edge detection. The developed mobile application “document digitization app” provides the functionality of automatic cropping, automatic edge detection, converting the documents in PDF format, sharing the documents into social media, electronic signatures for the converted documents, extract text from images, and is free from advertisement. Acknowledgment. We would like to thank God Almighty for His grace and blessings of life which helped us to conduct this work, also we would like to extend our special thanks to the Centre of Excellence for ICT in East Africa (CENIT@EA) for contributing to capacity building through knowledge building and financial support to develop our android application and our supervisor Dr. Dina Machuve for providing us the project title and guidance to our work. We are thankful to all our classmates, friends, and instructors from the School of Computational and Communication Science and Engineering (CoCSE) for their valuable assistance and mentorship during this period of our study.
References 1. UNESCO: Fundamental principles of digitization of documentary heritage (2017). http://www.unesco.org/new/fileadmin/MULTIMEDIA/HQ/CI/CI/pdf/ mow/digitization guidelines for web.pdf 2. Zhang, R., Yang, Y., Wang, W.: Research on document digitization processing technology. In: MATEC Web of Conferences, vol. 309, p. 02014 (2020). https:// doi.org/10.1051/matecconf/202030902014
Document Digitization Technology
723
3. Medium Platform: Optical character recognition with google cloud vision API (2018). https://medium.com/hackernoon/optical-character-recognition-withgoogle-cloud-vision-api-255bb8241235 4. Burie, Je., Chazalon, J., Coustaty, M.: ICDAR2015 competition on smartphone document capture and OCR (SmartDoc) (2015) 5. Ding, X., Wen, D., Peng, L.: Document digitization technology and its applications for digital library in China (China) (2004) 6. CamScanner app: CamScanner prices (2020). https://www.camscanner.com/ team/price 7. Kaspersky: Malicious android app had more than 100 million downloads in Google Play (2019). https://www.kaspersky.com/blog/camscanner-maliciousandroid-app/28156/ 8. Grizzly Labs: Genius scan (2020). https://thegrizzlylabs.com/genius-scan/ 9. Evernote Scanner (2020 ). https://evernote.com/ 10. Google Drive (2020). https://support.google.com/drive/ 11. Android Studio IDE (2020 ). https://developer.android.com/studio 12. Android Studio Plugin (2020). https://developer.android.com/studio/intro/ updatesdk-manager 13. CamScanner for high work and learning efficiency (2020). https://www. camscanner.com/ 14. Stoli´ nski, S., Bieniecki, W.: Application of OCR systems to processing and digitization of paper documents. In: Information Systems in Management VIII, vol. 102 (2011)
Risk and Returns Around FOMC Press Conferences: A Novel Perspective from Computer Vision Alexis Marchal(B) EPFL, Lausanne, Switzerland [email protected]
Abstract. I propose a new tool to characterize the resolution of uncertainty around FOMC press conferences. It relies on the construction of a measure capturing the level of discussion complexity between the Fed Chair and reporters during the Q&A sessions. I show that complex discussions are associated with higher equity returns and a drop in realized volatility. The method creates an attention score by quantifying how much the Chair needs to rely on reading internal documents to be able to answer a question. This is accomplished by building a novel dataset of video images of the press conferences and leveraging recent deep learning algorithms from computer vision. This alternative data provides new information on nonverbal communication that cannot be extracted from the widely analyzed FOMC transcripts. This paper can be seen as a proof of concept that certain videos contain valuable information for the study of financial markets. Keywords: FOMC · Machine learning data · Asset pricing · Finance
1
· Computer vision · Video
Introduction
Most central banks actively try to shape expectations of market participants through forward guidance. Some of the main objectives being to impact the price of various securities which in turn influences the financing cost of companies or to reduce market volatility during turbulent times. Over the last few years, we have witnessed an explosion of research papers employing machine learning to analyze various documents produced by central banks. The goal is to measure quantitatively how they communicate. This is usually realized by assigning a sentiment score (positive/negative) to the language employed by the bankers using Natural Language Processing (NLP) techniques. The contribution of this paper is to provide a new method to characterize the complexity of the discussion between reporters and the Chair of the Fed. Instead of analyzing the text documents, I use the video recordings of the FOMC press conferences and introduce a measure of attention exploiting computer vision c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 295, pp. 724–735, 2022. https://doi.org/10.1007/978-3-030-82196-8_54
FOMC and Computer Vision
725
algorithms. This is based on the simple premise that complex questions from journalists are followed by complex answers from the Chair, which often creates the need to consult internal documents in order to reply. The main idea is to differentiate between two questions asked by reporters, not by studying their text content, but rather by analyzing how does the Chair behave on the video when answering each question. This way, I am able to identify complex discussions by quantifying how often the Chair needs to look at internal documents. This is the key variable that video images are able to provide over other sources of data. I identify the events that involve more complex discussions and show that they have the largest (positive) impact on equity returns and reduce realized volatility. This highlights a mechanism of uncertainty resolution that works as follows. Answers to complex questions resolve more uncertainty than answers to simple questions and this ultimately impacts stock returns, volatility and the equity risk premium around the press conferences. Macroeconomic announcement days have been substantially discussed in the asset pricing literature which studies how much of the equity risk premium is earned around these events. [6,15,16], and [12] all find that a significant risk premium is earned around macroeconomic announcements. [8] argue that if you account for sample selection and day-of-the-month fixed effects, these days are not special and the risk premium is not that concentrated around macroeconomic announcement days. Regardless of the fraction of the equity premium that is earned on those days, there is some risk premium that is earned around these events and they reduce some uncertainty by disclosing important information to market participants. This alone makes these events an important object of interest for researchers. Together with [2] and [14], all of the above mentioned papers revolve around studying the build-up and resolution of uncertainty around macroeconomic announcements. My addition with respect to this literature is to identify why some press conferences reduce more uncertainty than others. To this end, I compare stock returns on FOMC press conference days when reporters had a complex discussion with other days when the talks were arguably simpler according to a new measure of attention. This allows me to identify a channel through which the Fed reduces market uncertainty and affects asset prices: by discussing with financial reporters. This implies that the Chair reveals additional information during the Q&A sessions that is not redundant with the pre-written opening statements. My findings are consistent with [13] who show that monetary policy affects the pricing of risk by identifying shocks to risky assets that are uncorrelated with changes in the risk-free rate (i.e. “FOMC risk shifts”). This paper also provides a contribution to the literature of machine learning methods used to analyze central banks communication. I quantify the degree of complexity of a discussion without relying on NLP techniques, hence avoiding their limitations.1 This new alternative dataset of videos allows me to analyze the press events from a new angle. Indeed, I investigate the same events but 1
One common drawback of NLP methods in finance/economics is the need to create a dictionary of positive and negative words. The choice of which words belong to which set is somehow subjective. Another problem with more advanced methods is the necessity to label the data which might have to be chosen by the researcher.
726
A. Marchal
leverage computer vision to extract a different signal which is the time spent by the Chair reading documents while answering questions. In other words, the NLP literature has focused on what is being said during the press conferences while I focus on how it is being said. This is accomplished by exploiting the images of the conferences and scrutinizing the human behavior. This information is not present in the transcripts and I argue that it is valuable for financial markets. However, it is likely that the signal I construct using videos could be augmented by sentiment measures extracted from text data. This is why I view my method as complementary to what has been done in the NLP literature. However, the combination of both approaches is left for future research. Another interesting use of machine learning to analyze FOMC conferences is present in [9]. Their dataset is closely related to mine in the sense that they also use the videos from FOMC press conferences but only analyze the audio in order to match sentences of the Chair with market reactions in real time. In comparison, my paper is the first to use the images from these videos. Overall, I present a proof of concept that FOMC videos actually provide useful information for financial economists. In accounting, papers like [3,7] and [4] have used video data to analyze the effects of disclosure through videos. However they do not use any systematic algorithm to extract the visual content which makes the approaches hardly scalable. Some authors like [1,10] or [11] use machines to process the videos but they focus on extracting emotions either from CEOs or entrepreneurs. None of their methods are suited to analyze FOMC press conferences because central bankers exerce an effort to appear as neutral as possible when they speak. In contrast to this literature, I develop a simpler tool that systematically computes the reading time of a person appearing in a video. This fully objective measure does not rely at all on emotions. The rest of the paper is organized as follows. Section 2 establishes the methodology to construct the attention measure. Section 3 presents the main results.
2
Dataset and Methodology
I use the video of each FOMC press conference (available on the Fed website) from their start in April 2011 to September 2020.2 The market data consists of the time-series of the S&P500 index sampled at a frequency of 1-min. Each press conference can be decomposed into two parts. (i) The first one is an introductory statement in which the Chair reads a pre-written speech, detailing the recent decisions of the Fed. (ii) The second part is a Q&A session between financial reporters and the Fed Chair. I focus solely on the Q&A for the following reasons. Most of the literature analyzing press conferences has focused on the 1st part (with a few rare exceptions) even though the Q&A occupies around 82% of the time of the press conference. Moreover, the unprepared character of the
2
I remove the conference from the 15th of March 2020 simply because there is no video available (it is only audio).
FOMC and Computer Vision
727
Q&A session means that the behavior of the Chair, when answering questions (whether he needs to read documents to answer questions or not for instance), does bring valuable information that has never been analyzed. Indeed, the Q&A is spontaneous and the Chair did not prepare answers to the reporters’ questions. Using this data, the main problems I try to solve are H1: How can we measure the complexity of a question and its associated answer? H2: Do complex discussions contribute more to reduce uncertainty? In order to answer these questions, I need to characterize the content of the press conferences. As previously explained, the existing literature has done so by assigning a sentiment score to the verbal content by combining text transcripts with some NLP algorithm. The new idea in my paper is to characterize a discussion between a reporter and the Chair of the FOMC, not by analyzing the language but rather by considering how the Chair reacts after being asked a question. To this end, I decide to focus on the following dimension: Does the Chair reply directly or does he read some internal documents in order to provide an answer? This information is available in the videos provided by the Fed but it needs to be extracted and converted into a numerical quantity that can serve as input for statistical inference tools. This is done by employing various computer vision algorithms that are new in finance but have been applied for years to solve engineering problems. In this paper, I focus on the economic mechanisms and the value of the information that can be extracted from this alternative data. Therefore I will keep the discussion of the methodology on a high (non-technical) level. The need for a technical discussion on computer vision can be (partially) avoided because every image processing can actually be easily illustrated. I will simply visually present the result of every computation by showing an image and what kind of information I extract from it. Given that a video is nothing but a collection of still images, I will use these two words interchangeably. The first step is to construct facial landmarks l ∈ R2 which are certain key points on a human face used to localize some specific regions like the eyes, the mouth, the jaw, etc. In this paper, they will help me track certain movements of the Fed Chair during the press conferences when he is answering a question. Basically, I want to know every time the Chair is looking at some documents. This is accomplished in two steps. (i) First I extract the identity of the people in every frame. This is done via a technique called deep metric learning. I do not linger on this method because it does not add any economic intuition. This is solely used to filter out images where the Chair appears and disregard the others. (ii) Once I have isolated the frames with the Chair, I will only use the landmarks associated with the eyes. In Fig. 1 we can observe the facial landmarks (black dots) that are specific to them. Each eye is localized by 6 vectors, the so-called landmarks, (l1 , ..., l6 ) on the 2D plane that is the picture. When the eyes close and open the landmarks will move, effectively tracking their movements. These points are created using an ensemble of regression trees. From there I compute a measure of eyes openness. I want to compute a scalar value indicating how open or closed are the eyes at every point in time. For this
728
A. Marchal
Fig. 1. Facial Landmarks for the Left Eye and Associated Distances (Colored Arrows).
purpose I use the eye aspect ratio (EAR) developed in [5]. On each still image (i.e. at one instant in time), I compute the EAR for eye j by calculating the L2 norm between the eye landmarks
EARj =
lj2 − lj6 + lj3 − lj5 2 lj1 − lj4
j ∈ {left eye, right eye}
(1)
where lj1 , lj2 , ..., lj6 are the facial landmarks characterizing one eye and depicted on the diagram in Fig. 1. The vertical distances are represented by green arrows and appear at the numerator of the EAR. The horizontal distance (red arrow) serves to normalize. The final EAR is a simple average for both eyes EAR =
EARleft eye + EARright eye . 2
(2)
Computing this variable for each frame of the videos means that an EAR scalar value is associated with every single image in my dataset. I denote by EARi,t the eye aspect ratio, during the press conference i at instant (frame) t. For each FOMC press conference, I obtain a time series of eye aspect ratio for the Chair. An example is provided in the plot of Fig. 2. The blank spaces correspond to times when reporters ask questions and appear on camera. Given that the Chair is not visible during these periods I do not compute an EAR. The scalar quantity EAR provides a simple way to measure if the eyes are open or closed at every point in time. The EAR takes high values when the eyes are wide open and approaches 0 as they close. To convince the reader that this variable indeed captures what I intend, I provide an example of two video frames in Fig. 3 in order to compare the EAR in different situations. In Fig. 3a, the Chair Janet Yellen is not looking at the documents on the desk and the associated EAR is 0.33 (relatively high value). The convex hull connecting all the landmarks l1 , ..., l6 (drawn in green around the eyes) creates a relatively large set. In the other picture, Fig. 3b, she is clearly reading and the EAR drops to 0.16. Here the convex envelope generates a smaller set. When the Chair spends a substantial amount of time reading, it will produce a series of EARi,t that will be lower during this time period. It is natural to wonder what type of questions will cause some reading by the Chair. To clarify this, I report below a comparison of two questions asked by reporters. They are copied from the transcript of the press conference of September 21, 2016.
FOMC and Computer Vision
729
Fig. 2. Time Series of the Eye Aspect Ratio (EAR) of the Fed Chair during the Q&A Session of the FOMC Press Conference (April, 29 2020). The Blank Spaces Correspond to Times when Reporters Ask Questions and Appear on Camera. I therefore do not Calculate an EAR during these Periods.
(a) EAR when the Chair is not reading.
(b) EAR when the Chair is reading a document.
Fig. 3. Convex Hulls Created by the Landmarks l1 , ..., l6 and Associated Eye Aspect Ratios (EARs).
Q∗ : Question from a reporter that does not lead to the consultation of internal documents by the Chair: “Chair Yellen, at a time when the public is losing faith in many institutions, did the FOMC discuss the importance of today as an opportunity to dispel the thinking that the Fed is politically compromised or beholden to markets?” Q’: Question from a reporter that does trigger substantial reading from the Chair: “Madam Chair, critics of the Federal Reserve have said that you look for any excuse not to hike, that the goalposts constantly move.
730
A. Marchal
And it looks, indeed, like there are new goalposts now when you say looking for further evidence and-and you suggest that it’s evidence that labor-labor market slack is being taken up. Could you explain what for the time being means, in terms of a time frame, and what that further evidence you would look for in order to hike interest rates? And also, this notion that the goalposts seem to move, and that you’ve indeed introduced a new goalpost with this statement.” The whole idea of this paper is to differentiate between Q∗ and Q’, not by studying the text content, but by analyzing how does the Chair behave when answering each question. The question Q’ will be associated with a complex discussion because my measure of attention EAR will be low due to the reading from the Chair. On the other hand, the EAR stays relatively high when Janet Yellen answers question Q∗ . For simplicity, I focus solely on where the Chair looks while answering reporters’ questions. More sophisticated measures incorporating extra facial landmarks on top of the ones locating the eyes could produce a more precise signal. So far, for each press conference i I have a time series of EAR. In order to compare the macroeconomic announcements, I decide to summarize the time series information into a variable Λi that will take one single value per conference. This is done by integrating the EAR over time. I only include the values below some threshold c in order to approximate the total time spent looking at internal documents. The attention measure is therefore defined as Λi = 0
Ti
EARi,t 1{EARi,t